Quiz 1 Flashcards

1
Q

what are the steps of data analysis pipeline

A
  1. Figure out the question.
  2. Find/​get relevant data.
  3. Clean & prepare the data.
  4. Analyze the data.
  5. Interpret & present results.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

why is full data analysis broken up into many steps

A

if is impractical to rerun the first few steps over an over. ex API calls.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what does np.vectorize do

A

turns a function into a function that can operate on an entire array in an element by element fashion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what is numPy good for

A

storing and operating on arrays

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what is pandas good for

A

pandas is good for manipulating data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what is a series in panda

A

a 1d array, stored as a numpy array, column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what is a dataframe in panda

A

a collection of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

define the steps of extract-transform-load

A

extract: get the data you need

transform: fix the data, clean data, get it in a form you want to work with

load: load into next step of your pipeline

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is signal processing and filtering algorithms

A

signal process uses filtering algorithms to remove noise from a signal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what does LOESS smoothing do

A

it’s a technique to smooth a curve, to remove the noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

LOESS smoothing takes a local area of the data and fit’s a line to it. We have to make a decision on how big this area is.

What happens if we pick a small area or a large area?

A

if small, we are more sensitive to noise

if large, we are less sensitive to signal changes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

is LOESS better with lots of samples or sparce samples

A

better with more samples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

true or false: LOESS’s parameters are y then x

A

true

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is kalman filtering

A

it allows you to express what you know to predict the most likely value for the truth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what does the kalman operation need

A

we need to give the variance of
1. our observations
2. our predications

and the covariance between each pair

we also need matrices for both out observations and predictions

the covariance matrix express our uncertainty in the measurement and predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

in your observation-covariance matrix, which expresses errors in the observations, what does lower and higher values mean

A

lower values: less sensor error, allow the observations/measurements to have more of an effect on the result

higher values: more noise exists

17
Q

the transition_covariance says what you think about the error in your prediction. What does lower and higher values mean

A

lower: less prediction error, let prediction affect the results more, less noise

higher: less accurate

18
Q

what does it mean to impute data

A

replacing missing or deleted outliers with plausible, calculated values

19
Q

what is entity resolution or record linkage

A

the process of finding multiple values that actually refer to the same entity

20
Q

what is the difference between
city_data = city_data[city_data[‘area’] <= 10000]
city_data = city_data[‘area’] <= 10000

A

city_data = city_data[city_data[‘area’] <= 10000]
will filter based on if area is <= 10k

city_data = city_data[‘area’] <= 10000
will change the dataset to a single column of true/false based off if area <= 10k

21
Q

how do you write sums in numpy and sums in pandas

A

numpy:
np.sum(totals, axis=x)

pandas:
totals.sum(axis=x)

22
Q

you have a df called counts, and you have a column in the dataframe called ‘date’ make it a datetime column

A

counts[‘date’] = pd.to_datetime(counts[‘date’])