Data Engineer Flashcards

(47 cards)

1
Q

Pub/Sub delivery: what order and delivery guarantees?

A

Pub/Sub delivers at least once to each EXISTING subscription. No order is guaranteed. Held for 7 days before dropping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two accumulation modes for DataFlow?

A

accumulatingFiredPanes - Keeps entire set of results as long as it is in the window. This means some of them are output multiple times if there are multiple triggers in the window.
discardingFiredPanes - Keeps only new results for outputs triggered within the window.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Watermark

A

Tracks how far behind the system is. Can be guaranteed if coming from pub/sub. This averages the skew over time. Watermarks determine WHEN in processing time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Windowing

A

Determines where in event time results are calculated. Windowing subdivides a PCollection according to the timestamps of its individual elements. Dataflow transforms that aggregate multiple elements, such as GroupByKey and Combine, work implicitly on a per-window basis—that is, they process each PCollection as a succession of multiple, finite windows, though the entire collection itself may be of unlimited or infinite size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Triggering

A

Triggering determines when to “close” each finite window as unbounded data arrives. Using a trigger can help to refine the windowing strategy for your PCollection to deal with late-arriving data or to provide early results. Default is to trigger at watermark.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Side input

A

A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection. When you specify a side input, you create a view of some other data that can be read from within the ParDo transform’s DoFn while processing each element. A side input can be a value computed by a separate branch of your pipeline.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Gradient descent

A

Iterative used to find the best parameters by reducing the error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Epoch

A

Traversal through the entire dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Softmax

A

Function that helps deal with multiple labels (suppresses lower inputs and increases max) and makes combination of all add to 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Neuron

A

one unit of combining inputs (weighted input + activation function)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hidden Layer

A

another layer of neuron(s) to combine outputs from previous layer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Inputs

A

data taken into the neuron

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Features

A

transformations of inputs, such as x^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Feature engineering

A

Determining the correct features to use the better use the machine learning model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Accuracy

A

correct / #total

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Precision

A

TP / (TP + FP)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Recall

A

TP / (TP + FN)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does TensorFlow evaluate?

A

Lazy evaluation - need to run the graph to get results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Estimator API models

A

LinearRegressor - linear regression
LinearClassifier - linear classification
DNNRegressor - Deep Neural Network regression
DNNClassifier - Deep Neural Network classifier

20
Q

Logistic vs Linear Regression

A

Linear regression outputs a continuous value (how much will this house sell for) and Logistic regression outputs a binary value (will this house sell for more than 500k)

21
Q

How to encode categorical data

A

Use one-hot encoding if vocabulary is the same at prediction time as training time, cold start is when you can’t do predictions on new things in the vocabulary. If you don’t have the vocabulary of all possible values, use a hash bucket.

22
Q

When to bucketize

A

Bucketize floats so they aren’t all different

23
Q

What type of data are DNNs good at

A

Dense, highly correlated values

24
Q

What type of data are Linear models good at

A

Sparse independent features

25
What model can do a combo of sparse and dense features
Wide-and-deep network in tf.estimator
26
Hyperparameter tuning
optimizes a target variable that you specify
27
TextLineDataset
Used to read data from CSV files/etc. when data can't fit in memory
28
What is BigQuery schema detection?
Schema auto-detection is available when you load data into BigQuery, and when you query an external data source. When auto-detection is enabled, BigQuery starts the inference process by selecting a random file in the data source and scanning up to 100 rows of data to use as a representative sample. BigQuery then examines each field and attempts to assign a data type to that field based on the values in the sample.
29
Supported external data sources for BigQuery
BigTable, Cloud Storage, Google Drive (CSV, JSON, Avro, or Google Sheets)
30
Writing BigQuery query results
All BigQuery searches are recorded in either a permanent or temporary table. Temporary tables are used to cache query results for 24 hours. Permanent tables can be either new or existing tables.
31
Supported data locations for Dataproc
HDFS (on cluster), GCS, Bigtable, or BigQuery
32
AutoML offerings
Vision, Natural Language, Translation
33
What can the Vision API do?
Detect categories of things in image, extract text, find topical entities (Celebrities, logos, news events, etc.), content moderation
34
What can the Cloud Video Intelligence API do?
Cloud Video Intelligence API makes videos searchable and discoverable by extracting metadata, identifying key nouns, and annotating the content of the video
35
Can you create a VM from a snapshot?
Yes
36
How do you share a snapshot across projects?
Convert to custom image first
37
GCP equivalent: Apache Kafka
Cloud Pub/Sub
38
GCP equivalent: Drill
BigQuery
39
GCP equivalent: Pig
Dataproc
40
GCP equivalent: Spark
Dataproc (or Dataflow with some rework)
41
GCP equivalent: Beam
Dataflow
42
GCP equivalent: Cassandra
Bigtable
43
GCP equivalent: HBase
Bigtable
44
GCP equivalent: Redis
Memorystore
45
What is spark?
Cluster and computing software, second generation after pig
46
What is hive?
Hadoop SQL-like database, runs mapreduce on the backend
47
What is pig?
Cluster computing software, runs mapreduce on the backend