Course 2 - Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform Flashcards by Rikard Donnelly

In a boolean 2x2 matrix of analyzed vs. collected data, where is the focus of “big data”?

In the cross section between collected data that has yet to be analyzed. Think of the iceberg - much of it is unused and ignored.

How well did you know this?

Not at all

Perfectly

Give some examples of unused data.

Google Street View data.
Emails.
Parking footage.
Purchase history.

How well did you know this?

Not at all

Perfectly

What are some barriers to big data analysis?

Unstructured data
Too large data amounts
Data quality
Too fast data streams

How well did you know this?

Not at all

Perfectly

Big data is often called counting problems. What’s the difference between easy and hard counting problems?

Hard problems:
Difficult to quantify “fitness”. Eg. vision analysis or natural language processing.

Easy problems:
Straightforward problems but large data amounts.

How well did you know this?

Not at all

Perfectly

Is one petabyte large?

Depends on data type and funds. PB is a lot of text, but not necessarily with pictures or video.

BUT a lot does not necessarily impact processing time.

How well did you know this?

Not at all

Perfectly

Describe how MapReduce works.

Split the data into small, parallelizable chunks. The output is then aggregated later.

How well did you know this?

Not at all

Perfectly

What is the difference between typical development with Dataproc and typical Spark/Hadoop?

Dataproc manages all the setup necessary in Spark/Hadoop.

Spark/Hadoop has a lot of setup, config and optimization.

How well did you know this?

Not at all

Perfectly

What are some drawbacks of managing a Hadoop cluster yourself?

Difficult to scale/add new hardware.
Less than 100% utilization -> bigger cost.
## Downtime when upgrading/redistributing tasks.

How well did you know this?

Not at all

Perfectly

What is a cluster?

A setup of master and worker nodes for crunching big data tasks. Data is centralized in master nodes and distributed (mapped) to worker nodes.

How well did you know this?

Not at all

Perfectly

Why use nearby zones?

Lower latency

- Egress (exporting) data might incur costs

How well did you know this?

Not at all

Perfectly

What’s the difference between cluster masters and nodes?

Master: Contains and splits data so workers can work in chunks. This is called mapping. Aggregates data later in reducing.

Worker: Data power attached to a master node. Receives data and processes it. Workers might be configured as preemptive and disappear from the cluster.

How well did you know this?

Not at all

Perfectly

What is a preemptive worker?

Unused data power from Google may be allocated and utilized. Think of last-minute airplane tickets. They can be revoked when someone requests that data power.

How well did you know this?

Not at all

Perfectly

What can images do for Dataproc?

Clusters can be installed with different versions of software stack.

How well did you know this?

Not at all

Perfectly

What is the gcloud tool?

A commandline program for interfacing with gcloud services, including creating dataproc clusters and submitting jobs.

How well did you know this?

Not at all

Perfectly

What is pyspark?

A python interface to the Spark framework for distributed computing.

How well did you know this?

Not at all

Perfectly

What is the hamburger stack?

Study These Flashcards

The three lines in the left corner of the Google web console interface.

Which ports does Hadoop use?

Study These Flashcards

HDFS: 50070

Hadoop web interface/Yarn: 8088 and 8080 (Unsure)

How can you make custom machines? What can be changed?

Study These Flashcards

Through the web console or command line.

CPU and memory can be changed.

How are data and processing structured in MapReduce?

Study These Flashcards

The data and operations are separated.

How is data stored in a Hadoop system?

Study These Flashcards

Data is typically split into multiple parts on the Hadoop file system (HDFS). This is called sharding.

What is sharding?

Study These Flashcards

Splitting data into several chunks for processing.

What is the traditional way of storing data on Hadoop vs. Google’s way?

Study These Flashcards

Traditional: Sharded data is transferred to each node separately.

Google’s way: Data is stored in Google Cloud Storage.

Describe a traditional Google workflow.

Study These Flashcards

Ingest -> process -> analysis
using
Pub/Sub -> Dataflow -> BigQuery

What’s a problem with keeping data on Hadoop nodes?

Study These Flashcards

In the node dies, its data must be moved.

How should you move data to Hadoop on Dataproc?

1) Move data to GCS. 2) Update prefixes (hdfs:// to gs://). 3) Start using Hadoop on Dataproc as usual.

How do you install software to a Dataproc stack?

1) Write an init script. 2) Upload it to GCS. 3) Provide it when creating a Dataproc cluster.

What is Hadoop?

Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model.

What is Apache Pig?

Apache Pig is an abstraction over MapReduce. It's a tool/platform used to analyze larger datasets with a data flow representation.

What is PySpark?

PySpark is a Python library for interacting with Spark.

What is Spark?

Spark is a big data platform similar to Hadoop.

Course 2 - Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform Flashcards

(30 cards)