Flashcards in Course 2 - Leveraging Unstructured Data with Cloud Dataproc on Google Cloud Platform Deck (30):
In a boolean 2x2 matrix of analyzed vs. collected data, where is the focus of "big data"?
In the cross section between collected data that has yet to be analyzed. Think of the iceberg - much of it is unused and ignored.
Give some examples of unused data.
- Google Street View data.
- Parking footage.
- Purchase history.
What are some barriers to big data analysis?
- Unstructured data
- Too large data amounts
- Data quality
- Too fast data streams
Big data is often called counting problems. What's the difference between easy and hard counting problems?
Difficult to quantify "fitness". Eg. vision analysis or natural language processing.
Straightforward problems but large data amounts.
Is one petabyte large?
Depends on data type and funds. PB is a lot of text, but not necessarily with pictures or video.
BUT a lot does not necessarily impact processing time.
Describe how MapReduce works.
Split the data into small, parallelizable chunks. The output is then aggregated later.
What is the difference between typical development with Dataproc and typical Spark/Hadoop?
Dataproc manages all the setup necessary in Spark/Hadoop.
Spark/Hadoop has a lot of setup, config and optimization.
What are some drawbacks of managing a Hadoop cluster yourself?
- Difficult to scale/add new hardware.
- Less than 100% utilization -> bigger cost.
- Downtime when upgrading/redistributing tasks.
What is a cluster?
A setup of master and worker nodes for crunching big data tasks. Data is centralized in master nodes and distributed (mapped) to worker nodes.
Why use nearby zones?
- Lower latency
- Egress (exporting) data might incur costs
What's the difference between cluster masters and nodes?
Master: Contains and splits data so workers can work in chunks. This is called mapping. Aggregates data later in reducing.
Worker: Data power attached to a master node. Receives data and processes it. Workers might be configured as preemptive and disappear from the cluster.
What is a preemptive worker?
Unused data power from Google may be allocated and utilized. Think of last-minute airplane tickets. They can be revoked when someone requests that data power.
What can images do for Dataproc?
Clusters can be installed with different versions of software stack.
What is the gcloud tool?
A commandline program for interfacing with gcloud services, including creating dataproc clusters and submitting jobs.
What is pyspark?
A python interface to the Spark framework for distributed computing.
What is the hamburger stack?
The three lines in the left corner of the Google web console interface.
Which ports does Hadoop use?
Hadoop web interface/Yarn: 8088 and 8080 (Unsure)
How can you make custom machines? What can be changed?
Through the web console or command line.
CPU and memory can be changed.
How are data and processing structured in MapReduce?
The data and operations are separated.
How is data stored in a Hadoop system?
Data is typically split into multiple parts on the Hadoop file system (HDFS). This is called sharding.
What is sharding?
Splitting data into several chunks for processing.
What is the traditional way of storing data on Hadoop vs. Google's way?
Traditional: Sharded data is transferred to each node separately.
Google's way: Data is stored in Google Cloud Storage.
Describe a traditional Google workflow.
Ingest -> process -> analysis
Pub/Sub -> Dataflow -> BigQuery
What's a problem with keeping data on Hadoop nodes?
In the node dies, its data must be moved.
How should you move data to Hadoop on Dataproc?
1) Move data to GCS.
2) Update prefixes (hdfs:// to gs://).
3) Start using Hadoop on Dataproc as usual.
How do you install software to a Dataproc stack?
1) Write an init script.
2) Upload it to GCS.
3) Provide it when creating a Dataproc cluster.
What is Hadoop?
Apache Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model.
What is Apache Pig?
Apache Pig is an abstraction over MapReduce. It's a tool/platform used to analyze larger datasets with a data flow representation.
What is PySpark?
PySpark is a Python library for interacting with Spark.