Data Engineering - Data Transformation for Machine Learning Flashcards

1
Q

What are the two types of data transformation?

A
  • Changing the data structure
  • Cleaning the data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the 6 issues that might appear in “dirty” data?

A
  1. Inconsistent schema - names and order of fields varies
  2. Extraneous text - additional unnecessary text in field
  3. Missing data - empty or N/A
  4. Redundant Information - the same data is available in several fields
  5. Contextual errors - the data is valid but wrong in the real-world context
  6. Junk vlaues - meaningless data in fields
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe Apache Spark

A

A data processing framework that can quickly perform processing tasks on very large data sets. It can run on Amazon EMR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe Amazon EMR

A

A managed cluster platform that simplifies running big data frameworks such as Apache Hadoop and Apache Spark

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Which languages does Apache Spark support?

A

Java, Scala, SQL and Python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can Apache Spark be used with Amazon SageMaker?

A

When run with SageMaker Spark is used for pre-processing data and SageMaker for model training and hosting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe Amazon Athena

A

a serverless, interactive analytics service built on open-source frameworks, support open-table and file format. It allows you to use standard SQL on datasets in S3. It requires a Data Catalogue to understand the structure of S3.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe Amazon Glue

A

An ETL service it transfers data from a raw S3 bucket to a processed S3 bucket

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a glue crawler?

A

A configurable glue item that can search a data source from a database/S3 bucket and populate a Glue database with data. It fist needs to identify and format before analyses the structeure of the data and then creates tables in the Glue database with the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a glue trigger?

A

A configurable Glue item that can make Glue crawlers and Jobs start processing. They can be configured to srart on a schedule or because an event has been detected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does the glue crawler recognise the format of the data?

A

A prioritised list of data classifiers is used to recognise the format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the Glue Database

A

Comprimises of tables that have been created by the Glue Crawler. The tables describe the data strcuture and are used to retrieve the data and manipulate it during data transformation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What type of store is the Glue Database?

A

Apache Hive Metastore

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Where does the Glue Database sit?

A

In the glue data catalogue. There is one per region, per account that holds all the Glue Databases that have been created by Glue Crawlers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe a Glue Job

A

A PySpark or Python programme that can access the source data in the Glue Databases. It makes use of an Apache Spark cluster to provide processing power. The Glue Job processies ghr data and performs data transformation and cleansing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which parameters can you change in Glue Jobs?

A

The instance type used by Glue and the number of workers that can be created.

17
Q

Which langauges do programmes executed by Glue need to be written in?

A

PySpark + Python

18
Q

Which services are available to prepare data for ML?

A

Glue, SageMaker, EMR, Athena, Redshift Spectrum, Data Pipeline

19
Q

What aspect of data preparation can EMR be used for?

A

EMR can be used for all aspects of data preparation

20
Q

Under what circumstances would it be a good idea to run SageMaker jobs in EMR?

A

If the data or part of the data preparation processing is already hosted in EMR then its best if all other processes are run in EMR as well.

21
Q

What processing must occur beifre Amazon RedShift Spectrum can be used for ad-hoc ETL processing?

A

The data must be processed by Glue crawlers to discover the data’s format and structure and populate tables in the data catalogue

22
Q

How is metadata loaded into the Data Catalogue?

A

Glue Crawlers

23
Q

When using Glue to run Spark what are the two programming languages supported?

A

Scala and PySpark