Domain 1: Data Engineering Flashcards
Data that has a well-defined schema and metadata needed to interpret the data such as the attributes and the data types.
Structured Data
Tabular data is an example of:
Structured Data
T/F: Depending on the column data type, you may have to perform different actions to prepare the data for machine learning.
True
An attribute in a tabular dataset is a _____, and a _____corresponds to a data point or an observation.
Column/row
Data that does not have a schema or any well-defined structural properties.
Unstructured Data
What makes up the majority of the data most organizations have?
Unstructured Data
Whose job is it to convert the unstructured data into some form of structured data for machine learning or train an ML model directly on the unstructured data itself?
Data Scientist
Examples include images, videos, audio files, text documents, or application log files.
Unstructured Data
Data that can be in JSON format or XML data that you may have from a NoSQL database.
Semi-structured Data
T/F: You may need to parse this semi-structured data into structured data to make it useful for machine learning.
True
Data that has a single or multiple target columns (dependent variables) or attributes.
Labeled Data
Data with no target attribute or label.
Unlabeled Data
A column in a tabular dataset besides the label column.
Feature
A row in a tabular dataset that consists of one or more features, which can also contain one or more labels.
Data Point
A collection of data points that you will use for model training and validation.
Dataset
A feature that can be represented by a continuous number or an integer but is unbounded in nature.
Numerical Feature
A feature that is discrete and qualitative, and can only take on a finite number of values.
Categorical Feature
In most machine learning problems, you need to convert _____features into _____features using different techniques.
Categorical/numerical
Images that are usually in different formats such as JPEG or PNG.
Image Data
An example of an _____ is the popular handwritten digits dataset such as MNIST or ImageNet.
Image dataset
This data usually consists of audio files in MP3 or WAV formats and can arise from call transcriptions in call centers.
Audio Data
This data is commonly referred to as a corpus and can consists of collections of documents.
Text Data (Corpus)
_____ can be stored in many formats, such as raw PDF or TXT files, JSON, or CSV.
Text Data
Examples of ________ include the newsgroups dataset, Amazon reviews data, the WikiQA corpus, WordNet, and IMDB reviews.
Popular text corpora