Course 1: Introduction to Data Engineering Flashcards

(110 cards)

1
Q

Entities that form a modern data ecosystem

A

1 Data integrated from disparate sources
2 diff types of analysis/skills to generate insights
3 stakeholders to act/collaborate on insights
4 tools, apps, infrastructure to store, process, disseminate data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Roles and Responsibilities of Data Engineers

A

1 Extract, integrate, and organize data from disparate sources
2 Clean, transform, and prep data
3 design, store, and manage data repositories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Engineer Competencies

A

1 Programming
2 knowledge of systems and tech architectures
3 understanding of relational and non-relational databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Roles and Responsibilities of Data Analysts

A

1 Inspect and clean data for deriving insights
2 identify correlations/patterns and apply statistical methods to analyze and mine data
3 visualize data to interpret and present findings

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Analyst Competencies

A

1 good knowledge of spreadsheets, query writing, and statistical tools to create visuals
2 programming
3 analytical and story telling skills

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Roles and Responsibilities of Data Scientist

A

1 analyze data for actionable insights

2 build machine learning models or deep learning models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Scientist Competencies

A

1 mathematics
2 statistics
3 fair understanding of programming languages, databases, and building data models
4 domain knowledge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Roles and Responsibilities of Business Analysts

A

1 leverage work of data analyst and scientists to look at implications for their business and recommend actions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Roles and Responsibilities of BI Analysts

A

1 same as business analyst except focus is on market forces and external influences
2 provide BI solutions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

List tasks in typical data engineering lifecycle

A

1 collect data: by extracting, integrating, organizing data from disparate sources
2 process data: cleaning, transforming, prepping
3 storing data: for reliability, availability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Needs for collecting data

A

1 develop tools, workflows, processes
2 design, build, maintain scalable data architectures
3 store in databases, warehouses, lakes, other repositories

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Needs for processing data

A

1 implement and maintain distributed systems for large-scale processing
2 design pipelines for extraction, transformation, and loading
3 design solutions for safeguarding, quality, privacy, and security
4 optimize tools, systems, and workflows for performance, reliability, and security
5 ensure regulatory and compliance guidelines

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Needs for storing data

A

1 architect/implement data stores
2 ensure scalable systems
3 ensure tools/systems in place for privacy, security, compliance, monitoring, backup, and recovery
4 make data available to users through services, APIs, programs
5 interfaces and dashboards to present data
6 ensure measures/checks and balances in place for secure and right-based access

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Elements of data engineering ecosystem

A
1 data
2 data repositories
3 data integration platforms
4 data pipelines
5 languages
6 BI and reporting tools
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

structured data with examples

A

objective facts and numbers that can be collected, exported, stored, and organized in typical databases — SQL databases, spreadsheets, OLTP (online transaction processing) systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

semi-structured data with examples

A

some organizational properties but lacks rigid schema — emails, binary executables (TCP/IP packets), zipped files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

unstructured data with examples

A

does not have easily identifiable structure and cannot be organized in database of rows and columns — web pages, social media feeds, images, audio files, pdfs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

standard file formats

A
1 delimited text - .CSV
2 microsoft excel - .XML spreadsheet or .XLSX
3 extensible markup language - .XML
4 portable document - .PDF
5 javascript object notation - .JSON
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

delimited text file

A

1 store data as text
2 each value separated by delimiter which is one or more characters that act as boundary bw values
3 .CSV or .TSV

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

microsoft excel file format

A

1 spreadsheet
2 open file format meaning accessible to other apps
3 can use and save all functions available in excel
4 secure format meaning it cannot save malicious code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

extensible markup language file format

A

1 markup language with set rules for encoding data
2 readable by humans and machines
3 self-descriptive language
4 platform and programming language independent
5 simpler to share between data systems

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

portable document file format

A

1 developed by adobe
2 present documents independent of app software, hardware, or operating systems
3 can be viewed same way on any device

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

javascript object notation file format

A
1 text-based open standard designed for transmitting structured data over web
2 language independent data format
3 can be read in any language
4 easy
5 compatible with wide array of browsers
6 one of the best tools for sharing data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

common sources of data

A
1 relational databases
2 flat files and XML databases
3 APIs and web services
4 web scraping
5 data streams and feeds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
relational database examples
1 microsoft SQL server 2 oracle 3 MySQL 4 IBM db2
26
APIs and web services
1 multiple users or apps can interact with and obtain data for processing/analysis 2 listens for incoming requests, in form of user web requests or network requests from apps 3 returns data in plain text, HTML, XML, JSON, or media files
27
popular examples of APIs
twitter and facebook for tweets and posts stock market APIs data lookup and validation
28
web scraping or screen scraping
1 download specific data based on defined parameters | 2 can extract text, contact info, images, videos, product items, etc.
29
popular uses of web scraping or screen scraping
1 providing pricing comps by collecting product details from retailer eCommerce websites 2 generating sales leads thru public data 3 extracting data from posts and authors on various forums 4 collecting training and testing models for machine learning
30
data streams and feeds
aggregating streams of data from instruments, IoT devices, GPS data, computer programs, websites, social media posts
31
popular data stream examples
``` 1 stock market tickers for financial trading 2 retail transactions 3 surveillance and video feeds 4 social media feeds 5 sensors 6 web clicks 7 flight events ```
32
popular data stream technologies
1 kafka 2 apache spark 3 apache storm
33
RSS (really simple syndication) feeds
capturing updated data from online forums and news sites where data is refreshed on ongoing basis
34
types of languages with usage description
1 query - accessing and manipulating data 2 programming - developing apps and controlling app behavior 3 shell and scripting - ideal for repetitive and time-consuming operational tasks
35
typical operations performed by shell scripts
``` 1 file manipulation 2 program execution 3 system admin tasks 4 installation for complex programs 5 executing routine backups 6 running batches ```
36
what is PowerShell and what is it used for?
- cross-platform automation tool and configuration framework by microsoft optimized for working with structured data - data mining, building GUIs, creating charts, dashboards, and interactive reports
37
metadata
data that provides info about other data
38
3 main types of metadata with description
1 technical - defines data structures in repositories or platforms 2 process - processes that operate behind business systems like data warehouses, accounting systems, or CRM tools 3 business - info about data described in readily interpretable ways
39
metadata management
includes developing and administering policies and procedures to ensure info can be accessed and integrated from various sources and appropriately shared across enterprise
40
why is metadata management important?
help understand both business context and data lineage, which helps improve data governance
41
data repository
general term for data collected, organized, and isolated so that it can be used for business ops or mined for reporting
42
database
collection of data designed for input, storage, search, and modification
43
DBMS (database management system)
set of programs that creates and maintains the database
44
relational database (RDBMS) and difference from flat files
data organized into tabular format with rows and columns following a well defined structure and schema. optimized for data operations and querying unlike flat files
45
non-relational databases (NoSQL)
built for speed, flexibility, and scale making it possible to store data in schema-less fashion
46
data warehouse
central repository for info from disparate sources consolidated through ETL process that enables analytics and BI
47
big data stores
distributed computational and storage infrastructure to store, scale, and process very large data sets
48
popular cloud relational databases services
``` amazon relational database service RDS google cloud SQL IBM db2 on cloud oracle cloud SQL Azure ```
49
advantages of relational databases
``` 1 create meaningful info by joining tables 2 flexibility 3 reduced redundancy 4 ease of backup and disaster recovery 5 ACID compliance ```
50
ACID (Atomicity, Consistency, Isolation, Durability) compliance
data in database remains accurate, consistent, reliable despite failures
51
limitations of relational databases
1 doesnt work well with semi-structured or unstructured data
52
data warehouse typical architecture 3 tiers
1 bottom - database servers, that extract data from various sources 2 middle - OLAP server, that allows user to process and analyze info coming from multiple database servers 3 top - client front-end, tools and apps used for querying, reporting, analyzing
53
popularly used data warehouses
``` 1 teradata enterprise data warehouse 2 oracle exadata 3 IBM Db2 warehouse on cloud 4 IBM netezza performance server 5 amazon redshift 6 BigQuery by Google 7 Cloudera's enterprise data hub 8 Snowflake cloud data warehouse ```
54
data mart
sub-section of data warehouse built specifically for a business function or community of users
55
types of data marts
dependent, independent, hybrid
56
dependent data mart
sub-section of data warehouse, offers analytical capabilities for restricted area of the data warehouse therefore providing isolated security and performance
57
independent data mart
created from sources other than enterprise data warehouse, like internal operating systems or external data
58
hybrid data mart
combine inputs from enterprise data warehouse, internal systems, and external data
59
data lake
data repository that can store large amounts of any type of data in their native format (raw)
60
benefits of data lakes
1 can store all types of data 2 can scale based on storage capacity 3 saves time of defining structures, schemas, and transformations 4 can repurpose data in different ways for many use cases
61
considerations for choice of data repository
``` 1 types of data 2 schema of data 3 performance 4 whether data is at rest or streaming 5 data encryption needs 6 volume 7 storage requirements 8 frequency of access 9 organizations policies ```
62
data extraction types with description and tools
1 batch processing - data is moved in large chunks from source to target system - Stitch, Blendo 2 stream processing - moved in real-time and tranformed in transit - Apache Samza, Apache Storm, Apache Kafka
63
types of loading in ETL process with descriptions
1 initial - populating all data in repository 2 incremental - applying ongoing updates and mods periodically 3 full refresh - erasing contents of one or more tables and reloading with fresh data
64
popular ETL tools
``` 1 IBM Infosphere 2 AWS Glue 3 Impravado 4 Skyvia 5 HEVO 6 Informatica PowerCenter ```
65
advantages of ELT process
1 processing large sets of unstructured and non-relational data 2 shortened cycle between extraction and delivery 3 can ingest data immediately as available 4 greater flexibility for exploratory analytics
66
data integration
discipline of the practices, architectural techniques, and tools that allow orgs to ingest, transform, combine, and provision data across various data types
67
big data
dynamic, large, and disparate volumes of data being created by people, tools, and machines
68
elements of big data
velocity, volume, variety, veracity, value
69
big data velocity
speed at which data accumulates
70
big data volume
scale of the data
71
big data variety
diversity of the data
72
big data veracity
quality and origin of data and conformity to facts and accuracy
73
big data value
ability and need to turn data into value
74
3 open source big data technologies
hadoop, hive, apache spark
75
hadoop
collection of tools that provides distributred storage and processing of big data
76
hive
data warehouse for data query and analysis built on top of hadoop
77
spark
distributed data analytics framework designed to perform complex data analytics in real-time
78
hadoop benefits
1 Better real time data-driven decisions 2 improved data acess and analysis 3 data offload and consolidation
79
4 main hadoop components
1 hadoop distributed file system (HDFS) is storage system for bid data that runs on multiple commodity hardware connected through network 2
80
HDFS benefits
1 fast recovery from hardware failures 2 access to streaming data because of high throughput rates 3 accommodation of large datasets because it can scale to hundreds of nodes in single cluster 4 portability, across multiple hardware platforms and compatible with multiple operating systems
81
hive benefits
1 data warehousing tasks such as ETL reporting, and data analysis 2 easy access to data via SQL
82
data platform layers
``` 1 collection 2 storage and integration 3 processing 4 analysis and user interface 5 data pipeline ```
83
data collection layer
1 connect to sources 2 transfer data in streaming, batch, or both 3 maintain metadata of collection
84
data collection layer tools
``` google cloud DataFlow IBM streams IBM streaming analytics on cloud amazon kinesis apache kafka ```
85
data storage layer
1 store data for processing 2 transform and merge extracted data, logically or physically 3 make data available for processing in streaming or batch modes
86
data storage tools
``` ibm DB2 microsoft sql server mysql oracle database postgreSQL ```
87
data processing layer
1 read data from storage and apply transformations 2 support popular querying tools and programming languages 3 scale to meet the processing demands of a growing dataset
88
primary considerations for designing a data store
``` 1 type of data 2 volume 3 intended use 4 storage 5 privacy, security, and governance ```
89
scalability
capability to handle growth in the amount of data, workloads, and users
90
normalization of the database
process of efficiently organizing data in a database
91
throughput or latency
rate at which info can be read from and written to the storage and the time it takes to access a specific location
92
Facets of security in data lifecycle management
physical infrastructure network application data
93
3 components to creating an effective strategy for info security (known as CIA triad)
1 Confidentiality - through controlling unauthorized access 2 Integrity - through validating that your resources are trustworthy 3 Availability - ensuring users have access to resources when they need
94
Rss feeds
data source typically used for capturing updated data from online forums and news sites
95
popular data exchange platforms
aws data exchange crunchbase lotame snowflake
96
data exchange platforms
facilitate the exchange of data while ensuring security and governance maintained
97
importing data process
combining data from different sources to provide combined view and a single interface where you can query and manipulate data
98
data wrangling
iterative process that involves data exploration, transformation, validation, and making data available
99
transformation tasks with definition
1 structuring - actions that change the form and schema of your data 2 normalization/denormalization - cleaning the database of unused data and reducing redundancy 3 cleaning - fix irregularities in data
100
types of performance threats to data pipelines
scalability app failures scheduled jobs not starting on schedule tool incompatibilities
101
performance metrics for a data pipleline with definition
1 latency - time it takes for a services to fulfill a request 2 failures - rate at which a service fails 3 resource utilitization 4 traffic - number of user requests received in a given period
102
steps to troubleshoot performance issues in data pipeline
1 collect as much info as possible 2 check if working with all the right versions of software 3 check the logs and metrics to isolate whether issue is related to infrastructure, data, software, or combo
103
performance metrics for a database
``` 1 system outages 2 capacity utilization 3 application slowdown 4 performance of queries 5 conflicting activities being executed by multiple users giving requests at the same time 6 batch activities ```
104
capacity planning
process of determining the optimal hardware and software resources required for performance
105
database monitoring tools def
take frequent snapshots of the performance indicators of a database
106
app management performance management tools def
help measure and monitor the performance of applications by tracking request response time and error messages and the amount of resources being utilized by each process
107
query performance monitoring tools def
gather stats about query throughput, execution performance, resource utililization and patterns
108
pseudonymization
de-identification process where personally identifiable info is replaced with artificial identifiers so data cant be tracked back to someones identity
109
data erasure
software-based method of permanently clearing data from a system by overwriting
110
DataOps
collaborative management practice focused on improving the communication, integration, and automation of data flows between data managers and consumers