Course 1: Introduction to Data Engineering Flashcards

Question

relational database examples

Answer 1

1 microsoft SQL server 2 oracle 3 MySQL 4 IBM db2

Answer 2

1 multiple users or apps can interact with and obtain data for processing/analysis 2 listens for incoming requests, in form of user web requests or network requests from apps 3 returns data in plain text, HTML, XML, JSON, or media files

Answer 3

twitter and facebook for tweets and posts stock market APIs data lookup and validation

Answer 4

1 download specific data based on defined parameters | 2 can extract text, contact info, images, videos, product items, etc.

Answer 5

1 providing pricing comps by collecting product details from retailer eCommerce websites 2 generating sales leads thru public data 3 extracting data from posts and authors on various forums 4 collecting training and testing models for machine learning

Answer 6

aggregating streams of data from instruments, IoT devices, GPS data, computer programs, websites, social media posts

Answer 7

``` 1 stock market tickers for financial trading 2 retail transactions 3 surveillance and video feeds 4 social media feeds 5 sensors 6 web clicks 7 flight events ```

Answer 8

1 kafka 2 apache spark 3 apache storm

Answer 9

capturing updated data from online forums and news sites where data is refreshed on ongoing basis

Answer 10

1 query - accessing and manipulating data 2 programming - developing apps and controlling app behavior 3 shell and scripting - ideal for repetitive and time-consuming operational tasks

Answer 11

``` 1 file manipulation 2 program execution 3 system admin tasks 4 installation for complex programs 5 executing routine backups 6 running batches ```

Answer 12

- cross-platform automation tool and configuration framework by microsoft optimized for working with structured data - data mining, building GUIs, creating charts, dashboards, and interactive reports

Answer 13

data that provides info about other data

Answer 14

1 technical - defines data structures in repositories or platforms 2 process - processes that operate behind business systems like data warehouses, accounting systems, or CRM tools 3 business - info about data described in readily interpretable ways

Answer 15

includes developing and administering policies and procedures to ensure info can be accessed and integrated from various sources and appropriately shared across enterprise

Answer 16

help understand both business context and data lineage, which helps improve data governance

Answer 17

general term for data collected, organized, and isolated so that it can be used for business ops or mined for reporting

Answer 18

collection of data designed for input, storage, search, and modification

Answer 19

set of programs that creates and maintains the database

Answer 20

data organized into tabular format with rows and columns following a well defined structure and schema. optimized for data operations and querying unlike flat files

Answer 21

built for speed, flexibility, and scale making it possible to store data in schema-less fashion

Answer 22

central repository for info from disparate sources consolidated through ETL process that enables analytics and BI

Answer 23

distributed computational and storage infrastructure to store, scale, and process very large data sets

Answer 24

``` amazon relational database service RDS google cloud SQL IBM db2 on cloud oracle cloud SQL Azure ```

Answer 25

``` 1 create meaningful info by joining tables 2 flexibility 3 reduced redundancy 4 ease of backup and disaster recovery 5 ACID compliance ```

Answer 26

data in database remains accurate, consistent, reliable despite failures

Answer 27

1 doesnt work well with semi-structured or unstructured data

Answer 28

1 bottom - database servers, that extract data from various sources 2 middle - OLAP server, that allows user to process and analyze info coming from multiple database servers 3 top - client front-end, tools and apps used for querying, reporting, analyzing

Answer 29

``` 1 teradata enterprise data warehouse 2 oracle exadata 3 IBM Db2 warehouse on cloud 4 IBM netezza performance server 5 amazon redshift 6 BigQuery by Google 7 Cloudera's enterprise data hub 8 Snowflake cloud data warehouse ```

Answer 30

sub-section of data warehouse built specifically for a business function or community of users

Answer 31

dependent, independent, hybrid

Answer 32

sub-section of data warehouse, offers analytical capabilities for restricted area of the data warehouse therefore providing isolated security and performance

Answer 33

created from sources other than enterprise data warehouse, like internal operating systems or external data

Answer 34

combine inputs from enterprise data warehouse, internal systems, and external data

Answer 35

data repository that can store large amounts of any type of data in their native format (raw)

Answer 36

1 can store all types of data 2 can scale based on storage capacity 3 saves time of defining structures, schemas, and transformations 4 can repurpose data in different ways for many use cases

Answer 37

``` 1 types of data 2 schema of data 3 performance 4 whether data is at rest or streaming 5 data encryption needs 6 volume 7 storage requirements 8 frequency of access 9 organizations policies ```

Answer 38

1 batch processing - data is moved in large chunks from source to target system - Stitch, Blendo 2 stream processing - moved in real-time and tranformed in transit - Apache Samza, Apache Storm, Apache Kafka

Answer 39

1 initial - populating all data in repository 2 incremental - applying ongoing updates and mods periodically 3 full refresh - erasing contents of one or more tables and reloading with fresh data

Answer 40

``` 1 IBM Infosphere 2 AWS Glue 3 Impravado 4 Skyvia 5 HEVO 6 Informatica PowerCenter ```

Answer 41

1 processing large sets of unstructured and non-relational data 2 shortened cycle between extraction and delivery 3 can ingest data immediately as available 4 greater flexibility for exploratory analytics

Answer 42

discipline of the practices, architectural techniques, and tools that allow orgs to ingest, transform, combine, and provision data across various data types

Answer 43

dynamic, large, and disparate volumes of data being created by people, tools, and machines

Answer 44

velocity, volume, variety, veracity, value

Answer 45

speed at which data accumulates

Answer 46

scale of the data

Answer 47

diversity of the data

Answer 48

quality and origin of data and conformity to facts and accuracy

Answer 49

ability and need to turn data into value

Answer 50

hadoop, hive, apache spark

Answer 51

collection of tools that provides distributred storage and processing of big data

Answer 52

data warehouse for data query and analysis built on top of hadoop

Answer 53

distributed data analytics framework designed to perform complex data analytics in real-time

Answer 54

1 Better real time data-driven decisions 2 improved data acess and analysis 3 data offload and consolidation

Answer 55

1 hadoop distributed file system (HDFS) is storage system for bid data that runs on multiple commodity hardware connected through network 2

Answer 56

1 fast recovery from hardware failures 2 access to streaming data because of high throughput rates 3 accommodation of large datasets because it can scale to hundreds of nodes in single cluster 4 portability, across multiple hardware platforms and compatible with multiple operating systems

Answer 57

1 data warehousing tasks such as ETL reporting, and data analysis 2 easy access to data via SQL

Answer 58

``` 1 collection 2 storage and integration 3 processing 4 analysis and user interface 5 data pipeline ```

Answer 59

1 connect to sources 2 transfer data in streaming, batch, or both 3 maintain metadata of collection

Answer 60

``` google cloud DataFlow IBM streams IBM streaming analytics on cloud amazon kinesis apache kafka ```

Answer 61

1 store data for processing 2 transform and merge extracted data, logically or physically 3 make data available for processing in streaming or batch modes

Answer 62

``` ibm DB2 microsoft sql server mysql oracle database postgreSQL ```

Answer 63

1 read data from storage and apply transformations 2 support popular querying tools and programming languages 3 scale to meet the processing demands of a growing dataset

Answer 64

``` 1 type of data 2 volume 3 intended use 4 storage 5 privacy, security, and governance ```

Answer 65

capability to handle growth in the amount of data, workloads, and users

Answer 66

process of efficiently organizing data in a database

Answer 67

rate at which info can be read from and written to the storage and the time it takes to access a specific location

Answer 68

physical infrastructure network application data

Answer 69

1 Confidentiality - through controlling unauthorized access 2 Integrity - through validating that your resources are trustworthy 3 Availability - ensuring users have access to resources when they need

Answer 70

data source typically used for capturing updated data from online forums and news sites

Answer 71

aws data exchange crunchbase lotame snowflake

Answer 72

facilitate the exchange of data while ensuring security and governance maintained

Answer 73

combining data from different sources to provide combined view and a single interface where you can query and manipulate data

Answer 74

iterative process that involves data exploration, transformation, validation, and making data available

Answer 75

1 structuring - actions that change the form and schema of your data 2 normalization/denormalization - cleaning the database of unused data and reducing redundancy 3 cleaning - fix irregularities in data

Answer 76

scalability app failures scheduled jobs not starting on schedule tool incompatibilities

Answer 77

1 latency - time it takes for a services to fulfill a request 2 failures - rate at which a service fails 3 resource utilitization 4 traffic - number of user requests received in a given period

Answer 78

1 collect as much info as possible 2 check if working with all the right versions of software 3 check the logs and metrics to isolate whether issue is related to infrastructure, data, software, or combo

Answer 79

``` 1 system outages 2 capacity utilization 3 application slowdown 4 performance of queries 5 conflicting activities being executed by multiple users giving requests at the same time 6 batch activities ```

Answer 80

process of determining the optimal hardware and software resources required for performance

Answer 81

take frequent snapshots of the performance indicators of a database

Answer 82

help measure and monitor the performance of applications by tracking request response time and error messages and the amount of resources being utilized by each process

Answer 83

gather stats about query throughput, execution performance, resource utililization and patterns

Answer 84

de-identification process where personally identifiable info is replaced with artificial identifiers so data cant be tracked back to someones identity

Answer 85

software-based method of permanently clearing data from a system by overwriting

Answer 86

collaborative management practice focused on improving the communication, integration, and automation of data flows between data managers and consumers

Course 1: Introduction to Data Engineering Flashcards

(110 cards)