Course 1: Introduction to Data Engineering Flashcards
(110 cards)
Entities that form a modern data ecosystem
1 Data integrated from disparate sources
2 diff types of analysis/skills to generate insights
3 stakeholders to act/collaborate on insights
4 tools, apps, infrastructure to store, process, disseminate data
Roles and Responsibilities of Data Engineers
1 Extract, integrate, and organize data from disparate sources
2 Clean, transform, and prep data
3 design, store, and manage data repositories
Data Engineer Competencies
1 Programming
2 knowledge of systems and tech architectures
3 understanding of relational and non-relational databases
Roles and Responsibilities of Data Analysts
1 Inspect and clean data for deriving insights
2 identify correlations/patterns and apply statistical methods to analyze and mine data
3 visualize data to interpret and present findings
Data Analyst Competencies
1 good knowledge of spreadsheets, query writing, and statistical tools to create visuals
2 programming
3 analytical and story telling skills
Roles and Responsibilities of Data Scientist
1 analyze data for actionable insights
2 build machine learning models or deep learning models
Data Scientist Competencies
1 mathematics
2 statistics
3 fair understanding of programming languages, databases, and building data models
4 domain knowledge
Roles and Responsibilities of Business Analysts
1 leverage work of data analyst and scientists to look at implications for their business and recommend actions
Roles and Responsibilities of BI Analysts
1 same as business analyst except focus is on market forces and external influences
2 provide BI solutions
List tasks in typical data engineering lifecycle
1 collect data: by extracting, integrating, organizing data from disparate sources
2 process data: cleaning, transforming, prepping
3 storing data: for reliability, availability
Needs for collecting data
1 develop tools, workflows, processes
2 design, build, maintain scalable data architectures
3 store in databases, warehouses, lakes, other repositories
Needs for processing data
1 implement and maintain distributed systems for large-scale processing
2 design pipelines for extraction, transformation, and loading
3 design solutions for safeguarding, quality, privacy, and security
4 optimize tools, systems, and workflows for performance, reliability, and security
5 ensure regulatory and compliance guidelines
Needs for storing data
1 architect/implement data stores
2 ensure scalable systems
3 ensure tools/systems in place for privacy, security, compliance, monitoring, backup, and recovery
4 make data available to users through services, APIs, programs
5 interfaces and dashboards to present data
6 ensure measures/checks and balances in place for secure and right-based access
Elements of data engineering ecosystem
1 data 2 data repositories 3 data integration platforms 4 data pipelines 5 languages 6 BI and reporting tools
structured data with examples
objective facts and numbers that can be collected, exported, stored, and organized in typical databases — SQL databases, spreadsheets, OLTP (online transaction processing) systems
semi-structured data with examples
some organizational properties but lacks rigid schema — emails, binary executables (TCP/IP packets), zipped files
unstructured data with examples
does not have easily identifiable structure and cannot be organized in database of rows and columns — web pages, social media feeds, images, audio files, pdfs
standard file formats
1 delimited text - .CSV 2 microsoft excel - .XML spreadsheet or .XLSX 3 extensible markup language - .XML 4 portable document - .PDF 5 javascript object notation - .JSON
delimited text file
1 store data as text
2 each value separated by delimiter which is one or more characters that act as boundary bw values
3 .CSV or .TSV
microsoft excel file format
1 spreadsheet
2 open file format meaning accessible to other apps
3 can use and save all functions available in excel
4 secure format meaning it cannot save malicious code
extensible markup language file format
1 markup language with set rules for encoding data
2 readable by humans and machines
3 self-descriptive language
4 platform and programming language independent
5 simpler to share between data systems
portable document file format
1 developed by adobe
2 present documents independent of app software, hardware, or operating systems
3 can be viewed same way on any device
javascript object notation file format
1 text-based open standard designed for transmitting structured data over web 2 language independent data format 3 can be read in any language 4 easy 5 compatible with wide array of browsers 6 one of the best tools for sharing data
common sources of data
1 relational databases 2 flat files and XML databases 3 APIs and web services 4 web scraping 5 data streams and feeds