Unit 3 Flashcards
WOOHOO!!! (16 cards)
What is Data Science?
multi-disciplinary field that uses scientific methods,
processes, algorithms, and systems to extract knowledge and
insights from structured, semi-structured and unstructured data.
Data Scientist:
A professional that collects large amounts of data using
analytical, statistical, and programmable skills.
Responsible for using data to develop solutions tailored to meet
the organization’s unique needs.
Vrai ou faux: Data scientists may write programs and develop new algorithms.
Vrai
Data Scientist skills
Programming,
Communication,
Organizational,
Mathematical,
Data analysis,
Problem solving,
Analytical skills.
Quantitative and Qualitative data
Quantity ~ numerical
Descriptive
Base of the DIKW pyramid
Data
What id data quality
check condition of data
Measured in: completeness,
uniqueness, consistency, timeliness, validity and accuracy.
Knowledge and wisdom
Knowledge is understanding facts(patterns), while wisdom is the ability to apply them thoughtfully and effectively.
Data processing
data ~ info
Data processing cycle
Collection:
Prep(data cleaning) : removing unnecessary and inaccurate data
Input: data ~ machine readable form
Processing: raw data is processed using machine learning and AI to make a good output
Output: data to user readable format
Storage: data and metadata for quick retrieval
Alternative cycle of data processing
Input: raw data ~processed
Processed: processing by a suited method
Output: the out come of the process and provides info
Types of data processing
Done based on source and steps taken to generate output
Batch: collected in large amounts and in batches…accumulate and process(weekly or monthly)
Online: fed into the CPU ASAP rocky!!! when it becomes available so continuous processing(barcode)
Real-time: within seconds , small amounts of data (ATM)
Multiprocessing: data is broken down into frames within the same computer…parallel processing (weather forecasting)
Time-sharing: allocates data in time slots to several users simultaneously
What is a data lake?
Centralized storage system that holds vast amounts of raw, unprocessed data in its native format, allowing it to be structured, semi-structured, or unstructured, and making it accessible for analysis, processing, and other uses when needed.
Data warehouse?
centralized storage system designed to store structured and processed data from multiple sources, optimized for querying and analysis to support business intelligence and decision-making.
Data warehouse and data lake difference
Data Warehouse:
Stores structured data (organized into tables and schemas)
Data is processed and cleaned before storage (ETL: Extract, Transform, Load)
Optimized for business analytics and reporting.
Example: Sales records used for generating reports and dashboards.
Data Lake:
Stores raw data in its original form (structured, semi-structured, or unstructured).
Data is processed on demand, not before storage (ELT: Extract, Load, Transform).
Flexible for big data analysis, AI, and machine learning.
Example: Sensor data, images, or logs waiting for later use