Learning from Data Flashcards
(353 cards)
What are the four Vs of Big Data?
- Volume: Scale of Data
- Variety: Different Forms of Data
- Velocity: Analysis of Streaming Data
- Veracity: Uncertainty of Data
What is Structured Data?
- Data that adheres to a data model
- Conforms to a tabular format with relationship between the different rows and columns
- Makes it easier to contextualise and understand the data
- Examples include tables in SQL databases
- Data elements are addressable for effective analysis
What is unstructured data?
- Data which is not organised according to a preset data model or schema, therefore cannot be stored in a traditional relational database
- 80% - 90% of data generated and collected by organisations is unstructured. It is rich in content but not immediately usable without first being sorted
What is semi-structured data?
- Data that does not adhere to a data model, but has some level of structure
- It contains tags, hierarchies, and other types of markers that give data structure
What is the technology of different data types?
- Structured: Based on relational database table
- Semi-structured: Based on XML/RDF
- Unstructured: Based on charcter and binary data
What is the transaction management of different data types?
- Structured: Matured transaction and concurrency techniques
- Semi-structured: Transaction is adopted from DBMS, not matured
- Unstructured: No transaction management and no concurrency
What is the version management of different data types?
- Structured: Versioning over tuples, rows, tables
- Semi-structured: Versioning over tuples or graph is possible
- Unstructured: Versioned as a whole
What is the flexibility of different data types?
- Structured: Schema dependent, less flexible
- Semi-structured: More flexible than structured, but less than unstructured
- Unstructured: More flexible and there is an absence of a schema
What are the analysis methods of different data types?
- Structured: SQL queries
- Semi-structured: Query languages (e.g., Cassandra, MangoDB)
-Unstructured: Natural language processing, audio analysis, video analysis, text analysis
What is the primary goal of data integration?
The goal of data integration is to combine data from heterogeneous sources into a single coherent data store, providing users with consistent access and delivery of data across various subjects and data structure types. This is particularly useful when data sources are disparate or siloed, such as across different hardware devices, software applications, or operating systems.
Name and describe the five data integration strategies
- Common User Interface (Manual Integration): Data managers manually handle every step of integration, from retrieval to presentation.
- Middleware Data Integration: Uses middleware software to bridge communication between systems, especially legacy and newer systems.
- Application-Based Integration: Software applications locate, retrieve, and integrate data by making it compatible across systems.
- Uniform Data Access: Provides a consistent view of data without moving or altering it, keeping data in its original location.
- Common Data Storage (Data Warehouse): Stores a duplicate copy of data in a central repository for uniform retrieval and presentation.
What are the advantages and disadvantages of the Common User Interface strategy?
- Advantages: Reduced cost, requires little maintenance, integrates a small number of data sources, and gives users total control.
- Disadvantages: Data must be handled manually at each stage, scaling requires changing code, and the process is labor-intensive.
What are the advantages and disadvantages of the Middleware Data Integration strategy?
- Advantages: Middleware software conducts the integration automatically, and the same way each time
- Disadvantages: Middleware needs to be deployed and maintained.
What are the advantages and disadvantages of the Application-based Integration strategy?
- Advantages: Simplified process, application allows systems to transfer information seamlessly, much of the process is automated.
- Disadvantages: Requires specialist technical knowledge and maintenance, complicated setup.
What are the advantages and disadvantages of the Uniform Access Integration strategy?
- Advantages: Lower storage requirements, provides a simplified view of the data to the end user, easier data access
- Disadvantages: Can compromise data integrity, data host systems are not designed to handle amount and frequency of data requests.
What are the advantages and disadvantages of the Common Data Storage strategy?
- Advantages: Reduced burden on the host system, increased data version management control, can run sophisticated queries on a stored copy of the data without compromising data integrity
- Disadvantages: Need to find a place to store a copy of the data, increases storage cost, require technical experts to set up the integration, oversee and maintain the data warehouse.
What percentage of their time do data scientists spend cleaning and organizing data?
Data scientists spend 60% of their time cleaning and organizing data, making it the most time-consuming part of their work.
What is the least enjoyable part of data science according to surveys?
Cleaning and organizing data is the least enjoyable part, cited by 57% of respondents.
What are the three main types of learning in machine learning?
- Supervised Learning: Uses labeled data to learn a mapping from inputs to outputs.
- Unsupervised Learning: Works with unlabeled data to find patterns or groupings.
- Semi-Supervised Learning: Combines both labeled and unlabeled data.
What is the difference between regression and classification in supervised learning?
- Regression: Predicts a continuous quantitative response (e.g., income, stock price).
- Classification: Predicts a qualitative response (e.g., marital status, cancer diagnosis).
What is the general form of a supervised learning model?
The model learns a mapping function: y_p= f(Ω, x)
where:
y_p = predicted output
Ω = model parameters
x = input features
What are hyperparameters in machine learning?
Hyperparameters are parameters not learned directly from the data but set before training (e.g., learning rate, number of layers in a neural network). They control the learning process and are often tuned for optimal performance.
How is the quality of predictions measured in supervised learning?
A loss function J(y, y_p) quantifies the difference between predicted values y_p and actual values y. The goal is to minimize this function during training.
What is the purpose of a loss function?
The loss function measures how well the model’s predictions match the actual data. It guides the optimization process to adjust model parameters (Ω) for better accuracy.