Data Flashcards

1
Q

What would be three key components of Data Science (DS)?

Enligt föreläsning PPU161_Introduction_to_DataScience_230925.pdf

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data mining?

A

Korta versionen: - Non-trivial process of identifying valid, novel, potentially useful, and ultimately
understandable patterns in data.

  • Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases.
  • Alternative names: Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, information harvesting, business intelligence, etc.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the 3 diffrent data mining process models

A
  • The Knowledge Discovery Databases (KDD) model is an iterative and interactive model. It has total nine steps. It refers to finding knowledge in data and emphasizes the high level of specific data mining method.
  • Cross-Industry Standard Process for Data Mining (CRISP-DM) was launched in late 1996 by Daimler Chrysler (then Daimler-Benz), SPSS (then ISL) and NCR. This models the refines over the years. It has six steps or phases.
  • Sample, Explore, Modify, Model, Assess (SEMMA) model was developed by SAS institute. It has five different phases.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe the phases of KDD

A

Understanding the Goal: Define the problem you want to address and determine the goal of the knowledge discovery process.

Data Selection: Gather and select the data that is relevant to the problem you’re trying to solve. This step involves choosing the right dataset from various available sources.

Data Preprocessing: Clean and preprocess the data to handle missing values, outliers, and noise. Data preprocessing also involves transforming data into a suitable format for analysis.

Data Transformation: Convert the preprocessed data into appropriate forms for mining. This step can include normalization, aggregation, and other transformations to make the data suitable for the chosen data mining technique.

Data Mining: Apply various data mining techniques to extract patterns, trends, and insights from the transformed data. Common data mining techniques include clustering, classification, regression, and association rule mining.

Pattern Evaluation: Evaluate the discovered patterns to ensure their quality and relevance to the problem at hand. This step involves assessing patterns based on measures like accuracy, precision, recall, and relevance to the problem domain.

Knowledge Representation: Present the discovered knowledge in a comprehensible form, often using visualization techniques. This step is crucial for stakeholders to understand and interpret the results effectively.

Interpretation and Evaluation: Interpret the mined patterns in the context of the problem domain. Evaluate the knowledge discovered to determine its usefulness and effectiveness in addressing the initial goal.

Deployment: Implement and integrate the discovered knowledge into existing systems or processes. Deployed knowledge can lead to informed decision-making and improved outcomes in various applications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the phases of CRISP-DM

Från tenta 2021-10-23

A

Från: PPU161_Introduction_to_DataScience_230925.pdf, sid 27

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe the phases of SEMMA

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain/define the follwoing: Artificial intelligence, Machine Learning and Deep Learning

A
  • Artificial intelligence: Getting machines to do what humans are good at
  • Machine Learning: Feeding an algorithm data to learn and predict something
  • Deep Learning: A subtype of machine learning which utilizes multi-layer neural networks

Från föreläsning: PPU161_Datamining_Visualization_AI_ML_lecture_slides_230928.pdf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Big data sources and forms

A

Från föreläsning: PPU161_Introduction_to_DataScience_230925.pdf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Big data and its characteristics

A

Big data is any data that is expensive to manage and hard to extract value from due to its associated primary characteristics called the 3Vs:

  • Volume - The size of the data
  • Velocity - The speed at which the data is generated and processed
  • Variety and complexity - The diversity of sources, formats, quality, and structures
  • Other V:s - Veracity, Value, Variability

Från föreläsning: PPU161_Introduction_to_DataScience_230925.pdf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe the diffrent data preprocessing steps

A
  • Data cleaning - Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
    inconsistencies
  • Data integration - integration of multiple databases, files, or notes
  • Data transformation - Normalization (scaling to a specific range) and Summarization/Aggregation
  • Data reduction - Reduced representation of data in volume but produces the same or similar analytical results. And Feature selection, dimensionality reduction, data compression, etc.
    34

Från föreläsning: PPU161_Datamining_Visualization_AI_ML_lecture_slides_230928.pdf

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain supervised learning and under what conditions would you use it.

A

If you have examples to train the system with known results from those examples, supervised learning is used (regression and classification problems)

  • Patterns that predict some target value
  • Target/output values do exist and are used

Example: Use the lables to build a model. Model used to classify new house size only on the know feature set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Explain unsupervised learning and under what conditions would you use it.

A

When it is not clear which type of information is going to be found.
* Finding patterns in data without any truth
* Target/output values does not exists
* Knowledge discovery
* Most data have this form

Example: Size is missing. We need to look fo similarities in the data and group them into clusers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly