Module 1: Data Science Process Flashcards

(37 cards)

1
Q

What are the 5 V’s of Big Data?

A

Volume - Terabytes, petabytes
Variety - structured, unstructured, multi-media
Velocity - batch, realtime, streams
Veracity - reliability, availability, completeness
Value - Insights, foresights, actions, decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Survivorship bias? Give an Example

A

In WW2 man was tasked with where to put extra armor on planes to prevent them being shot down. He was able to observe some planes that did return but instead of putting the armor where the holes were he put the armor where they wernt as these would have been the places where getting shot must have been worse that for the plane as he didnt see bullet holes in this spot which meant the planes must have not come back.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Advantages of DBMS? (SSRNHC)

A
  • Separation of data from apps
  • Separation of physical
    structures and logical structures
  • Relational model and theory
  • Non procedural query language
  • High performance query processing
  • Concurrency control and recovery
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DBMS in the big data era

A
  • The closed world assumption
  • Software independent of hardware platforms for too long
  • Victim of its own success extensions are not well supported
  • Limited data types.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Name the steps of the data science process

A
  1. Problem Formulation
  2. Data Collection
  3. Data preparation
  4. Data analysis
  5. Storytelling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe the Problem Formulation Step

A

Defining the problem in a clear way including explaining its purpose, who will benefit, in what way, how will you measure the success, is it a problem that can be solved in a data driven way

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the Data Collection Step

A

What data do you need, where will you store it, for how long, do you own the data, is it readily available, what do you have to do to gain access to it, are you allowed to make a copy of it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the Data Preparation Step

A
  • Is the data fit for the purpose you intend to use it for.
  • How will you assess the how fit for purpose the data is
  • what do you need to prepare the data for analysis
  • what assurances for data prepared without bias and issues with quality.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the data analysis step

A

How will you analyze the data, why is your approach the right approach, what methods / algorithms and models will you use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe the Story telling step

A

How you present the results of your analytics. Is your presentation susceptible to misunderstandings, what are the key messages you want the audience to get from the findings.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the 4 Human centered principles for data analytics?

A
  1. Every data problem is a human problem
  2. Assume people are poorly represented in the data you have
  3. If possible visit people in their physical locations
  4. Be critical of you assumptions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What questions should you ask of the data with regards to you analysis.

A

Who, What, When, Where, Why, How

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why might you want to do data sampling?

A

Volume of data may cause storage and accessibility problems

You need the convinience of working with a smaller set of data (laptop vs cluster)

Smaller dataset has the same data properties.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the two broad types of data sampling?

A

Sampling without replacement
- no duplicate items in sample, items are dependent

Sampling with replacement - each time we add an item to the sample it is not excluded from being added again, sampled items are independent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the different types of sampling

A

Simple Random sampling - if a set is n size chose items in the larger set s where n < s and the probability of selecting an item is 1/n

Weighted Random sampling - designing weights to capture a particular interest in the data

Stratified random sampling.

N items in dataset each belonging to k strata, want to select k items form each straum giving m = sk item for the sample (WR)

For each strata chose each of the k samples from the stratum uniformly at random

In some studies may want to preserve the proportion of strata in other studies may want to over sample rare strata.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the key questions to address during data collection?

A

What data do you need
Do you need all of it or a sample
Are you authorized to acquire the data
In what from are you going to store (ingest) the data

17
Q

What are the different methods of ingestion? (TTCM)

A

Tabular data - csv, JSON
Text Data, MongoDB
Complex structured data - SQL Relational DB
Multimedia data - Data lakes and cloud storage

18
Q

What are the 7 Data types?

A
  1. Numeric Data
  2. Text Data
  3. Date time data
  4. Spatial Data
  5. Time series data
  6. Graph Data
  7. Multimedia data
19
Q

What are the dimensions of data quality? (FARCC)

A

Freshness
Accuracy
Reliability
Completeness
Consistency

20
Q

What is data quality

A

The degree to which data can be used for its intended purpose, and the degree to which data accurately represents the real world

21
Q

What are the 3 things to check for to see if your data is fit for use?

A

Data Exploration - Discovering and understanding the quality characteristics of the data through exploratory techniques

Data Transformation - transforming the data through cleaning, curating, repairing

Data Enrichment - Enriching the data with data integration and imputation

22
Q

Explain the difference between data integration and data imputation

A

Data integration is joining different kinds of data together that share a common attriubute like time

Data imputation is imputing missing values into the data, recreating lost data.

23
Q

What are Hindsight, Insight and Foresight? Briefly explain

A

Hindsight - What happened (search and query)
Insight - why is it happening (knowledge, discovery
Foresight - What will happen (prediction)

24
Q

What are the three components of story telling?

A

Narrative
Data
Visual

25
What are the 6 Lessons for Effective Storytelling? (UAEFTT)
1. Understanding the context 2. Choosing an appropriate visual to display 3. Eliminate clutter 4. Focus attention 5. Think like a designer 6. Tell a story
26
What are the Gestalt (visual perception) principles? (PECCCS)
Proximity Enclosure Closure Continuity Connectedness Similarity
27
What are the styles in data stories?
Zoom out, drill down, factors, outliers, compare, intersections, change over time, contrast
28
What are 5 common misconceptions in storytelling? (VVIMS)
1. Visualizations is for making data flashy 2. Visualization is too biased to be useful 3. It has to be exact 4. More info in a single graphic the better 5. Software does everything
29
Explain the step of 1. understanding the context
Who is your audience, what do they need to do or what actions need to be taken, how can you most effectively communicate
30
Explain step 2. Choosing an appropriate visual to display
chose a visual that is easiest for your audience to read and tells an effective visual story
31
Explain step 3. Eliminate Clutter (MML)
Minimize mental load on audience (visual perception visuals)
32
Explain step 4. Focus attention where you want it
This is basically the use of highlighting or color to emphasize certain pieces of information.
33
Explain step 5. Think like a designer
Create clear visual hierarchy, Highlight important stuff, eliminate distractions, accessibility, dont over complicate, aesthetics (color, alignment, spacing)
34
6. Tell a story - using styles in data stories
Using zoom out, contrast, comparison, intersection, drill down, zoom out, factors, outliers, change over time.
35
What are the pitfalls of data science?
1. Using bad data (good for one purpose maybe not another) 2. Putting data before theory (stocks, bitcoin, crime preds) 3. Worshipping math (unrealistic assumptions, black swan problem) 4. Worshipping computers (Blind trust, black boxes, common sense) 5. Torturing data (pvalue-hacking, cherry picking) 6. Fooling yourself (hiding data, hiding methods, lacking repeatability 7. Confusing Correlation with Causation (finding and validating cause and effect much harder than showing association) 8. Being surprised by regression towards the mean (if you do or dont do something it will regress to the mean) 9. Doing harm (fake news, privacy protection)
36
Data Quality Dimensions provide a way of defining and measuring various data quality problems. Briefly explain how you would detect and/or measure any two of the following problems. 1. Incomplete data 2. Inconsistent data 3. Duplicate data 4. Stale data 5. Inaccurate data
Incomplete data: when there are NULL, empty values or blanks in cells in a data set. Would search through data with required conditions for each category of data Inconsistent Data: Detected by format and cross-field checks (e.g. if column holds DATE data type, check the formatting is consistent like DD:MM:YYYY). Redundant or Duplicate Data: Check for duplicate unique identifiers (e.g. if primary key on name, then look for duplicate names) Out-of-Date Data: Detected using timestamp checks directly from data set or external validation (e.g. user’s address could be checked, if address is outdated then is an out of date data point). Invalid Data: Detected via range, format, and data type checks. Identify values that fall outside a valid range or don't belong to a valid set of predefined values (e.g., a negative age or an invalid product code). Also could check the data follows the required format, such as ensuring email addresses; username@domain.com pattern. Could also check all data points in each column follows correct data type
37
Provide 2 Reasons why design thinking is a suitable approach to formulate authentic data science problems?
1. Human Centered Approach to problems - Ensures all user demands are requirements are met and solved 2. Iterative and Flexible Approach - Design thinking supports iterative design and works well with data sciences exploratory nature.