Module 1: Data Science Process Flashcards by Justin Blake

What are the 5 V’s of Big Data?

Volume - Terabytes, petabytes
Variety - structured, unstructured, multi-media
Velocity - batch, realtime, streams
Veracity - reliability, availability, completeness
Value - Insights, foresights, actions, decisions.

How well did you know this?

Not at all

Perfectly

What is Survivorship bias? Give an Example

In WW2 man was tasked with where to put extra armor on planes to prevent them being shot down. He was able to observe some planes that did return but instead of putting the armor where the holes were he put the armor where they wernt as these would have been the places where getting shot must have been worse that for the plane as he didnt see bullet holes in this spot which meant the planes must have not come back.

How well did you know this?

Not at all

Perfectly

Advantages of DBMS? (SSRNHC)

Separation of data from apps
Separation of physical
structures and logical structures
Relational model and theory
Non procedural query language
High performance query processing
Concurrency control and recovery

How well did you know this?

Not at all

Perfectly

DBMS in the big data era

The closed world assumption
Software independent of hardware platforms for too long
Victim of its own success extensions are not well supported
Limited data types.

How well did you know this?

Not at all

Perfectly

Name the steps of the data science process

Problem Formulation
Data Collection
Data preparation
Data analysis
Storytelling

How well did you know this?

Not at all

Perfectly

Describe the Problem Formulation Step

Defining the problem in a clear way including explaining its purpose, who will benefit, in what way, how will you measure the success, is it a problem that can be solved in a data driven way

How well did you know this?

Not at all

Perfectly

Describe the Data Collection Step

What data do you need, where will you store it, for how long, do you own the data, is it readily available, what do you have to do to gain access to it, are you allowed to make a copy of it.

How well did you know this?

Not at all

Perfectly

Describe the Data Preparation Step

Is the data fit for the purpose you intend to use it for.
How will you assess the how fit for purpose the data is
what do you need to prepare the data for analysis
what assurances for data prepared without bias and issues with quality.

How well did you know this?

Not at all

Perfectly

Describe the data analysis step

How will you analyze the data, why is your approach the right approach, what methods / algorithms and models will you use.

How well did you know this?

Not at all

Perfectly

Describe the Story telling step

How you present the results of your analytics. Is your presentation susceptible to misunderstandings, what are the key messages you want the audience to get from the findings.

How well did you know this?

Not at all

Perfectly

What are the 4 Human centered principles for data analytics?

Every data problem is a human problem
Assume people are poorly represented in the data you have
If possible visit people in their physical locations
Be critical of you assumptions

How well did you know this?

Not at all

Perfectly

What questions should you ask of the data with regards to you analysis.

Who, What, When, Where, Why, How

How well did you know this?

Not at all

Perfectly

Why might you want to do data sampling?

Volume of data may cause storage and accessibility problems

You need the convinience of working with a smaller set of data (laptop vs cluster)

Smaller dataset has the same data properties.

How well did you know this?

Not at all

Perfectly

What are the two broad types of data sampling?

Sampling without replacement
- no duplicate items in sample, items are dependent

Sampling with replacement - each time we add an item to the sample it is not excluded from being added again, sampled items are independent

How well did you know this?

Not at all

Perfectly

What are the different types of sampling

Simple Random sampling - if a set is n size chose items in the larger set s where n < s and the probability of selecting an item is 1/n

Weighted Random sampling - designing weights to capture a particular interest in the data

Stratified random sampling.

N items in dataset each belonging to k strata, want to select k items form each straum giving m = sk item for the sample (WR)

For each strata chose each of the k samples from the stratum uniformly at random

In some studies may want to preserve the proportion of strata in other studies may want to over sample rare strata.

How well did you know this?

Not at all

Perfectly

What are the key questions to address during data collection?

Study These Flashcards

What data do you need
Do you need all of it or a sample
Are you authorized to acquire the data
In what from are you going to store (ingest) the data

What are the different methods of ingestion? (TTCM)

Study These Flashcards

Tabular data - csv, JSON
Text Data, MongoDB
Complex structured data - SQL Relational DB
Multimedia data - Data lakes and cloud storage

What are the 7 Data types?

Study These Flashcards

Numeric Data
Text Data
Date time data
Spatial Data
Time series data
Graph Data
Multimedia data

What are the dimensions of data quality? (FARCC)

Study These Flashcards

Freshness
Accuracy
Reliability
Completeness
Consistency

What is data quality

Study These Flashcards

The degree to which data can be used for its intended purpose, and the degree to which data accurately represents the real world

What are the 3 things to check for to see if your data is fit for use?

Study These Flashcards

Data Exploration - Discovering and understanding the quality characteristics of the data through exploratory techniques

Data Transformation - transforming the data through cleaning, curating, repairing

Data Enrichment - Enriching the data with data integration and imputation

Explain the difference between data integration and data imputation

Study These Flashcards

Data integration is joining different kinds of data together that share a common attriubute like time

Data imputation is imputing missing values into the data, recreating lost data.

What are Hindsight, Insight and Foresight? Briefly explain

Study These Flashcards

Hindsight - What happened (search and query)
Insight - why is it happening (knowledge, discovery
Foresight - What will happen (prediction)

What are the three components of story telling?

Study These Flashcards

Narrative
Data
Visual

What are the 6 Lessons for Effective Storytelling? (UAEFTT)

1. Understanding the context 2. Choosing an appropriate visual to display 3. Eliminate clutter 4. Focus attention 5. Think like a designer 6. Tell a story

What are the Gestalt (visual perception) principles? (PECCCS)

Proximity Enclosure Closure Continuity Connectedness Similarity

What are the styles in data stories?

Zoom out, drill down, factors, outliers, compare, intersections, change over time, contrast

What are 5 common misconceptions in storytelling? (VVIMS)

1. Visualizations is for making data flashy 2. Visualization is too biased to be useful 3. It has to be exact 4. More info in a single graphic the better 5. Software does everything

Explain the step of 1. understanding the context

Who is your audience, what do they need to do or what actions need to be taken, how can you most effectively communicate

Explain step 2. Choosing an appropriate visual to display

chose a visual that is easiest for your audience to read and tells an effective visual story

Explain step 3. Eliminate Clutter (MML)

Minimize mental load on audience (visual perception visuals)

Explain step 4. Focus attention where you want it

This is basically the use of highlighting or color to emphasize certain pieces of information.

Explain step 5. Think like a designer

Create clear visual hierarchy, Highlight important stuff, eliminate distractions, accessibility, dont over complicate, aesthetics (color, alignment, spacing)

6. Tell a story - using styles in data stories

Using zoom out, contrast, comparison, intersection, drill down, zoom out, factors, outliers, change over time.

What are the pitfalls of data science?

1. Using bad data (good for one purpose maybe not another) 2. Putting data before theory (stocks, bitcoin, crime preds) 3. Worshipping math (unrealistic assumptions, black swan problem) 4. Worshipping computers (Blind trust, black boxes, common sense) 5. Torturing data (pvalue-hacking, cherry picking) 6. Fooling yourself (hiding data, hiding methods, lacking repeatability 7. Confusing Correlation with Causation (finding and validating cause and effect much harder than showing association) 8. Being surprised by regression towards the mean (if you do or dont do something it will regress to the mean) 9. Doing harm (fake news, privacy protection)

Data Quality Dimensions provide a way of defining and measuring various data quality problems. Briefly explain how you would detect and/or measure any two of the following problems. 1. Incomplete data 2. Inconsistent data 3. Duplicate data 4. Stale data 5. Inaccurate data

Incomplete data: when there are NULL, empty values or blanks in cells in a data set. Would search through data with required conditions for each category of data Inconsistent Data: Detected by format and cross-field checks (e.g. if column holds DATE data type, check the formatting is consistent like DD:MM:YYYY). Redundant or Duplicate Data: Check for duplicate unique identifiers (e.g. if primary key on name, then look for duplicate names) Out-of-Date Data: Detected using timestamp checks directly from data set or external validation (e.g. user’s address could be checked, if address is outdated then is an out of date data point). Invalid Data: Detected via range, format, and data type checks. Identify values that fall outside a valid range or don't belong to a valid set of predefined values (e.g., a negative age or an invalid product code). Also could check the data follows the required format, such as ensuring email addresses; username@domain.com pattern. Could also check all data points in each column follows correct data type

Provide 2 Reasons why design thinking is a suitable approach to formulate authentic data science problems?

1. Human Centered Approach to problems - Ensures all user demands are requirements are met and solved 2. Iterative and Flexible Approach - Design thinking supports iterative design and works well with data sciences exploratory nature.

Module 1: Data Science Process Flashcards

(37 cards)