Data Science Process Flashcards

(7 cards)

1
Q

Parts of a Data Science Project

A
  1. Forming the Question
  2. Finding or Generating the Data
  3. Explore the Data
  4. Model the Data, which typically means using some statistical or machine learning techniquest to analyze the data and answer your question.
  5. Analyze the Data
  6. Determine the Conclusions from your Analysis
  7. Communicate the Results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Step 1, Define the question

Case Study: Is Hilary the most poisoned name in US history

A

Have a well-defined question
Additional questions may pop up as you do the analysis, but knowing what you ultimately want to answer is critical.

Example
Is Hilary/Hillary really the most rapidly poisoned name in recorded American history?

Case Study: Is Hilary the most poisoned name in US history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Step 2, the data

Case Study: Is Hilary the most poisoned name in US history

A

Hilary collected data from the Social Security website. This dataset included the 1,000 most popular baby names from 1880 until 2011.

Case Study: Is Hilary the most poisoned name in US history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Step 3, Data Analysis

Case Study: Is Hilary the most poisoned name in US history

A

After obtaining data, you have to figure out what you’re going to do with the data to answer your question.

  1. Calculate the relative risk for each of the 4,110 different names in her dataset from one year to the next from 1880 to 2011
  2. Hilary wrote code in R to do this and it is available in GitHub
  3. She then looked at the percentage of babies named each name in a particular year
  4. That gave her information to answer her question

Case Study: Is Hilary the most poisoned name in US history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Step 4, Exploratory Data Analysis

Case Study: Is Hilary the most poisoned name in US history

A

Hilary made all of her work available in GitHub so you can see all the work she put in to get to the final result.
- Data science projects often involve writing a lot of code and generating a lot of figures that aren’t included in your final results.
- This is part of the data science process. Figuring out how to do what you want to do to answer your question of interest is part of the process, doesn’t always show up in your final project, and can be very time-consuming.

Case Study: Is Hilary the most poisoned name in US history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Step 5, Data Analysis Results

Case Study: Is Hilary the most poisoned name in US history

A

Consider whether or not the results were what you were expecting from ANY analysis!!
1. Hilary looked at the names with the biggest drop in percentage from one year to the next
2. Hilary was sixth on the list, she was thinking it would be number 1
3. The first 5 years looked peculiar to her. None of them were names that were popular for long
4. She found that many of the names became popular all of a sudden and then dropped precipitously
5. She decided to only include names that were in the top 1,000 for more than 20 years to remove flash in the pan names
6. Her thesis proved correct. Hilary had the quickest fall from popularity in 1992 of any female baby name between 1880 and 2011. Marian also fell, but it was a slow decline. Hilary dropped off a cliff all of a sudden.

Case Study: Is Hilary the most poisoned name in US history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Step 6, communicating your results

Case Study: Is Hilary the most poisoned name in US history

A

Most projects build off of someone else’s work. Make sure to give them credit!!
- Hilary wrote a detailed blog post that communicated the results of her analysis, answered the question she set out to answer, and did so in an entertaining way.
- Hilary gave credit to those whose work she built off of by linking to a blog post where someone asked a similar question, linking to the Social Security website where she got the data and linking to where she learned about web scraping.

Case Study: Is Hilary the most poisoned name in US history

How well did you know this?
1
Not at all
2
3
4
5
Perfectly