Data Science Process Flashcards
(7 cards)
Parts of a Data Science Project
- Forming the Question
- Finding or Generating the Data
- Explore the Data
- Model the Data, which typically means using some statistical or machine learning techniquest to analyze the data and answer your question.
- Analyze the Data
- Determine the Conclusions from your Analysis
- Communicate the Results
Step 1, Define the question
Case Study: Is Hilary the most poisoned name in US history
Have a well-defined question
Additional questions may pop up as you do the analysis, but knowing what you ultimately want to answer is critical.
Example
Is Hilary/Hillary really the most rapidly poisoned name in recorded American history?
Case Study: Is Hilary the most poisoned name in US history
Step 2, the data
Case Study: Is Hilary the most poisoned name in US history
Hilary collected data from the Social Security website. This dataset included the 1,000 most popular baby names from 1880 until 2011.
Case Study: Is Hilary the most poisoned name in US history
Step 3, Data Analysis
Case Study: Is Hilary the most poisoned name in US history
After obtaining data, you have to figure out what you’re going to do with the data to answer your question.
- Calculate the relative risk for each of the 4,110 different names in her dataset from one year to the next from 1880 to 2011
- Hilary wrote code in R to do this and it is available in GitHub
- She then looked at the percentage of babies named each name in a particular year
- That gave her information to answer her question
Case Study: Is Hilary the most poisoned name in US history
Step 4, Exploratory Data Analysis
Case Study: Is Hilary the most poisoned name in US history
Hilary made all of her work available in GitHub so you can see all the work she put in to get to the final result.
- Data science projects often involve writing a lot of code and generating a lot of figures that aren’t included in your final results.
- This is part of the data science process. Figuring out how to do what you want to do to answer your question of interest is part of the process, doesn’t always show up in your final project, and can be very time-consuming.
Case Study: Is Hilary the most poisoned name in US history
Step 5, Data Analysis Results
Case Study: Is Hilary the most poisoned name in US history
Consider whether or not the results were what you were expecting from ANY analysis!!
1. Hilary looked at the names with the biggest drop in percentage from one year to the next
2. Hilary was sixth on the list, she was thinking it would be number 1
3. The first 5 years looked peculiar to her. None of them were names that were popular for long
4. She found that many of the names became popular all of a sudden and then dropped precipitously
5. She decided to only include names that were in the top 1,000 for more than 20 years to remove flash in the pan names
6. Her thesis proved correct. Hilary had the quickest fall from popularity in 1992 of any female baby name between 1880 and 2011. Marian also fell, but it was a slow decline. Hilary dropped off a cliff all of a sudden.
Case Study: Is Hilary the most poisoned name in US history
Step 6, communicating your results
Case Study: Is Hilary the most poisoned name in US history
Most projects build off of someone else’s work. Make sure to give them credit!!
- Hilary wrote a detailed blog post that communicated the results of her analysis, answered the question she set out to answer, and did so in an entertaining way.
- Hilary gave credit to those whose work she built off of by linking to a blog post where someone asked a similar question, linking to the Social Security website where she got the data and linking to where she learned about web scraping.
Case Study: Is Hilary the most poisoned name in US history