Data Science Fundamentals Flashcards
(19 cards)
What is Data Science?
- Using data to answer questions
- Data Science is a broad field, hence the broad description
- A data scientist is broadly defined as someone who combines the skills of a software programmer, statistician, and storyteller/artists to extract the nuggets of gold hidden under mountains of data
Three qualities of big data
- Volume
- Velocity
- Variety
What does Volume stand for in Big Data?
How much data there is
What does Velocity stand for in Big Data?
The rate at which data is being generated
What does Variety stand for in Big Data?
The many forms the data comes in
Diagram of Data Science skills overlap
Data Science steps
- Subject Matter Expertise so we have enough expertise in the area that we want to ask about in order to formulate our questions.
- Cleaning and Formatting Data typically requires some programming
- Analyze Data typically requires stats and math knowledge
What can you do with R
- Access data
- Experiment with the data
- Analyze the data
- Plot the data
Why are data scientists in so much demand?
- Because most of the answers are not already outlined in textbooks.
- A data scientist needs to be somebody who knows how to find answers to novel problems.
What is data?
- A set of values
- In statistics, the population you are trying to discover something about
What is a variable?
Measurements or characteristics of an item
What is a qualitative variable?
Measurements or information about qualities
What is a quantitative variable?
Measurements or information about quantities or numerical items
Common types of messy data
- Sequencing data
- Population census data
- Electronic Medical Records (EMR) or other large databases
- Geographic information system (GIS) data (mapping)
- Image analysis and image extrapolation
- Language and translations
- Website traffic
- Personal/Ad data (eg. Facebook, Netflix predictions, etc)
What is sequencing data?
- Data produced by sequencing machines
- For example, DNA or RNA sequencing data
What format is sequencing data often found in?
- FASTQ format
- This is a raw file format produced by sequencing machines.
- These files are often hundreds of millions of lines long.
Why is Image analysis messy data?
There is a lot of information coded in an image or video and it has to be extracted.
Why is census information considered messy data?
- Almost all members of a country answer a set of standardized questions
- When you have that many respondants, the data is large and messy
Is data of secondary or primary importance?
Secondary
Data is important, but a good data scientist asks questions first and seeks out relevant data second.