Introductory terms Flashcards
descriptive statistics
describe a dataset using mathematically calculated values such as mean and std deviation
inferential statistics
statistical calculations that enable us to draw conclusions about the larger population
Normal distribution
the mean sets the middle of the distribution and the standard deviation sets the width.
Probability
the mathematical study of what could potentially happen; in data science probability calculations are used to simulate scenarios and build models, which help us understand data that has yet to exist.
Programming
the act of giving the computer instructions to perform a task
Clustering
a subsection of data science that allows us to classify data. Programming makes clustering data time-efficient
Domain expertise
the particular set of knowledge that someone cultivates in order to understand their data. My domain expertise is the food system, agriculture, sustainability.
Data science process (8 steps)
Ask a question; determine necessary data; get the data; clean and organize the data; explore the data; model the data + analysis; communicate findings; reproducibility and automation.
a/b testing
a process of showing two variants of the same web page to different segments of website visitors at the same time and comparing which variant drives more conversions.
Margin of error
amount results of survey differ from real population value. The larger the error, the less confidence we have in results.
Confidence level
the probability that we were to run another survey with the same metrics that would get the same results. (90%, 95%, 99%).
Population size
size of the population we’re collecting data on. A common number in sample size calculations is 100,000.
Likely sample portion
the % of people surveyed whose results we anticipate matching the expected outcome. If we don’t have historical data, we normally use 50%.
Active data collection
setting up specific conditions in which to get data. On the hunt. I.e.: running different experiments and surveys.
Passive data collection
you’re looking for data that already exists. You’re foraging data. I.e.: Locating datasets and web scraping.
data wrangling
Raw data can come in a variety of file types and formats. Data wrangling - cleaning and organizing datasets.
Pandas (a python library)
a great tool for importing and organizing datasets. It can be used to convert spreadsheets (like CSV) into easily readable tables and charts known as DataFrames. We can also use Pandas to transform datasets by adding columns and rows to an existing table (merging).
Statistical calculations
use descriptive statistics to get a sense of what it contains, including average, median, and standard deviation. We can use a python module known as NumPy (numerical python) to calculate descriptive statistics values.
Data visualizations
enables us to see patterns, relationships, and outliers, and how they relate to the entire dataset. Especially useful with large datasets. Python data viz libraries like Matplotlib and Seaborn can display distributions and statistical summaries for easy comparison. The JavaScript library D3 enables the creation of interactive data visualizations, which are useful for modeling different scenarios.
models
To analyze data, we create a model. Models are abstractions of reality, informed by real data, that allow us to understand situations and make guesses about how things might change given different variables.
2 varieties of variables
quantitative and categorical variables
Quantitative variables
Quantitative variables are variables expressed numerically, whether as a count (discrete) or measurement (continuous).
Discrete variables
Discrete variables are numeric values that can only be integer values (counts). They represent whole units and are not decimals or fractions.
Continuous variables
Continuous variables are numeric values that can be expressed with decimal precision. Examples are length, weight, and age.