Introductory terms Flashcards by Sheryl Rivas

descriptive statistics

describe a dataset using mathematically calculated values such as mean and std deviation

How well did you know this?

Not at all

Perfectly

inferential statistics

statistical calculations that enable us to draw conclusions about the larger population

How well did you know this?

Not at all

Perfectly

Normal distribution

the mean sets the middle of the distribution and the standard deviation sets the width.

How well did you know this?

Not at all

Perfectly

Probability

the mathematical study of what could potentially happen; in data science probability calculations are used to simulate scenarios and build models, which help us understand data that has yet to exist.

How well did you know this?

Not at all

Perfectly

Programming

the act of giving the computer instructions to perform a task

How well did you know this?

Not at all

Perfectly

Clustering

a subsection of data science that allows us to classify data. Programming makes clustering data time-efficient

How well did you know this?

Not at all

Perfectly

Domain expertise

the particular set of knowledge that someone cultivates in order to understand their data. My domain expertise is the food system, agriculture, sustainability.

How well did you know this?

Not at all

Perfectly

Data science process (8 steps)

Ask a question; determine necessary data; get the data; clean and organize the data; explore the data; model the data + analysis; communicate findings; reproducibility and automation.

How well did you know this?

Not at all

Perfectly

a/b testing

a process of showing two variants of the same web page to different segments of website visitors at the same time and comparing which variant drives more conversions.

How well did you know this?

Not at all

Perfectly

Margin of error

amount results of survey differ from real population value. The larger the error, the less confidence we have in results.

How well did you know this?

Not at all

Perfectly

Confidence level

the probability that we were to run another survey with the same metrics that would get the same results. (90%, 95%, 99%).

How well did you know this?

Not at all

Perfectly

Population size

size of the population we’re collecting data on. A common number in sample size calculations is 100,000.

How well did you know this?

Not at all

Perfectly

Likely sample portion

the % of people surveyed whose results we anticipate matching the expected outcome. If we don’t have historical data, we normally use 50%.

How well did you know this?

Not at all

Perfectly

Active data collection

setting up specific conditions in which to get data. On the hunt. I.e.: running different experiments and surveys.

How well did you know this?

Not at all

Perfectly

Passive data collection

you’re looking for data that already exists. You’re foraging data. I.e.: Locating datasets and web scraping.

How well did you know this?

Not at all

Perfectly

data wrangling

Study These Flashcards

Raw data can come in a variety of file types and formats. Data wrangling - cleaning and organizing datasets.

Pandas (a python library)

Study These Flashcards

a great tool for importing and organizing datasets. It can be used to convert spreadsheets (like CSV) into easily readable tables and charts known as DataFrames. We can also use Pandas to transform datasets by adding columns and rows to an existing table (merging).

Statistical calculations

Study These Flashcards

use descriptive statistics to get a sense of what it contains, including average, median, and standard deviation. We can use a python module known as NumPy (numerical python) to calculate descriptive statistics values.

Data visualizations

Study These Flashcards

enables us to see patterns, relationships, and outliers, and how they relate to the entire dataset. Especially useful with large datasets. Python data viz libraries like Matplotlib and Seaborn can display distributions and statistical summaries for easy comparison. The JavaScript library D3 enables the creation of interactive data visualizations, which are useful for modeling different scenarios.

models

Study These Flashcards

To analyze data, we create a model. Models are abstractions of reality, informed by real data, that allow us to understand situations and make guesses about how things might change given different variables.

2 varieties of variables

Study These Flashcards

quantitative and categorical variables

Quantitative variables

Study These Flashcards

Quantitative variables are variables expressed numerically, whether as a count (discrete) or measurement (continuous).

Discrete variables

Study These Flashcards

Discrete variables are numeric values that can only be integer values (counts). They represent whole units and are not decimals or fractions.

Continuous variables

Study These Flashcards

Continuous variables are numeric values that can be expressed with decimal precision. Examples are length, weight, and age.

Categorical variables

Categorical variables group observations into separate categories that can be ordered or unordered. With categorical variables we want to understand how the observation in our dataset can be grouped and separated from one another based on their attributes.

Ordinal variables

when the groupings have a specific order or ranking. For example, “strongly disagree” to “strongly agree,” customer satisfaction ratings, age groups, standings in a competition.

Nominal variables

when there is not apparent order or ranking to the categories. For example, eye color, pet type, favorite good, hometown city.

Binary variables

Binary variables are a special kind of nominal variable that only has two categories. True/false, or 0 and 1 for example.

Git

a version control system (VCS), a google doc for coders; a project managed with git is called a repository. Github is a popular hosting service for Git repositories, which allows you to store your local Git repositories in the cloud.

CAPTCHA

Completely Automated Public Turing Test to tell Computers and Humans Apart

SQL

Structured Query Language - a programming language used to manage data stored in relational databases

HTTP

Hypertext Transfer Protocol

TCP

Transmission Control Protocol

Introductory terms Flashcards

(33 cards)