Introductory terms Flashcards

1
Q

descriptive statistics

A

describe a dataset using mathematically calculated values such as mean and std deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

inferential statistics

A

statistical calculations that enable us to draw conclusions about the larger population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Normal distribution

A

the mean sets the middle of the distribution and the standard deviation sets the width.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Probability

A

the mathematical study of what could potentially happen; in data science probability calculations are used to simulate scenarios and build models, which help us understand data that has yet to exist.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Programming

A

the act of giving the computer instructions to perform a task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Clustering

A

a subsection of data science that allows us to classify data. Programming makes clustering data time-efficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Domain expertise

A

the particular set of knowledge that someone cultivates in order to understand their data. My domain expertise is the food system, agriculture, sustainability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Data science process (8 steps)

A

Ask a question; determine necessary data; get the data; clean and organize the data; explore the data; model the data + analysis; communicate findings; reproducibility and automation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

a/b testing

A

a process of showing two variants of the same web page to different segments of website visitors at the same time and comparing which variant drives more conversions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Margin of error

A

amount results of survey differ from real population value. The larger the error, the less confidence we have in results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Confidence level

A

the probability that we were to run another survey with the same metrics that would get the same results. (90%, 95%, 99%).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Population size

A

size of the population we’re collecting data on. A common number in sample size calculations is 100,000.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Likely sample portion

A

the % of people surveyed whose results we anticipate matching the expected outcome. If we don’t have historical data, we normally use 50%.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Active data collection

A

setting up specific conditions in which to get data. On the hunt. I.e.: running different experiments and surveys.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Passive data collection

A

you’re looking for data that already exists. You’re foraging data. I.e.: Locating datasets and web scraping.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

data wrangling

A

Raw data can come in a variety of file types and formats. Data wrangling - cleaning and organizing datasets.

17
Q

Pandas (a python library)

A

a great tool for importing and organizing datasets. It can be used to convert spreadsheets (like CSV) into easily readable tables and charts known as DataFrames. We can also use Pandas to transform datasets by adding columns and rows to an existing table (merging).

18
Q

Statistical calculations

A

use descriptive statistics to get a sense of what it contains, including average, median, and standard deviation. We can use a python module known as NumPy (numerical python) to calculate descriptive statistics values.

19
Q

Data visualizations

A

enables us to see patterns, relationships, and outliers, and how they relate to the entire dataset. Especially useful with large datasets. Python data viz libraries like Matplotlib and Seaborn can display distributions and statistical summaries for easy comparison. The JavaScript library D3 enables the creation of interactive data visualizations, which are useful for modeling different scenarios.

20
Q

models

A

To analyze data, we create a model. Models are abstractions of reality, informed by real data, that allow us to understand situations and make guesses about how things might change given different variables.

21
Q

2 varieties of variables

A

quantitative and categorical variables

22
Q

Quantitative variables

A

Quantitative variables are variables expressed numerically, whether as a count (discrete) or measurement (continuous).

23
Q

Discrete variables

A

Discrete variables are numeric values that can only be integer values (counts). They represent whole units and are not decimals or fractions.

24
Q

Continuous variables

A

Continuous variables are numeric values that can be expressed with decimal precision. Examples are length, weight, and age.

25
Q

Categorical variables

A

Categorical variables group observations into separate categories that can be ordered or unordered. With categorical variables we want to understand how the observation in our dataset can be grouped and separated from one another based on their attributes.

26
Q

Ordinal variables

A

when the groupings have a specific order or ranking. For example, “strongly disagree” to “strongly agree,” customer satisfaction ratings, age groups, standings in a competition.

27
Q

Nominal variables

A

when there is not apparent order or ranking to the categories. For example, eye color, pet type, favorite good, hometown city.

28
Q

Binary variables

A

Binary variables are a special kind of nominal variable that only has two categories. True/false, or 0 and 1 for example.

29
Q

Git

A

a version control system (VCS), a google doc for coders; a project managed with git is called a repository. Github is a popular hosting service for Git repositories, which allows you to store your local Git repositories in the cloud.

30
Q

CAPTCHA

A

Completely Automated Public Turing Test to tell Computers and Humans Apart

31
Q

SQL

A

Structured Query Language - a programming language used to manage data stored in relational databases

32
Q

HTTP

A

Hypertext Transfer Protocol

33
Q

TCP

A

Transmission Control Protocol