# Data Science Terms and Techniques Flashcards

1
Q

Hypothesis

A

an assumption made about the world that can be tested using data; an educated guess that needs to be validated or disproved by experiment and data

2
Q

Statistical Inference

A

a branch of statistics dedicated to drawing conclusion about the world using smaller data samples

3
Q

Confidence intervals

A

an interval estimate used to express the degree of uncertainty associated with a sample statistic

4
Q

Statistical Significance

A

an estimate of how likely that the observed event has some kind of real world importance; an estimate of how likely an event might occur randomly - the smaller the number, the more likely that the observed event has some kind of real-world importance.

5
Q

Big Data

A

a collective term used for technology to analyze large amounts of data to unearth insights, typically into human behavior and patterns

6
Q

Data Set

A

a collection of data to be analyzed

7
Q

Analytics

A

a collective term for techniques used to analyze data, mostly to draw business insights

8
Q

Algorithm

A

a well defined set of steps to solve a specific problem

9
Q

Technology Stack

A

the collective set of tools and programs used in an organization or team

10
Q

Pre-packaged distribution

A

a package that bundles all of the required python tools and libraries e.g. numpy, scipy, pandas, scikit-learn, jupyter, matplotlib, seaborn and statsmodels. In the python world, Anaconda and Canopy are popular distributions for scientific computing and data science.

11
Q

Regular Expressions

A

a technique to quickly search for or substitute complex patterns in strings

12
Q

Jupyter

A

formerly known as IPython, this tool enables data scientists to prototype code rapidly and combine it with useful documentation

13
Q

Raw data

A

data from original or secondary sources that may be unstructured or corrupted and needs more work performed on it before it can be analyzed

14
Q

Data Wrangling

A

process of taking data in its raw form and manipulating it in various ways into a useful form

15
Q

Messy or Dirty data

A

data can be messy or dirty in the sense that it might contain values that are invalid, missing, corrupted, inconsistent or non-uniform

16
Q

Storytelling

A

this highly effective art of communication is not limited to entertainment; it is a crucial skill needed to communicate answers to questions that data scientists ask

17
Q

Visualization

A

a picture is worth a thousand words, this rings true when you are trying to present data in an easy to understand manner. Data visualization is increasingly used to depict data analysis. Think about how data is visualized during election times.

18
Q

Supervised Learning

A

algorithms that create a model of the world by looking at labeled examples

19
Q

Unsupervised Learning

A

algorithms that create a model of the world using examples without labels

20
Q

Bayesian Analysis

A

algorithms based on Bayes Theorem, which makes inferences about the world by combining domain knowledge or assumptions and observed evidence

21
Q

Clustering

A

a family unsupervised learning algorithms used to automatically find groups in data sets

22
Q

What does data science entail?

A

processing, analyzing and visualizing data.

23
Q

Why is python a popular language for data science?

A

Because it can:
handle large datasets
works with common mathematical functions
creates powerful data visualizations

24
Q

What is Python Jupyter Notebook?

A

It’s built around a typical data analysis workflow and very different from an integrated development environment such as Pychar which focuses more on just working with code. In Jupyter, you work with notebooks which mix plain text, code and code outputs in one view. You can interleave code with markdown text explanations which enables you to easily explore data, create visualizations and share your results.

25
Q

What is a kernel in Jupyter Notebook?

A

The kernel defines the programming language that the code in the notebook will be written in. This is displayed in the top right corner of the notebook. When you run code, it’s executed inside the kernel session.

26
Q

What strategies should one use to address missing data?

A
1. Remove any rows that contain missing data
2. Populate the empty fields with a specified value
3. Populate the empty fields with a calculated value.
4. Use analysis techniques that work with missing data.