# Data Science Terms and Techniques Flashcards

Hypothesis

an assumption made about the world that can be tested using data; an educated guess that needs to be validated or disproved by experiment and data

Statistical Inference

a branch of statistics dedicated to drawing conclusion about the world using smaller data samples

Confidence intervals

an interval estimate used to express the degree of uncertainty associated with a sample statistic

Statistical Significance

an estimate of how likely that the observed event has some kind of real world importance; an estimate of how likely an event might occur randomly - the smaller the number, the more likely that the observed event has some kind of real-world importance.

Big Data

a collective term used for technology to analyze large amounts of data to unearth insights, typically into human behavior and patterns

Data Set

a collection of data to be analyzed

Analytics

a collective term for techniques used to analyze data, mostly to draw business insights

Algorithm

a well defined set of steps to solve a specific problem

Technology Stack

the collective set of tools and programs used in an organization or team

Pre-packaged distribution

a package that bundles all of the required python tools and libraries e.g. numpy, scipy, pandas, scikit-learn, jupyter, matplotlib, seaborn and statsmodels. In the python world, Anaconda and Canopy are popular distributions for scientific computing and data science.

Regular Expressions

a technique to quickly search for or substitute complex patterns in strings

Jupyter

formerly known as IPython, this tool enables data scientists to prototype code rapidly and combine it with useful documentation

Raw data

data from original or secondary sources that may be unstructured or corrupted and needs more work performed on it before it can be analyzed

Data Wrangling

process of taking data in its raw form and manipulating it in various ways into a useful form

Messy or Dirty data

data can be messy or dirty in the sense that it might contain values that are invalid, missing, corrupted, inconsistent or non-uniform