BIG DATA ANALYSIS Flashcards
(66 cards)
Descriptive analysis
What has happened?
- Access and manipulate past datas
- Inform decision-making
Predictive analytics
Predictive analytics
What could happen in the future ?
- Use historical data to make future decision
- Estimation of variables
prescriptive analytics
What should we do ?
- Optimization and simulation to provide advice
- Explore several possible actions and suggest course of action.
- Build models
5 Vs of data
Volume, Velocity, Variety, Veracity, Value
3 Statistical data types :
3 Statistical data types :
Cross-sectional data, Times-series data, Panel data
Cross-sectional data
A given sort of entity for a single period of time
Times-series data
For a single entity for multiple periods of time
Panel data
Panel data
Multiples entities for multiple periods of time
3 types of variables
numerical, categorical dummy
numerical variables
Data that represent quantities or measurements.
Ex : age
categorical variables
Data that represent distinct categories or groups. It
attributes a Number for each category.
dummy variables
Data that represent categorical data in a set of binaries
(0 and 1) - La fonction indicatrice en Mathématiques -
The Linear Regression Model: Definition
modeling method that postulates a relationship between dependent variable (𝒚) and one or more independent variables (𝑥1, 𝑥2, … , 𝑥𝑘)
y dep de 𝑥1, 𝑥2, … , 𝑥𝑘
simple linear regression model
x1 only independent variable, y dep
𝑦 = β0 + β1 𝑥1 + ϵ
Beta 0
C’est l’intercept (ordonnée à l’origine), c’est la valeur de y lorsque x=0.
Beta 1
unknown slope : coefficient directeur
β0 + β1 𝑥1
deterministic component of the linear model
Positive
negative
no linear relationship
positive, pente vers le haut Beta1 positif negative, pente vers le bas Beta1 négatif
no relation : pente droite, B1 = 0
For a multiple linear regression model
what is ∀𝑖 ∈ [1; 𝑘] β𝑖
unknown pop parameter associated with variable xi
The multiple linear regression model:
Given 𝑦 the dependent variable, { x𝑖 | 𝑖 ∈ { 1, 2, 3, … , k } the dependent variables : 𝑦 = β0 + ∑ β𝑖𝑥𝑖
𝑘
𝑖=1
+ ϵ = β0 + β1x1 + β2x2 + ⋯ + βkxk + ϵ
residual error
e = y - y^
Ordinary least squares (OLS) formula:
SSE = ∑ 𝑒^2
def OLS
used to find the minimum when summarize the squared errors between the observed data points and the values predicted by the linear model.
Given 𝑌 1, 𝑌 2, … , 𝑌 𝑁 N dependent variables, using Matrix :
{ 𝑌 1 = β0 + β1𝑋1 + ϵ1
𝑌 2 = β0 + β1𝑋2 + ϵ2
⋮
𝑌𝑁 = β0 + β1𝑋𝑁 + ϵ𝑁
What is the value of Y?
Y (Y1
Y2
YN)
<=> Y = XBeta + epsilon