Supervised Machine Learning Flashcards
(28 cards)
Ordinary Least Squares (OLS)
A regression technique that minimizes the sum of squared differences between actual and predicted values.
Objective of OLS
Minimize the cost function .
Linear Regression Equation
Independent Variable
Variables (features) used to predict the target variable.
Dependent Variable
The target variable or output being predicted (e.g., RH - Relative Humidity).
Dataset Used
Air Quality UCI dataset.
Data Cleaning Steps
Drop NaNs, fix commas in floats, remove columns with >10% missing values, replace -200 with median.
Date Parsing
Convert ‘Date’ column into datetime objects and extract Year, Month, Day, and Day Name.
Pandas .groupby()
Used to group data by ‘Month’ for 2004 and 2005 separately.
Sklearn LinearRegression
Used to fit a linear model to predict Relative Humidity (RH).
Train-Test Split
Performed with 33% test size using train_test_split().
Regression Coefficients
Values assigned to features representing their contribution to prediction.
Model Evaluation Metrics
R², MSE, RMSE, MAE (from sklearn.metrics).
Statsmodels OLS
Alternative to sklearn to get detailed regression summary including p-values, standard errors.
Interpreting Coefficients
Shows how each feature impacts the target RH, e.g., AH has a large positive coefficient.
Heatmap (correlation matrix)
Used to visualize correlations between features (with seaborn.heatmap).
OLS Summary Output
Displays R², F-statistic, coefficients, confidence intervals, etc.
Matrix Equation for OLS
, where X = feature matrix, B = coefficient vector, E = target vector.
Matrix Dimensions
Product valid only if cols of first = rows of second; resulting matrix has rows of first, cols of second.
Matrix Multiplication (NumPy)
Done using np.matmul() or @ operator.
Identity Matrix (Mentioned)
Matrix with 1s on the diagonal, used to solve systems of linear equations (to be covered in next lesson).
Creating Matrices (NumPy)
Use np.matrix() or nested lists, e.g., np.matrix(“8 7;3 -5”).
Invalid Multiplication
Multiplying matrices with mismatched dimensions leads to ValueError.
Feature Columns (Air Data)
PT08.S1(CO), C6H6(GT), NMHC(GT), NOx(GT), NO2(GT), O3(GT), T, RH, AH, Year, Month, Day