Evern_ote_ Flashcards
(34 cards)
Discrete distribution
- A discrete distribution is one in which the “data can only take on certain values, for example integers (finite)”.
- For a discrete distribution, probabilities can be assigned to the values in the distribution - for example, “the probability that the web page will have 12 clicks in an hour is 0.15.”
continues distribution
A continuous distribution is one in which “data can take on any value within a specified range (which may be infinite).”
- the probability associated with any particular value of a continuous distribution is null.
Therefore, continuous distributions are normally “described in terms of probability density”, which can be converted into the probability that a value will fall within a certain range.
Discrete and continuous data
Discrete datainvolves round, concrete numbers that are determined by counting.
Continuous datainvolves complex numbers that are measured across a specific time interval
conditional probability
Conditional probability is the “probability of one event occurring with some relationship to one or more other events.” For example:
Event A is that it is raining outside, and it has a 0.3 (30%) chance of raining today.
Event B is that you will need to go outside, and that has a probability of 0.5 (50%).
A conditional probability would look at these two events in relationship with one another, such as the probability that it is both raining and you will need to go outside.
The formula for conditional probability is:
P(B|A) = P(A and B) / P(A)
Bayes theorem
The fundamental idea of Bayesian inference is to become “less wrong” with more data.
The process is straightforward: we have an initial belief, known as a prior, which we update as we gain additional information.
P(A|B) = P(B|A) P(A) / P(B)
Note: The conclusions drawn from the Bayes law are logical but anti-intuitive. Almost always, people pay a lot of attention to the posterior probability, but they overlook the prior probability.
Hypothesis testing and confidence interval estimation:
- UsingHypothesis Testing, we try to interpret or draw conclusions about the population using sample data.
- AHypothesis Testevaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.
- Whenever we want to make claims about the distribution of data or whether one set of results are different from another set of results in applied machine learning, we must rely on statistical hypothesis tests.
Random variable
Arandom variable, usually writtenX, is a variable whose possible values are numerical outcomes of a random phenomenon.
Discrete random variable
A discrete random variable is one which may take on only a “countable number of distinct values” such as 0,1,2,3,4,……..
Discrete random variables are usually (but not necessarily) counts.
If a random variable can take only a finite number of distinct values, then it must be discrete.
Examples of discrete random variables include the number of children in a family, the Friday night attendance at a cinema, the number of patients in a doctor’s surgery, the number of defective light bulbs in a box of ten.
Continuous random variable
A continuous random variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile.
cumulative distribution function.
All random variables (discrete and continuous) have a cumulative distribution function.
For a discrete random variable, the cumulative distribution function is found by summing up the probabilities.
Simple linear regression
- Simple linear regression is used to estimate the relationship between two quantitative variables.
- Regression problem
- Range = -inf to inf
y = B0 + B1 X + e
Linear regression finds the “line of best fit” line through your data by searching for the regression coefficient (B1) that minimizes the total error (e) of the model.
Cost function: Mean square error
- yis the predicted value of the dependent variable (y) for any given value of the independent variable (x).
- B0is theintercept, the predicted value ofywhen thexis 0.
- B1is the regression coefficient – how much we expectyto change asxincreases.
- xis the independent variable ( the variable we expect is influencingy).
- eis theerrorof the estimate, or how much variation there is in our estimate of the regression coefficient.
Simple logistic regression
- Uses sigmoid function
- Better suited for classification problem
- Range within 0-1
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
cost function: has its own refer below
https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
Types of logistic regression
- Binary Logistic Regression
The categorical response has only two 2 possible outcomes. Example: Spam or No - Multinomial Logistic Regression
Three or more categories without ordering. Example: Predicting which food is preferred more (Veg, Non-Veg, Vegan) - Ordinal Logistic Regression
Three or more categories with ordering. Example: Movie rating from 1 to 5
Mean square error
https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a
logistic regression cost function
https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
gradient descent
Gradient descent is an “optimization algorithm” that’s used when training a machine learning model.
It’s “based on a convex function” and “tweaks its parameters iteratively to minimize a given function to its local minimum.”
Gradient:
“A gradient measures how much the output of a function changes if you change the inputs a little bit.” — Lex Fridman (MIT).
Generalised linear models
In statistics, a generalized linear model is a “flexible generalization of ordinary linear regression” that allows for the response variable (Y) to have an “error distribution other than the normal distribution”
Applicable when the relationship between X and Y are not linear and exponential
https://www.mygreatlearning.com/blog/generalized-linear-models/
Components of GLM
There are 3 components in GLM.
Systematic Component/Linear Predictor:
It is just the linear combination of the Predictors and the regression coefficients.
β0+β1X1+β2X2
Link Function:
Represented as η or g(μ), it specifies the link between a random and systematic components. It indicates how the expected/predicted value of the response relates to the linear combination of predictor variables.
Random Component/Probability Distribution:
It refers to the probability distribution, from the family of distributions, of the response variable.
The family of distributions, called an exponential family, includes normal distribution, binomial distribution, or poisson distribution.
Exponential family of distribution
Probability Distribution, and their corresponding Link function
Probability Distribution Link Function
Normal Distribution Identity function
Binomial Distribution Logit/Sigmoid function
Poisson Distribution Log function (aka log-linear, log-link)
Regularisation
To reduce overfitting of a model
“It is a form of regression that shrinks the coefficient estimates towards zero.” In other words, this technique forces us not to learn a morecomplex or flexible model, to avoid the problem of overfitting.
Regularisation type
Ridge Regression
Lasso regression
Lasso regression
Lasso regression is another variant of the regularization technique used to reduce the complexity of the model. It stands forLeast Absolute and Selection Operator.
👉 It is similar to the Ridge Regression except that the penalty term includes the absolute weights instead of a square of weights. Therefore, the optimization function becomes:
Fig. Cost Function for Lasso Regression
Image Source:link
👉 In statistics, it is known as theL-1 norm.
👉 In this technique, the L1 penalty has the effect of forcing some of the coefficient estimates to be exactly equal to zero which means there is a complete removal of some of the features for model evaluation when the tuning parameter λ is sufficiently large. Therefore, the lasso method also performsFeature selectionand is said to yieldsparse models.
👉Limitation of Lasso Regression:
- Problems with some types of Dataset:If the number of predictors is greater than the number of data points, Lasso will pick at most n predictors as non-zero, even if all predictors are relevant.
- Multicollinearity Problem:If there are two or more highly collinear variables then LASSO regression selects one of them randomly which is not good for the interpretation of our model.
Key Differences between Ridge and Lasso Regression
👉Ridge regression helps us to reduce only the overfitting in the model while keeping all the features present in the model. It reduces the complexity of the model by shrinking the coefficients whereas Lasso regression helps in reducing the problem of overfitting in the model as well as automatic feature selection.
👉 Lasso Regression tends to make coefficients to absolute zero whereas Ridge regression never sets the value of coefficient to absolute zero.
What does Regularization achieve?
What does Regularization achieve?
👉 In simple linear regression, the standard least-squares model tends to have some variance in it, i.e. this model won’t generalize well for a future data set that is different from its training data.
👉 Regularization tries to reduce the variance of the model, without a substantial increase in the bias.