Evern_ote_ Flashcards by Santhiya Duraisamy

Discrete distribution

A discrete distribution is one in which the “data can only take on certain values, for example integers (finite)”.
For a discrete distribution, probabilities can be assigned to the values in the distribution - for example, “the probability that the web page will have 12 clicks in an hour is 0.15.”

How well did you know this?

Not at all

Perfectly

continues distribution

A continuous distribution is one in which “data can take on any value within a specified range (which may be infinite).”
- the probability associated with any particular value of a continuous distribution is null.

Therefore, continuous distributions are normally “described in terms of probability density”, which can be converted into the probability that a value will fall within a certain range.

How well did you know this?

Not at all

Perfectly

Discrete and continuous data

Discrete datainvolves round, concrete numbers that are determined by counting.

Continuous datainvolves complex numbers that are measured across a specific time interval

How well did you know this?

Not at all

Perfectly

conditional probability

Conditional probability is the “probability of one event occurring with some relationship to one or more other events.” For example:

Event A is that it is raining outside, and it has a 0.3 (30%) chance of raining today.
Event B is that you will need to go outside, and that has a probability of 0.5 (50%).

A conditional probability would look at these two events in relationship with one another, such as the probability that it is both raining and you will need to go outside.

The formula for conditional probability is:

P(B|A) = P(A and B) / P(A)

How well did you know this?

Not at all

Perfectly

Bayes theorem

The fundamental idea of Bayesian inference is to become “less wrong” with more data.

The process is straightforward: we have an initial belief, known as a prior, which we update as we gain additional information.

P(A|B) = P(B|A) P(A) / P(B)

Note: The conclusions drawn from the Bayes law are logical but anti-intuitive. Almost always, people pay a lot of attention to the posterior probability, but they overlook the prior probability.

How well did you know this?

Not at all

Perfectly

Hypothesis testing and confidence interval estimation:

UsingHypothesis Testing, we try to interpret or draw conclusions about the population using sample data.
AHypothesis Testevaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.
Whenever we want to make claims about the distribution of data or whether one set of results are different from another set of results in applied machine learning, we must rely on statistical hypothesis tests.

How well did you know this?

Not at all

Perfectly

Random variable

Arandom variable, usually writtenX, is a variable whose possible values are numerical outcomes of a random phenomenon.

How well did you know this?

Not at all

Perfectly

Discrete random variable

A discrete random variable is one which may take on only a “countable number of distinct values” such as 0,1,2,3,4,……..

Discrete random variables are usually (but not necessarily) counts.

If a random variable can take only a finite number of distinct values, then it must be discrete.

Examples of discrete random variables include the number of children in a family, the Friday night attendance at a cinema, the number of patients in a doctor’s surgery, the number of defective light bulbs in a box of ten.

How well did you know this?

Not at all

Perfectly

Continuous random variable

A continuous random variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile.

How well did you know this?

Not at all

Perfectly

cumulative distribution function.

All random variables (discrete and continuous) have a cumulative distribution function.

For a discrete random variable, the cumulative distribution function is found by summing up the probabilities.

How well did you know this?

Not at all

Perfectly

Simple linear regression

Simple linear regression is used to estimate the relationship between two quantitative variables.
Regression problem

Range = -inf to inf

                          y = B0 + B1 X + e

Linear regression finds the “line of best fit” line through your data by searching for the regression coefficient (B1) that minimizes the total error (e) of the model.

Cost function: Mean square error

yis the predicted value of the dependent variable (y) for any given value of the independent variable (x).
B0is theintercept, the predicted value ofywhen thexis 0.
B1is the regression coefficient – how much we expectyto change asxincreases.
xis the independent variable ( the variable we expect is influencingy).
eis theerrorof the estimate, or how much variation there is in our estimate of the regression coefficient.

How well did you know this?

Not at all

Perfectly

Simple logistic regression

Uses sigmoid function
Better suited for classification problem

Range within 0-1

     y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

cost function: has its own refer below
https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

How well did you know this?

Not at all

Perfectly

Types of logistic regression

Binary Logistic Regression
The categorical response has only two 2 possible outcomes. Example: Spam or No
Multinomial Logistic Regression
Three or more categories without ordering. Example: Predicting which food is preferred more (Veg, Non-Veg, Vegan)
Ordinal Logistic Regression
Three or more categories with ordering. Example: Movie rating from 1 to 5

How well did you know this?

Not at all

Perfectly

Mean square error

https://towardsdatascience.com/introduction-to-machine-learning-algorithms-linear-regression-14c4e325882a

How well did you know this?

Not at all

Perfectly

logistic regression cost function

https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc

How well did you know this?

Not at all

Perfectly

gradient descent

Study These Flashcards

Gradient descent is an “optimization algorithm” that’s used when training a machine learning model.

It’s “based on a convex function” and “tweaks its parameters iteratively to minimize a given function to its local minimum.”

Gradient:

Study These Flashcards

“A gradient measures how much the output of a function changes if you change the inputs a little bit.” — Lex Fridman (MIT).

Generalised linear models

Study These Flashcards

In statistics, a generalized linear model is a “flexible generalization of ordinary linear regression” that allows for the response variable (Y) to have an “error distribution other than the normal distribution”

Applicable when the relationship between X and Y are not linear and exponential

https://www.mygreatlearning.com/blog/generalized-linear-models/

Components of GLM

Study These Flashcards

There are 3 components in GLM.

Systematic Component/Linear Predictor:
It is just the linear combination of the Predictors and the regression coefficients.

β0+β1X1+β2X2

Link Function:
Represented as η or g(μ), it specifies the link between a random and systematic components. It indicates how the expected/predicted value of the response relates to the linear combination of predictor variables.

Random Component/Probability Distribution:
It refers to the probability distribution, from the family of distributions, of the response variable.

The family of distributions, called an exponential family, includes normal distribution, binomial distribution, or poisson distribution.

Exponential family of distribution

Probability Distribution, and their corresponding Link function

Study These Flashcards

Probability Distribution Link Function
Normal Distribution Identity function
Binomial Distribution Logit/Sigmoid function
Poisson Distribution Log function (aka log-linear, log-link)

Regularisation

Study These Flashcards

To reduce overfitting of a model

“It is a form of regression that shrinks the coefficient estimates towards zero.” In other words, this technique forces us not to learn a morecomplex or flexible model, to avoid the problem of overfitting.

Regularisation type

Study These Flashcards

Ridge Regression

Lasso regression

Study These Flashcards

Lasso regression is another variant of the regularization technique used to reduce the complexity of the model. It stands forLeast Absolute and Selection Operator.

👉 It is similar to the Ridge Regression except that the penalty term includes the absolute weights instead of a square of weights. Therefore, the optimization function becomes:

Fig. Cost Function for Lasso Regression
Image Source:link

👉 In statistics, it is known as theL-1 norm.

👉 In this technique, the L1 penalty has the eﬀect of forcing some of the coeﬃcient estimates to be exactly equal to zero which means there is a complete removal of some of the features for model evaluation when the tuning parameter λ is suﬃciently large. Therefore, the lasso method also performsFeature selectionand is said to yieldsparse models.

👉Limitation of Lasso Regression:

Problems with some types of Dataset:If the number of predictors is greater than the number of data points, Lasso will pick at most n predictors as non-zero, even if all predictors are relevant.
Multicollinearity Problem:If there are two or more highly collinear variables then LASSO regression selects one of them randomly which is not good for the interpretation of our model.

Key Differences between Ridge and Lasso Regression

👉Ridge regression helps us to reduce only the overfitting in the model while keeping all the features present in the model. It reduces the complexity of the model by shrinking the coefficients whereas Lasso regression helps in reducing the problem of overfitting in the model as well as automatic feature selection.

👉 Lasso Regression tends to make coefficients to absolute zero whereas Ridge regression never sets the value of coefficient to absolute zero.

What does Regularization achieve?

Study These Flashcards

What does Regularization achieve?

👉 In simple linear regression, the standard least-squares model tends to have some variance in it, i.e. this model won’t generalize well for a future data set that is different from its training data.

👉 Regularization tries to reduce the variance of the model, without a substantial increase in the bias.

How λ relates to the principle of “Curse of Dimensionality”?

"As the value of λ rises, it significantly reduces the value of coefficient estimates and thus reduces the variance." Till a point, this increase in λ is beneficial for our model as it is only reducing the variance (hence avoiding overfitting), without losing any important properties in the data. But after a certain value of λ, the model starts losing some important properties, giving rise to bias in the model and thus underfitting. Therefore, we have to select the value of λ carefully. To select the good value of λ, cross-validation comes in handy.

Important points about λ the tuning parameter:

Important points about λ: - λ is the tuning parameter used in regularization that decides how much we want to penalize the flexibility of our model i.e, controls the impact on bias and variance. - When λ = 0, the penalty term has no eﬀect, the equation becomes the cost function of the linear regression model. Hence, for the minimum value of λ i.e, λ=0, the model will resemble the linear regression model. So, the estimates produced by ridge regression will be equal to least squares. - However, as λ→∞ (tends to infinity), the impact of the shrinkage penalty increases, and the ridge regression coeﬃcient estimates will approach zero.

machine learning

Machine learning can be summarized as learning a function (f) that maps input variables (X) to output variables (Y). Y = f(x)

Parametric Machine Learning Algorithms

"Algorithms that simplify the function to a known form are called parametric machine learning algorithms." A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs.

Nonparametric Machine Learning Algorithms

"Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms." By not making assumptions, they are free to learn any functional form from the training data.

Benefits of Parametric Machine Learning Algorithms:

- Simpler: These methods are easier to understand and interpret results. - Speed: Parametric models are very fast to learn from data. - Less Data: They do not require as much training data and can work well even if the fit to the data is not perfect.

Limitations of Parametric Machine Learning Algorithms:

- Constrained: By choosing a functional form these methods are highly constrained to the specified form. - Limited Complexity: The methods are more suited to simpler problems. - Poor Fit: In practice the methods are unlikely to match the underlying mapping function.

examples of popular nonparametric machine learning algorithms are:

- k-Nearest Neighbors - Decision Trees like CART and C4.5 - Support Vector Machines

Pros and cons of non-parametric models

Benefits of Nonparametric Machine Learning Algorithms: - Flexibility: Capable of fitting a large number of functional forms. - Power: No assumptions (or weak assumptions) about the underlying function. - Performance: Can result in higher performance models for prediction. Limitations of Nonparametric Machine Learning Algorithms: - More data: Require a lot more training data to estimate the mapping function. - Slower: A lot slower to train as they often have far more parameters to train. - Overfitting: More of a risk to overfit the training data and it is harder to explain why specific predictions are made.

Causal models

Causal models are mathematical models representing causal relationships within an individual system or population. They facilitate inferences about causal relationships from statistical data.

Evern_ote_ Flashcards

(34 cards)