Math Flashcards

1
Q

Alternative hypothesis

A

In statistical hypothesis testing, the null hypothesis and alternative hypothesis are two mutually exclusive statements.
The alternative hypothesis (often denoted as Ha or H1) is a statement that contradicts the null hypothesis and usually assumes that hypothesised effect exists. It represents the researcher’s hypothesis or the claim to be tested. The alternative hypothesis suggests that there is a significant effect, relationship, or difference between variables in the population, while null hypothesis usually states that there is no effect.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Arg Max function

A

Arg Max (arg max): A mathematical function that returns the input value where a given function achieves its maximum value. In other words, it finds the input that makes the function’s output the highest.

arg maxₓ f(x) = {x | f(x) = max(f(x’))}
(where x’ represents all possible inputs)

Common Uses:
* Optimization problems
* Machine learning algorithms
* Decision-making (finding the best solution)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Average (mean)

A

Average (Mean): A measure of central tendency representing the typical or central value in a dataset.

Calculation: Sum of all values divided by the total number of values.

Mathematical Formula:
x̄ = (1/n) Σ_{i=1}^n x_i

(Where x̄ is the average, n is the number of data points, and x_i represents each value)

Uses:
* Descriptive statistics
* Data analysis
* Comparing datasets or groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Base rate

A

Refers to the underlying probability of an event occurring in a population, regardless of other factors. It serves as a benchmark for assessing the likelihood of an event. Understanding the base rate is crucial for making accurate predictions and evaluating the performance of predictive models. For example, in medical diagnosis, the base rate might represent the prevalence of a disease within a certain population, providing valuable context for interpreting diagnostic test results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Basis

A

In linear algebra, a basis is a set of linearly independent vectors that span a vector space, meaning any vector in the space can be expressed as a unique linear combination of basis vectors. Basis vectors form the building blocks for representing and understanding vector spaces, facilitating operations such as vector addition, scalar multiplication, and linear transformations. For example, in Euclidean space, the standard basis consists of orthogonal unit vectors aligned with the coordinate axes (e.g., {(1, 0, 0), (0, 1, 0), (0, 0, 1)} for 3-dimensional space), enabling the representation of any point in the space using coordinates along these axes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Bellman Equations

A

Bellman Equations are a set of recursive equations used in dynamic programming and reinforcement learning to express the value of a decision problem in terms of the values of its subproblems. They provide a way to decompose a complex decision-making process into smaller, more manageable steps.

Application: Bellman Equations are fundamental in reinforcement learning algorithms such as value iteration and Q-learning, where they are used to compute the optimal value function or policy for a given environment.
Example: In a grid world environment where an agent must navigate to a goal while avoiding obstacles, Bellman Equations express the value of each state as the immediate reward plus the discounted value of the subsequent state reached by taking an optimal action.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Bernoulli Distribution

A

A discrete probability distribution that models the outcomes of a binary random experiment. It is characterized by a single parameter, p, representing the probability of success (usually denoted by 1) in a single trial and the probability of failure (denoted by 0) as 1 - p. The distribution is commonly used to model simple events with two possible outcomes, such as success or failure, heads or tails, and yes or no.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Binomial coeficient formula

A

The formula calculates the number of ways you can choose a smaller group (k) out of a larger group (n) when the order you pick them in doesn’t matter.

Formula: (n k) = n! / (k! (n-k)!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Binomial distribution

A

Binomial Distribution: A discrete probability distribution describing the number of successes in a fixed number of independent trials, each with the same success probability (p). Modeling binary outcomes (success/failure) in various fields.

Common Notation: B(n, p)
* n: Number of trials
* p: Probability of success on each trial

The probability mass function (PMF) of the binomial distribution gives the probability of observing exactly k successes in n trials:
P(X = k) = (n k) pᵏ (1 - p)ⁿ⁻ᵏ

(where ‘k’ is the number of successes and (n k) is the binomial coefficient)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Block Matrices

A

Matrices composed of smaller submatrices arranged in a rectangular array.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Capital Sigma Notation

A

The summation over a collection X = {x1, x2, . . . , xn≠1, xn} or over the attributes of a vector x = [x(1), x(2), . . . , x(m≠1), x(m)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cartesian coordinate system

A

The Cartesian coordinate system, named after the French mathematician René Descartes, provides a geometric framework for specifying the positions of points in a plane or space using ordered pairs or triplets of numbers, respectively. In a two-dimensional Cartesian coordinate system, points are located with reference to two perpendicular axes, usually labeled x and y, intersecting at a point called the origin. The coordinates of a point represent its distances from the axes along the respective directions. The Cartesian coordinate system serves as the foundation for analytic geometry, facilitating the study of geometric shapes, equations, and transformations in mathematical analysis and physics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Cauchy Distribution

A

A probability distribution that arises frequently in various areas of mathematics and physics. It is characterized by its symmetric bell-shaped curve and heavy tails, which indicate that extreme values are more likely compared to other symmetric distributions like the normal distribution. The Cauchy distribution has no defined mean or variance due to its heavy tails, making it challenging to work with in statistical analysis. However, it has applications in fields such as physics, finance, and signal processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Centra Limit Theorem (CLT)

A

A key concept in statistics that states that the distribution of sample means from any population approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution. This theorem is crucial in inferential statistics as it allows for the estimation of population parameters and the construction of confidence intervals and hypothesis tests, even when the population distribution is unknown or non-normal. The Central Limit Theorem is widely applied in various fields, including finance, biology, and engineering, where statistical inference is essential.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Central Tendencies

A

Central tendencies, also known as measures of central tendency, are summary statistics that describe the central location or typical value of a dataset. They provide insights into the distribution of data and help summarize the main features of the dataset. The three main measures of central tendency are the:

  • mean (the arithmetic average of all values in the dataset and is sensitive to outliers)
  • median (the middle value of the dataset when the values are arranged in ascending or descending order and is robust to outliers.)
  • mode. (the most frequently occurring value(s) in the dataset and is applicable to both numerical and categorical data.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Chain rule

A

Chain Rule (Calculus): A fundamental rule for finding the derivative of composite functions (functions made up of other functions).

States: The derivative of the composite function f(g(x)) is equal to the derivative of the outer function f evaluated at the inner function g(x), multiplied by the derivative of the inner function g. In mathematical notation, this can be expressed as:

d/dx [f(g(x))] = f’(g(x)) * g’(x)

where f’(g(x)) represents the derivative of the outer function f evaluated at g(x), and g’(x) represents the derivative of the inner function g.

the derivative of a sum is the sum of derivatives:

u = w1x1 + w2x2 + b
y = f(u) //(f being the activation function)

dy/dx1 = (dy/du) * (du/dx1) = f’(u) * w1
dy/dw1 = (dy/du) * (du/dw1) = f’(u) * x1
dy/db = (dy/du) * (du/db) = f’(u) * 1

Each coefficient (weight) or bias has its own “chain” within the overall calculation. The derivative of the activation function (f’(u)) is a common factor dictating how much a change anywhere in the input (u) affects the output. This is the core of why we can calculate the contribution of individual weights and biases to the error during backpropagation!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Codomain

A

In mathematics, the codomain of a function is the set of all possible output values or elements that the function can produce. It is the set of values to which the function maps its domain elements. The codomain represents the entire range of possible outputs of the function, regardless of whether all elements in the codomain are actually attained by the function.

The codomain is distinct from the range, which refers to the set of actual output values produced by the function when evaluated on its domain. In function notation, the codomain is typically denoted as the set Y in the function f: X → Y, where X is the domain of the function and Y is the codomain.

The codomain provides information about the possible outputs of a function and helps define the scope and range of the function’s behavior.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Combinatorics

A

A branch of mathematics concerned with counting, arranging, and analyzing the combinations and permutations of finite sets of objects. In machine learning and artificial intelligence, combinatorics plays a crucial role in feature engineering, model parameterization, and optimization algorithms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Concave function

A

The opposite of Convex Function. Function in which whenever you connect to points of the function, then a line is always above function graph. Concave functions are essential in convex optimization, where they serve as objective functions or constraints in optimization problems. In machine learning and artificial intelligence, concave functions find applications in convex optimization algorithms, such as gradient descent, for training models, minimizing loss functions, and solving constrained optimization problems. Understanding concave functions is crucial for designing efficient optimization algorithms and analyzing the convergence properties of machine learning models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Conditional distribution

A

The probability distribution of a random variable given the value or values of another variable. It describes the likelihood of observing certain outcomes of one variable given specific conditions on another variable. Conditional distributions are fundamental for modeling dependencies and relationships between variables in probabilistic models, Bayesian inference, and predictive modeling tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Confidence Intervals

A

Statistical intervals used to estimate the range of plausible values for a population parameter, such as the mean or proportion, based on sample data. They provide a measure of uncertainty around the point estimate and quantify the precision of estimation. Confidence intervals are essential for hypothesis testing, parameter estimation, and assessing the reliability of statistical inference in machine learning and data analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Continous

A

Continuous variables are those that can take any real value within a certain range or interval. They are characterized by an infinite number of possible values and are typically represented by real numbers. Continuous variables are prevalent in data analysis, modeling, and predictive tasks, such as regression analysis, time series forecasting, and density estimation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Continous random variable

A

In contrast to discrete random variables, continuous random variables can take on an infinite number of possible values within a specified range. These values are typically associated with measurements or quantities that can take any value within a certain interval. Continuous random variables are described by probability density functions (PDFs), which indicate the likelihood of observing a value within a given range. Examples of continuous random variables include height, weight, temperature, and time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Continous variable

A

A type of quantitative variable that can take on an infinite number of values within a specified range or interval. Continuous variables are characterized by having an uncountable and infinite number of possible values, including both whole numbers and fractional values. They can take on any value within their range, and the concept of “gaps” between values is not meaningful. Continuous variables are typically represented by real numbers and are subject to arithmetic operations such as addition, subtraction, multiplication, and division.

Examples of continuous variables include measurements such as height, weight, temperature, time, and distance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Convex

A

In mathematics, a set or function is said to be convex if every line segment connecting two points within the set lies entirely within the set itself. In other words, a set is convex if, for any two points x and y in the set, the line segment connecting x and y is also contained in the set. Similarly, a function is convex if its epigraph (the region lying above the graph of the function) is a convex set.

Convexity is a fundamental concept in optimization, geometry, and mathematical analysis, with many important properties and applications. Convex sets and functions have desirable properties such as uniqueness of solutions, global optimality, and efficient optimization algorithms. Convexity plays a crucial role in convex optimization problems, machine learning algorithms, economics, game theory, and signal processing, among other fields.

Convexity plays a crucial role in machine learning optimization problems. It simplifies the optimization process by ensuring well-behaved objective functions, allowing efficient algorithms like gradient descent to find global minima. Convexity guarantees that any local minimum is also a global minimum, providing confidence in the optimality of solutions. Convex problems are robust to initialization, making optimization less sensitive to starting points. Additionally, convexity promotes generalization by leading to simpler models with fewer parameters and facilitating the use of regularization techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Corelation matrix

A

A square matrix that summarizes the correlation coefficients between pairs of variables in a dataset. Each entry in the matrix represents the correlation between two variables, indicating the strength and direction of their linear relationship. Correlation matrices are commonly used in exploratory data analysis and feature selection to identify patterns, dependencies, and multicollinearity among variables in machine learning and statistical modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Covariance

A

A statistical measure that quantifies the degree of joint variability between two random variables. It indicates the tendency of the variables to vary together, either positively or negatively, from their respective means. Positive covariance indicates that the variables tend to increase or decrease together, while negative covariance indicates that one variable tends to increase as the other decreases. Covariance is a fundamental concept in statistics, machine learning, and finance, where it serves as a measure of linear relationship between variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Covariance matrix

A

A square matrix that summarizes the covariances between pairs of variables in a dataset. It is a symmetric matrix where each entry represents the covariance between two variables. Covariance matrices are essential in multivariate statistics and machine learning, where they characterize the relationships and variability among multiple variables simultaneously. In machine learning, covariance matrices are used in techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Gaussian distribution modeling for dimensionality reduction, feature selection, and statistical inference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Covariance of data

A
  • Measures: The direction and degree to which two random variables change together.
  • Range: Can range from negative infinity to positive infinity.
  • Units: The units reflect the product of the units of the two variables being measured. This makes it harder to interpret directly.
  • Impact of scaling: If you change the scale of one or both variables (e.g., switch from inches to centimeters), the covariance value will also change.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Critical value

A

A threshold or reference point used in statistical hypothesis testing to determine the significance of test results. It represents the boundary beyond which the null hypothesis is rejected or the test statistic is considered extreme enough to warrant further investigation. Critical values are derived from probability distributions, such as the standard normal distribution or t-distribution, and correspond to specific levels of significance or confidence levels. Critical values play a crucial role in hypothesis testing, confidence intervals, and decision-making in statistical analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Cumulative distribution function (CDF)

A

A probability distribution function that represents the probability that a random variable takes on a value less than or equal to a given point. In other words, it provides the cumulative probability distribution of a random variable. The CDF is often denoted by F(x) and is used to analyze and understand the probability distribution of continuous and discrete random variables. It is a fundamental concept in statistics and probability theory, commonly used in hypothesis testing, estimation, and modeling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Density estimation

A

A statistical technique used to estimate the probability density function (PDF) of a random variable based on observed data. It involves estimating the underlying distribution of the data points in a continuous domain. Density estimation methods include parametric approaches such as kernel density estimation and non-parametric approaches such as histograms and nearest neighbor methods. Density estimation is commonly used in exploratory data analysis, modeling univariate and multivariate distributions, and generating synthetic data for simulation and modeling.

It helps understanding Probability Distributions (by Visualizing the overall shape and spread of a distribution from a data sample. Identify modes (peaks) in the data, suggesting possible clusters. Detect outliers – unusual points residing in very low probability areas.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Dependent (variable)

A

The relationship between two or more random variables where the value of one variable influences or is influenced by the value of another variable. Dependent variables are interconnected and exhibit some form of correlation, association, or causality. Understanding dependent relationships is crucial for modeling and analyzing complex systems, conducting hypothesis testing, and making predictions in various fields such as finance, economics, and social sciences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Derivative

A

The slope, often referred to as the derivative in calculus, is a fundamental concept that measures how a function changes as its input changes. Geometrically, the slope represents the steepness of the tangent line to the function’s graph at a given point. A positive slope indicates that the function is increasing, while a negative slope indicates that the function is decreasing. A slope of zero indicates that the function is neither increasing nor decreasing at that point.

The derivative represents the rate of change or the slope of the function at a particular point. It measures how the function value changes with respect to a small change in the independent variable. The derivative is a fundamental concept in calculus and mathematical analysis, used to analyze the behavior of functions, optimize functions, and solve differential equations. In machine learning and optimization, derivatives are essential for gradient-based optimization algorithms such as gradient descent.

The general formula for calculating the derivative of a function f(x) with respect to its input variable x is denoted as f’ (x) or df/dx​ . It is defined as the limit of the difference quotient as the change in x approaches zero:

f′(x) = lim h→0 ((f(x+h) - f (x)) / h )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Derivative of power functions: Constant

A

f’(x) = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Derivative of power functions: Cubic

A

f’(x) = 3x^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Derivative of power functions: Exponential

A

f(x) = e^x

f’(x) = e^x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Derivative of power functions: General formula (power rule)

A

f(x) = x^n

f’(x) = nx^n-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Derivative of power functions: i/x

A

f’(x) = -x^-2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Derivative of power functions: Inverse function

A

g’(y) = 1 / f’(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Derivative of power functions: Line

A

f(x) = ax + b

f’(x) = a

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Derivative of power functions: Quadratic

A

f’(x) = 2x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Derivative of power functions: Trigonometric function

A
  • Sine: The derivative of the function sin(x) is cos(x).
  • Cosine: The derivative of the function cos(x) is -sin(x).
  • Tangent: The derivative of the function tan(x) is sec^2(x), or 1/cos^2(x).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Descriptive Statistics

A

Statistical techniques employed to summarize and describe the main features of a dataset. They encompass measures such as the mean, median, mode, standard deviation, range, skewness, and kurtosis. Descriptive statistics offer a comprehensive overview of dataset characteristics, aiding in interpretation, comparison, and decision-making across various fields such as economics, finance, and social sciences. They provide valuable insights into the distribution, variability, and shape of data, facilitating data-driven decision-making and hypothesis testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Determinant

A

In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. The determinant of a matrix A is commonly denoted det(A), det A, or |A|. Its value characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if and only if the matrix is invertible and the linear map represented by the matrix is an isomorphism. The determinant of a product of matrices is the product of their determinants. The determinant is used in various mathematical operations and theorems, including solving systems of linear equations, computing eigenvalues and eigenvectors, and determining the orientation and volume of geometric shapes. The determinant is denoted by the symbol “det(A)” or “|A|”, where A is the matrix.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Diagonal Matrix

A

A square matrix with non-zero elements only on its main diagonal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Diffrentiation

A

A fundamental operation in calculus that involves calculating the rate of change or slope of a function at a given point. It is the process of finding the derivative of a function with respect to one or more variables. The derivative represents how the function’s output changes as its input varies and provides valuable insights into the behavior of functions, including identifying critical points, extrema, and inflection points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Discrete Random variable

A

Random variable that can take on a countable number of distinct values. These values are typically integers and are often the result of counting or enumerating outcomes in a sample space. Discrete random variables are characterized by a probability mass function (PMF), which assigns probabilities to each possible value the variable can take. Examples of discrete random variables include the number of heads obtained in a series of coin flips or the number of defects in a batch of products.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Discrete variable

A

A type of variable that can only take on distinct, separate values from a finite or countable set. It is characterized by having gaps or jumps between consecutive values, with no intermediate values allowed. Discrete variables are often categorical or qualitative in nature, representing distinct categories, classes, or labels. Examples of discrete variables include the number of students in a class, the outcomes of a dice roll, the types of animals in a zoo, and the categories of products in a store. They are used to represent countable phenomena and make categorical distinctions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Disjoint (mutualy exclusive)

A

Two events or sets are said to be disjoint or mutually exclusive if they have no elements in common, i.e., they cannot occur simultaneously. If events A and B are disjoint, then P(A ∩ B) = 0. Disjoint events are independent of each other, and the occurrence of one event does not affect the probability of the other event occurring.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Divide by coefficient (matrix)

A

Dividing each term of an expression or equation by a constant factor or coefficient. It is a common operation used to simplify algebraic expressions, solve equations, or manipulate mathematical formulas. Dividing by a coefficient scales or rescales the expression by the reciprocal of the coefficient, effectively adjusting the magnitude or scale of the terms. Dividing by a coefficient is a fundamental operation in algebra, calculus, and linear algebra, used in various mathematical and scientific contexts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Domain

A

The set of all possible input values or independent variables for which the function is defined. It represents the permissible values that the input variable can take while ensuring that the function produces meaningful output. The domain specifies the range of valid inputs that the function can process and is essential for determining the function’s behavior, range, and properties. The domain of a function is typically described using interval notation, set notation, or inequalities, depending on the nature of the function and its constraints. Understanding the domain of a function is crucial for analyzing its behavior, solving equations, and evaluating its applicability to real-world problems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Dot product

A

Also known as the scalar product or inner product, is an algebraic operation that takes two equal-length sequences of numbers (usually vectors) and returns a single number. It is calculated by multiplying corresponding components of the vectors and then summing the products. The dot product is used to measure the similarity or alignment between vectors, compute projections, and calculate work done by a force acting in a direction. In machine learning and linear algebra, the dot product plays a crucial role in vector spaces, optimization algorithms, and neural network operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

Eigenbases

A

Bases is a minimal set of vectors (number of vectors is the same as dimentionality of space). These are the set of eigenvectors corresponding to a linear transformation or matrix. In linear algebra, an eigenbasis is a basis for a vector space consisting entirely of eigenvectors of a linear operator or matrix. Each eigenvector in the eigenbasis is associated with a unique eigenvalue, and together they form a complete set of linearly independent vectors that diagonalize the matrix. Eigenbases play a fundamental role in diagonalization, spectral decomposition, and solving systems of linear equations, providing a convenient representation for analyzing and understanding linear transformations.

In normal language: USed when you stretch or rotate a shape on a grid (linear transformation). An eigenbasis is a special set of vectors pointing in different directions on the grid. These arrows have a unique property: when the transformation happens, they don’t change direction, they only get longer or shorter. An eigenbasis helps us understand how the transformation affects the grid by showing us which directions stay the same and how much they stretch or shrink.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

Eigenvectors

A

Special vectors associated with linear transformations or matrices that retain their direction when the transformation is applied. In linear algebra, an eigenvector of a square matrix A is a nonzero vector v such that Av = λv, where λ is a scalar known as the eigenvalue corresponding to v. Eigenvectors represent the directions along which linear transformations stretch or compress space, and eigenvalues represent the scale factors by which these transformations occur. Eigenvectors are used in various applications such as principal component analysis (PCA), spectral analysis, and solving systems of differential equations, providing insights into the behavior and properties of linear systems.

When we apply a transformation to the space, these arrows might change in length, but they don’t change direction. They’re like the backbone of the transformation, showing us the main directions that don’t get twisted or turned. Each arrow has a special number associated with it called an eigenvalue, which tells us how much the arrow stretches or shrinks when the transformation happens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

Elimination method

A

Systematic approach used to solve systems of linear equations by eliminating variables one by one until a solution is found. It involves manipulating equations to cancel out variables or reduce the system to simpler equations with fewer variables. The elimination method is commonly used in algebra and linear algebra to solve systems of equations with multiple unknowns, providing a step-by-step procedure to determine the values of the variables that satisfy all the equations simultaneously.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Euclidian distance

A

Measures the straight-line distance between two points in Euclidean space (follows from the Pythagorean theorem).

Formula (2D space):
d = sqrt((x2 - x1)^2 + (y2 - y1)^2)

  • (x1, y1) and (x2, y2) are the coordinates of the two points
  • d is the Euclidean distance

Generalizes to higher-dimensional spaces, where it measures the straight-line distance between points in n-dimensional space.

Applications: Pattern recognition, Clustering, Regression analysis, Nearest neighbor algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Euler’s number

A

Euler’s number, denoted by the letter ‘e’, is a mathematical constant approximately equal to 2.71828… It’s an irrational number, meaning it has an infinite, non-repeating decimal expansion. It’s the base of the natural logarithm (ln). Euler’s number is deeply connected to processes that exhibit exponential change, such as compound interest or radioactive decay. It elegantly represents proportional growth and change. The value of Euler’s number arises naturally in various mathematical contexts, particularly in calculus, number theory, and complex analysis.

The function f(x) = e^x is very special. It’s the only function whose derivative (rate of change) is equal to itself. the amazing thing about e^x is that the rate of change at any point on that curve is exactly equal to the value of the function itself. This property makes it incredibly useful for modelling growth and decay. (for example in bacteria multiplication: if the population is currently 100, it’s growing at a rate of 100 bacteria per hour. If it’s 500, it’s growing at a rate of 500 bacteria per hour.)

Overall, Euler’s number is a fundamental constant in mathematics with wide-ranging applications across different fields. Its importance lies in its connection to exponential growth, calculus, complex analysis, and other areas of mathematics, making it a cornerstone of mathematical theory and practice. ‘e’ plays a fundamental role in calculus, particularly in solving differential equations and finding integrals. Many phenomena in the world, from population growth to radioactive decay, can be modeled or approximated using functions involving ‘e’. ‘e’ is essential in compound interest calculations used in financial models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Event

A

A possible outcome or occurrence of a random experiment. It represents a specific situation or result that may happen, such as rolling a particular number on a dice, drawing a specific card from a deck, or observing a certain event in a statistical study. Events are fundamental concepts in probability theory and are used to define probability distributions, calculate probabilities, and analyze uncertainty in various domains.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Expectation (Mean)

A

Often referred to as the mean or expected value, of a random variable is a measure of the central tendency of its distribution. It represents the average value that the variable would take over a large number of independent repetitions of the random experiment. The expectation is calculated as the weighted sum of all possible values of the random variable, where each value is weighted by its corresponding probability of occurrence. The expectation is a fundamental concept in probability theory and is used to characterize the properties of random variables, estimate population parameters, and make predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

Exponential

A

The essence of an exponential relationship is that a quantity grows or shrinks by being multiplied by itself repeatedly.

An exponential function has the general form f(x) = a^x, where:
- ‘a’ is the base (the number being multiplied)
- ‘x’ is the exponent (the number of times the base is multiplied by itself)

Exponential function is a mathematical function or distribution characterized by a constant base raised to the power of a variable exponent. The exponential function, f(x) = e^x, where e is Euler’s number (approximately 2.71828), is a common example of an exponential function. Exponential functions exhibit rapid growth or decay, depending on whether the exponent is positive or negative. Exponential distributions describe the behavior of random variables that model processes with constant rates of change over time, such as radioactive decay, population growth, or the waiting times between independent events. Exponential functions and distributions are widely used in mathematics, statistics, and science to model various natural phenomena and processes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Functions

A

A function is a relation that associates each element x of a set X, the domain of the function, to a single element y of another set Y, the codomain of the function. A function usually has a name. If the function is called f, this relation is denoted y = f(x) (read f of x), the element x is the argument or input of the function, and y is the value of the function or the output. The symbol that is used for representing the input is the variable of the function
(we often say that f is a function of the variable x).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Geometric (your first succes will be on the n-th try)

A

Refers to a probability distribution in which the likelihood of success increases with each attempt, following a geometric progression. In this context, the probability of success on the first try is p, the probability of success on the second try is p(1-p), the probability of success on the third try is p(1-p)^2, and so on. The geometric distribution is commonly used to model the number of trials needed to achieve the first success in a sequence of independent Bernoulli trials, where each trial has a constant probability of success p.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Geometric dot product

A

Also known as the scalar product or inner product, is a mathematical operation that takes two vectors and returns a scalar quantity. It is calculated by multiplying corresponding components of the vectors and summing the results. In geometric terms, the dot product represents the magnitude of one vector projected onto another vector, scaled by the cosine of the angle between them. The dot product is used to measure the similarity or alignment between vectors, calculate projections, and determine angles between vectors. In machine learning and data analysis, the dot product is often used in vector spaces, optimization algorithms, and neural network operations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Global minimum

A

In optimization, the global minimum refers to the lowest possible value of the objective function over the entire feasible domain. It represents the optimal solution that minimizes the objective function and satisfies all constraints, providing the best achievable outcome for the optimization problem. The global minimum is distinguished from local minima, which are lower values of the objective function within specific regions of the feasible domain but may not be the lowest overall. Finding the global minimum is a key objective in optimization problems, as it ensures the best performance or utility of the system under consideration. Various optimization algorithms, such as gradient descent and simulated annealing, are employed to search for the global minimum in complex, high-dimensional optimization landscapes encountered in machine learning, engineering, economics, and other fields.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Gradient

A

A vector-valued function that represents the direction and magnitude of the steepest ascent of a scalar-valued function at a given point. It is a generalization of the derivative to multiple dimensions and provides valuable information about the rate of change or slope of the function in each direction. The gradient of a function points in the direction of the greatest increase of the function and has a magnitude equal to the rate of change in that direction. In machine learning, the gradient is commonly used in optimization algorithms, such as gradient descent, to iteratively update the parameters of a model in the direction that minimizes the objective function. By following the negative gradient direction, optimization algorithms can converge towards the optimal solution or minimum of the objective function.

the gradient of a function is a vector that points in the direction of the greatest rate of increase of the function at a given point. It is a generalization of the derivative of a scalar-valued function to functions of multiple variables. The gradient is calculated by taking the partial derivatives of the function with respect to each of its variables and arranging them into a vector. Geometrically, the gradient represents the direction of steepest ascent of the function’s graph at the given point. In machine learning and optimization, the gradient plays a crucial role in gradient-based optimization algorithms such as gradient descent, where it is used to update the parameters of a model iteratively to minimize a loss function and find the optimal solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Hyperplane

A

In geometry and linear algebra, a hyperplane is a flat affine subspace of dimension n−1 embedded in an n-dimensional space. It is defined as the set of points that satisfy a linear equation of the form w⋅x+b=0, where w is a normal vector perpendicular to the hyperplane, x is a point in the space, and b is a scalar bias term. Geometrically, a hyperplane divides the space into two half-spaces and serves as a boundary or separation surface between them. In machine learning, hyperplanes are fundamental concepts in classification and regression tasks, where they are used to define decision boundaries between different classes or regions of the input space. Hyperplanes are also used in clustering, dimensionality reduction, and pattern recognition algorithms for partitioning and organizing data in high-dimensional spaces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Hypothesis testing

A

A statistical hypothesis test is a method of statistical inference used to decide whether the data sufficiently support a particular hypothesis. Hypothesis testing is a statistical method to determine if an observed difference or effect in your data is likely due to a real phenomenon in the larger population, or if it could be simply explained by random chance. It helps you make informed, data-driven decisions about whether changes, treatments, or relationships are truly significant. Many tests have assumptions about your data that need to be checked. Hypothesis testing is all about asking questions like: Is there a true difference between group A and group B? Does this newly developed drug actually work better than the old one? Is there a relationship between a customer’s age and their likelihood to buy a product.

Key Steps:
1. Formulate Hypotheses:
Null Hypothesis (H0): The default statement, usually one of “no effect” or “no difference”.
Alternative Hypothesis (Ha): The statement you want to find evidence to support.

  1. Choose a Test Statistic and Significance Level:
    Test Statistic: Calculates a value summarizing how different your sample is from what the null hypothesis expects (e.g., t-statistic, z-statistic)
    Significance Level (alpha): Your risk tolerance for rejecting the null even if it’s true (common value: 0.05)
  2. Calculate the p-value:
    The probability of getting a test statistic as extreme or more extreme than what you observed if the null hypothesis were true.
  3. Make Decision:
    p-value < alpha: Reject the null hypothesis. You have evidence to support the alternative hypothesis.
    p-value >= alpha: Fail to reject the null hypothesis. You don’t have enough evidence to claim the effect or difference exists in the larger population.

Hypothesis testing doesn’t provide definitive proof about your population parameter, just evidence.
It is associated with Errors: Type I error (false positive), Type II error (false negative) are possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Identity Matrix

A

A special diagonal matrix with ones on the main diagonal and zeros elsewhere.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Independent (statistics)

A

The core idea of independence is that two events or variables are independent if knowing the outcome of one tells you nothing about the outcome of the other.

Example (Coin Tosses): If you flip two fair coins, the outcome of the first flip doesn’t influence the outcome of the second. These events are independent.

Feature Independence:

Ideally, Features Are Informative Alone: Each feature in your dataset should provide unique information about the target variable you’re trying to predict.
Redundant Features: Highly correlated features can hinder some models, so feature selection processes often aim to identify and potentially remove them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Independent sample

A

A set of data points drawn from a population where each observation is unrelated to or not influenced by others. Independence of samples is fundamental for statistical analysis, ensuring that observations remain statistically independent and free from confounding variables or biases. Independent samples facilitate robust statistical inference, hypothesis testing, and generalizability of findings across different contexts or populations. They provide a reliable basis for making inferences about population parameters and assessing the effectiveness of interventions or treatments in research studies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Inferential Statistics

A

A branch of statistics concerned with making predictions, inferences, or generalizations about a population based on data collected from a sample. It involves using probability theory to draw conclusions about a population parameter, such as a mean or proportion, from sample data. Inferential statistics allows researchers to make informed decisions and predictions based on limited information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Integration

A

Continuous analog of a sum, which is used to calculate areas, volumes, and their generalizations. Integration, the process of computing an integral, is one of the two fundamental operations of calculus, the other being differentiation. The integral can be seen as the opposite of a derivative. If a function represents the rate of change of something, the integral helps us find the total amount of change accumulated over an interval. the integral tells you the total amount of change over an interval.

Consider a function f(x) and its graph. The definite integral of f(x) between two points ‘a’ and ‘b’ calculates the signed area enclosed by the function’s curve, the x-axis, and the vertical lines at x=a and x=b.

Integrals help locate the center of mass of objects, especially those with irregular shapes or varying density. The integral of a probability density function (PDF) represents probabilities. The area under the PDF curve within a specific range calculates the probability of a random variable falling within that range.

What Integrals Tell Us:
Geometrically: Integrals reveal the area under a curve.
Physically: Integrals translate rates of change into total quantities accumulated (distance, work, volume, etc.).
Probabilistically: Integrals are key for working with continuous distributions and finding probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

Interval

A

A set of values between two endpoints, typically expressed in terms of the lower and upper bounds. In mathematics, intervals can be open, closed, half-open, or half-closed, depending on whether the endpoints are included or excluded from the set of values. Intervals are commonly used to represent sets of real numbers or continuous ranges of variables in various mathematical contexts, such as calculus, geometry, and statistics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

Inverse matrix

A

Think of a regular number, like 5. Its inverse is 1/5, because multiplying 5 by its inverse gets you back to 1 (the identity element for multiplication).
An inverse matrix does something similar: When you multiply a matrix by its inverse, the result is the identity matrix (a special matrix analogous to the number 1).

Not All Matrices Have Inverses
Only Square Matrices: Only square matrices (same number of rows and columns) can potentially have inverses.
Singular Matrices: Matrices with a determinant of zero are called singular and don’t have inverses.

76
Q

Jacobian matrix

A

The Jacobian matrix is a multivariable extension of the regular derivative. It captures local Change: While a derivative tells you how much a single-input function changes at a point, the Jacobian tells you how much a vector-valued function (a function with multiple outputs) changes locally around a specific input point.

Why It’s Important: The Jacobian gives you the best linear approximation of a complex, multivariable function near a point.
This information is used in gradient-based optimization methods. The gradients of different components of the loss function with respect to the model’s parameters form the Jacobian. Analyzing the Jacobian can tell you how sensitive the outputs of your system are to small changes in the inputs. (Jacobians are also used to model the relationship between joint movements and the position of a robot’s end effector.)

Imagine a machine with several knobs (inputs) and a few dials displaying readings (outputs). The inputs and outputs are related, but not in a super straightforward way. The Jacobian is like a tool that tells you, if you tweak one input knob just a tiny bit, how will each of the output dials change in response. The trick is, how much the output changes might depend on the current settings of all the other knobs too.

The Jacobian is a Table where:
Each row is focused on one specific output dial.
Each column is focused on one specific input knob.
Inside the table the numbers in the table represent how much wiggling one knob will change the reading on one dial.

Why It Matters:
Think of the Jacobian matrix as a powerful diagnostic tool that reveals the inner workings of a machine learning model, telling you how it reacts to changes and helping you guide the optimization process in the right direction. Gradients tell you in which direction to adjust the parameters to reduce the error. For models with multiple outputs or complex loss functions, the Jacobian matrix packages those gradients neatly. If you only want to make tiny changes to the input, the Jacobian helps you predict what will happen to the outputs. It reveals how the different inputs and outputs of the system interact and influence each other. Techniques like the Jacobian norm can be used to regularize complex models, preventing overfitting. Jacobians can be used in the analysis and stabilization of GAN training.

77
Q

Kernels (math)

A

The kernel of a linear transformation (often represented by a matrix) is the set of all vectors that, when transformed, result in the zero vector. The kernel tells us something fundamental about how the transformation squashes or distorts the space it operates on. The kernel precisely identifies a subspace that gets nullified by the transformation. It’s about structure. The kernel pinpoints vectors that become specifically zero after transformation. The kernel tells us about the inherent “collapsing” or distortion caused by the transformation. Exposes the transformational properties of a matrix.

The kernel in linear algebra is highly specific – it’s about the null space of a transformation. Inner products help determine membership in the kernel (the null space).

78
Q

Kurtosis

A

A statistical measure describing the shape of the distribution’s tails relative to the normal distribution. Positive kurtosis indicates heavier tails (leptokurtic), implying more extreme values, while negative kurtosis suggests lighter tails (platykurtic), indicating fewer extreme values. Kurtosis provides insights into the distribution’s peakedness or flatness and complements other measures of central tendency and spread. It helps researchers understand the distribution’s characteristics and make informed decisions about data modeling and analysis.

79
Q

Lagrange notation

A

Lagrange notation is a widely used way of representing derivatives in calculus. It’s named after the mathematician Joseph-Louis Lagrange.

It’s more concise than other notations (like Leibniz notation).Easily represent higher derivatives with multiple prime symbols (e.g., f’’’‘(x) for the fourth derivative).

80
Q

Law of large numbers

A

A fundamental principle in probability and statistics that states that as the size of a sample or the number of repetitions of a random experiment increases, the sample mean approaches the population mean. In other words, the average of the results obtained from a large number of trials is likely to be close to the expected value. The law of large numbers forms the basis for many statistical procedures and ensures the reliability of statistical inference.

81
Q

Law of total probability

A

fundamental concept in probability theory that relates marginal probabilities to conditional probabilities. It states that if you have a partition of the sample space (i.e., a collection of disjoint events whose union covers the entire sample space), then the probability of an event can be computed as the sum of the probabilities of that event conditioned on each outcome in the partition, weighted by the probability of each outcome in the partition.

The Law of Total Probability is very useful in situations where it may be easier to compute conditional probabilities rather than marginal probabilities directly. It allows us to decompose complex probability problems into simpler, more manageable parts by considering the different scenarios represented by the partition. It’s a fundamental tool in probability theory and is widely used in various fields, including statistics, machine learning, and data science.

agine you have a bag of different colored marbles: red, blue, and green. Now, you close your eyes and randomly pick a marble from the bag. You want to know the probability of picking a red marble, but you’re not sure if each color has an equal chance of being picked.

Here’s where the Law of Total Probability comes in handy.

First, you realize that you can break down the event of picking a red marble into smaller events based on the color of the marbles in the bag. Let’s say there are three scenarios:
You pick from a bag containing only red marbles.
You pick from a bag containing only blue marbles.
You pick from a bag containing only green marbles.
The Law of Total Probability tells you that you can find the probability of picking a red marble by considering each of these scenarios separately and then adding up their probabilities.
So, you find the probability of picking a red marble in each scenario:
Probability of picking red marble from the bag of red marbles.
Probability of picking red marble from the bag of blue marbles (which is 0 since there are no red marbles).
Probability of picking red marble from the bag of green marbles (also 0, as there are no red marbles here either).
Finally, you add up these probabilities, each multiplied by the probability of being in that scenario. For example, if the bag of red marbles makes up half of all the marbles in the bag, then you’d multiply the probability of picking a red marble from that bag by 0.5 (the probability of being in that scenario).
In essence, the Law of Total Probability allows you to find the overall probability of an event by considering all possible scenarios and weighing their contributions based on their likelihood. It’s like breaking down a big problem into smaller, more manageable parts and then putting them all together to get the answer.

82
Q

Left-tailed test

A

A left-tailed test is used to determine if a sample statistic is significantly smaller than a specified value. Put another way statistical hypothesis test in which the critical region, or the region of rejection, is located entirely on the left side of the distribution curve. This means that the test focuses on determining whether the sample statistic is significantly smaller than a certain value, often a population parameter or a specified threshold.

The null hypothesis in a left-tailed test typically states that there is no significant difference or that the sample statistic is equal to or greater than the specified value The alternative hypothesis states the specific direction of the difference we are interested in. In a left-tailed test, it asserts that the sample statistic is significantly smaller than the specified value. If the calculated test statistic falls within the critical region (i.e., it is smaller than the critical value), you reject the null hypothesis in favor of the alternative hypothesis.

83
Q

Likelyhood

A

Probability measures the likelihood of an event occurring based on the underlying sample space. In other words, it quantifies the chance that a particular outcome will happen. Likelihood, on the other hand, is used in the context of statistical inference and parameter estimation. It measures the compatibility between observed data and a particular set of parameter values (hypotheses) in a statistical model. In simple terms, likelihood quantifies how well the model, with specific parameter values, explains the observed data.

It’s important to note that likelihood is not a probability distribution. Unlike probabilities, likelihood values can be greater than 1. Also, likelihood is used for inference, such as parameter estimation, hypothesis testing, and model selection, while probabilities are used for predicting the likelihood of future events.

In summary, probability measures the likelihood of events in a sample space, while likelihood measures the compatibility of observed data with specific parameter values in a statistical model. Probability focuses on the chance of future events, while likelihood focuses on the support of observed data for different parameter values in a model.

84
Q

Linear dependence & independence

A

Linear dependence and independence describe the relationships between the rows or columns of a matrix, which directly translate to the solution behavior of a system of equations. A set of vectors (rows or columns) is linearly dependent if one vector can be expressed as a linear combination of the others. In a system of equations, this means some equations are redundant; they don’t provide new constraints. Conversely, linear independence means none of the vectors can be formed as a combination of the others. Geometrically, linearly independent vectors point in unique directions. This translates to a system of equations where each equation provides essential information, often leading to systems with unique solutions.

In a matrix, we can analyze linear independence in two ways:
Row Independence: The matrix’s rows are linearly independent if none of the rows can be formed as a linear combination of the other rows.
Column Independence: The matrix’s columns are linearly independent if none of the columns can be formed as a linear combination of the other columns.

85
Q

Linear regression

A

A supervised learning algorithm used for modeling the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the input features and the target variable.

Linear regression does search for the line (or hyperplane in higher dimensions) that minimizes the sum of the squared distances (or residuals) between the observed data points and the predicted values on the line. This method is known as the method of least squares.

y = mx + b

86
Q

Linear transformation

A

A linear transformation is a function that maps vectors from one space to another while preserving two key properties:
Additivity: The transformation of the sum of two vectors equals the sum of their individual transformations.
Scalar Multiplication: Scaling a vector and then transforming it is the same as transforming it and then scaling it by the same amount.

Geometric Interpretation: Linear transformations can be visualized as stretching, rotating, shearing, reflecting, or projecting a space, but without bending or warping it in a non-linear way.

A matrix can act as a linear transformation by performing matrix multiplication. When you multiply a matrix by a vector, you’re effectively applying the transformation that the matrix represents. The columns of a transformation matrix tell you where the original basis vectors (like the standard x and y-axis vectors) end up after the transformation.

In ML:
1. Data Preprocessing
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) use linear transformations to find new, lower-dimensional representations of data that capture the most important directions of variation. This is crucial for handling high-dimensional datasets and improving computational efficiency.
Feature Scaling and Normalization: Linear transformations are often used to scale features to have comparable ranges or zero mean and unit variance. This can improve the convergence of many ML algorithms.

  1. Within Models
    Neural Network Layers: The core operation of a dense (fully connected) layer in a neural network is a matrix multiplication, which is a linear transformation. These transformations learn to project the input data into different spaces where it might be easier to classify or make predictions.
    Kernel Methods: In techniques like Support Vector Machines (SVMs), linear transformations induced by kernels are used to map data into higher-dimensional spaces where it becomes linearly separable.
  2. Interpretability
    Analyzing Feature Importance: Linear transformations in simple models can sometimes offer insights into which original features are most heavily weighted, helping with model understanding.
    Disentangled Representations: Some ML research focuses on learning linear transformations that create representations where meaningful factors are separated, making it easier to interpret and manipulate the model’s output.

Beyond Linearity
Building Blocks: Linear transformations are foundational. Even non-linear models often combine them with activation functions to create complex, expressive mappings.
Limitations: Linear transformations alone are limited in the patterns they can capture. That’s why techniques like deep learning are so powerful.

87
Q

Local minimum

A

Local minimum is the point in the domain of the functions, which has the minimum value. A local minimum is a point on a curve or surface that is lower in value than all neighboring points within a small neighborhood surrounding it. In mathematical terms, a local minimum occurs where the derivative of the function is zero and changes sign from negative to positive, indicating a downward slope. Local minima may not be the absolute minimum of the function but represent relative low points within a specific region.

88
Q

Log loss

A

Also known as logarithmic loss or cross-entropy loss, is a measure used to evaluate the performance of a classification model. It quantifies the accuracy of predictions by penalizing incorrect classifications. Log loss is defined as the negative logarithm of the predicted probability assigned to the true class. Log loss is primarily used to evaluate the performance of classification models that predict probabilities. Unlike accuracy, it has no upper limit It can get arbitrarily large for extremely wrong predictions. Log loss cares a lot about whether your model is just right or very confident. It’s more forgiving of a slightly incorrect prediction than one where the model was highly certain but wrong. It directly evaluates how good the model’s probability estimates are, rather than just whether the final classification was correct or not.

Lower log loss values indicate better performance, with 0 representing perfect predictions. Higher log loss values indicate worse performance. Log loss is sensitive to the correctness and confidence of probability estimates. It heavily penalizes confident but incorrect predictions. Therefore, it’s crucial to ensure that the model’s predicted probabilities are well-calibrated and reflect the true uncertainty in the predictions.

89
Q

Matrix

A

A matrix is a rectangular array of numbers arranged in rows and columns.

90
Q

Matrix Decompositions

A

Matrices can often be decomposed into simpler forms, which can facilitate various computations and analyses. Common decompositions include:
LU Decomposition: Decomposes a matrix into a lower triangular matrix and an upper triangular matrix.
QR Decomposition: Decomposes a matrix into an orthogonal matrix and an upper triangular matrix.
Singular Value Decomposition (SVD): Decomposes a matrix into three matrices, which reveal information about the matrix’s singular values and singular vectors.

91
Q

Matrix Decompositions

A

Matrices can often be decomposed into simpler forms, which can facilitate various computations and analyses. Common decompositions include:

LU Decomposition: Decomposes a matrix into a lower triangular matrix and an upper triangular matrix.
QR Decomposition: Decomposes a matrix into an orthogonal matrix and an upper triangular matrix.
Singular Value Decomposition (SVD): Decomposes a matrix into three matrices, which reveal information about the matrix’s singular values and singular vectors.

92
Q

Matrix Exponential

A

Generalization of the exponential function for matrices.

93
Q

Matrix Norms

A

Measures of the “size” of a matrix. Matrix norms are like measuring tapes for matrices, giving us a sense of their ‘size’ or magnitude. They go beyond simply counting rows and columns. A matrix norm boils down complex matrix information into a single, non-negative number. Think of it this way: some matrices might have small numbers but be very spread out, while others are compact with large values. Matrix norms are scalar values used to quantify the size or magnitude of a matrix, playing a vital role in analyzing numerical algorithms and understanding how errors might amplify through mathematical operations.

A good matrix norm has several key properties:
Non-negativity: The norm is always zero or positive. It’s zero only for a zero matrix.
Scaling: Multiplying a matrix by a scalar multiplies the norm by the absolute value of that scalar.
Triangle Inequality: The norm of the sum of two matrices is less than or equal to the sum of their individual norms.
Submultiplicative: The norm of a product of matrices is less than or equal to the product of their individual norms.

Common matrix norms include:
Frobenius Norm: Like finding the length (magnitude) of a vector by squaring all the elements, summing them up, and taking the square root.
Induced Norm: Based on how much a matrix can stretch a vector, influenced by the vector norm used. A common example is the L2 norm.
p-norms: A family of norms based on different ways to combine the absolute values of elements (like summing them or finding the maximum).

94
Q

Matrix product

A

Matrix product is a way of multiplying two compatible matrices to create a new matrix. It involves multiplying elements from rows of the first matrix with corresponding elements from columns of the second matrix and summing up the products. For a matrix product to be valid, the number of columns in the first matrix must equal the number of rows in the second matrix.

Many ML models rely on linear transformations. Matrix products provide an efficient way to represent and perform these transformations on data. A neural network layer is essentially a matrix multiplication of the input data with a weight matrix followed by an activation function.

Matrix products allow computations that simultaneously involve multiple features of your data:
Calculating Correlations: Covariance matrices (which often involve matrix products) reveal relationships between different features.
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) use matrix products to find new, lower-dimensional representations of data.

Example: A Simple Neural Network Layer
Input data (X): A matrix where each row is a sample, and each column is a feature.
Weight matrix (W): A matrix where each column represents a neuron in the layer.
Output (Y): Y = X ⋅ W (This matrix product represents the output of the layer before applying an activation function)

95
Q

Matrix Properties

A

Matrices have various properties, such as rank, determinant, trace, eigenvalues, and eigenvectors, which provide important information about their behavior and structure

Rank: The maximum number of linearly independent rows or columns.
Determinant: A scalar value that can be computed for square matrices; used to determine invertibility.
Trace: Sum of the diagonal elements of a square matrix.
Eigenvalues: scalar values that represent how a linear transformation (described by the matrix) affects the directions of certain vectors.
Eigenvectors: non-zero vectors that are scaled by the matrix but not rotated during the transformation. They represent the directions along which the linear transformation behaves like simple stretching or compression.

96
Q

Max function

A

The max function, denoted as max(a, b), returns the larger of the two values a and b. It is a mathematical function commonly used to find the maximum value among a set of numbers or to make comparisons between two values.

97
Q

Maximum Liklihood Estimation (MLE)

A

A method used to estimate the parameters of a statistical model by maximizing the likelihood function, which measures how well the model explains the observed data. The parameters that maximize the likelihood function are considered the most likely values given the observed data. MLE is widely used in statistical inference, where the goal is to estimate unknown parameters based on observed data. It provides estimates that are asymptotically efficient and consistent under certain conditions.

98
Q

Mean

A

The mean, also known as the average, is a measure of central tendency calculated by summing all values in a dataset and dividing by the total number of values. It represents the arithmetic average of a set of numbers and is commonly used to describe the typical value or central value of a dataset.

The mean is sensitive to outliers and extreme values in the data.

99
Q

Measures of central tendency

A

Statistical metrics used to define the center or typical value within a dataset. They include:
- Mean: Calculated by summing all values and dividing by the total count of values.
- Median: The middle value when data is sorted in ascending or descending order.
- Mode: The most frequently occurring value in the dataset.

100
Q

Median

A

A measure of central tendency that represents the middle value of a dataset when arranged in ascending or descending order. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values. The median is less sensitive to outliers than the mean and is often used to describe the central tendency of skewed or non-normally distributed data.

101
Q

Mode

A

A measure of central tendency that represents the most frequently occurring value in a dataset. Unlike the mean and median, which require numerical data, the mode can be calculated for both numerical and categorical data. A dataset can have one mode (unimodal), two modes (bimodal), or more than two modes (multimodal). The mode is useful for identifying the typical or predominant value in a dataset and is commonly used in descriptive statistics and data analysis.

102
Q

Naive Bayes

A

Popular machine learning algorithm based on Bayes’ theorem. Naive refers to an assumption of independence between features. It is commonly used for classification tasks, particularly in text categorization and spam filtering. Despite its simplifying assumption, Naive Bayes often performs well in practice and is computationally efficient, making it suitable for large-scale datasets.

103
Q

Naive ssumption

A

Naive Bayes classifiers operate on the key assumption that all features (attributes) within your dataset are conditionally independent of each other given the class (target) variable. This means that the presence or absence of one feature doesn’t affect the probability of any other feature occurring, given a specific class. While often unrealistic, this assumption greatly simplifies the calculations involved in building the model.

Naive Bayes works well when you have a large number of features, or for high-dimensional datasets where feature relationships might be less dominant. If you know that your features strongly depend on each other, other techniques might be more suitable.

104
Q

Natural Number

A

Natural numbers are always whole (no fractions or decimals) and greater than zero. There’s a smallest natural number (1), and you can always get the next one by adding 1 (2, 3, 4…). Mathematically, natural numbers have a rigorous definition, but the essence is that they can be obtained starting at 1 and repeatedly adding 1.

What natural numbers ARE NOT:
Zero: Zero represents the absence of a quantity, not a count of something.
Negative Numbers: These extend the idea of a number line in the opposite direction.
Fractions/Decimals: These represent parts of a whole rather than whole objects themselves.

105
Q

Newtons Method

A

Newton’s method is an iterative process to find the root of a function (where its value equals zero), which also lets us approximate the function’s derivative. Imagine zooming in on a curve. As you get closer, it looks like a straight line–its tangent. We use the slope of this tangent as the derivative estimate. Newton’s method says: 1) start with a guess for the root, 2) draw the tangent line at that guess, 3) where that line hits the x-axis is your new, better guess at the root. Repeat this and your guesses rapidly get closer to the true root, giving you a better and better derivative approximation.

106
Q

Non-response bias

A

Non-response bias occurs when individuals or groups who choose to participate in a study, survey, or similar process are systematically different from those who choose not to participate. This leads to a sample that doesn’t accurately reflect the target population you’re trying to understand. For example, if people who hold negative opinions about a product are less likely to participate in a product feedback survey, the results may appear overly positive. Non-response bias can significantly distort findings, making it hard to generalize your conclusions to the broader population.

107
Q

Null hypothesis

A

Null Hypothesis (H0): The default statement, usually one of “no effect” or “no difference”. Basis of Statistical Testing: The null hypothesis often states that there is no relationship between two variables, no difference between groups, or that an observed effect is simply due to random chance.
Alternative Hypothesis (Ha): The statement you want to find evidence to support.

The goal of statistical testing is to gather enough evidence to reject the null hypothesis. By showing the null hypothesis is very unlikely, we gain confidence in an alternative hypothesis (that there is an effect or relationship). The null hypothesis provides a starting point for comparison. If you can’t definitively say something is different from the expected, you don’t have strong evidence for a change.

108
Q

Orthogonal Matrices

A

Square matrices whose columns and rows are orthogonal unit vectors. (Perpendicular vectors. Imagine two arrows pointing at right angles to each other, like a perfect T intersection. Those arrows represent orthogonal vectors. In mathematical terms, the angle between them is 90 degrees, and their dot product (a specific operation that measures their alignment) is zero.)

109
Q

Permutations

A

A permutation is an arrangement of objects in a definite order. Imagine you have a set of letters {A, B, C}; some permutations would be ABC, BCA, or CAB. The key concept is that each permutation involves a unique rearrangement of the same elements, and the order in which they appear matters. Permutations are used in calculating probabilities, cryptography, and various areas of computer science where understanding different arrangements is essential.

110
Q

Point estimators

A

Statistical methods used to estimate unknown parameters, such as population mean or variance, based on sample data. A point estimator produces a single value (point estimate) that serves as the best guess or approximation of the true parameter value. Common point estimators include the sample mean, sample variance, and maximum likelihood estimator. The quality of a point estimator is typically assessed based on properties such as unbiasedness, efficiency, and consistency.

111
Q

Poisson Distribution

A

A probability distribution that describes the number of events occurring in a fixed interval of time or space, given a known average rate of occurrence and assuming independence between events. It is characterized by a single parameter, typically denoted by λ, which represents the average rate of occurrence of the events. The Poisson distribution is commonly used to model rare events such as the number of arrivals at a service center, the number of phone calls received per hour, or the number of accidents at an intersection.

112
Q

Policy Iteration

A

Policy iteration is an algorithm in reinforcement learning that helps find the best course of action (policy) to maximize long-term rewards. It works in cycles:

Start Simple: Begin with any policy, no matter how random.
Evaluate: Simulate the environment using that policy and calculate the value (expected future reward) for each state.
Improve: Based on these values, find a new policy that’s better at selecting actions that lead to higher rewards.
Repeat: Keep going back to step 2 with the improved policy until the policy stabilizes, meaning it no longer significantly improves upon itself.
This iterative process ensures you gradually converge on the optimal policy that brings the most rewards in the long run.

113
Q

Polynomial

A

A polynomial is a mathematical expression that consists of variables, constants, and exponents combined using addition, subtraction, and multiplication operations. Variables in a polynomial can only have non-negative integer powers (like x², x³, but not x½ or x⁻¹). Polynomials can have one or multiple terms, and their form reveals important properties about their behavior. For example, a linear polynomial (like 3x + 5) forms a straight line when graphed, while a quadratic polynomial (like x² - 2x + 1) forms a parabola.

114
Q

Polynomial transformation

A

A technique used in machine learning and statistics to transform input features by raising them to different powers and combining them through multiplication. Polynomial transformations are commonly used to capture nonlinear relationships between features and target variables in regression and classification tasks. By introducing polynomial terms, such as quadratic or cubic terms, polynomial transformation allows models to fit more complex patterns in the data.

115
Q

Population

A

The entire group of individuals, items, or events that researchers want to study and draw conclusions about. In statistical analysis, population parameters, such as mean, median, standard deviation, and variance, describe characteristics of the entire population. While studying an entire population is often impractical, researchers draw conclusions from representative samples to make inferences about the population as a whole, striving to minimize sampling errors and ensure the findings’ validity and reliability. Understanding population characteristics is essential for informed decision-making, policy formulation, and addressing research questions across diverse fields and disciplines.

116
Q

Positive Definite Matrices

A

A symmetric matrix where all eigenvalues are positive.

117
Q

Posterior

A

Posterior probability represents the updated belief or uncertainty about the likelihood of an event occurring after incorporating new evidence or data. It is obtained by applying Bayes’ theorem, which combines the prior probability of an event with the likelihood of observing the data given the event (likelihood function). Posterior probabilities reflect the updated beliefs or probabilities after considering new information, making them more informative and accurate than prior probabilities alone. Bayesian inference involves updating prior beliefs using Bayes’ theorem to obtain posterior probabilities, allowing for a principled approach to reasoning under uncertainty in various fields, including statistics, machine learning, and decision-making.

118
Q

Prior

A

Prior probability represents the initial belief or uncertainty about the likelihood of an event occurring before any new evidence is taken into account.Priors can be uniform (assigning equal probabilities to all possible outcomes), informed (based on available knowledge or data), or subjective (based on personal beliefs or opinions). Priors play a crucial role in Bayesian inference, providing a starting point for updating beliefs in light of new evidence.

119
Q

Probability

A

Probability is the branch of mathematics concerning events and numerical descriptions of how likely they are to occur. A probability is a way of assigning every event a value between zero and one, with the requirement that the event made up of all possible results.

120
Q

Probability density function (PDF)

A

The probability distribution of a CRV (a continuous probability distribution) is described by a probability density function (pdf). The pdf is a function whose codomain is nonnegative and the area under the curve is equal to 1

A function that describes the probability distribution of a continuous random variable. Unlike the probability mass function (PMF) for discrete random variables, which assigns probabilities to individual values, the PDF assigns probabilities to intervals of values. The integral of the PDF over a given interval gives the probability that the random variable falls within that interval. The PDF must satisfy two properties: it must be non-negative for all possible values of the random variable, and the total area under the curve (the integral over all possible values) must be equal to 1.

121
Q

Probability mass function (PMF)

A

A PMF is a mathematical function used with discrete random variables – variables that can only take on specific, countable values (like the number of times you roll a 6 on a die). The PMF tells you the probability that the random variable will equal each of its possible values. PMFs are the building blocks for more complex probability calculations with discrete variables.All probabilities given by a PMF must be between 0 and 1. If you add up the probabilities of all possible outcomes, the total must always be 1.

Imagine a recipe where the ingredients are the possible values the variable can take, and the amount of each ingredient is the probability of getting that specific value.

122
Q

Probability trees and diagrams

A

Probability trees and diagrams are helpful tools for visualizing and calculating probabilities in scenarios where multiple events happen in sequence. Think of them as maps with branches! Each branch represents a possible outcome of an event, and the number written on it is the probability of that outcome happening. To find the probability of a chain of events (like flipping a coin twice and getting heads both times), you follow the right branches and multiply the probabilities along the way. These diagrams make it easier to understand complex scenarios and ensure you consider all possible outcomes and their associated probabilities.

123
Q

Random Variable

A

Usually written as an italic capital letter, like X, is a variable whose possible values are numerical outcomes of a random phenomenon. Examples of random phenomena with a numerical outcome include a toss of a coin (0 for heads and 1 for tails), a roll of a dice, or the height of the first stranger you meet outside. There are two types of random variables: discrete and continuous.

A discrete random variable takes on only a countable number of distinct values such as red, yellow, blue or 1, 2, 3,…. The probability distribution of a discrete random variable is described by a list of probabilities associated with each of its possible values. This list of probabilities is called a probability mass function (pmf).

A continuous random variable (CRV) takes an infinite number of possible values in some interval. Examples include height, weight, and time. the probability distribution of a CRV (a continuous probability distribution) is described by a probability density function (pdf). The pdf is a function whose codomain is nonnegative and the area under the curve is equal to 1

124
Q

Rank of a matrix

A

The rank of a system of linear equations tells you if there’s a unique solution, multiple solutions, or no solutions. The rank of a matrix, in linear algebra, refers to the maximum number of linearly independent rows or columns within that matrix. A set of vectors (rows or columns) are linearly independent if none of them can be formed as a combination of the others. The rank effectively tells you the true dimension of the space spanned by the matrix’s column vectors (or row vectors). If a matrix has a lower rank than its number of rows or columns, it means there is some redundancy within it. Think of each row (or column) of matrix as representing an equation. Linear independence means this system of equations would have unique solutions if solved.

The rank of a matrix is both:
The maximum number of linearly independent rows.
The maximum number of linearly independent columns.

Singular vs. Non-singular
Singular: A matrix is singular if its determinant is zero. This implies: Not invertible. Linearly dependent rows or columns. Might correspond to systems with no unique solution.
Non-singular: A matrix with a non-zero determinant. This implies: Invertible. Linearly independent rows and columns.

Complete, Redundant, and Contradictory
Complete: A consistent system of equations (it has at least one solution). Whether it’s a unique solution or infinitely many depends on the rank.
Redundant: The augmented matrix has linearly dependent rows. This means there are redundant equations, leading to infinitely many solutions.
Contradictory: The augmented matrix represents an inconsistent system. There’s no solution that can satisfy all the equations.

In ML, rank can be used to assess feature redundancy or find lower-dimensional representations of data.

125
Q

Real Number

A

In mathematics, a real number is a number that can be used to measure a continuous one-dimensional quantity such as a distance, duration or temperature. Here, continuous means that pairs of values can have arbitrarily small differences.. Real numbers are numbers that include both rational and irrational numbers.

Real numbers are essential for modeling quantities that can have continuous values: length, temperature, time, speed, etc. The analysis of change and smooth curves hinges on real numbers.

There are infinitely many real numbers, and there’s always a real number between any two other real numbers. Real numbers have a natural order (greater than, less than).

Real numbers do not include things like Imaginary numbers (involving the square root of -1)

126
Q

Regularization

A

A technique used in machine learning and statistics to prevent overfitting and improve the generalization performance of a model. It involves adding a penalty term to the objective function being optimized during model training. The penalty term discourages overly complex models by imposing constraints on the model parameters, leading to smoother and more regularized solutions. Common regularization techniques include:
- L1 regularization (Lasso)
- L2 regularization (Ridge),
- Elastic Net regularization.

Regularization is essential for building models that generalize well to unseen data and avoid overfitting to noise in the training data.

127
Q

Regularization term

A

Regularization term refers to the additional component added to the loss function or objective function during model training. This term penalizes HOW? overly complex models by imposing constraints on the model parameters. The choice of regularization term depends on the specific regularization technique being employed, such as L1 regularization, L2 regularization, or Elastic Net regularization. The regularization term helps to control the trade-off between model complexity and fit to the training data, leading to models that generalize better to unseen data.

128
Q

Representative data set

A

A subset of data that accurately reflects the characteristics of the entire dataset or population. Representative datasets are crucial for making valid inferences and generalizations about the population based on sample data. Ensuring representativeness involves careful selection methods to avoid bias and ensure the sample’s diversity mirrors that of the population, thereby enhancing the reliability and validity of research findings. Representative datasets enable researchers to extrapolate findings to broader populations with confidence, supporting evidence-based decision-making and policy formulation.

129
Q

Right-tailed test

A

A right-tailed test focuses on whether there’s evidence your sample is significantly larger than what the null hypothesis claims. The critical region, where you’d reject the null hypothesis, is in the rightmost tail of the distribution. You’re asking: “Is my sample so unusually large that it’s unlikely to have happened by chance if the null hypothesis were true?”. Right-tailed tests look for deviations in one direction (larger). When making decisions we should always compare Left-Tailed Test (looking at “smaller than”) and Two-Tailed Test (looking for any significant difference)

130
Q

Sample

A

In statistics, a sample refers to a subset of individuals or observations taken from a larger population. Samples are used to make inferences or generalizations about the population from which they are drawn.

131
Q

Sample mean

A

A measure of central tendency that represents the average value of observations in a sample. It is calculated by summing up all the values in the sample and dividing by the number of observations. The sample mean provides an estimate of the population mean and is a fundamental concept in inferential statistics.

132
Q

Sample proportion

A

Used to estimate the proportion of a certain attribute or characteristic within a population based on a sample. It is calculated by dividing the number of individuals in the sample exhibiting the attribute of interest by the total sample size. Sample proportions are often used in hypothesis testing and confidence interval construction for population proportions.

133
Q

Sample statistics

A

Numerical measures calculated from a sample of data that provide information about the characteristics of the sample. These statistics are used to estimate or infer properties of the population from which the sample is drawn. Common sample statistics include measures of central tendency (e.g., mean, median, mode) and measures of variability (e.g., variance, standard deviation). Sample statistics are essential in statistical analysis for making inferences, testing hypotheses, and drawing conclusions about populations based on limited sample data.

134
Q

Sample variance

A

A measure of dispersion or variability within a sample. It quantifies how much individual observations in a sample differ from the sample mean. It is calculated by taking the average of the squared differences between each observation and the sample mean. Sample variance is essential in understanding the spread of data points within a sample and is used in various statistical analyses.

135
Q

Scalar (vector)

A

A single numerical value, typically representing a magnitude or quantity only without any associated direction. Scalars are distinguished from vectors, which are quantities that have both magnitude and direction. Scalars can represent various physical and abstract quantities, such as temperature, mass, time, and energy. In linear algebra, scalars are used to scale vectors or matrices, multiplying each element of the vector or matrix by the scalar value.

If Scalar is a single point on a number line, vector is an arrow with a starting point, length (magnitude), and an arrowhead indicating direction. If Scalar has only magnitude, vector has magnitude and direction.

136
Q

Second derivative

A

The First Derivative: Measures the instantaneous rate of change of a function. It tells you how much the function’s output changes for a tiny change in its input, which corresponds to the slope of the function’s graph.

The Second Derivative: The derivative of the derivative! It measures the rate of change of the first derivative. This reveals how the slope of the original function is changing.If the function measures position, derivative tells you the exact speed at specific point. The second derivative measures acceleration at any moment.

In optimization The second derivative test helps identify whether a critical point (where the first derivative is zero) is a local minimum, maximum, or neither.

137
Q

Set (Math)

A

A set is an unordered collection of unique elements. We denote a set as a calligraphic capital character, for example, S. A set of numbers can be finite (include a fixed amount of values). In this case, it is denoted using accolades, for example, {1, 3, 18, 23, 235} or {x1, x2, x3, x4, . . . , xn}. A set can be infinite and include all values in some interval. If a set includes all values between a and b, including a and b, it is denoted using brackets as [a, b]. If the set doesn’t include the values a and b, such a set is denoted using parentheses like this: (a, b). For example, the set [0, 1] includes such values as 0, 0.0001, 0.25, 0.784, 0.9995, and 1.0. A special set denoted R includes all numbers from minus infinity to plus infinity.

138
Q

Set (Math)

A

A set is an unordered collection of unique elements. We denote a set as a calligraphic capital character, for example, S. A set of numbers can be finite (include a fixed amount of values). In this case, it is denoted using accolades, for example, {1, 3, 18, 23, 235} or {x1, x2, x3, x4, . . . , xn}. A set can be infinite and include all values in some interval. If a set includes all values between a and b, including a and b, it is denoted using brackets as [a, b]. If the set doesn’t include the values a and b, such a set is denoted using parentheses like this: (a, b). For example, the set [0, 1] includes such values as 0, 0.0001, 0.25, 0.784, 0.9995, and 1.0. A special set denoted R includes all numbers from minus infinity to plus infinity.

139
Q

Significance level

A

This is your predetermined threshold for rejecting the null hypothesis. A common level is 0.05, meaning you accept a 5% chance of incorrectly rejecting the null hypothesis (seeing an effect when there actually is none). It controls the rate of Type I errors (false positives).

140
Q

Singularity of matrix

A

If a system of linear equations is represented by a singular matrix, it means that the system does not have a unique solution. This could occur when there are dependent equations, or when there are more equations than unknowns.

Singular matrix is like a broken tool in your mathematical toolbox. Normally, a matrix acts as a transformation (stretching, rotating, etc.). A singular matrix collapses space in some way. It has lost the ability to fully represent all the original directions of information. This is often signaled by the matrix having a determinant of zero. Singular matrices cause trouble because, like dividing by zero, they can lead to undefined or unpredictable results when you try to use them in certain calculations (like finding the inverse).

In the context of data analysis or linear regression, a singular matrix may indicate collinearity or redundancy among the predictors. This means that one or more columns (or rows) of the matrix are linearly dependent on the others, resulting in a loss of information or redundancy in the data.

141
Q

Skewness

A

A third moment of distribution. It is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. It quantifies the extent to which the probability mass or density function of a random variable deviates from symmetry around its mean. Positive skewness indicates that the right tail of the distribution is longer or fatter than the left tail, while negative skewness indicates the opposite. Skewness is an important statistical measure used in data analysis and modeling to assess the shape and symmetry of distributions.

142
Q

Slope (derivatives)

A

The slope, often referred to as the derivative in calculus, is a fundamental concept that measures how a function changes as its input changes. Geometrically, the slope represents the steepness of the tangent line to the function’s graph at a given point. A positive slope indicates that the function is increasing, while a negative slope indicates that the function is decreasing. A slope of zero indicates that the function is neither increasing nor decreasing at that point.

The derivative represents the rate of change or the slope of the function at a particular point. It measures how the function value changes with respect to a small change in the independent variable. The derivative is a fundamental concept in calculus and mathematical analysis, used to analyze the behavior of functions, optimize functions, and solve differential equations. In machine learning and optimization, derivatives are essential for gradient-based optimization algorithms such as gradient descent.

The general formula for calculating the derivative of a function f(x) with respect to its input variable x is denoted as f’ (x) or df/dx​ . It is defined as the limit of the difference quotient as the change in x approaches zero:

f′(x) = lim h→0 ((f(x+h) - f (x)) / h )

143
Q

Span

A

In machine learning, “span” usually refers to the concept of linear span within the context of vectors and feature spaces. Imagine your data points as vectors in a high-dimensional space (where each feature is a dimension). The span of a set of vectors is all the possible points you can reach by combining those vectors through linear combinations (scaling and adding). A larger span means the vectors can represent a wider range of potential data points. This is important in ML because:

Model Expressiveness: A model’s ability to learn complex patterns depends on whether its transformations can span the space where the real-world data lies.
Kernel Methods: Techniques like Support Vector Machines (SVMs) use kernels to project data into higher-dimensional spaces. The span in this transformed space determines the SVM’s ability to find complex decision boundaries.
Feature Engineering: Sometimes, creating new features as linear combinations of existing ones can increase the span and improve model performance.

144
Q

Sparse Matrices

A

Matrices where most elements are zero.

145
Q

Square Matrix

A

A matrix with an equal number of rows and columns.

146
Q

Standard diviation (spread),

A

A statistical measure of central tendency representing the average distance of data points from the mean value. It quantifies the variability or spread of values within a dataset. Higher standard deviation indicates greater dispersion among data points, whereas lower values suggest a more concentrated distribution around the mean.

SD = √[ Σ(xi - x̄)² / (n - 1) ]
Explanation of Symbols:

Σ: Summation symbol (means “add up the following terms”)
xi: An individual data point in your dataset
x̄: The mean (average) of your data points
n: The number of data points in your sample

147
Q

Symmetric Matrix

A

A square matrix that is equal to its transpose.

148
Q

T-distribution

A

The T-distribution, also known as the Student’s T-distribution, is a probability distribution that is symmetric and bell-shaped, similar to the normal distribution. However, it has heavier tails, which means it has more probability in the tails and less in the center compared to the normal distribution. The T-distribution is characterized by a single parameter known as the degrees of freedom (df), which determines the shape of the distribution. As the degrees of freedom increase, the T-distribution approaches the normal distribution. The T-distribution is commonly used in statistics, particularly in hypothesis testing and confidence interval estimation when the sample size is small or when the population standard deviation is unknown. It arises naturally in the context of estimating the mean of a normally distributed population when the sample size is small, and the population standard deviation is estimated from the sample.

149
Q

T-test

A

The T-test is a statistical hypothesis test used to determine whether there is a significant difference between the means of two groups. It is based on the T-distribution and is particularly useful when the sample size is small or when the population standard deviation is unknown. There are several types of T-tests, including the independent samples T-test, paired samples T-test, and one-sample T-test. The independent samples T-test compares the means of two independent groups, while the paired samples T-test compares the means of two related groups. The one-sample T-test compares the mean of a single sample to a known population mean. In each case, the T-test calculates a test statistic (the T-value) and compares it to a critical value from the T-distribution to determine whether the difference between the means is statistically significant at a given significance level (alpha). If the absolute value of the T-value exceeds the critical value, the null hypothesis of no difference between the means is rejected, indicating that there is a significant difference between the groups. The T-test is widely used in various fields, including medicine, psychology, and business, to compare means and make inference about population parameters based on sample data.

150
Q

The Hessian matrix

A

The Hessian matrix is a collection of all the second-order partial derivatives of a scalar-valued function (a function that takes multiple variables as input and outputs a single number). Just like the regular derivative tells you about the slope of a curve at a point, the Hessian matrix provides information about the curvature of a multidimensional surface. It reveals how the function’s steepness changes in different directions.

Imagine a topographical map. The Hessian matrix at a particular point would describe whether you are in a valley (positive curvature), on a mountain peak (negative curvature), on a saddle point, or a more complex shape.

In the training of neural networks, the Hessian matrix can be used in second-order optimization methods. Hessian-based methods use curvature information to potentially take more informed steps towards the optimal parameters, compared to gradient descent, which only considers the first-order slope. The Hessian can also provide insights into the uncertainty of function estimates. For instance, the inverse of the Hessian can be related to the covariance matrix which describes the spread of a probability distribution. Unfortunatlly Calculating and storing the full Hessian matrix for a large neural network with millions of parameters can be extremely expensive or even infeasible. This often necessitates approximations or workarounds.

151
Q

Transpose

A

In linear algebra, the transpose of a matrix is a new matrix created by flipping its rows and columns. Rows become columns, columns become rows. If the original matrix is A, its transpose is denoted as Aᵀ. The dot product of two vectors can be calculated using a matrix transpose.

In ML, data is often organized into matrices where rows represent individual samples, and columns represent features. The transpose helps switch between this sample-oriented perspective and a feature-oriented perspective, as needed. Certain algorithms or calculations might work more conveniently when features form the rows rather than the columns. The transpose allows for this easy transformation. Calculating covariance matrices, crucial for understanding relationships between features, can involve transposes. Transpose operations can appear when projecting data onto lower-dimensional spaces, a technique used for dimensionality reduction.

The dot product between vectors is a fundamental calculation in many ML algorithms. One way to express a dot product is as a matrix multiplication involving a transpose. During backpropagation in neural networks, gradients are calculated with respect to weight matrices. Transposes often play a role in manipulating these gradients and ensuring the math of backpropagation works out correctly.

152
Q

Two-tailed test

A

Two-tailed test is like looking for something lost in a large field. You’re not sure if it’s to your left or your right, so you search in both directions. In hypothesis testing, a two-tailed test considers the possibility of extreme deviations from your null hypothesis in both directions (greater than or less than the expected value). It’s used when you care about a change in either direction, not just a specific increase or decrease. For example, if testing the effect of a new drug, you might use a two-tailed test since you’d be interested in whether it significantly improves or worsens a patient’s condition.

153
Q

Type I error

A

A Type I error occurs when you reject the null hypothesis even though it’s actually true. In other words, you conclude there’s a significant difference or effect when, in reality, there isn’t one. It’s the risk of believing there’s a meaningful finding when it’s actually due to chance or random variation.

A Type I error could lead to pursuing ineffective treatments, investing in flawed strategies, or making decisions based on false assumptions. False positives contribute to the problem of non-reproducible results in scientific studies. Reducing the risk of a Type I error often increases the risk of a Type II error (failing to reject the null hypothesis when it’s false). It’s a balancing act. Which type of error is considered more serious depends on the specific research question and potential consequences of a wrong decision.

A higher Type I error rate means a greater chance of making false positive predictions. This directly lowers your model’s precision. If you want a precise model, you need to be very careful about controlling your Type I error rate. Precision measures how many of the positive predictions made by your model are actually correct. It focuses on minimizing false positives.

154
Q

Type II error

A

A Type II error occurs when you fail to reject the null hypothesis even though it’s actually false. In other words, you conclude there’s no significant difference or effect when, in reality, there is one. It’s the risk of missing a real effect or failing to discover something meaningful.

Type II errors can lead to overlooking potentially beneficial treatments, missing important discoveries, or failing to identify problems that need addressing. Studies that are too small or have insufficiently sensitive measurements increase the risk of Type II errors.

Statistical Power: Power is the probability of correctly rejecting the null hypothesis when it’s false. You increase power by:
Larger Sample Size: More data gives you a better chance of detecting true effects.
Larger Effect Size: A bigger difference between groups is easier to detect.
Less Variability: Reducing noise or measurement error makes it easier to see the signal.

Reducing the risk of Type II error often increases the risk of a Type I error (false positive). Which type of error is considered more serious depends on the specific research question and potential consequences of a wrong decision.

155
Q

Unbiased estimators

A

Statistical estimators whose expected value is equal to the true parameter value being estimated. In other words, an unbiased estimator produces estimates that, on average, are not systematically too high or too low when considering multiple samples from the population. Unbiasedness is a desirable property in statistical estimation as it ensures that the estimator provides accurate and reliable estimates of population parameters.

156
Q

Uniform distribution

A

The uniform distribution is a probability distribution where all outcomes are equally likely within a specified range. In other words, every value within the range has the same probability of occurring. The uniform distribution is characterized by two parameters: the minimum value (a) and the maximum value (b) of the range. The probability density function (PDF) of the uniform distribution is constant within the range [a, b] and zero outside this range. The cumulative distribution function (CDF) of the uniform distribution increases linearly from 0 to 1 within the range [a, b]. The uniform distribution is commonly used in simulations, random number generation, and statistical modeling when no prior knowledge about the distribution of the data is available. In Python, the uniform distribution is available in libraries such as NumPy and SciPy, allowing users to generate random numbers following a uniform distribution or perform calculations related to the uniform distribution easily.

157
Q

Variance

A

Variance is the second moment of a distribution, providing essential information about the variability or volatility of data sets. It quantifies the average squared deviation of each data point from the mean of the data set. Variance is a measure of the dispersion or spread of a set of data points around their mean. A high variance indicates that the data points are spread out widely from the mean, while a low variance indicates that the data points are clustered closely around the mean.

158
Q

Vector

A

A vector is an ordered collection of numbers, representing both magnitude and direction. Vectors exist in multi-dimensional spaces (e.g., a 2D vector, a 3D vector). A vector is an ordered list of scalar values, called attributes. We denote a vector as a bold character. Vectors can be visualized as arrows that point to some directions as well as points in a multi-dimensional space. We denote an attribute of a vector as an italic value with an index, like this: W(j) or X(j). The index “j” denotes a specific dimension of the vector, the position of an attribute in the list. Sometimes index is on the upper right. As long as it is in round brackets () it means the same

If Scalar is a single point on a number line, vector is an arrow with a starting point, length (magnitude), and an arrowhead indicating direction. If Scalar has only magnitude, vector has magnitude and direction. Multiplying a vector by a scalar scales its magnitude. For example, doubling a velocity vector doubles its speed but keeps the direction the same. You cannot directly add a vector and a scalar, as it doesn’t have a clear geometric meaning. Specialized operations like dot product, cross product, and element-wise addition can be performed on vectors

159
Q

Vector Direction

A

The direction of a vector refers to the orientation or angular position of the vector in space relative to a reference axis or coordinate system. It specifies the angular relationship between the vector and a reference direction, usually measured in terms of angles or trigonometric functions. In a two-dimensional Cartesian coordinate system, the direction of a vector can be represented by an angle measured counterclockwise from the positive x-axis. In a three-dimensional space, direction can be specified using spherical coordinates (azimuth and inclination angles) or Cartesian coordinates (x, y, and z components). Alternatively, direction can also be represented using unit vectors, which have a magnitude of 1 and point in the direction of the vector. Understanding the direction of vectors is essential in various fields, including physics, engineering, computer graphics, and machine learning, where vectors are used to represent physical quantities, forces, velocities, displacements, and more.

160
Q

Vector Magnitude

A

The magnitude of a vector, also known as the length or norm, represents the size or length of the vector in space. It is a scalar quantity and is always non-negative. The magnitude of a vector is calculated using the Pythagorean theorem in two or three dimensions, or using the Euclidean distance formula in higher dimensions. For a vector represented by components (x, y, z, …) in Cartesian coordinates, the magnitude can be calculated as the square root of the sum of the squares of its components. Geometrically, the magnitude of a vector represents the distance from the origin to the point represented by the vector in space. In physics, the magnitude of a vector often represents the strength, intensity, or magnitude of a physical quantity or force. In machine learning and data analysis, the magnitude of vectors is often used to quantify similarity, distance, or importance in feature space.

161
Q

Vector Norms

A

Vector norms quantify the “size” or “length” of a vector in a vector space.

Lp Norm: Defined as |x|p = (∑i=1^n |x_i|^p)^(1/p), where p is a real number.
L1 Norm (Manhattan Distance): Sums the absolute values of the vector’s components.
L2 Norm (Euclidean Distance): The most common one. It calculates the square root of the sum of the squared components.
L∞ Norm (Infinity Norm): Finds the maximum absolute value among the vector’s components.

Norms provide a measure of distance or similarity between vectors. Norms satisfy properties such as non-negativity, homogeneity, triangle inequality, and subadditivity. Normalizing a vector by its norm yields a unit vector pointing in the same direction. hoice of norm depends on the problem, with different norms capturing different aspects of vector behavior.

Models can become too complex and “memorize” the training data, failing to generalize to new examples. Norms are added as penalty terms to a model’s loss function (Regularization). Also, Many ML algorithms rely on quantifying how similar or different data points are. Norms provide metrics for Clustering (K-means) with use of L2 norm and KNN (various norms). Also they are help for feature scaling, needed because having similar range can be crucial for algorithms sensitive to scale (Min-Max Scaling, Standardization). Also the direction of updates during gradient-based optimization often relies on calculating the gradient’s norm. Norms are also important for Regression metrics measuring MSE, MAE

162
Q

Zero Matrix

A

A matrix with all elements being zero.

163
Q

Eigenvalues

A

Eigenvalues, in linear algebra, are special scalar values associated with a square matrix. They tell you something important about how that matrix transforms vectors.

Transformation: Imagine a matrix like a stretching and twisting machine. It takes an input vector and outputs a transformed version.
Eigenvectors: Special non-zero vectors (called eigenvectors) are unique because when fed into this matrix, they only get stretched (or shrunk) in magnitude, not twisted or bent. Their direction stays the same.
Eigenvalue: The eigenvalue is the scaling factor by which an eigenvector gets stretched. An eigenvalue of 2 means the vector gets doubled in length, while -1 flips its direction and doubles its length.

So, eigenvalues essentially capture the “stretching power” of a matrix along specific directions (eigenvectors). They are crucial for various applications in physics, engineering, and data analysis.

164
Q

Sparsity

A

Property of having a relatively small number of nonzero elements compared to the total number of elements in a mathematical object. This concept is commonly encountered in various mathematical contexts, including linear algebra, optimization, signal processing, and machine learning.

Here are a few examples of sparsity in different mathematical contexts:

Sparse Matrices: In linear algebra, a matrix is considered sparse if the majority of its elements are zero. Sparse matrices often arise in applications such as network analysis, finite element methods, and solving systems of linear equations. Utilizing the sparsity of matrices can lead to more efficient algorithms and storage methods compared to dense matrices.

Sparse Solutions in Optimization: In optimization problems, a solution is considered sparse if it has only a small number of nonzero components. Sparse solutions are often desirable in various applications, such as compressed sensing, where one seeks to recover a sparse signal from a limited number of observations.

Sparse Signals in Signal Processing: In signal processing, a signal is considered sparse if it has only a few significant components compared to its total length. Sparse signals are common in applications such as image processing, audio processing, and data compression.

Sparse Representations in Machine Learning: In machine learning, sparse representations refer to data representations where only a subset of features or dimensions is relevant or contributes significantly to the underlying structure of the data. Sparse representations are utilized in tasks such as feature selection, dimensionality reduction, and regularization to improve model interpretability, generalization, and efficiency.

165
Q

Bias (statistics)

A

Systematic error or deviation of a statistical estimator from the true value of the parameter being estimated. It can arise due to various factors such as sampling methods, measurement errors, or modeling assumptions. Bias affects the accuracy and reliability of statistical inference, leading to incorrect conclusions or predictions. Bias can be classified as either positive (overestimation) or negative (underestimation), and reducing bias is a key objective in statistical analysis to improve the validity of conclusions drawn from data.

166
Q

Coefficient

A

A coefficient is a multiplicative factor in a mathematical expression. It’s a number (or sometimes a symbol) placed before a variable or term, indicating how the value of that variable or term should be scaled. Coefficients play a crucial role in simplifying equations, factoring polynomials, and solving for unknown variables.

Coefficients provide a way to adjust the magnitude and sometimes the direction (through positive or negative signs) of a term or variable.

in statistics Measures the strength and direction of a linear relationship between two variables. In linear regression, coefficients represent the change in the outcome variable for a one-unit change in the predictor variable. In ML Analyzing coefficients helps make predictions and understand which features have the strongest impact on predictions. Coefficients indicate how changes in the features affect the odds of an outcome occurring. They can also be used as penalty (regularization)

167
Q

Consistency (statistics)

A

A property of estimators or statistical procedures that converges to the true value or target distribution as the sample size increases indefinitely. Consistent estimators approach the population parameter or true distribution in probability as the amount of data grows. Consistency is a desirable property for estimators to ensure that they provide reliable and accurate estimates in the long run.

168
Q

Covariance and multicolinearity

A

Covariance is a pairwise measure of association between two variables. Multicollinearity is a condition where multiple independent variables are highly linearly related, potentially causing problems in statistical models.

169
Q

Covariance vs Correlation

A

Covariance
- Measures: The direction and degree to which two random variables change together.
- Range: Can range from negative infinity to positive infinity.
- Units: The units reflect the product of the units of the two variables being measured. This makes it harder to interpret directly.
- Impact of scaling: If you change the scale of one or both variables (e.g., switch from inches to centimeters), the covariance value will also change.

Correlation
- Measures: The strength and direction of a linear relationship between two variables.
- Range: Always between -1 and +1.
-1: Perfect negative correlation
0: No correlation
+1: Perfect positive correlation
- Units: Dimensionless (no units), making it a standardized measure.
- Impact of scaling: Not affected by changes in scale. If you change units, the correlation will stay the same.

Both covariance and correlation indicate the direction of a relationship between variables. Correlation provides a more easily interpretable measure of the strength of the relationship due to its standardization. Covariance is useful for understanding the raw change between variables, while correlation is better for comparing relationships between different pairs of variables.

170
Q

Efficiency (statistics)

A

ability of an estimator or statistical procedure to yield precise and accurate estimates of population parameters using the available sample data. An efficient estimator achieves low variance and bias, providing estimates that are close to the true parameter values with high probability. Efficiency is typically measured using criteria such as mean squared error (MSE), efficiency score, or asymptotic efficiency. In statistical inference, efficient estimators require smaller sample sizes to achieve a given level of precision compared to less efficient estimators, making them desirable for practical applications.

171
Q

Embeding Matrix

A

a two-dimensional array used in natural language processing and deep learning to represent word embeddings. Each row of the embedding matrix corresponds to the vector representation (embedding) of a word in a high-dimensional space. Embedding matrices are learned from large text corpora using techniques such as Word2Vec, GloVe, or FastText, capturing semantic and syntactic relationships between words. They are used as lookup tables to convert words into dense vector representations that can be fed into neural networks for tasks such as sentiment analysis, machine translation, and text generation.

172
Q

Empirical (actual data) vs. Theoretical

A

Empirical (data-driven): This approach focuses on actual observations and measurements. It’s about gathering real-world data and analyzing it to understand patterns and relationships. Think of it as learning from experience. In machine learning, this means training models on real datasets to see how well they perform.

Theoretical: This approach builds on established concepts, principles, and often mathematical models. It’s about using existing knowledge to explain and predict phenomena. In machine learning, this involves using statistical and mathematical frameworks to understand how algorithms should behave under certain conditions.

The Synergy: Neither approach is sufficient alone. Theory provides a foundation and helps us interpret data, while empirical analysis keeps us grounded in reality. Here’s why they’re both crucial:

Theory Guides Exploration: Theoretical frameworks suggest what kind of data to collect and how to analyze it.
Data Reveals the Unexpected: Real-world data can expose limitations or surprising patterns not captured by theory, leading to new theoretical insights.
Machine Learning Example: For instance, a theoretical model might suggest a specific machine learning algorithm for a task. However, empirical evaluation using real data is essential to determine how well it performs in practice and potentially choose a different algorithm that works better.

173
Q

Jaccard Distance

A

a metric used to quantify how dissimilar two sets are. It focuses on the unique elements that belong to each set. It is used for Text Similarity like comparing the uniqueness of words in documents. It is also used for recommendation Systems like Finding sets of items (e.g., movies, products) that are dissimilar to what a user has already seen. It can also be used to Image Segmentation: assessing the dissimilarity between a predicted segmentation and ground truth.

Calculation
Intersection: Find the elements that are common to both sets (the overlap between the sets).
Union: Find all the elements that are present in either set (the total items on both lists combined).
Divide: Divide the size of the intersection (# of common elements) by the size of the union (# of total elements).

The Jaccard Distance is sensitive to the size of the sets. Large sets with few overlapping elements could still have a relatively high distance.

174
Q

L1 Norm

A

Also known as the Manhattan distance or taxicab norm, measures the distance between two points as if you were traveling on a city grid (like a taxi!). It calculates the sum of the absolute differences between the coordinates of the points. The L1 norm measures distance by the shortest path along a grid, not a straight line “as the crow flies”.

The L1 norm is often used in machine learning for regularization techniques like LASSO regression. It has the tendency to produce models with sparse coefficients (i.e., many coefficients are zero), leading to feature selection. Compared to the L2 norm (Euclidean distance), the L1 norm is less sensitive to outliers because it doesn’t square the differences.

175
Q

L1 vs L2 norm

A

The L1 norm, also known as the Manhattan or taxicab norm, provides a way to measure distance by summing the absolute values of the differences between coordinates. This norm is preferred when aiming for feature selection and sparse models, as it has the tendency to drive many coefficients towards exactly zero. The L1 norm is inherently more robust to outliers since it doesn’t amplify the effect of large deviations by squaring the differences. Visually, the L1 norm can be represented as a diamond shape.

The L2 norm, commonly referred to as the Euclidean norm, calculates distance as the square root of the sum of squared differences between coordinates. It’s a popular choice when the goal is to shrink the size of coefficients without necessarily setting them to zero, and when the presence of outliers is less of a concern. The L2 norm penalizes large coefficients and promotes smoother solutions; however, due to the squaring of differences, it exhibits a greater sensitivity to outliers. The L2 norm can be visualized as a circle.

176
Q

L2 Norm

A

The L2 norm, also known as the Euclidean distance, measures the distance between two points as a direct, ‘as the crow flies’ line, representing the shortest path between them. It calculates the square root of the sum of the squared differences between the coordinates of the points. The L2 norm penalizes large deviations due to this squaring effect, magnifying their impact on the overall distance calculation.

The L2 norm is often used in machine learning for regularization techniques like Ridge regression. It tends to shrink the size of coefficients, encouraging smoother solutions, but it doesn’t necessarily drive coefficients to zero. Compared to the L1 norm (Manhattan distance), the L2 norm is more sensitive to outliers because it squares the differences, giving larger deviations more weight in the optimization process.

177
Q

Matched pairs experiments

A

Matched pairs experiments are a study design where participants are paired up based on similar characteristics relevant to the outcome you’re interested in. Then, each member of a pair is assigned to a different treatment group. For example, to test a new exercise program, you might pair participants based on fitness level, age, etc. One person in each pair does the new program, the other does a standard routine, and you compare their results. The key idea is that pairing minimizes the influence of those other factors, letting you isolate the effect of the treatment you’re actually testing.

178
Q

Moments of distribution

A

In statistics, moments are mathematical quantities that provide information about the shape, center, and spread of a probability distribution.

179
Q

Multiarmed bandit problem

A

Imagine you’re in a casino facing a row of slot machines (one-armed bandits). Each machine has a different, unknown probability of paying out. You have a limited number of pulls. Your goal is to maximize your winnings by figuring out the best machines as quickly as possible. This dilemma is the multi-armed bandit problem: the challenge of balancing exploration (trying different machines to gather information) with exploitation (using your current knowledge to focus on the seemingly best machine at the moment). It represents a fundamental exploration-exploitation trade-off common in reinforcement learning scenarios where an agent must learn through trial and error while optimizing for a reward.

180
Q

Multivaried

A

In ML, “multivariate” means involving multiple variables or features simultaneously. A multivariate dataset contains several columns, each representing a different feature you want to consider. For example, a dataset for predicting housing prices might have features like square footage, number of bedrooms, neighborhood, etc. Multivariate analyses and models are designed to understand and utilize the relationships and interactions between these multiple features. This is in contrast to “univariate” which focuses on a single feature in isolation. Most real-world ML problems are multivariate as they aim to capture the complexity of the data.

181
Q

Self-selection bias

A

Self-selection bias occurs when individuals or groups choose to participate in a process (like a study, survey, or program) based on factors that also affect the variable you’re trying to measure. This leads to a sample that isn’t truly representative of the population you want to understand. For example, if only people who already feel strongly about a topic fill out a survey, your results will be skewed and not reflect the general population’s views. Self-selection bias can make it difficult to draw accurate conclusions, as the observed differences might be due to the underlying reasons for participation rather than the actual effect you’re trying to study.

182
Q

Statistical Power

A

Power is the probability of correctly rejecting the null hypothesis when it’s false. You increase power by:

Larger Sample Size: More data gives you a better chance of detecting true effects.
Larger Effect Size: A bigger difference between groups is easier to detect.
Less Variability: Reducing noise or measurement error makes it easier to see the signal.

183
Q

T-statistic

A

A T-statistic is a value used in hypothesis testing to determine if a difference between two groups is statistically significant or likely due to random chance. It’s calculated by taking the difference between the groups’ means and dividing it by a measure of variability (related to standard deviation). Think of it as a signal-to-noise ratio: a large T-statistic means the observed difference is likely real, not just a fluke. T-statistics are used in various scenarios, like comparing a sample mean to a known value or examining differences between groups in an experiment. The T-statistic assumes your data follows a normal distribution. If this assumption is strongly violated, you might need non-parametric alternatives.

When you’re working with smaller datasets (typically below 30 samples), the T-statistic is more reliable than the Z-statistic. This is because it takes into account the increased uncertainty that comes with smaller samples. In most real-world scenarios, you don’t know the true population standard deviation. The T-statistic allows you to estimate it from your sample data, making it widely applicable.

184
Q

Test statistic

A

A test statistic is a value calculated from your sample data that helps you make decisions in hypothesis testing. It quantifies how far your observed data deviates from what would be expected if the null hypothesis (your initial assumption) were true. The distribution of this test statistic under the null hypothesis is known, allowing you to calculate a p-value. This p-value represents the probability of observing a test statistic as extreme or more extreme than yours, assuming the null hypothesis is true. It guides your decision to either reject or fail to reject the null hypothesis based on a chosen significance level.

185
Q

Z-statistic

A

A Z-statistic tells you how many standard deviations a specific data point is away from the mean of its population. It converts any data point from a normal distribution into a standard normal distribution. This has a mean of 0 and a standard deviation of 1. This allows us to compare data points from different distributions apples-to-apples because they’re on the same Z-scale.

Interpretation
Z = 0: The data point is exactly equal to the mean.
Z > 0: The data point is above the mean (number of standard deviations above).
Z < 0 The data point is below the mean (number of standard deviations below).
Magnitude: The larger the absolute value of Z, the further the data point is from the average in terms of standard deviations.

Z-scores outside the range of -3 to +3 are often considered potential outliers. Z-tables (or calculators) let you find the probability of a value falling within a certain range in a normal distribution.