Behaviour module Flashcards
What are the issues with grid search?
- If the best parameter value(s) are outside of the range of values you evaluate, you will obviously not find the best parameter during search
- If the best parameter identified is on the edge of the parameter range evaluated, you likely are missing the true best parameter(s)
- The accuracy of grid search depends on how finely you evaluate the parameter range
- Grid search only works well when the number of fitted parameters is small (2-3 or less)
why do we maximize LOG likelihood instead of just likelihood?
The likelihood is the product of many numbers between 0 and 1. For large datasets, eventually this number will get rounded down to zero (numerical underflow)
4 steps for maximum likelihood
Step 1: Formulate a model that predicts probabilites of all possible outcomes as function of parameters
Step 2: Calculate the probability of each observation given parameters
Step 3: The product of the probability of all observations is the Likelihood
Step 4: Search/Solve for parameters that maximize Likelihood
What is the difference in how we fit linear and non-linear models to data?
- Linear models: we can directly solve for the parameters that best fit the data
using calc and linear algebra (done automatically by stat software) - Non-linear models: we have to iteratively search for the best parameters (more on this later)
Name four types of models used in behavioral sciences
- Simple general linear model with Gaussian error (General Linear Models)
Linear-regression
Comparing groups (t-tests/ANOVA) - Simple linear models with other error distributions (Generalized Linear Models)
Logistic regression
Poisson regression - Non-linear models:
Descriptive models that are non-linear in the parameters - Process-based models
aim to describe the underlying mechanisms and sequences of operations that give rise to cognitive functions and behaviour
Typically non-linear
What is a poisson distribution?
- A discrete probability distribution hat expresses the probability of a given number of events occurring in a fixed interval of time
- Assumptions
- events occur with a constant mean rate
- Events occur independently of the time since the last Event
When (and when not) would you use a poisson distribution and why?
- Using discrete probability distributions like the Poisson to model count data is generally only required when counts are low
- As λ increases, the Poisson distribution becomes symmetric, and you can use Normal distribution to model data
Process-based/computational models of behavior
- mathematical equations that link the experimentally observable variables (e.g. stimuli, outcomes, past experiences) to behaviour
- computational models represent different “algorithmic hypotheses” about how behaviour is generated
General framework for fitting/analyzing nearly all models
- Maximum Likelihood (quantify goodness of fit)
- Non-linear optimization (finding parameters the best fit the data)
- Quanitfying uncertainty – likelihood profiles and bootstrapping
- Comparing models: Information Criteria and cross-validation
What do we use instead of OLS (in this course) and why?
Likelihood. reason: OLS does not work for all types of data (e.g non-normally distributed, binary)
What is likelihood
Likelihood is the joint probability/ probability density of the data given a set of parameter values
* In other words “the probability of the data given the parameter values”
* When errors of data points are independent, the joint probability of the data is the product of the probabilities/probability densities of all observations
PMF AND PDF
PMF: probability mass function – for discrete probability distributions (gives probability of
observations as function of parameters)
PDF: probability density function - for continuous probability distributions (gives probability density of observations as function of parameters)
What is the problem we are trying to solve by optimization methods?
Non-linear models,
General problem: we want to find the parmeters that maximize the log likelihood, but don’t know what the likelihood surface looks like, we can only evaluate the Likelihood one parameter combination at a time
2 types of optimization methods
Gradient-based methods (e.g newtons method, gradient descent), Gradient free methods (Nelder-mead simplex)
Nelder-mead simplex
- Nelder-mead simplex is an algorithm for search parameter space to find a minimum
- start by going in what seems to the best direction by reflecting the high (worst) point in the simplex through the face opposite it;
- if the goodness-of-fit at the new point is better than the best (lowest) other point in the simplex, expand the length of the jump in that direction;
- if this jump was bad—the height at the new point is worse than the second- worst point in the simplex—then try a point that’s only half as far out as the initial try;
- if this second try, closer to the original, is also bad, then contract the simplex around the current best (lowest) point
How does gradient descent find the best parameters likelihood
- Calculate the partial derivatives of the –LL with respect to the parameters
- The vector of partial derivatives of LL with respect to parameters, is the gradient, which is the direction of steepest ascent of the –LL
- We want to minimize the – LL so we move in the opposite direction a small amount:
How do we avoid local minima
- All optimizers require an initital guess for parameter values, which is where the search process begins:
Good starting guesses for initial parameters is helps avoid local minima (Either based on data, or previous studies, biological interpretation) - Generally gradient-free methods are more robust to local minima
- Optimize multiple times with different initial parameters (E.g. coarse grid parmeters, and run optimization inititalized at all combinations of the gridded parameters)
Pros and cons of grid search, gradient-based and gradient free optimization methods
Grid search
Pros:
*Easy
*unlikeliy to miss global minima if grid set appropriatly
Cons:
*Very slow for high dimensional problems
*Only as precise as the grid
Gradient-based (e.g. Newton’s Method)
* Pros:
Fast - Converge much faster to minima
Cons:
Easily caught in global minima if they exist
Gradient free (e.g. Nelder Mead)
Pros:
Works models of intermediate complexity
Faster than grid search
Cons:
Slower to converge than gradient methods
Can still get caught in local minim
What question are we trying to answer with parameter recoverability and what steps does it involve?
if this cognitive process/behavior works like I think it works, will my
experiment provide sufficient information to recover parameters
with the desired precision and without bias?
Steps of Parameter recoverability
1. Use your model and known parameter values to generate a synthetic data set
2. Simulate the experiment you plan to conduct (# of replicates, ect.)
3. Fit your model to the simulated data set
4. Compare the true and fitted parameter values
5. Repeat many times and evaluate the distribution of fitted parameter estimates compared to the true value that generated the dataset (to estimate precision and bias
- When we are uncertain about what the likely range of parameter values is, we can do parameter recover over a range of parameter values to see under what range of parameter values we get precise, unbiassed estimates
Why must we use probability density instead of probability for continous distributions
- If a random variable is continuous, there are an infinite number of values it could take
- The probability of observing any specific value exactly is 0
- We can only define probabilities of observing values that fall within a specific range (e.g. between 0 and 1)
3 continuous distributions other than gaussian (normal)
- Exponential distribution
Time to event, or time between events, when events happen at constant rate - Weibull distribution
Time to event, prob of event increases or decrease with time - Inverse-Gaussian
(.e.g) expected first-passage time distribution for drift diffusion with one boundary
Probability density function
- A probability density function describes the relative likelihood that a value of a random variable would be equal to a specific value
- If we draw many random numbers from a continuous probability distribution, the probability density tells us the relative likelihood of drawing values near a specific value
- Example: If the prob density of x = 0 is 0.4, and the prob density of x = 1 is 0.2, we should expect to see twice as many observations near 0 compared to near 1.
- Integrates to 1
- Probability of observing a value of x between two bounds is equivalent to the area under the curve between the two bounds
What type of distribution is usually used to model RTs?
- Non-normal continous probability distributions are often used to model reaction times, which tend to be possitivly skewed
- These distributions generally have multiple parameters that inlfuence both the mean and shape of the distribution
- We can model how differences in the parameters differ acorss treatments and groups
Standard error and CIs (in normal distribution)
- Standard error – because the sampling distribution is normally distributed, we can
quantify the shape of the sampling distribution by the estimated value of the parameter,
and the the standard deviation of the sampling distribution which we call a standard
error - Confidence intervals – we can use standard errors to construct confidence intervals. By
definition, we expect the true parameter value to fall within 95% confidence intervals
95% of the time