# Data Science General Flashcards

## Anything useful I pick up

What is a CHAID model?

Stands for ‘Chi-square Automatic Interaction Detector (CHAID)’

CHAIDis an automated tool used to discover the relationship between variables.

A technique created by Gordon V. Kass in 1980

CHAIDanalysis builds a predictive model, or tree, to help determine how variables best merge to explain the outcome in the given dependent variable.

It is a understandable set of rules, whereas a neural network is obscure, with weights that have no intuitive meaning.

What is OLAP, and what is it used for?

Online analytical processing.

OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modelling. Imagine a rubics cube, products on vertical(x), sales regions on horizontal(y) and sales quarters on the depth (z), then an individual cube would have useful info.

OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining

What is data binning?

Also called Discrete binning or bucketing, it is a data pre-processing technique used to reduce the effects of minor observation errors

It is a form of quantization.

What is ‘Statistical data binning’?

e.g. You have data about a group of people, you might arrange their ages into a smaller number of age intervals (for example, grouping every five years together).

It can also be used in multivariate statistics, binning in several dimensions at once.

What does ‘statistically significant’ mean?

Unlikely to be due to chance

———-

It DOESN’T imply that the value is significant to us personally

Standard Deviation vs Standard Error. Describe.

theStandard Deviationof the sample is the degree to which individuals within the sample differ from the sample mean.

Whereas:

TheStandard Errorof the sample mean is an estimate of how far the sample mean is likely to be from the population mean.

95% confidence level means what?

A 95% confidence interval is a range of values that you “can be 95% certain contains the true mean of the population.”

It is not correct to say that “there is a 95% chance that the population mean lies within the interval”

It is not the same as a range that contains 95% of the values.

What is the Difference Between a Statistic and a Parameter?

The difference between a statistic and a parameter is that statistics describe a sample. A parameter describes an entire population.

A statistic and a parameter are very similar. They are both descriptions of groups, like “50% of dog owners prefer X Brand dog food.”

For example, you randomly poll voters in an election. You find that 55% of the population plans to vote for candidate A. That is a statistic. Why? You only asked a sample—a small percentage— of the population who they are voting for. You calculated what the population was likely to do based on the sample.

You could ask a class of third graders who likes vanilla ice cream. 90% raise their hands. You have a parameter: 90% of that class likes vanilla ice cream. You know this because you asked everyone in the class.