Ipres 3 Flashcards
(61 cards)
Descriptive statistics
is a set of methods used to describe data and their characteristics. For example, if you were investigating the number of visitors to a beach
in August (nice job if you can get it!), you might draw a graph to see how the
number of visitors varied each day, work out how many people visit on an average
day and calculate the proportion of visitors who were male/female or children/
adults. These would all be descriptive data.
Inferential statistics
involves using what we know to make inferences (estimates
or predictions) about what we don’t know. For example, if we asked 200 people
who they were going to vote for on the day before a local election we could try to
predict which party would win the election. Or if we asked 50 injecting drug users
whether they share injecting equipment such as needles with other users, we could
try to estimate the proportion of all injecting drug users who share equipment.
Continuous variables
include things like ‘weights of newborn babies’, ‘distance travelled to work by those in full-time employment’, or ‘percentage of children living in lone-parent families’. Such variables are measured in numbers, and an observation may take any value on a continuous scale. For example, distance trav- elled to work could take a value of 0 miles for people working at home, 1.6 miles,
4.8 miles, or any other value up to 100 miles or more for those commuting long
distances. Similarly, any variable measured as a percentage can take a value of 0%,
100% or anything in between. For continuous variables the standard rules of arithmetic apply, so it makes sense to say that if you commute for 4 miles then that is
twice the commute of someone who commutes 2 miles.
Discrete/ categorical variables
are not measured on a continuous
numerical scale. Examples of discrete variables are:
- sex: female/male
- religion: Buddhist/Christian/Hindu/Jewish/Muslim/Sikh/other
- degree subject studied: Politics/Sociology/Social Work
and so on. Such variables have no numeric value. We may assign them a number,
for example, Politics = 1, Sociology = 2, Social Work = 3 and so on, but the actual
numbers do not mean anything. For example, Social Work is not three times greater
than Politics!
types of categorical variables
Nominal variables
are variables that have two or more categories, but which do not have an intrinsic order or inherent numerical quality in themselves. So nominal variables include things like ‘marital status’, ‘ethnicity’ or ‘location’, in addition to the variables already identified including ‘religion’ and ‘degree studied’. In some cases there may be many attributes to a nominal variable. For instance, if we were classifying where people live in the USA by state there would be 50 attributes (or states).
types of categorical variables
Dichotomous variables
are nominal variables which have only two categories or levels. For example, if we were looking at gender, we would generally categorise somebody as ‘male’ or ‘female’. This is also a nominal variable. A further example would be if we asked somebody if they had ever smoked, giving them the possible answers of ‘yes or ‘no.
types of categorical variables
Ordinal variables
have two or more categories, like nominal variables, but the categories
can also be ordered or ranked moving from greater to smaller values (or vice versa). So an opinion poll might ask how likely you were to vote for a particular party at the next election with the possible options ‘very likely’, ‘likely’, ‘not sure’, ‘unlikely’ or ‘very unlikely’. While the responses are in a particular order we cannot (or should not!) place a ‘value’ on them. This is despite our own views about which we prefer! Another example would be if you asked someone to provide an answer to the statement ‘I generally eat healthily’ and had a similar scale from ‘strongly agree’ to ‘strongly disagree’. It is difficult to say whether the distance between the two categories is equal as it will depend upon individual perception. So someone who has their ‘five-a-day’ and five chocolate bars too may consider themselves to eat healthily, whereas another person may consider that eating two chocolate bars in addition to lots of fruit and vegetables means they do not eat healthily. As a result, the distance between the points on the scale is not clear and continuous.
continupus variables
difference between interval and ratio variables
interval has 0 as a an arbitrary point (celcius scale) ,ratio does not (age, weight)
Item non-response
A specific type of missing data that occurs when a respondent does not provide a valid answer to a question, which can reduce sample size and introduce bias.
Selection bias
A type of bias that arises when missing data is not random, meaning that certain groups of people or opinions are systematically underrepresented.
Listwise deletion
A method of handling missing data by removing all cases with missing values from the analysis, which can reduce sample size and potentially introduce bias.
Imagine you’re playing a board game with your friends, but one of them doesn’t answer an important question. Instead of guessing their answer, you decide to remove them from the game completely. That’s what listwise deletion does—it removes any data that has a missing answer, even if most of it is still useful.
Imputation
A technique for handling missing data by predicting the missing responses using other available information from the questionnaire.
Let’s say you’re reading a storybook, but one page is missing. Instead of skipping that part, you try to guess what happened based on the rest of the story. That’s what imputation does—it fills in missing answers by using the information that’s already there.
Mean imputation
A basic form of imputation that replaces missing values with the midpoint or mean, which is only reliable if missing data is random.
Regression imputation
A more advanced method that uses regression techniques to estimate missing values based on other answers from the respondent, making it more statistically sound but requiring good predictors.
Aggregate data
means collecting and combining lots of individual pieces of information to look at overall patterns instead of specific details.
Imagine a jar full of different-colored candies. Instead of counting each color one by one, you just say, “There are 100 candies in total, and about 40% are red.” That’s aggregation—you’re summarizing the data instead of focusing on each tiny piece.
Likert scale (interval data)
A Likert scale is a rating scale used in surveys to measure people’s opinions, attitudes, or behaviors. It typically consists of a series of statements with response options ranging from strongly agree to strongly disagree (or similar). It helps quantify subjective feelings in a structured way.
What is skewness?
Skewness is about how the data in a graph “leans” or is “stretched” to one side. Imagine stacking blocks from smallest to biggest, and then a few really big or really small blocks change the shape.
Positively skewed (right-skewed)
- Most data is small and clumped on the left side of the graph.
- A few really big values stretch the graph to the right.
- The mean (average) is pulled up by the big numbers.
- The median (middle value) stays closer to the center of the data.
👉 Result:
Mean > Median
📌 Why?
Because the mean adds everything up and divides, those few very large values make the total bigger, so the average increases.
Example:
Imagine these numbers: 2, 3, 4, 5, 30
Mean = (2+3+4+5+30)/5 = 8.8
Median = 4
→ Mean is higher than median.
Negatively skewed (left-skewed)
- Most data is large and clumped on the right side of the graph.
- A few really small values stretch the graph to the left.
- The mean gets pulled down by the small numbers.
- The median stays closer to the middle of the data.
👉 Result:
Mean < Median
📌 Why?
Because the mean is dragged down by those very small numbers, even if most values are higher.
Example:
Numbers: 1, 20, 21, 22, 23
Mean = (1+20+21+22+23)/5 = 17.4
Median = 21
→ Mean is lower than median.
Guidelines for choosing a measure of the average
- Modes are not used very often, though they can be useful in certain circumstances. Avoid them ingeneral or use them along with other measures!
- The median is more intelligible to the general public because it is the ‘middle’ observation.
- The mean uses all the data, but the median does not. Therefore the mean is‘ influenced’ more by unusual or extreme data. If the data are particularly subject to error, use the median.
- The mean is more useful when the distribution is symmetrical or normal. The median is more useful when the distribution is positively or negatively skewed, because outliers do not affect the median.
- The mean is best for minimising sampling variability. If we take repeated samples from a population, each sample will give us a slightly different mean and median. However, the means will vary less than the medians. (For example, if we took 10 different samples of 20 students and calculated the average weight for each of the 10 groups, the mean weights would differ less between the 10 groups than the median weights.)
other distributions of data
- deciles (which divide a distribution into ten equal parts)
- quintiles (which divide a distribution into five equal parts)
- quartiles (which divide a distribution into four equal parts)
Percentiles
Percentiles tell you how a value compares to the rest of the data by showing the percentage of values that fall below it. If you’re in the 90th percentile on a test, that means 90% of the class scored lower than you — and only 10% scored higher.
Suppose we had a set of exam marks and we knew that the 30th percentile was a mark of 52. This means that 30% of people who sat the exam got a mark of 52 or lower.
Range
The difference between the highest and lowest values in a dataset, used to measure the spread of data.
Inter-Quartile Range (IQR)
The difference between the upper quartile (Q3) and lower quartile (Q1) in a dataset, representing the range of the middle 50% of the data.
To work it out we would position the variables in order from highest to lowest and concentrate on the middle 50 per cent of the distribution. So the inter-quartile range is the range of the middle half of the observations, the difference between the lower quartile and the upper quartile.