Week 3 - Data Manipulation - Recode, Compute, and Count Flashcards

(53 cards)

1
Q

In SPSS what are operators?

A

These are written (words) or symbolic (symbols) commands that make SPSS perform a certain job/operation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the logical operators?

A

– OR / |
Do something i EITHER condition is met

– AND / &
Do something if both conditions are met

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are assignment/testing operators?

A
Do SOMETHING if a variable is equal to a value
others include: 
- not equal to
- less than
- greater than
- less than or equal to
- greater than or equal to
- not
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are standard arithmetic operators?

A

Mathematical symbols/operations e.g., +, -, *, /, **

- for exponentiation w could say X = Y**2 (i.e., Y squared)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In SPSS, what is a function?

A

Consists of a word (which SPSS recognises) which carriers out some operation on the variables or values that follow the function name.

e.g., SELECT IF (sex=1 & age=>=30)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is the select function written in SPSS Syntax?

A

SELECT IF (sex=1 & age=>=30)

or alternatively SELECT IF (sex eq 1 and age ge 30)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the function ‘compute rv.normal(0,1)’ do?

A

Creates a random variable with a mean of zero and a SD of 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between system missing and user missing?

A

System missing is automatically signed by SPSS whenever there is a blank cell.

User missing is when we tell SPSS to treat a value of a variable, which might otherwise be a normal value (e.g., 9), we tell SPSS to treat it LIKE it is missing (without it actually being missing from the system).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does writing ‘temporary.’ before your syntax do?

e.g.,
temporary.
select if (sex = 2 and age >=30)

A

This makes a temporary selection or change that only lasts for one process e.g., freq age.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can you tell that a cases have been excluded in the SPSS spread sheet?

A

oblique lines will cross out the cells which have been excluded.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Using syntax, how would you select cases that are NOT MISSING on a variable?

A

Select if not(missing(variablename))

OR

Select if ~missing(variablename)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can sort cases and split file be used to perform the same job as the following syntax?

temporary.
select if (sex =1 ).
freq age

temporary.
select if (sex = 2)
freq age

A

sort cases by sex (this orders data by sex)
split file by sex (this splits the data in half by sex
freq age (will fetch frequency of ages for each half)
split file off (removes the split file)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the command ‘select if’ used for?

A

To select cases upon which you want to perform some analyses on (either inferential or descriptive). To select a subset of your data (and exclude the rest from your operations).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What can the recode command be used for?

A
  1. Permanently creating new variables (based on available variables)
    - e.g., grouping values if a numeric variable, such as calculating a total score based on answers to a questionnaire.
  2. Altering variables
    - e.g., breaking a continious variable (e.g., age) into categories
    - splitting the variable/data into two (or more) e.g., by median
  3. Findings missing values, and excluding or removing them when making a dummy variable
  4. Other - using recode to reverse a scale or using conditional recoding.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How would you use syntax to create a variable which takes the continuous variable of age (of people aged between 0-100) and makes it into a single categorical variable with 4 categories?

A

You would use RECODE

e.g., recode age (lo thru 25=1) (26 thru 30 = 2) (31 thru 40 =3) (41 thru hi=4) into agecat.

This can also been completed through the transform menu in SPSS Point-and-Click.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The research wants to recode age as a continuous variable into age as a categorical variables with 4 age brackets so they create the following syntax:

recode age (lo thru 25=1) (26 thru 30 = 2) (31 thru 40 =3) (41 thru hi=4) into agecat.

HOWEVER, one of the cases has an age of 25.9 and has not been included in a category; how could this issue be fixed?

Explain why you would use the new syntax.

A

The problem is that in the original syntax, 25.9 and other between category numbers would not be included in groups. would need to re-write syntax as:

Recode age (lo thru 25=1) (25 thru 30 =2) (30 thru 40=3) (40thru hi = 4) into agerec

By overlapping values which are recoded we can make sure that nothing slips through the cracks. Note that once a value has been recoded it is not recoded again.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
Is:
Recode age (lo thru 25=1) (25 thru 30 =2) (30 thru 40=3) (40thru hi = 4) into agerec

The same as:

Recode age (lo thru 25=1) (lo thru 30 =2) (lo thru 40=3) (lo thru hi = 4) into agerec

Why?

A

YES!!

Because each case will only be assigned to a category once. Once it has been assigned to a category it won’t be assigned/re-assigned again; the function/operations works from left to right.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe what the following syntax would do:

temporary.
recode salary (1=2) (6 thru hi=5)
add value labels salary 2 ‘up to $30k’ 5 ‘>50k’
freq salary

A

This creates a TEMPORARY RECODE (it does not alter the file, it just creates temporary categories for the purpose of the function frequency)

It makes salary a two-category variable. and adds descriptive labels. and then gets the frequency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Describe how you would use recode (syntax) to perform a median split (e.g., for age)

A

First you would need to find the median:

freq age/statistics=median (imagine the median is 30)

Then you would perform a recode based on this information:

recode age (lo thru 30=1) (30 thru hi = 2) into agemed
freq agemed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is visual binning?

A

Can be found in the transform menu allows you to see a visual diagram of the distribution of a variable (e.g., age) and gives you a number of options to divide up the variable/data

e.g., equal percentiles based on scanned cases = equal cases in both groups
cutpoints at mean and selected SD

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Why might you want to make data missing?

A

You may want to make values that can’t be used in a meaningful way ‘missing’ e.g., the few outliers with income in the lowest/highest bracket.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What function might you use to exclude (i.e., make missing) data on a variable? (e.g., outliers in low and high salary bracket)

A

Recode

e.g., 
recode salary (1, 6 thru hi=sysmis)   [this will make income categories 1 and 6 system missing]

this can also be performed through ‘tranform’ menu in point and click.

(The temporary command can be used to made the recode TEMPORARY)

23
Q

What function might you use to LOCATE data that is missing?

A

Crosstabs and recode to fine where missing values are.

24
Q

What is crosstabulation?

A

A process or function that combines and/or summarises data from one or more sources into a concise format for analysis or reporting. Crosstabs display the joint distribution of two or more variables and they are usually represented in the form of a contingency table in a matrix.

IT tells you how much data is missing but NOT WHERE it is missing from.

25
Using recode; how can you locate where missing values are in your data? Imagine you are a researcher who wants to compare salary bracket (variable = salary) by job-level (variable = joblev). You find that you have 3 missing values somewhere in salary, but you don't which job-level the people who have these missing values are from. i.e., How many workers fell in each bracket of salary; how many juniour managers fell in each bracket of salary...and so on.
First you would need to RECODE the missing values on salary into a valid value. Recode salary (missing = -99) Then you can use crosstabs: crosstabs salary by joblev - the missing values would then appear in the crosstab tables, and show you which categories of Joblev the missing data is from. (Crosstabs display a table that shows how many people in job-level fall in each salary bracket)
26
Would crosstab alone help you determine how much missing data you had?
Yes it would tell you how many cases were missing, but NOT WHERE they are missing from.
27
Imagine you are a research who needs to make dummy variables for a 3-level categorical variable (joblev) where 1= worker; 2= mid management; 3 = top management What syntax could you use? (assuming no missing values and you are not allowed to use the compute function).
``` Recode joblev (1=1) (else=0) into Jlev1 Recode joblev (2=1) (else=0) into Jlev2 ```
28
Imagine you are a research and you have a 3-level categorical variable (joblev) where 1= worker; 2= mid management; 3 = top management, what does the following syntax achieve? ``` Recode joblev (1=1) (else=0) into Jlev1 Recode joblev (2=1) (else=0) into Jlev2 crosstab joblev by Jleve1 Jlev2 ``` What is the 'reference group'? When couldn't you use this syntax?
This is creating dummy variables. Recode joblev (1=1) (else=0) into Jlev1 - This creates a binary variable called 'Jlev1' where workers are represented by '1' and mid-management and top management are represented by '0'. Recode joblev (2=1) (else=0) into Jlevel2 -This creates a binary variable called 'Jlev2' where mid-management are represented by '1' and worker and top management are represented by '0'. crosstab joblev by Jleve1 Jlev2 - checks that the dummy variables are correct. Top management are the reference group as they are represented when BOTH Jlev1 and Jlev2 = 0. Jlev1 = workers and Jlev2 = mid-management. You couldn't use this syntax if there was MISSING data on a variable because else = 0 would code all missing values as 0 as well.
29
Why can't the recode command be used to create dummy variables when there is missing data? e.g., Imagine you are a research and you have a 3-level categorical variable (joblev) where 1= worker; 2= mid management; 3 = top management you make dummy variables using the following syntax: ``` Recode joblev (1=1) (else=0) into Jlev1 Recode joblev (2=1) (else=0) into Jlev2 crosstab joblev by Jleve1 Jlev2 ```
Because (else = 0) would code all missing values as 0 as well; and they would therefore become a part of our reference group, which we intended to just include top management.
30
What function can be used to reverse-code negative items derived from a questionnaire
RECODE!! e.g., recode b14b b14c b14g b14i b141 b14o (1=7) (2=6) (3=5) (5=3) (6=2) (7=1) these would be negative items on a 1-7 likert scale. - e.g., 1 strongly disagree ---> 7 strongly agree
31
What are some of the ways that the COMPUTE command can be used?
- Transforming variables to manage positive or negative skew - Creating a scale (sum or mean of items) - Creating dummy variables - Create new variables which represent combinatons of variables (e.g., joblevels by sex ---> SEXJOBLEV = 1 'male worker' 2 'male mid-manage' 3 'male top-manage' 4 ' female worker' 5 'female mid-manage' 6 'female top manage') - Centring a variable
32
In regards to skewness of distribution: - a value of 0 means? - A asymmetrical distribution with a long tail towards higher values (right) has a ____ skew - A asymmetrical distribution with a long tail towards the lower values (left) has a _____skew - if the skewness is greater than 0 it indicates a ____ skew - if the skewness is less than 0 it indicates a ____ skew
- equal, no skew - positive skew - negative skew - positive - negative Values greater than 1 can be taken to mean there is substantial skew (though this is a rather arbitrary rule of thumb)
33
Your age (variable) data has a skew of .91! What are the three methods we could use to fix this? (in order from least to most severe)
1. Compute sage=SQRT(age) 2. Compute lage=lg10 (age) 3. Compute rage=1/age
34
Your age (variable) data has a skew of .91! What does the syntax Compute sage=SQRT(age) do to fix this?
Creates a variable called 'sage' which is the square root of age - this tends to pull in the top half of the distribution and spread the bottom part a bit. A SLIGHT TRANSFORMATION
35
Your age (variable) data has a skew of .91! What does the syntax Compute lage=lg10 (age) do to fix this?
Creates a variable which is the log of age - A MIDDLE GROUP TRANSFORMATION (neither slight nor severe) for reducing positive skew.
36
Your age (variable) data has a skew of .91! What does the syntax Compute rage=1/age do to fix this?
Creates a variable called rage. Takes the reciprocal of age and divide it into one, older people will have a lower value than those who are younger. A VERY SEVERE TRANSFORMATION. Reciprocal - now a higher value = lower: reciprocal = the quantity obtained by dividing the number one by a given quantity.
37
What additional steps do you have to take when transforming data with a NEGATIVE SKEW (using SQRT, Log10 and Reciprocal transformations).
Using these methods on a negative skew would just make it WORSE. First you need to either - square the values (so negative values become positive) e. g., compute perf2 = perf2**2 - reverse the scale e.g., e.g., the values which were 1-6, we subtract them from a value that is 1 higher. e. g., compute sperf = 7 - perf [so that 1 becomes 6, 2 becomes 5...and so on; this would later have to be reversed by adding 1 to the highest value (2.45) and then taking away sperf e.g., computer sperf2 = 3.45 - sperf) NOW YOU CAN APPLY THE TOOLKIT OF TRANSFORMATIONS!!
38
We would like to create a new variable which is the mean of some of the other variables/items (e.g., mean of the added up indv items in a scale) We have 15 items labeled b1 to b15 and we want to find the mean of the sum of these items. How would we do this (with no missing data)? How would we do this with missing data? SYNTAX
COMPUTE function to make a new variable which is the mean of the items. No missing data: compute meanofB=mean (b1 to b15) Missing data: compute meanofB=mean.12 (b1 to b15) [ the .12 says there must be at least 12 items out of the 15 not-missing]
39
Why wouldn't you use the following syntax to compute the mean of a questionnaire item-sum? compute B14 = (b14a+b14b+b14c+b14d+b14e+b14f+b14g+b14h ..... (etc) +b140) /15
- TIME CONSUMING - if any value was missing, b14 would not be created (this might be good if you required everyone to answer all of the items!)
40
When summing a scale (Rather then the mean) why is it a bad idea to use the 'sum' function [i.e., compute b14sum = sum (b14a- b14o)] What should you do instead (assume it is a 15 item scale)
If there is any missing data, the sum function would just count the one response as there score..terribly misleading. You could use the following syntax instead: computer b14sum = (mean.12 (b14a to b14o))*15 (remember, the .12 means they must have answered at least 12 items!)
41
Imagine you are a research who needs to make dummy variables for a 3-level categorical variable (joblev) where 1= worker; 2= mid management; 3 = top management How would you use the compute command to create dummy variables? (allowing for missing data) could it be shortened if you knew there was no missing data?
``` Do if (joblev=1). compute Jlev1=1. else. Compute Jlev1=0. end if. ``` ``` Do if (joblev=2). compute Jlev2=1. else. compute Jlev2=0. end if. ``` - if we knew there was no missing data could use the following instead: compute Jlev1=0. if (joblev = 1)Jlev1=1. compute Jlev2=0. if (joblev=2) Jlev2=1. (in this case, if there were missing data here, the person would be assigned as '0' on both values i.e., as a top management...this would be very misleading)
42
Imagine you are a research with a 3-level categorical variable (joblev) where 1= worker; 2= mid management; 3 = top management. What does the following syntax do?: ``` Do if (joblev=1). compute Jlev1=1. else. Compute Jlev1=0. end if. ``` ``` Do if (joblev=2). compute Jlev2=1. else. compute Jlev2=0. end if. ``` - what about this?: compute Jlev1=0. if (joblev = 1)Jlev1=1. compute Jlev2=0. if (joblev=2) Jlev2=1.
Computes dummy variables: Jlev1 = 1 & Jlev2 =0 ---> Worker Jlev1 = 0 & Jlev2 =1 ---> Mid Management Jlev1 = 0 & Jlev2 =0 ---> Top Management The second syntax does the same thing, but can only be utilised when there is NO MISSING DATA; because otherwise people with the person would be assigned '0' on both values i.e., as a top management...this would be very misleading.
43
How can logical output of commands be used to create dummy variables? (Imagine you are a research with a 3-level categorical variable (joblev) where 1= worker; 2= mid management; 3 = top management.)
- in SPSS if a statement is true e.g., age =6, then spss outputs a '1' - in contrast, a false statement e.g., gender = male, when it is in fact female, spss will output a '0' Therefore the following syntax can also be used to create dummy variables: compute Jlev1 = (joblev = 1). compute Jlev2 = (joblev = 2). freq Jlev1 Jlev2 - i.e., for the first line of the above command: - if joblev is NOT EQUAL TO 1, than the output is 0 (false, and this becomes the reference category) - if joblev is EQUAL to 1, than the ouput is 1 (true).
44
What does the following syntax do and why might you want to do this? ``` compute sexjoblev = sex*10+joblev recode sexjoblev (10=1) (11=2) (12=3) (21=4) (22=5) (23=6) value labels 1 'male worker' 2 'male mid-manage' 3 'male top-manage' 4 ' female worker' 5 'female mid-manage' 6 'female top manage' ``` Where joblev = job level 3 categories (top-manager, mid-management, worker; and sex = gender 2 categories (1= male or 2= female)
This syntax is a short way to create new variables which represent joblevels by sex (this can then be crosstabulated with a third variable). Specifically: - it multiples sex by 10, such that female now = 10 and male now = 20 - it then adds the value of joblev to that value (e.g., 1, 2, 3, 4, 5, or 6 - depending on one's job level) - recode is then used to assign each value on this variable [10, 11, 12 (males by job level) and 22, 22, 23 (females by job level)] into categories 1-6 - and then labels these categories appropriately.
45
What does the following syntax do? do repeat x= b14b b14c b14g b14i b14l b14o. compute x=8-x. end repeat. Where b14b to b14o are items on a questionaire each rated out of 7, that need to be reversed.
Each value to be reveresed needs to be substracted from the highest possible value (7) + 1 i.e., 8. This command makes a loop that does this efficiently. - The 'x' stands for the first item in the list the first time through, the second loop around the 'x' stands for the second item on the list....and so on. - compute x = 8-x is what you want to do EACH TIME AROUND with the next variable in the chain. (EXTRA note a slash separates two lists and both will be run through simultaneously - see notes if interested)
46
How would we go about centring a value at its mean?
using the compute command - we would subtract the variables mean from each individuals score. syntax might look like this: descriptives perf (variable perf descriptives, retrieve the mean value) compute perfcent=perf - 4.1418 (perf - mean of perf).
47
What does the function 'COUNT' do?
Counts the number of times a response occurs over a set of variables. - used less frequently than RECODE and COMPUTE commands. - in the count command, we give a list of variables and it counts how often a certain value occurs.
48
What function would we use if we would like to know how many items out of a 15 item scale does the respondant 'agree' with? (on a 7 point likert scale, where 5,6,7 indicate agreement)
COUNT function e.g., count b14agree= b14a to b140 (5,6,7). (a value of 2 means agreement with two items and so on)
49
How does the COUNT function handle missing variables? (e.g., when we want to know how many items out of a 15 item scale does the respondant 'agree' with? (on a 7 point likert scale, where 5,6,7 indicate agreement) what should we do?
Count doesn't care if a variable is missing or not - it simply looks at the variable for the specified values. If one of the specified value/s is present, then it adds '1' to the count. - if it doesn't (including if it is missing), it just doesn't count it. It doesn't say "well the person didn't answer the quetion so I can't really count them' rather it erroneously states ' they didn't have the values e.g., they didn't agree. THEREFORE, we might want to exclude people that didn't answer the questions. We can also create a new variable that counts the number of people missing values. - now that we have a variable count b14mis=b14a to b140 (missing)
50
How might we use the COUNT function to remove missing data in order to do a count. (e.g., when we want to know how many items out of a 15 item scale does the respondant 'agree' with? (on a 7 point likert scale, where 5,6,7 indicate agreement)
We can create a new variable that counts the number of people missing values. Then remove the missing data e.g., ``` count b14agree= b14a to b140 (5,6,7). count b14mis=b14a to b140 (missing). do if b14miss ne 0. recode b14agree= (lo thru hi=sysmis). end if. ```
51
The aim is to calculate a new variable named agesq which is the square of an existing variable called age. How would you use syntax to achieve this?
compute agesq=age**2
52
Fill-in the blank below where the aim is to recode values of age up to 18 as 1, >18 to 50 (inclusive) as 2 and above 50 as 3. Do not use any decimal places in your answer. recode age (_____)(_____)(_____) (else=copy)
recode age (lo thru 18=1)(18 thru 50=2)(50 thru hi=3) (else=copy).
53
Fill-in the blank below where the aim is to recode any value of body mass index (BMI) between 18 and 25 as a new value zero (0). Bear in mind that BMI is measured on a continuous scale. Do not use any decimal places in your answer. recode bmi (______) (else=copy).
recode bmi (18 thru 25=0) (else=copy).