Week 6: Data Summaries & Loops Flashcards
(33 cards)
scatterplot for 2 variables
2 numerical variables
- continuous variables (discrete could work but would be weird if there aren’t many values)
- no categorical data
relationship between the pair can be important
- data points not related to one another (need a line)
line plot for 2 variables
2 variables
- same as scatter plot except neighboring points are “connected”
- we want them to be connected (often x is time)
- very good for time series data & prediction
bar plot for 2 variables
- x axis: mainly categorical, discrete if not many (not continuous bc we only have a finite number of bars)
- y axis: discrete, continuous (avoid ordinal, not nominal bc the taller the bar, the greater the quantity)
- y axis is a variable and not “count” or “frequency”
things that can mislead bar plot
- starting position of variable value
- order of the bars (it helps if you order the bars)
compare plots
by plotting 2 variables together
comparing box plots
- side by side box plots
- becomes plot of 2 variables (x axis identifies different plots)
- interaction between numerical (y axis) & categorical (x axis)
- comparison of the same measure among different groups
comparing bar plots (1 variable)
- side by side bars
- becomes plot of 2 variables
- similar for bar plots with 2 variables (y axis is still a count, but each bar itself can identify a variable)
comparing histograms
- compares 2 variables (1 numerical, 1 categorical)
- y axis still count/frequency, histograms themselves can be one variable (ex: color coded), x axis as numerical variable
- lays each histogram on top of each other (usually add transparency)
comparing line plots
- multiple lines on a plot
- plot of 3 variables (x axis, y axis, line itself- ex: dotted vs full line)
- scatter plots are similar
good data visualization
- understand data & purpose (what story does your data tell & what story do you want to tell)
df.describe
summary statistics
- count (number of non-missing values)
- mean
- std dev
- min
- 25% percentile
- median
- 75% percentile
- max
pandas
df.groupby
- group together all rows belonging to one category
- category is determined by a single column
need:
- 1 categorical column to group by
- 1+ numerical column to get stats on
- 1 aggregate function (e.g., min, mean, max, etc.) for numerical variables
ex: df.groupby(cat_column_name)[[num_column_name]].aggfunc()
df.groupby visual example
start w/ a bunch of difference values, each with a color
1) group by “color”
- groups all the values with the same color together
2) look at the “value column”
3) take the mean
- takes the mean of the values in each color group
- don’t group on “value” because “value” is not categorical
example: groupby on loan 50 data
call:
df.groupby(‘term’)[‘annual_income’].max()
- will group data by the ‘term’
- then take the max of the ‘annual income’
pivot tables
- way to summarize info for 3+ variables
suppose: df w/ columns ‘cat1’ and ‘cat2’ for 2 categorical variables & ‘num1’ for a numerical variable
code:
df. pivot_table(index=’cat1’, columns=’cat2’, values=’num1’, aggfunc=’sum’)
- groups by 2 categorical variables (cat1 & cat2) –> group within a larger group
- summarize 1 numerical variable (num1)
group rows into categories based on cat1/cat2, summarize num1 by aggfunc
pivot table dog example
- categorical: sex & breed
- numerical: age
- aggfunc: avg
will give:
female+breed1 avg age, male+breed1 avg age, female+breed2 avg age, male+breed2 avg age, etc.
index in pivot table?
bc it acts like an index (it’s the column on the very left)
combine dataframes
- row-wise
- column-wise
row-wise
stack the dataframes together
- works bc rows are like observations and are “independent” from one another
- columns must be the same to avoid missing values
df.concat
- add rows to a table
column-wise
we match on a column
- can’t stack next to each other bc we don’t know how to combine them
merging
df1.merge(df2, how=’’, left_on=’’, right_on=’’)
- how: ‘inner’, ‘outer’, ‘left’, ‘right’
- left_on: column_x_from_df1
- right_on: column_y_from_df2
ex: 2 tables have ids for responses –> merge –> adds columns, organized by id
inner merge
keep only matching records on a merge
outer merge
keep all records in both df (but if not matching/missing info –> NaN, not a number)
left merge
keeps all records on the left df
right merge
keep all records in right df
- can use lists or dictionaries or another collection of something
- don't refer to loop_variable outside the loop
while: word in python, indicates loop
loop will continue until the condition is false (Boulian- true/false condition)