- side by side box plots - becomes plot of 2 variables (x axis identifies different plots) - interaction between numerical (y axis) & categorical (x axis) - comparison of the same measure among different groups

- compares 2 variables (1 numerical, 1 categorical) - y axis still count/frequency, histograms themselves can be one variable (ex: color coded), x axis as numerical variable - lays each histogram on top of each other (usually add transparency)

- multiple lines on a plot - plot of 3 variables (x axis, y axis, line itself- ex: dotted vs full line) - scatter plots are similar

- way to summarize info for 3+ variables suppose: df w/ columns 'cat1' and 'cat2' for 2 categorical variables & 'num1' for a numerical variable code: df. pivot_table(index='cat1', columns='cat2', values='num1', aggfunc='sum') - groups by 2 categorical variables (cat1 & cat2) --> group within a larger group - summarize 1 numerical variable (num1) group rows into categories based on cat1/cat2, summarize num1 by aggfunc

Week 6: Data Summaries & Loops Flashcards by Rebecca Tang

scatterplot for 2 variables

2 numerical variables
- continuous variables (discrete could work but would be weird if there aren’t many values)
- no categorical data

relationship between the pair can be important
- data points not related to one another (need a line)

How well did you know this?

Not at all

Perfectly

line plot for 2 variables

2 variables

same as scatter plot except neighboring points are “connected”
we want them to be connected (often x is time)
very good for time series data & prediction

How well did you know this?

Not at all

Perfectly

bar plot for 2 variables

x axis: mainly categorical, discrete if not many (not continuous bc we only have a finite number of bars)
y axis: discrete, continuous (avoid ordinal, not nominal bc the taller the bar, the greater the quantity)
y axis is a variable and not “count” or “frequency”

How well did you know this?

Not at all

Perfectly

things that can mislead bar plot

starting position of variable value
order of the bars (it helps if you order the bars)

How well did you know this?

Not at all

Perfectly

compare plots

by plotting 2 variables together

How well did you know this?

Not at all

Perfectly

comparing box plots

side by side box plots
becomes plot of 2 variables (x axis identifies different plots)
interaction between numerical (y axis) & categorical (x axis)
comparison of the same measure among different groups

How well did you know this?

Not at all

Perfectly

comparing bar plots (1 variable)

side by side bars
becomes plot of 2 variables
similar for bar plots with 2 variables (y axis is still a count, but each bar itself can identify a variable)

How well did you know this?

Not at all

Perfectly

comparing histograms

compares 2 variables (1 numerical, 1 categorical)
y axis still count/frequency, histograms themselves can be one variable (ex: color coded), x axis as numerical variable
lays each histogram on top of each other (usually add transparency)

How well did you know this?

Not at all

Perfectly

comparing line plots

multiple lines on a plot
plot of 3 variables (x axis, y axis, line itself- ex: dotted vs full line)
scatter plots are similar

How well did you know this?

Not at all

Perfectly

good data visualization

understand data & purpose (what story does your data tell & what story do you want to tell)

How well did you know this?

Not at all

Perfectly

df.describe

summary statistics
- count (number of non-missing values)
- mean
- std dev
- min
- 25% percentile
- median
- 75% percentile
- max

pandas

How well did you know this?

Not at all

Perfectly

df.groupby

group together all rows belonging to one category
category is determined by a single column

need:
- 1 categorical column to group by
- 1+ numerical column to get stats on
- 1 aggregate function (e.g., min, mean, max, etc.) for numerical variables

ex: df.groupby(cat_column_name)[[num_column_name]].aggfunc()

How well did you know this?

Not at all

Perfectly

df.groupby visual example

start w/ a bunch of difference values, each with a color

1) group by “color”
- groups all the values with the same color together
2) look at the “value column”
3) take the mean
- takes the mean of the values in each color group

don’t group on “value” because “value” is not categorical

How well did you know this?

Not at all

Perfectly

example: groupby on loan 50 data

call:
df.groupby(‘term’)[‘annual_income’].max()

will group data by the ‘term’
then take the max of the ‘annual income’

How well did you know this?

Not at all

Perfectly

pivot tables

way to summarize info for 3+ variables

suppose: df w/ columns ‘cat1’ and ‘cat2’ for 2 categorical variables & ‘num1’ for a numerical variable

code:
df. pivot_table(index=’cat1’, columns=’cat2’, values=’num1’, aggfunc=’sum’)

groups by 2 categorical variables (cat1 & cat2) –> group within a larger group
summarize 1 numerical variable (num1)

group rows into categories based on cat1/cat2, summarize num1 by aggfunc

How well did you know this?

Not at all

Perfectly

pivot table dog example

Study These Flashcards

categorical: sex & breed
numerical: age
aggfunc: avg

will give:
female+breed1 avg age, male+breed1 avg age, female+breed2 avg age, male+breed2 avg age, etc.

index in pivot table?

Study These Flashcards

bc it acts like an index (it’s the column on the very left)

combine dataframes

Study These Flashcards

row-wise
column-wise

row-wise

Study These Flashcards

stack the dataframes together

works bc rows are like observations and are “independent” from one another
columns must be the same to avoid missing values

df.concat
- add rows to a table

column-wise

Study These Flashcards

we match on a column

can’t stack next to each other bc we don’t know how to combine them

merging

df1.merge(df2, how=’’, left_on=’’, right_on=’’)
- how: ‘inner’, ‘outer’, ‘left’, ‘right’
- left_on: column_x_from_df1
- right_on: column_y_from_df2

ex: 2 tables have ids for responses –> merge –> adds columns, organized by id

inner merge

Study These Flashcards

keep only matching records on a merge

outer merge

Study These Flashcards

keep all records in both df (but if not matching/missing info –> NaN, not a number)

left merge

Study These Flashcards

keeps all records on the left df

right merge

Study These Flashcards

keep all records in right df

loops

way to do something over and over again in code defines a loop variable and loops over its values

loops cookie recipe bad example

print ('add sugar to the bowl and mix') print ('add butter to the bowl and mix') print ('add flour to the bowl and mix') sucks: - copy/paste code is not good (mistakes are easy) - if wanted to adapt (ex: bowl --> pot), must change many lines

loops cookie recipe alt example

ingredients = ('sugar','butter','flour') for ingredients in ingredients: print("add" + ingredient + "to the bowl and mix") for: word in python that tells us we are creating loop variable 'ingredient' is name of a new variable that will represent each item (can be changed) in: python word that separates variable we're creating and the thing it gets its value from 'ingredients' is collected we are looking over colon (:): start of indented block indented block: actual thing that loops

cookie example what actually happens

will print each statement going down the line to ingredients - will continue to go between 'for...' and the loop 'print...' until end of list - at end of list: jumps outside the loop

loops in general

for in :


            


- can use lists or dictionaries or another collection of something
- don't refer to loop_variable outside the loop





30



to do something certain number of times example




something = 1

for variable in range(4):
           print(something)

1
1
1
1






31



2 ways to loop through a list




1st = [1, 2, 3, 4]
sum = 0

1) loop through list directly (by value)

for item in 1st:
          sum = sum + item

2) loop through an index (and lookup items by index)

for idx in range (len(1st)) :
         item = 1st[idx]
         sum = sum + item

- looks up item by position or index on the list






32



"while" loop




keeps doing something until a condition is met

while condition_is_true:
         

while: word in python, indicates loop

loop will continue until the condition is false (Boulian- true/false condition)






33



loop that goes until reaches max points




players = {'play1': 47, 'play2':55, 'play3':12}

best_player = None

best_score = -99999999 (a rly small number)

for player in players:
       current _player_score = players[player]
      if (current_player_score > best_score)
                 best_player = player
                 best_score = current_player_score

print("Best Player =" + best_player)

returns: Best Player = play2

Week 6: Data Summaries & Loops Flashcards

(33 cards)