Week 6: Data Summaries & Loops Flashcards

(33 cards)

1
Q

scatterplot for 2 variables

A

2 numerical variables
- continuous variables (discrete could work but would be weird if there aren’t many values)
- no categorical data

relationship between the pair can be important
- data points not related to one another (need a line)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

line plot for 2 variables

A

2 variables

  • same as scatter plot except neighboring points are “connected”
  • we want them to be connected (often x is time)
  • very good for time series data & prediction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

bar plot for 2 variables

A
  • x axis: mainly categorical, discrete if not many (not continuous bc we only have a finite number of bars)
  • y axis: discrete, continuous (avoid ordinal, not nominal bc the taller the bar, the greater the quantity)
  • y axis is a variable and not “count” or “frequency”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

things that can mislead bar plot

A
  • starting position of variable value
  • order of the bars (it helps if you order the bars)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

compare plots

A

by plotting 2 variables together

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

comparing box plots

A
  • side by side box plots
  • becomes plot of 2 variables (x axis identifies different plots)
  • interaction between numerical (y axis) & categorical (x axis)
  • comparison of the same measure among different groups
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

comparing bar plots (1 variable)

A
  • side by side bars
  • becomes plot of 2 variables
  • similar for bar plots with 2 variables (y axis is still a count, but each bar itself can identify a variable)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

comparing histograms

A
  • compares 2 variables (1 numerical, 1 categorical)
  • y axis still count/frequency, histograms themselves can be one variable (ex: color coded), x axis as numerical variable
  • lays each histogram on top of each other (usually add transparency)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

comparing line plots

A
  • multiple lines on a plot
  • plot of 3 variables (x axis, y axis, line itself- ex: dotted vs full line)
  • scatter plots are similar
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

good data visualization

A
  • understand data & purpose (what story does your data tell & what story do you want to tell)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

df.describe

A

summary statistics
- count (number of non-missing values)
- mean
- std dev
- min
- 25% percentile
- median
- 75% percentile
- max

pandas

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

df.groupby

A
  • group together all rows belonging to one category
  • category is determined by a single column

need:
- 1 categorical column to group by
- 1+ numerical column to get stats on
- 1 aggregate function (e.g., min, mean, max, etc.) for numerical variables

ex: df.groupby(cat_column_name)[[num_column_name]].aggfunc()

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

df.groupby visual example

A

start w/ a bunch of difference values, each with a color

1) group by “color”
- groups all the values with the same color together
2) look at the “value column”
3) take the mean
- takes the mean of the values in each color group

  • don’t group on “value” because “value” is not categorical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

example: groupby on loan 50 data

A

call:
df.groupby(‘term’)[‘annual_income’].max()

  • will group data by the ‘term’
  • then take the max of the ‘annual income’
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

pivot tables

A
  • way to summarize info for 3+ variables

suppose: df w/ columns ‘cat1’ and ‘cat2’ for 2 categorical variables & ‘num1’ for a numerical variable

code:
df. pivot_table(index=’cat1’, columns=’cat2’, values=’num1’, aggfunc=’sum’)

  • groups by 2 categorical variables (cat1 & cat2) –> group within a larger group
  • summarize 1 numerical variable (num1)

group rows into categories based on cat1/cat2, summarize num1 by aggfunc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

pivot table dog example

A
  • categorical: sex & breed
  • numerical: age
  • aggfunc: avg

will give:
female+breed1 avg age, male+breed1 avg age, female+breed2 avg age, male+breed2 avg age, etc.

17
Q

index in pivot table?

A

bc it acts like an index (it’s the column on the very left)

18
Q

combine dataframes

A
  • row-wise
  • column-wise
19
Q

row-wise

A

stack the dataframes together

  • works bc rows are like observations and are “independent” from one another
  • columns must be the same to avoid missing values

df.concat
- add rows to a table

20
Q

column-wise

A

we match on a column

  • can’t stack next to each other bc we don’t know how to combine them

merging

df1.merge(df2, how=’’, left_on=’’, right_on=’’)
- how: ‘inner’, ‘outer’, ‘left’, ‘right’
- left_on: column_x_from_df1
- right_on: column_y_from_df2

ex: 2 tables have ids for responses –> merge –> adds columns, organized by id

21
Q

inner merge

A

keep only matching records on a merge

22
Q

outer merge

A

keep all records in both df (but if not matching/missing info –> NaN, not a number)

23
Q

left merge

A

keeps all records on the left df

24
Q

right merge

A

keep all records in right df

25
loops
way to do something over and over again in code defines a loop variable and loops over its values
26
loops cookie recipe bad example
print ('add sugar to the bowl and mix') print ('add butter to the bowl and mix') print ('add flour to the bowl and mix') sucks: - copy/paste code is not good (mistakes are easy) - if wanted to adapt (ex: bowl --> pot), must change many lines
27
loops cookie recipe alt example
ingredients = ('sugar','butter','flour') for ingredients in ingredients: print("add" + ingredient + "to the bowl and mix") for: word in python that tells us we are creating loop variable 'ingredient' is name of a new variable that will represent each item (can be changed) in: python word that separates variable we're creating and the thing it gets its value from 'ingredients' is collected we are looking over colon (:): start of indented block indented block: actual thing that loops
28
cookie example what actually happens
will print each statement going down the line to ingredients - will continue to go between 'for...' and the loop 'print...' until end of list - at end of list: jumps outside the loop
29
loops in general
for in : - can use lists or dictionaries or another collection of something - don't refer to loop_variable outside the loop
30
to do something certain number of times example
something = 1 for variable in range(4): print(something) 1 1 1 1
31
2 ways to loop through a list
1st = [1, 2, 3, 4] sum = 0 1) loop through list directly (by value) for item in 1st: sum = sum + item 2) loop through an index (and lookup items by index) for idx in range (len(1st)) : item = 1st[idx] sum = sum + item - looks up item by position or index on the list
32
"while" loop
keeps doing something until a condition is met while condition_is_true: while: word in python, indicates loop loop will continue until the condition is false (Boulian- true/false condition)
33
loop that goes until reaches max points
players = {'play1': 47, 'play2':55, 'play3':12} best_player = None best_score = -99999999 (a rly small number) for player in players: current _player_score = players[player] if (current_player_score > best_score) best_player = player best_score = current_player_score print("Best Player =" + best_player) returns: Best Player = play2