M04 - Pandas Flashcards

1
Q

Series

A
  • One-dimensional, labeled array capable of holding any data type
  • Data is linear and has an index that acts as a key in the dictionary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Series syntax

A

list_var = [‘list’]

series_var = pandas^.Series(list_var)

^ = pandas can be whatever alias you assign it when importing the dependency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Retrieve a series syntax

A

series_var

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DataFrame

A

-2-dimensional labeled data structure w/ rows and columns of potentially different data types where data is aligned in a table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

DataFrame from Dictionary syntax

A

var_df = pandas^.DataFrame(dict_var)

^ = pandas can be whatever alias you assign when importing the dependency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Retrieve a DataFrame syntax

A

var_df

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

DataFrame naming best practices

A

Name with “_df” at the end to distinguish DataFrames from Series and Variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

DataFrame from List(s) syntax

A
#Create empty _df
var_df = pd.DataFrame( )
#Add List to _df
var_df['Column Header of my Choosing'] = list_var
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

3 Main Parts of a DataFrame & how you can access them

A
  1. Columns: the top/header rows
  2. Index: Numbers down the left-hand margin
  3. Values: values in the columns (the data)

Can be accessed w/ the columns, index, and values attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Columns attribute syntax + Output

A

var_df.columns

Index( [‘Column1’ , ‘Column2’ , ….] ), dtype = ‘object’

Object may be other data type? tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Index attribute syntax + output

A

var_df.index

RangeIndex(start = 0 , stop = endIndex , step = increment)

i.e. var_df has 5 entries, incremented by 1
RangeIndex(start = 0, stop = 5, step = 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Values attribute syntax + output

A

var_df.values

Outputs the values without column names (ex. below has 3 columns ID, School, Type):
array( [ [ 0, ‘Huang High School’ , ‘District’ ] ,
[1, ‘Figueroa High School’ , ‘District’] , … ] dtype = object)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Convert csv file into DataFrame syntax/example

A
# Declare filename variable for csv
file_to_load = os.path.join('path' , 'filename.csv')
#Create DataFrame
file_data_df = pd.read_csv(file_to_load)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

head( ) and tail( ) methods: syntax + what they do

A

var_df.head( ) - returns top 5 rows of DF

var_df.tail( ) - returns last 5 rows of DF

inserting a number in the ( ) will return that many rows from top/bottom i.e. var_df.head(10) will return top 10 rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

count( ) method: what it does + syntax

A

Provies a count for the rows for each column containing data. “Null” values are not counted by default.

var_df.count( )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

isnull( ) method: what it does + syntax

A

Determines empty rows. Returns boolean T/F. True if empty, False if not.

var_df.isnull( )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

sum( ) method w/ isnull() or notnull(): what it does + syntax + output

A

Gets total number of empty rows that are marked as “True”

var_df.isnull( ).sum( )

Outputs all column names and sum of “True” values in each column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

notnull( ) method: what it does + syntax

A

Returns T/F, w/ “True” for not empty and “False” if it’s empty value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

NaN in a DataFrame

A

Means ‘not a number’ and cannot be equal to zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Options for Missing Data

A
  1. Do Nothing
  2. Drop the Row
  3. Fill in the Row
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Do Nothing (missing data) considerations

A
  • NaNs will not be considered in the sum or averages

- If we wish to multiply/divide with a row that has a NaN, the answer will be NaN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Drop the Row (missing data) considerations

A
  • Removing the row removes the all data in associated with that row
    1. How much data would be removed if NaNs are dropped?
    2. How much would this impact the analysis?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Method to drop a row with NaNs + syntax + note about indexes

A

dropna( )

var_df.dropna( )

-Indexes do not reset automatically (0, 1, 2, 3) w/ 2 dropped is now (0, 1, 3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Fill in the Row (missing data) considerations

A
  • Must be used with caution

- Must carefully consider the values you insert for every downstream analysis perfomed

25
Q

Method to fill in rows w/ NaNs + syntax

A

fillna( )

var_df.fillna(value/’value’)

26
Q

6 Common Data Types

A
  1. Boolean
  2. Integer (32bit)
  3. Integer (64bit)
  4. Float
  5. Object
  6. Datetime
27
Q

Boolean (Pandas Name/Ex.)

A

Name: bool
Ex: True and False

28
Q

Integer (Pandas Name/Ex for both)

A

Name: int32 or int64
Ex. int32: -2,147,483,648 to 2,147,483,647
Ex. int64: -9,223,3720,036,854,775,808 to 9,223,372,036,854,775,807

29
Q

Float (Pandas Name/Ex)

A

float64

Floating Decimal

30
Q

Object (Pandas Name/Ex)

A

Name: O or object

Ex: Typically strings; often used as a catchall for columns w/ different data types or other Python objects like tuples, lists, an dictionaries

31
Q

Datetime (Pandas Name/Ex)

A

datetime64

Ex: Specific moment in time w/ nanosecond precision
2019-06-03 16:04:00.465107

32
Q

dtypes attribute + syntax + output

A

Lets you check the data type of each column on a DataFrame

var_df.dtypes

Returns Column headers w/ Pandas Data name

33
Q

Both syntaxes: dtypes on specific column

A

If column has NO SPACES in name:
var_df.column.dtypes

If column has SPACES:
var_df[‘column name’].dtypes

34
Q

Cleaning/Testing code solutions best practices

A

Create a copy of or new, separate file for cleaning/testing from the source code you are working on.

35
Q

tolist( ) method - what it does + syntax

A

-Will add all data in a specified column to a list

tolist_var = var_df[ “Column Name”].tolist( )

36
Q

split( ) method - what it does + syntax

A
  • Will split a Python string object on whitespace, or where there is no text
    var. split( )
37
Q

Get length of a split syntax

A

len(var.split( ) )

38
Q

set( ) method - what it does + syntax

A

-Returns all unique items/values in a LIST when the list is added inside parentheses

set(list_var)

39
Q

strip( ) method - what it does + syntax

A
  • Removes any combination of letters and words that are inside the parentheses
    var. strip(“value”)
40
Q

replace( ) method - what it does + syntax

A
  • Replaces and ‘old’ phrase/string with a new one

var. replace(‘Old’ , ‘New’)

41
Q

merge( ) method - what it does + syntax

A

Merges two DataFrames on a common column (think Join)

merged_var_df = pandas.merge(var1_df , var2_df, on = [‘var1_df_columnheader’ , ‘var2_df_columnheader’] )

42
Q

If you have 2 DataFrames with same info but different column titles, you should:

A

-Rename the columns to match, this helps avoid duplicate columns or merging issues

43
Q

unique( ) method - what it does (note the output type it provides) + syntax

A

-Returns an ARRAY or LIST of all unique values in a given column of a DATAFRAME

varX_df = var_df[‘column_name’].unique( )

44
Q

Get a count of unique values in a DataFrame column syntax

A

len(var_df[‘column_name’].unique( ) )

45
Q

Method + syntax for getting the average of a column

A

mean( )

var_df[‘column_name’].mean( )

46
Q

map( ) function - what it does + syntax

A
  • Used for substituting each value in a Series with another value. Where the new value is generated from a function, a dictionary, or a Series
  • Note, if there are multiples of a current value, you only need to map it once it will change all instances to the new value

series_var.map( { ‘current value1’ : ‘new value1’ , ‘current value2’ : ‘new value 2’ , … } )

47
Q

Function + 4 Basic Parts

A
  • Smaller, more manageable piece of code
  • Good for repetitive tasks
    1. The name, which is what we call the function
    2. The parameters, which are values we send to the function
    3. The code block, which are the statements under the function that perform the task
    4. The return value, which is what the function gives back, or ‘returns’ to use when the task is complete
48
Q

Function Syntax

A
def fxn_name( ):
(tab)instructions
49
Q

format( ) function - what it does + syntax

A
  • Used to format a value to a specific format
  • I.e. decimal places, adding separators, etc.

“{value : format specification}”.format(value)

Ex. format 92.34 held as my_var
print(““.format(my_var)

Output: 92

50
Q

Reorder Columns syntax

A
#Set var w/ column order how you want
new_column_order = ['col2' , 'col4' , 'col1' , 'col3']
# Assign a new or same DataFrame to the new column order
var_df = var_df[new_column_order]
51
Q

set_index( ) method - what it does + syntax

A

-Returns a series wit the index set to a specified column

var_name = var_df.set_index( [ ‘column_name1’ ] ) [“column_name2”]

column_name1 will be the index, column_name2 will be the variable

52
Q

value_counts( ) method - what it does + syntax

A

-Returns a Series that counted + totals each unique entry in a column

var_Series = var_df[ ‘column_to_count’].value_counts( )

53
Q

grouby( ) function - what it does + syntax (with mean( ) )

A

-Splits an object (like a DataFrame), apply a mathematical operation, and combine the results

var = var_df.groupby( [‘column_name’] ).mean( )

54
Q

sort_values( ) function - what it does + syntax

A
  • Sorts values in a DataFrame or Series for a given text, index, or column that is passed within the parentheses
  • Can add parameter named ‘ascending’ (type: bool), default is ascending=True

var = var_df.sort_values([‘Column_name’], ascending=False)

55
Q

describe( ) method - what it does/returns

A

-Runs on DataFrame or Series
-Returns:
+Number of rows in DF or Series
+Average of the rows as mean
+St Dev of the rows as std
+Minimum value of the rows as min
+25th percentile as 25%
+50th percentile as 50%
+75th percentile as 75%
+Maximum value of the rows as max

56
Q

describe( ) syntax

A

var_df.describe( )

57
Q

cut( ) function - what it does + syntax

A
  • Segments and sorts data values into bins
  • When making a variable for ranges, must include a value lower than the lowest value (i.e. 0 in the case of school district analysis)

pandas.cut(var_df , var_ranges)

58
Q

Add a List or Series to a DataFrame syntax

A

var_df = pandas.DataFrame({‘Col_Name1’: ‘Col_Values1’ , ‘Col_Name2’ : Col_Values2 , … })