M04 - Pandas Flashcards

(58 cards)

1
Q

Series

A
  • One-dimensional, labeled array capable of holding any data type
  • Data is linear and has an index that acts as a key in the dictionary
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Series syntax

A

list_var = [‘list’]

series_var = pandas^.Series(list_var)

^ = pandas can be whatever alias you assign it when importing the dependency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Retrieve a series syntax

A

series_var

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

DataFrame

A

-2-dimensional labeled data structure w/ rows and columns of potentially different data types where data is aligned in a table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

DataFrame from Dictionary syntax

A

var_df = pandas^.DataFrame(dict_var)

^ = pandas can be whatever alias you assign when importing the dependency

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Retrieve a DataFrame syntax

A

var_df

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

DataFrame naming best practices

A

Name with “_df” at the end to distinguish DataFrames from Series and Variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

DataFrame from List(s) syntax

A
#Create empty _df
var_df = pd.DataFrame( )
#Add List to _df
var_df['Column Header of my Choosing'] = list_var
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

3 Main Parts of a DataFrame & how you can access them

A
  1. Columns: the top/header rows
  2. Index: Numbers down the left-hand margin
  3. Values: values in the columns (the data)

Can be accessed w/ the columns, index, and values attributes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Columns attribute syntax + Output

A

var_df.columns

Index( [‘Column1’ , ‘Column2’ , ….] ), dtype = ‘object’

Object may be other data type? tbd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Index attribute syntax + output

A

var_df.index

RangeIndex(start = 0 , stop = endIndex , step = increment)

i.e. var_df has 5 entries, incremented by 1
RangeIndex(start = 0, stop = 5, step = 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Values attribute syntax + output

A

var_df.values

Outputs the values without column names (ex. below has 3 columns ID, School, Type):
array( [ [ 0, ‘Huang High School’ , ‘District’ ] ,
[1, ‘Figueroa High School’ , ‘District’] , … ] dtype = object)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Convert csv file into DataFrame syntax/example

A
# Declare filename variable for csv
file_to_load = os.path.join('path' , 'filename.csv')
#Create DataFrame
file_data_df = pd.read_csv(file_to_load)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

head( ) and tail( ) methods: syntax + what they do

A

var_df.head( ) - returns top 5 rows of DF

var_df.tail( ) - returns last 5 rows of DF

inserting a number in the ( ) will return that many rows from top/bottom i.e. var_df.head(10) will return top 10 rows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

count( ) method: what it does + syntax

A

Provies a count for the rows for each column containing data. “Null” values are not counted by default.

var_df.count( )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

isnull( ) method: what it does + syntax

A

Determines empty rows. Returns boolean T/F. True if empty, False if not.

var_df.isnull( )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

sum( ) method w/ isnull() or notnull(): what it does + syntax + output

A

Gets total number of empty rows that are marked as “True”

var_df.isnull( ).sum( )

Outputs all column names and sum of “True” values in each column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

notnull( ) method: what it does + syntax

A

Returns T/F, w/ “True” for not empty and “False” if it’s empty value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

NaN in a DataFrame

A

Means ‘not a number’ and cannot be equal to zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Options for Missing Data

A
  1. Do Nothing
  2. Drop the Row
  3. Fill in the Row
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Do Nothing (missing data) considerations

A
  • NaNs will not be considered in the sum or averages

- If we wish to multiply/divide with a row that has a NaN, the answer will be NaN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Drop the Row (missing data) considerations

A
  • Removing the row removes the all data in associated with that row
    1. How much data would be removed if NaNs are dropped?
    2. How much would this impact the analysis?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Method to drop a row with NaNs + syntax + note about indexes

A

dropna( )

var_df.dropna( )

-Indexes do not reset automatically (0, 1, 2, 3) w/ 2 dropped is now (0, 1, 3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Fill in the Row (missing data) considerations

A
  • Must be used with caution

- Must carefully consider the values you insert for every downstream analysis perfomed

25
Method to fill in rows w/ NaNs + syntax
fillna( ) var_df.fillna(value/'value')
26
6 Common Data Types
1. Boolean 2. Integer (32bit) 3. Integer (64bit) 4. Float 5. Object 6. Datetime
27
Boolean (Pandas Name/Ex.)
Name: bool Ex: True and False
28
Integer (Pandas Name/Ex for both)
Name: int32 or int64 Ex. int32: -2,147,483,648 to 2,147,483,647 Ex. int64: -9,223,3720,036,854,775,808 to 9,223,372,036,854,775,807
29
Float (Pandas Name/Ex)
float64 Floating Decimal
30
Object (Pandas Name/Ex)
Name: O or object Ex: Typically strings; often used as a catchall for columns w/ different data types or other Python objects like tuples, lists, an dictionaries
31
Datetime (Pandas Name/Ex)
datetime64 Ex: Specific moment in time w/ nanosecond precision 2019-06-03 16:04:00.465107
32
dtypes attribute + syntax + output
Lets you check the data type of each column on a DataFrame var_df.dtypes Returns Column headers w/ Pandas Data name
33
Both syntaxes: dtypes on specific column
If column has NO SPACES in name: var_df.column.dtypes If column has SPACES: var_df['column name'].dtypes
34
Cleaning/Testing code solutions best practices
Create a copy of or new, separate file for cleaning/testing from the source code you are working on.
35
tolist( ) method - what it does + syntax
-Will add all data in a specified column to a list tolist_var = var_df[ "Column Name"].tolist( )
36
split( ) method - what it does + syntax
- Will split a Python string object on whitespace, or where there is no text var. split( )
37
Get length of a split syntax
len(var.split( ) )
38
set( ) method - what it does + syntax
-Returns all unique items/values in a LIST when the list is added inside parentheses set(list_var)
39
strip( ) method - what it does + syntax
- Removes any combination of letters and words that are inside the parentheses var. strip("value")
40
replace( ) method - what it does + syntax
- Replaces and 'old' phrase/string with a new one | var. replace('Old' , 'New')
41
merge( ) method - what it does + syntax
Merges two DataFrames on a common column (think Join) merged_var_df = pandas.merge(var1_df , var2_df, on = ['var1_df_columnheader' , 'var2_df_columnheader'] )
42
If you have 2 DataFrames with same info but different column titles, you should:
-Rename the columns to match, this helps avoid duplicate columns or merging issues
43
unique( ) method - what it does (note the output type it provides) + syntax
-Returns an ARRAY or LIST of all unique values in a given column of a DATAFRAME varX_df = var_df['column_name'].unique( )
44
Get a count of unique values in a DataFrame column syntax
len(var_df['column_name'].unique( ) )
45
Method + syntax for getting the average of a column
mean( ) var_df['column_name'].mean( )
46
map( ) function - what it does + syntax
- Used for substituting each value in a Series with another value. Where the new value is generated from a function, a dictionary, or a Series - Note, if there are multiples of a current value, you only need to map it once it will change all instances to the new value series_var.map( { 'current value1' : 'new value1' , 'current value2' : 'new value 2' , ... } )
47
Function + 4 Basic Parts
- Smaller, more manageable piece of code - Good for repetitive tasks 1. The name, which is what we call the function 2. The parameters, which are values we send to the function 3. The code block, which are the statements under the function that perform the task 4. The return value, which is what the function gives back, or 'returns' to use when the task is complete
48
Function Syntax
``` def fxn_name( ): (tab)instructions ```
49
format( ) function - what it does + syntax
- Used to format a value to a specific format - I.e. decimal places, adding separators, etc. "{value : format specification}".format(value) Ex. format 92.34 held as my_var print("{: .0f}".format(my_var) Output: 92
50
Reorder Columns syntax
``` #Set var w/ column order how you want new_column_order = ['col2' , 'col4' , 'col1' , 'col3'] ``` ``` # Assign a new or same DataFrame to the new column order var_df = var_df[new_column_order] ```
51
set_index( ) method - what it does + syntax
-Returns a series wit the index set to a specified column var_name = var_df.set_index( [ 'column_name1' ] ) ["column_name2"] column_name1 will be the index, column_name2 will be the variable
52
value_counts( ) method - what it does + syntax
-Returns a Series that counted + totals each unique entry in a column var_Series = var_df[ 'column_to_count'].value_counts( )
53
grouby( ) function - what it does + syntax (with mean( ) )
-Splits an object (like a DataFrame), apply a mathematical operation, and combine the results var = var_df.groupby( ['column_name'] ).mean( )
54
sort_values( ) function - what it does + syntax
- Sorts values in a DataFrame or Series for a given text, index, or column that is passed within the parentheses - Can add parameter named 'ascending' (type: bool), default is ascending=True var = var_df.sort_values(['Column_name'], ascending=False)
55
describe( ) method - what it does/returns
-Runs on DataFrame or Series -Returns: +Number of rows in DF or Series +Average of the rows as mean +St Dev of the rows as std +Minimum value of the rows as min +25th percentile as 25% +50th percentile as 50% +75th percentile as 75% +Maximum value of the rows as max
56
describe( ) syntax
var_df.describe( )
57
cut( ) function - what it does + syntax
- Segments and sorts data values into bins - When making a variable for ranges, must include a value lower than the lowest value (i.e. 0 in the case of school district analysis) pandas.cut(var_df , var_ranges)
58
Add a List or Series to a DataFrame syntax
var_df = pandas.DataFrame({'Col_Name1': 'Col_Values1' , 'Col_Name2' : Col_Values2 , ... })