Pandas Primer Flashcards Preview

11637 Foundations of Computational Data Science > Pandas Primer > Flashcards

Flashcards in Pandas Primer Deck (56)
Loading flashcards...

How do you read a comma-delimited file in pandas?

df = pd.read_csv(filepath)


How do you create a dataframe in pandas? (2)

  • pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns = ['col1', 'col2', 'col3'])
  • With a dictionary: 
    • df = pd.DataFrame({"ID" : [1, 2, 3], "First Name" : ["John", "Jim", "Joe"], "Last Name" : ["Smith", "Hendry", "Wilson"]})


How do you show a dataframe in pandas?



How do you access a cell in a dataframe?

  • .loc['cobra']

  • .iloc[row_or_col_index]

  • .loc[row_label,col_label]
  • .iloc[row_index,col_index]


What selection features does pandas support? (2)

  • Slicing in .loc and .iloc with start_index:end_index
  • array indexing


What's the difference between .loc and .iloc?

  • loc is label-based, -> use row and column labels.
  • iloc is integer position-based -> use integer position values


how do you show the first few rows of a dataframe?



What's one interesting thing about pandas slicing?

  • When you want an entire row or column, instead of including a ":", can omit the row or column index entirely (along with the comma) by not using a loc or iloc function at all
  • E.g. 
    • display(df.loc[:, "Last Name"]) is equivalent to display(df["Last Name"])


How do you set an entry in a DataFrame?

df.loc[1,"Last Name"] = "some_val" 


How do you set a row in pandas?

df.loc[3,:] = (100, "Andrew", "Moore") 


What happens if you try to set a row and the input index doesn't exist?

new row is appeneded to the end


How do we select a subset of the rows that satisfy some conditions from a dataframe?

df[(df["First Name"] == "Jim") & (df["Last Name"] == "Kilter")]


Given this dataframe, how do you find rows where Last Name has 6 characters?

df[df["Last Name"].str.len() == 6]


Given this dataframe, how do you find rows where First Name contains the substring "Jo"?

df[df["First Name"].str.contains("Jo")]


Given this dataframe, how do you find rows where First Name is either "Jim" or "Kim"?

df[df["First Name"].isin(["Jim", "Kim"])]


What do you do if you want to find rows that do not satisfy a certain condition?

Use the negation symbol ~ 

E.g. df[~df["First Name"].isin(["Jim", "Kim"])]


  • What's one trick we can use to speed up the selecting of pandas rows?
  • What does it do?

  • Use a query string to select rows, which
    • df.query('(`First Name` == "John") & (`Last Name` == "Smith")')
  • can avoid the creation of the intermediate boolean index and reduce runtime / memory usage:


What's important to remember about querying? (3)

  1. The returned object of a query is a view of the original data frame.
  2. Modifying the view will not affect the original data frame, but will yield a warning.
  3. Unlike Numpy, Pandas preserves the original row index after filtering. E.g. if you make a copy of a slice that has rows 2 and 5 and try to select index 1 from that, an error will be thrown.



How do you copy a dataframe?



If our dataframe has no row with index 0, what do we do?

Call .reset_index(drop = True)

E.g. df_copy_reset_index = df_copy.reset_index(drop = True)


How can we iterate over rows of a dataframe, from slowest to fastest?


  • Use .iloc along with row index.
  • Use iterrows method.
  • Use apply with axis=1.
  • Fastest: Use Pandas vectorization


How can we iterate over columns of a dataframe? (iteration syntax and how you index the column)


  1. Call .columns to get the list of column names and iterate over it 
  2. Use .iloc along with the column indexes


What's important to note about iterrows()?

  • the row returned is only a copy of the data
  • so you cannot update the data frame during .iterrows()


What is the default of the apply method?

  • To loop through the columns (same as axis=0)
  • To loop through rows, use axis=1


  • What can be used with pandas vectorization?

  1. built-in pd.Series methods
  2. operations that are compatible with Numpy arrays, for example basic math operations or Boolean conditons


  • What is the idea of pandas vectorization?

operating a procedure on the entire column array at once, instead of on individual column elements


If we can't vectorize an operation we want to apply to all rows, what should we do? Why?

  1. Use the apply method
  2. Why?
    • It's the 2nd fastest row iteration approach of the 4
    • It can work with any input function


If we have a vectorizable operation we want to apply over a set of rows, and the columns are numerical, what should we do?

apply vectorization on the underlying Numpy arrays (by calling df[column_name].to_numpy()) for an even greater speedup.


What's One primary advantage of Pandas over other data table packages?

its powerful data manipulation functions


What's important to remember about most pandas dataframe methods?

 most do not modify the input dataframe and only return the output in a new dataframe