Section 5 - Data Analysis and visualization with Python Flashcards
(14 cards)
What is Kaggle?
Website with many datasets. Regularly used by Data Scientists and Students.
What is the full form of CSV?
Comma Separated Value
What is loc method in pandas?
Loc method is used to retrieve rows from a dataframe that match a certain condition.
eg :- us_babies.loc[use_babies[‘Year’] ==2014, :]
Write a sort_values statement.
us_babies_2014.sort_values(‘Count’,ascending = False)
What is iloc statment?
An iloc statement is used to retrieve rows which fall under a range. For example the first 5 rows.
eg- dataframe.iloc[0:5]
How do you check for null values in a dataset?
dataset.isnull()
Returns the dataset with True or False values in all cells. True indicates that the cell has null value.
What is a panda series?
A column in a panda dataframe.
What does dataset[series name].unique() do? Provide one application.
Returns an array of unique values in a series.
Eg - crime[offense].unique()
It can be used to check for misspelling.
What are some good practices for data cleaning?
Data cleaning decides the quality of data analysis making it an essential step.
- Note all changes which have been done to the series to keep track during data analysis.
- Use caution when using series not cleaned during analysis.
Using seaborn create a bar graph displaying the number of airbnb listings in each neighbourhood group of New York.
Dataframe is called “listing”
Coloumn for neighbourhood group is called “neighbourhood_group”
Table consists of rows of airbnb listings with a coloumn for neighbourhood_group.
sn.countplot(x = “neighbourhood_group”, data = listing)
Use seaborn to create a bar graph with x axis as “neighbourhood_group” and y axis as “price”.
sn.barplot(x = “neighbourhood_group”, y = “price”, data = listings)
Price will be the average price since there are multiple enteries for each neighbourhood group with different prices.
Explain a histogram in simple terms.
A histogram is the representation of a distribution of data, i.e., the data is divided into multiple sets and the amount in each set is displayed. X axis - distributed sets of data. Y axis - Quanity.
Make a scatterplot with matplotlib.plt
Make a histogram with matplotlib.plt