Data Science Flashcards

(51 cards)

1
Q

What is the difference between qualitative and quantitative data?

A

Qualitative data is descriptive (e.g., names, categories), while quantitative data is numerical (e.g., height, age).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data cleaning?

A

The process of fixing or removing incorrect, corrupted, or incomplete data within a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is EDA?

A

Exploratory Data Analysis

A process of analyzing datasets to summarize their main characteristics, often using visual methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the mean?

A

The average of a dataset, calculated by summing all values and dividing by the number of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the median?

A

The middle value in an ordered dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the mode?

A

The most frequently occurring value in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is standard deviation?

A

A measure of how spread out numbers are from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is variance?

A

The average of the squared differences from the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a normal distribution?

A

A bell-shaped distribution that is symmetrical about the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a p-value?

A

The probability that observed data occurred by chance under the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a confidence interval?

A

A range of values derived from a sample that is likely to contain the population parameter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is correlation?

A

A statistical measure that describes the extent to which two variables are related.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is causation?

A

A relationship where one variable causes a change in another variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is SQL?

A

Structured Query Language, used to communicate with databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does SELECT do in SQL?

A

Retrieves data from a database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What does WHERE do in SQL?

A

Filters records based on specified conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is a JOIN in SQL?

A

Combines rows from two or more tables based on a related column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is GROUP BY in SQL?

A

Aggregates data across rows that share a common value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a pivot table?

A

A tool in Excel used to summarize and analyze data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does VLOOKUP do in Excel?

A

Searches for a value in the first column of a table and returns a value in the same row from another column.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is conditional formatting?

A

A feature that changes the appearance of cells based on conditions.

22
Q

What is a histogram?

A

A histogram is a graphical representation of the distribution of numerical data. It groups data into bins (or intervals) and shows how many data points fall into each bin using bars. Unlike a bar chart, the bars in a histogram touch each other, indicating the data is continuous.

23
Q

What is a bar chart?

A

A bar chart is a graphical display of categorical data using rectangular bars. Each bar represents a category, and the height or length of the bar shows the frequency or value. Unlike histograms, bars do not touch because the categories are discrete, not continuous.

24
Q

What is a scatter plot?

A

A scatter plot is a type of graph that shows the relationship between two numerical variables. Each point on the plot represents an observation with values on the x-axis and y-axis. It’s useful for identifying correlations, trends, outliers, and patterns in data.

25
What is a line chart used for?
A line chart is a graph that uses points connected by lines to show trends over time or ordered categories. It is ideal for visualizing time series data or showing how one variable changes in relation to another, often used to track progression, patterns, or fluctuations.
26
What is data visualization and what are some tools used?
Data visualization is the process of representing data graphically to make information easier to understand, analyze, and communicate. It helps uncover patterns, trends, and outliers using tools like charts, graphs, and dashboards. Common tools include matplotlib, seaborn, Tableau, and Power BI, Looker Studio.
27
What is Pandas in Python?
Pandas is a Python library used for data manipulation and analysis. It provides two core data structures: Series (1D) and DataFrame (2D). Pandas makes it easy to clean, filter, group, merge, reshape, and analyze structured data, especially from CSV, Excel, SQL, and JSON formats.
28
What is NumPy?
NumPy (Numerical Python) is a Python library for numerical computing. It provides support for multi-dimensional arrays and matrices, along with a large collection of mathematical functions to operate on these arrays. It's the foundation for scientific computing in Python and is often used with pandas, scikit-learn, and matplotlib.
29
What is a DataFrame?
A DataFrame is a 2-dimensional, tabular data structure in pandas with rows and columns, similar to an Excel spreadsheet or SQL table. Each column can contain a different data type (e.g., integers, strings, floats). It's one of the most commonly used structures for data analysis and manipulation in Python.
30
What does .groupby() do in Pandas?
.groupby() in pandas is used to split data into groups based on the values in one or more columns. You can then apply aggregations (like .sum(), .mean(), .count()) or custom functions to each group.
31
What is a p-value?
A p-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true. It helps you determine the significance of your results. A small p-value (typically ≤ 0.05) means the result is statistically significant — you may reject the null hypothesis. A large p-value (> 0.05) suggests weak evidence against the null — you fail to reject it.
32
What does .merge() do in Pandas?
Combines DataFrames using database-style joins.
33
What is hypothesis testing?
A method of making decisions using data, typically involving p-values and significance levels.
34
What is a P-Value?
35
How do you select the 'age' column from a DataFrame called df?
df['age'] =
36
What is A/B testing?
An experiment comparing two versions of something to determine which performs better.
37
What is root cause analysis?
A method of problem solving that tries to identify the primary cause of a problem.
38
What is outlier detection?
Identifying values that are significantly different from others in the dataset.
39
What is time series analysis?
Analyzing data points collected or recorded at specific time intervals.
40
What is the SQL syntax to select all columns from a table named 'customers'?
SELECT * FROM customers;
41
How do you filter rows in SQL where age is greater than 30?
SELECT * FROM table_name WHERE age > 30;
42
What is the SQL syntax to count the number of rows in a table?
SELECT COUNT(*) FROM table_name;
43
How do you sort results in SQL by a column named 'price' in descending order?
SELECT * FROM table_name ORDER BY price DESC;
44
What is the SQL syntax for an INNER JOIN between 'orders' and 'customers' on 'customer_id'?
SELECT * FROM orders INNER JOIN customers ON orders.customer_id = customers.customer_id;
45
How do you import Pandas in Python?
import pandas as pd
46
What is the syntax to read a CSV file named 'data.csv' using Pandas?
pd.read_csv('data.csv')
47
How do you display the first 5 rows of a DataFrame called df?
df.head()
48
What is the syntax to filter rows where 'salary' > 50000 in a DataFrame df?
df[df['salary'] > 50000]
49
What is the syntax to loop through a list called 'items' in Python?
for item in items: print(item)
50
How do you create a list of numbers from 0 to 9 in Python?
list(range(10))
51
How do you import NumPy in Python?
import numpy as np