Priority 5 Flashcards

1
Q

Which is faster, Python lists or Numpy arrays?

A

NumPy arrays

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why are NumPy arrays faster than Python lists?

A

NumPy arrays are implemented in C versus Python lists are implemented in Python. Because C is a compiled language, it is faster than Python, which is an interpreted language.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the differences between Python lists and tuples?

3 bullet points

A
  • Lists are mutable whereas tuples are not.
  • Lists are defined using square brackets [] whereas tuples are defined using parentheses ().
  • Tuples are generally faster than lists given immutability, allowing for code optimization.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the similarities between Python lists and tuples?

3 bullet points

A
  • Both collection of objects.
  • Both comma-separated values.
  • Both ordered.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a Python set?

A

Unordered collection of unique objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the typical use case of Python sets?

A

Often used to store a collection of distinct objects and perform membership tests (i.e., to check if an object is in the set).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How are Python sets defined?

A

Curly braces, {}, and a comma-separated list of values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the key properties of Python sets?

5 bullet points

A
  • Unordered
  • Unique
  • Mutable
  • Not indexed/do not support slicing
  • Not hashable (cannot be used as keys in dictionaries or as elements in other sets)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the difference between Python split and join?

1 bullet point for each

A
  • Split function is used to create a list from a string based on some delimiter (e.g., space).
  • Join function concatenates a list of strings into a single string.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Syntax: Python split

Include definition of any class objects and/or parameters

A
string.split(separator, maxsplit)
  • string: The string you want to split.
  • separator: (optional): The delimiter used to split the string. If not specified, it defaults to whitespace.
  • maxsplit: (optional): The maximum number of splits to perform. If not specified, it splits the string at all occurrences of the separator.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Syntax: Python join

Include definition of any class objects and/or parameters

A
separator.join(iterable)
  • separator: The string that will be used to separate the elements of the iterable in the resulting string.
  • iterable: An iterable object (e.g., a list, tuple, or string) whose elements will be joined together.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the logical operators in Python? What are they used for?

A
  • and, or, not
  • Used to perform boolean operations on bool values.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Logical operators in Python: and

A

Returns True if both operands are True; otherwise, False.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Logical operators in Python: or

A

Returns True if either of the operands are True; returns False if both operands are False.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Logical operators in Python: not

A

Returns True if the operand is False; returns False if the operand is True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the top 6 functions used for Python strings?

A
  1. len()
  2. strip()
  3. split()
  4. replace()
  5. upper()
  6. lower()
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Top 6 functions used for Python strings: len()

A

Returns the length of a string.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Top 6 functions used for Python strings: strip()

A

Removes leading and trailing whitespace from a string.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Top 6 functions used for Python strings: split()

A

Splits a string into a list of substrings based on a delimiter.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Top 6 functions used for Python strings: replace()

A

Replaces all occurrences of a specified string with another string.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Top 6 functions used for Python strings: upper()

A

Converts a string to uppercase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Top 6 functions used for Python strings: lower()

A

Converts a string to lowercase.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the pass keyword in Python? What is it used for?

A

pass is a null statement that does nothing. It is often used as a placeholder where a statement is required syntactically, but no action needs to be taken.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are some common use cases of the pass keyword in Python?

3 bullet points

A
  • Empty functions or classes: When you define a function/class but haven’t implemented any logic yet. Use pass to avoid syntax errors.
  • Conditional statements: If you need an if statement but don’t want to take any action in the if block, you can use pass.
  • Loops: You can use pass in loops when you don’t want to perform any action in a specific iteration.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the use of the `continue` keyword in Python?
`continue` is used in a loop to skip over the current iteration and move on to the next one.
26
Definition: immutable data type in Python
Object whose state cannot be modified after it is created.
27
Definition: mutable data type in Python
Object whose state can be modified after it is created.
28
Examples of immutable data types in Python
* Numbers: `int`, `float`, `complex` * `bool` * `str` * Tuples
29
Examples of mutable data types in Python
* Lists * Dictionaries * Sets
30
Because numbers are immutable data types in Python, what happens when you change the value of a number variable?
Old value gets garbage-collected, freeing up the memory assigned to stroing the object
31
Python variables versus objects
* Variables are names that refer to or hold references to concrete objects. * Objects are concrete pieces of information that live in specific memory positions on computer.
32
Can you use `sort()` on tuples? Why or why not?
No. Tuples are immutable. You would have to create a new sorted tuple from the original tuple.
33
What are `try except` blocks used for in Python?
Exception handling
34
`try except` blocks in Python: what is the `try` block?
Contains code that might cause an exception to be raised.
35
`try except` blocks in Python: what is the `except` block?
Contains code that is executed if an exception is raised during the execution of a `try` block.
36
What are the similarites between Python functions and methods? | 3 bullet points
* Both blocks of code that perform a specific task. * Both can take input parameters and return a value. * Both defined using the `def` keyword.
37
What are the key differences between Python functions and methods? | 4 bullet points
* Functions are defined outside of classes; methods are functions that are associated with a specific object or class. * Functions can be called on a standalone basis; methods are called using the dot notation on an object of a class. * Functions perform general tasks; methods perform actions specific to the object they belong to. * Parameters are optional for functions; for methods, the first parameter is usually `self`, which refers to the instance of the class.
38
How do functions help in code optimization? | 4 high-level points
1. Code reuse 2. Improved readability 3. Easier testing 4. Improved performance
39
Functions + code optimization: Code reuse
Allow you to reuse code by encapsulating it in a single place and calling it multiple times from different parts of your program. Reduces redundancy, making code more concise and easier to maintain.
40
Functions + code optimization: Improved readability
Functions make your code more readable and easier to understand by dividing your code into logical blocks. This makes it easier to identify bugs and make changes.
41
Functions + code optimization: Easier testing
Functions allow you to test individual blocks of code separately, which can make it easier to find and fix bugs.
42
Functions + code optimization: Improved performance
Functions allow you to use optimized code libraries and/or allow the Python interpreter to optimize the code more effectively.
43
Why is NumPy often used for data science? | 3 bullet points
* Fast and efficient operations on arrays and matrices of numerical data versus Python's built-in data structures. This is because it uses optimized C and Fortran code behind the scenes. * Large number of functions for performing mathematical and statistical operations on arrays and matrices. * Integrates well with other scientific computing libraries in Python, such as SciPy and pandas.
44
Definition: list comprehension in Python
Shorter syntax when creating a new list based on the values of an existing list.
45
Syntax: Python list comprehension
``` new_list = [expression for item in iterable if condition] ```
46
Definition: dict comprehension in Python
Concise way of creating dictionaries in Python
47
Syntax: Python dict comprehension
``` {key: value for item in iterable} ```
48
Definition: global variable in Python
A variable that is defined outside of any function or class
49
Definition: local variable in Python
A variable that is defined inside a function or class
50
Where can a Python global variable be accessed?
Can be accessed from anywhere in the program
51
Where can a Python local variable be accessed?
Can only be accessed within the function or class in which it is defined
52
What happens inside a Python function if you have a local variable and global variable with the same name?
The local variable will take precedence over the global variable within the function or class in which it is defined
53
# What will this code output? ``` # Adding a long comment so that it left-aligns the text x = 10 def func(): x = 5 print(x) func() print(x) ```
5 10
54
Definition: Python ordered dictionary
Subclass of Python dictionary class that maintains the order of elements in which they were added
55
Python ordered dictionary class name
`OrderedDict`
56
How do Python ordered dictionaries maintain the order of elements in the dictionary?
A doubly linked list
57
What do `return` and `yield` in Python have in common?
Both are keywords used to send values back from a function
58
What is the functionality of the `return` keyword in Python?
Terminates the function and returns a value to the caller
59
What is the functionality of the `yield` keyword in Python?
Pauses the function's execution and returns a value to the caller but maintains the function's state so that it can be resumed later
60
What is the use case of the `return` keyword in Python?
Used in regular functions when you want to compute a single result and return it
61
What is the use case of the `yield` keyword in Python?
Used to create generator functions that produce a sequence of values over time
62
Definition: Python lambda function
Small anonymous function that can take any number of arguments but can only have one expression
63
Syntax: Python lambda function
``` lambda arguments : expression ```
64
# What will this code output? ``` # Adding a long comment so that it left-aligns the text x = lambda a : a + 10 x(5) ```
15
65
How are Python lambda functions typically used in practice?
Often used in combination with higher-order functions, such as `map()`, `filter()`, and `reduce()`
66
What does the `assert` keyword in Python do?
Used to test a condition. If the condition is `True`, the program continues to execute. If the condition is `False`, then the program raises an `AssertionError` exception.
67
What is the `assert` keyword in Python used for?
Used for debugging purposes and is not intended to be used as a way to handle runtime errors
68
For exception handling within production Python code, should you use `try-except` or `assert`? Why?
`try-except` * Allows recovery and custom actions versus termination with `AssertionError` * Fully customizable exception messages versus limited to raising `AssertionError`
69
What are decorators in Python?
Used to modify or extend the functionality of a function, method, or class without changing its source code
70
Syntax: Python decorators
# Adding a long comment so that it left-aligns the text ``` # Adding a long comment so that it left-aligns the text @decorator_function def function_to_be_decorated(): # Function code here ```
71
# What does this code output? ``` def my_decorator(func): def wrapper(): print("Something is happening before the function is called.") func() print("Something is happening after the function is called.") return wrapper @my_decorator def say_hello(): print("Hello!") say_hello() ```
Something is happening before the function is called. Hello! Something is happening after the function is called.
72
What is univariate analysis?
Used to analyze and describe the characteristics of a single variable
73
Common steps when conducting univariate analysis on a numerical variable | 4 bullet points
* Calculate descriptive statistics, such as mean, median, mode, and standard deviation, to summarize the distribution of the data. * Visualize the distribution of the data using plots such as histograms, boxplots, or density plots. * Check for outliers and anomalies in the data. * Check for normality in the data using statistical tests or visualizations such as a Q-Q plot.
74
Common steps when conducting univariate analysis on a categorical variable | 4 bullet points
* Calculate the frequency of each category in the data. * Calculate the percentage of each category in the data. * Visalize the distribution of the data using plots such as bar and pie charts. * Check for imbalances or abnormalities in the distribution of the data.
75
Common ways to find outliers in a data set | 3 bullet points
* **Visual Inspection:** Identification via visual inspection of data using plots such as histograms, scatterplots, or boxplots. * **Summary Statistics:** Identification via calculating summary statistics, such as mean, median, or interquartile range. For example, if the mean is significantly different from the median, it could indicate the presence of outliers. * **Z-Score:** z-score measures how many standard deviations a given data point is from the mean. Data points with a z-score > threshold (e.g., 3 or 4) may be considered outliers.
76
What are common methods to handle the missing values in a data set? | 5 main points
1. Drop rows 2. Drop columns 3. Imputation with mean or median 4. Imputation with mode 5. Imputation with a predictive model
77
# Drop rows Common methods to handle the missing values in a data set | Explanation + Pro/Con
Drop rows with null values * Pro: Simple and fast * Con: Can signicantly reduce sample size and impact the statistical power of the analysis
78
# Drop columns Common methods to handle the missing values in a data set | Explanation + Pro/Con
Drop columns with null values * Pro: Can be a good option if many values are missing from column or column is irrelevant * Con: Can result in omitted variable bias
79
# Imputation with mean or median Common methods to handle the missing values in a data set | Explanation + Pro/Con
Replace null values with the mean or median of the non-null values in the column * Pro: Good option if the data are missing at random and mean/median is a reasonable representation of the data * Con: Introduces bias if the data are not missing at random
80
# Imputation with mode Common methods to handle the missing values in a data set | Explanation + Pro/Con
Replace null values with the mode of the non-null values in the column * Pro: Good option for categorical data where mode is a reasonable representation of the data * Con: Introduces bias if the data are not missing at random
81
# Imputation with a predictive model Common methods to handle the missing values in a data set | Explanation + Pro/Con
Use a predictive model to estimate the missing values based on other available data * Pro: Can be more accurate/less biased if the data are not missing at random and there is a strong relationship between the missing values and other data * Con: More complex/time-consuming
82
Definition: skewness
Measure of asymmetry or distortion of symmetric distribution. A distribution is skewed if it is not symmetrical, with more data points concentrated on one side of the mean than the other.
83
What are the different types of skewness?
* Positive skewness * Negative skewness
84
# Positive skewness Different types of skewness | 3 bullet points
* Long tail on the right side * Majority of data points concentrated on the left side of the mean * A few extreme values on the right side of the distribution that are pulling the mean to the right
85
# Negative skewness Different types of skewness | 3 bullet points
* Long tail on the left side * Majority of data points concentrated on the right side of the mean * A few extreme values on the left side of the distribution that are pulling the mean to the left
86
What are the three main measures of central tendency?
1. Mean 2. Median 3. Mode
87
# Mean Three main measures of central tendency | 3 bullet points
* Arithmetic average of a dataset * Calculated by adding all the values in the dataset and dividing by the number of values * Sensitive to outliers
88
# Median Three main measures of central tendency | 3 bullet points
* Middle value of the dataset when the values are arranged in order from smallest to largest * Arrange the values in order and find the middle value. If there is an odd number of values, the median is the middle value. If there is an even number of values, the median is the mean of the two middle values. * Not sensitive to outliers
89
# Mode Three main measures of central tendency | 3 bullet points
* Value that occurs most frequently in a dataset * May have multiple modes or no modes at all * Not sensitive to outliers
90
Definition: descriptive statistics
Used to summarize and describe a dataset by using measures of central tendency (mean, median, mode) and measures of spread (standard deviation, variance, range)
91
Definition: inferential statistics
Used to make inferences about a population based on sample data using statitical models, hypothesis testing, and estimation
92
What are the four key elements of an EDA report
1. Univariate analysis 2. Bivariate analysis 3. Missing data analysis 4. Data visualization
93
# Univariate analysis Four key elements of an EDA report | How does it contribute to understanding a dataset?
Helps understand the distribution of individual variables
94
# Bivariate analysis Four key elements of an EDA report | How does it contribute to understanding a dataset?
Helps understand the relationship between variables
95
# Missing data analysis Four key elements of an EDA report | How does it contribute to understanding a dataset?
Helps understand the quality of the data
96
# Data visualization Four key elements of an EDA report | How does it contribute to understanding a dataset?
Provides a visual interpretation of the data
97
Definition: central limit theorem | 2 bullet points
* As sample size increases, the distribution of the sample mean will approach a normal distribution * True regardless of the underlying distribution from which the sample is drawn
98
What is the benefit of the central limit theorem?
Even if the individual data points in a sample are not normally distributed, we can use normal distribution-based methods to make inferences about the population by taking the average of a large enough number of data points
99
Two main types of target variables for predictive modeling
* Numeric variable * Categorical variable
100
# Numeric variable Main types of target variables for predictive modeling | 2 bullet points
* Quantifiable characteristic whose values are numbers * May be continuous or discrete
101
# Categorical variable Main types of target variables for predictive modeling
* Values can take on one of a limited, usually fixed, number of possible values
102
Definition: binary categorical variable
Categorical variable that can take on exactly two values
103
Definition: polytomous categorical variable
Categorical variable with more than two possible values
104
When will the mean, median, and mode be the same for a given dataset?
Symmetric unimodal distribution: symmetrically distributed with a single peak
105
Definition: model variance
Error from sensitivity to small fluctuations in training data
106
Definition: model bias
Error from overly simplistic assumptions (e.g., data is linear when it's not, omitted variable bias)
107
What will be the result of a model with low bias and high variance?
Overfitting: model will be to sensitive to noise and random fluctuations in the data, failing to generalize well to new data
108
What will be the result of a model with high bias and low variance?
Underfitting: model will miss important relationships in the data
109
What are the types of errors in hypothesis testing?
* Type I error * Type II error
110
# Type I error Types of errors in hypothesis testing | 4 bullet points
* False positive * Null hypothesis is true but is rejected * Denoted by the Greek letter α * Usually set at a level of 0.05, meaning there is a 5% chance of making a Type I error
111
# Type II error Types of errors in hypothesis testing | 4 bullet points
* False negative * Null hypothesis is false but is not rejected * Denoted by the Greek letter β * Often represented as 1 - β, or the power of the test. The power of the test is the probability of correctly rejecting the null hypothesis when it is false.
112
Definition: confidence interval
Range of values expected to contain the true population parameter with a specific level of confidence
113
What is the most common confidence interval?
95%
114
What is the primary difference between correlation and covariance?
* Correlation is the normalized version of covariance, meaning correlation adjusts for the scales of the variables
115
Definition: correlation
Strength and direction of a linear relationship between two variables
116
Equation: correlation
117
What is the range and meaning of different values of correlation?
-1 and 1 * +1: Perfect positive linear relationship * -1: Perfect negative linear relationship * 0: No linear relationship
118
What are the units of correlation?
Unitless
119
Definition: covariance
Measures the degree to which two random variables change together. Indicates the direction of the linear relationship between variables
120
Equation: covariance
121
What is the range and meaning of different values of covariance?
Any value, positive, negative, or zero * Positive: When X increases, Y tends to increase * Negative: When X increases, Y tends to decrease * 0: No linear relationship
122
What are the units of covariance?
Product of the units of the two variables
123
Definition: hypothesis test
Statistical method to determine whether there is enough evidence in a sample of data to support or reject a stated assumption (hypothesis) about a population
124
What are some of the key reasons why hypothesis testing is useful for data science? | 3 points
* Can make decisions based on statistical evidence, rather than relying on assumptions or opinions. * Formal, standardized approach, making results interpretable and reproducible. * Allows for clear and credible communication of findings.
125
What are some of the key use cases for hypothesis testing? | 3 points
* A/B Testing: Evaluate if new feature, design, or change has a significant impact * Feature Selection: Test the significance of variables in statistical or ML models * Model Significance: Assess the significance of a predictive model
126
Definition: Contingency table
Tabular format used to display the frequencies (counts) of data points across two or more categorical variables
127
What is the chi-square test of independence?
Statistical test used to determine whether there is a significant association between two categorical variables in a contingency table
128
What are the null and alternative hypotheses in the chi-square test of independence?
* Null Hypothesis (*H_0*): The two variables are independent (no association). * Alternative Hypothesis (*H_a*): The two variables are not independent (there is an association).
129
Equation: chi-square statistic
130
Equation: 𝐸 (Expected frequency, calculated under the assumption of independence)
131
Equation: Degrees of freedom (df) in a chi-square test of independence
132
How do you run the chi-square test of independence? | 3 steps
1. Calculate the chi-square statistic 2. Compare the computed *Χ^2* statistic to a critical value from the chi-square distribution table (based on df and significance level, e.g., 0.5) 3. If *Χ^2* exceeds the critical value, reject the null hypothesis
133
Definition: p-value
Probability, assuming the null hypothesis is true, of obtaining a test statistic as extreme as or more extreme than the one observed
134
Definition: alpha (hypothesis testing)
Significance level, or a predetermined threshold representing the maximum acceptable probability of making a Type I error (i.e., rejecting a true null hypothesis). Criterion against which the p-value is compared to decide whether to reject the null hypothesis
135
What are the most common types of sampling techniques? | 4 main points
1. Simple random sampling 2. Stratefied random sampling 3. Cluster sampling 4. Systematic sampling
136
# Simple random sampling Common types of sampling techniques
Each member of the population has an equal chance of being selected for the sample
137
# Stratefied random sampling Common types of sampling techniques
Involves dividing the population into subgroups (or strata) based on certain characteristics and selecting a random sample from each stratum
138
# Cluster sampling Common types of sampling techniques
Involves dividing the population into smaller groups (or clusters) and then selecting a random sample of clusters
139
# Systematic sampling Common types of sampling techniques
Involves selecting every kth member of the population to be included in the sample
140
Equation: Bayes' theorm
141