Intermediate Flashcards

(38 cards)

1
Q

What is a good use case for implementing neural networks for supervised learning?

A

Neural networks are ideal for scenarios where traditional ML algorithms fail, especially with unstructured data like images, text, or audio. They are also preferred when the dataset size is large, allowing for improved pattern detection.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Should neural networks be applied in healthcare for predicting diagnosis and medication?

A

No, neural networks are not suitable where interpretability is crucial, such as in healthcare. Decision trees or random forests are preferred for their transparency.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why do we mostly use the ReLU activation function in CNNs?

A

ReLU is simple to compute and avoids the vanishing gradient problem, making it effective for deep networks in CNNs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the role of fully connected layers in a CNN?

A

Fully connected layers learn non-linear combinations of high-level features extracted by convolutional layers, providing a meaningful feature space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some drawbacks of using CNNs on image datasets?

A

CNNs require a lot of labeled data and can be susceptible to spurious patterns. Transfer learning and data augmentation can address these issues.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What text pre-processing steps are required for sentiment analysis of Twitter posts?

A

1) Lowercasing 2) Stemming 3) Lemmatization 4) Stop words removal 5) Noise removal 6) Remove emoticons.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which evaluation metric is suitable for sentiment analysis?

A

Precision, Recall, F-score, and Accuracy are suitable metrics. F1 scores are often used due to class imbalance in sentiment analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which neural network configuration is likely to perform better on complex tasks?

A

The network with four hidden layers of four nodes each is likely to perform better due to its hierarchical learning capability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the pros and cons of Batch Gradient Descent vs Stochastic Gradient Descent?

A

Batch Gradient Descent has better convergence but is computationally expensive for large datasets. Stochastic Gradient Descent is faster but can be noisy and less stable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is AUC different from ROC?

A

ROC is a curve showing a model’s performance at different thresholds, while AUC measures the area under the ROC curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do k-means and hierarchical clustering differ?

A

K-Means is centroid-based and requires a predefined number of clusters, while hierarchical clustering is connectivity-based and does not require this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Why are decision trees prone to overfitting?

A

Decision trees aim for homogeneity in leaf nodes, leading to capturing noise and complex patterns in training data, which results in overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does multicollinearity affect linear regression models?

A

Multicollinearity affects the interpretation of the model, making it difficult to distinguish the individual effects of correlated independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which evaluation metric should be used for linear regression with outliers?

A

MAE is preferred as it is robust to outliers, unlike MSE or RMSE which are sensitive to them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between r-squared and adjusted r-squared?

A

R-squared measures the variance explained by independent variables, while adjusted r-squared accounts for the number of variables and penalizes unnecessary ones.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do we measure performance of a supervised learning model?

A

Performance metrics include R2, Adj. R2, RMSE, MAE for regression, and Accuracy, Precision, Recall, F1-Score for classification.

17
Q

How do we measure performance of an unsupervised learning model?

A

Performance metrics include silhouette score and cophenetic correlation.

18
Q

Can you plot 3D plots using matplotlib?

A

Yes, using the function: import numpy as np; import matplotlib.pyplot as plt; fig = plt.figure(); ax = plt.axes(projection =’3d’).

19
Q

How is get_dummies() different from one hot encoder?

A

OneHotEncoder cannot process strings directly and requires mapping to integers, while pandas.get_dummies converts string columns into one-hot representation by default.

20
Q

Name a tool to convert categorical columns into numeric columns.

A

LabelEncoder and OneHotEncoder from sklearn are popular tools for this purpose.

21
Q

How will you remove duplicate data from a dataframe?

A

Use the ‘drop_duplicates()’ function to eliminate redundant rows from the DataFrame.

22
Q

How do you select a sample from a dataframe?

A

Use df.sample() for random selection, df.sample(n=3) for n rows, df.sample(n=3, replace=True) for allowing duplicates, or df.sample(frac=0.50) for a fraction.

23
Q

How does the groupby function work in Python?

A

The groupby() function splits data into groups based on criteria, allowing for aggregation and analysis of grouped data.

24
Q

What are the parameters for the groupby function in Python?

A

Parameters include: by, axis, level, as_index, sort, group_keys, squeeze, and **kwargs.

25
How do you check the distribution of data in Python?
A simple and commonly used plot to quickly check the distribution of a sample of data is the histogram. ## Footnote Example: from matplotlib import pyplot; pyplot.hist(data)
26
Which libraries in SciPy have you worked with in your project?
SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers, etc. Subpackages include: scipy.cluster, scipy.constants, scipy.fftpack, scipy.integrate, scipy.interpolation, scipy.linalg, scipy.io, scipy.ndimage, scipy.odr, scipy.optimize, scipy.signal, scipy.sparse, scipy.spatial, scipy.special, scipy.stats, scipy.weaves.
27
How is the Python series different from a single column dataframe?
A Python series is the data structure for a single column of a DataFrame. It is a one-dimensional object that can hold any data type and does not have any name/header, whereas the DataFrame has column names.
28
What does the function zip() do?
The zip() function takes iterables, aggregates them in a tuple, and returns it. ## Footnote Syntax: zip(*iterables)
29
Can lambda function be used within a user-defined function?
Yes, a lambda function can be used as an anonymous function within another function.
30
What does [::-1] do in Python?
[::] produces a copy of all the elements in order, while [::-1] produces a copy of all the elements in reverse order.
31
How do you check missing values in a dataframe using Python?
The Pandas isnull() function detects missing values in the given object, returning a boolean same-sized object indicating if the values are NA.
32
Explain a scenario where negative indices are used in Python.
Negative indexing allows access to elements from the end of an array. For example, -1 gives the last element, and -2 gives the second last element.
33
What different methods can be used to standardize the data using Python?
Methods include Min Max Scaler, Standard Scaler, Max Abs Scaler, Robust Scaler, Quantile Transformer Scaler, Power Transformer Scaler, and Unit Vector Scaler.
34
How would you define a block in Python?
A block is a group of statements in a program or script, consisting of at least one statement and declarations for the block.
35
How do you do up-sampling of data? Name a Python function or explain the code.
Up-sampling is the process of randomly duplicating observations from the minority class. A common way is to resample with replacement. ## Footnote Module for resampling: from sklearn.utils import resample
36
What is machine learning?
Machine learning is a branch of AI that focuses on using data and algorithms to mimic human learning, improving by learning from past events.
37
What is the mathematical outcome of any machine learning model building exercise?
Machine learning models summarize patterns in data by establishing relationships between predictors and predicted values, often represented as equations.
38
How do you address a performance gap in a machine learning model?
To address a performance gap, various regularization techniques can be applied to prevent overfitting, such as ridge regression, lasso regression, and pruning techniques.