Random Question Flashcards

1
Q

What are the commonly used programming languages in data science?

A

Python, R, and SQL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Fill in the blank: A __________ is a combination of data, algorithms, and machine learning techniques used to make predictions.

A

predictive model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is overfitting in machine learning?

A

When a model learns the training data too well, capturing noise instead of the underlying pattern.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which of the following is a common metric for evaluating classification models? A) Mean Absolute Error B) Accuracy C) R-squared

A

B) Accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the purpose of cross-validation?

A

To assess how the results of a statistical analysis will generalize to an independent data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does ETL stand for in data processing?

A

Extract, Transform, Load.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: Feature engineering is the process of selecting, modifying, or creating features to improve model performance.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between supervised and unsupervised learning?
Give examples of supervised and unsupervised algorithms

A

Supervised learning uses labeled data to train models, while unsupervised learning uses unlabeled data.

Supervised learning has a feedback learning
S: decision trees, SVM
U: k-means, hierchical clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a confusion matrix?

A

A table used to evaluate the performance of a classification model by comparing predicted and actual outcomes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Fill in the blank: The __________ is a statistical measure that represents the likelihood of an event occurring.

A

probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the purpose of a data pipeline?

A

To automate and streamline the process of data collection, transformation, and storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Which algorithm is commonly used for regression tasks?

A

Linear regression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the significance of p-values in hypothesis testing?

A

P-values indicate the probability of observing the data, or something more extreme, under the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

True or False: Data visualization is an important part of data analysis in data science.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the purpose of the ‘train-test split’ in machine learning?

A

To evaluate the performance of a model on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Name one common library used for data manipulation in Python.

A

Pandas.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the term ‘big data’ refer to?

A

Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Fill in the blank: __________ learning is a subset of machine learning focused on teaching computers to learn from data without being explicitly programmed.

A

Machine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the purpose of regularization in machine learning?

A

To prevent overfitting by adding a penalty for larger coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a common use case for clustering algorithms?

A

Market segmentation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What does ‘data wrangling’ involve?

A

Cleaning and transforming raw data into a usable format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Which of the following is a regression algorithm? A) K-means B) Decision Trees C) Naive Bayes

A

B) Decision Trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

True or False: Dimensionality reduction techniques are used to reduce the number of features in a dataset.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the role of a data scientist?

A

To analyze and interpret complex data to help organizations make informed decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the main advantage of using ensemble methods in machine learning?
They combine multiple models to improve predictive performance.
26
Fill in the blank: A __________ is a graphical representation of the distribution of numerical data.
histogram
27
What are outliers?
Data points that differ significantly from other observations in a dataset.
28
What is the purpose of exploratory data analysis (EDA)?
To summarize the main characteristics of a dataset, often using visual methods.
29
What is the difference between batch and online learning?
Batch learning processes data in batches, while online learning processes data one instance at a time.
30
How is logistic regression done?
Data -> linear model->proba-> sigmoïde fonction -> values 0 et 1-> treshold classifier
31
Formula sigmoïde function
P= 1/( 1+ exp(-y)) ou y =ax+b
32
What are the step to make a decision tree
1) calculate the entropy of the target and prédiction attributes. 2) calculate the information gain. 3) Root is the feature with the highest info gain Repeat
33
Build a random forest.
Select k record randomly ( k< m) Calculate the node D using the best Split. Repeat for daugther nodes Repeat with another k
34
How to avoid overfitting
1) keep model simple 2 detect via cross validation 3 ) régularisation 4 ) add more data or feature selectio 5) early stop
35
Feature sélection méthodes?
Filtre méthods: Lda, chi-square,ANOVA Wrapper méthods: forward fea sel Backward feature, récursive feature sélection ( thé others two look at one AT thé Time)
36
Why dimension reduction
LESS storage Space, less computational power, removing redundant features
37
Calculate eigenvalues and eigenvector of -2 -4 2 -2 1 2 4 2 5
Lamda3 - 4lamda2 - 27lamda +90
38
What are recommender système ?
Collaboratrice filtering,content based filtering
39
How to Select k for k means
Calculate the Wss, sum of squared distance between the centroid and each membre of a cluster and search for elbow method
40
How treat outlier
Remove if they are garbage Normalise Use another model Use algo rebust against outlier random forest
41
Precison
TP /TP+ fp
42
Recall
TP / TP + fn
43
Entropy formula inpurity level
-sum(P * log2.p)
44
Tpr
TP/TP+ fn
45
Fpr
Fp/fp+tn
46
Différence entre logistique and linéaire
L'output est catégorique vs continue
47
Bagging vs boosting
Bagging: aim to reduce variance in a noisy dataset: Split data, train models, average. Boosting IS ensemble learning to strengthen weak models Learning from previous errors. Gradient boosting (risk overfitting)
48
F1 score
2 x Precison x recall/ (précision + recall)
49
Assumptions of linear regression
Linear dependency between feature and y Independence
50
What is logistic regression
Prédictive analyses to find relatioships between dépendant binary variable and indépendant features using logistic regression équation
51
What IS décision tree.
Tool to classify data and déterminé thé probabilités of a outcome of a système. Thé base IS a Root node, branches in décision node and into leaves node
52
Pruning thé décision tree.
Éliminate leaves to avoid overfitting using gini index
53
Errors vs residual error
Observed value- true values Observed value - estimated valuez
54
Ensemble learning
Multiple models are uséd to improve prédictive performance
55
Naive Bayes
Classification algorithme that assumes that the feature are indépendant
56
SVM
Prédictive and classification using hyperplanes to ségrégate between two classes
57
Law of large number
To get thé expected result one should run thé experiment a large number of times
58
Counfouding variable
Variable that have an effect on other cause and effect
59
Do gradient descend Always converge same point
No there are some local optimum
60
Binomial formula
N!/(N-x)! X! P^x q^n-x
61
Type I error
False positive
62
Type II error
False négatif
63
L1 régularisation vs L2
L1 absolute value of weight * lamda3 leads to sparse model and values near to zéro good for high dim data with irrelevant features L2 squared values prevent overfitting without éliminating features works well correlated features
64
Feature scalling
Min max scalling z score transformation log transformation X- min/ (max-min) X- mean/ std Robuste scaling
65
How to deal with outlier
Visualisez, statistical méthodes (z score, iqr) 1) removal 2 transfo 3 capping 4) investigation
66
One hot encoding
Transformation catégorie into binaries
67
How to deal with catégorie values
One hot encoding (0,1) Label encoding (1,2,3...) Target encoding uses thé mean Fréquence replace thé catégorie with fréquence Domaine specific encoding i'e encoded based on the distance of a central point
68
Bias variance tradeoff
Biais error introduce by the d'simplication underfit thé data Variance error by model sensitivity overfitting thé data