Data Scientist Interview Flashcards
What is variance?
Variance measures data spread by averaging squared differences from the mean. Higher variance means more spread.
Variance is a key concept in statistics that helps in understanding the dispersion of data points in a dataset.
Difference between mean and median?
Mean is the average. Median is the middle value. Median is less affected by outliers.
The mean can be skewed by extreme values, while the median provides a better measure of central tendency in such cases.
How do you detect outliers?
Use Z-score (>3), IQR (1.5*IQR rule), or visualize with boxplots and histograms.
Outlier detection is crucial for data cleaning and ensuring the accuracy of statistical analyses.
Explain Bayes’ Theorem
It updates probabilities based on new evidence: P(A|B) = P(B|A) * P(A) / P(B).
Bayes’ Theorem is fundamental in statistics and machine learning for making inferences based on prior knowledge.
When to use T-test vs. Chi-square vs. ANOVA?
T-test: 2 means, Chi-square: categorical independence, ANOVA: 3+ means.
These tests are used to compare means or distributions in various scenarios, depending on the number of groups and data types.
What’s a p-value?
The probability of getting the observed result if the null hypothesis is true. p < 0.05 is significant.
P-values are used in hypothesis testing to determine the strength of the evidence against the null hypothesis.
What is Multicollinearity, and how to detect it?
When features are highly correlated. Detect using VIF (>5) or correlation matrices.
Multicollinearity can affect the stability of regression coefficients and complicate interpretations.
How do you handle imbalanced datasets?
Resampling, class weighting, better metrics like F1-score, and tree-based models.
Addressing imbalance is essential for improving model performance and ensuring fair predictions.
What is a window function?
A function that performs calculations across a specific row window, like RANK or LEAD/LAG.
Difference between RANK, DENSE_RANK, and ROW_NUMBER?
RANK skips ranks on ties, DENSE_RANK does not, ROW_NUMBER gives unique numbers.
What are CTEs and subqueries?
CTEs (Common Table Expression) improve readability and can be recursive. Subqueries are nested queries inside another query.
Difference between INNER JOIN and LEFT JOIN?
INNER JOIN keeps only matching rows, LEFT JOIN keeps all left table rows.
What is vectorization in Python?
Using NumPy or pandas operations instead of loops for faster execution.
What is the difference between a list and a tuple?
Lists are mutable, tuples are immutable and faster.
What is a hash table?
A data structure that maps keys to values using a hash function for fast lookup.
Difference between BFS and DFS?
BFS (Breadth First Search) explores level by level, DFS (Depth First Search) goes deep first.
What is overfitting?
A model that learns noise instead of patterns, performing poorly on new data.
Difference between supervised and unsupervised learning?
Supervised uses labeled data, unsupervised finds patterns in unlabeled data.
What is the difference between bagging and boosting?
Bagging reduces variance (Random Forest), boosting corrects errors (XGBoost).
Difference between logistic regression and SVM?
Logistic regression is simpler, SVM works better for complex decision boundaries.
Why use ReLU in neural networks?
It avoids vanishing gradients and speeds up training.
What is backpropagation?
A method to update weights using gradients of the loss function.
Difference between CNNs and RNNs?
CNNs handle images, RNNs handle sequential data like text or time series.
What is dropout in neural networks?
Randomly deactivates neurons during training to prevent overfitting.