Data Scientist Interview Flashcards

Question

Difference between batch normalization and layer normalization?

Answer 1

Batch norm normalizes per feature (better for CNNs and dense networks) Layer norm normalizes per sample (RNNs, transformers, and small batch sizes) Both stabilize and speed up training in deep neural networks by normalizing activations.

Answer 2

A framework for processing big data by splitting tasks into map and reduce steps. Map Phase: Convert words into key-value pairs [('apple', 1), ('banana', 1), ('apple', 1), ('banana', 1), ('orange', 1), ('apple', 1)] Shuffle & Sort: Group by word {'apple': [1, 1, 1], 'banana': [1, 1], 'orange': [1]} Reduce Phase: Sum up values for each word. {'apple': 3, 'banana': 2, 'orange': 1}

Answer 3

A sequence of data processing steps, often automated.

Answer 4

ETL transforms data before loading, ELT loads first, then transforms.

Answer 5

A non-relational database like MongoDB that stores data in flexible formats.

Answer 6

Retention, churn, conversion rate, and user engagement.

Answer 7

Use collaborative filtering, content-based filtering, or hybrid approaches.

Answer 8

DAU: daily active users, MAU: monthly active users, retention: returning users.

Answer 9

Low p-value Significant lift in key metric (ie conversion rate, revenue, engagement, etc). - Lift is improvement in test group (new version w changes being tested, B) compared to control group (og version, A)

Answer 10

`SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees);`

Answer 11

`SELECT country, COUNT(*) FROM customers GROUP BY country;`

Answer 12

`SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(order_id) > 2;`

Answer 13

`SELECT month, sales, AVG(sales) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) FROM sales_data;`

Answer 14

`SELECT employee_name, department, salary, RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank_salary FROM employees WHERE rank_salary <= 3;`

Answer 15

`df.isnull().sum()`

Answer 16

`df.drop_duplicates(inplace=True)`

Answer 17

`df['category'].value_counts().head(5)`

Answer 18

`df.groupby('category')['sales'].mean()`

Answer 19

`df['clean_text'] = df['text'].apply(lambda x: x.lower().strip())`

Answer 20

`merged_df = pd.merge(df1, df2, on='customer_id', how='left')`

Answer 21

Variance measures how spread out data is from the mean.

Answer 22

Mean is the average, median is the middle value.

Answer 23

Using IQR, Z-scores, or visualization (boxplot, histogram).

Answer 24

It calculates conditional probability: P(A|B) = P(B|A) * P(A) / P(B).

Answer 25

T-test compares two means, Chi-square tests categorical data, ANOVA compares multiple means.

Answer 26

The probability of observing results at least as extreme as the null hypothesis.

Answer 27

High correlation between features, detected using VIF scores.

Answer 28

Use oversampling, undersampling, SMOTE, or adjust class weights.

Answer 29

Linear predicts continuous values, logistic predicts probabilities.

Answer 30

Overfitting happens when a model learns noise; prevent it using regularization, dropout, or more data.

Answer 31

Precision = TP / (TP + FP), Recall = TP / (TP + FN), F1 balances both.

Answer 32

L1 shrinks weights to zero, L2 penalizes large weights.

Answer 33

Prevents overfitting by evaluating models on different subsets of data.

Answer 34

Ensures equal weight for all features, improves gradient descent.

Answer 35

It projects data onto new axes to capture maximum variance.

Answer 36

Random Forest reduces variance, boosting corrects model errors iteratively.

Answer 37

Use NLP, sentiment analysis, and user behavior patterns.

Answer 38

Hybrid filtering, better embeddings, A/B testing models.

Answer 39

Check retention, session data, A/B tests, and feature changes.

Answer 40

Use classification models, analyze key churn indicators, improve retention strategies.

Answer 41

Clearing Messy Data: 1. Remove duplicate rows 2. Handle missing values 3. Convert data types (put dates in datetime format) 4. Standardize text data (str.lower().strip()) 5. Handle outliers 6. Normalize or scale features (min max scale or z-score normalization) 7. Fix inconsistent categories (instead of having some labels as NY and some as New York for example) 8. Convert categories to numbers (label encoding or one hot encoding) 9. Create derived variables (new variables based on OG variables to help improve model performance, ie parts of a date or age from birthdate) 10. Verify data consistency (ensure all data adheres to same rules, ie consistent formats, currency, no impossible ages or dates, the ranges of data are realistic)

Answer 42

Discuss the problem, your approach, and impact.

Answer 43

Use visuals, storytelling, and focus on business impact.

Answer 44

Situation, Task, Action, Result

Data Scientist Interview Flashcards

(68 cards)