DS Entretien 3 Flashcards
(100 cards)
“You are given an unsorted list of integers. The list is too large to fit into memory all at once, but you need to sort it as efficiently as possible. What sorting algorithm would you choose, and why?”
Quick Sort
Its average-case time complexity is O(n log n), making it efficient for large datasets. It operates in-place, requiring only O(log n) additional space for the recursion stack, which is crucial when dealing with memory constraints.
You can mention that for even better performance in certain cases, you might use a randomized pivot or switch to a different algorithm (like merge sort) if the worst-case performance is a concern.
What algorithm is recommended for merging two sorted lists of integers and why?
Merge Sort is recommended because:
* Guarantees a time complexity of O(n log n) in the worst case
* More predictable than Quick Sort
* Useful for large datasets that don’t fit into memory
* Works efficiently with external storage
* Higher space complexity of O(n) due to additional memory requirements
* Less space-efficient than in-place algorithms like Quick Sort
Merge Sort is particularly effective in scenarios where data is too large to handle in memory, making it suitable for external sorting tasks.
what is regularization in one sentence
Regularization is a technique used to prevent overfitting in machine learning models by adding a penalty to the loss function, discouraging overly complex models.
Lasso (L1): adds |w| to the loss, encouraging some weights to become exactly zero
Ridge (L2): adds w^2 encouraging smaller weights without eliminating them
Machine Learning Model Deployment Steps
- Understand problem definition and expected output
- Data preparation (handle missing values, duplicates, formatting)
- Feature engineering (create new features, transform raw data into useful formats)
- Select model
- Train model
- Evaluate model (accuracy, precision, recall, etc)
- Save model (pickle or tensorflow)
- Develop API for inference (Flask, FastAPI)
- Containerize model (using Docker for portability)
- Deploy model (cloud / on-premises/ edge)
- Monitor performance (check for model drift)
- Scale and load balancing (using Kubernetes, cloud scaling)
- Model retraining (automate with CI/CD pipelines)
- Security & Governance (authentification, encryption, compliance)
What is problem definition in machine learning, and why is it important?
Problem definition is understanding the business or research question and what the model is expected to predict or classify. It helps in aligning the model’s objective with real-world goals, guiding data collection and model selection.
What is data preparation in machine learning?
Data preparation involves cleaning the data, handling missing values, removing duplicates, and formatting the data correctly. This step ensures the model can learn effectively and the results are reliable.
What is feature engineering, and why is it important?
Feature engineering is the process of creating new features from raw data or transforming existing features into more meaningful formats. It helps improve model performance by ensuring the model can capture the right patterns in the data.
What is model selection in machine learning?
Model selection is the process of choosing an appropriate algorithm for the task at hand, whether it’s regression, classification, or clustering. It depends on the problem, data, and desired output.
What is model training in machine learning?
Model training involves feeding the preprocessed data into the chosen algorithm so it can learn the underlying patterns. This step adjusts model parameters to minimize error and improve predictions.
How do you evaluate a machine learning model?
Model evaluation involves measuring its performance using metrics such as accuracy, precision, recall, F1-score, or RMSE. These metrics indicate how well the model generalizes to new data.
What does saving a model in machine learning mean?
Saving a model involves serializing it to a file so it can be used later for inference. Common formats are Pickle for Python or TensorFlow’s SavedModel for deep learning.
What is an API, and what are some tools to develop it?
An API (Application Programming Interface) allows external systems to interact with a model. Tools like Flask and FastAPI are used to build APIs that serve machine learning models for real-time predictions.
What does containerizing a machine learning model mean?
Containerizing a model involves packaging it and its dependencies into a portable container using tools like Docker. This ensures the model can run consistently across different environments.
What does it mean to deploy a model, and where can you deploy it?
Deploying a model means making it accessible for use in production, either on cloud platforms (AWS, GCP, Azure), on-premises servers, or on edge devices like mobile phones or IoT devices.
What is model monitoring in deployment?
Model monitoring tracks the model’s performance over time, ensuring it continues to perform well. It involves checking for issues like model drift, which can happen if data patterns change.
What is scaling and load balancing in machine learning deployment?
Scaling ensures that the deployed model can handle varying amounts of traffic by increasing or decreasing resources. Load balancing distributes the traffic evenly across multiple instances to ensure smooth performance.
What is model retraining in machine learning?
Model retraining involves periodically updating the model with new data to maintain its accuracy. This can be automated using CI/CD pipelines, ensuring the model adapts to changing data.
What are security and governance in machine learning deployment?
Security ensures that only authorized users can access the model, using methods like authentication and encryption. Governance involves ensuring compliance with data protection regulations and managing the model lifecycle.
What is precision in machine learning evaluation?
Precision is the ratio of true positive predictions to the total predicted positives. It measures how many of the positive predictions were actually correct. Formula: Precision = TP / (TP + FP)
TP = True Positives, FP = False Positives
What is recall in machine learning evaluation?
Recall is the ratio of true positive predictions to the total actual positives. It measures how many of the actual positives were correctly identified. Formula: Recall = TP / (TP + FN)
FN = False Negatives
What is a confusion matrix?
A confusion matrix is a table that shows the performance of a classification model. It compares the predicted labels with the true labels, showing the counts of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).
What is accuracy in machine learning evaluation?
Accuracy is the ratio of correct predictions (both true positives and true negatives) to the total predictions. It is the most common metric but can be misleading in imbalanced datasets. Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
TN = True Negatives
What is the F1 score in machine learning evaluation?
The F1 score is the harmonic mean of precision and recall. It balances the two metrics and is especially useful when the class distribution is imbalanced. Formula: F1 = 2 * (Precision * Recall) / (Precision + Recall).
What is specificity in machine learning evaluation?
Specificity (or True Negative Rate) measures the proportion of actual negatives that were correctly identified. Formula: Specificity = TN / (TN + FP).