Capstone Project Flashcards

Question

What is overfitting and how to avoid?

Answer 1

Overfitting: Model performs well on train but poorly on test set Avoided using: Cross-validation (5-fold) Early stopping Simpler models as baselines

Answer 2

Logistic Regression: Linear classifier, interpretable Random Forest: Ensemble of decision trees (bagging) XGBoost / LightGBM: Boosted decision trees that learn sequentially and correct previous mistakes. LightGBM is faster and more efficient.

Answer 3

We used an 80/20 train-test split with stratification to preserve class balance. SMOTE was applied only to the training set to avoid data leakage and ensure fair model evaluation on the untouched test set.

Answer 4

Stratification is a technique used during data splitting to ensure that the proportion of each class (e.g., readmitted vs. not readmitted) is preserved in both the training and testing sets — or across all folds in cross-validation. Since our dataset is imbalanced — meaning there are many more patients not readmitted than those who are — stratification ensures that: Both the training and test sets reflect the true class distribution. The model learns and is evaluated fairly on both positive and negative cases. It prevents misleading metrics that might happen if one class is underrepresented in the test set. We used StratifiedShuffleSplit for the 80/20 split and StratifiedKFold for cross-validation. This helped maintain consistency and improve model robustness, especially for recall and AUC.

Answer 5

Yes. We used 5-fold stratified cross-validation during training to ensure model stability and avoid overfitting. This helped us assess how well the model would generalize to unseen data.

Answer 6

Cross-validation is a technique used to evaluate a machine learning model’s ability to generalize to new, unseen data. It helps detect overfitting and ensures the model’s performance isn’t just good on one specific train-test split. In our project, we used 5-fold stratified cross-validation, which works as follows: The training data is split into 5 equal-sized subsets (called folds). The model is trained on 4 folds and validated on the remaining 1 fold. This process repeats 5 times, each time with a different validation fold. The final performance is the average of all 5 evaluations. We also used stratification during this process to ensure that each fold maintained the original class distribution (readmitted vs. not readmitted). Why we used it: To get a more reliable estimate of how the model will perform on unseen data. To avoid overfitting, since the model is validated on multiple subsets. To ensure robust and stable metrics, especially important in healthcare where reliability is critical.

Answer 7

Accuracy Precision Recall (most important in this context) F1 Score ROC-AUC

Answer 8

Accuracy is the percentage of correct predictions made by the model out of all predictions. It tells us how often the model is correct overall. However, in imbalanced datasets like ours, accuracy can be misleading. For example, if 90% of patients are not readmitted, a model that predicts “no” for everyone will still be 90% accurate — but useless for identifying high-risk patients.

Answer 9

Precision = TP / (TP + FP) It shows the proportion of positive predictions that were actually correct. Precision measures how many of the patients predicted to be high-risk (positive) were actually readmitted. It answers the question: “When the model says a patient will be readmitted, how often is it right?” High precision means fewer false alarms — important when you want to avoid overwhelming the care team with unnecessary follow-ups.

Answer 10

Recall = TP / (TP + FN) It shows the proportion of actual positives the model was able to capture. Recall (also called sensitivity) measures how many of the actual readmitted patients were correctly identified by the model. It answers: “Of all patients who were truly readmitted, how many did the model catch?” We prioritized recall because in healthcare, missing a high-risk patient (false negative) can be dangerous. A missed prediction might result in no follow-up, leading to complications or even death. So, catching as many true positives as possible is critical.

Answer 11

F1 Score = 2 × (Precision × Recall) / (Precision + Recall) It’s the harmonic mean of precision and recall, useful when you want to balance the two. The F1 Score is the harmonic mean of precision and recall. It balances both metrics into a single number. It’s especially useful when you need a balance between avoiding false positives and catching all true positives, as in healthcare. It’s a more reliable performance summary than accuracy in imbalanced datasets.

Answer 12

ROC-AUC stands for Receiver Operating Characteristic – Area Under the Curve. It measures how well the model can distinguish between classes across all decision thresholds. ROC curve plots the True Positive Rate vs. False Positive Rate. AUC is the area under that curve — ranging from 0 to 1. AUC = 1.0 ⇒ perfect classifier 0.5 ⇒ random guessing A high AUC (like 0.95 in our case) means the model does a very good job separating readmitted from non-readmitted patients, regardless of the threshold chosen.

Answer 13

Logistic Regression: Good baseline, but limited in capturing complex relationships. Random Forest: Solid performance and interpretability, but slower than boosting methods. XGBoost: High AUC and good precision, but longer training time. LightGBM: Best overall — 93.36% accuracy, 99.68% precision, 87% recall, and 95.82% AUC — fast, accurate, and interpretable.

Answer 14

Although LightGBM did not have the absolute highest ROC-AUC, it offered the best overall trade-off across all key metrics: Precision: 99.68% - fewer false alarms Recall: 87% (critical for identifying high-risk patients, critical in healthcare) F1 Score: 92.92% - showed it handled both false positives and false negatives well, making it the most practical and trustworthy for real-world deployment. Accuracy: 93.36% It also had several practical advantages: Fast training speed and SHAP compatibility Native handling of categorical variables Seamless integration with SHAP for model explainability Stable performance across all cross-validation folds These factors made LightGBM the most clinically useful, efficient, and interpretable model for deployment, even if another model had a slightly higher AUC.

Answer 15

We used five main metrics: Accuracy: Overall correctness Precision: How many positive predictions were correct Recall: How many actual positives were captured (most important for us) F1 Score: Balance between precision and recall ROC-AUC: Overall ability to distinguish between classes across thresholds These metrics were computed on the test set, after training and tuning the models using cross-validation.

Answer 16

Overfitting happens when a model learns the training data too well, including its noise, outliers, or random patterns, instead of just the underlying relationships. As a result, the model performs very well on training data but fails to generalize to new, unseen data — leading to poor test performance. It’s like memorizing the answers for one exam instead of learning the actual subject — it doesn’t help when the questions change. How we avoided overfitting in this project: Cross-validation: We used 5-fold stratified cross-validation to ensure the model performed consistently across different subsets of the training data. Early stopping: For boosting models like LightGBM and XGBoost, we applied early stopping — which halts training if the model stops improving on validation data — preventing it from fitting noise. Regularization & tuning: We carefully tuned hyperparameters using RandomizedSearchCV and HalvingRandomSearchCV, applying built-in regularization (like max_depth, min_child_weight, etc.) to control complexity. Data split discipline: We applied SMOTE only on the training set (never on test data) to avoid data leakage and preserve honest evaluation.

Answer 17

The confusion matrix helped us visualize: True positives: correctly predicted readmissions False negatives: high-risk patients missed (we aimed to reduce these) False positives: patients flagged as high-risk who weren’t readmitted Both LightGBM and XGBoost showed very few false negatives, which is crucial in healthcare — we don’t want to miss patients who are actually at risk.

Answer 18

Yes. We applied sigmoid calibration using CalibratedClassifierCV to improve the reliability of the predicted probabilities. This ensures that a prediction like “0.80 probability of readmission” actually corresponds to an 80% chance, making the model’s output more actionable for clinicians.

Answer 19

The ROC curve showed how well the model distinguishes between classes across thresholds. LightGBM and XGBoost both had high AUCs, confirming strong performance. The Precision-Recall curve focused on our main trade-off: catching as many true positives as possible (recall) without generating too many false positives (precision). LightGBM provided the best balance in this curve, making it ideal for real-world use where both metrics matter.

Answer 20

We used SHAP (SHapley Additive Explanations) to: Identify top features influencing predictions (like number of prior inpatient visits, medication count, A1C results) Generate global summary plots to explain the model’s logic Create individual-level waterfall plots to show why a specific patient was flagged This allowed doctors to see not just the prediction — but the "why" behind it.

Answer 21

We prepared the model using a modular and reproducible deployment pipeline. Key steps included: Serializing the trained LightGBM model using joblib.dump() so it can be loaded later for inference without retraining. Saving preprocessing steps (e.g., encoders and scalers) to ensure consistency between training and prediction time. Organizing outputs into an artifacts/ folder structure (models, data, plots, SHAP outputs), making it easier to manage and track versions.

Answer 22

The model can be deployed using: Flask or Streamlit to create a simple dashboard or user interface for hospital staff. REST APIs that integrate with Electronic Health Record (EHR) systems, allowing the model to receive patient data and return predictions automatically. Cloud platforms like AWS, Azure, or GCP for hosting the model securely and at scale (e.g., using AWS Lambda or EC2). This setup makes the model flexible, lightweight, and easy to integrate into different clinical environments.

Answer 23

The model could be integrated at the point of discharge, where it receives patient data and outputs: A readmission risk score A binary classification (high-risk or low-risk) A SHAP explanation showing the top factors influencing that prediction Clinicians or care coordinators can then decide whether to: Schedule early follow-up visits Provide additional education Trigger transitional care plans This supports proactive care, instead of waiting for the patient to return.

Answer 24

We use SHAP (SHapley Additive Explanations) to explain model outputs: Global SHAP plots show which features are most important overall. Local SHAP plots (like waterfall plots) explain why an individual patient was classified as high-risk. This transparency builds clinical trust and aligns with ethical AI practices — doctors don’t just see a score, they see the reasoning behind it.

Answer 25

We addressed multiple aspects: HIPAA & GDPR compliance: The dataset was fully de-identified, and no personally identifiable information is used in the model. Explainability: SHAP ensures decisions can be interrogated, satisfying expectations of the EU AI Act and WHO's trustworthy AI guidelines. Human oversight: The model is intended as a decision-support tool, not a replacement for medical judgment. Bias auditing: We used SMOTE to balance the training data and monitored fairness across subgroups during model evaluation.

Answer 26

Risks include: Model drift: As healthcare practices evolve, the model might become outdated. ➤ Mitigation: Set up monitoring and retraining pipelines. Over-reliance on automation: Staff might blindly trust predictions. ➤ Mitigation: Always pair predictions with SHAP explanations and enforce human oversight. Data privacy concerns: Sensitive patient data may be at risk. ➤ Mitigation: Use secure servers, encrypted data transmission, and strict access controls.

Answer 27

Early stopping halts training when the model's performance on validation data stops improving after a set number of iterations. It prevents overfitting, especially in boosting models like LightGBM and XGBoost. It also reduces training time without compromising performance.

Answer 28

RandomizedSearchCV randomly samples hyperparameter combinations from a defined grid — it's faster than GridSearchCV and works well when the search space is large. HalvingRandomSearchCV starts with many combinations but evaluates them with fewer resources (e.g., fewer trees), narrowing down to the best ones iteratively. We used both to efficiently explore parameters without wasting resources.

Answer 29

LightGBM handles missing values natively — it learns the best direction to assign them during training. It also supports categorical features directly using a technique called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB), which improves both performance and speed without needing to one-hot encode manually.

Answer 30

Raw ICD-9 codes are too granular and sparse, which can hurt model performance. We grouped them into broader clinical categories (e.g., circulatory, diabetes-related, respiratory) to: Reduce dimensionality Increase interpretability Improve signal strength for modeling

Answer 31

Normalization scales values between 0 and 1 (MinMaxScaler). Standardization scales to zero mean and unit variance (StandardScaler). We used StandardScaler after log-transforming skewed features. This made models like Logistic Regression and boosting algorithms converge better and compare features fairly.

Answer 32

In a clinical setting, missing a high-risk patient (false negative) is more dangerous than flagging a low-risk one. High recall ensures we catch as many actual readmissions as possible, even if it means occasionally flagging someone who won’t return. It supports proactive care and avoids preventable complications.

Answer 33

ROC-AUC measures the model’s ability to separate classes across all thresholds. Precision-Recall curves focus more on positive class performance, which is better for imbalanced datasets like ours. We used both: ROC-AUC for overall performance, and Precision recall curve to tune thresholds and balance sensitivity (recall) vs specificity (precision).

Answer 34

SHAP (SHapley Additive Explanations) helped us: Understand global feature importance Explain individual predictions Visualize how features like med_count, A1Cresult, and number_inpatient influenced risk This supported ethical AI, improved clinical trust, and aligned with GDPR and EU AI Act requirements.

Answer 35

We tackled bias by: Using SMOTE to balance the training data and prevent the model from favoring the majority class Monitoring performance across subgroups (e.g., by age and race) using SHAP Emphasizing explainability and clinician oversight in deployment, avoiding black-box decisions

Answer 36

Conduct external validation with recent patient data Collaborate with clinicians to test workflows and build trust Set up performance monitoring and retraining schedules to detect model drift Secure the system for data privacy and access control Document processes for compliance with HIPAA, GDPR, and the EU AI Act

Answer 37

Yes, of course. In my model, the threshold is the cutoff point that decides whether a predicted probability should be classified as a positive case or a negative one. For example, my model outputs probabilities between 0 and 1 — and by default, if the probability is 0.5 or higher, it predicts the patient will be readmitted; otherwise, it predicts they won’t. But this threshold isn't fixed. I can change it depending on the clinical goal. If I lower the threshold, the model becomes more sensitive — meaning it will catch more true positives, which is good for recall, but it might also increase false positives. On the other hand, raising the threshold would give me higher precision, but I would risk missing more actual readmissions. In my project, since the cost of missing a high-risk patient is high, I focused on tuning the threshold to prioritize recall, while still keeping precision at a useful level. I used the Precision-Recall curve to visually explore this trade-off and find a threshold that balances both, depending on clinical needs.

Answer 38

random_state is a parameter that controls the randomness in operations like train-test splitting, SMOTE, or even during model initialization. It's basically like setting a seed for the random number generator. By setting a fixed random_state — for example, random_state=42 — I make sure that every time I run the code, I get the same results. The data gets split the same way, SMOTE generates the same synthetic samples, and the models behave consistently. This is really important for reproducibility — especially in research or when sharing my project with others. It ensures that the results can be replicated exactly, which is a key principle in both machine learning and scientific work. I use it in: Train-test split: This ensures that the split between training and test sets remains the same every time I run the notebook, and the class distribution is preserved through stratify=y. SMOTE (Synthetic Minority Oversampling Technique): Here, random_state ensures that SMOTE generates the same synthetic examples during resampling, which keeps training behavior consistent across experiments. Model Initialization: This sets the internal randomness of model components like tree splits or sampling, so training results don’t vary unpredictably. By using random_state consistently, I ensured that the results are deterministic, traceable, and easy to debug or replicate — which is essential in both academic research and real-world deployment scenarios.

Capstone Project Flashcards

(62 cards)