Capstone Project Flashcards

(62 cards)

1
Q

What was the main goal of your project?

A

The main goal was to develop a machine learning model that predicts whether diabetic patients will be readmitted to the hospital within 30 days of discharge. Early readmissions are a critical issue in healthcare, and this model aims to help clinicians identify high-risk patients in advance so they can take proactive actions to reduce preventable readmissions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is this problem important to solve?

A

Diabetic patients cost billions annually. These readmissions often reflect gaps in post-discharge care and can be avoided through better planning and follow-up. By predicting which patients are at high risk, we can improve care quality while also reducing healthcare costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What real benefits can your model bring to hospitals or healthcare providers?

A

There are several concrete benefits:

Better Patient Outcomes:
The model helps providers identify high-risk patients before discharge, so they can schedule early follow-ups, adjust medications, or provide extra guidance — all of which improve patient stability and reduce the risk of complications.

Cost Savings:
Readmissions are very expensive. Hospitals can save money and avoid penalties by focusing care on those patients most likely to return.

Smarter Resource Allocation:
Staff and resources like nurse visits or home care services can be allocated based on actual risk, making operations more efficient.

Clinical Decision Support:
The model is explainable using SHAP, which means doctors can understand the reasons behind each prediction. This increases trust and makes the model practical for use in real clinical settings.

Scalability and Reusability:
The pipeline is open-source, modular, and cloud-deployable. With minor adjustments, the same system can be reused to predict readmissions for other chronic conditions, like heart failure or COPD.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How does your model promote trust and explainability in clinical practice?

A

We used SHAP (SHapley Additive exPlanations), a well-known explainability tool in AI. SHAP shows how each feature (e.g., number of inpatient visits, abnormal lab results) contributes to the prediction, both globally and for each individual patient. This level of transparency aligns with ethical AI principles in healthcare and helps clinicians make informed decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Can your solution be deployed in a real hospital environment?

A

Yes. The model is saved using joblib, and the full pipeline can be integrated into hospital systems via APIs, or used with a user-friendly dashboard built in Streamlit or Flask. It’s also scalable and can be hosted on cloud platforms like AWS or Azure for broader use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What success metric was used? Why?

A

We prioritized recall and ROC-AUC, as missing high-risk patients (false negatives) is more harmful in healthcare.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What motivated the project?

A

High readmission rates among diabetics lead to financial strain and poor health outcomes. Predictive modeling can help healthcare providers proactively manage these patients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Who are the main users or beneficiaries of your model, and how do they benefit from it?

A

Hospitals, clinicians, care coordinators, and ultimately the diabetic patients who receive better follow-up care.
Hospitals
Reduce readmission-related costs and penalties, improve efficiency in resource allocation, and enhance overall care quality.

Clinicians
Get decision support through explainable risk scores (via SHAP), enabling smarter discharge planning and better prioritization of care.

Care Coordinators
Can target high-risk patients for follow-up, improving care efficiency and outcomes with fewer wasted efforts.

Patients
Receive more personalized and timely post-discharge care, reducing the chances of complications and improving recovery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Where did the data come from? Isn’t that too old to be relevant for today’s healthcare?

A

From the Diabetes 130-US Hospitals dataset (UCI Repository), covering 100,000+ encounters from 1999 to 2008.

While the dataset is historical, it still holds strong relevance for a few key reasons:

Core healthcare patterns remain consistent
The risk factors for diabetic readmissions — like frequent inpatient visits, medication complexity, and poor glycemic control — are still major clinical issues today. These variables are still tracked in modern hospital systems and are highly predictive regardless of the year.

Focus is on methodology, not direct deployment
The primary goal of the project was to explore how machine learning can be applied to hospital readmission problems. The focus was on creating a reusable and interpretable pipeline — not on immediate production deployment. The methods and insights can easily be applied to more recent datasets in the future.

Compliance and de-identification
Since the dataset is fully de-identified under HIPAA and GDPR standards, it was ethically safe and legally appropriate to use for academic research — something more recent clinical data may not allow due to privacy concerns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type of data was available?

A

Demographic (age, race, gender), clinical (diagnoses, lab results, meds), and administrative (admission type, discharge disposition).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Were there missing values? Why?

A

Yes. Columns like weight, payer_code, and medical_specialty had significant missingness (up to 96.8%) and were dropped. Placeholders like ? were converted to NaN and cleaned appropriately.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How would you adapt your solution if you had access to more recent data?

A

If I had access to modern data, I would:

Update the model using recent patient profiles and treatment protocols.

Reassess which features are still predictive — for example, newer medications or care pathways may be available.

Re-tune hyperparameters and re-evaluate model calibration to account for shifts in care delivery and population demographics.

Compare performance to see if model generalization still holds over time.

This would help assess model drift and ensure the tool remains accurate and clinically useful.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What steps did you take to clean the dataset?

A

1. I replaced all placeholder missing values (“?”) with NaN so they could be properly detected and handled using pandas tools.
2. I dropped columns with excessive missingness, like weight, payer_code, and medical_specialty, because they were missing in over 40–90% of rows and provided little useful signal.
3. I removed records with invalid values, such as “Unknown/Invalid” gender, or missing race data.
4. I excluded patients who were discharged to hospice or who died, since they cannot be readmitted and would distort the target variable.
5. I dropped identifiers like patient_nbr and encounter_id to avoid data leakage.
6. I grouped ICD-9 diagnosis codes into broader categories for interpretability and dimensionality reduction.
7. I used helper functions to simplify discharge/admission source fields into clinically relevant categories.
8. I encoded the target variable y as binary: 1 for readmitted in <30 days, and 0 for all other outcomes.
9. I created new features like med_count, service_count, and flags for medication change (chg_flag) and diabetes medication use.
These steps ensured that the data was clean, medically interpretable, and ready for training robust ML models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What feature engineering techniques did you use?

A

Feature engineering is the process of creating, transforming, or selecting features in a dataset to make them more informative and useful for machine learning models. The goal is to help the model capture the true patterns and relationships in the data more effectively.

In my project, I used feature engineering to extract clinically meaningful signals and improve model performance. Here are a few key examples:

  1. Created Aggregated Service Features

df_feat = add_service_and_med_features(df_clean)

I engineered features like:

service_count: total number of services (e.g., labs, procedures, meds)

med_count: number of unique medications

med_change_count: number of changes in medication during the stay
These features reflect the intensity and complexity of a patient’s treatment, which are strong indicators of readmission risk.

🔹 2. Grouped Diagnosis Codes

df = categorize_diagnoses(df)

I grouped detailed ICD-9 diagnosis codes into broader categories like:

Circulatory system issues

Diabetes-related conditions

Respiratory diseases
This reduced dimensionality and captured clinical meaning, making it easier for the model to learn patterns.

🔹 3. Simplified Administrative Codes

df = simplify_admin_fields(df)

I mapped administrative IDs (like admission_type_id) into grouped categories such as emergency, referral, transfer, etc. This helped reduce noise and improved interpretability.

🔹 4. Converted Age Brackets to Numeric

df[‘age_mid’] = df[‘age’].map({…})

The age column had values like [70-80). I converted these into numeric midpoints (e.g., 75) so they could be used in models as continuous variables.

🔹 5. Created Binary Flags

df[‘chg_flag’] = df[‘change’].map({‘Ch’: 1, ‘No’: 0})

I created binary indicators for whether medications were changed (chg_flag) and whether the patient was on diabetes medication (diab_med_flag). These are simple but powerful signals of instability or active treatment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How did you handle categorical variables?

A

We used a mix of label encoding and custom mappings:

For variables like admission_type and discharge_disposition, we used medically-informed mappings to preserve the meaning of each category.

Binary variables like change and diabetesMed were encoded as 0 and 1.

race and gender were label-encoded after cleaning invalid entries.

This transformation ensured compatibility with the ML models while keeping semantic meaning intact.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Did you normalize or standardize? Why?

A

Yes. We applied log transformation to reduce skew in features like number_inpatient and number_medications, and then used StandardScaler to standardize numerical features.

Standardization (mean = 0, std = 1) helps many ML algorithms, especially gradient-boosting models, converge faster and perform more consistently.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

The dataset had class imbalance. How did you address it?

A

We used SMOTE to address class imbalance. The dataset was highly imbalanced — most patients were not readmitted within 30 days. To address this, we used SMOTE (Synthetic Minority Over-sampling Technique), applied only to the training set to prevent data leakage.

SMOTE creates synthetic examples of the minority class rather than duplicating rows, which helps reduce overfitting.

After SMOTE, the training set had balanced classes, which improved the model’s recall — a critical metric in healthcare where missing high-risk patients can be dangerous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why did you choose SMOTE over other techniques like undersampling?

A

We chose SMOTE because:

Undersampling would remove a large portion of valuable majority-class data, which could weaken the model.

SMOTE allows us to keep all original data and enhance minority class representation by synthesizing new, realistic samples.

It improves model sensitivity (recall) without the risk of overfitting that comes from simply duplicating rows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which models were tested?

A

We tested four supervised classification models:

Logistic Regression

Random Forest

XGBoost

LightGBM

These were selected for their strong performance on structured tabular data, and because they offer a balance between predictive power, interpretability, and scalability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why did you choose those specific models?

A

Logistic Regression is simple and interpretable, useful as a baseline.

Random Forest handles non-linear relationships well and provides feature importance.

XGBoost is highly efficient and accurate with built-in regularization.

LightGBM is even faster than XGBoost, uses less memory, supports categorical variables natively, and integrates well with SHAP for explainability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What criteria did you use to select the final model?

A

We prioritized:

High recall, to avoid missing high-risk patients

High ROC-AUC, to ensure strong overall discriminative ability

Interpretability, using SHAP

Training efficiency, for scalability
LightGBM was selected because it offered the best combination of these factors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How did you handle hyperparameter tuning?

A

HalvingRandomSearchCV for Random Forest and XGBoost

RandomizedSearchCV for LightGBM
These techniques efficiently explored a range of hyperparameters while reducing computation time. Tuning significantly improved metrics, especially AUC and recall.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is SMOTE and why used?

A

SMOTE (Synthetic Minority Oversampling Technique) creates new, synthetic examples of the minority class to balance the dataset, improving recall and reducing model bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Standardization vs Normalization:

A

I applied standardization, which transforms the features so that their mean is 0 and their standard deviation is 1. This helps ensure all features are on the same scale.”

Standardization (used): Rescales data to mean=0, std=1 (StandardScaler)

Normalization: Rescales data to range [0,1] (MinMaxScaler)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is overfitting and how to avoid?
Overfitting: Model performs well on train but poorly on test set Avoided using: Cross-validation (5-fold) Early stopping Simpler models as baselines
26
How do the models work (in high level)?
Logistic Regression: Linear classifier, interpretable Random Forest: Ensemble of decision trees (bagging) XGBoost / LightGBM: Boosted decision trees that learn sequentially and correct previous mistakes. LightGBM is faster and more efficient.
27
How did you split the dataset for training and testing?
We used an 80/20 train-test split with stratification to preserve class balance. SMOTE was applied only to the training set to avoid data leakage and ensure fair model evaluation on the untouched test set.
28
What is stratification, and why did you use it in your train-test split and cross-validation?
Stratification is a technique used during data splitting to ensure that the proportion of each class (e.g., readmitted vs. not readmitted) is preserved in both the training and testing sets — or across all folds in cross-validation. Since our dataset is imbalanced — meaning there are many more patients not readmitted than those who are — stratification ensures that: Both the training and test sets reflect the true class distribution. The model learns and is evaluated fairly on both positive and negative cases. It prevents misleading metrics that might happen if one class is underrepresented in the test set. We used StratifiedShuffleSplit for the 80/20 split and StratifiedKFold for cross-validation. This helped maintain consistency and improve model robustness, especially for recall and AUC.
29
Did you use any form of cross-validation?
Yes. We used 5-fold stratified cross-validation during training to ensure model stability and avoid overfitting. This helped us assess how well the model would generalize to unseen data.
30
What is cross-validation, and why did you use it?
Cross-validation is a technique used to evaluate a machine learning model’s ability to generalize to new, unseen data. It helps detect overfitting and ensures the model’s performance isn’t just good on one specific train-test split. In our project, we used 5-fold stratified cross-validation, which works as follows: The training data is split into 5 equal-sized subsets (called folds). The model is trained on 4 folds and validated on the remaining 1 fold. This process repeats 5 times, each time with a different validation fold. The final performance is the average of all 5 evaluations. We also used stratification during this process to ensure that each fold maintained the original class distribution (readmitted vs. not readmitted). Why we used it: To get a more reliable estimate of how the model will perform on unseen data. To avoid overfitting, since the model is validated on multiple subsets. To ensure robust and stable metrics, especially important in healthcare where reliability is critical.
31
What metrics did you use to evaluate model performance?
Accuracy Precision Recall (most important in this context) F1 Score ROC-AUC
32
What is accuracy, and what does it measure?
Accuracy is the percentage of correct predictions made by the model out of all predictions. It tells us how often the model is correct overall. However, in imbalanced datasets like ours, accuracy can be misleading. For example, if 90% of patients are not readmitted, a model that predicts “no” for everyone will still be 90% accurate — but useless for identifying high-risk patients.
33
What is precision?
Precision = TP / (TP + FP) It shows the proportion of positive predictions that were actually correct. Precision measures how many of the patients predicted to be high-risk (positive) were actually readmitted. It answers the question: “When the model says a patient will be readmitted, how often is it right?” High precision means fewer false alarms — important when you want to avoid overwhelming the care team with unnecessary follow-ups.
34
What is recall, and why is it the most important metric in your project?
Recall = TP / (TP + FN) It shows the proportion of actual positives the model was able to capture. Recall (also called sensitivity) measures how many of the actual readmitted patients were correctly identified by the model. It answers: “Of all patients who were truly readmitted, how many did the model catch?” We prioritized recall because in healthcare, missing a high-risk patient (false negative) can be dangerous. A missed prediction might result in no follow-up, leading to complications or even death. So, catching as many true positives as possible is critical.
35
What is F1 Score?
F1 Score = 2 × (Precision × Recall) / (Precision + Recall) It’s the harmonic mean of precision and recall, useful when you want to balance the two. The F1 Score is the harmonic mean of precision and recall. It balances both metrics into a single number. It’s especially useful when you need a balance between avoiding false positives and catching all true positives, as in healthcare. It’s a more reliable performance summary than accuracy in imbalanced datasets.
36
What is ROC-AUC and what does it tell you?
ROC-AUC stands for Receiver Operating Characteristic – Area Under the Curve. It measures how well the model can distinguish between classes across all decision thresholds. ROC curve plots the True Positive Rate vs. False Positive Rate. AUC is the area under that curve — ranging from 0 to 1. AUC = 1.0 ⇒ perfect classifier 0.5 ⇒ random guessing A high AUC (like 0.95 in our case) means the model does a very good job separating readmitted from non-readmitted patients, regardless of the threshold chosen.
37
What results did each model achieve?
Logistic Regression: Good baseline, but limited in capturing complex relationships. Random Forest: Solid performance and interpretability, but slower than boosting methods. XGBoost: High AUC and good precision, but longer training time. LightGBM: Best overall — 93.36% accuracy, 99.68% precision, 87% recall, and 95.82% AUC — fast, accurate, and interpretable.
38
Why did you finally choose LightGBM?
Although LightGBM did not have the absolute highest ROC-AUC, it offered the best overall trade-off across all key metrics: Precision: 99.68% - fewer false alarms Recall: 87% (critical for identifying high-risk patients, critical in healthcare) F1 Score: 92.92% - showed it handled both false positives and false negatives well, making it the most practical and trustworthy for real-world deployment. Accuracy: 93.36% It also had several practical advantages: Fast training speed and SHAP compatibility Native handling of categorical variables Seamless integration with SHAP for model explainability Stable performance across all cross-validation folds These factors made LightGBM the most clinically useful, efficient, and interpretable model for deployment, even if another model had a slightly higher AUC.
39
How did you evaluate the models’ performance?
We used five main metrics: Accuracy: Overall correctness Precision: How many positive predictions were correct Recall: How many actual positives were captured (most important for us) F1 Score: Balance between precision and recall ROC-AUC: Overall ability to distinguish between classes across thresholds These metrics were computed on the test set, after training and tuning the models using cross-validation.
40
What is overfitting, and how did you prevent it in your project?
Overfitting happens when a model learns the training data too well, including its noise, outliers, or random patterns, instead of just the underlying relationships. As a result, the model performs very well on training data but fails to generalize to new, unseen data — leading to poor test performance. It’s like memorizing the answers for one exam instead of learning the actual subject — it doesn’t help when the questions change. How we avoided overfitting in this project: Cross-validation: We used 5-fold stratified cross-validation to ensure the model performed consistently across different subsets of the training data. Early stopping: For boosting models like LightGBM and XGBoost, we applied early stopping — which halts training if the model stops improving on validation data — preventing it from fitting noise. Regularization & tuning: We carefully tuned hyperparameters using RandomizedSearchCV and HalvingRandomSearchCV, applying built-in regularization (like max_depth, min_child_weight, etc.) to control complexity. Data split discipline: We applied SMOTE only on the training set (never on test data) to avoid data leakage and preserve honest evaluation.
41
What insights did the confusion matrix give you?
The confusion matrix helped us visualize: True positives: correctly predicted readmissions False negatives: high-risk patients missed (we aimed to reduce these) False positives: patients flagged as high-risk who weren’t readmitted Both LightGBM and XGBoost showed very few false negatives, which is crucial in healthcare — we don’t want to miss patients who are actually at risk.
42
Did you use any probability calibration methods?
Yes. We applied sigmoid calibration using CalibratedClassifierCV to improve the reliability of the predicted probabilities. This ensures that a prediction like “0.80 probability of readmission” actually corresponds to an 80% chance, making the model’s output more actionable for clinicians.
43
What did the ROC and Precision-Recall curves tell you?
The ROC curve showed how well the model distinguishes between classes across thresholds. LightGBM and XGBoost both had high AUCs, confirming strong performance. The Precision-Recall curve focused on our main trade-off: catching as many true positives as possible (recall) without generating too many false positives (precision). LightGBM provided the best balance in this curve, making it ideal for real-world use where both metrics matter.
44
How did you ensure your results were interpretable for clinical use?
We used SHAP (SHapley Additive Explanations) to: Identify top features influencing predictions (like number of prior inpatient visits, medication count, A1C results) Generate global summary plots to explain the model’s logic Create individual-level waterfall plots to show why a specific patient was flagged This allowed doctors to see not just the prediction — but the "why" behind it.
45
How did you prepare your model for deployment?
We prepared the model using a modular and reproducible deployment pipeline. Key steps included: Serializing the trained LightGBM model using joblib.dump() so it can be loaded later for inference without retraining. Saving preprocessing steps (e.g., encoders and scalers) to ensure consistency between training and prediction time. Organizing outputs into an artifacts/ folder structure (models, data, plots, SHAP outputs), making it easier to manage and track versions.
46
What technologies or tools would you use to deploy this model in real-world healthcare systems?
The model can be deployed using: Flask or Streamlit to create a simple dashboard or user interface for hospital staff. REST APIs that integrate with Electronic Health Record (EHR) systems, allowing the model to receive patient data and return predictions automatically. Cloud platforms like AWS, Azure, or GCP for hosting the model securely and at scale (e.g., using AWS Lambda or EC2). This setup makes the model flexible, lightweight, and easy to integrate into different clinical environments.
47
How would your model be used in practice by clinicians or hospital staff?
The model could be integrated at the point of discharge, where it receives patient data and outputs: A readmission risk score A binary classification (high-risk or low-risk) A SHAP explanation showing the top factors influencing that prediction Clinicians or care coordinators can then decide whether to: Schedule early follow-up visits Provide additional education Trigger transitional care plans This supports proactive care, instead of waiting for the patient to return.
48
How do you ensure predictions are explainable for clinical decision-making?
We use SHAP (SHapley Additive Explanations) to explain model outputs: Global SHAP plots show which features are most important overall. Local SHAP plots (like waterfall plots) explain why an individual patient was classified as high-risk. This transparency builds clinical trust and aligns with ethical AI practices — doctors don’t just see a score, they see the reasoning behind it.
49
What ethical and legal considerations did you account for in deployment?
We addressed multiple aspects: HIPAA & GDPR compliance: The dataset was fully de-identified, and no personally identifiable information is used in the model. Explainability: SHAP ensures decisions can be interrogated, satisfying expectations of the EU AI Act and WHO's trustworthy AI guidelines. Human oversight: The model is intended as a decision-support tool, not a replacement for medical judgment. Bias auditing: We used SMOTE to balance the training data and monitored fairness across subgroups during model evaluation.
50
What risks are involved in deploying a model like this, and how can they be mitigated?
Risks include: Model drift: As healthcare practices evolve, the model might become outdated. ➤ Mitigation: Set up monitoring and retraining pipelines. Over-reliance on automation: Staff might blindly trust predictions. ➤ Mitigation: Always pair predictions with SHAP explanations and enforce human oversight. Data privacy concerns: Sensitive patient data may be at risk. ➤ Mitigation: Use secure servers, encrypted data transmission, and strict access controls.
51
Why did you apply early stopping during model training?
Early stopping halts training when the model's performance on validation data stops improving after a set number of iterations. It prevents overfitting, especially in boosting models like LightGBM and XGBoost. It also reduces training time without compromising performance.
52
What is the difference between RandomizedSearchCV and HalvingRandomSearchCV? Why did you use both?
RandomizedSearchCV randomly samples hyperparameter combinations from a defined grid — it's faster than GridSearchCV and works well when the search space is large. HalvingRandomSearchCV starts with many combinations but evaluates them with fewer resources (e.g., fewer trees), narrowing down to the best ones iteratively. We used both to efficiently explore parameters without wasting resources.
53
How does LightGBM handle missing values and categorical variables?
LightGBM handles missing values natively — it learns the best direction to assign them during training. It also supports categorical features directly using a technique called Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB), which improves both performance and speed without needing to one-hot encode manually.
54
Why did you group diagnosis codes using ICD-9 categories?
Raw ICD-9 codes are too granular and sparse, which can hurt model performance. We grouped them into broader clinical categories (e.g., circulatory, diabetes-related, respiratory) to: Reduce dimensionality Increase interpretability Improve signal strength for modeling
55
What’s the difference between standardization and normalization? Which one did you use?
Normalization scales values between 0 and 1 (MinMaxScaler). Standardization scales to zero mean and unit variance (StandardScaler). We used StandardScaler after log-transforming skewed features. This made models like Logistic Regression and boosting algorithms converge better and compare features fairly.
56
Why is recall more important than precision in your project?
In a clinical setting, missing a high-risk patient (false negative) is more dangerous than flagging a low-risk one. High recall ensures we catch as many actual readmissions as possible, even if it means occasionally flagging someone who won’t return. It supports proactive care and avoids preventable complications.
57
Why is ROC-AUC useful, and how does it differ from precision-recall curves?
ROC-AUC measures the model’s ability to separate classes across all thresholds. Precision-Recall curves focus more on positive class performance, which is better for imbalanced datasets like ours. We used both: ROC-AUC for overall performance, and Precision recall curve to tune thresholds and balance sensitivity (recall) vs specificity (precision).
58
How did you use SHAP in your project?
SHAP (SHapley Additive Explanations) helped us: Understand global feature importance Explain individual predictions Visualize how features like med_count, A1Cresult, and number_inpatient influenced risk This supported ethical AI, improved clinical trust, and aligned with GDPR and EU AI Act requirements.
59
How did you address bias in the model?
We tackled bias by: Using SMOTE to balance the training data and prevent the model from favoring the majority class Monitoring performance across subgroups (e.g., by age and race) using SHAP Emphasizing explainability and clinician oversight in deployment, avoiding black-box decisions
60
What steps would you take before putting this model into production?
Conduct external validation with recent patient data Collaborate with clinicians to test workflows and build trust Set up performance monitoring and retraining schedules to detect model drift Secure the system for data privacy and access control Document processes for compliance with HIPAA, GDPR, and the EU AI Act
61
Can you explain what a threshold is in your classification model and how it affects predictions?
Yes, of course. In my model, the threshold is the cutoff point that decides whether a predicted probability should be classified as a positive case or a negative one. For example, my model outputs probabilities between 0 and 1 — and by default, if the probability is 0.5 or higher, it predicts the patient will be readmitted; otherwise, it predicts they won’t. But this threshold isn't fixed. I can change it depending on the clinical goal. If I lower the threshold, the model becomes more sensitive — meaning it will catch more true positives, which is good for recall, but it might also increase false positives. On the other hand, raising the threshold would give me higher precision, but I would risk missing more actual readmissions. In my project, since the cost of missing a high-risk patient is high, I focused on tuning the threshold to prioritize recall, while still keeping precision at a useful level. I used the Precision-Recall curve to visually explore this trade-off and find a threshold that balances both, depending on clinical needs.
62
What is random_state and why did you use it?
random_state is a parameter that controls the randomness in operations like train-test splitting, SMOTE, or even during model initialization. It's basically like setting a seed for the random number generator. By setting a fixed random_state — for example, random_state=42 — I make sure that every time I run the code, I get the same results. The data gets split the same way, SMOTE generates the same synthetic samples, and the models behave consistently. This is really important for reproducibility — especially in research or when sharing my project with others. It ensures that the results can be replicated exactly, which is a key principle in both machine learning and scientific work. I use it in: Train-test split: This ensures that the split between training and test sets remains the same every time I run the notebook, and the class distribution is preserved through stratify=y. SMOTE (Synthetic Minority Oversampling Technique): Here, random_state ensures that SMOTE generates the same synthetic examples during resampling, which keeps training behavior consistent across experiments. Model Initialization: This sets the internal randomness of model components like tree splits or sampling, so training results don’t vary unpredictably. By using random_state consistently, I ensured that the results are deterministic, traceable, and easy to debug or replicate — which is essential in both academic research and real-world deployment scenarios.