Capstone Project Flashcards
(62 cards)
What was the main goal of your project?
The main goal was to develop a machine learning model that predicts whether diabetic patients will be readmitted to the hospital within 30 days of discharge. Early readmissions are a critical issue in healthcare, and this model aims to help clinicians identify high-risk patients in advance so they can take proactive actions to reduce preventable readmissions.
Why is this problem important to solve?
Diabetic patients cost billions annually. These readmissions often reflect gaps in post-discharge care and can be avoided through better planning and follow-up. By predicting which patients are at high risk, we can improve care quality while also reducing healthcare costs.
What real benefits can your model bring to hospitals or healthcare providers?
There are several concrete benefits:
Better Patient Outcomes:
The model helps providers identify high-risk patients before discharge, so they can schedule early follow-ups, adjust medications, or provide extra guidance — all of which improve patient stability and reduce the risk of complications.
Cost Savings:
Readmissions are very expensive. Hospitals can save money and avoid penalties by focusing care on those patients most likely to return.
Smarter Resource Allocation:
Staff and resources like nurse visits or home care services can be allocated based on actual risk, making operations more efficient.
Clinical Decision Support:
The model is explainable using SHAP, which means doctors can understand the reasons behind each prediction. This increases trust and makes the model practical for use in real clinical settings.
Scalability and Reusability:
The pipeline is open-source, modular, and cloud-deployable. With minor adjustments, the same system can be reused to predict readmissions for other chronic conditions, like heart failure or COPD.
How does your model promote trust and explainability in clinical practice?
We used SHAP (SHapley Additive exPlanations), a well-known explainability tool in AI. SHAP shows how each feature (e.g., number of inpatient visits, abnormal lab results) contributes to the prediction, both globally and for each individual patient. This level of transparency aligns with ethical AI principles in healthcare and helps clinicians make informed decisions.
Can your solution be deployed in a real hospital environment?
Yes. The model is saved using joblib, and the full pipeline can be integrated into hospital systems via APIs, or used with a user-friendly dashboard built in Streamlit or Flask. It’s also scalable and can be hosted on cloud platforms like AWS or Azure for broader use.
What success metric was used? Why?
We prioritized recall and ROC-AUC, as missing high-risk patients (false negatives) is more harmful in healthcare.
What motivated the project?
High readmission rates among diabetics lead to financial strain and poor health outcomes. Predictive modeling can help healthcare providers proactively manage these patients.
Who are the main users or beneficiaries of your model, and how do they benefit from it?
Hospitals, clinicians, care coordinators, and ultimately the diabetic patients who receive better follow-up care.
Hospitals
Reduce readmission-related costs and penalties, improve efficiency in resource allocation, and enhance overall care quality.
Clinicians
Get decision support through explainable risk scores (via SHAP), enabling smarter discharge planning and better prioritization of care.
Care Coordinators
Can target high-risk patients for follow-up, improving care efficiency and outcomes with fewer wasted efforts.
Patients
Receive more personalized and timely post-discharge care, reducing the chances of complications and improving recovery.
Where did the data come from? Isn’t that too old to be relevant for today’s healthcare?
From the Diabetes 130-US Hospitals dataset (UCI Repository), covering 100,000+ encounters from 1999 to 2008.
While the dataset is historical, it still holds strong relevance for a few key reasons:
Core healthcare patterns remain consistent
The risk factors for diabetic readmissions — like frequent inpatient visits, medication complexity, and poor glycemic control — are still major clinical issues today. These variables are still tracked in modern hospital systems and are highly predictive regardless of the year.
Focus is on methodology, not direct deployment
The primary goal of the project was to explore how machine learning can be applied to hospital readmission problems. The focus was on creating a reusable and interpretable pipeline — not on immediate production deployment. The methods and insights can easily be applied to more recent datasets in the future.
Compliance and de-identification
Since the dataset is fully de-identified under HIPAA and GDPR standards, it was ethically safe and legally appropriate to use for academic research — something more recent clinical data may not allow due to privacy concerns.
What type of data was available?
Demographic (age, race, gender), clinical (diagnoses, lab results, meds), and administrative (admission type, discharge disposition).
Were there missing values? Why?
Yes. Columns like weight, payer_code, and medical_specialty had significant missingness (up to 96.8%) and were dropped. Placeholders like ? were converted to NaN and cleaned appropriately.
How would you adapt your solution if you had access to more recent data?
If I had access to modern data, I would:
Update the model using recent patient profiles and treatment protocols.
Reassess which features are still predictive — for example, newer medications or care pathways may be available.
Re-tune hyperparameters and re-evaluate model calibration to account for shifts in care delivery and population demographics.
Compare performance to see if model generalization still holds over time.
This would help assess model drift and ensure the tool remains accurate and clinically useful.
What steps did you take to clean the dataset?
1. I replaced all placeholder missing values (“?”) with NaN so they could be properly detected and handled using pandas tools.
2. I dropped columns with excessive missingness, like weight, payer_code, and medical_specialty, because they were missing in over 40–90% of rows and provided little useful signal.
3. I removed records with invalid values, such as “Unknown/Invalid” gender, or missing race data.
4. I excluded patients who were discharged to hospice or who died, since they cannot be readmitted and would distort the target variable.
5. I dropped identifiers like patient_nbr and encounter_id to avoid data leakage.
6. I grouped ICD-9 diagnosis codes into broader categories for interpretability and dimensionality reduction.
7. I used helper functions to simplify discharge/admission source fields into clinically relevant categories.
8. I encoded the target variable y as binary: 1 for readmitted in <30 days, and 0 for all other outcomes.
9. I created new features like med_count, service_count, and flags for medication change (chg_flag) and diabetes medication use.
These steps ensured that the data was clean, medically interpretable, and ready for training robust ML models.
What feature engineering techniques did you use?
Feature engineering is the process of creating, transforming, or selecting features in a dataset to make them more informative and useful for machine learning models. The goal is to help the model capture the true patterns and relationships in the data more effectively.
In my project, I used feature engineering to extract clinically meaningful signals and improve model performance. Here are a few key examples:
- Created Aggregated Service Features
df_feat = add_service_and_med_features(df_clean)
I engineered features like:
service_count: total number of services (e.g., labs, procedures, meds)
med_count: number of unique medications
med_change_count: number of changes in medication during the stay
These features reflect the intensity and complexity of a patient’s treatment, which are strong indicators of readmission risk.
🔹 2. Grouped Diagnosis Codes
df = categorize_diagnoses(df)
I grouped detailed ICD-9 diagnosis codes into broader categories like:
Circulatory system issues
Diabetes-related conditions
Respiratory diseases
This reduced dimensionality and captured clinical meaning, making it easier for the model to learn patterns.
🔹 3. Simplified Administrative Codes
df = simplify_admin_fields(df)
I mapped administrative IDs (like admission_type_id) into grouped categories such as emergency, referral, transfer, etc. This helped reduce noise and improved interpretability.
🔹 4. Converted Age Brackets to Numeric
df[‘age_mid’] = df[‘age’].map({…})
The age column had values like [70-80). I converted these into numeric midpoints (e.g., 75) so they could be used in models as continuous variables.
🔹 5. Created Binary Flags
df[‘chg_flag’] = df[‘change’].map({‘Ch’: 1, ‘No’: 0})
I created binary indicators for whether medications were changed (chg_flag) and whether the patient was on diabetes medication (diab_med_flag). These are simple but powerful signals of instability or active treatment.
How did you handle categorical variables?
We used a mix of label encoding and custom mappings:
For variables like admission_type and discharge_disposition, we used medically-informed mappings to preserve the meaning of each category.
Binary variables like change and diabetesMed were encoded as 0 and 1.
race and gender were label-encoded after cleaning invalid entries.
This transformation ensured compatibility with the ML models while keeping semantic meaning intact.
Did you normalize or standardize? Why?
Yes. We applied log transformation to reduce skew in features like number_inpatient and number_medications, and then used StandardScaler to standardize numerical features.
Standardization (mean = 0, std = 1) helps many ML algorithms, especially gradient-boosting models, converge faster and perform more consistently.
The dataset had class imbalance. How did you address it?
We used SMOTE to address class imbalance. The dataset was highly imbalanced — most patients were not readmitted within 30 days. To address this, we used SMOTE (Synthetic Minority Over-sampling Technique), applied only to the training set to prevent data leakage.
SMOTE creates synthetic examples of the minority class rather than duplicating rows, which helps reduce overfitting.
After SMOTE, the training set had balanced classes, which improved the model’s recall — a critical metric in healthcare where missing high-risk patients can be dangerous.
Why did you choose SMOTE over other techniques like undersampling?
We chose SMOTE because:
Undersampling would remove a large portion of valuable majority-class data, which could weaken the model.
SMOTE allows us to keep all original data and enhance minority class representation by synthesizing new, realistic samples.
It improves model sensitivity (recall) without the risk of overfitting that comes from simply duplicating rows.
Which models were tested?
We tested four supervised classification models:
Logistic Regression
Random Forest
XGBoost
LightGBM
These were selected for their strong performance on structured tabular data, and because they offer a balance between predictive power, interpretability, and scalability.
Why did you choose those specific models?
Logistic Regression is simple and interpretable, useful as a baseline.
Random Forest handles non-linear relationships well and provides feature importance.
XGBoost is highly efficient and accurate with built-in regularization.
LightGBM is even faster than XGBoost, uses less memory, supports categorical variables natively, and integrates well with SHAP for explainability.
What criteria did you use to select the final model?
We prioritized:
High recall, to avoid missing high-risk patients
High ROC-AUC, to ensure strong overall discriminative ability
Interpretability, using SHAP
Training efficiency, for scalability
LightGBM was selected because it offered the best combination of these factors.
How did you handle hyperparameter tuning?
HalvingRandomSearchCV for Random Forest and XGBoost
RandomizedSearchCV for LightGBM
These techniques efficiently explored a range of hyperparameters while reducing computation time. Tuning significantly improved metrics, especially AUC and recall.
What is SMOTE and why used?
SMOTE (Synthetic Minority Oversampling Technique) creates new, synthetic examples of the minority class to balance the dataset, improving recall and reducing model bias.
Standardization vs Normalization:
I applied standardization, which transforms the features so that their mean is 0 and their standard deviation is 1. This helps ensure all features are on the same scale.”
Standardization (used): Rescales data to mean=0, std=1 (StandardScaler)
Normalization: Rescales data to range [0,1] (MinMaxScaler)