Report questions Flashcards
(20 cards)
Your research question is predictive. Why did you choose this over, say, a causal or descriptive formulation (cf. Leek & Peng, 2015; Lecture 3)?
Answer:
We chose a predictive formulation—“To what extent can factors predict slow and fast picking times?”—because the primary business objective is operational decision support rather than hypothesis testing or explanation. As Leek & Peng (2015) argue, one of the most common mistakes in data science is mistaking the type of question at hand. Our problem required categorizing and anticipating performance outcomes (i.e., picking speed), not explaining them in a causal framework or simply describing them. A predictive model provides actionable outputs for logistical planning (e.g., anticipating bottlenecks), which a descriptive or causal model would not directly support.
What we could have done better:
While a predictive question is relevant, we could have strengthened the research design by explicitly contrasting it with alternative question types. For instance, an explanatory (causal) question like “What factors cause increases in picking time?” could have led us to explore intervention-based insights. Even if a causal analysis wasn’t feasible due to data limitations, acknowledging this would have shown greater methodological maturity (Lecture 3; Leek & Peng, 2015).
We also could have justified our predictive approach more explicitly using business needs (e.g., real-time forecasting requirements), and discussed trade-offs such as reduced interpretability and generalizability compared to causal models.
How does your research question align with Salling’s operational or strategic business goals?
Answer:
Salling’s logistics strategy emphasizes quality, efficiency, and optimization in warehouse performance. Our question directly supports these goals by targeting one of the key cost drivers in warehouse operations—picking time. By predicting slow and fast pickings, the model can be embedded into real-time decision systems (e.g., dynamic picker assignments or restocking alerts). This creates a direct line from data insights to business action, fulfilling the CRISP-DM requirement to align modeling tasks with business objectives (Lecture 3; Domingos, 2012; Course Plan Week 3).
What we could have done better:
Although our question supports operational efficiency, we could have tightened the alignment by engaging more directly with Salling’s strategic KPIs (e.g., delivery time targets, warehouse cost per unit picked). For example, we might have translated our prediction outputs into potential cost savings or labor allocation impacts, providing more tangible value.
Moreover, we could have interviewed warehouse managers or used internal documentation to better understand how picking speed is actually monitored or acted upon in daily operations. This would have made our model outputs more relevant and actionable.
Veeramachaneni (2016) stresses the disconnect between business value and data science outputs. How did you mitigate this risk in your approach?
We explicitly mitigated this risk by:
Involving domain knowledge early: Assumptions about fixed routes, uniform product handling, and layout were validated with Salling (Section 2.1.2).
Simplifying modeling: We prioritized interpretable models (logistic regression, random forest) instead of more complex black-box models.
Using business-aligned categories: The transformation of picking time into normal_fast and normal_slow makes the output more actionable for warehouse managers.
Providing concrete recommendations (Section 4.3): We interpreted results in light of potential interventions like layout changes or picker training.
These strategies align with Veeramachaneni’s advice to avoid technical isolation and to optimize for business value, not model complexity (Books and Paper Summaries, p. 1).
What we could have done better:
While we did implement interpretable models and used feature engineering aligned with domain knowledge, we fell short in demonstrating measurable business impact. For example:
We could have run a simulation or scenario analysis to estimate how much time or cost could be saved by acting on our model’s insights.
We might have created dashboards or reporting templates that visualize slow-picking predictions in a way warehouse supervisors could use in real-time.
Furthermore, we should have included explicit discussions with business users (even hypothetically, if real stakeholders were unavailable) to validate whether our model outputs and classifications made sense for their decision workflows.
Why is CRISP-DM appropriate in this case, rather than another data science process model (e.g., KDD or SEMMA)?
RISP-DM is well-suited because:
It emphasizes business understanding and iterative refinement, both of which were crucial in navigating the assumptions, feature engineering, and modeling steps in our project.
Unlike SEMMA, which is more tool-specific and oriented toward data mining software (e.g., SAS), CRISP-DM offers flexibility and is method-neutral, allowing for integration of both traditional EDA and machine learning (Lecture 3; Domingos, 2012).
CRISP-DM provided a clear structure for documenting our workflow: from identifying the business goal, through modeling, to recommendations for deployment (Section 2).
The process enhanced transparency and communication—two challenges highlighted by Berinato (2019) in bridging the gap between data teams and business stakeholders.
What we could have done better:
We referenced CRISP-DM, but we could have used the framework more thoroughly to reflect on our process iteratively:
We might have looped back from modeling to data understanding after encountering assumptions violations or poor performance.
In the deployment phase, we could have discussed how Salling might integrate our findings into their operational systems, or what challenges they might face in doing so.
Additionally, comparing CRISP-DM more explicitly to other frameworks like KDD or SEMMA would have shown a deeper conceptual engagement with methodology (Session 3; Course Plan Week 3).
You state “no such thing as raw data” (Session 2). How did this philosophy inform your assumptions and cleaning choices?
Answer:
We were guided by the foundational insight from Rosenberg (2013) and Session 2 that “raw data is an oxymoron.” This means that data is never neutral—it is always shaped by the systems and contexts in which it is produced. Consequently, we treated the dataset from Salling not as a perfect reflection of warehouse operations, but as a constructed representation.
This influenced our decisions to:
Investigate the origin and format of variables like picking_time, which was derived from system timestamps.
Make transparent assumptions about missing values and potential logging errors (e.g., cases with zero or negative durations).
Recognize that standardized warehouse procedures (e.g., fixed routing) introduced structured behavior into the data, which affects variance and the meaning of performance.
By foregrounding these considerations, we maintained a critical stance toward our data and grounded our preprocessing decisions in both technical and epistemological awareness (Rosenberg, 2013; Bit by Bit, Chapter 2; Session 2).
What we could have done better:
We referenced assumptions, but we did not explicitly reflect on the constructed nature of our dataset, as highlighted in “Raw data is an oxymoron” (Rosenberg, 2013; Session 2). We treated picking time as an objective measure, yet it is the result of several layers of prior system design choices (e.g., scanner logs, time rounding, picker assignments).
We could have:
Critically questioned the provenance of each variable (e.g., how and why timestamps are recorded), especially given the reliance on warehouse management systems.
Applied the “data vs evidence” distinction (Session 2) to be more transparent about which data actually support our interpretations.
Discussed algorithmic confounding or drift (Bit by Bit) if pickers or policies had changed during the data collection period.
This would have enhanced the epistemological depth of our assumptions and data ethics.
How would Rosenberg (2013) critique your operationalization of “picking time” as a proxy for efficiency?
Answer:
How would Rosenberg (2013) critique your operationalization of “picking time” as a proxy for efficiency?
Rosenberg (2013), in “Raw Data” Is an Oxymoron, would likely critique our operationalization of “picking time” as a proxy for efficiency on several grounds. His central argument is that data are not neutral observations of the world, but rather constructed representations shaped by decisions about what to measure, how to record it, and why it matters.
Applying this lens, several issues arise:
Reductionism: Our report equates shorter picking time with higher efficiency, assuming all seconds are equal. Rosenberg would challenge this simplification, asking “Efficiency for whom?” and “Under what assumptions?” For example, fast picking could also result from skipping quality checks or informal rule-breaking—undermining the intended business objective of high service quality.
Contextual Blindness: By abstracting “picking time” from the broader work context (e.g., product fragility, workload fluctuations, worker fatigue), we risk misrepresenting the labor process. Rosenberg emphasizes that data cannot be divorced from the social, technical, and institutional systems in which they are produced.
Obscured Infrastructure: The start and finish times used to calculate picking time are generated by warehouse information systems, not directly observed. Rosenberg would highlight how these timestamps reflect system design decisions (e.g., scanning rules, idle-time logging) that embed assumptions and blind spots into the data.
In short, Rosenberg would critique our use of picking time as epistemologically naive unless we critically examine its provenance, embedded assumptions, and institutional framing.
What We Could Have Done Better
To address Rosenberg’s critique and strengthen the epistemic rigor of our analysis, we could have:
Reflected on Data Construction: Explicitly discussed how picking time is generated—not just calculated—and identified any sociotechnical biases in that process (e.g., whether breaks or delays unrelated to the task are recorded).
Triangulated with Qualitative Data: Supplemented quantitative timestamps with qualitative context, such as interviews, internal documentation, or worker feedback to validate what picking time really captures.
Qualified Our Efficiency Proxy: Framed picking time not as a direct measure of efficiency but as a proxy with known limitations, including potential trade-offs with accuracy, compliance, or safety.
Explored Alternative Metrics: Investigated or proposed composite efficiency indicators, such as items picked per minute adjusted for volume or route complexity, which may better reflect operational realities.
By embedding Rosenberg’s critical perspective into our methodology, we would not only improve the validity of our conclusions but also demonstrate a deeper understanding of data epistemology, as emphasized in Session 2 of the course.
What risks are introduced by using a single warehouse’s data during a specific three-week window (including Easter)? How did you address nonrepresentativeness (Bit by Bit; Lecture 2)?
Answer:
The limited temporal and spatial scope of our dataset presents a risk of nonrepresentativeness, a key problem characteristic of big data identified in Bit by Bit (Salganik, 2018) and Session 2. Our data was drawn from only one warehouse and spanned a three-week period that included Easter, a known seasonal event with atypical consumer demand and staffing dynamics.
To address this, we:
Conducted exploratory plots of volume and picking activity by calendar week to detect any strong Easter effects (Appendix 9).
Included weekday and hour-of-day variables to control for time-related operational shifts.
Emphasized in our evaluation and discussion sections that our findings are valid only within this specific temporal context, and should not be extrapolated to other time periods or warehouses without further testing.
These mitigations reflect our effort to maintain internal validity, even if external generalizability is limited.
What we could have done better:
We acknowledged this limitation, but we could have quantified the risk of nonrepresentativeness using:
Comparisons across calendar weeks to see if Easter significantly skewed behavior.
Sensitivity analyses, e.g., removing holiday-adjacent days and rerunning model training to check for performance stability.
Further, we could have evaluated seasonal or temporal confounds by plotting performance or volume across days and hours (Session 3.1), then modeled time effects explicitly (e.g., using interaction terms or time-based dummy variables).
Finally, we could have discussed this limitation using sampling theory and external validity frameworks, as encouraged in Week 3–4 lectures.
Your feature picking_speed is a categorical variable based on IQR thresholds. What are the trade-offs of this discretization in terms of information loss and model interpretability?
The creation of the picking_speed variable was intended to enhance interpretability and enable classification models to predict “normal_fast” vs “normal_slow” pickings within an operationally meaningful range (i.e., ≤250 seconds). This aligns with business needs that often require actionable categories rather than precise duration estimates.
However, this discretization introduces trade-offs:
Information loss: Subtle variations within the continuous picking_time variable are no longer captured. This can reduce model sensitivity.
Boundary bias: Observations just below or above the threshold (e.g., 14.9s vs. 15.1s) are treated as qualitatively different, which may not reflect reality.
Statistical power: Categorizing a continuous variable may reduce the variance explained in regression models, depending on how distinct the resulting classes are.
To counteract this, we anchored the categorization in interquartile thresholds, ensuring data-driven class boundaries, and included EDA to validate that these bins reflected meaningful operational distinctions.
What we could have done better:
While the fast/slow categorization was intuitively understandable, we failed to analyze how much information was lost by transforming a continuous variable into a binary outcome. This could have affected:
Model sensitivity (reduced ability to differentiate near-boundary cases),
Potential biases introduced by arbitrary thresholds (e.g., 15s and 28s),
Over-simplification of business complexity.
We could have:
Performed a threshold sensitivity analysis to explore alternative cutoffs.
Compared results with a model trained on the continuous picking_time variable using regression, and reported the trade-off in accuracy vs. interpretability.
Aligned our thresholding method more closely with business thresholds (e.g., expected service levels or time budgets).
This would have grounded our transformation in both technical and business logic.
You opted to remove observations >1500s and high-volume pickings >0.60 m². How do you justify these thresholds in terms of bias vs. variance trade-off (Lecture 7)?
Answer:
We chose these thresholds to improve the bias-variance trade-off, a central concept in supervised machine learning (Lecture 7). Observations exceeding 1500 seconds were rare, often associated with anomalous or erroneous system behavior, and added high variance to the target variable.
Our thresholding strategy was guided by:
Distributional analysis (Appendix 3, 10, and 11), which showed heavy right skew.
IQR-based outlier detection, consistent with Hellerstein (2008) and data cleaning practices discussed in Lecture 3.
A conservative approach to high-volume cutoffs: instead of removing all values above the IQR threshold, we retained more observations by setting the threshold at 0.60 m².
By reducing the influence of these extreme values, we improved model stability, especially for algorithms sensitive to scale like logistic regression. At the same time, we were transparent about the limitations this introduced for detecting rare but operationally significant events.
What we could have done better:
We discussed right-skew and used IQR-based rules, but we did not rigorously justify our threshold selection or test its impact on model generalization.
To improve:
We should have explored model performance before and after outlier removal to empirically assess the variance reduction.
Considered modeling the long-tail separately (e.g., using a two-part model or including a “rare event” classifier).
Investigated heteroscedasticity or mixture distributions in picking times, especially if slow-pick events had distinct causes (e.g., damaged items, restocking interruptions).
Additionally, we should have reflected on whether outliers represented true variation or system noise (cf. Lecture 3, Week 3 on assumption testing and model diagnostics).
Why did you choose logistic regression and random forest specifically? How do their assumptions differ, and how do these assumptions relate to the nature of your data?
Answer:
We chose logistic regression and random forest because they represent two well-established, complementary approaches to classification. Logistic regression is a parametric model that offers high interpretability and clear assumptions, while random forest is a non-parametric ensemble model capable of capturing complex non-linear patterns.
Logistic regression assumes:
Linearity in the log-odds
No multicollinearity
Independent observations
Sufficiently large sample size for stable coefficient estimates
These assumptions partially align with our dataset. However, the potential for non-linearity and the inclusion of engineered categorical predictors (e.g., product_freq_tag, picking_speed) may violate linearity assumptions, which justified the use of random forest.
Random forest does not rely on strict assumptions about feature distributions or functional form. It is robust to multicollinearity, skewed distributions, and outliers, making it suitable for our feature space that includes time-based, volume-based, and categorical engineered variables.
Using both models allowed us to compare performance vs. interpretability, and triangulate model findings against business relevance.
What we could have done better:
We did not test model assumptions (e.g., VIF for multicollinearity in logistic regression, linearity of log-odds).
We could have justified the choice of models in relation to the business goal more directly. For example, “We chose logistic regression because managers at Salling require interpretable guidance on operational bottlenecks.”
We might have experimented with alternative models (e.g., gradient boosting or naïve Bayes) and documented why they were excluded.
Better practice: Include a model selection matrix based on trade-offs: interpretability, performance, complexity, and business alignment.
Against which baseline did you compare model performance, and why? How does this baseline reflect business value rather than just statistical performance?
Our baseline was a majority class classifier that always predicts the most frequent class (normal_fast). This reflects a realistic operational default: if no model is used, a business might assume normal operations and only escalate slow pickings when problems arise.
This baseline is statistically simple but business-relevant, as it mimics a “do nothing” strategy. It helps determine whether our models provide incremental value over naive heuristics.
Additionally, we could have used heuristic rules (e.g., “flag all pickings > volume threshold”) as an operational baseline, which would further align with decision-support workflows in warehouse environments.
What could we have done better?
We only used a majority-class baseline. This is standard, but not context-aware.
We missed the opportunity to define a business-relevant operational rule baseline, e.g., flagging all pickings over a certain volume or in specific warehouse zones.
Better practice: Compare models not just to statistical baselines, but to realistic heuristics that domain experts already use.
How did you assess overfitting and underfitting (Lecture 7)? Were your train-test splits aligned with temporal patterns in the data (e.g., early vs. late pickings)?
We assessed model fit using:
Train-test split (typically 70/30)
Cross-validation
Performance metrics: Accuracy, precision, recall, and AUC-ROC
Overfitting was checked by comparing train vs. test performance, particularly drops in AUC. Logistic regression was more stable; random forest had better test-set performance but required tuning (e.g., max_depth, n_estimators) to avoid overfitting.
However, we did not align our train-test split with temporal ordering, meaning that data from different time periods may be mixed across folds. This introduces a risk of data leakage in time-dependent processes (e.g., pickers getting faster over time), which could inflate reported performance.
What we could have done better:
We did not implement time-aware splitting, which is essential for temporal data to avoid look-ahead bias (Lecture 7; Session 5).
Cross-validation was mentioned but not deeply analyzed—no fold performance plots or variance reporting.
We could have applied learning curves to visualize how performance evolved with more data.
Better practice:
Use rolling-window validation for time-based data.
Explicitly report overfit indicators (e.g., train vs. test AUC gaps, precision-recall tradeoffs).
Random forests typically outperform logistic regression but are harder to interpret. How did you balance predictive power with the ability to provide actionable business insights?
Answer:
We prioritized model interpretability alongside accuracy. Logistic regression provided clear coefficient estimates, allowing us to explain the direction and magnitude of predictor effects (e.g., time-of-day effects on picking performance). These insights were useful for communicating findings to stakeholders.
Random forest offered higher predictive performance, but to enhance interpretability we:
Reported feature importance plots
Used partial dependence plots to explain key relationships (e.g., volume vs. predicted slowdown)
This dual-model approach balanced actionability and predictive strength, in line with course recommendations (Veeramachaneni, 2016; Lecture 6–7).
What we could have done better:
We discussed feature importance, but we didn’t present example decision rules from the random forest (e.g., representative paths through trees).
Partial dependence or SHAP plots could have been used to visualize non-linear effects more intuitively.
Better practice:
Include model explainability techniques beyond feature ranking.
Conduct stakeholder usability tests to check which outputs are actually helpful to warehouse managers.
You mention “removal of fast pickers.” Could this result in biased insights if these individuals represent under-recognized patterns (e.g., due to technology use or training)? Discuss in light of Session 9 and the “responsibility” slide.
Yes, removing fast pickers could introduce bias and ethical blind spots in the analysis. While the exclusion was technically motivated—to eliminate statistical outliers and prevent skewed model training—this decision carries significant ethical and epistemological implications.
From an ethical standpoint (Session 9), the responsibility for using data fairly includes considering what is being removed and why. If fast pickers had valid strategies (e.g., using shortcuts, personalized routines, or superior training), their exclusion might:
Mask innovation or best practices, which could have been instructive for organizational learning.
Reinforce bias by excluding a group that may differ in gender, contract status, or training history (even if those variables weren’t directly observed).
Violate the principle of procedural fairness, by filtering out legitimate diversity in operational styles.
According to the ethics lecture and Kosinski et al. (2013) examples, powerful algorithms can reflect and reinforce systemic inequalities when poorly scoped or curated. Our decision to exclude outliers should therefore be scrutinized not just statistically, but morally.
What we could have done better:
Assumption Testing: Rather than removing fast pickers, test their impact by building models with and without them and comparing performance or explanations.
Alternative Labeling: Investigate whether fast picking correlates with higher error rates, retrials, or special tasks (e.g., urgent orders).
Inclusion Criteria Transparency: Clearly state in the methodology how outlier removal might affect equity and downstream policy use.
Better practice: Reflect on outlier removal as a value-laden decision, not a neutral technical step.
Your report implies that insights from one warehouse could be cautiously applied to others. What fairness and ethical concerns arise from this, particularly in ML deployment across heterogeneous contexts?
Answer:
Deploying an ML model trained in one warehouse across multiple sites raises important concerns regarding fairness, representativeness, and unintended consequences, especially when:
Operational practices differ across warehouses (e.g., layout, pacing, shift patterns).
Picker populations vary, including temporary vs. permanent staff, or differences in gender, age, or ability.
Cultural or regional norms influence how tasks are approached.
Without validation in new contexts, models may perform poorly or unfairly penalize certain groups. For example, a picker flagged as “slow” in Warehouse A might be average in Warehouse B due to different physical layouts or product mixes.
From a data ethics perspective (Session 9, slide: “Homogenization of workforce”), such generalization risks standardizing behavior in ways that ignore local knowledge and suppress diversity. It may also result in loss of trust or resistance from affected employees and unions.
What we could have done better:
Conduct External Validity Tests: Use simulated or historical data from a second warehouse (if available) to evaluate generalizability.
Fairness Metrics: Even if demographic data wasn’t available, surrogate fairness checks (e.g., by shift, team, or day) could help assess whether model performance varies across groups.
Adaptability Discussion: Offer guidance on how to adapt the model to new environments (e.g., through retraining, reweighting features).
Better practice: Incorporate a “responsible AI” section outlining deployment conditions and fairness safeguards.
How would you communicate the results to non-technical stakeholders, ensuring they make informed decisions without oversimplifying uncertainty (Berinato, 2019)?
Answer:
To bridge the last-mile problem (Berinato, 2019; Session 1), we would:
Use visual storytelling (e.g., dashboards with filters for time-of-day or volume) to surface key insights while maintaining access to the underlying data.
Emphasize confidence intervals and uncertainty bands to avoid overconfidence in predictions. For example: “We are 80% confident that slow pickings increase during night shifts, but results vary by picker experience.”
Frame insights using counterfactuals and scenarios, such as: “If high-volume pickings are relocated to row 3, our model suggests a potential 12% reduction in slow events.”
Avoid technical jargon (e.g., ROC, AUC) and instead present model performance in business terms, such as cost reduction potential, improvement in on-time deliveries, or fewer delayed shifts.
We would also provide a summary of limitations, clearly identifying where the model might fail or require retraining—particularly if applied in different warehouses.
What we could have done better:
Prototype a Dashboard: Include mock-ups or example visualizations to illustrate how managers could interact with model outputs.
Scenario-Based Reporting: Introduce “if-then” tables showing how different variables affect predicted picking speed in business terms.
Use Communication Roles: Align with Berinato’s (2019) model by showing how a communicator, analyst, and business lead might each contribute to final deployment.
Better practice: Embed communication plans into the deployment phase using clear roles and tools.
Looking back, how might the final models or data handling have differed if your question were causal or explanatory rather than predictive?
Answer:
Since we framed our research question as predictive—“To what extent can factors predict slow and fast picking times?”—we focused on classification accuracy and variable importance. As a result, our data preparation, feature engineering, and evaluation prioritized predictive power, using models like logistic regression and random forest.
Had our question been causal or explanatory (e.g., What causes slower picking times?), our approach would have required:
Careful causal diagramming (e.g., DAGs) to identify confounders.
Attention to endogeneity, selection bias, and reverse causality.
Different models, such as linear regression with controls, instrumental variables, or propensity score matching, to infer effects rather than just predict outcomes.
What we could have done better:
We could have explicitly articulated the implications of using a predictive vs. causal framework in Section 1.
We could have acknowledged the limits of prediction in explaining “why” something occurs—especially when recommending actions based on variable importance.
Even within a predictive framework, we might have applied causal sensitivity analysis (e.g., simulated interventions on warehouse layout or picker shift allocation) to enhance the decision-making value of the models.
What steps did you take to prevent insights from “dying” before reaching decision-makers? (Session 1; Berinato, 2019)
Answer:
To address the last-mile problem—the gap between analysis and implementation—we took several steps to make the insights understandable and relevant for non-technical stakeholders:
We used categorical labels like normal_fast and normal_slow to simplify communication of results.
We included variable importance plots and a discussion of feature effects in plain language.
We translated technical performance (e.g., AUC) into business implications, like potential gains in warehouse efficiency or picker training.
Berinato (2019) emphasizes that insights need narratives, not just numbers. Our report included a discussion section (4.3) with practical suggestions for Salling based on model findings, which aimed to bridge that gap.
What we could have done better:
We did not build or mock up a decision-support tool or dashboard that visualizes model outputs in an operational context.
We could have tested our insights with non-technical readers (e.g., through informal usability testing or stakeholder feedback loops).
Including storytelling structures or annotated examples of how predictions could drive specific warehouse decisions (e.g., reassigning pickers, reorganizing stock) would have increased uptake potential.
If this project were scaled across 600 stores, what parts of your pipeline would need to be automated (Veeramachaneni, 2016)? What new risks might emerge?
Answer:
Scaling to Salling’s full warehouse network would demand automation in several areas of the pipeline:
Data ingestion and cleaning: Automating ETL (Extract, Transform, Load) processes for handling varied data formats, warehouse layouts, and timestamp structures.
Model retraining and evaluation: Incorporating pipelines for regular model updates, potentially using MLOps platforms.
Feature engineering: Automating variable creation based on warehouse-specific layouts, volume metrics, and picker identifiers.
According to Veeramachaneni (2016), scaling requires a shift from manual analysis to repeatable workflows. However, automation introduces new risks:
Loss of context sensitivity (e.g., warehouse-specific nuances may be ignored).
Model drift if operational patterns change over time.
Ethical opacity—automated decision-making systems risk amplifying bias if unchecked (e.g., penalizing certain picker groups across sites).
What we could have done better:
We could have mapped out the automation architecture explicitly: which tools (e.g., Airflow, R scripts, Docker) could handle which steps.
We didn’t explore monitoring frameworks for detecting drift or fairness violations over time.
We should have flagged scalability trade-offs, particularly around maintaining transparency, accountability, and stakeholder trust when decisions are automated.
You removed observations with missing time fields. Would imputation have been more appropriate in this business context? What are the causal assumptions behind each choice (Session 3/4)?
Answer:
We removed rows with missing start_time or finish_time, assuming that such entries were missing at random (MAR) and rare enough to not meaningfully bias the results. This approach simplified our analysis and preserved data integrity for duration calculations.
However, in business contexts, removal can obscure systematic patterns, such as:
Errors in specific picker shifts or system outages.
Missing data correlated with performance (e.g., faster pickers forgetting to log timestamps).
Imputation could have been appropriate if:
The missingness was believed to be MAR or MCAR (missing completely at random).
Sufficient contextual features were available for predictive imputation (e.g., shift, volume, product type).
What we could have done better:
We didn’t perform missingness analysis (e.g., missingness heatmaps, patterns by picker or shift).
We could have compared results with and without imputation to test sensitivity.
We missed the chance to evaluate causal assumptions: if missingness depended on unobserved efficiency, then deletion may have biased the sample.
Better practice: Use multiple imputation with diagnostics, or model missingness directly when it may carry business significance.