3 - Data Science Foundations Flashcards

Question

What approach did Maya recommend for clients with a missing employer field?

Answer 1

Exclusion from the data set.

Answer 2

Identify whether they are unemployed, self-employed, or retired.

Answer 3

Investigate the impact of the omission on the total reimbursement.

Answer 4

The median ## Footnote The median represents the middle value when all data points are sorted from lowest to highest.

Answer 5

Calculate both the mean and median and see if they are different. ## Footnote Substantial differences may indicate a non-symmetric distribution or the presence of outliers.

Answer 6

The mean and median are usually close to each other. ## Footnote A bell curve is an example of a symmetric distribution.

Answer 7

In right-skewed distributions, mode < median < mean; in left-skewed distributions, mode > median > mean. ## Footnote Skewness affects the relationship between these measures of central tendency.

Answer 8

Range, percentiles, interquartile range, standard deviation, confidence intervals, variance. ## Footnote Each measure is useful in different scenarios.

Answer 9

The difference between the 75th and 25th percentiles. ## Footnote It corresponds to the 'middle half' of a data set and is less affected by outliers.

Answer 10

A moderately positive relationship. ## Footnote This suggests that the average number of claims increases with increasing age.

Answer 11

A quantity that can take on many possible values across a continuum. ## Footnote Examples include money, time, age, height, and weight.

Answer 12

A perfect positive or negative relationship, respectively. ## Footnote For example, age has a perfect positive correlation with the date.

Answer 13

There may be no clear linear pattern between the variables. ## Footnote However, weak correlations can still be useful for developing predictive models.

Answer 14

To assume linear relationships between variables. ## Footnote Other coefficients, like Spearman, are more robust to non-linear relationships.

Answer 15

To quickly identify outliers and their role in estimating correlation. ## Footnote This helps avoid misleading conclusions from data errors.

Answer 16

It generates questions and hypotheses for further investigation. ## Footnote It is not designed to yield definitive findings.

Answer 17

Under 40, 40–65, and 65+. ## Footnote This categorization helps distinguish between Medicare-age patients, middle-aged patients, and young patients.

Answer 18

Having more conditions doesn't necessarily mean a person is sicker. ## Footnote For instance, someone with pancreatic cancer is sicker than someone with minor allergies, despite having fewer conditions.

Answer 19

The proposed metric has limitations due to the lack of other data on health status that could measure how sick patients are. ## Footnote This highlights the importance of being transparent about definitions and interpretations in metrics.

Answer 20

Older patients may utilize routine screening procedures and preventive health care more than younger patients. ## Footnote Examples include colonoscopies, mammograms, and routine PCP checkups.

Answer 21

By counting procedure codes for relevant services. ## Footnote This includes screening procedures and preventive services.

Answer 22

Are older patients sicker than younger patients? ## Footnote This involves evaluating health status and health care utilization.

Answer 23

A reference value in a hypothesis test that represents no effect or no difference. ## Footnote It serves as the baseline against which the quantity of interest is tested.

Answer 24

1. Effect size 2. p-value ## Footnote These outputs help evaluate the significance of the findings.

Answer 25

The quantity being tested in a hypothesis test, indicating the magnitude of the difference or association. ## Footnote For example, the difference in average number of claims per year between older and younger patients.

Answer 26

The probability of correctly rejecting a false null hypothesis when the alternative hypothesis is true. ## Footnote It indicates the test's ability to detect an effect when there is one.

Answer 27

* Number of data points * Magnitude of the effect size * Significance level or p-value threshold ## Footnote More data generally increases power, as does a larger effect size.

Answer 28

The probability of observing data as extreme as the observed data, assuming the null hypothesis is true. ## Footnote It helps assess the evidence against the null hypothesis.

Answer 29

Strong evidence against the null hypothesis, suggesting a significant effect exists. ## Footnote Common threshold for statistical significance is 0.05.

Answer 30

Why do older patients tend to have more insurance claims? ## Footnote 'Why' questions do not lend themselves to hypothesis testing.

Answer 31

More data increases confidence in the estimated probability of outcomes. ## Footnote This reduces the influence of random chance on effect estimates.

Answer 32

Sample size calculations determine the necessary amount of data to achieve a desired power level. ## Footnote This ensures sufficient data is collected before analysis.

Answer 33

It indicates that there is strong enough evidence to suggest that the null hypothesis is not true.

Answer 34

It is an arbitrary threshold.

Answer 35

Particle physics.

Answer 36

The effect estimate and the p-value.

Answer 37

* Large effect size and high p-value (not statistically significant) * Large effect size and low p-value (statistically significant) * Small effect size and high p-value * Small effect size and low p-value

Answer 38

Low confidence in the effect estimate due to insufficient data.

Answer 39

High confidence that the effect is real.

Answer 40

Data must meet certain assumptions for the tests to be valid.

Answer 41

The number of claims submitted by each patient.

Answer 42

Relationships between variables.

Answer 43

To visualize the relationship between two continuous variables.

Answer 44

The change in Y for every one unit increase of X.

Answer 45

Y = mX + b.

Answer 46

The slope of the resulting line in linear regression.

Answer 47

* Inference: Understanding associations among variables * Prediction: Estimating the outcome for an individual

Answer 48

Whether the variable is meaningful or a proxy for the true variable of interest.

Answer 49

It can lead to uninterpretable results and poor decision-making.

Answer 50

Combining ranges of continuous values to form groups.

Answer 51

It leads to loss of information.

Answer 52

To account for the specific population of Medicare-eligible patients.

Answer 53

Claims, categorized as fewer than fifty claims or fifty or more claims submitted per year ## Footnote This categorization lacks a clear rationale or interpretability.

Answer 54

Exposures, features, inputs, or predictors ## Footnote These are the variables quantifying the association with the outcome.

Answer 55

Which independent variables should be included in the model?

Answer 56

Include all accessible variables and let the model determine meaningful associations ## Footnote This approach requires less thought and domain knowledge.

Answer 57

Possibility of false positives and results that are difficult to interpret

Answer 58

The risk of finding spurious associations

Answer 59

Models are typically more interpretable ## Footnote It may have reduced predictive power due to fewer predictor variables.

Answer 60

List of variables believed to be associated with the outcome

Answer 61

* Diagnoses of patients * Duration of conditions * Chronic vs. acute conditions * Specific insurance plan * Demographics such as place of residence and employment status

Answer 62

Patients with chronic diseases file more claims, with older patients filing significantly more than younger patients

Answer 63

4.34 for younger patients and 15.17 for older patients

Answer 64

Why do older patients’ chronic conditions lead to more claims than younger patients’ chronic conditions?

Answer 65

* Are there any outliers in the data? * What assumptions does the hypothesis test make? * Is the data normally distributed? * Do we have adequate sample size?

Answer 66

* Making unrealistic assumptions * Excluding data arbitrarily * Picking the wrong null hypothesis * Ignoring effect size or p-value * Assuming correlation is causation

Answer 67

When a model captures noise along with the underlying pattern, performing poorly on unseen data

Answer 68

The practice of performing many statistical tests until a significant result is found ## Footnote This practice is misleading and unethical.

Answer 69

To ensure that the company’s patients receive the most cost-effective care possible.

Answer 70

Desirable outcomes at a reasonable cost.

Answer 71

A small percentage of causes often leads to a large percentage of results.

Answer 72

80 percent of reimbursements go to 20 percent of patients.

Answer 73

Patients who have received surgery.

Answer 74

To generate key insights into the data set, including limitations and relationships between variables.

Answer 75

The procedures table.

Answer 76

Replacing a missing value with a nonmissing value.

Answer 77

They had missing values in the 'Employer' field.

Answer 78

It could exclude unemployed, self-employed, or retired patients who are important for analysis.

Answer 79

From 2015 to 2023.

Answer 80

Major changes to reimbursement rates.

Answer 81

Exclusion or imputation.

Answer 82

A positive slope in the scatterplot indicates a relationship.

Answer 83

The average (mean).

Answer 84

The median.

Answer 85

high-security data lake.

Answer 86

* Procedure and diagnosis codes * Date of the claim * Patient encounter date * Unique identifier for the provider organization * Reimbursement amount

Answer 87

Increased by 20 percent.

Answer 88

Identifying high-cost patients and their demographics.

Answer 89

The median ## Footnote The median is the middle value when all data points are sorted from lowest to highest.

Answer 90

The distribution of the variable is not symmetric or there are outliers ## Footnote This can also suggest that the variable of interest may have extreme values or long tails.

Answer 91

A symmetric distribution ## Footnote In symmetric distributions, the mean and median are usually close to each other.

Answer 92

Mode < Median < Mean ## Footnote In right-skewed distributions, the mean is pulled towards the tail.

Answer 93

* Range * Percentiles * Interquartile range * Standard deviation * Confidence intervals * Variance ## Footnote Each measure is useful in different scenarios.

Answer 94

The nth percentile value means that n percent of all data points have a value less than or equal to that value ## Footnote For example, the 50th percentile is the median.

Answer 95

The difference between the 75th and 25th percentiles ## Footnote It corresponds to the middle half of a data set and helps exclude outliers.

Answer 96

A moderately positive relationship between two variables ## Footnote This means the average number of claims increases with increasing claimant age.

Answer 97

Between -1 and 1 ## Footnote A coefficient of 1 or -1 indicates a perfect positive or negative relationship, respectively.

Answer 98

Continuous variables ## Footnote Continuous variables can take on many possible values across a continuum.

Answer 99

Continuous variables can take on many values; categorical variables can take on only a few values ## Footnote An example of a categorical variable is a patient's race.

Answer 100

Linear relationships ## Footnote Other types of correlations, like the Spearman coefficient, look at rank order and are more robust to other relationships.

Answer 101

It may be misleading ## Footnote For example, a U-shaped relationship could yield a weak linear correlation despite having a strong general relationship.

Answer 102

To quickly identify outliers and their role in estimating correlation ## Footnote Visualizations can help prevent misleading conclusions from erroneous data points.

Answer 103

Generate questions and hypotheses for further exploration ## Footnote They are not intended to yield definitive findings.

Answer 104

* Are older patients sicker than younger patients? * Do older patients use health care services more than younger patients? * Is the relationship between age and number of claims driven by recently added employers? ## Footnote These hypotheses help focus further analyses.

Answer 105

Under 40, 40–65, and 65+ ## Footnote This categorization helps distinguish among different patient age groups.

Answer 106

Looking at preexisting conditions ## Footnote Diagnosis codes can also provide insight into health status.

Answer 107

Having more diagnoses does not necessarily mean being sicker ## Footnote For example, someone with pancreatic cancer is likely sicker than someone with benign conditions.

Answer 108

The proposed definition had limitations due to a lack of captured data on health status that could measure how sick patients are. ## Footnote Kamala emphasized the importance of clearly stating definitions, limitations, and interpretations in metrics.

Answer 109

Stick with the proposed metric while clearly stating how it is measured and interpreted. If results do not make sense, consider importing a comorbidity index from another source. ## Footnote This approach allows for the use of an imperfect metric in the absence of better data.

Answer 110

Older patients may use routine screening procedures and preventive health care more than younger patients, independent of how sick they are. ## Footnote Examples include colonoscopies, mammograms, and routine PCP checkups.

Answer 111

Utilization of routine screening and preventive health care services, and possibly other services not indicative of the severity of a condition. ## Footnote This metric is distinct from overall health care claims.

Answer 112

Using procedure codes for any service that falls into the category of routine screening and preventive services. ## Footnote Kamala agreed to work with her team to compile a list of relevant procedure codes.

Answer 113

The specific age groups weren't detailed in the text, but they were identified as part of the analysis plan. ## Footnote The age groups are essential for stratifying the data in the analysis.

Answer 114

By counting the number of preexisting conditions and diagnosis codes while noting the limitations of this definition. ## Footnote This approach acknowledges that the health status metric may not be completely accurate.

Answer 115

They combined clinical and analytical expertise to create a plan that is computationally feasible and clinically relevant. ## Footnote This cross-functional collaboration enhanced the analysis's overall quality.

Answer 116

False. ## Footnote They acknowledged the imperfections in their metrics and planned accordingly.

Answer 117

procedure codes. ## Footnote This method provides a structured way to define the services of interest.

Answer 118

A statistical hypothesis test evaluates whether a quantity of interest is meaningfully different from a reference value. ## Footnote The reference value is often referred to as the null hypothesis.

Answer 119

The null hypothesis is a difference in height of 0 inches between men and women.

Answer 120

The p-value represents the probability of observing data as extreme as what was observed, assuming the null hypothesis is true.

Answer 121

The effect size indicates the magnitude of the difference or association being tested in the hypothesis test.

Answer 122

Increasing the number of data points increases the power of a test.

Answer 123

The probability of correctly rejecting a false null hypothesis when the alternative hypothesis is true.

Answer 124

The p-value decreases, indicating stronger evidence against the null hypothesis.

Answer 125

Averages can be misleading, as they may not represent the underlying population accurately.

Answer 126

A small p-value suggests that the null hypothesis may not be true.

Answer 127

The p-value was 0.5.

Answer 128

It indicates low confidence in the effect estimate due to insufficient data.

Answer 129

The effect size can be influenced by sample size, the magnitude of the effect, and the significance level.

Answer 130

It means that lower p-values indicate stronger evidence against the null hypothesis, rather than a strict cutoff.

Answer 131

The necessary sample size for a hypothesis test to achieve a predetermined power.

Answer 132

It suggests strong evidence that the coin is rigged.

Answer 133

Both the effect estimate and the p-value should be considered together.

Answer 134

There is insufficient evidence to determine the effect's significance.

Answer 135

50 percent ## Footnote The p-value in this scenario is 0.75, indicating insufficient evidence about the effect estimate's accuracy.

Answer 136

0.00003 ## Footnote This p-value suggests that the coin is likely not completely fair.

Answer 137

Data follows a bell curve or normal distribution ## Footnote Other assumptions may include equal standard deviations between groups being compared.

Answer 138

To visualize the relationship between two continuous variables ## Footnote Scatterplots display the spread of points that can indicate correlation.

Answer 139

The change in Y for every one unit increase of X ## Footnote It quantifies the relationship between two variables.

Answer 140

An M unit increase in the Y variable for every unit increase in the X variable ## Footnote The regression coefficient is derived from the slope of the line of best fit.

Answer 141

* Inference * Prediction ## Footnote Inference involves understanding associations, while prediction focuses on forecasting individual outcomes.

Answer 142

Whether the variable is meaningful ## Footnote Ensuring the variable accurately represents the phenomenon of interest is crucial.

Answer 143

Loss of information ## Footnote Categorization should only be done when it offers substantial interpretability gains.

Answer 144

Variables for which we are quantifying the association with the outcome ## Footnote They can also be referred to as exposures, features, inputs, or predictors.

Answer 145

* Data-driven: Includes all available variables and lets the model find associations * Hypothesis-driven: Selects variables based on prior beliefs about their associations ## Footnote The data-driven approach may lead to false positives, while the hypothesis-driven approach is typically more interpretable.

Answer 146

Possibility of false positives ## Footnote Certain variables may appear associated due to random chance rather than a true relationship.

Answer 147

A list of variables believed to be associated with the outcome ## Footnote This reflects the hypothesis-driven approach.

Answer 148

Diagnoses patients have ## Footnote Certain conditions may correlate with more health system encounters.

Answer 149

People with chronic conditions may have consistently more claims than people without them.

Answer 150

Diagnoses, specific insurance plan, demographics (place of residence, employment status, number of dependents).

Answer 151

The presence of chronic diseases.

Answer 152

In younger patients, the coefficient was 4.34; in older patients, it was 15.17.

Answer 153

Why do older patients’ chronic conditions lead to more claims than younger patients’ chronic conditions?

Answer 154

Outliers can affect the results.

Answer 155

Are the assumptions valid?

Answer 156

Is the data normally distributed?

Answer 157

Could lead to a nonsignificant p-value or underpowered analysis.

Answer 158

Univariable analysis examines one variable; multivariable analysis examines multiple variables.

Answer 159

Interpreting only the p-value and not the effect size.

Answer 160

confounder

Answer 161

When a model captures noise along with the underlying pattern in the data.

Answer 162

Performing many statistical tests until a significant result is found.

Answer 163

The exclusion needs to be justified and understood.

Answer 164

The results are statistically significant.

Answer 165

Leads to a biased representation of the analysis.

Answer 166

It gives an idea of the magnitude or importance of the effect.

Answer 167

How was missing data dealt with? Does imputation/exclusion make sense?

3 - Data Science Foundations Flashcards

(199 cards)