3 - Data Science Foundations Flashcards

1
Q

What is the main focus of Kamala’s role at Stardust Health Insurance?

A

Ensuring that patients receive the most cost-effective care possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does cost-effective medical care produce?

A

Desirable outcomes at a reasonable cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Pareto principle?

A

A relationship where a small percentage of causes produce a large majority of results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What key demographic information does Kamala consider about the patient population?

A

Patients’ age and gender distributions, common conditions, and medications.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What percentage of reimbursements went to 20 percent of patients at Stardust?

A

80 percent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is the health care reimbursement a significant expenditure for Stardust?

A

It represents the company’s largest source of expenditure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Kamala’s target population for her analysis?

A

Patients who have received surgery.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is exploratory data analysis?

A

An analysis that generates insights into the data set, including limitations, summary statistics, and relationships between variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is included in the claims database used for analysis?

A

Claims made by all covered patients from 2015 to 2023.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What age group is excluded from the claims data due to privacy concerns?

A

Clients below age 18.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the main table in the claims database called?

A

The procedures table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is imputation in data analysis?

A

Replacing a missing value with a nonmissing value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True or False: The date of patient encounter is always recorded in the claims database.

A

False.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What percentage of the time is the ‘Date of patient encounter’ field empty?

A

17 percent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a common issue with using averages in data analysis?

A

Averages can be misleading if extreme values are present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What measure of central tendency is less sensitive to extreme values?

A

Median.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What was the average age of claimants in 2018?

A

46 years.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What was the average age of claimants in 2019?

A

50 years.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is one potential question Kamala could answer with the claims data?

A

What diagnosis and procedure codes are most costly for the company?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What should Kamala consider regarding missing data in her analysis?

A

The implications of exclusion or imputation on data quality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Fill in the blank: The data set covers claims from ______ to 2023.

A

2015

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is one limitation of the claims data set according to Kamala?

A

It does not include data on family history.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the significance of the 80/20 rule in Kamala’s analysis?

A

It highlights that a small percentage of patients contribute to a large portion of costs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What type of data is stored in a separate table for patient identification?

A

Name, date of birth, social security number, address, and employer information.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What approach did Maya recommend for clients with a missing employer field?
Exclusion from the data set.
26
What alternative did Kamala suggest instead of excluding patients with missing employer data?
Identify whether they are unemployed, self-employed, or retired.
27
What analysis did Kamala ask the data science team to perform regarding missing dependents' claims?
Investigate the impact of the omission on the total reimbursement.
28
What is a more appropriate measure of central tendency when extreme values are present?
The median ## Footnote The median represents the middle value when all data points are sorted from lowest to highest.
29
How can you determine whether to use the mean or median?
Calculate both the mean and median and see if they are different. ## Footnote Substantial differences may indicate a non-symmetric distribution or the presence of outliers.
30
What is a characteristic of symmetric distributions regarding the mean and median?
The mean and median are usually close to each other. ## Footnote A bell curve is an example of a symmetric distribution.
31
In skewed distributions, how do the mean, median, and mode relate?
In right-skewed distributions, mode < median < mean; in left-skewed distributions, mode > median > mean. ## Footnote Skewness affects the relationship between these measures of central tendency.
32
What are some typical measures of variation?
Range, percentiles, interquartile range, standard deviation, confidence intervals, variance. ## Footnote Each measure is useful in different scenarios.
33
What does the interquartile range represent?
The difference between the 75th and 25th percentiles. ## Footnote It corresponds to the 'middle half' of a data set and is less affected by outliers.
34
What does a correlation coefficient of 0.4 indicate?
A moderately positive relationship. ## Footnote This suggests that the average number of claims increases with increasing age.
35
What is a continuous variable?
A quantity that can take on many possible values across a continuum. ## Footnote Examples include money, time, age, height, and weight.
36
What does a correlation coefficient of 1 or -1 indicate?
A perfect positive or negative relationship, respectively. ## Footnote For example, age has a perfect positive correlation with the date.
37
What does a weak correlation (–0.2 to 0.2) imply?
There may be no clear linear pattern between the variables. ## Footnote However, weak correlations can still be useful for developing predictive models.
38
What is the Pearson correlation coefficient used for?
To assume linear relationships between variables. ## Footnote Other coefficients, like Spearman, are more robust to non-linear relationships.
39
Why is it important to visualize data in scatterplots?
To quickly identify outliers and their role in estimating correlation. ## Footnote This helps avoid misleading conclusions from data errors.
40
What is the key takeaway from exploratory data analysis?
It generates questions and hypotheses for further investigation. ## Footnote It is not designed to yield definitive findings.
41
What age groups did Kamala propose for analyzing patient claims?
Under 40, 40–65, and 65+. ## Footnote This categorization helps distinguish between Medicare-age patients, middle-aged patients, and young patients.
42
What is one limitation of using the number of preexisting conditions to measure how sick someone is?
Having more conditions doesn't necessarily mean a person is sicker. ## Footnote For instance, someone with pancreatic cancer is sicker than someone with minor allergies, despite having fewer conditions.
43
What limitations did Kamala identify regarding the proposed metric for measuring sickness?
The proposed metric has limitations due to the lack of other data on health status that could measure how sick patients are. ## Footnote This highlights the importance of being transparent about definitions and interpretations in metrics.
44
What does Kamala suggest about the use of health care services?
Older patients may utilize routine screening procedures and preventive health care more than younger patients. ## Footnote Examples include colonoscopies, mammograms, and routine PCP checkups.
45
How will health care utilization be measured in the analysis plan?
By counting procedure codes for relevant services. ## Footnote This includes screening procedures and preventive services.
46
What key hypotheses did Kamala and the data science team aim to test?
Are older patients sicker than younger patients? ## Footnote This involves evaluating health status and health care utilization.
47
What is a null hypothesis?
A reference value in a hypothesis test that represents no effect or no difference. ## Footnote It serves as the baseline against which the quantity of interest is tested.
48
What two outputs do hypothesis tests typically have?
1. Effect size 2. p-value ## Footnote These outputs help evaluate the significance of the findings.
49
What is an effect size?
The quantity being tested in a hypothesis test, indicating the magnitude of the difference or association. ## Footnote For example, the difference in average number of claims per year between older and younger patients.
50
How is statistical power defined?
The probability of correctly rejecting a false null hypothesis when the alternative hypothesis is true. ## Footnote It indicates the test's ability to detect an effect when there is one.
51
What factors influence the power of a hypothesis test?
* Number of data points * Magnitude of the effect size * Significance level or p-value threshold ## Footnote More data generally increases power, as does a larger effect size.
52
What does a p-value represent?
The probability of observing data as extreme as the observed data, assuming the null hypothesis is true. ## Footnote It helps assess the evidence against the null hypothesis.
53
What does a low p-value indicate?
Strong evidence against the null hypothesis, suggesting a significant effect exists. ## Footnote Common threshold for statistical significance is 0.05.
54
What is an example of a question that cannot be answered by a hypothesis test?
Why do older patients tend to have more insurance claims? ## Footnote 'Why' questions do not lend themselves to hypothesis testing.
55
In the context of the coin flipping example, what is the significance of needing more data?
More data increases confidence in the estimated probability of outcomes. ## Footnote This reduces the influence of random chance on effect estimates.
56
What is the relationship between sample size and power calculations?
Sample size calculations determine the necessary amount of data to achieve a desired power level. ## Footnote This ensures sufficient data is collected before analysis.
57
What does a p-value less than 0.05 indicate?
It indicates that there is strong enough evidence to suggest that the null hypothesis is not true.
58
What is a drawback of using a p-value threshold of 0.05?
It is an arbitrary threshold.
59
In which field might the p-value threshold be p < 0.000001?
Particle physics.
60
What should be considered together when interpreting results of statistical hypothesis tests?
The effect estimate and the p-value.
61
What are the four cases for interpreting p-values and effect sizes?
* Large effect size and high p-value (not statistically significant) * Large effect size and low p-value (statistically significant) * Small effect size and high p-value * Small effect size and low p-value
62
What does a high p-value with a large effect size indicate?
Low confidence in the effect estimate due to insufficient data.
63
What does a small p-value with a large effect size indicate?
High confidence that the effect is real.
64
What is the importance of data assumptions in statistical hypothesis tests?
Data must meet certain assumptions for the tests to be valid.
65
What is the outcome of interest for Kamala in her analysis?
The number of claims submitted by each patient.
66
What does modeling aim to find?
Relationships between variables.
67
What is a scatterplot used for?
To visualize the relationship between two continuous variables.
68
What does the slope of the line of best fit represent?
The change in Y for every one unit increase of X.
69
What is the equation of the line used to quantify relationships in linear regression?
Y = mX + b.
70
What is the regression coefficient?
The slope of the resulting line in linear regression.
71
What is the difference between inference and prediction in modeling?
* Inference: Understanding associations among variables * Prediction: Estimating the outcome for an individual
72
What is a key consideration when defining an outcome variable?
Whether the variable is meaningful or a proxy for the true variable of interest.
73
What can happen if the outcome variable is misspecified?
It can lead to uninterpretable results and poor decision-making.
74
What is categorization in the context of outcome variables?
Combining ranges of continuous values to form groups.
75
What is a potential downside of categorizing a variable?
It leads to loss of information.
76
Why might Kamala categorize the age variable?
To account for the specific population of Medicare-eligible patients.
77
True or False: Categorizing the outcome variable (claims) as fewer than fifty or fifty or more claims submitted per year is justified.
False.
78
What is the outcome variable in the context of claims analysis?
Claims, categorized as fewer than fifty claims or fifty or more claims submitted per year ## Footnote This categorization lacks a clear rationale or interpretability.
79
What are independent variables also known as?
Exposures, features, inputs, or predictors ## Footnote These are the variables quantifying the association with the outcome.
80
What is the main question in model building regarding independent variables?
Which independent variables should be included in the model?
81
What is the data-driven approach to variable selection?
Include all accessible variables and let the model determine meaningful associations ## Footnote This approach requires less thought and domain knowledge.
82
What is a disadvantage of the data-driven approach?
Possibility of false positives and results that are difficult to interpret
83
What does the hypothesis-driven approach minimize?
The risk of finding spurious associations
84
What is a key benefit of the hypothesis-driven approach?
Models are typically more interpretable ## Footnote It may have reduced predictive power due to fewer predictor variables.
85
What is a good starting point for selecting variables in a model?
List of variables believed to be associated with the outcome
86
What are some variables Kamala suggested to analyze claims?
* Diagnoses of patients * Duration of conditions * Chronic vs. acute conditions * Specific insurance plan * Demographics such as place of residence and employment status
87
What was the main finding regarding chronic diseases and claims?
Patients with chronic diseases file more claims, with older patients filing significantly more than younger patients
88
What were the coefficients for chronic diseases in younger and older patients?
4.34 for younger patients and 15.17 for older patients
89
What is a critical question to explore regarding older patients' chronic conditions?
Why do older patients’ chronic conditions lead to more claims than younger patients’ chronic conditions?
90
What are some important questions to consider in data analysis?
* Are there any outliers in the data? * What assumptions does the hypothesis test make? * Is the data normally distributed? * Do we have adequate sample size?
91
What are common mistakes in statistical modeling?
* Making unrealistic assumptions * Excluding data arbitrarily * Picking the wrong null hypothesis * Ignoring effect size or p-value * Assuming correlation is causation
92
What is overfitting in the context of statistical modeling?
When a model captures noise along with the underlying pattern, performing poorly on unseen data
93
What is data dredging or p-hacking?
The practice of performing many statistical tests until a significant result is found ## Footnote This practice is misleading and unethical.
94
What is the primary goal of Kamala at Stardust Health Insurance?
To ensure that the company’s patients receive the most cost-effective care possible.
95
What does cost-effective medical care produce?
Desirable outcomes at a reasonable cost.
96
What does the Pareto principle state?
A small percentage of causes often leads to a large percentage of results.
97
What percentage of reimbursements at Stardust goes to which percentage of patients?
80 percent of reimbursements go to 20 percent of patients.
98
What type of patients does Kamala focus on for her cost analysis?
Patients who have received surgery.
99
What is exploratory data analysis used for?
To generate key insights into the data set, including limitations and relationships between variables.
100
What is the main table in the claims database called?
The procedures table.
101
What is imputation in data analysis?
Replacing a missing value with a nonmissing value.
102
Why were certain patients excluded from the data set?
They had missing values in the 'Employer' field.
103
What potential issue did Kamala identify with excluding patients with missing employer data?
It could exclude unemployed, self-employed, or retired patients who are important for analysis.
104
What period does the claims data cover?
From 2015 to 2023.
105
What significant change occurred in 2016 that affects Kamala's analysis?
Major changes to reimbursement rates.
106
What is a common approach to handle missing data?
Exclusion or imputation.
107
What relationship did the data science team find between the age of claimants and the number of claims filed?
A positive slope in the scatterplot indicates a relationship.
108
What measure of central tendency can be misleading due to extreme values?
The average (mean).
109
What is a more appropriate measure of central tendency when dealing with extreme values?
The median.
110
What was the average age of claimants in 2018?
46 years.
111
What was the average age of claimants in 2019?
50 years.
112
Fill in the blank: The database does not include claims for dependents for certain employers whose contracts stipulate that data on dependents must be stored in a _______.
high-security data lake.
113
True or False: Kamala can use the claims database to analyze family history of chronic illness.
False.
114
What kind of data does the procedures table include?
* Procedure and diagnosis codes * Date of the claim * Patient encounter date * Unique identifier for the provider organization * Reimbursement amount
115
What significant increase did the data science team observe in the number of claims per claimant from 2018 to 2019?
Increased by 20 percent.
116
What is one of the important considerations when analyzing the data set for Kamala's project?
Identifying high-cost patients and their demographics.
117
What is a more appropriate measure of central tendency when dealing with extreme values?
The median ## Footnote The median is the middle value when all data points are sorted from lowest to highest.
118
What does it indicate if the mean and median are substantially different?
The distribution of the variable is not symmetric or there are outliers ## Footnote This can also suggest that the variable of interest may have extreme values or long tails.
119
What is a bell curve an example of?
A symmetric distribution ## Footnote In symmetric distributions, the mean and median are usually close to each other.
120
In a right-skewed distribution, how do the mean, median, and mode compare?
Mode < Median < Mean ## Footnote In right-skewed distributions, the mean is pulled towards the tail.
121
What are some typical measures of variation?
* Range * Percentiles * Interquartile range * Standard deviation * Confidence intervals * Variance ## Footnote Each measure is useful in different scenarios.
122
What is the definition of percentiles?
The nth percentile value means that n percent of all data points have a value less than or equal to that value ## Footnote For example, the 50th percentile is the median.
123
What is the interquartile range?
The difference between the 75th and 25th percentiles ## Footnote It corresponds to the middle half of a data set and helps exclude outliers.
124
What does a correlation coefficient of 0.4 indicate?
A moderately positive relationship between two variables ## Footnote This means the average number of claims increases with increasing claimant age.
125
What is the range of values for a correlation coefficient?
Between -1 and 1 ## Footnote A coefficient of 1 or -1 indicates a perfect positive or negative relationship, respectively.
126
What type of variables are money, time, and age considered?
Continuous variables ## Footnote Continuous variables can take on many possible values across a continuum.
127
What is the difference between continuous and categorical variables?
Continuous variables can take on many values; categorical variables can take on only a few values ## Footnote An example of a categorical variable is a patient's race.
128
What does a Pearson correlation coefficient assume?
Linear relationships ## Footnote Other types of correlations, like the Spearman coefficient, look at rank order and are more robust to other relationships.
129
What is a potential issue when calculating a correlation coefficient for non-linear relationships?
It may be misleading ## Footnote For example, a U-shaped relationship could yield a weak linear correlation despite having a strong general relationship.
130
Why is it important to visualize data when analyzing relationships between variables?
To quickly identify outliers and their role in estimating correlation ## Footnote Visualizations can help prevent misleading conclusions from erroneous data points.
131
What are exploratory data analyses designed to do?
Generate questions and hypotheses for further exploration ## Footnote They are not intended to yield definitive findings.
132
What are the three working hypotheses Kamala wants to test regarding older patients?
* Are older patients sicker than younger patients? * Do older patients use health care services more than younger patients? * Is the relationship between age and number of claims driven by recently added employers? ## Footnote These hypotheses help focus further analyses.
133
How does Kamala define 'older' and 'younger' patients?
Under 40, 40–65, and 65+ ## Footnote This categorization helps distinguish among different patient age groups.
134
What is one way to measure how sick someone is?
Looking at preexisting conditions ## Footnote Diagnosis codes can also provide insight into health status.
135
What limitation does Kamala identify regarding the proposed metric for measuring sickness?
Having more diagnoses does not necessarily mean being sicker ## Footnote For example, someone with pancreatic cancer is likely sicker than someone with benign conditions.
136
What limitation did Kamala identify regarding the proposed definition of 'sickness'?
The proposed definition had limitations due to a lack of captured data on health status that could measure how sick patients are. ## Footnote Kamala emphasized the importance of clearly stating definitions, limitations, and interpretations in metrics.
137
What solution did Maya propose to address the limitations of the health status metric?
Stick with the proposed metric while clearly stating how it is measured and interpreted. If results do not make sense, consider importing a comorbidity index from another source. ## Footnote This approach allows for the use of an imperfect metric in the absence of better data.
138
How did Kamala clarify the use of health care services in relation to older patients?
Older patients may use routine screening procedures and preventive health care more than younger patients, independent of how sick they are. ## Footnote Examples include colonoscopies, mammograms, and routine PCP checkups.
139
What types of health care utilization were Kamala and Maya interested in measuring?
Utilization of routine screening and preventive health care services, and possibly other services not indicative of the severity of a condition. ## Footnote This metric is distinct from overall health care claims.
140
What method did Kamala propose for defining the relevant services in health care utilization?
Using procedure codes for any service that falls into the category of routine screening and preventive services. ## Footnote Kamala agreed to work with her team to compile a list of relevant procedure codes.
141
What three age groups did Kamala propose for breaking up patients in the analysis?
The specific age groups weren't detailed in the text, but they were identified as part of the analysis plan. ## Footnote The age groups are essential for stratifying the data in the analysis.
142
How will health status be measured in the proposed analysis plan?
By counting the number of preexisting conditions and diagnosis codes while noting the limitations of this definition. ## Footnote This approach acknowledges that the health status metric may not be completely accurate.
143
What collaborative approach did Kamala and the data science team take in crafting their analysis plan?
They combined clinical and analytical expertise to create a plan that is computationally feasible and clinically relevant. ## Footnote This cross-functional collaboration enhanced the analysis's overall quality.
144
True or False: The data science team decided to disregard the limitations of the metrics they were using.
False. ## Footnote They acknowledged the imperfections in their metrics and planned accordingly.
145
Fill in the blank: The team decided to measure health care utilization by counting _____ for relevant services.
procedure codes. ## Footnote This method provides a structured way to define the services of interest.
146
What is a statistical hypothesis test?
A statistical hypothesis test evaluates whether a quantity of interest is meaningfully different from a reference value. ## Footnote The reference value is often referred to as the null hypothesis.
147
What is the null hypothesis in the context of height comparison?
The null hypothesis is a difference in height of 0 inches between men and women.
148
What does the p-value represent in hypothesis testing?
The p-value represents the probability of observing data as extreme as what was observed, assuming the null hypothesis is true.
149
What does an effect size indicate?
The effect size indicates the magnitude of the difference or association being tested in the hypothesis test.
150
True or False: A p-value of 0.003 indicates strong evidence against the null hypothesis.
True
151
What is the relationship between sample size and the power of a hypothesis test?
Increasing the number of data points increases the power of a test.
152
What is statistical power?
The probability of correctly rejecting a false null hypothesis when the alternative hypothesis is true.
153
Fill in the blank: The threshold for labeling results as statistically significant is typically set at _______.
0.05
154
What happens to the p-value when more extreme results are observed?
The p-value decreases, indicating stronger evidence against the null hypothesis.
155
What is a potential issue with relying solely on averages in effect sizes?
Averages can be misleading, as they may not represent the underlying population accurately.
156
What does a small p-value indicate about the null hypothesis?
A small p-value suggests that the null hypothesis may not be true.
157
In the coin flipping example, what was the p-value after flipping heads once?
The p-value was 0.5.
158
What is the implication of a high effect size with a high p-value?
It indicates low confidence in the effect estimate due to insufficient data.
159
How can the effect size of a hypothesis test be influenced?
The effect size can be influenced by sample size, the magnitude of the effect, and the significance level.
160
What is meant by 'statistical significance is a spectrum'?
It means that lower p-values indicate stronger evidence against the null hypothesis, rather than a strict cutoff.
161
What does the power calculation help determine?
The necessary sample size for a hypothesis test to achieve a predetermined power.
162
What does a p-value of 0.001 suggest about the fairness of the coin?
It suggests strong evidence that the coin is rigged.
163
What should be considered when interpreting the results of hypothesis tests?
Both the effect estimate and the p-value should be considered together.
164
What does a small effect size with a high p-value imply?
There is insufficient evidence to determine the effect's significance.
165
What is the effect estimate when flipping a coin twice and getting one heads and one tails?
50 percent ## Footnote The p-value in this scenario is 0.75, indicating insufficient evidence about the effect estimate's accuracy.
166
What is the p-value when flipping a coin 10,000 times and getting 5,200 heads and 4,800 tails?
0.00003 ## Footnote This p-value suggests that the coin is likely not completely fair.
167
What statistical assumption may some hypothesis tests require?
Data follows a bell curve or normal distribution ## Footnote Other assumptions may include equal standard deviations between groups being compared.
168
What is a scatterplot used for in statistics?
To visualize the relationship between two continuous variables ## Footnote Scatterplots display the spread of points that can indicate correlation.
169
What does the slope of a line of best fit represent?
The change in Y for every one unit increase of X ## Footnote It quantifies the relationship between two variables.
170
What does the regression coefficient indicate?
An M unit increase in the Y variable for every unit increase in the X variable ## Footnote The regression coefficient is derived from the slope of the line of best fit.
171
What are the two fundamental goals of modeling?
* Inference * Prediction ## Footnote Inference involves understanding associations, while prediction focuses on forecasting individual outcomes.
172
What is a key consideration when defining an outcome variable in statistical modeling?
Whether the variable is meaningful ## Footnote Ensuring the variable accurately represents the phenomenon of interest is crucial.
173
What is the risk of categorizing a continuous variable?
Loss of information ## Footnote Categorization should only be done when it offers substantial interpretability gains.
174
What are independent variables in a statistical model?
Variables for which we are quantifying the association with the outcome ## Footnote They can also be referred to as exposures, features, inputs, or predictors.
175
What is the difference between a data-driven approach and a hypothesis-driven approach in model building?
* Data-driven: Includes all available variables and lets the model find associations * Hypothesis-driven: Selects variables based on prior beliefs about their associations ## Footnote The data-driven approach may lead to false positives, while the hypothesis-driven approach is typically more interpretable.
176
What is a potential disadvantage of the data-driven approach?
Possibility of false positives ## Footnote Certain variables may appear associated due to random chance rather than a true relationship.
177
What should be a starting point for variable selection in model building?
A list of variables believed to be associated with the outcome ## Footnote This reflects the hypothesis-driven approach.
178
What example did Kamala provide for a variable that may influence the number of claims filed?
Diagnoses patients have ## Footnote Certain conditions may correlate with more health system encounters.
179
What is the main hypothesis regarding chronic conditions and claims?
People with chronic conditions may have consistently more claims than people without them.
180
What variables did Kamala suggest including in the initial model?
Diagnoses, specific insurance plan, demographics (place of residence, employment status, number of dependents).
181
What was the biggest contributor to the number of claims filed per year?
The presence of chronic diseases.
182
How do chronic diseases affect claims differently in younger versus older patients?
In younger patients, the coefficient was 4.34; in older patients, it was 15.17.
183
True or False: Older patients file fewer claims compared to younger patients with chronic diseases.
False.
184
What key question arises about the difference in claims between younger and older patients?
Why do older patients’ chronic conditions lead to more claims than younger patients’ chronic conditions?
185
What is a key consideration when assessing outliers in data?
Outliers can affect the results.
186
What assumption should be checked regarding the hypothesis test?
Are the assumptions valid?
187
What is necessary to determine if the data is suitable for analysis?
Is the data normally distributed?
188
What is a potential issue if the sample size is inadequate?
Could lead to a nonsignificant p-value or underpowered analysis.
189
What is the difference between univariable and multivariable analysis?
Univariable analysis examines one variable; multivariable analysis examines multiple variables.
190
What is a common mistake when interpreting statistical results?
Interpreting only the p-value and not the effect size.
191
Fill in the blank: A __________ is a variable that influences both the dependent variable and independent variable.
confounder
192
What does overfitting refer to in statistical modeling?
When a model captures noise along with the underlying pattern in the data.
193
What is data dredging or p-hacking?
Performing many statistical tests until a significant result is found.
194
What should be justified when excluding data from analysis?
The exclusion needs to be justified and understood.
195
What does a small p-value indicate?
The results are statistically significant.
196
What is the risk of cherry-picking results in analysis?
Leads to a biased representation of the analysis.
197
What is the importance of effect size in statistical analysis?
It gives an idea of the magnitude or importance of the effect.
198
True or False: Correlation implies causation.
False.
199
What should be considered regarding missing data?
How was missing data dealt with? Does imputation/exclusion make sense?