Hospitalization Hypothesis Testing - Data Project

ankitmorajkar
Nov 19, 2025
4 min read

In the data-driven world of modern healthcare, efficiency is not just about operational speed. It is about accurately understanding risk. For organizations like Apollo Hospitals, the ability to predict hospitalization charges based on patient demographics and clinical metrics is a critical component of financial stability and resource allocation.

I am currently working with a dataset from Stratascratch to simulate this exact business scenario. My objective is clear: I need to tease out meaningful insights from 1,338 anonymized patient records to help the hospital improve its pricing models and understand the spread of health risks.

Rather than relying on intuition or simple averages, I apply rigorous inferential statistics to validate my findings. I use Microsoft Excel to perform T-Tests, ANOVA, and Chi-Square analysis to determine exactly which variables drive costs and which ones are merely noise.

The Data Landscape

The dataset provides a snapshot of patient profiles. I have access to variables including Age, Sex, Region, Smoking Status, Viral Load, Severity Level, and Hospitalization Charges.

My initial descriptive analysis reveals a right-skewed distribution for charges. Most patients incur standard costs, but a significant "long tail" of patients faces massive bills. My goal is to understand what puts a patient in that high-cost category.

1. The Financial Impact of Smoking

My most immediate observation is the disparity in costs between smokers and non-smokers. When I look at the raw averages, the difference seems massive. However, in data science, "seems" is not enough. I need to prove that this difference is statistically significant and not a result of random chance.

I notice that the variance in charges for smokers is roughly double that of non-smokers. Because of this unequal spread, I cannot use a standard Student’s T-Test. Instead, I utilize Welch’s T-Test which assumes unequal variances.

The results are definitive. The P-value is effectively zero. This confirms that the difference in means is highly significant.

The business implications here are staggering. I find that an average smoker costs the hospital system approximately Rupees 80,125, compared to just 21,085 for a non-smoker. This 4x multiplier identifies smoking status as the single most powerful predictor of financial liability in the dataset.

2. The Myth of Viral Load

Next, I turn my attention to clinical metrics. It is easy to assume that a higher "Viral Load" (the amount of virus in the blood) correlates directly with a more severe, and therefore more expensive, hospital stay.

To test this, I run a Pearson Correlation analysis. I am looking for a strong linear relationship between the viral load integer and the total charge.

The resulting correlation coefficient is 0.198.

This is a crucial finding because it challenges a common assumption. While there is a very weak positive trend, viral load alone is a poor predictor of costs. If the hospital plans to build a predictive pricing model, I recommend they deprioritize this variable in favor of lifestyle factors like smoking or BMI.

3. The Severity Level Paradox

The dataset categorizes patients into severity levels ranging from 0 to 5. Logic dictates that costs should rise in steps as severity increases. To validate this across multiple groups simultaneously, I employ an ANOVA (Analysis of Variance) test.

The ANOVA returns a P-value of 0.0058, which allows me to reject the null hypothesis. Statistically, severity does impact cost.

However, when I dig deeper into the group means, I find a non-linear anomaly. Patients at Severity Levels 4 and 5 actually show lower average costs than those at Level 3.

This is where domain knowledge and critical thinking come into play. I check the sample sizes and see that Level 4 and Level 5 have only 25 and 18 patients respectively. This small sample size suggests the cost drop is likely a data artifact or survivor bias rather than a true operational insight. I note this as a limitation and a specific area where we need to collect more data before making policy changes.

4. Regional Demographics and Risk

Finally, I investigate the geographic spread of risk. Apollo Hospitals wants to know if specific regions, such as the Southwest or Northeast, carry different risk profiles regarding smoking.

I construct a contingency table and perform a Chi-Square Test of Independence. I am testing to see if smoking status depends on the region.

The P-value comes back at 0.057.

This is a borderline result. Since it is slightly above the standard 0.05 significance threshold, I technically fail to reject the null hypothesis. This indicates that smoking habits are relatively consistent across all four regions. From a strategy perspective, this tells me that regional-specific anti-smoking campaigns might yield diminishing returns compared to a unified national strategy.

Conclusion and Strategic Recommendations

Through this analysis, I move beyond simple reporting and provide actionable intelligence. Based on the data, I propose the following strategies for Apollo Hospitals:

Risk-Adjusted Pricing: The hospital should implement a tiered premium model. The 400% cost variance between smokers and non-smokers justifies a significant financial adjustment.
Model Optimization: Predictive models for revenue should weigh lifestyle choices heavily while treating viral load as a secondary or tertiary feature.
Data Quality Initiatives: We need to actively gather more data on high-severity cases to correct the potential bias observed in the ANOVA analysis.

This project reinforces the value of statistical rigor. By looking past the averages and testing our assumptions, we transform raw healthcare data into a roadmap for financial efficiency.

GitHub