Sales Anomaly Analysis - Data Project

ankitmorajkar
Nov 20, 2025
3 min read

Data rarely tells you the truth at first glance. It whispers, it hints, and sometimes, if you’re lucky, it screams.

Recently, I tackled a data project from Stratascratch with a dataset involving 50 weeks of timestamped sales data (a recruitment case study from 23andMe). On the surface, the task was simple: "Analyze the sales." But as I dug into the files, I found a story about statistical anomalies, false leads, and the importance of rigorous testing.

Here is how I went from raw data to actionable business insights using Python, Pandas, and magically without writing a single line of code myself, but with the new Google Antigravity workflow.

1. The Discovery: The Step-Change

I started by merging 50 separate weekly CSV files into a single pandas DataFrame to visualize the entire year’s performance. Usually, sales data is noisy; it goes up and down with the weather or the weekend.

But when I plotted the daily sales, I didn't just see noise. I saw a cliff.

On April 29, 2013, the average daily sales didn't just grow; they exploded. The mean jumped from ~504 units per day to ~703 units per day overnight.

In the business world, graphs like this are suspicious. Was it a data error? A holiday? Or a genuine shift in business strategy? I needed to prove this wasn't just a fluke.

2. The "Sanity Check" (Statistical Significance)

Visuals are great, but they can be deceiving. To be certain this was a fundamental shift in the business and not just a lucky week, I ran an A/B test (specifically, an independent two-sample t-test).

I split the data into two groups:

Group A: All days before April 29.
Group B: All days after April 29.

The Null Hypothesis (H_0) was that the sales volume for both periods was essentially the same.

The Result?

The T-statistic was massive (-45.94), and the P-value was 3.49e-138.

In plain English: The probability that this jump happened by random chance is effectively zero. This gave me the mathematical confidence to tell stakeholders:

"Something changed on April 29, and it’s here to stay."

3. Chasing False Leads: The Gender Hypothesis

Once you know what happened, the next question is why.

My dataset included purchaser gender. My initial theory was that the company might have launched a marketing campaign targeting a specific demographic, driving the new volume.

I analyzed the ratio of Female vs. Male buyers over time.

Observation: The percentage of female buyers did drop significantly (from ~65% to ~40%) over the year.
The Verdict: However, when I overlaid this trend with the sales spike, they didn't match. The demographic shift was slow and gradual, whereas the sales spike was instant.

Correlation does not equal causation. By plotting these side-by-side, I was able to rule out demographic shifts as the primary driver of the April 29th surge.

4. Operational Insights: When do we staff?

Moving away from the anomaly, I wanted to provide operational value. If sales are up, when do we need staff on the floor?

I broke the day down into four "dayparts" (Morning, Afternoon, Evening, Night). The data revealed a clear hierarchy:

Afternoon (12 PM - 6 PM): 39.4% of total sales.
Morning: 30.8%
Evening: 20.9%

This simple aggregation is gold for a store manager. Nearly 70% of all business happens before 6 PM. If you are stocking shelves or running breaks during the afternoon rush, you are losing money.

5. The Tech Stack (Powered by Antigravity)

For this project, I used an Agentic Workflow. Instead of manually writing every line of code, I utilized Google Antigravity to orchestrate the environment.

Core: Python 3.12, Pandas, Scikit-Learn.
Stats: Scipy (for the T-tests).
Workflow: I acted as the architect, defining the questions ("Check the p-value," "Plot the distribution"), while the AI agent handled the syntax and library management. This allowed me to focus 100% on the analysis rather than debugging imports.

Conclusion

This project reinforced a key lesson in data science: Always validate the visual. A line going up looks good on a slide, but a p-value of e-138 gives you the confidence to make business decisions based on it.

You can check out the full code and the charts on my GitHub: https://github.com/ankit-morajkar/stratascratch-23andme-sales-data-analysis