I once encountered this in the real world as a data analyst a long time ago. I was working at an e-commerce company, called The Hut Group, and the whole year our marketing team had been saying our marketing cost of goods sold (the percentage of our revenue we needed to spend on marketing) had been declining across every product category. But at year end, the execs were shocked to realize that our cost of goods sold had almost doubled, from 10% to nearly 20%.
The finance team had asked me to double check the marketing team's numbers, to see if there'd been some funny math in the reporting. But the marketing team were totally right, marketing spend across the three main categories - games, beauty, and nutrition had all fallen (~15% to ~10%, ~30% to ~25%, and ~50% to ~30% respectively). However, the mix of these product categories had shifted massively, with nutrition growing from roughly 10% of our total sales to now nearly 50%.
In net that meant that whilst the marketing team had gotten more cost-efficient at selling every individual product category, the growth in the nutrition industry had vastly outstripped the growth in all other categories, and since that was the highest individual category, the aggregate marketing costs % had gone up, even though the team had improved every category. I then had the fun job of explaining the Yule Simpson paradox to a bunch of accountants.
Pretty much every dataset I work with as an SRE is full of these paradoxes. One classic published example comes from Google:
A network engineer took a trip to Indonesia or something (can't find the citation to confirm the exact tale), noticed the service was slow, and when asking around everyone said "that's how its always been." Basically the local cellular networks are slow and off island fiber connects are saturated. Back at the office they decide to attack the problem by optimizing payload sizes. Does the work, reducing download sizes by half, and ships it. Latency metrics? Average and p95 latency actually increased after shipping the work to production.
How does an objectively good change make things worse? Well, the service had improved for those customers so much that they used it a lot more. Even with the lighter demand on bandwidth the network latency to the datacenter was worse than typical US customers, so as more of these people realized the service sucked way less, they used it more and drove the numbers up.
I have tons of these examples where a data team looks at a particular slice of request telemetry, and comes to a wrong conclusion because they didn't model enough of the system, or controlled for the wrong (or too many) variables. The worst ones the cyclic finger pointing situations that Simpson's paradox can produce: App developers blaming a regression on the server side component while the server team blames the app team, often because the server and app release schedules accidentally aligned too well. In this case we have canary data to exonerate our side of the equation, but sometimes the problem lies in even deeper spaces, like app updates from an entirely different app.
But your example isn‘t a case of Simpson‘s Paradox (which is purely statistical), but Jevons Paradox (which is about human behaviour and economics).
If I recall the youtube slow-internet optimisation case correction, I think it is an example of Simpson's paradox. They made it faster for countries with fast internet, and faster for countries with slow internet, and then the average performance across all users/countries was slower, because now the countries with slow internet used youtube much more than before.
But the improvement induced the demand, which to my mind makes this different from Simpson's Paradox.
Doesn't matter. That is not relevant to the paradox.
How does "Average and p95 latency actually increased after shipping the work to production. How does an objectively good change make things worse?" relate to Simpson's paradox again?
That's exactly it. After "shipping the work to production" (making it faster for everybody), the overall average and p95 got worse. Each sub-population experienced improvement: countries with fast internet got faster youtube, countries with slow internet got faster youtube. But the overall average and p95 got worse: overall average was slower youtube. Because now more users from the second sub-population bring the overall average speed down (or latency up). That's Simpson's paradox.
Ah, you may be right. It's not clear in the story that "Average and p95 latency actually increased after shipping the work to production." means average of Indonesia and ex-Indonesia and not just Indonesian average.
I would say the improvement allowed the demand to be met, everybody wanted to use youtube, but few could.
Just like many people may want to eat a wide range of expensive tasty food, but have to make do with junk because it's what they can afford.
It would be Simpsons' Paradox if Google services in Indonesia were initially slow because Indonesians tend to use YouTube more often than lighter services.
There wasn't an error in the conclusions of the initial measuremen. It was the solution that had problems.
Good point! I'm just a humble Linux sysadmin dubbed "SRE" who slept through Stats for Engineers and now pays the price every week dealing with SWE eager to blame me for their mistakes.
You were right; that was a case of Simpson's paradox. Every category experienced a latency boost but the overall statistic worsened. Jevon's paradox is what caused the induced demand, but when the new usage data was gathered the initial review was an example of Simpson's paradox.
Effect of the change -> Jevon's paradox.
Measurement of Jevon's paradox -> Simpson's paradox (in this case, that isn't a general rule).
The fact that the two are easily linked is one of the reasons the statistical paradox is so common in practice.
Latency improved for everyone, but overall average latency increased because usage increased faster in high latency areas. That's Simpson's Paradox. Simpson's Paradox doesn't care where the subpopulations you're measuring came from.
isn't that the "One More Lane, I Promise!" meme
It is, but usually the meme misrepresents induced demand. While I don't like cars and we should focus on other infrastructure, adding a lane does help.
It does not reduce congestion, but it does now serve more people at this same current congestion level. And those people have come from somewhere. Sometimes from public transport, which isn't really good, but sometimes from some backwater road.
The bigger problem with induced demand is that it's often poor ROI to add that lane where the demand is highest.
That is, imagine you have a big city. You can add capacity for 1m extra people to travel to the city centre, where there's lots of congestion. Or you find ways to induce demand around the other limits of town, even town current demand is low there.
Odds are you'll pick the first, because it's "obvious" and doesn't require much thinking to see it'd help. But we really ought to look at cost-benefit of the second option too, because repeatedly inducing demand in the centre keeps driving up the incremental cost of further improvements, along plenty of other undesirable second order effects.
Adding lanes is like getting a bigger cache with the same throughput.
It's obvious at the supermarket: what goes faster, a single cashier processing four short lanes of 10 people with round robin, or two cashiers processing a single lane with 40 people?
Is the city center able to process 1m extra people? If not, it doesn't matter how many lanes you build.
Well you often can make it able to "process" 1m extra people: You can build overpasses, and tunnels, and taller buildings. But the cost-per-extra-person will tend to go up accordingly, to the point where you could spend an extraordinary amount attracting people out of the centre.
E.g. London's "Crossrail" / Elizabeth line cost $24 billion. Granted, it also allows some people to go through London faster, but I can't help to wonder what that money could've done if applied to attract businesses out of the centre instead. E.g. upgrading links between towns on the outskirts, upgrading town centres, and generally try to make it more attractive for businesses to be located further out.
Given the extraordinary costs it takes to do large infrastructure projects in London, I'd be very surprised if you couldn't get a higher return on investment that way, or by investing similar sums elsewhere in the UK entirely.
Until more people choose to live further away because the commute is now tolerable with the extra lane (and it's cheaper), and then you're back to square one.
This reminds me of a similar story with YouTube [1] where improving the page weight decreased the metrics because more people with lower end connections could access the page.
Metrics interpretation is as important as the metrics themselves!
[1]: https://blog.chriszacharias.com/page-weight-matters
That may be exactly the story I was thinking of, or perhaps the original of a story I encountered on a GCP cloud post or something.
every time I hear about examples of simpson in peactice, I don't get what lesson to learn
marketting team overoptimized, so non-nutrition demand fell?
drop nutrition from line of products, so that you're both efficient in products you do and overall?
these metrics are insufficient and it's better to look at gross change rather than ratios?
I have no idea
I think the last one is closest. I’d go with: “finance team should look at the gross change”, if that’s what matters for them.
Maybe the lesson is to analyze different business units (product categories?) independently first, then the whole.
IME, the "problem" (to the extent there is one) is almost always that the naïvely-chosen KPI metric wasn't specific enough.
Here's a recent example from a friend. You're a SaaS company, and your home page's load time is reported as slow. You set your KPI for the quarter to be "reduce p99 load time of the home page by 50%".
The load time is a function of customer size, so bigger customers = slower home page. It's actually a quadratic function. So the p99 of small customers is like the p50 of large customers. You have 20 small customers and 20 big customers.
That quarter, the sales team onboards 10 new tiny customers, and 10 big customers churn. It's the holiday season in your big customers' geo, so mostly small customers are using the platform. It's the busiest time of year for the small customers, so they're over-using the platform.
All these factors lead to p99 latency dropping by 60%, smashing the KPI goal. Bonuses all around, pats on the back. And no code changes needed, besides!
The solution is: choose a KPI that is tightly coupled to your problem, and not confounded with other variables.
In the above case, a better KPI would have been "p99 latency for large customers", because it is robust to the distribution of customer sizes across current users, churned users, and seasonal differences in usage.
The article suggests an answer to your question, see the last sentence of the introduction:
"its lesson "isn't really to tell us which viewpoint to take but to insist that we keep both the parts and the whole in mind at once."
In the case above, they failed at "keeping the parts in mind" as clearly, the different ratios between different products was crucial.
It’s actually surprisingly common. You can even find it in “classical” toy datasets like Iris: https://github.com/DataForScience/Causality/blob/master/1.2%...
Covid vaccination rates and deaths were rather famously subject to it. E.g. some combination of stats like “most covid deaths were vaccinated individuals”, “vaccination reduces death rate”, and “population segment with lowest vaccination rates has lowest covid death rates.” were all true at the same time.
Those aren't examples of Simpsons even taken together, but there was a famous (by which I mean it got a lot of press, including being written up in the Times and Post when it came out) study that showed that although every subgroup in Italian demographic data had lower CFRs than their Chinese counterparts, the Chinese group had a lower CFR when taken as a whole:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8791436/
it’s shocking that product mix wasn’t slide on reporting
but marketing selects for positivity not objectivity
the facts and only the facts that support what they do
I thought it is pretty common to apply mixed / hierarchical linear models? I didn't study statistics but in our field of many problems of modelling biological effects we would do that.
E.g https://www.pymc.io/projects/examples/en/latest/generalized_...