The Problem With Raw Reliability Rankings
If you look at raw MOT pass rates for, say, the Ford Focus by manufacture year, older years score badly. A 2003 Focus averages about a 68% pass rate. A 2018 Focus averages about 83%. That 15-point gap looks like a manufacturing quality story — and partly it is. But it's also a mileage story. Every 2003 Focus has had 20+ years to accumulate tests across its entire life, including the high-mileage years when everything starts to fail. The 2018 Focus has mostly been tested while it was still relatively new.
You're not comparing like with like. You're comparing a 20-year-old car's full MOT history against a 6-year-old car's early MOT history.
The question used-car buyers actually want answered is: "If I buy a 2003 Focus with 80,000 miles on the clock, is that more or less reliable than a 2018 Focus with 80,000 miles?" Raw annual pass rates don't answer that.
The gap shrinks from 15 points to 11 — not eliminated, but you're finally comparing equivalent cars at equivalent points in their lives.
To answer it properly, you need to compare each manufacture year at equivalent points in its life — at the same mileage. That's mileage normalisation.
What We Built
CarHunch now has standalone best year to buy pages for around 7,300 car models — every model with enough data to make meaningful comparisons. Each page shows:
- An overall year ranking with MOT pass rates, verdict badges (Recommended / Caution) relative to that model's own median — not an arbitrary absolute threshold
- A mileage-normalised chart: one line per manufacture year, pass rate on the Y axis, mileage band on the X axis — so you can see how different vintages perform at equivalent points in their lives
- Severity weighting — accounting for whether the failures recorded were minor advisories or serious defects
- FAQPage structured data so Google can surface direct answers to "what is the best year for a Volvo V70?"
The hub pages at /vehicle/{make}/{model}/ already had a "most reliable year" chart (Phase 1 of this work). Phase 2 added the mileage-normalised layer, which required building a new precomputed derived table from the raw data.
The Data
The source is the DVLA's national MOT dataset — around 500 million test records across approximately 50 million unique vehicles, stored as Parquet files and queried via DuckDB. We also maintain a derived table, mot_history_by_registration, which pre-aggregates per-vehicle MOT statistics (avg mileage, pass rate, test count) and is much smaller at ~14GB vs 56GB for the canonical tests table.
The defect severity layer comes from our own defect_classifications table — 1.82 billion rows mapping each MOT failure item to a severity multiplier via a taxonomy we built. This is the layer that distinguishes "wiper blade advisory" from "brake failure".
The Naive Approach (And Why It Explodes)
The conceptually simple query would be:
SELECT
slug_make,
slug_model,
manufacture_year,
CASE WHEN odometer_value < 30000 THEN '0-30k'
WHEN odometer_value < 60000 THEN '30-60k'
...
END AS mileage_band,
AVG(test_result = 'PASSED') AS pass_rate,
AVG(severity_multiplier) AS avg_severity
FROM mot_tests t
JOIN vehicles v ON t.registration = v.registration
JOIN defect_classifications d ON t.test_id = d.test_id
GROUP BY 1, 2, 3, 4
This fails in two ways on our hardware:
1. Canonical scan OOM. The canonical mot_tests table is 56GB. Joining it against vehicles (another large table) requires materialising a large intermediate result. On our single-server setup with shared RAM between the API, Redis, and the worker process, this OOM-killed the build process almost immediately.
2. The 494M-row hash join. The severity join is particularly brutal. defect_classifications has one row per defect item per test — about 10 per test on average, so 494M rows for 50M tests. Building a hash table of 494M rows to join against 500M test rows is a ~20GB in-memory operation. Even if the data fit in RAM, it would thrash swap and slow the server to a crawl for 20+ minutes.
The Solution: Staged Pipeline With Derived Tables
The key insight is that you don't need to do the severity join at the test level. Severity is a property of a vehicle's defect history — not of individual test moments. If you pre-aggregate severity to the registration level first (mean severity multiplier across all defects ever recorded for that vehicle), you can join at registration granularity against mot_history_by_registration, which is orders of magnitude smaller.
This was the turning point. Everything else in the pipeline flows from this decision: the two-pass architecture, the stage checkpointing, the choice to use Polars streaming for the aggregation step. Once you stop trying to join at test granularity, the whole problem becomes tractable on a shared server with 16GB RAM.
We restructured the build into four cached stages, each writing a checkpoint parquet file so the pipeline can be killed and resumed at any point:
Stage A — Adaptive Band Boundaries
For each make/model, compute the p20, p40, p60, and p80 percentiles of average odometer readings across all registrations. Round to the nearest 1,000 miles. Enforce strict monotonicity (each percentile must be higher than the last — rare edge cases where two percentiles collapse can otherwise produce degenerate bands).
This is the "adaptive" part. Fixed bands (0–30k, 30–60k, etc.) are wrong for models with very different typical mileage profiles. A London black taxi covering 100,000 miles per year has a completely different distribution than a weekend sports car covering 3,000. Fixed bands would leave the taxi's entire fleet life in the "120k+" category. Adaptive bands always give you five roughly equal slices of actual usage.
Output: ~200k rows, one per make/model/band. Takes ~20 seconds. Stored as band_boundaries.parquet.
Stage B — Registration-Level Severity
Separately (no join to tests yet), aggregate defect_classifications to registration level:
SELECT registration, AVG(severity_multiplier) AS avg_severity
FROM defect_classifications
GROUP BY registration
This reduces 1.82 billion rows to ~70 million (one per vehicle that has ever had a recordable defect). Polars streaming mode handles this in about 50 seconds without hitting peak RAM limits — it processes in chunks rather than materialising the full result set.
We used Polars rather than DuckDB for this stage specifically because of .collect(streaming=True). DuckDB handles the base data aggregations well and is our default for everything query-shaped, but the severity aggregation is a 1.82B-row group-by with no predicate to push down — the entire table has to be consumed. Polars' streaming mode was critical here: it processes in fixed-size batches that cap peak RSS regardless of input size, where DuckDB's execution model would have materialised more intermediate state than our RAM budget allowed.
Output: ~70M rows. Stored as stage_b_severity.parquet.
Stage C1 — Pass Rates (No Severity)
Scan mot_history_by_registration (14GB, pre-aggregated per vehicle) and vehicle_basic_by_registration (1.2GB), join to the band boundaries from Stage A, and compute pass rate per (make, model, year, band).
Critically, this stage does not touch the severity data. The 14GB derived table contains avg_mileage and pass_rate already computed — we just need to bin each vehicle into its mileage band using the Stage A boundaries.
Output: ~176k rows (make/model/year/band aggregates). Takes ~150 seconds. Stored as stage_c1_pass_rates.parquet.
Stage C2 — Severity By Band
Join the Stage B severity data (registration → avg_severity) against vehicle_basic_by_registration (registration → make/model/year) and the Stage A band boundaries (to assign each vehicle to a mileage band based on its avg_mileage). Group by (make, model, year, band) to get average severity per band.
The critical difference from the naive approach: the join is registration-level (70M rows × 1.2GB), not test-level (494M rows × 56GB). Peak RAM stays manageable.
Output: ~176k rows. Takes ~55 seconds. Stored as stage_c2_severity.parquet.
Stage D — Join and Write
Simple join of C1 and C2, compute human-readable band labels (e.g. "55k–95k mi"), write final output atomically (write to temp file, rename into place so the API never reads a partial file).
Total build time: ~5 minutes wall-clock. Total peak RAM: under 4GB. On a 16GB server shared with the live API and Redis, this is safe.
Server Safety Engineering
The pipeline has a few explicit "be kind to the server" mechanisms:
Thread limits: We set POLARS_MAX_THREADS=2 (default 2, configurable with --threads N) before importing Polars. Without this, Polars defaults to the number of CPU cores, which on a 16-core server means 16 parallel threads competing with the live API for CPU time. Two threads still parallelise IO and computation but leave the server responsive.
nice/ionice: The build script is designed to be launched with nice -n 19 ionice -c 3 — lowest CPU priority, idle-class IO. If anything else on the system needs CPU or disk, the build yields immediately.
Inter-stage pauses: A 10-second sleep between stages (configurable with --pause-secs N) lets the server breathe and allows Redis and the API to catch up on any requests that queued during peak build load.
Stage caching: If the build is killed (OOM, user interrupt, anything), re-running it picks up from the last completed stage checkpoint. No work is lost.
Polars streaming: The large aggregations use Polars' .collect(streaming=True) where available — this processes data in chunks that don't require materialising the full result in RAM at once.
Why Adaptive Bands Beat Fixed Bands
Here's the problem with fixed 0–30k/30–60k/60–90k/90–120k/120k+ bands for a concrete model: the Volvo V70.
The V70 is a high-mileage car. A significant fraction of the V70 fleet has covered 150,000+ miles. Fixed bands would put most of the interesting high-mileage data into a single "120k+" bucket, which tells you almost nothing — it's just "really high mileage V70s eventually fail more often." You already knew that.
Adaptive bands computed from the actual V70 mileage distribution might give you bands like 0–45k, 45–75k, 75–105k, 105–145k, 145k+. Each band contains roughly the same number of test records. The 145k+ band for a V70 is genuinely informative — it contains many vehicles, and you can see meaningful differences between 2003 and 2010 model years within it.
For a low-mileage model (say, a Porsche Boxster with an average 6,000 miles per year), the bands might be 0–25k, 25–40k, 40–55k, 55–75k, 75k+. You'd never see meaningful data in a "120k+" bucket because most Boxsters never get there.
The Residual Age Effect
We did verify the mileage normalisation actually helps. Testing with Ford Focus at 50–70k miles (a comparable mileage slice across all years):
- 1999 manufacture year: 72.2% pass rate
- 2018 manufacture year: 83.1% pass rate
That's an 11-point gap, compared to 25+ points in raw lifetime averages. So mileage normalisation substantially closes the gap — but doesn't eliminate it entirely.
The residual is real. A low-mileage 20-year-old Focus has still had 20 years of weather, UV, road salt, perishing rubber seals, and ageing electronics. Age-related deterioration is not mileage-dependent — it's time-dependent. We acknowledge this explicitly in the methodology section on the page rather than pretending mileage bands are the complete answer.
The second contributor is genuine manufacturing improvement. A 2018 Focus genuinely uses better materials, has tighter factory tolerances, and benefits from engineering lessons learned over 15 years of production. Some of the gap between old and new is just: the new car is better made.
Relative Verdicts, Not Absolute Scores
The "Recommended" and "Caution" badges use the model's own median pass rate as the baseline. A year is "Recommended" if its pass rate is 5+ percentage points above the model's median, with at least 100 vehicles in the sample.
This matters because a "good" year for a Land Rover Defender is very different from a "good" year for a Toyota Yaris. Defenders fail MOTs at much higher rates in general — that's expected, accepted, and baked into the price. Ranking a Defender "avoid" because it has a 65% pass rate (when the median Defender is 62%) would be misleading. The relevant question is: is this year better or worse than other years of the same model?
The verdict thresholds are deliberately conservative. Years very close to the median don't get a verdict. We'd rather say nothing than mislead someone into thinking a 73.1% vs 72.8% difference is meaningful signal.
API and Frontend
The precomputed data is served via a new endpoint:
GET /vehicles/vintage/{make_slug}/{model_slug}/mileage-bands
Response shape (flat list, pivoted to a chart in PHP/JS):
{
"slug_make": "volvo",
"slug_model": "v70",
"bands": [
{"year": 2007, "band_rank": 1, "mileage_band": "0–45k mi",
"total_tests": 1240, "pass_rate": 91.2, "avg_severity_score": 1.1},
{"year": 2007, "band_rank": 2, "mileage_band": "45–75k mi",
"total_tests": 1890, "pass_rate": 84.1, "avg_severity_score": 1.3},
...
]
}
Cached in Redis at a 24-hour TTL. The data changes slowly (weekly rebuild schedule) so a day-long cache is appropriate.
On the frontend, the PHP template fetches the hub stats endpoint and the mileage-bands endpoint in parallel via curl_multi. The band data arrives as a flat list and gets pivoted in PHP to the Chart.js dataset format — one dataset (line) per manufacture year, with band labels on the X axis.
Years are sampled to a maximum of 12 for readability (evenly spread, keeping oldest and newest). Colour encodes age: blue for the newest years, amber for the oldest, using HSL interpolation across a 210→35 hue range. spanGaps: true connects across mileage bands where a year has no data (not every year will have data in every band).
SEO Surface: 7,300 New Pages
Each best-year page is independently addressable at /vehicle/{make}/{model}/best-year/. We generate these for models with at least 500 total vehicles and at least 3 manufacture years each with 100+ vehicles — enough data to produce meaningful comparisons rather than noisy single-digit sample sizes.
That qualification threshold produces about 7,300 pages, added to the sitemap at priority 0.7 with monthly changefreq.
Each page carries a FAQPage JSON-LD schema block generated from the actual data:
{
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "What is the best year for a Volvo V70?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Based on MOT data from 4,823 Volvo V70s, the most reliable
manufacture year is 2009 with a 79.4% MOT pass rate across
312 vehicles tested. Years consistently above average: 2007–2010."
}
}
]
}
The answers are generated programmatically from the live data — no templated copy-paste. If the data for a model changes (as it does with daily delta updates), the answer will update on the next page load.
As far as we're aware, no one else publishes mileage-normalised, severity-weighted reliability analysis from the full DVLA dataset at this scale. Survey-based reliability guides — which most published "best year to buy" content relies on — typically cover a few dozen responses per model per year and can't distinguish at-equivalent-mileage performance. We have tens of thousands of test records per cohort for the major models.
Discovery and Entry Points
The best-year pages are connected to the rest of the site through several entry points:
- Model hub pages (
/vehicle/volvo/v70/) — a "Which year to buy?" button below the year chart links to the best-year page. The chart bars themselves are clickable, linking directly to the group page for that manufacture year. - Group pages (
/vehicle/volvo/v70/2007/) — a green "Which year to buy?" pill in the other-years navigation section invites users to explore the full model comparison. - Sitemap — all 7,300 pages in the sitemap, submitted to Google Search Console.
What We Learned
Aggregation level matters enormously for large joins. Moving the severity join from test-level (494M rows) to registration-level (70M rows) was the single biggest engineering decision. The 7× row count reduction is actually more like a 50× computation reduction because the join key becomes much cheaper to hash.
Adaptive bands are worth the complexity. The extra stage to compute per-model percentile boundaries adds maybe 20 seconds to the build but makes the output genuinely useful rather than arbitrarily bucketed. Fixed bands would have produced misleading comparisons for models with unusual mileage distributions.
Stage caching is not premature optimisation. We hit OOM kills twice during development when exploring different approaches. Having checkpoint files meant we could resume from Stage B or C rather than starting over from a 50-second Stage A each time. At production scale, where builds run weekly on a shared server, this is just table stakes.
The residual age effect is real and should be disclosed. It would have been tempting to present the mileage-normalised chart as the complete picture. It isn't — age-related deterioration is a real thing that the chart doesn't capture. We say so explicitly on the page. Users who understand the methodology trust the numbers more.
What's Next
The mileage-band stats table is now part of the weekly rebuild schedule. Future work could include:
- Fuel-type splits on the best-year page for models where petrol/diesel/hybrid performance diverges significantly across years
- Defect category breakdowns per band — not just "pass rate" but "likely to need suspension work at 80k miles"
- Comparison mode: "Show me how the 2007 Focus compares to the 2007 Astra at the same mileage"
The data is all there. It's a matter of deciding which analyses are useful enough to surface to users.
The goal isn't just to rank cars — it's to answer the question buyers actually have:
"What happens if I own this car at 80,000 miles?"
That's the gap we're trying to close.
Try it: Volvo V70 best year to buy · Ford Focus best year to buy · VW Golf best year to buy