Scoring 128 Million UK Vehicles in 75 Seconds

The Problem

The CarHunch Score (see this example at the top right of the page) ranks every UK vehicle against others of the same make, model, and year: a score of 76 means this vehicle performed better than 76% of similar vehicles in its MOT history.

That "similar vehicles" comparison is the interesting constraint. Percentile rankings are relative — they're computed within cohorts. When one vehicle in a cohort gets a new MOT result, the ranking for every other vehicle in that cohort potentially shifts. There is no correct incremental update. To produce accurate scores, you have to recompute the whole thing.

The second constraint is cohort size. A 2018 Ford Fiesta cohort has over 100,000 vehicles. A 2006 Porsche Boxster cohort might have 8. A percentile ranking in a cohort of 8 is mathematically valid but statistically thin. Any scoring system has to handle both correctly and communicate the difference to the user.

The dataset: 128,688,833 UK vehicles with MOT test history, sourced from DVLA open data, stored as Parquet files and queried via DuckDB.

Why Full Rebuild Wins

When recompute time is short, simplicity wins.

Full rebuild:

Is correct by construction — no stale rankings, no delta logic, no state to track
Is easy to verify — same input data always produces the same scores
Fails cleanly — a failed run leaves the previous file untouched

The question is whether it's fast enough. At 128 million vehicles, it takes 75 seconds.

"Correctness is easier when a full rebuild takes 75 seconds."

The Two Hard Problems

Getting the pipeline to run is the easy part. Two harder problems sit inside it.

Thin data and confidence dampening

Without dampening, a car with one clean MOT scores in the top 5% of its cohort — it has a 100% pass rate. But a single test tells you almost nothing. Two identical cars where one has one clean test and the other has six clean tests should not score the same.

The fix is a dampening factor:

# Polars — pull low-data vehicles toward the cohort median
dampening = (pl.col("total_tests") / 4).clip(upper_bound=1.0)

df = df.with_columns(
    (pl.col("risk_score_raw") * dampening).alias("risk_score_dampened")
)

With this, a 1-test vehicle's signal is pulled 75% toward zero before ranking. A 4-test vehicle gets the full signal.

The effect shows up dramatically in the score distribution.

Distribution of CarHunch Scores across 128.7M vehicles. Score 50 contains 52 million vehicles (40% of the dataset) — the direct result of confidence dampening pulling single-test vehicles toward the cohort median. The remaining scores are remarkably flat, with slightly elevated tails at 5 and 95.

Look at that spike. Score 50 contains 52 million vehicles — 40% of the entire dataset. That isn't a bug or a rounding artefact. It's the dampening working exactly as intended.

Around 48.7 million of those vehicles have only one recorded MOT test. With a dampening factor of 0.25 (one test out of a required four), their reliability signal is compressed to a quarter of its face value before ranking. Once you rank near-zero signals within a cohort, they all land in the middle. The banding step rounds them to 50.

What this means in practice: a car with a single clean MOT doesn't score 95. It scores 50 — "we don't know yet." A car needs at least three or four tests before its score starts to carry meaningful weight. This is by design. Showing a confident-looking score based on one data point would be misleading; showing 50 with a "provisional" label is honest.

The confidence level is also surfaced directly in the UI: vehicles with 4+ tests get a High confidence badge, 2–3 tests get Medium, and 1 test (or none) gets Low. This means a user looking at a score of 50 on a 1-test car can immediately see why — the score reflects limited data, not a bad vehicle.

The rest of the distribution — the 76 million vehicles with more test history — spreads out across the full 5–95 range fairly evenly, with slight elevations at the tails: score 5 (vehicles that consistently underperform their cohort) and score 95 (vehicles with consistently clean records).

Severity weighting vs raw defect counts

Early versions used raw defect counts. A high-mileage car that accumulates advisory notes — worn tyres, minor corrosion, surface rust — would score terribly, even though advisories are by definition not serious enough to fail. This created a systematic bias against older, well-maintained vehicles.

The fix: weight defects by severity before dividing by test count:

defect_severity = (
    pl.col("dangerous_defects") * 4.0
    + pl.col("major_defects")   * 2.0
    + pl.col("advisory_defects") * 0.5
    + pl.col("minor_defects")   * 0.25
) / pl.col("total_tests")

Now a car with eight advisories and zero failures scores very differently from a car with two dangerous defects. The severity-weighted approach distinguishes between "lots of minor wear" and "repeated serious faults" — which is exactly the distinction that matters to someone buying a used car.

The Stack

Polars for batch computation

DuckDB is used throughout the rest of this pipeline — for ingesting DVLA delta files, building derived tables, and serving the API. But for the scoring step, Polars turned out to be the right tool.

The batch job reads two pre-aggregated derived tables (~130M rows combined), joins them, computes cohort baselines, ranks within groups, and writes ~128M rows back to Parquet. Polars' lazy execution model plans the entire pipeline before running it, fuses operations where possible, and parallelises group-by and window operations across all CPU cores.

The key operation is a window rank — computing each vehicle's percentile position within its cohort:

df = df.with_columns(
    pl.col("risk_score_dampened")
    .rank("average")
    .over("cohort_key")    # cohort_key = make + model + year band
    .alias("cohort_rank")
)

Note: use .group_by() in modern Polars (≥ 0.19), not .groupby() which was deprecated.

DuckDB for instant lookups

The 128M-row scores file is served via a single-row DuckDB query at request time — no server process, no connection overhead:

SELECT
    carhunch_score, confidence, pass_rate, baseline_fail_rate,
    defect_severity, baseline_defect_severity,
    cohort_size, total_tests, make, model, manufacture_year
FROM read_parquet('/data/carhunch_scores/data.parquet')
WHERE registration = 'KY17CYS'
LIMIT 1

Response time: under 50ms cold, under 5ms on a Redis cache hit. DuckDB's Parquet reader uses predicate pushdown — it doesn't scan 1.7GB to find one row.

Atomic file swap

During a rebuild, the API is still serving from the previous scores file. If the rebuild wrote directly to the live path and crashed halfway through, the API would return corrupted data.

The fix is a standard atomic rename:

import tempfile, shutil

# Write to a temp file in the same directory
tmp = tempfile.NamedTemporaryFile(
    dir=OUTPUT_DIR, suffix=".parquet", delete=False
)
df.write_parquet(tmp.name)

# Atomic rename — on Linux this is a single rename() syscall
shutil.move(tmp.name, str(OUTPUT_DIR / "data.parquet"))

On Linux, shutil.move() within the same filesystem resolves to rename() — atomic. The API always reads either the complete old file or the complete new file, never a partial one.

Two Real Vehicles

Registration numbers are examples from publicly available DVLA MOT data.

KY17CYS — Score: 95

A 2017 Mazda MX-5 with six recorded MOT tests. Pass rate: 100%. Defect severity score: 0.0 — no defects recorded across all tests. Its cohort of 4,564 similar vehicles has an average pass rate of 93% and a defect severity of 0.54. It ranks in the top 5% of its cohort on both measures.

GD16ZXO — Score: 45

A 2016 Volkswagen Passat with seven recorded tests. Pass rate: 86%. Its cohort of 17,426 vehicles has a pass rate of 84% — so this vehicle is actually slightly above average on pass rate. But its defect severity score is 1.07 against a cohort average of 0.83: a steady accumulation of advisory notes over its test history. The severity weighting drags the score below 50 despite the above-average pass rate. This is the system working correctly: a vehicle that passes but accumulates issues should score differently from one that passes cleanly.

Performance

Here is how long each step of the nightly pipeline takes, from the first production run.

End-to-end nightly pipeline: 3 minutes 47 seconds total. The score rebuild adds 77 seconds.

Hardware: Dell Precision 5820, Xeon W-class CPU, 128GB RAM.

From the first production run (2026-02-25):

128,688,833 vehicles scored
Score computation (Polars): 31.8 seconds
Parquet write (1.7 GB): ~45 seconds
Total score rebuild step: 77 seconds
End-to-end nightly pipeline: 3 minutes 47 seconds

Memory: Polars holds the working dataset in RAM during the pipeline. This is manageable because the inputs are pre-aggregated derived tables, not raw MOT records — the join inputs are much smaller than the 128M output rows suggest.

Honest limitation: this works because the full working set fits comfortably in 128GB. If the dataset grew 10× or if the pipeline moved to raw records as inputs, the architecture would need to evolve — probably toward out-of-core Polars or chunked processing. For the current scale, the simple approach is the right one.

Architecture

The scoring step sits inside the existing nightly delta pipeline. No new cron jobs, no new services.

What You Don't Need

The full scoring system for 128 million vehicles runs inside a single cron job on a workstation. It requires:

No data warehouse
No always-on database service for the scores
No external analytics service
No Spark cluster or distributed compute
No infra team to maintain any of it

Columnar analytics libraries — Polars for batch transforms, DuckDB for query-time reads — have got good enough that this is now the straightforward approach, not the clever one.

"The interesting problems were about correctness, not scale."