We Publish Our Accuracy Numbers. Here's Why.
Ask a construction AI vendor how accurate their software is. You'll get one of three answers: a vague "very high," a cherry-picked number from a controlled demo, or silence. Nobody publishes their real numbers.
We do. Our benchmark publishes production workload — 200k+ AI executions this quarter across 12 production prompts — on real construction documents. Per-task accuracy scores publish next quarter on the same page. Not lab results. Not demo data. Real documents from real projects processed by real users.
This post explains how we think about measurement, where we're strong, where we're still pushing, and why publishing it should be table stakes for any company selling AI to the construction industry.
The trust gap in construction AI
Construction professionals are skeptical of new software. They have every reason to be. The industry has a long history of tools that demo well and fail on the jobsite — or in this case, fail on the actual spec book sitting on your desk.
When a vendor tells you their AI "reads documents," your next question should be: how well? What's the error rate? On which document types? At what volume? And if they can't answer those questions with specifics, you're looking at a black box.
The problem is worse in procurement. A missed piece of equipment on a $50M project isn't a rounding error — it's a change order. A misread spec isn't a minor bug — it's a wrong product quoted, a submittal rejected, a schedule slipping. Accuracy isn't a nice-to-have feature. It's the entire product.
So we made a decision early on: if accuracy is the product, then accuracy data should be public.
What we measure
Here's the top-level summary as of this writing. Current workload counts are on the live benchmark; per-task accuracy scores return there next quarter.
Production shape: 200k+ AI executions this quarter across 12 production prompts — the same grouping we report on the benchmark page. Underneath that are the workflow types you care about in procurement: document classification, equipment extraction, spec parsing, schedule reading, and more. Each execution is a discrete task — classifying a single document, extracting equipment from a single schedule, parsing a single spec section.
Blended performance varies by document type. Some workflows have clear structure and predictable formats; others involve ambiguous formatting, inconsistent engineering conventions, or information spread across multiple documents. The headline picture hides important variation — which is exactly why we break it down.
High-confidence workflows
Some procurement tasks have clear structure, predictable formats, and well-defined outputs. On these, quality is high enough that humans review exceptions, not every result.
- Component spec parsing — Given a mechanical or electrical spec section, extract the specified manufacturer, model, and performance requirements. Spec books follow CSI format with predictable structure.
- Document classification — Sort a bid package into mechanical schedules, plumbing risers, cover sheets, spec sections. Document types have strong visual and textual signatures.
- Equipment extraction — Pull equipment from schedules and specs: quantities, sizes, capacities, tags. This is the core of what we do, refined on real project data.
These workflows account for the majority of our execution volume. They're the foundation — the automated reading layer that turns a 300-page PDF into structured procurement data.
Harder document types
Other tasks are harder: alternates in footnotes, secondary columns, or reference notes; mechanical schedules with merged cells, nested sub-tables, or continuation pages. We're transparent about where extraction is still hardest — and where we're investing engineering effort.
- Table alternates — Parsing relationships between basis-of-design products and acceptable alternates when formatting is inconsistent across firms.
- Complex mechanical schedules — Layouts that challenge row-column relationships across multi-page tables.
On the hardest layouts, a meaningful share of rows still needs human correction. That's not full hands-off automation yet — but it still compresses a long manual read into a shorter review-and-correct workflow. We're pushing these areas every month.
Why these specific workflows matter
If you're a GC procurement manager or a manufacturer rep, you already know why. But for context:
Document classification is the first step. When a 400-page bid package lands in your inbox, someone has to figure out what's in it. Which sections are mechanical specs? Where are the schedules? Is there an equipment list on drawing M-601 or is it on M-602? Our system sorts the documents so you don't have to page through the PDF yourself.
Equipment extraction is the core workflow. Once you know where the schedules are, you need to pull out every piece of equipment — every air handling unit, chiller, boiler, pump, fan coil, VAV box — with its tag, size, capacity, and specified manufacturer. This is the work that takes a senior estimator hours per project. The system does the first pass and flags anything uncertain for human review.
Spec parsing determines what's actually specified. The equipment schedule tells you there's an AHU. The spec section tells you it needs to be a Trane IntelliPak with specific CFM, static pressure, and efficiency ratings. Matching schedules to specs is how you build a complete equipment list.
Together, these workflows replace the most time-consuming part of procurement: reading. Not deciding, not negotiating, not relationship-building — reading. The hours spent turning PDFs into structured data before any actual procurement work begins.
How we measure accuracy
A number is only useful if you know how it was produced. Here's our methodology:
Ground truth comes from human reviewers. Every AI extraction is compared against a human-verified result. When a user corrects an AI output — fixing an equipment tag, adding a missed item, changing a classification — that correction becomes ground truth data. We're measuring against real-world expert judgment, not synthetic test sets.
We measure at the field level, not the document level. If the AI extracts 20 pieces of equipment from a schedule and gets 19 right but misses 1, we score at the field level — not zero because one row in the document was wrong. Field-level measurement gives you a realistic picture of how much human review is needed.
Volume matters. A thin sample is noise. At 200k+ executions in a quarter — and a long runway of production runs before that — measurement is statistically meaningful. Our sample sizes are large enough that confidence intervals are tight.
We don't exclude hard cases. Some vendors test on clean, well-formatted documents and report the results as representative. We include everything — the 1997 scan of a hand-drawn mechanical schedule, the PDF where someone filled in an Excel table with merged cells and printed it sideways, the spec book where the MEP sections start on page 247. Our reporting reflects the actual document quality you encounter on real projects.
Why publishing creates accountability
The moment you publish measurement, you own it. When extraction quality shifts, it shows up in internal monitoring — and per-task breakdown will be visible on the benchmark page again next quarter. Our users see it. Our prospects see it. Our competitors see it.
That kind of visibility changes how you build product. It forces you to instrument everything, monitor continuously, and treat accuracy regression like a production outage. You can't hide behind vague claims when the data is public.
It also changes the sales conversation. Instead of "trust us, our AI is great," the conversation becomes "here are the numbers, here's the methodology, here's where we're strong, here's where we're still working." That's a better conversation for everyone involved.
We think every construction AI vendor should publish how they measure and what they run in production. Not because we're confident we'll always be ahead on every dimension — but because the industry deserves to make informed decisions. If you're asking contractors to trust AI with their procurement, show them the scorecard.
Where this is going
Our target is simple: raise quality on every workflow. The high-confidence workflows prove what's possible. The harder document types tell us where to focus.
Complex mechanical schedules and table alternates need better handling of merged cells, continuation tables, footnotes, and non-standard layouts. These are specific engineering problems — not hand-waving about "improving AI."
Every month, we process more documents, collect more corrections, and retrain on a larger dataset of real construction documents. The accuracy curves are moving in the right direction. Watch workload counts on the benchmark page; per-task scores land there next quarter.
If you're evaluating construction AI tools, ask for the numbers. If they won't share them, ask why.
See the live numbers
Production workload is public and updated quarterly. Per-task accuracy publishes next quarter on the same page.
View Benchmark