We Publish Our AI Accuracy Numbers. Here's Why.
Ask a construction AI vendor how accurate their software is. You'll get one of three answers: a vague "very high," a cherry-picked number from a controlled demo, or silence. Nobody publishes their real numbers.
We do. Our accuracy benchmark is live, updated continuously, and based on 91,000+ AI executions across 12 production procurement workflows. Not lab results. Not demo data. Real documents from real projects processed by real users.
This post explains what those numbers mean, where we're strong, where we're still improving, and why we think publishing accuracy data should be table stakes for any company selling AI to the construction industry.
The trust gap in construction AI
Construction professionals are skeptical of new software. They have every reason to be. The industry has a long history of tools that demo well and fail on the jobsite — or in this case, fail on the actual spec book sitting on your desk.
When a vendor tells you their AI "reads documents," your next question should be: how well? What's the error rate? On which document types? At what volume? And if they can't answer those questions with specifics, you're looking at a black box.
The problem is worse in procurement. A missed piece of equipment on a $50M project isn't a rounding error — it's a change order. A misread spec isn't a minor bug — it's a wrong product quoted, a submittal rejected, a schedule slipping. Accuracy isn't a nice-to-have feature. It's the entire product.
So we made a decision early on: if accuracy is the product, then accuracy data should be public.
What the numbers actually mean
Here's the top-level summary as of this writing. You can check the live benchmark for current figures.
91,000+ total AI executions across 12 distinct procurement workflows. These include document classification, equipment extraction, spec parsing, schedule reading, and more. Each execution is a discrete task — classifying a single document, extracting equipment from a single schedule, parsing a single spec section.
89% overall accuracy across all workflows, weighted by volume. That's a blended number. Some workflows are effectively solved; others are genuinely hard and still improving. The overall number is honest but it hides important variation — which is exactly why we break it down.
The "effectively solved" tier
Some procurement tasks have clear structure, predictable formats, and well-defined outputs. Our accuracy on these is high enough that they run autonomously — humans review exceptions, not every result.
- Component spec parsing: 99%+ — Given a mechanical or electrical spec section, extract the specified manufacturer, model, and performance requirements. Spec books follow CSI format with predictable structure. The AI reads them reliably.
- Document classification: 97%+ — Is this a mechanical schedule, a plumbing riser diagram, a cover sheet, a spec section? Sorting a 400-page bid package into its component document types. High accuracy because document types have strong visual and textual signatures.
- Equipment extraction: 95%+ — Pull the individual pieces of equipment from a schedule or spec, including quantities, sizes, capacities, and tags. This is the core of what we do, and the accuracy reflects years of iteration on real project data.
These three workflows account for the majority of our execution volume. They're the foundation of the platform — the automated reading layer that turns a 300-page PDF into structured procurement data.
The "still improving" tier
Other tasks are harder. They involve ambiguous formatting, inconsistent engineering conventions, or information spread across multiple documents. We're transparent about where accuracy drops:
- Table alternates extraction: 83.1% — Mechanical schedules sometimes list alternate equipment in footnotes, secondary columns, or reference notes. Parsing these reliably is harder than reading the primary equipment. The alternates are formatted inconsistently across engineering firms, and the AI has to infer relationships that aren't always explicit.
- Complex mechanical schedules: 81% — Some mechanical schedules use merged cells, rotated headers, nested sub-tables, or span multiple pages with continuation markers. These layouts challenge the AI's ability to maintain row-column relationships. We're improving, but we're honest about where we are.
An 81% accuracy on complex mechanical schedules means roughly 1 in 5 extractions needs human correction. That's not good enough for full automation. But it's a meaningful starting point — it turns a 45-minute manual extraction into a 10-minute review-and-correct workflow. We're pushing these numbers up every month.
Why these specific workflows matter
If you're a GC procurement manager or a manufacturer rep, you already know why. But for context:
Document classification is the first step. When a 400-page bid package lands in your inbox, someone has to figure out what's in it. Which sections are mechanical specs? Where are the schedules? Is there an equipment list on drawing M-601 or is it on M-602? At 97%+ accuracy, our system sorts the documents so you don't have to page through the PDF yourself.
Equipment extraction is the core workflow. Once you know where the schedules are, you need to pull out every piece of equipment — every air handling unit, chiller, boiler, pump, fan coil, VAV box — with its tag, size, capacity, and specified manufacturer. This is the work that takes a senior estimator hours per project. At 95%+ accuracy, the AI does the first pass and flags anything uncertain for human review.
Spec parsing determines what's actually specified. The equipment schedule tells you there's an AHU. The spec section tells you it needs to be a Trane IntelliPak with specific CFM, static pressure, and efficiency ratings. Matching schedules to specs is how you build a complete equipment list. At 99%+, this mapping runs reliably.
Together, these workflows replace the most time-consuming part of procurement: reading. Not deciding, not negotiating, not relationship-building — reading. The hours spent turning PDFs into structured data before any actual procurement work begins.
How we measure accuracy
A number is only useful if you know how it was produced. Here's our methodology:
Ground truth comes from human reviewers. Every AI extraction is compared against a human-verified result. When a user corrects an AI output — fixing an equipment tag, adding a missed item, changing a classification — that correction becomes ground truth data. We're measuring against real-world expert judgment, not synthetic test sets.
We measure at the field level, not the document level. If the AI extracts 20 pieces of equipment from a schedule and gets 19 right but misses 1, that's 95% accuracy — not 0% because the document had an error. Field-level measurement gives you a realistic picture of how much human review is needed.
Volume matters. A 95% accuracy claim on 50 documents is noise. On 91,000+ executions, it's a statistically significant measurement. Our sample sizes are large enough that the confidence intervals are tight. The numbers mean what they say.
We don't exclude hard cases. Some vendors test on clean, well-formatted documents and report the results as representative. We include everything — the 1997 scan of a hand-drawn mechanical schedule, the PDF where someone filled in an Excel table with merged cells and printed it sideways, the spec book where the MEP sections start on page 247. Our numbers reflect the actual document quality you encounter on real projects.
Why publishing creates accountability
The moment you publish a number, you own it. If our equipment extraction accuracy drops from 95% to 93% next month, that shows up on the benchmark page. Our users see it. Our prospects see it. Our competitors see it.
That kind of visibility changes how you build product. It forces you to instrument everything, monitor continuously, and treat accuracy regression like a production outage. You can't hide behind vague claims when the data is public.
It also changes the sales conversation. Instead of "trust us, our AI is great," the conversation becomes "here are the numbers, here's the methodology, here's where we're strong, here's where we're still working." That's a better conversation for everyone involved.
We think every construction AI vendor should publish their accuracy data. Not because we're confident we'll always have the best numbers — but because the industry deserves to make informed decisions. If you're asking contractors to trust AI with their procurement, show them the scorecard.
Where this is going
Our target is simple: push every workflow above 95%. The "effectively solved" tier proves it's possible. The "still improving" tier tells us where to focus.
Complex mechanical schedules at 81% means we need better handling of merged cells, continuation tables, and non-standard layouts. Table alternates at 83.1% means we need better inference for footnote references and secondary equipment listings. These are specific, measurable problems — not hand-waving about "improving AI."
Every month, we process more documents, collect more corrections, and retrain on a larger dataset of real construction documents. The accuracy curves are moving in the right direction. You can watch them on the benchmark page.
If you're evaluating construction AI tools, ask for the numbers. If they won't share them, ask why.
See the live numbers
Our accuracy benchmark is public and updated continuously. Check the data yourself.
View Benchmark