What 91,000 Procurement Workflows Taught Us About Construction Documents

By BuildVision Team · March 2026

We've now run over 91,000 AI workflow executions across 12 distinct procurement workflows, all on real commercial construction documents. Not demo data. Not cherry-picked PDFs. Real mechanical schedules, electrical panel listings, plumbing fixture tables, spec books, addenda, and the full mess of documents that show up when a GC starts buying equipment for a project.

This post is a look at what we've learned — not about AI in the abstract, but about the specific, frustrating, deeply idiosyncratic nature of construction documents. What makes them hard. Where accuracy matters most. And what happens when you get it wrong.

The document landscape

When a general contractor or procurement team starts equipment buyout on a commercial project, the document set typically includes:

Across 91,000+ executions, we've touched every combination of these document types. The patterns that emerged surprised us.

What makes construction documents hard

If you've never tried to automate construction document processing, you might assume the hard part is reading text from a PDF. It isn't. OCR is a solved problem for clean, native PDFs. The hard part is everything else.

Inconsistent formatting

There is no standard for how a mechanical schedule should be laid out. Every engineering firm has its own template. Some use landscape tables with equipment tags as row headers. Others use portrait layouts with tags as column headers. Some embed schedules in plan sheets. Others put them on dedicated schedule pages. Some split a single schedule across two or three sheets. We've seen mechanical schedules formatted as tables, as lists, as notes on a drawing, and — in one memorable case — as a series of callouts scattered across a floor plan.

Hand-marked PDFs

Cloud marks, redline markups, yellow highlighting, sticky notes, and handwritten annotations are common. On about 15% of the documents we process, there's some kind of hand markup that partially obscures or modifies the printed content. A crossed-out model number with a handwritten replacement. A capacity value circled with a question mark. An addendum note scribbled in the margin. These aren't noise — they're often the most important information on the page.

Schedules split across addenda

The original bid set says the project has a 200-ton chiller, Model XYZ. Addendum 2 revises it to a 250-ton unit. Addendum 4 changes the model number but not the capacity. Addendum 6 adds two more chillers that weren't in the original set. Reconciling these changes across multiple documents is something experienced PMs do instinctively — but it's one of the hardest tasks to automate reliably.

Conflicts between schedule and spec

The mechanical schedule says the air handling unit is 10,000 CFM. The spec section says 12,000 CFM. The equipment plan shows a unit that physically won't fit in the mechanical room at either capacity. This happens more often than engineers would like to admit. In our data, roughly 8-12% of projects have at least one material conflict between schedule and specification, and the schedule is right about 70% of the time.

Scanned vs. native PDFs

About 25% of the documents we process are scanned — either from physical prints or from a print-to-scan workflow that destroys the underlying text layer. Scanned documents are harder across the board: lower extraction accuracy, more ambiguous table structures, and a higher rate of misread values. A scanned "8" can look like a "6" or a "0" depending on print quality.

The accuracy breakdown

We publish our accuracy numbers on our live benchmark. Here's what 91,000+ executions have taught us about where accuracy varies by document type and workflow:

Overall accuracy across all 12 workflows: 89%. We publish this number because it's honest. Some workflows are effectively solved. Others need work. The distinction matters.

How accuracy compounds

Here's a number that doesn't show up in benchmarks but matters enormously in practice: how accuracy scales with project size.

A 95% equipment extraction rate sounds high. And for a small project with 20 pieces of equipment, it is — that's one missed item, which a PM will almost certainly catch during review. But commercial construction projects aren't small.

A mid-size commercial project might have 200-400 pieces of equipment across mechanical, electrical, and plumbing. At 95% accuracy, a 400-item project means 20 items that need human attention. That's manageable, especially when the system flags low-confidence extractions. But drop to 90%, and you're looking at 40 items. At 85%, it's 60.

The relationship isn't just mathematical — it's psychological. At 95%, a PM can review exceptions and trust the rest. At 85%, they start re-checking everything, which defeats the purpose of automation. There's a threshold somewhere around 92-94% where trust breaks down, and the tool shifts from "helpful" to "another thing I have to double-check."

This is why we obsess over those last few percentage points. The difference between 90% and 95% isn't 5% — it's the difference between a tool that saves hours and a tool that creates work.

Patterns we didn't expect

Across 91,000 executions, a few patterns emerged that we didn't anticipate.

Electrical is more standardized than mechanical

Electrical schedules follow more predictable formats than mechanical schedules. Panelboard schedules, for example, are almost always structured the same way: circuit number, load description, breaker size, voltage. Switchgear one-line diagrams follow IEEE conventions that are remarkably consistent across firms. This means electrical extraction accuracy runs 3-5 percentage points higher than mechanical, even though the underlying documents are similarly complex.

Plumbing is the most chaotic

Plumbing schedules have the widest variation in format. Some engineers produce beautiful, standardized fixture schedules. Others embed plumbing equipment data in notes on plan sheets, in spec sections, or in general notes that aren't associated with any specific schedule. Plumbing also has the highest rate of missing data — a water heater schedule that lists the model but not the capacity, or a pump schedule with flow rate but no head pressure.

Firm size correlates with document quality

Documents from large national engineering firms (the top 50 by revenue) are dramatically more consistent than documents from smaller regional firms. This isn't about competence — it's about templates. Large firms invest in standardized drawing templates, QA processes, and document management systems. The result is documents that are more predictable for both humans and machines. Our accuracy on documents from top-50 firms runs about 4% higher than on documents from firms outside the top 200.

Addenda accuracy depends on format, not content

The hardest addenda to process aren't the ones with the most changes — they're the ones that reference changes by description rather than by explicit replacement. "Revise AHU-1 capacity from 10,000 to 12,000 CFM" is easy to parse. "See revised mechanical schedule" with an attached PDF that may or may not contain the complete updated schedule is much harder. About 40% of addenda fall into the second category.

Why "good enough" isn't

In construction procurement, every missed equipment item is a potential problem. It's not like e-commerce, where a 2% error rate means a small percentage of orders ship wrong. In construction, a missed piece of equipment can mean:

On a commercial project, a single change order for missed MEP equipment can easily run $50,000-$200,000. That's not a rounding error — that's margin. When your construction fee on a $100M project is 2%, your entire profit is $2M. One bad equipment miss doesn't just reduce margin. It can eliminate it.

This is why we don't treat accuracy as a nice-to-have metric. It's the core product question. A procurement tool that extracts 85% of equipment correctly isn't 85% useful — it's a liability, because the 15% it misses creates work and risk that didn't exist before.

The feedback loop

One thing that 91,000 executions gives you that 100 executions doesn't: a feedback loop. Every time a user corrects an extraction — changes a model number, adds a missed piece of equipment, fixes a quantity — that correction becomes training data.

We track correction rates by document type, engineering firm, equipment category, and document format. This lets us identify specific problem areas and prioritize improvements. When our complex mechanical schedule accuracy went from 74% to 81% over the last two quarters, it wasn't because we changed the underlying model. It was because we had enough correction data from that specific document type to identify the patterns we were missing — merged cells in multi-page schedules, footnotes that override table values, and capacity values expressed in units we hadn't encountered before.

The feedback loop also works at the firm level. The more projects we see from a given engineering firm, the better we get at parsing their specific template and formatting conventions. This isn't overfitting — it's pattern recognition that matches what experienced PMs do. A PM who's seen twenty document sets from a particular engineer knows where to look for the quirks. We're building that same institutional knowledge, but at scale.

What comes next

The 89% overall accuracy number will change. Some workflows will improve significantly — complex mechanical schedules and table alternates are where we're investing the most engineering effort. Others are close enough to ceiling that improvements will be incremental.

We publish these numbers on our live benchmark because transparency matters. If a workflow gets worse after a model update, you'll see it. If a new workflow launches with low accuracy, you'll see that too. Construction professionals make real decisions based on this data, and they deserve to know exactly how much they can trust it.

91,000 executions is a lot. But the construction document universe is enormous — thousands of engineering firms, dozens of document formats, hundreds of equipment categories, and an infinite supply of hand-marked PDFs with coffee stains and sticky notes. We're not done learning. We're just past the point where the patterns start to become clear.

See the live benchmark

91,000+ executions. 12 workflows. Updated quarterly. No cherry-picking.

View Benchmark