What 100,000 Procurement Workflows Taught Us About Construction Documents

By BuildVision Team · March 2026

We've now run over 100,000 AI workflow executions across 12 distinct procurement workflows, all on real commercial construction documents. Not demo data. Not cherry-picked PDFs. Real mechanical schedules, electrical panel listings, plumbing fixture tables, spec books, addenda, and the full mess of documents that show up when a GC starts buying equipment for a project.

This post is a look at what we've learned, not about AI in the abstract, but about the specific, frustrating, deeply idiosyncratic nature of construction documents. What makes them hard. Where accuracy matters most. And what happens when you get it wrong.

The document set

When a general contractor or procurement team starts equipment buyout on a commercial project, the document set typically includes:

Across 100,000+ executions, we've touched every combination of these document types. The patterns that emerged surprised us.

What makes construction documents hard

If you've never tried to automate construction document processing, you might assume the hard part is reading text from a PDF. It isn't. OCR is a solved problem for clean, native PDFs. The hard part is everything else.

Inconsistent formatting

There is no standard for how a mechanical schedule should be laid out. Every engineering firm has its own template. Some use landscape tables with equipment tags as row headers. Others use portrait layouts with tags as column headers. Some embed schedules in plan sheets. Others put them on dedicated schedule pages. Some split a single schedule across two or three sheets. We've seen mechanical schedules formatted as tables, as lists, as notes on a drawing, and — in one memorable case — as a series of callouts scattered across a floor plan.

Hand-marked PDFs

Cloud marks, redline markups, yellow highlighting, sticky notes, and handwritten annotations are common. On many document sets we process, there's some kind of hand markup that partially obscures or modifies the printed content. A crossed-out model number with a handwritten replacement. A capacity value circled with a question mark. An addendum note scribbled in the margin. These aren't noise. They're often the most important information on the page.

Schedules split across addenda

The original bid set says the project has a 200-ton chiller, Model XYZ. Addendum 2 revises it to a 250-ton unit. Addendum 4 changes the model number but not the capacity. Addendum 6 adds two more chillers that weren't in the original set. Reconciling these changes across multiple documents is something experienced PMs do instinctively, but it's one of the hardest tasks to automate reliably.

Conflicts between schedule and spec

The mechanical schedule says the air handling unit is 10,000 CFM. The spec section says 12,000 CFM. The equipment plan shows a unit that physically won't fit in the mechanical room at either capacity. This happens more often than engineers would like to admit. In our data, material conflicts between schedule and specification show up often enough that reconciliation can't be an afterthought — and the authoritative source isn't always obvious.

Scanned vs. native PDFs

A meaningful share of the documents we process are scanned, either from physical prints or from a print-to-scan workflow that destroys the underlying text layer. Scanned documents are harder across the board: more ambiguous table structures and a higher rate of misread values. A scanned "8" can look like a "6" or a "0" depending on print quality.

Where extraction is easier — and harder

We publish production workload on our live benchmark (200k+ executions this quarter, 12 production prompts; per-task scores next quarter). Here's what crossing 100,000+ executions taught us about where quality varies by document type and workflow:

Some workflows run at very high quality. Others still need work. The distinction matters — and per-task scores will return to the benchmark page when we publish them next quarter.

How quality compounds with project size

Here's something that shows up in practice more than in a single headline: how error rates scale when the equipment list is large.

On a small project with twenty pieces of equipment, a few exceptions are easy to catch in review. Commercial projects aren't small.

A mid-size commercial project might have hundreds of pieces of equipment across mechanical, electrical, and plumbing. Small per-row error rates stop feeling small when multiplied across a long bill of materials — which is why flagging low-confidence rows matters more than bragging about a single top-line figure.

The relationship isn't just mathematical; it's psychological. When exceptions are rare and well-flagged, a PM can review them and move on. When noise is high, they start re-checking everything — which defeats the purpose of automation. The product goal is to keep the exception set small and obvious.

That's why we treat measurement as ongoing work, not a one-time marketing claim.

Patterns we didn't expect

Across 100,000 executions, a few patterns emerged that we didn't anticipate.

Electrical is more standardized than mechanical

Electrical schedules follow more predictable formats than mechanical schedules. Panelboard schedules, for example, are almost always structured the same way: circuit number, load description, breaker size, voltage. Switchgear one-line diagrams follow IEEE conventions that are remarkably consistent across firms. In our data, electrical extraction tends to run tighter than mechanical, even when the underlying documents are similarly complex.

Plumbing is the most chaotic

Plumbing schedules have the widest variation in format. Some engineers produce beautiful, standardized fixture schedules. Others embed plumbing equipment data in notes on plan sheets, in spec sections, or in general notes that aren't associated with any specific schedule. Plumbing also has the highest rate of missing data: a water heater schedule that lists the model but not the capacity, or a pump schedule with flow rate but no head pressure.

Firm size correlates with document quality

Documents from large national engineering firms (the top 50 by revenue) are dramatically more consistent than documents from smaller regional firms. This isn't about competence; it's about templates. Large firms invest in standardized drawing templates, QA processes, and document management systems. The result is documents that are more predictable for both humans and machines.

Addenda accuracy depends on format, not content

The hardest addenda to process aren't the ones with the most changes; they're the ones that reference changes by description rather than by explicit replacement. "Revise AHU-1 capacity from 10,000 to 12,000 CFM" is easy to parse. "See revised mechanical schedule" with an attached PDF that may or may not contain the complete updated schedule is much harder. A large share of addenda fall into the second category.

Why "good enough" isn't

In construction procurement, every missed equipment item is a potential problem. It's not like e-commerce, where you can absorb a thin defect rate across millions of small transactions. In construction, a missed piece of equipment can mean:

On a commercial project, a single change order for missed MEP equipment can easily run $50,000-$200,000. That's not a rounding error — that's margin. When your construction fee on a $100M project is 2%, your entire profit is $2M. One bad equipment miss doesn't just reduce margin. It can eliminate it.

This is why we don't treat extraction quality as a nice-to-have metric. It's the core product question. A procurement tool that misses meaningful slices of the equipment list creates work and risk that didn't exist before.

The feedback loop

One thing that 100,000 executions gives you that 100 executions doesn't: a feedback loop. Every time a user corrects an extraction — changes a model number, adds a missed piece of equipment, fixes a quantity — that correction becomes training data.

We track correction rates by document type, engineering firm, equipment category, and document format. This lets us identify specific problem areas and prioritize improvements. On complex mechanical schedules, gains came from correction data at scale: merged cells in multi-page schedules, footnotes that override table values, and capacity values expressed in units we hadn't encountered before.

The feedback loop also works at the firm level. The more projects we see from a given engineering firm, the better we get at parsing their specific template and formatting conventions. This isn't overfitting. It's pattern recognition that matches what experienced PMs do. A PM who's seen twenty document sets from a particular engineer knows where to look for the quirks. We're building that same institutional knowledge, but at scale.

What comes next

Overall blended performance will change as we ship. Some workflows will improve significantly: complex mechanical schedules and table alternates are where we're investing the most engineering effort. Others are close enough to ceiling that improvements will be incremental.

We publish workload counts on our live benchmark because transparency matters. Per-task accuracy scores will return to that page next quarter; when methodology or measurement shifts, you'll see it reflected there. Construction professionals make real decisions based on this data, and they deserve to know exactly how much they can trust it.

100,000 executions is a lot. But the construction document universe is enormous — thousands of engineering firms, dozens of document formats, hundreds of equipment categories, and an infinite supply of hand-marked PDFs with coffee stains and sticky notes. We're not done learning. We're just past the point where the patterns start to become clear.

See the live benchmark

200k+ executions this quarter. 12 production prompts. Workload updated quarterly. No cherry-picking.

View Benchmark