What 100,000 Procurement Workflows Taught Us About Construction Documents
We've now run over 100,000 AI workflow executions across 12 distinct procurement workflows, all on real commercial construction documents. Not demo data. Not cherry-picked PDFs. Real mechanical schedules, electrical panel listings, plumbing fixture tables, spec books, addenda, and the full mess of documents that show up when a GC starts buying equipment for a project.
This post is a look at what we've learned, not about AI in the abstract, but about the specific, frustrating, deeply idiosyncratic nature of construction documents. What makes them hard. Where accuracy matters most. And what happens when you get it wrong.
The document set
When a general contractor or procurement team starts equipment buyout on a commercial project, the document set typically includes:
- Mechanical schedules — air handling units, chillers, boilers, pumps, VFDs, cooling towers, fan coil units. These are the backbone of HVAC procurement and often the most complex documents in the set.
- Electrical schedules — switchgear, transformers, panelboards, generators, transfer switches, UPS systems. Typically more standardized than mechanical, but with their own formatting quirks.
- Plumbing schedules — water heaters, pumps, fixtures, grease interceptors, backflow preventers. Usually shorter, but with more variation in how engineers present the data.
- Equipment plans — floor plans and details that show equipment locations but not always complete specs. These cross-reference schedules and can contain information not found anywhere else in the set.
- Specification sections — Division 23 (HVAC), Division 26 (Electrical), Division 22 (Plumbing). The spec often conflicts with the schedule. More on that below.
- Addenda — the documents issued between bid day and the final document set that modify, replace, or contradict anything above.
Across 100,000+ executions, we've touched every combination of these document types. The patterns that emerged surprised us.
What makes construction documents hard
If you've never tried to automate construction document processing, you might assume the hard part is reading text from a PDF. It isn't. OCR is a solved problem for clean, native PDFs. The hard part is everything else.
Inconsistent formatting
There is no standard for how a mechanical schedule should be laid out. Every engineering firm has its own template. Some use landscape tables with equipment tags as row headers. Others use portrait layouts with tags as column headers. Some embed schedules in plan sheets. Others put them on dedicated schedule pages. Some split a single schedule across two or three sheets. We've seen mechanical schedules formatted as tables, as lists, as notes on a drawing, and — in one memorable case — as a series of callouts scattered across a floor plan.
Hand-marked PDFs
Cloud marks, redline markups, yellow highlighting, sticky notes, and handwritten annotations are common. On many document sets we process, there's some kind of hand markup that partially obscures or modifies the printed content. A crossed-out model number with a handwritten replacement. A capacity value circled with a question mark. An addendum note scribbled in the margin. These aren't noise. They're often the most important information on the page.
Schedules split across addenda
The original bid set says the project has a 200-ton chiller, Model XYZ. Addendum 2 revises it to a 250-ton unit. Addendum 4 changes the model number but not the capacity. Addendum 6 adds two more chillers that weren't in the original set. Reconciling these changes across multiple documents is something experienced PMs do instinctively, but it's one of the hardest tasks to automate reliably.
Conflicts between schedule and spec
The mechanical schedule says the air handling unit is 10,000 CFM. The spec section says 12,000 CFM. The equipment plan shows a unit that physically won't fit in the mechanical room at either capacity. This happens more often than engineers would like to admit. In our data, material conflicts between schedule and specification show up often enough that reconciliation can't be an afterthought — and the authoritative source isn't always obvious.
Scanned vs. native PDFs
A meaningful share of the documents we process are scanned, either from physical prints or from a print-to-scan workflow that destroys the underlying text layer. Scanned documents are harder across the board: more ambiguous table structures and a higher rate of misread values. A scanned "8" can look like a "6" or a "0" depending on print quality.
Where extraction is easier — and harder
We publish production workload on our live benchmark (200k+ executions this quarter, 12 production prompts; per-task scores next quarter). Here's what crossing 100,000+ executions taught us about where quality varies by document type and workflow:
- Component spec parsing. The most structured workflow. Spec sections follow relatively predictable patterns: a section number, a product description, acceptable manufacturers, and performance criteria.
- Document classification. Given a PDF page, is it a mechanical schedule, an electrical single-line diagram, a plumbing riser, or a cover sheet? Visual signatures of each document type are distinctive.
- Equipment extraction. Pulling structured equipment data (tag, description, manufacturer, model, capacity, quantities) from schedules. Formatting variation has the biggest impact; clean native PDFs from large engineering firms behave differently than scanned sets with non-standard layouts.
- Equipment quantity detection. Quantities are expressed differently depending on the document — duplicate rows, plan vs. schedule mismatches, implied multiples. Context matters enormously.
- Table alternates. When a schedule lists a basis-of-design product and acceptable alternates, parsing the relationship is harder than parsing the products. Formatting is inconsistent; relationships are often implied.
- Complex mechanical schedules. Multi-page schedules with merged cells, nested sub-tables, footnotes that modify values, and references to other sheets. This is where we invest the most effort.
Some workflows run at very high quality. Others still need work. The distinction matters — and per-task scores will return to the benchmark page when we publish them next quarter.
How quality compounds with project size
Here's something that shows up in practice more than in a single headline: how error rates scale when the equipment list is large.
On a small project with twenty pieces of equipment, a few exceptions are easy to catch in review. Commercial projects aren't small.
A mid-size commercial project might have hundreds of pieces of equipment across mechanical, electrical, and plumbing. Small per-row error rates stop feeling small when multiplied across a long bill of materials — which is why flagging low-confidence rows matters more than bragging about a single top-line figure.
The relationship isn't just mathematical; it's psychological. When exceptions are rare and well-flagged, a PM can review them and move on. When noise is high, they start re-checking everything — which defeats the purpose of automation. The product goal is to keep the exception set small and obvious.
That's why we treat measurement as ongoing work, not a one-time marketing claim.
Patterns we didn't expect
Across 100,000 executions, a few patterns emerged that we didn't anticipate.
Electrical is more standardized than mechanical
Electrical schedules follow more predictable formats than mechanical schedules. Panelboard schedules, for example, are almost always structured the same way: circuit number, load description, breaker size, voltage. Switchgear one-line diagrams follow IEEE conventions that are remarkably consistent across firms. In our data, electrical extraction tends to run tighter than mechanical, even when the underlying documents are similarly complex.
Plumbing is the most chaotic
Plumbing schedules have the widest variation in format. Some engineers produce beautiful, standardized fixture schedules. Others embed plumbing equipment data in notes on plan sheets, in spec sections, or in general notes that aren't associated with any specific schedule. Plumbing also has the highest rate of missing data: a water heater schedule that lists the model but not the capacity, or a pump schedule with flow rate but no head pressure.
Firm size correlates with document quality
Documents from large national engineering firms (the top 50 by revenue) are dramatically more consistent than documents from smaller regional firms. This isn't about competence; it's about templates. Large firms invest in standardized drawing templates, QA processes, and document management systems. The result is documents that are more predictable for both humans and machines.
Addenda accuracy depends on format, not content
The hardest addenda to process aren't the ones with the most changes; they're the ones that reference changes by description rather than by explicit replacement. "Revise AHU-1 capacity from 10,000 to 12,000 CFM" is easy to parse. "See revised mechanical schedule" with an attached PDF that may or may not contain the complete updated schedule is much harder. A large share of addenda fall into the second category.
Why "good enough" isn't
In construction procurement, every missed equipment item is a potential problem. It's not like e-commerce, where you can absorb a thin defect rate across millions of small transactions. In construction, a missed piece of equipment can mean:
- A change order when the missing item is discovered during installation
- A schedule delay when the equipment wasn't ordered in time
- A price increase when the buyout window closes and you're buying at spot pricing
- An incorrect substitution because the alternate wasn't properly evaluated against the spec
On a commercial project, a single change order for missed MEP equipment can easily run $50,000-$200,000. That's not a rounding error — that's margin. When your construction fee on a $100M project is 2%, your entire profit is $2M. One bad equipment miss doesn't just reduce margin. It can eliminate it.
This is why we don't treat extraction quality as a nice-to-have metric. It's the core product question. A procurement tool that misses meaningful slices of the equipment list creates work and risk that didn't exist before.
The feedback loop
One thing that 100,000 executions gives you that 100 executions doesn't: a feedback loop. Every time a user corrects an extraction — changes a model number, adds a missed piece of equipment, fixes a quantity — that correction becomes training data.
We track correction rates by document type, engineering firm, equipment category, and document format. This lets us identify specific problem areas and prioritize improvements. On complex mechanical schedules, gains came from correction data at scale: merged cells in multi-page schedules, footnotes that override table values, and capacity values expressed in units we hadn't encountered before.
The feedback loop also works at the firm level. The more projects we see from a given engineering firm, the better we get at parsing their specific template and formatting conventions. This isn't overfitting. It's pattern recognition that matches what experienced PMs do. A PM who's seen twenty document sets from a particular engineer knows where to look for the quirks. We're building that same institutional knowledge, but at scale.
What comes next
Overall blended performance will change as we ship. Some workflows will improve significantly: complex mechanical schedules and table alternates are where we're investing the most engineering effort. Others are close enough to ceiling that improvements will be incremental.
We publish workload counts on our live benchmark because transparency matters. Per-task accuracy scores will return to that page next quarter; when methodology or measurement shifts, you'll see it reflected there. Construction professionals make real decisions based on this data, and they deserve to know exactly how much they can trust it.
100,000 executions is a lot. But the construction document universe is enormous — thousands of engineering firms, dozens of document formats, hundreds of equipment categories, and an infinite supply of hand-marked PDFs with coffee stains and sticky notes. We're not done learning. We're just past the point where the patterns start to become clear.
See the live benchmark
200k+ executions this quarter. 12 production prompts. Workload updated quarterly. No cherry-picking.
View Benchmark