Book a Live Demo

Technology

No Blind Spots: How Our Shadow Pipeline Hardens AI Agents in Adaptive

A production‑shadow lab where we safely A/B test prompts, retrieval strategies, and LLMs—confirming impact before rollout.

Mike Radzewicz
8
min read
September 11, 2025

Everyone is shipping AI; very few teams can prove their AI is correct.

At Adaptive, AI and LLMs are core to how we deliver construction accounting at scale—extracting fields from invoices, matching vendors and customers, assigning cost codes, and flagging anomalies. If even one LLM model drifts, financial workflows can mis-post spend, mis-categorize bills, and frustrate accountants.

For more context on why we’re bullish on AI in this space, see: AI in Construction Accounting—3 Real Use Cases Builders Should Know.

Our North Star is simple to state and hard to hit: 95%+ field‑level accuracy on real production‑grade inputs—measured in the wild, before releasing to our customers.

This post walks through how we:

  • Fork our production data to create safe, representative inputs.
  • Architect a shadow system that mirrors production without side effects.
  • Evaluate the shadow system with metrics and traces that domain experts trust.
  • Open experimentation to the whole team, turning every PR into a testable hypothesis.

Along the way we’ll share the early wins and what we’ve learned so far.

What we mean by “field‑specific prompts”:
Our AI pipeline is a set of smaller, focused agents—each tuned for a specific field (e.g. cost code, project, vendor name, invoice date, line‑item totals). Narrow prompts and guardrails beat one giant “do everything” prompt for accuracy and debuggability.
What we mean by “shadow system”:
A shadow system is a production‑like copy of your pipeline that consumes the same inputs as production, runs candidate models/logic, and records outputs without writing side effects to customer systems. It lets us quantify progress and catch regressions before rollout.

Building a Reliable Production Snapshot in Heroku

Our experiments rely on up-to-date customer configurations—vendors, cost codes, customers, and more.

We run on Heroku, and surprisingly, it’s scaled well for us as a post-Series A team. The most reliable way we’ve found to maintain a clean, representative test dataset is to destroy the forked database and re-fork from production each night.

How we snapshot on Heroku

  1. Nightly fork at 03:00 UTC. We destroy the previous shadow database and fork from production. The snapshot is at most ~24 hours old.
  2. Clean slate, every day. Any rows created by shadow runs never linger; the next fork wipes them, keeping experiments independent and reproducible.
  3. Read‑only / no side effects. Shadow apps use scoped credentials with no writes to external integrations (e.g. QuickBooks) and no outbound notifications.

Why this approach (and not real‑time replication)

Stability beats cleverness here. Heroku’s fork is simple to operate, has been rock‑solid for us, and avoids tricky partial‑replication or corruption scenarios. The ~24‑hour lag is an acceptable trade‑off because we’re measuring accuracy, not simulating time‑critical writes. If we ever need lower latency, we can move to streaming.

Architecting a Shadow System on That Snapshot

The shadow system mirrors the production path end‑to‑end, swapping in candidate agents and prompts where we want to test changes. Here’s how it works:

Stage 1: Email → Webhook

A SendGrid webhook delivers the same email payload to production and the shadow environment.

Why it matters: Identical inputs remove a whole class of “it works on my machine” debates.

Stage 2: Multi-Agent Match

In the shadow system we run the newest prompts, retrieval strategies, and model variants for each field-specific agent.

Why it matters: We can trial improvements safely and in context, against the exact documents our customers send us.

Stage 3: Prefect Cloud ETL (hourly)

We copy both configuration and results from the shadow database into Snowflake via Prefect Cloud, so results are preserved before the nightly database fork.

Why it matters: A clean database keeps runs independent and reproducible; Snowflake becomes the single source of truth for evaluation data.

Stage 4: Snowflake data lands in the corresponding table

All shadow outputs land alongside production metadata in one warehouse.

Why it matters: Centralized, queryable data makes shadow vs. production comparisons easy, scales analytics, and keeps access control consistent.

Stage 5: dbt Models → KPIs

We transform log lines into evaluation metrics: precision/recall per field, drift %, and coverage.

Why it matters: Versioned, consistent definitions turn raw logs into apples-to-apples metrics over time and feed dashboards/alerts reliably.

Stage 6: Omni Dashboard (read-only)

We surface mismatches (shadow system vs. production truth) to domain experts for adjudication and root-cause analysis.

Why it matters: Tight feedback loops turn model guesses into verified improvements.

Evaluating the System (Omni + LangSmith)

We pair structured metrics with rich traces:

  • Omni dashboards highlight where candidates disagree with current production data—by field and by client—so experts can spot patterns fast.
  • LangSmith traces show exactly what each agent saw and decided at each step. That makes it easy to debug prompts, retrieval, and tool calls.

A quick example workflow:

  1. A candidate agent increases vendor name recall but shows a higher drift rate on bill dates.
  2. We open the trace in LangSmith to inspect retrieval context and prompt formatting.
  3. We adjust the vendor matching prompt agent (ex: narrower prompt, stricter tool output schema).
  4. We re‑run the same documents; Omni shows the side‑by‑side metric deltas.

Opening Experimentation to the Whole Team

Most teams stop at two environments running side‑by‑side.

We went further: Every pull request(PR) becomes its own shadow lab so anyone can test prompts, retrieval strategies, and LLMs against the same data—before release.

Why this matters:

  • Continuous learning. Every change is evaluated against the same documents and forked database, so wins and regressions are obvious.
  • Shared ownership. Product, Support, and Ops can propose and validate changes—not just engineers.
  • Lower risk. No writes to customer systems; nothing ships until it proves itself.

How it works:

Step 1 — Create Review Apps for each PR

Heroku spins up an isolated Review App for every pull request, pulling its code and configs.

Step 2 — Shadowize the Review App

Our CI runs a single command to convert the Review App into a shadow app:

  • Point the app’s database URL to the nightly-forked secondary database.
  • Register a unique inbound email route in SendGrid.
  • Scale web/worker dynos to safely mirror production traffic.

Step 3 — Replay a fixed sample set

We push a curated bundle of ~4,000 bills through the stack (runtime ~8 hours) and do an apples‑to‑apples comparison across PRs—same docs, same configs.

Step 4 — Launch from Slack (no code required)

Non‑engineers trigger runs via Cursor in Slack:

@Cursor Please improve the vendor matching agent prompt by adding a condition: if multiple vendors match, give preference to the vendor with the highest usage according to the get vendor tool

The cursor background agent applies the change and opens a PR. Our CI then runs the same command end‑to‑end—shadowizes the app and replays the sample set—and posts the metric deltas back into the thread. No local setup needed.

Step 5 — Like‑for‑like comparisons

Because experiments share the forked database and fixed sample set, our Omni dashboard compares results side‑by‑side—total amount, vendor, date, customer, and cost code hit rates—so improvements pop at a glance.

Results & What We’ve Learned

In the first month of using the shadow system, we more than tripled cost code accuracy while maintaining precision. That gave us the confidence to roll it out to 100 early customers.

What we’ve learned:

  • Small, field‑specific agents are easier to improve than one giant prompt.
  • Nightly forks and fixed samples strike a great balance between reliability and control.
  • Pairing metrics (dbt/Snowflake) with traces (LangSmith) turns “why did this fail?” into an answerable question.
  • Opening the loop to non‑engineers increases the rate of good ideas.

Early access customer reactions:

  • Muhammad from Apparatus, an accounting firm working with Adaptive: “AI now does the heavy lifting on our bills and receipts. We still review everything, but 95–98% of the AI assignments are accurate right out of the gate. That level of efficiency and accuracy just wasn’t possible when we were doing it all manually.”
  • Matt Gummersbach, Product Manager at Adaptive: “The experiments' framework let me improve our agent performance on my own. I didn’t have to wait for engineering bandwidth — I could design, run, and measure experiments myself.“

Thanks for reading. If you’re building AI pipelines that ingest invoices or other documents—OCR and multi-agent stacks—we’d love to compare notes!

Ready for crystal clear financials 
without the headache?

Let us show you how Adaptive's AI-powered construction financial management 
software works in a brief 30 minute demo with someone from our team.