A shift from reconstruction to observation.
Here's the workflow most teams follow: an error fires in production, someone opens the stack trace, and the guessing begins. What were the inputs? What was the database state? Which API responses came back? You try to recreate the conditions locally, seed a database with plausible data, mock the external calls, run the code, and hope the same failure shows up. More often than not, it doesn't.
You're reconstructing a causal chain from incomplete information. The reproduction rate is low because you're working backwards from an effect and trying to guess the cause. Most teams spend more time trying to reproduce bugs than they spend actually fixing them. The fix itself is usually a few lines. The reproduction is the bottleneck.
This is the default approach because, historically, there was no alternative. You didn't have the failure state captured. You only had the aftermath. But that constraint doesn't hold anymore.
If your observability layer captures the IO timeline, local variables, request context, state reads, and environment metadata at failure time, you're not guessing anymore. You have a snapshot of the application's state at the exact moment it broke. The request that triggered the error, the database queries that ran and the rows they returned, the HTTP calls that went out and the responses that came back. All of it, frozen in time.
The question changes shape entirely. It's no longer "what happened?" because you already know what happened. It's "how do I put another environment into this exact state and watch it break again?" That's a different problem. The first is forensic guesswork. The second is engineering.
Most modern observability stacks already collect pieces of this data. Structured logs capture request context. Distributed traces capture call graphs. The missing piece is usually the local variable state and the full IO timeline at the throw site. Once you capture those, you have everything you need to skip the guessing phase entirely.
Reproduction becomes a deterministic operation. Take the captured data, use it to set up an environment that matches the failure state, run the code path, confirm the failure. You're not replaying inputs and hoping for the same output. You're directly forcing the state and letting the code execute against it.
The difference is mechanical. Traditional reproduction is probabilistic: you approximate the conditions and see if the bug shows up. Observation-based reproduction is deterministic: you set the exact conditions and run the exact code path. The failure either reproduces or it doesn't. No guessing involved.
This also means reproduction attempts are repeatable. If the first attempt confirms the failure, you've got your repro case. If it doesn't, the captured data was incomplete or the root cause lies elsewhere. Either way, you know within seconds instead of hours.
You need four things from the reproduction environment. Isolation, because the forced state might be destructive and you don't want it touching anything real. Speed, because waiting minutes for a VM to boot defeats the purpose. Network access, because the application makes real calls to stubbed endpoints. And disposability, because the VM should be thrown away after confirmation.
Firecracker microVMs give you all four. Firecracker is a lightweight virtual machine monitor built by AWS for Lambda and Fargate. It's open source. It's designed for starting thousands of VMs per second. Each VM gets its own kernel, its own filesystem, its own network namespace. Boot to ready in under 150ms.
The lifecycle is simple: receive an error package, spin up a Firecracker VM with the right rootfs, inject the fixture data and stub configuration, start the application, trigger the code path, observe the result, destroy the VM. The whole thing takes seconds. The compute cost is negligible.
You don't need a full copy of production data. The error package's IO timeline shows the exact queries that ran and what they returned. A structured fixture DSL takes those query traces and creates minimal data. If the query was SELECT * FROM users WHERE id = 47 and it returned { id: 47, role: 'admin', suspended: true }, the fixture creates exactly that row. Nothing more.
This approach is fast. Creating three or four rows in a SQLite database takes microseconds. It's deterministic because the data comes directly from the captured IO, not from a snapshot that might have drifted. And it doesn't need access to production databases, which matters for security and compliance.
The fixture DSL is declarative. It reads the IO timeline, extracts every database interaction, and generates the minimal set of tables and rows needed to satisfy those queries. If the application code runs a join across two tables, the fixture creates both tables with exactly the rows needed for that join to produce the same result as production.
If the IO timeline shows an outgoing HTTP call to a payment API that returned a 503, the environment stubs that endpoint to return 503. If a Redis call timed out after 3 seconds, the stub simulates a 3-second timeout. If an S3 GetObject returned a specific byte payload, the stub returns that exact payload. Every external dependency is replaced with a stub that replays the exact behavior observed during the failure.
The application doesn't know the difference. From its perspective, the payment API is slow, Redis is timing out, and S3 is returning the expected data. The stubs are configured entirely from the IO timeline. No manual mock setup, no guessing at API contracts, no maintaining a separate test fixture for each service.
This is where the captured data pays for itself. Traditional reproduction requires someone to look up what the payment API returns on a 503, figure out the response body format, and write a mock. With captured IO, the stub is auto-generated. The response headers, body, status code, and latency are all recorded. The stub replays them verbatim.
Two agents, asymmetric by design. The master agent handles triage: it reads the error package, classifies causality, decides whether reproduction is warranted, and orchestrates the VM lifecycle. It reasons infrequently but with high fidelity.
The slave agent operates inside the VM. It applies fixture rules, configures stubs, starts the application, triggers the failing code path, and reports the outcome. Mechanical execution, no reasoning required.
The cost structure follows the split. Intelligence at the decision layer, efficiency at the execution layer.
Not every error warrants a VM. The triage pipeline applies four progressive filters.
Tier 1: Fingerprint deduplication. Identical stack trace and context as an existing issue. Increment, link, move on.
Tier 2: Root cause classification from captured data. The IO timeline makes causality self-evident. A connection timeout preceding the throw, a missing row after a query. No execution needed.
Tier 3: Novel error with unclear causality. The sequence of events is visible but the causal link is not. Possible race conditions, accumulated state corruption. Spin up a VM, reproduce, confirm.
Tier 4: Escalation. Reproduction failed or the failure depends on conditions outside the IO timeline: hardware timing, kernel behavior, intermittent network partitions. Route to a developer.
Most errors resolve at Tier 1 or 2. Compute is spent only where causality demands it.
The majority of production errors -- null returns, downstream timeouts, expired tokens, malformed input -- are fully explained by the captured IO timeline. Root cause is immediate. No execution required.
Reproduction targets the remaining 10-20%: errors where the captured sequence of events does not self-evidently explain the failure. Reproduction is a tool for establishing causality, not a default ritual.
The master agent activates only for Tier 3+ errors. Per-call cost is bounded by the structured nature of error packages. The slave agent handles mechanical execution at marginal cost. Firecracker VMs are ephemeral -- no idle compute, no standing infrastructure.
A single prevented manual reproduction session (typically 1-2 hours of developer time) offsets thousands of automated VM spin-ups. The cost difference is two to three orders of magnitude.
The debugging workflow inverts. The question shifts from "how do I reproduce this?" to "do I need to?" For the ~80% resolved from captured data alone, a developer goes straight to writing the patch. For the ~20% requiring execution, a confirmed repro case lands in the queue within seconds, full context attached.
Developer time moves from reconstruction to resolution. The work that remains is the work that matters: understanding the bug, writing the fix, verifying it, and shipping it.