Multi-Agent Implementation Experience

A multi-agent development system was built where agents drive issues from a backlog to merged pull requests with minimal human intervention. In the three months since 2026-02-09, the plugin repository accumulated 545 commits, shipped 4 minor/patch releases (v1.47.2 through v1.51.1), and processed issues from COR-1091 through COR-1592 — a working scope of more than 500 tracked items. The system today is composed of 5 orchestrator skills, 31 utility skills, and 18 agents.

The pipeline at a glance

The core loop is driven by an orchestrator skill (the pipeline skill) that never decides the next step from agent narrative. Instead, it defers to a deterministic state probe that inspects ground-truth from the issue tracker, the git server (Gitea), and the PR state. The probe returns one of five verdicts (RESOLVER, REVIEWER, RESOLVER_FIX, DONE, ABORT_*) and the orchestrator routes accordingly. The loop is capped at 5 iterations per issue.

flowchart LR
    Issue[Issue in tracker]
    Orchestrator[Orchestrator skill]
    Probe[State probe]
    Specialist[Specialist agent\nfrontend / backend / infra / docs]
    Reviewer[Reviewer agent]
    Resolver[Change-request resolver]
    Merged([PR merged])

    Issue --> Orchestrator
    Orchestrator --> Probe
    Probe -->|RESOLVER| Specialist
    Probe -->|REVIEWER| Reviewer
    Probe -->|RESOLVER_FIX| Resolver
    Specialist --> Probe
    Reviewer --> Probe
    Resolver --> Probe
    Probe -->|DONE| Merged

Throughput

The table below covers the one fully-measured case study from the measurement window. Data sourced from project memory (project_meeting_epic_impl_stats.md).

Metric	Value
Scope	One epic, fully implemented end-to-end
Issues resolved	11
Pull requests merged	13
Wall-clock duration	3 hours 28 minutes
Lines of code produced	10,224
Throughput	~49 lines / minute
Comparison vs. estimated solo human effort	~1,600× faster

Caveats:

This is a single observed run, not a cross-project average.
Wall-clock duration is mostly agent thinking and tool calls, not human review time.
"1,600×" is a rough comparison against an estimated solo-human baseline for the same scope.

System shape

The system is layered into three zones: the main session, sub-agent processes, and external systems. A skill is a lightweight inline procedure that runs in the main session context. An agent is an isolated sub-process with its own context window. State and audit trail live in the external systems — the issue tracker (YouTrack), the git server (Gitea), and the team chat (Discord) — not in the agents themselves.

flowchart TB
    subgraph Main[Main session]
        Orch[Orchestrator skills]
        Probe[State probe]
        Conv[Shared conventions]
    end
    subgraph Sub[Sub-agent processes]
        Coders[Specialist coders]
        Review[Reviewer]
        Resolve[Resolver]
    end
    subgraph Ext[External systems]
        Track[Issue tracker]
        Git[Git server]
        Chat[Chat / notifications]
    end
    Main --> Sub
    Main --> Ext
    Sub --> Ext

This layering keeps the main session thin and the sub-agents stateless between runs. When a sub-agent fails mid-run, the issue tracker retains the last known good state, and the orchestrator can re-dispatch without reconstructing intermediate state.

Five recurring failure shapes

1. Sub-procedure auto-promoted to sub-agent (commit `f685c9a5`)

A reusable routing procedure was packaged as a callable unit. The framework auto-registered it as an agent, so the orchestrator dispatched it as an isolated sub-agent each time, costing approximately 9% of token budget per run. After the refactor, the routing procedure was moved into a shared convention document and inlined; that 9% disappeared.

2. Sub-modules' intermediate output mistaken for terminal output (commits `5c86501`, `f2bde768`)

Reviewer agents finished their checks and returned narrative summaries without performing the actual merge or change-request action. Pull requests sat in Ready state, the orchestrator re-dispatched the reviewer, and runs hit the 5-iteration cap. After adding an explicit "terminal-action exit gate" requiring proof of side-effect emission before return, this class of run-to-cap aborts dropped to near-zero.

3. Soft validation gates were always overridable (commit `a10dd51`)

A judgment-based check on the PR base branch was being satisfied by narrative justification in the PR body. After replacing it with a hard validation on the (base branch, branch name, title prefix) tuple, no further base-branch violations were merged.

4. A test-coverage gate could not distinguish markup from logic (commit `0b68cccc`)

The gate flagged any view-template file with added lines as needing tests. Cosmetic UI changes were repeatedly blocked. After teaching the gate to inspect inside code blocks rather than count added lines, false-positive blocks ceased. A complementary upstream change pre-marks markup-only work at planning time.

5. Specialists had no tool to verify external specifications (commit `c79fa56`)

Code targeting infrastructure schemas (Kubernetes, Helm, cloud APIs) sometimes referenced fields that did not exist. After granting web-fetch tools to specialist agents and requiring source URLs in commit and PR bodies for external-spec changes, this class of error became reviewable mechanically rather than relying on human catch.

What practices became hard requirements

Practice	Was	Now (in agent-driven workflow)
Single responsibility per module	Code-quality habit	Determines whether a task can be cleanly routed to one specialist
Clear naming of callable units	Readability	Names directly drive dispatch — collisions cause unintended sub-agent spawns
Test coverage	Quality safeguard	Binary gate; absence blocks merge automatically
External-spec citations	Polite documentation	Audit trail; reviewer agents check for them mechanically

Closing

The throughput numbers above are real but limited to in-scope cases the system is shaped for. The recurring failure shapes are not bugs in any one component — they are the cost of running a chain of probabilistic agents over deterministic infrastructure. Most of the infrastructure-level work in the last 3 months has been making invariants explicit enough that no narrative can override them.

Table of Contents