Table of Contents

Multi-Agent Implementation Experience

A multi-agent development system was built where agents drive issues from a backlog to merged pull requests with minimal human intervention. In the three months since 2026-02-09, the plugin repository accumulated 545 commits, shipped 4 minor/patch releases (v1.47.2 through v1.51.1), and processed issues from COR-1091 through COR-1592 — a working scope of more than 500 tracked items. The system today is composed of 5 orchestrator skills, 31 utility skills, and 18 agents.

The pipeline at a glance

The core loop is driven by an orchestrator skill (the pipeline skill) that never decides the next step from agent narrative. Instead, it defers to a deterministic state probe that inspects ground-truth from the issue tracker, the git server (Gitea), and the PR state. The probe returns one of five verdicts (RESOLVER, REVIEWER, RESOLVER_FIX, DONE, ABORT_*) and the orchestrator routes accordingly. The loop is capped at 5 iterations per issue.

flowchart LR
    Issue[Issue in tracker]
    Orchestrator[Orchestrator skill]
    Probe[State probe]
    Specialist[Specialist agent\nfrontend / backend / infra / docs]
    Reviewer[Reviewer agent]
    Resolver[Change-request resolver]
    Merged([PR merged])

    Issue --> Orchestrator
    Orchestrator --> Probe
    Probe -->|RESOLVER| Specialist
    Probe -->|REVIEWER| Reviewer
    Probe -->|RESOLVER_FIX| Resolver
    Specialist --> Probe
    Reviewer --> Probe
    Resolver --> Probe
    Probe -->|DONE| Merged

Throughput

The table below covers the one fully-measured case study from the measurement window. Data sourced from project memory (project_meeting_epic_impl_stats.md).

Metric Value
Scope One epic, fully implemented end-to-end
Issues resolved 11
Pull requests merged 13
Wall-clock duration 3 hours 28 minutes
Lines of code produced 10,224
Throughput ~49 lines / minute
Comparison vs. estimated solo human effort ~1,600× faster

Caveats:

  • This is a single observed run, not a cross-project average.
  • Wall-clock duration is mostly agent thinking and tool calls, not human review time.
  • "1,600×" is a rough comparison against an estimated solo-human baseline for the same scope.

System shape

The system is layered into three zones: the main session, sub-agent processes, and external systems. A skill is a lightweight inline procedure that runs in the main session context. An agent is an isolated sub-process with its own context window. State and audit trail live in the external systems — the issue tracker (YouTrack), the git server (Gitea), and the team chat (Discord) — not in the agents themselves.

flowchart TB
    subgraph Main[Main session]
        Orch[Orchestrator skills]
        Probe[State probe]
        Conv[Shared conventions]
    end
    subgraph Sub[Sub-agent processes]
        Coders[Specialist coders]
        Review[Reviewer]
        Resolve[Resolver]
    end
    subgraph Ext[External systems]
        Track[Issue tracker]
        Git[Git server]
        Chat[Chat / notifications]
    end
    Main --> Sub
    Main --> Ext
    Sub --> Ext

This layering keeps the main session thin and the sub-agents stateless between runs. When a sub-agent fails mid-run, the issue tracker retains the last known good state, and the orchestrator can re-dispatch without reconstructing intermediate state.

Five recurring failure shapes

1. Sub-procedure auto-promoted to sub-agent (commit f685c9a5)

A reusable routing procedure was packaged as a callable unit. The framework auto-registered it as an agent, so the orchestrator dispatched it as an isolated sub-agent each time, costing approximately 9% of token budget per run. After the refactor, the routing procedure was moved into a shared convention document and inlined; that 9% disappeared.

2. Sub-modules' intermediate output mistaken for terminal output (commits 5c86501, f2bde768)

Reviewer agents finished their checks and returned narrative summaries without performing the actual merge or change-request action. Pull requests sat in Ready state, the orchestrator re-dispatched the reviewer, and runs hit the 5-iteration cap. After adding an explicit "terminal-action exit gate" requiring proof of side-effect emission before return, this class of run-to-cap aborts dropped to near-zero.

3. Soft validation gates were always overridable (commit a10dd51)

A judgment-based check on the PR base branch was being satisfied by narrative justification in the PR body. After replacing it with a hard validation on the (base branch, branch name, title prefix) tuple, no further base-branch violations were merged.

4. A test-coverage gate could not distinguish markup from logic (commit 0b68cccc)

The gate flagged any view-template file with added lines as needing tests. Cosmetic UI changes were repeatedly blocked. After teaching the gate to inspect inside code blocks rather than count added lines, false-positive blocks ceased. A complementary upstream change pre-marks markup-only work at planning time.

5. Specialists had no tool to verify external specifications (commit c79fa56)

Code targeting infrastructure schemas (Kubernetes, Helm, cloud APIs) sometimes referenced fields that did not exist. After granting web-fetch tools to specialist agents and requiring source URLs in commit and PR bodies for external-spec changes, this class of error became reviewable mechanically rather than relying on human catch.

What practices became hard requirements

Practice Was Now (in agent-driven workflow)
Single responsibility per module Code-quality habit Determines whether a task can be cleanly routed to one specialist
Clear naming of callable units Readability Names directly drive dispatch — collisions cause unintended sub-agent spawns
Test coverage Quality safeguard Binary gate; absence blocks merge automatically
External-spec citations Polite documentation Audit trail; reviewer agents check for them mechanically

Closing

The throughput numbers above are real but limited to in-scope cases the system is shaped for. The recurring failure shapes are not bugs in any one component — they are the cost of running a chain of probabilistic agents over deterministic infrastructure. Most of the infrastructure-level work in the last 3 months has been making invariants explicit enough that no narrative can override them.