Multi-Agent Implementation Experience
A multi-agent development system was built where agents drive issues from a backlog to merged pull requests with minimal human intervention. In the three months since 2026-02-09, the plugin repository accumulated 545 commits, shipped 4 minor/patch releases (v1.47.2 through v1.51.1), and processed issues from COR-1091 through COR-1592 — a working scope of more than 500 tracked items. The system today is composed of 5 orchestrator skills, 31 utility skills, and 18 agents.
The pipeline at a glance
The core loop is driven by an orchestrator skill (the pipeline skill) that never decides the next step from agent narrative. Instead, it defers to a deterministic state probe that inspects ground-truth from the issue tracker, the git server (Gitea), and the PR state. The probe returns one of five verdicts (RESOLVER, REVIEWER, RESOLVER_FIX, DONE, ABORT_*) and the orchestrator routes accordingly. The loop is capped at 5 iterations per issue.
flowchart LR
Issue[Issue in tracker]
Orchestrator[Orchestrator skill]
Probe[State probe]
Specialist[Specialist agent\nfrontend / backend / infra / docs]
Reviewer[Reviewer agent]
Resolver[Change-request resolver]
Merged([PR merged])
Issue --> Orchestrator
Orchestrator --> Probe
Probe -->|RESOLVER| Specialist
Probe -->|REVIEWER| Reviewer
Probe -->|RESOLVER_FIX| Resolver
Specialist --> Probe
Reviewer --> Probe
Resolver --> Probe
Probe -->|DONE| Merged
Throughput
The table below covers the one fully-measured case study from the measurement window. Data sourced from project memory (project_meeting_epic_impl_stats.md).
| Metric | Value |
|---|---|
| Scope | One epic, fully implemented end-to-end |
| Issues resolved | 11 |
| Pull requests merged | 13 |
| Wall-clock duration | 3 hours 28 minutes |
| Lines of code produced | 10,224 |
| Throughput | ~49 lines / minute |
| Comparison vs. estimated solo human effort | ~1,600× faster |
Caveats:
- This is a single observed run, not a cross-project average.
- Wall-clock duration is mostly agent thinking and tool calls, not human review time.
- "1,600×" is a rough comparison against an estimated solo-human baseline for the same scope.
System shape
The system is layered into three zones: the main session, sub-agent processes, and external systems. A skill is a lightweight inline procedure that runs in the main session context. An agent is an isolated sub-process with its own context window. State and audit trail live in the external systems — the issue tracker (YouTrack), the git server (Gitea), and the team chat (Discord) — not in the agents themselves.
flowchart TB
subgraph Main[Main session]
Orch[Orchestrator skills]
Probe[State probe]
Conv[Shared conventions]
end
subgraph Sub[Sub-agent processes]
Coders[Specialist coders]
Review[Reviewer]
Resolve[Resolver]
end
subgraph Ext[External systems]
Track[Issue tracker]
Git[Git server]
Chat[Chat / notifications]
end
Main --> Sub
Main --> Ext
Sub --> Ext
This layering keeps the main session thin and the sub-agents stateless between runs. When a sub-agent fails mid-run, the issue tracker retains the last known good state, and the orchestrator can re-dispatch without reconstructing intermediate state.
Five recurring failure shapes
1. Sub-procedure auto-promoted to sub-agent (commit f685c9a5)
A reusable routing procedure was packaged as a callable unit. The framework auto-registered it as an agent, so the orchestrator dispatched it as an isolated sub-agent each time, costing approximately 9% of token budget per run. After the refactor, the routing procedure was moved into a shared convention document and inlined; that 9% disappeared.
2. Sub-modules' intermediate output mistaken for terminal output (commits 5c86501, f2bde768)
Reviewer agents finished their checks and returned narrative summaries without performing the actual merge or change-request action. Pull requests sat in Ready state, the orchestrator re-dispatched the reviewer, and runs hit the 5-iteration cap. After adding an explicit "terminal-action exit gate" requiring proof of side-effect emission before return, this class of run-to-cap aborts dropped to near-zero.
3. Soft validation gates were always overridable (commit a10dd51)
A judgment-based check on the PR base branch was being satisfied by narrative justification in the PR body. After replacing it with a hard validation on the (base branch, branch name, title prefix) tuple, no further base-branch violations were merged.
4. A test-coverage gate could not distinguish markup from logic (commit 0b68cccc)
The gate flagged any view-template file with added lines as needing tests. Cosmetic UI changes were repeatedly blocked. After teaching the gate to inspect inside code blocks rather than count added lines, false-positive blocks ceased. A complementary upstream change pre-marks markup-only work at planning time.
5. Specialists had no tool to verify external specifications (commit c79fa56)
Code targeting infrastructure schemas (Kubernetes, Helm, cloud APIs) sometimes referenced fields that did not exist. After granting web-fetch tools to specialist agents and requiring source URLs in commit and PR bodies for external-spec changes, this class of error became reviewable mechanically rather than relying on human catch.
What practices became hard requirements
| Practice | Was | Now (in agent-driven workflow) |
|---|---|---|
| Single responsibility per module | Code-quality habit | Determines whether a task can be cleanly routed to one specialist |
| Clear naming of callable units | Readability | Names directly drive dispatch — collisions cause unintended sub-agent spawns |
| Test coverage | Quality safeguard | Binary gate; absence blocks merge automatically |
| External-spec citations | Polite documentation | Audit trail; reviewer agents check for them mechanically |
Closing
The throughput numbers above are real but limited to in-scope cases the system is shaped for. The recurring failure shapes are not bugs in any one component — they are the cost of running a chain of probabilistic agents over deterministic infrastructure. Most of the infrastructure-level work in the last 3 months has been making invariants explicit enough that no narrative can override them.