Agent Orchestration Patterns: From Single Agents to Agent Swarms

A single autonomous agent can write a script. A coordinated swarm of agents can write, review, execute, and publish a peer-reviewed machine learning paper while you sleep. The frontier isn't making single models smarter—it's making them work together.

Introduction

Here's what happened at a leading ML research lab in May 2026:

A researcher set up ARIS (Autonomous Research via Adversarial Multi-Agent Collaboration) with a single hypothesis: "Can we improve image classification on the CIFAR-10 dataset using selective head quantization?"

She set it running on Friday evening. She went home.

Monday morning, she returned to find:

✅ Three independent experimental runs, each with different quantization strategies
✅ A 15-page peer-reviewed-quality manuscript
✅ Ablation studies comparing to baselines
✅ A GitHub repository with reproducible code
✅ All claims in the paper directly traceable to verified experimental results

Zero human intervention. Zero hallucinations. Zero silent data leaks.

This is the frontier of AI orchestration.

We've been obsessing over making single models smarter. Bigger context windows. Better reasoning. More parameters.

But the real breakthrough isn't a better single agent. It's coordinated agents arguing with each other.

This post explains:

Why single agents fail on complex, long-horizon tasks
What agent swarms solve that individuals can't
The three pillars of orchestration (delegation, parallelization, conflict resolution)
How ARIS works (adversarial collaboration, three-layer architecture)
How to build it (working code patterns with LangGraph)
Where the industry is heading

The Single Agent Problem

What Single Agents Do Well

A single autonomous agent, executing the ReAct pattern (Reason → Act → Observe → Loop), can:

✅ Write and debug code autonomously
✅ Run experiments and observe outputs
✅ Generate reports from data
✅ Handle straightforward linear workflows

Remember DeepAnalyze-8B from the previous post? It's a single agent. It works great for structured tasks: load data → EDA → train model → report.

Where Single Agents Break

Single agents fail when faced with complex, adversarial, or high-stakes tasks.

Example: You ask an agent to "Write a machine learning paper on quantization."

What happens:

Agent thinks: "I need to run experiments and write a paper."
Agent writes: Code that runs quantization experiments
Agent runs: The code (it executes)
Agent observes: "Experiment completed. F1 score improved by 15%."
Agent concludes: "Quantization is effective."
Agent writes: "Our method improves F1 by 15%. This demonstrates efficacy."

What's wrong with this?

The agent never questioned:

Is the 15% improvement statistically significant?
Did I compare against the right baseline?
Could this be a measurement artifact?
Did I actually run the right experiment, or did I misunderstand the paper I cited?

The agent produces "plausible unsupported success."

Plausible Unsupported Success

This is the fundamental problem with single agents: they lack internal skepticism.

What Is Plausible Unsupported Success?

An output that:

Looks plausible (grammatically correct, logically flowing)
Is internally consistent (doesn't contradict itself)
Has no obvious errors (code runs, doesn't crash)
But is fundamentally wrong (lacks evidence, contains silent assumptions, or misses critical context)

Real Examples

Example 1: Silent Data Leak

# Agent writes this code
def load_and_preprocess(df):
    # Fill missing values with a constant
    df['age'].fillna(-1, inplace=True)  # -1 is a sentinel

    # Normalize features
    df_normalized = (df - df.mean()) / df.std()

    return df_normalized

The problem: The agent filled missing values with -1, then normalized. The normalization includes the -1 values, which corrupts the mean and std. The agent never tested whether this introduced bias.

Silent failure: The code runs. The agent observes no errors. It reports "data preprocessing complete." But the data is corrupted in a way that's not immediately obvious.

Example 2: Unsupported Claim

Agent writes: "Our method achieves 95% accuracy, a 5% improvement over prior work."

Reality:
- Prior work: Not clearly defined
- Accuracy on what dataset? Test set? Validation set?
- Was statistical significance tested? (With only 100 samples, 5% difference might be noise)
- Did you run multiple seeds? What's the variance?

The claim is plausible. It's grammatically correct. But it's unsupported.

Example 3: Context Misalignment

Agent reads a paper: "We use Adam optimizer with learning rate 0.001"
Agent writes code: "optimizer = Adam(lr=0.001)"
Agent runs on modern PyTorch: It fails

Problem: The paper was from 2018. PyTorch changed the API in 2020.
The agent produced plausibly correct code that doesn't actually work.

Why This Happens

Single agents optimize for coherence, not correctness.

An LLM trained to generate text that sounds right will produce plausible content. It has no internal skeptic saying "wait, did you actually verify this?"

Why Swarms Win

Here's where multi-agent orchestration changes everything.

Separation of Concerns

Instead of one agent that writes and judges its own work:

Single Agent:
  "I wrote this code."
  "Does it look right? ...yes, it does."
  → Proceeds confidently (no external reality check)

Swarm:
  Executor Agent: "I wrote this code."
  Reviewer Agent (different model, different priors): "Let me check..."
  → Debate until aligned or escalate

Cross-Model Diversity

Different models have different strengths and blind spots.

Claude 3.5 (Executor):
  Strength: Writing fluent, coherent code
  Blind spot: Might miss edge cases

GPT-4o (Reviewer):
  Strength: Catching logical flaws, testing assumptions
  Blind spot: Sometimes over-critical, rejects valid approaches

Together:
  Claude writes, GPT-4o questions
  → Stronger outputs than either alone

Forced Verification

When a Reviewer agent says "your code has a bug," the Executor must provide evidence of refutation.

Executor: "The code runs and produces results."

Reviewer: "Running without errors ≠ correctness.
  Show me: (1) that the output matches the hypothesis,
  (2) the edge cases you tested, (3) statistical significance."

Executor: *Must now verify these things or fix the code*

This creates an adversarial loop that catches plausible unsupported success.

The Three Pillars of Orchestration

When orchestrating a swarm, the architecture fundamentally changes from a simple state machine to a distributed system.

Pillar 1: Task Delegation (The DAG)

A "manager" or "router" agent breaks a massive objective into a Directed Acyclic Graph (DAG) of subtasks.

Goal: "Write and publish an ML paper on quantization"

DAG:
  ├─ Literature Review (parallel with ↓)
  ├─ Experimental Design (depends on ↑)
  │  ├─ Baseline experiments (parallel with ↓)
  │  ├─ Proposed method experiments
  │  └─ Ablation studies (depends on ↑)
  ├─ Results Analysis (depends on ↑)
  ├─ Manuscript Writing (depends on ↑)
  ├─ Self-Review (depends on ↑)
  ├─ Revision (depends on ↑)
  └─ Publication

Critical: Identify dependencies.
- Literature Review can start immediately
- Experimental Design waits for Literature Review
- Ablations wait for baselines
- Writing can start before all experiments finish

The Manager agent routes each subtask to a specialized worker:

Literature researcher → Web search + summarization
Coder → Experiment implementation
Analyst → Results interpretation
Writer → Manuscript generation
Reviewer → Quality assurance

Pillar 2: Parallel Execution

Instead of linear sequential execution, independent tasks run simultaneously on separate compute nodes.

Traditional sequential:
  Literature (2h) → Design (1h) → Baseline (3h) →
  Proposed (2h) → Analysis (1h) → Writing (3h)
  = 12 hours total

With parallelization:
  Literature (2h) ────────────────────┐
  Design (1h) ─────────────────────┬──┤
  Baseline (3h) ─┐                 │  │
  Proposed (2h) ─┼─ Parallel ──────┤  │
  Analysis (1h) ─┘                 │  │
  Writing (3h) ──────────────────→─┴──┘
  = 4 hours total (max path in DAG)

3x speedup just from parallelization.

Pillar 3: Conflict Resolution

When agents disagree, you need deterministic escalation.

Executor: "My experiment shows 15% improvement"
Reviewer: "That's not statistically significant"

Options:
1. Adversarial loop: Agents debate until aligned
   → "You're right, let me run with more samples"
   → "Good, now p < 0.05"

2. Escalate: If they can't agree after 3 iterations
   → Route to a Human-in-the-Loop reviewer
   → Or apply a pre-defined tiebreaker policy

3. Consensus: Multiple reviewers vote
   → If 2/3 approve, pass
   → If 2/3 reject, fix

ARIS: Adversarial Collaboration in Action

Now let's look at a concrete implementation: ARIS (Autonomous Research via Adversarial Multi-Agent Collaboration).

Released in May 2026, ARIS is an open-source research harness designed for autonomous ML research.

The Philosophy

ARIS doesn't use a single "god model." Instead:

It pairs:
  Executor (e.g., Claude 3.5 Sonnet)
  with
  Reviewer (e.g., GPT-4o)

Why different families?
Because models share blind spots with themselves.
A mixed-model configuration produces varied critiques,
breaking the "self-play" confirmation bias.

The Flow

1. Input hypothesis
   ↓
2. Executor plans experiments
   ↓
3. Executor runs experiments (in sandbox)
   ↓
4. Reviewer audits results
   ├─ Is the experiment correct?
   ├─ Do results match claims?
   ├─ Are claims supported by evidence?
   ↓
5. If Reviewer satisfied → Move to next task
   If Reviewer unsatisfied → Send back to Executor for revision
   ↓
6. Final output: Only verified, audited results

ARIS Architecture: Three Layers

ARIS is structured across three distinct layers:

┌─────────────────────────────────────────────────────┐
│  ASSURANCE LAYER (Verification, Auditing)          │
│  - Integrity verification                           │
│  - Result-to-claim mapping                          │
│  - Claim auditing                                   │
├─────────────────────────────────────────────────────┤
│  ORCHESTRATION LAYER (Workflow Management)         │
│  - DAG execution                                    │
│  - Task routing                                     │
│  - Effort level management                          │
├─────────────────────────────────────────────────────┤
│  EXECUTION LAYER (Skills & Tools)                  │
│  - 65+ reusable Markdown-defined skills            │
│  - MCP integrations                                │
│  - Persistent research wiki                         │
└─────────────────────────────────────────────────────┘

Execution Layer Deep Dive

The Execution Layer is where agents actually do things.

The 65+ Skills

Instead of free-form coding, agents access a curated library of "skills"—reusable, tested, documented procedures.

Skill examples:
- run_hyperparameter_sweep: Objective, parameter ranges → Best params + metrics
- statistical_test: Data1, Data2, test_type → p-value, significance
- load_dataset: Dataset name, filters → Loaded data with metadata
- plot_results: Results dict, style → Publication-quality plots
- write_section: Topic, findings → Markdown section with citations
- compare_models: Model list, metrics → Comparison table + analysis

Why skills instead of free code?

Tested: Each skill has unit tests
Documented: Each skill includes expected inputs/outputs
Safe: No arbitrary code execution
Reusable: Agents don't reinvent wheels
Auditable: Clear, reproducible procedures

MCP Integrations

ARIS connects to external tools via Model Context Protocol (MCP).

Agents can:
- Query academic databases (ArXiv, Papers with Code)
- Access computational resources (GPU clusters)
- Interact with data storage (S3, databases)
- Call external APIs (GitHub, Hugging Face)
- All safely, via defined MCP servers

Example:

Agent action: "Search for recent papers on quantization"
MCP Integration: Calls Papers with Code API
Result: [title, abstract, code_link, citation_count]
Agent continues: Now has grounding in the literature

Persistent Research Wiki

Across multiple runs, ARIS maintains a "research wiki"—a memory of all prior findings.

Research Wiki contents:
- Experimental results (baseline accuracies, hyperparameters)
- Failed approaches (with reasons why they failed)
- Literature summary (papers read, key findings)
- Code artifacts (reusable implementations)
- Intermediate hypotheses (discarded ideas)

Benefit:
If running a new experiment, ARIS doesn't re-discover old results.
It builds on prior knowledge.

Orchestration Layer Deep Dive

The Orchestration Layer manages the workflow—it's the "system brain."

DAG Execution

The Orchestrator:
1. Accepts a high-level goal
2. Breaks it into a DAG of subtasks
3. Identifies dependencies
4. Routes tasks to appropriate agents
5. Monitors progress
6. Handles re-routing if a task fails

Dynamic Effort Level

ARIS can adjust effort per task:

High effort (thorough):
- Run experiments with multiple seeds
- Statistical significance testing
- Cross-validation
- Ablation studies
- Use when: Critical results

Medium effort (standard):
- Run experiments once
- Basic validation
- Use when: Intermediate results

Low effort (quick):
- Heuristic estimation
- No validation
- Use when: Early exploration

The Orchestrator adjusts effort based on:

Task criticality (is this in the final paper?)
Available compute (how much time do we have?)
Prior confidence (do we know this will work?)

Assurance Layer (The Secret Weapon)

This is where ARIS solves the "plausible unsupported success" problem.

The Assurance Layer treats verification as a first-class citizen with a three-stage audit process.

Stage 1: Integrity Verification

Does the artifact actually work?

For code:
- Does it run without errors?
- Do outputs match expected types?
- Are edge cases handled?

For results:
- Are metrics correctly computed?
- Are dimensions correct (shape mismatch detected)?
- Are units consistent?

Action: Run the artifact in a sandbox, observe outputs.

Stage 2: Result-to-Claim Mapping

Does the evidence support the claim?

Claim: "Our method improves accuracy by 5%"

Verification:
- What was measured? (Accuracy on which dataset, which metric?)
- What was compared against? (Which baseline?)
- Is the 5% improvement correct? (Sanity check the arithmetic)
- Is this the right direction? (Did we test improvement vs. degradation?)

Action: Trace the claim to the data that supports it.

Stage 3: Claim Auditing

Is the claim scientifically sound?

Checks:
- Is the improvement statistically significant? (p-value test)
- Was variance measured? (Std dev, confidence intervals)
- Are there confounding factors? (Did we control for them?)
- Is this reproducible? (Multiple runs, different seeds)
- Does this align with priors? (Does it match what domain experts expect?)

Action: Apply statistical rigor and domain knowledge.

The Outcome

Every sentence in the final manuscript maps directly to a verified experimental result.

"We propose a quantization strategy that reduces model size by 4x
with only 0.5% accuracy loss."

Proof:
✓ Model size: Measured in bytes, confirmed reduction
✓ 4x: Arithmetic verified (from X bytes to X/4 bytes)
✓ Accuracy loss: Measured on test set, p < 0.05
✓ 0.5%: Confirmed via cross-validation (mean ± std)

This prevents hallucination because claims are tethered to evidence.

Building the Orchestration Loop

How do you actually build an adversarial swarm like ARIS?

Using a framework like LangGraph, you define Executor and Reviewer as distinct nodes that pass state back and forth.

The Code Structure

from langgraph.graph import StateGraph, END
from typing import TypedDict, Optional, List

# 1. Define the shared state between agents
class ResearchState(TypedDict):
    hypothesis: str
    code_artifact: Optional[str]
    experimental_results: Optional[str]
    reviewer_critique: Optional[str]
    execution_log: List[str]
    is_approved: bool
    iterations: int
    max_iterations: int

# 2. The Executor Agent (writes and runs code)
def executor_node(state: ResearchState):
    """
    Executor: Writes code, runs experiments.
    Takes critique and improves code iteratively.
    """

    # If there's prior critique, use it to improve
    if state.get("reviewer_critique"):
        instruction = f"""
        Previous feedback: {state['reviewer_critique']}

        Hypothesis: {state['hypothesis']}
        Current code: {state['code_artifact']}

        Fix the code based on the feedback.
        """
    else:
        # First iteration: generate from scratch
        instruction = f"""
        Design and implement an experiment for:
        {state['hypothesis']}
        """

    # Generate code
    artifact = generate_code_via_llm(instruction)

    # Execute in sandbox
    results, error = execute_sandbox(artifact)

    # Log execution
    execution_log = state.get('execution_log', [])
    execution_log.append(f"Iteration {state['iterations']}: {error or 'Success'}")

    return {
        "code_artifact": artifact,
        "experimental_results": results,
        "execution_log": execution_log,
        "iterations": state.get("iterations", 0) + 1,
        "is_approved": False  # Reset approval status
    }

# 3. The Reviewer Agent (different model, different background)
def reviewer_node(state: ResearchState):
    """
    Reviewer: Audits Executor's work.
    Uses multi-model approach (e.g., GPT-4o, not Claude).
    Conducts three-stage audit.
    """

    # Stage 1: Integrity Verification
    integrity_check = verify_integrity(
        state["code_artifact"],
        state["experimental_results"]
    )

    if not integrity_check.passed:
        critique = f"""
        INTEGRITY FAILURE: {integrity_check.error}

        Fix: {integrity_check.suggestion}
        """
        return {
            "is_approved": False,
            "reviewer_critique": critique
        }

    # Stage 2: Result-to-Claim Mapping
    claim_check = map_results_to_claims(
        state["hypothesis"],
        state["experimental_results"]
    )

    if not claim_check.valid:
        critique = f"""
        CLAIM-RESULT MISMATCH: {claim_check.issue}

        Your results show: {claim_check.actual}
        Your claim says: {claim_check.stated}

        Revise the code or the claim.
        """
        return {
            "is_approved": False,
            "reviewer_critique": critique
        }

    # Stage 3: Claim Auditing (statistical rigor)
    audit_check = audit_claims(
        state["hypothesis"],
        state["experimental_results"]
    )

    if not audit_check.sound:
        critique = f"""
        STATISTICAL ISSUE: {audit_check.issue}

        Problem: {audit_check.detail}

        Resolution: {audit_check.fix}
        """
        return {
            "is_approved": False,
            "reviewer_critique": critique
        }

    # All checks passed
    return {
        "is_approved": True,
        "reviewer_critique": None
    }

# 4. Orchestrate the Swarm
workflow = StateGraph(ResearchState)

# Add the nodes (agents)
workflow.add_node("Executor", executor_node)
workflow.add_node("Reviewer", reviewer_node)

# 5. Routing and Conflict Resolution
def route_after_review(state: ResearchState):
    """
    Conflict resolution logic.
    Decides where to go after Reviewer speaks.
    """

    if state.get("is_approved"):
        # Reviewer satisfied, move forward
        return END

    if state.get("iterations", 0) >= state.get("max_iterations", 3):
        # Hit iteration limit, escalate to human or declare failure
        return "Escalate"

    # Not approved, iterate
    return "Executor"

# Connect the nodes
workflow.add_edge("Executor", "Reviewer")
workflow.add_conditional_edges("Reviewer", route_after_review)
workflow.set_entry_point("Executor")

# 6. Compile and execute
research_swarm = workflow.compile()

# Run the adversarial loop
initial_state = {
    "hypothesis": "Selective head quantization improves efficiency",
    "code_artifact": None,
    "experimental_results": None,
    "reviewer_critique": None,
    "execution_log": [],
    "is_approved": False,
    "iterations": 0,
    "max_iterations": 3
}

final_result = research_swarm.invoke(initial_state)

print(f"Execution Log: {final_result['execution_log']}")
print(f"Final Code: {final_result['code_artifact']}")
print(f"Verified Results: {final_result['experimental_results']}")

Key Pattern: Separation of Concerns

Notice the architecture:

Executor (Claude): Writes code, runs experiments, optimizes for progress
Reviewer (GPT-4o): Audits work, catches errors, optimizes for correctness

They're decoupled. Executor doesn't judge itself. Reviewer doesn't code (slow). Each does what it's good at.

Conflict Resolution Strategies

What happens when agents disagree?

Strategy 1: Adversarial Debate

Executor: "I ran this experiment 1,000 times"
Reviewer: "That's overkill. 100 runs is sufficient."

Executor: "But I want low variance..."
Reviewer: "Show me the variance reduction from 100 to 1,000"

Executor: *computes* "Variance reduced by 0.02%"
Reviewer: "That's negligible for our purpose. 100 runs is fine."

Consensus: Use 100 runs, save 90% compute.

Strategy 2: Escalation to Human

After 3 iterations, agents still disagree.
System escalates to human reviewer.

Human makes call: "Run 500 times, split the difference."

Strategy 3: Multi-Model Voting

Instead of 2 agents, use 3+ reviewers.

Results:
- Claude: "Approve"
- GPT-4o: "Request changes"
- LLaMA: "Approve"

Voting rule: 2/3 approve → proceed

Orchestration Patterns: Beyond Simple Pairs

Once you master Executor-Reviewer pairs, you can compose larger swarms.

Pattern 1: Tournament

Multiple Executors propose different solutions.
Reviewer ranks them.
Best solution advances.

Use case: Hyperparameter optimization

Pattern 2: Assembly Line

Agent 1: Literature review
Agent 2: Experimental design (awaits Agent 1)
Agent 3: Execution (awaits Agent 2)
Agent 4: Analysis (parallel with Agent 3)
Agent 5: Writing (awaits Agent 4)
Agent 6: Review (awaits Agent 5)

Depends-on edges create a DAG of execution.

Pattern 3: Hierarchical

Manager Agent: Routes tasks
├─ Research Sub-Agent: Runs experiments
├─ Writing Sub-Agent: Generates prose
├─ Review Sub-Agent: Quality checks
└─ Integration Sub-Agent: Combines results

Pattern 4: Cross-Domain Debate

ML Expert Agent: "Use XGBoost"
Domain Expert Agent: "Use linear model (interpretability)"

System: Runs both, benchmarks on domain metrics.
Consensus: "Use XGBoost with SHAP for interpretability."

Real-World Results

ARIS benchmarks (early 2026):

Research Paper Generation

Metric	Traditional	ARIS
Time to manuscript	2-4 weeks	4-8 hours
Manual intervention	High (20+ corrections)	Low (1-2 clarifications)
Hallucinations	~2-5 per paper	0 (caught in assurance)
Reproducibility	~60% (code lost, hyperparams unclear)	100% (all logged)
Publication-ready	Partial (needs editing)	Yes (light polish)

Code Quality

Metric	Agent-Only	ARIS (with Reviewer)
Bugs per 100 lines	2.3	0.4
Edge cases handled	40%	92%
Test coverage	30%	78%
Maintainability score	6.2/10	8.7/10

Accuracy of Claims

Type	Unsupported Claims	Trivial Claims	Properly Supported
No Review	18%	12%	70%
With Review	0%	3%	97%

The Assurance Layer catches what would be plausible but wrong.

Common Misconceptions

Misconception 1: "Orchestration means more latency"

Truth: Well-designed parallelization reduces latency despite orchestration overhead.

Sequential: 12 hours
Parallel orchestrated: 4 hours

Overhead < 5 minutes, speedup = 3x.

Misconception 2: "You need perfect agents for good swarms"

Truth: Imperfect agents with good orchestration beat perfect agents alone.

Median agent + strong reviewer > Excellent agent (unreviewed)

Misconception 3: "Adversarial loops just slow things down"

Truth: They prevent worse failures (hallucinations, silent bugs).

Cost of debate: 20% slower
Cost of missed bugs in production: 10x slower recovery

Misconception 4: "You need to build custom orchestration"

Truth: LangGraph, AutoGen, CrewAI provide ready-made patterns.

You're configuring, not coding from scratch.

The Future: Orchestration as Core Skill

The frontier of AI isn't "make models smarter." It's "orchestrate agents to work together reliably."

For ML engineers in 2026, this is the critical skill gap.

You need to master:

State management (what information passes between agents?)
Task decomposition (how to break large goals into solvable subtasks?)
Conflict resolution (what happens when agents disagree?)
Assurance patterns (how to catch hallucinations?)
Parallel execution (how to run independent tasks concurrently?)

The engineers building multi-agent orchestration systems will be the ones building the reliable AI infrastructure that enterprises actually deploy.

Single agents are toys. Orchestrated swarms are production systems.

Implementation Checklist

Understand ReAct pattern (done if you read previous posts)
Learn LangGraph state graphs (2-3 hours)
Build a simple Executor-Reviewer pair (4-6 hours)
Add a third agent (Analyzer) (3-4 hours)
Implement conflict resolution logic (2-3 hours)
Add parallel execution (3-4 hours)
Build simple assurance checks (verification, auditing) (4-6 hours)
Benchmarks and optimization (2-3 hours)

Total: ~24-32 hours to production Executor-Reviewer system

The Meta-Question

If agents can write papers while you sleep, what do you do?

The answer: You orchestrate the agents.

You define success criteria. You build assurance frameworks. You decide when to scale from one agent to ten.

You stop asking "how do I solve this?" and start asking "how do I build a system that solves this, verifies the solution, and handles failure gracefully?"

That's the job of 2026.

Published: May 21, 2026 | Last updated: May 21, 2026

This post draws from ARIS framework research (open-source, May 2026) and production orchestration patterns being deployed at leading ML research institutions.