Skip to content

Long-Horizon Agents

A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.

A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.

A single model call answers one prompt. An agent loop extends this into a process:

observeplanactobserve \text{observe} \rightarrow \text{plan} \rightarrow \text{act} \rightarrow \text{observe} \rightarrow \cdots

The word “long-horizon” means the task cannot be solved reliably in one step. The agent must preserve intent across time.

Examples include:

TaskWhy it is long-horizon
Building a software featureRequires reading code, editing files, testing, debugging
Researching a topicRequires search, source selection, synthesis, citation
Planning a tripRequires constraints, availability, routes, booking options
Running an experimentRequires setup, execution, measurement, analysis
Robot manipulationRequires perception, motion, feedback, correction

The central problem is control. The agent must decide what to do next, not merely predict the next token.

Agent State

An agent maintains state across steps. State includes the task objective, observations, tool results, partial outputs, memory, and constraints.

A minimal agent state can be written as:

st=(g,ot,a<t,mt), s_t = (g, o_{\leq t}, a_{<t}, m_t),

where:

SymbolMeaning
ggGoal
oto_{\leq t}Observations up to time tt
a<ta_{<t}Previous actions
mtm_tMemory at time tt

The next action is selected from this state:

atπθ(ast). a_t \sim \pi_\theta(a \mid s_t).

Here πθ\pi_\theta is the model’s policy. In an LLM agent, the policy is usually implemented by a language model prompted with the current state and available tools.

Agent Loop

A basic agent loop has four stages.

StageRole
ObserveRead current environment state
DecideSelect next action
ActExecute tool call or produce output
UpdateStore result and revise state

In pseudocode:

state = init_state(goal)

for step in range(max_steps):
    action = policy(state)

    if action.type == "final":
        return action.output

    observation = environment.step(action)
    state = update_state(state, action, observation)

This loop is simple. Real systems add validation, tool schemas, error handling, budgets, retries, and safety checks.

Planning

Planning decomposes a goal into steps.

For example, a coding task may become:

  1. Inspect repository structure.
  2. Locate relevant files.
  3. Read interfaces.
  4. Edit implementation.
  5. Run tests.
  6. Fix failures.
  7. Summarize changes.

The plan gives structure, but it should not be rigid. Tool results may reveal that the original plan was wrong.

A good agent uses a plan as a working hypothesis.

Planning may be explicit, stored as text, or implicit, represented inside the hidden state of the model. Explicit plans are easier to inspect and revise.

Replanning

Long-horizon tasks rarely follow the first plan exactly.

Replanning occurs when:

TriggerExample
Observation contradicts assumptionFile name differs from expected
Tool call failsAPI returns an error
Task becomes underspecifiedMultiple valid interpretations appear
New constraint appearsUser adds a deadline
Partial result is poorTest failure exposes a bug

Replanning can be represented as:

πθ(atst) \pi_\theta(a_t \mid s_t)

where the current state sts_t includes the latest observations. The agent does not choose actions from the initial prompt alone.

Tools

Tools allow an agent to affect the world or inspect external state.

Common tool types include:

ToolPurpose
SearchRetrieve external information
Code executionRun programs and tests
File editingModify project state
Database queryRead structured records
BrowserInspect web pages
Calendar or emailOperate personal workflows
Robot controllerMove in physical space

A tool call usually has a schema:

{
    "name": "search",
    "arguments": {
        "query": "PyTorch distributed data parallel tutorial"
    }
}

Schemas constrain actions. They make tool use easier to validate and safer to execute.

Tool Selection

Tool selection is a decision problem. The agent must choose whether to answer directly, retrieve information, run code, ask for clarification, or stop.

A weak agent overuses tools. A weak agent also underuses tools. Both errors matter.

ErrorConsequence
Tool overuseSlow, noisy, expensive
Tool underuseStale or unsupported answers
Wrong toolIrrelevant observation
Wrong argumentsFailed or misleading result

Tool selection improves when the system has clear action descriptions, examples, and feedback from tool results.

Memory

Long-horizon agents need memory because the full history may exceed the context window.

Memory can be divided into several kinds.

Memory typeDescription
Working memoryCurrent task state
Episodic memoryPrevious events and actions
Semantic memoryStable facts and knowledge
Procedural memoryReusable methods and policies
External memoryDocuments, databases, vector stores

Working memory is usually included directly in the prompt. External memory is retrieved when relevant.

A memory write should be selective. Storing everything creates noise. Storing too little causes forgetting.

Reflection and Self-Evaluation

Many agent systems include a reflection step. The agent reviews its own intermediate result and decides whether it is sufficient.

Example checks:

TaskSelf-evaluation question
CodingDid tests pass?
ResearchAre claims supported by sources?
PlanningAre constraints satisfied?
MathDoes substitution verify the answer?
WritingDoes the output match the requested style?

A reflection step may produce:

The current answer lacks source citations. Retrieve primary sources before finalizing.

Reflection is useful only when connected to action. A critique that does not change behavior adds cost without improving the result.

Verification

Verification checks whether the agent’s output satisfies external criteria.

DomainVerification method
CodeUnit tests, type checks, linters
MathProof checking, substitution
RetrievalSource citation and quote matching
Data analysisRecomputed statistics
Tool useAPI response validation

Verification separates plausible generation from reliable execution.

For code agents, the test suite is often more valuable than model self-confidence. For research agents, citations and source inspection are more valuable than fluent prose.

Credit Assignment

Long-horizon tasks make learning difficult because success or failure may depend on actions taken many steps earlier.

If an agent fails at step 40, the cause may be:

  • a bad assumption at step 3
  • poor retrieval at step 9
  • a wrong edit at step 18
  • missing validation at step 30

This is a credit assignment problem.

In reinforcement learning terms, the return depends on a trajectory:

τ=(s0,a0,r0,s1,a1,r1,,sT). \tau = (s_0,a_0,r_0,s_1,a_1,r_1,\ldots,s_T).

The objective is to improve the policy:

maxθEτπθ[R(τ)]. \max_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)].

Long trajectories make this objective high variance. Practical agent training often uses shorter supervised traces, preference data, tool-use demonstrations, or process rewards.

Process Supervision

Outcome supervision judges only the final answer. Process supervision gives feedback on intermediate steps.

For long-horizon agents, process supervision is often more informative.

Supervision typeSignal
Outcome supervisionFinal task success
Process supervisionQuality of steps
Tool supervisionCorrect tool call
Verification supervisionTest or checker result

A coding agent can learn from traces where each step is labeled as useful or harmful. A research agent can learn whether a citation actually supports a claim.

Hierarchical Agents

A long task can be decomposed hierarchically.

A high-level planner chooses subgoals. Low-level workers execute them.

GoalSubgoalAction. \text{Goal} \rightarrow \text{Subgoal} \rightarrow \text{Action}.

For example:

LevelCoding agent example
High-levelImplement authentication
Mid-levelAdd middleware
Low-levelEdit auth.py

Hierarchical control reduces complexity. Each layer operates at a different temporal scale.

Multi-Agent Systems

Some systems use multiple agents with distinct roles.

Agent roleFunction
PlannerBreak down the task
ResearcherGather evidence
CoderModify implementation
CriticReview output
ExecutorRun tests or tools

Multi-agent systems can improve coverage, but they introduce coordination costs. Agents may duplicate work, disagree, or amplify errors.

A multi-agent design should have clear responsibilities and a final arbitration mechanism.

Agent Environments

An agent environment defines what actions are possible and what observations are returned.

Examples:

EnvironmentActions
ShellRun commands
BrowserOpen, click, search
CodebaseRead, patch, test
GameMove, inspect, interact
RobotSense, move, grasp

The environment determines the agent’s effective capabilities.

A language model without tools can describe actions. A language model with tools can execute actions.

Safety Constraints

Long-horizon agents require stricter safety controls than single-turn systems because they can act repeatedly.

Important constraints include:

ConstraintPurpose
Permission boundariesPrevent unauthorized actions
Tool allowlistsLimit available operations
Budget limitsBound cost and time
Human approvalGate sensitive actions
SandboxingContain execution
LoggingSupport auditability

The longer the horizon, the more opportunities exist for compounding errors.

Failure Modes

Long-horizon agents fail in characteristic ways.

Failure modeDescription
Goal driftAgent gradually departs from the user’s objective
LoopingAgent repeats similar actions
Premature stoppingAgent finishes before verification
Tool hallucinationAgent assumes tool results that did not occur
Context lossImportant constraints disappear
OverplanningAgent spends effort planning instead of acting
Error accumulationSmall mistakes compound

The best practical defense is state discipline: preserve constraints, record observations, verify outputs, and stop when the objective is satisfied.

PyTorch View of Agents

An agent is not usually a single PyTorch module. It is a system around a model.

Still, the policy model can be represented abstractly:

class AgentPolicy(torch.nn.Module):
    def forward(self, state_tokens):
        hidden = self.backbone(state_tokens)
        action_logits = self.action_head(hidden[:, -1])
        return action_logits

A tool-using system wraps this model in an execution loop:

state = encode_task(user_goal)

for _ in range(max_steps):
    action = decode_action(policy(state))

    observation = run_tool(action)
    state = append_observation(state, action, observation)

    if action.is_final:
        break

In practice, modern agents often use pretrained foundation models rather than training an agent policy from scratch. The important concept is the separation between model prediction and environment interaction.

Evaluation

Long-horizon agents are evaluated by task success rather than next-token accuracy.

Useful metrics include:

MetricMeaning
Success rateFraction of completed tasks
Step countEfficiency
Tool error rateQuality of tool use
Verification pass rateObjective correctness
CostTokens, compute, API calls
Human intervention rateNeed for assistance

A good benchmark should include tasks with hidden tests or independent verification. Otherwise, the agent may produce plausible but incorrect outputs.

Summary

Long-horizon agents extend foundation models into goal-directed systems. They maintain state, plan, use tools, update memory, verify results, and revise behavior over many steps.

The main theoretical ideas are policy learning, state representation, planning, tool use, memory, credit assignment, and process supervision. The main engineering problems are context management, tool reliability, verification, safety boundaries, and cost control.

In PyTorch terms, the neural model supplies a policy over actions. The full agent is the loop that connects this policy to tools, memory, observations, and external verification.