A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.
A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.
A single model call answers one prompt. An agent loop extends this into a process:
The word “long-horizon” means the task cannot be solved reliably in one step. The agent must preserve intent across time.
Examples include:
| Task | Why it is long-horizon |
|---|---|
| Building a software feature | Requires reading code, editing files, testing, debugging |
| Researching a topic | Requires search, source selection, synthesis, citation |
| Planning a trip | Requires constraints, availability, routes, booking options |
| Running an experiment | Requires setup, execution, measurement, analysis |
| Robot manipulation | Requires perception, motion, feedback, correction |
The central problem is control. The agent must decide what to do next, not merely predict the next token.
Agent State
An agent maintains state across steps. State includes the task objective, observations, tool results, partial outputs, memory, and constraints.
A minimal agent state can be written as:
where:
| Symbol | Meaning |
|---|---|
| Goal | |
| Observations up to time | |
| Previous actions | |
| Memory at time |
The next action is selected from this state:
Here is the model’s policy. In an LLM agent, the policy is usually implemented by a language model prompted with the current state and available tools.
Agent Loop
A basic agent loop has four stages.
| Stage | Role |
|---|---|
| Observe | Read current environment state |
| Decide | Select next action |
| Act | Execute tool call or produce output |
| Update | Store result and revise state |
In pseudocode:
state = init_state(goal)
for step in range(max_steps):
action = policy(state)
if action.type == "final":
return action.output
observation = environment.step(action)
state = update_state(state, action, observation)This loop is simple. Real systems add validation, tool schemas, error handling, budgets, retries, and safety checks.
Planning
Planning decomposes a goal into steps.
For example, a coding task may become:
- Inspect repository structure.
- Locate relevant files.
- Read interfaces.
- Edit implementation.
- Run tests.
- Fix failures.
- Summarize changes.
The plan gives structure, but it should not be rigid. Tool results may reveal that the original plan was wrong.
A good agent uses a plan as a working hypothesis.
Planning may be explicit, stored as text, or implicit, represented inside the hidden state of the model. Explicit plans are easier to inspect and revise.
Replanning
Long-horizon tasks rarely follow the first plan exactly.
Replanning occurs when:
| Trigger | Example |
|---|---|
| Observation contradicts assumption | File name differs from expected |
| Tool call fails | API returns an error |
| Task becomes underspecified | Multiple valid interpretations appear |
| New constraint appears | User adds a deadline |
| Partial result is poor | Test failure exposes a bug |
Replanning can be represented as:
where the current state includes the latest observations. The agent does not choose actions from the initial prompt alone.
Tools
Tools allow an agent to affect the world or inspect external state.
Common tool types include:
| Tool | Purpose |
|---|---|
| Search | Retrieve external information |
| Code execution | Run programs and tests |
| File editing | Modify project state |
| Database query | Read structured records |
| Browser | Inspect web pages |
| Calendar or email | Operate personal workflows |
| Robot controller | Move in physical space |
A tool call usually has a schema:
{
"name": "search",
"arguments": {
"query": "PyTorch distributed data parallel tutorial"
}
}Schemas constrain actions. They make tool use easier to validate and safer to execute.
Tool Selection
Tool selection is a decision problem. The agent must choose whether to answer directly, retrieve information, run code, ask for clarification, or stop.
A weak agent overuses tools. A weak agent also underuses tools. Both errors matter.
| Error | Consequence |
|---|---|
| Tool overuse | Slow, noisy, expensive |
| Tool underuse | Stale or unsupported answers |
| Wrong tool | Irrelevant observation |
| Wrong arguments | Failed or misleading result |
Tool selection improves when the system has clear action descriptions, examples, and feedback from tool results.
Memory
Long-horizon agents need memory because the full history may exceed the context window.
Memory can be divided into several kinds.
| Memory type | Description |
|---|---|
| Working memory | Current task state |
| Episodic memory | Previous events and actions |
| Semantic memory | Stable facts and knowledge |
| Procedural memory | Reusable methods and policies |
| External memory | Documents, databases, vector stores |
Working memory is usually included directly in the prompt. External memory is retrieved when relevant.
A memory write should be selective. Storing everything creates noise. Storing too little causes forgetting.
Reflection and Self-Evaluation
Many agent systems include a reflection step. The agent reviews its own intermediate result and decides whether it is sufficient.
Example checks:
| Task | Self-evaluation question |
|---|---|
| Coding | Did tests pass? |
| Research | Are claims supported by sources? |
| Planning | Are constraints satisfied? |
| Math | Does substitution verify the answer? |
| Writing | Does the output match the requested style? |
A reflection step may produce:
The current answer lacks source citations. Retrieve primary sources before finalizing.Reflection is useful only when connected to action. A critique that does not change behavior adds cost without improving the result.
Verification
Verification checks whether the agent’s output satisfies external criteria.
| Domain | Verification method |
|---|---|
| Code | Unit tests, type checks, linters |
| Math | Proof checking, substitution |
| Retrieval | Source citation and quote matching |
| Data analysis | Recomputed statistics |
| Tool use | API response validation |
Verification separates plausible generation from reliable execution.
For code agents, the test suite is often more valuable than model self-confidence. For research agents, citations and source inspection are more valuable than fluent prose.
Credit Assignment
Long-horizon tasks make learning difficult because success or failure may depend on actions taken many steps earlier.
If an agent fails at step 40, the cause may be:
- a bad assumption at step 3
- poor retrieval at step 9
- a wrong edit at step 18
- missing validation at step 30
This is a credit assignment problem.
In reinforcement learning terms, the return depends on a trajectory:
The objective is to improve the policy:
Long trajectories make this objective high variance. Practical agent training often uses shorter supervised traces, preference data, tool-use demonstrations, or process rewards.
Process Supervision
Outcome supervision judges only the final answer. Process supervision gives feedback on intermediate steps.
For long-horizon agents, process supervision is often more informative.
| Supervision type | Signal |
|---|---|
| Outcome supervision | Final task success |
| Process supervision | Quality of steps |
| Tool supervision | Correct tool call |
| Verification supervision | Test or checker result |
A coding agent can learn from traces where each step is labeled as useful or harmful. A research agent can learn whether a citation actually supports a claim.
Hierarchical Agents
A long task can be decomposed hierarchically.
A high-level planner chooses subgoals. Low-level workers execute them.
For example:
| Level | Coding agent example |
|---|---|
| High-level | Implement authentication |
| Mid-level | Add middleware |
| Low-level | Edit auth.py |
Hierarchical control reduces complexity. Each layer operates at a different temporal scale.
Multi-Agent Systems
Some systems use multiple agents with distinct roles.
| Agent role | Function |
|---|---|
| Planner | Break down the task |
| Researcher | Gather evidence |
| Coder | Modify implementation |
| Critic | Review output |
| Executor | Run tests or tools |
Multi-agent systems can improve coverage, but they introduce coordination costs. Agents may duplicate work, disagree, or amplify errors.
A multi-agent design should have clear responsibilities and a final arbitration mechanism.
Agent Environments
An agent environment defines what actions are possible and what observations are returned.
Examples:
| Environment | Actions |
|---|---|
| Shell | Run commands |
| Browser | Open, click, search |
| Codebase | Read, patch, test |
| Game | Move, inspect, interact |
| Robot | Sense, move, grasp |
The environment determines the agent’s effective capabilities.
A language model without tools can describe actions. A language model with tools can execute actions.
Safety Constraints
Long-horizon agents require stricter safety controls than single-turn systems because they can act repeatedly.
Important constraints include:
| Constraint | Purpose |
|---|---|
| Permission boundaries | Prevent unauthorized actions |
| Tool allowlists | Limit available operations |
| Budget limits | Bound cost and time |
| Human approval | Gate sensitive actions |
| Sandboxing | Contain execution |
| Logging | Support auditability |
The longer the horizon, the more opportunities exist for compounding errors.
Failure Modes
Long-horizon agents fail in characteristic ways.
| Failure mode | Description |
|---|---|
| Goal drift | Agent gradually departs from the user’s objective |
| Looping | Agent repeats similar actions |
| Premature stopping | Agent finishes before verification |
| Tool hallucination | Agent assumes tool results that did not occur |
| Context loss | Important constraints disappear |
| Overplanning | Agent spends effort planning instead of acting |
| Error accumulation | Small mistakes compound |
The best practical defense is state discipline: preserve constraints, record observations, verify outputs, and stop when the objective is satisfied.
PyTorch View of Agents
An agent is not usually a single PyTorch module. It is a system around a model.
Still, the policy model can be represented abstractly:
class AgentPolicy(torch.nn.Module):
def forward(self, state_tokens):
hidden = self.backbone(state_tokens)
action_logits = self.action_head(hidden[:, -1])
return action_logitsA tool-using system wraps this model in an execution loop:
state = encode_task(user_goal)
for _ in range(max_steps):
action = decode_action(policy(state))
observation = run_tool(action)
state = append_observation(state, action, observation)
if action.is_final:
breakIn practice, modern agents often use pretrained foundation models rather than training an agent policy from scratch. The important concept is the separation between model prediction and environment interaction.
Evaluation
Long-horizon agents are evaluated by task success rather than next-token accuracy.
Useful metrics include:
| Metric | Meaning |
|---|---|
| Success rate | Fraction of completed tasks |
| Step count | Efficiency |
| Tool error rate | Quality of tool use |
| Verification pass rate | Objective correctness |
| Cost | Tokens, compute, API calls |
| Human intervention rate | Need for assistance |
A good benchmark should include tasks with hidden tests or independent verification. Otherwise, the agent may produce plausible but incorrect outputs.
Summary
Long-horizon agents extend foundation models into goal-directed systems. They maintain state, plan, use tools, update memory, verify results, and revise behavior over many steps.
The main theoretical ideas are policy learning, state representation, planning, tool use, memory, credit assignment, and process supervision. The main engineering problems are context management, tool reliability, verification, safety boundaries, and cost control.
In PyTorch terms, the neural model supplies a policy over actions. The full agent is the loop that connects this policy to tools, memory, observations, and external verification.