Skip to content

Conversational Systems

A conversational system processes dialogue between users and machines.

A conversational system processes dialogue between users and machines. The system receives one or more conversational turns and generates a response. Unlike single-turn NLP tasks, dialogue systems must maintain context across multiple exchanges.

Example:

SpeakerText
UserWhat is PyTorch?
AssistantPyTorch is a deep learning framework developed by Meta.
UserDoes it support GPUs?
AssistantYes. PyTorch supports CUDA and other accelerator backends.

The second user message contains the pronoun it. The system must remember that it refers to PyTorch. Dialogue therefore requires contextual reasoning across turns.

Modern conversational systems are usually based on transformers and large language models. Earlier systems relied heavily on rules, templates, and manually designed state machines.

Components of a Conversational System

A conversational system often contains several modules.

ComponentPurpose
TokenizerConverts text into tokens
Dialogue state trackerMaintains conversational context
RetrieverRetrieves relevant knowledge or memory
Response generatorProduces the next response
Safety and filteringBlocks harmful or invalid outputs
Tool interfaceCalls external systems or APIs

Some systems integrate all behavior into one large model. Others use pipelines with separate modules.

A simple chatbot pipeline may look like:

user message
-> tokenize
-> retrieve context or memory
-> language model
-> decode response
-> safety filtering
-> output

Dialogue as Sequence Modeling

A conversation can be represented as a sequence of tokens across multiple turns.

Suppose a dialogue contains turns:

User: What is PyTorch?
Assistant: A deep learning framework.
User: Who created it?

The model may receive the full dialogue history:

<User> What is PyTorch?
<Assistant> A deep learning framework.
<User> Who created it?

The task is to predict the next assistant response:

<Assistant> Meta developed PyTorch.

The probability of the response is modeled autoregressively:

P(yx)=t=1TP(yty<t,x), P(y \mid x) = \prod_{t=1}^{T} P(y_t \mid y_{<t}, x),

where:

SymbolMeaning
xxDialogue history
yyResponse tokens
yty_tToken at position tt

The model predicts one token at a time conditioned on the previous dialogue context.

Dialogue History Representation

Dialogue models must represent conversation history in a structured way.

A common approach uses special role tokens:

<User>
<Assistant>
<System>

Example:

<System> You are a helpful assistant.
<User> Explain gradient descent.
<Assistant> Gradient descent updates parameters using gradients.
<User> Why does the learning rate matter?

These role markers help the model distinguish instructions, user inputs, and assistant responses.

The tokenized sequence is then passed into the transformer.

Context Windows

Transformers process only a limited number of tokens. This limit is called the context window.

If the context length is:

L, L,

then the total conversation history must fit inside LL tokens.

Long conversations therefore require truncation, summarization, retrieval, or external memory systems.

Common strategies:

StrategyDescription
TruncationKeep only recent turns
SummarizationCompress earlier dialogue
RetrievalRetrieve relevant past turns
External memoryStore conversation state separately

A simple truncation strategy keeps the most recent tokens:

keep last N tokens
discard earlier history

This is computationally simple but may forget important information.

Retrieval-Augmented Dialogue

Conversational systems often need external knowledge.

Example:

User: What is the latest version of PyTorch?

A frozen language model may not know current information. A retriever can fetch documents from a database or search engine.

The system becomes:

user query
-> retrieve documents
-> append retrieved text to prompt
-> generate response

This approach is called retrieval-augmented generation.

The retrieved context may include:

SourceExample
Search engineWeb pages
DocumentationAPI references
Vector databaseSimilar embeddings
Conversation memoryEarlier turns
Knowledge baseStructured facts

The language model conditions on retrieved evidence when generating the response.

Dialogue State Tracking

Task-oriented systems often maintain structured dialogue state.

Example restaurant-booking dialogue:

SlotValue
CuisineItalian
CityParis
Party size4
Time7 PM

The user may provide information incrementally:

User: Find an Italian restaurant.
User: In Paris.
User: For four people.

The dialogue state tracker accumulates constraints over turns.

Older dialogue systems used explicit symbolic state tracking. Modern systems often encode dialogue state implicitly inside transformer hidden representations, though structured tracking remains useful for reliability.

Intent Detection and Slot Filling

Task-oriented dialogue systems commonly separate two subtasks:

TaskGoal
Intent detectionIdentify user goal
Slot fillingExtract parameter values

Example:

Book a flight to Tokyo tomorrow morning.

Intent:

BOOK_FLIGHT

Slots:

SlotValue
DestinationTokyo
Datetomorrow
Timemorning

Intent detection is usually sentence classification. Slot filling is usually sequence labeling similar to named entity recognition.

Modern large language models often unify both tasks within generative prompting.

Response Generation

There are two major approaches to dialogue response generation.

ApproachDescription
Retrieval-basedSelect existing response
GenerativeProduce new response tokens

Retrieval-based systems choose responses from a database or candidate set. They are easier to control and safer but less flexible.

Generative systems synthesize responses token by token. They are more flexible but can hallucinate or generate unsafe outputs.

Modern conversational AI is mostly generative.

Decoder-Only Dialogue Models

Many modern chat systems use decoder-only transformers.

The full conversation history is concatenated into one token sequence:

<System> ...
<User> ...
<Assistant> ...
<User> ...

The model predicts the next assistant tokens autoregressively.

If the input has shape:

[B, T]

the transformer produces hidden states:

[B, T, D]

The output projection maps hidden states to vocabulary logits:

[B, T, V]

During generation, the model repeatedly predicts:

P(xtx<t). P(x_t \mid x_{<t}).

The generated tokens are appended to the sequence and fed back into the model.

Sampling Strategies

Dialogue systems rarely use greedy decoding because it often produces repetitive or generic outputs.

Common decoding strategies include:

MethodDescription
Greedy decodingChoose highest-probability token
Beam searchTrack several candidate sequences
Temperature samplingAdjust probability sharpness
Top-k samplingSample from top k tokens
Top-p samplingSample from smallest high-probability set

Temperature scaling modifies logits:

pi=exp(zi/T)jexp(zj/T). p_i = \frac{\exp(z_i / T)} {\sum_j \exp(z_j / T)}.

Lower temperature produces more deterministic outputs. Higher temperature increases diversity.

Top-k sampling restricts generation to the kk highest-probability tokens.

Top-p sampling, also called nucleus sampling, chooses the smallest token set whose cumulative probability exceeds threshold pp.

These methods balance fluency, diversity, and stability.

Repetition Problems

Autoregressive dialogue models may repeat phrases or loops.

Example:

The answer is very important because it is very important because it is very important...

Several factors contribute:

CauseDescription
Exposure biasModel conditions on its own outputs
Probability collapseRepeated tokens dominate probabilities
Weak decoding constraintsDecoder allows loops

Common mitigation methods:

MethodIdea
Repetition penaltiesReduce probability of repeated tokens
N-gram blockingPrevent repeated phrases
Temperature adjustmentIncrease diversity
Better training dataReduce repetitive patterns

A repetition penalty modifies logits for previously generated tokens:

generated_tokens = set(previous_ids)

for token_id in generated_tokens:
    logits[token_id] /= repetition_penalty

Instruction Tuning

Base language models predict text continuations. Conversational systems require instruction-following behavior.

Instruction tuning trains the model on examples such as:

Instruction: Summarize this paragraph.
Response: ...

or:

User: Explain backpropagation.
Assistant: ...

The model learns conversational formatting, helpfulness, and task-following behavior.

Instruction tuning datasets often contain:

Example typeDescription
Question answeringDirect factual answers
SummarizationCondensed explanations
CodingProgram synthesis
ReasoningStep-by-step solutions
DialogueMulti-turn conversations

Instruction tuning is one reason modern LLMs behave differently from older pretrained language models.

Reinforcement Learning from Human Feedback

Many conversational systems use reinforcement learning from human feedback, abbreviated RLHF.

The process usually has three stages:

StagePurpose
Supervised fine-tuningLearn dialogue behavior
Reward modelingLearn preference scores
RL optimizationOptimize responses for reward

Human annotators compare model responses:

PromptPreferred response
Explain transformers.Response A better than Response B

A reward model predicts human preference scores. Reinforcement learning then updates the dialogue model to maximize reward.

RLHF improves helpfulness, harmlessness, and instruction following, though it can also create overrefusal, verbosity, or reward hacking behaviors.

Tool Use and Function Calling

Modern conversational systems increasingly interact with external tools.

Example:

User: What is the weather in Hanoi?

The system may:

-> detect weather intent
-> call weather API
-> format result
-> generate response

Tool use extends model capability beyond static parametric memory.

Common tools include:

Tool typeExample
SearchWeb retrieval
CalculatorArithmetic
Code executionPython runtime
Database querySQL
CalendarScheduling
EmailMessaging
File retrievalDocument search

The language model acts as a controller that decides when and how to use tools.

Safety and Moderation

Conversational systems must manage unsafe or harmful outputs.

Safety systems may address:

RiskExample
ToxicityHarassment
Self-harm adviceDangerous instructions
MisinformationFalse claims
Privacy leakagePersonal information
JailbreakingPrompt attacks
HallucinationsUnsupported facts

Moderation systems may use:

MethodDescription
Rule filtersKeyword blocking
ClassifiersToxicity prediction
Constitutional promptsInstruction constraints
RLHFHuman preference optimization
Retrieval verificationEvidence grounding

Safety systems must balance protection against overblocking useful responses.

Evaluation of Conversational Systems

Dialogue evaluation is difficult because many valid responses may exist.

Example:

User: How are you?

Possible valid responses:

I'm doing well.
I'm fine, thank you.
Doing great today.

Automatic metrics such as BLEU correlate weakly with conversational quality.

Modern evaluation often considers:

CriterionMeaning
HelpfulnessSolves the user problem
RelevanceMatches conversation context
FaithfulnessGrounded in evidence
CoherenceLogically consistent
SafetyAvoids harmful outputs
LatencyResponds quickly
User satisfactionHuman preference

Human evaluation remains important.

Memory Systems

Long-term conversational systems may require persistent memory.

Example:

User: My favorite language is Rust.

A later conversation:

User: Recommend a systems programming book.

The assistant may use stored memory to personalize recommendations.

Memory systems may store:

Memory typeExample
User preferencesFavorite languages
Past conversationsDialogue history
DocumentsUploaded files
Tool outputsPrevious searches

Memory retrieval must balance usefulness, privacy, and correctness.

Hallucinations in Dialogue

Generative dialogue systems may produce fluent but false statements.

Example:

PyTorch was released in 2008.

The statement is grammatically correct but factually wrong.

Hallucinations arise because language models optimize next-token prediction, not truth verification.

Methods to reduce hallucinations include:

MethodDescription
Retrieval augmentationUse external evidence
Citation generationProvide sources
Verification modelsCheck factual consistency
Tool useQuery trusted systems
AbstentionRefuse uncertain answers

No current method eliminates hallucinations completely.

Multi-Agent Systems

Some conversational architectures use multiple interacting agents.

Example pipeline:

planner agent
-> retrieval agent
-> coding agent
-> critic agent
-> final response

Different agents specialize in different tasks.

Examples:

AgentRole
PlannerBreak tasks into steps
RetrieverGather information
CoderWrite programs
CriticEvaluate outputs
Memory managerRetrieve relevant context

Multi-agent systems may improve decomposition and reliability but increase orchestration complexity.

Practical Chat Data Format

Conversational datasets are often represented as message lists.

Example:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is PyTorch?"},
    {"role": "assistant", "content": "PyTorch is a deep learning framework."},
]

Tokenization converts these messages into one flattened token sequence with role markers.

Training usually masks loss over user tokens and computes loss only on assistant outputs.

Summary

Conversational systems model multi-turn dialogue between users and machines. Modern systems are usually transformer-based autoregressive language models conditioned on dialogue history.

Important components include dialogue context management, retrieval, response generation, safety systems, memory, and tool use. Instruction tuning and RLHF help models follow user intent and conversational conventions.

Conversational AI extends beyond pure language modeling into retrieval systems, reasoning systems, planning systems, and external tool orchestration.