长上下文LLM问答的主流技术路线总结

less than 1 minute read

Published: July 21, 2025

This post explores four key techniques for enhancing large language models (LLMs) in long-context scenarios. It begins with Retrieval-Augmented Generation (RAG), which retrieves relevant knowledge snippets to serve as context. It then discusses sparse attention mechanisms, such as BigBird and Longformer, which improve efficiency by connecting only selected tokens. The post also introduces context compression methods like MemoryBank, enabling LLMs to retain essential user information across dialogues. Finally, it highlights MemAgent, a system that recursively summarizes long inputs and leverages memory for reasoning, reinforced using GRPO.

1. RAG

检索增强生成（Retrieval-Augmented Generation, RAG），不必过多赘述，大家都很熟悉的东西了，对于一些问答系统，当用户进行查询时，有一整个知识库可以用于查询，但是全部输入llm作为上下文不可行，因此希望找出其中与问题最相关的那些部分作为context。该技术通过将问题编码为向量，对提前编码好的知识库进行查询，集计算相似度，选出相似度高达片段作为上下文，实现目的。

2. 稀疏注意力机制

直觉上说就是，序列太长，都计算上下文不可承受，那我选其中的一些token用于作为注意力的链接对象！比如当前token的前几个token（local attention），或者一些重要的token(global attention)，或者间隔着选token(Strided Attention)，或者随机等等

相当于这条路线是从模型结构上增强LLM对长序列的处理能力，即改进Transformer或采用新架构，使模型能够直接处理超长上下文序列。

一些例子：

BigBird（Google）三种结构组成
- Local Attention（局部注意）
  - 每个 token 只与周围的 w 个 token 相连
- Global Attention（全局 token）
  - 某些 token（如 [CLS] 或摘要头）能“看到所有 token”，并且被所有人看到
- Random Attention（随机连接）
  - 每个 token 再随机连接若干非邻近 token（防止图断裂）
  - 比如：token 5 随机连接 token 17、26、49
Longformer: The Long-Document Transformer
- 滑动窗口
  - 每个token 只 attends 它前后固定范围（比如 ±k 个token）
- 全局token机制（Global Attention）

3. 压缩/精简上下文

最近读到的MemoryBank这篇paper就是，针对ai助手这个场景，让llm分析之前对话中获得的用户的事件，性格，在之后的对话中作为上下文输入。

4. 在一次回答中，将长文本分段，多次递归总结，维护记忆，最后根据记忆和问题进行回答

这就是MemAgent，正是读到这篇paper让我想要进行这个总结，同时该工作使用GRPO对模型进行强化。 MemAgent示意图

Share on

Twitter Facebook LinkedIn

Renjie Gu