Long context optimizations for LLM models#
Using LLM models with very long context and prompts might be particularly challenging. The key goals are to get maximum throughput, minimal latency and reasonable memory consumption. It is very common for applications using RAG chain, documents summarization, question answering and many more. Below optimizations can significantly boost performance :
Prefix caching
KV cache compression
Prefix caching
Prefix caching in large language models (LLMs) is an optimization technique used to improve performance when processing repeated or static parts of input prompts. Instead of recomputing KV for the same prefix (e.g., a fixed instruction or context), it is cached after the first computation and stored after request is already processed and response is returned. When the same prefix is encountered again, the cached KV is reused, skipping redundant computations. This reduces latency and computational overhead, especially in scenarios like chatbots or applications with repetitive prompts.
KV cache compression: KV cache stores the intermediate key and value tensors generated by the model’s attention layers for each token in the input sequence. This cache allows the model to avoid recomputing attention for previous tokens when generating new tokens, greatly speeding up inference for long contexts. For very long contexts or high concurrency, the KV cache can consume a large amount of memory (RAM or VRAM). Compression reduces this memory usage, enabling longer prompts or more parallel requests without running out of memory.
Deployment#
Let’s demonstrate all the optimizations combined and test it with the real life scenario of sending multiple various questions in the same context. It will illustrate the gain from the prefix caching on the first token latency, improved second token latency thanks to prompt lookup and moderate memory consumption despite very long prompts and parallel execution.
Export the model Qwen/Qwen2.5-7B-Instruct-1M which has the max context length of 1 million tokens!
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/2/demos/common/export_models/export_model.py -o export_model.py
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/2/demos/common/export_models/requirements.txt
mkdir models
python export_model.py text_generation --source_model Qwen/Qwen2.5-7B-Instruct-1M --weight-format int4 --config_file_path models/config.json --model_repository_path models
Start OVMS:
docker run -it --rm -u $(id -u) -p 8000:8000 -v $(pwd)/models/:/models:rw openvino/model_server:latest --rest_port 8000 --source_model Qwen/Qwen2.5-7B-Instruct-1M --model_repository_path /models --task text_generation --enable_prefix_caching true --kv_cache_precision u8 --target_device CPU
Dataset for experiments#
To test the performance using vllm benchmarking script, let’s create a custom dataset with long shared context and a set of questions in each request. That way we can create a dataset with identical very long context with different queries related to the context. That is a common scenario for RAG applications which generates response based on a complete knowledge base. To make this experiment similar to real live, the context is not synthetic but build with the content of Don Quixote story with 10 different questions related to the story. Because the context is reused, it is a perfect case for benefitting from prefix caching.
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/2/demos/continuous_batching/long_context/custom_dataset.py -o custom_dataset.py
pip install requests transformers
python custom_dataset.py --limit_context_tokens 50000
It will create a file called dataset.jsonl
with 10 requests of shared context body limited to 50000 tokens.
Testing performance#
Let’s check the performance
git clone --branch v0.9.1 --depth 1 https://github.com/vllm-project/vllm
cd vllm
pip3 install -r requirements/cpu.txt . --extra-index-url https://download.pytorch.org/whl/cpu
python benchmarks/benchmark_serving.py --host localhost --port 8000 --endpoint /v3/chat/completions --backend openai-chat --model Qwen/Qwen2.5-7B-Instruct-1M --dataset-name custom --dataset-path ../dataset.jsonl --num-prompts 10 --max-concurrency 1 --custom-output-len 50
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 31.44
Total input tokens: 500414
Total generated tokens: 500
Request throughput (req/s): 0.32
Output token throughput (tok/s): 15.91
Total Token throughput (tok/s): 15934.81
---------------Time to First Token----------------
Mean TTFT (ms): 1551.46
Median TTFT (ms): 518.46
P99 TTFT (ms): 3260.48
The results shown above, despite very long context, have much lower TTFT latency with prefix caching. As long as the beginning of the request prompt is reused, KV cache can be also reused to speed up prompt processing.
Performance Comparison Table#
Context Length (tokens) |
TTFT No Caching (ms) |
TTFT Prefix Caching (ms) |
KV Cache Usage (GB) |
---|---|---|---|
1,000 |
785 |
141 |
0.1 |
5,000 |
4160 |
172 |
0.2 |
10,000 |
9570 |
217 |
0.4 |
50,000 |
152,589 |
795 |
1.5 |
100,000 |
624,713 |
1097 |
3.1 |
200,000 |
5406 |
6.2 |
The results show that the cache usage grows linearly with the context length. First token generation without prefix caching is growing significantly with the prompt size. Prefix caching is very effective in reducing the first token generation making the long context calls practical even on slower HW.
Testing accuracy#
Testing accuracy for use cases with long context can be done via lm-eval_harness. The only difference is that the configured testing task should include a relevant dataset.
For example:
lm-eval --model local-chat-completions --tasks longbench_gov_report --model_args model=Qwen/Qwen2.5-7B-Instruct-1M,base_url=http://localhost:8000/v3/chat/completions,num_concurrent=10,tokenized_requests=False,timeout=3000 --verbosity DEBUG --seed 1 --apply_chat_template
Such experiment can confirm the impact on accuracy from the model quantization and KV cache compression.
Cache Precision Comparison#
Cache Precision |
Plugin Config |
Accuracy (longbench_gov_report, concurrency 50) |
Max Cache Usage (GB) |
Duration (s for 100 requests) |
---|---|---|---|---|
INT8 |
“KV_CACHE_PRECISION”:”u8” |
0.3374 |
11 |
41m6.993s |
BF16 |
“KV_CACHE_PRECISION”:”bf16” |
0.3297 |
20 |
40m15.359s |
FP32 |
“KV_CACHE_PRECISION”:”FP32”,”EXECUTION_MODE_HINT”: “ACCURACY” |
0.331 |
37 |
105m15.876s |
The results in an experiment captured on Xeon Gen4 server show that KV cache compression has minimal impact on accuracy and significantly reduces memory consumption. Slower execution with FP32 precision is a result of disabled AMX acceleration.
Recommendations#
Enable prefix caching feature with --enable_prefix_caching
parameter when you expect reusing parts of the context. That is typically the case for RAG, chat and agentic application.
Use KV cache compression as INT8 which is the default setting.
Set the KV cache size via --cache_size
parameter based on the available memory, expected concurrency and context length. It will improve the performance.
Note You can force reducing the concurrency on the server using a parameter --rest_workers
which by default allows number of connections the same like number of CPU cores. Alternatively the limit can be set on the model level in --max_num_seqs
.