Skip to main content

Prompt Caching

About Prompt Caching

Prompt caching, also referred to as context caching, is a technique used to optimize performance and reduce the operational costs of interacting with large language models (LLMs).

By storing and reusing previously processed data, prompt caching eliminates the need to reprocess the same information in subsequent requests. This approach is particularly beneficial in scenarios such as multi-turn conversations, complex workflows, or applications requiring repetitive queries, where parts of the request (e.g., system instructions, user prompts, or tool definitions) remain unchanged.

By leveraging cached data, systems can significantly reduce computation time, improve response efficiency, and lower token usage costs, making it an essential feature for scalable and cost-effective AI implementations.

Prompt Caching in LLM

Large Language Models (LLMs) process sequences of input tokens and generate sequences of output tokens. Each token consumed modifies the internal state of the LLM in a predictable manner. Once all input tokens are processed, the model begins generating output tokens one by one.

Prompt caching leverages the predictable state changes in LLMs to optimize processing for input sequences that share the same prefix. For example:

input sequence 1: [tokenA, tokenB, tokenC, tokenD, ...]
input sequence 2: [tokenA, tokenB, tokenC, tokenE, ...]

To optimize these sequences: Roll the LLM through the first sequence until the shared prefix ends (e.g., [tokenA, tokenB, tokenC]), save the model's state after processing the shared prefix, begin processing the second sequence from this precomputed state, skipping redundant computations for the shared prefix.

End users of LLMs rarely deal with streams of input tokens directly. Instead, a high-level concept of a chat request is used more commonly. The idea of prompt caching translates from tokens to chat requests fairy easily - adjacent parts of the chat request (such as tool definitions and messages) eventually map onto adjacent blocks of input tokens which are fed into the LLM. Therefore, we can relate prompt caching to a sequence of tool definitions and chat messages in chat completion requests. Requests starting with the same sequence of tools/messages can be computed more efficiently thanks to the prompt caching.

Leading model providers such as OpenAI, Google, and AWS support prompt caching. However, all of them limit the caching scope to a single endpoint. Meaning that in order to leverage caching, the requests sharing the same prefixes must be sent to the same model deployment.

DIAL Core can automate this process, ensuring that requests with shared prefixes are routed to the same model deployment.

Prompt Caching in DIAL

DIAL Core uses hashing to redirect chat completion requests with the same prefix to the same upstream endpoint:

  • Unique hashes are computed for each prefix of an incoming chat completion request.
  • The mapping from the hashes to an upstream endpoint and expiration time is saved to Redis cache.
  • When next request comes in, DIAL Core computes the hashes again and looks them up in the mapping. When there is a match, it sends the request to the corresponding upstream endpoint.

Cache-Availability Priority Policy

When DIAL Core finds a matching hash in the Redis cache and routes the request to the corresponding upstream, it may encounter an error in the response. This raises the question:

  • Should DIAL Core retry the request to the same upstream (potentially resulting in a cache hit)?
  • Or should it route the request to another upstream (guaranteeing a cache miss but ensuring service availability)?

By default, DIAL Core prioritizes service availability. If the upstream associated with the cache is unavailable or returns an error, the request is routed to another upstream, even though this guarantees a cache miss.

This behavior can be customized using the X-CACHE-POLICY header in the chat completion request. The two available policies are:

  • availability-priority (default):
    • Prioritizes service availability over cache hits.
    • If the cache upstream is unavailable or returns an error, the request is routed to another upstream.
    • Ensures the request succeeds but sacrifices cached efficiency.
  • cache-priority:
    • Prioritizes cache hits over availability.
    • Retries the request on the same upstream, even if the upstream initially returned an error.
    • Maximizes the chances of leveraging cached data but risks delayed responses if the upstream remains unavailable.

Types of Prompt Caching

DIAL supports two types of prompt caching: automatic and manual. The following table summarizes pros and cons of these types of caching.

Automatic CachingManual Caching
User EffortMinimal (handled by the system)High (users must explicitly mark cache breakpoints)
ControlLimited (system decides breakpoints)Full (users decide breakpoints)
Use CaseMulti-turn conversations, general efficiencyCustom workflows, advanced use cases
Dependency on ProviderDepends on automatic caching support from the provider (e.g., OpenAI)Works as long as the provider supports caching
FlexibilityLow (system-driven)High (user-driven)

Automatic

Automatic caching happens without explicit user input. The system determines when and where to create cache entries, making it seamless for the user. This is ideal for multi-turn conversations where only new content needs to be processed, while cached content is reused.

Deployment configuration

Enable automatic caching by setting the deployment feature flag autoCachingSupported to true.

IMPORTANT: Ensure the language model supports automatic prompt caching before enabling it. Not all models support this feature.

Manual

Manual Caching is ideal for scenarios that require a precise control over what parts of a request are cached, especially in complex workflows.

You have full control over where cache breakpoints are placed, making it suitable for highly customized workflows. For example, working with a complex prompt (e.g., a mix of tool definitions, system instructions, and user messages) you can explicitly mark parts of the request for caching to optimize performance.

Deployment configuration

Enable manual caching by setting the deployment feature flag cacheSupported to true.

IMPORTANT: Ensure the language model supports prompt caching before enabling it. Not all models support this feature.

Cache breakpoint

In this type of prompt caching, you manually mark parts of chat completion request with cache breakpoints.

DIAL Core computes hashes only for request prefixes ending with a cache breakpoint.

Tool definitions and chat messages can be marked with cache breakpoints using the custom_fields.cache_breakpoint field. Breakpoints can include an expire_at field to set a time-to-live (TTL) for the corresponding cache entry in DIAL Core and on the LLM side.

Example of cache breakpoints in a chat completion request:

{
"tools": [
{
"type": "function",
"name": "query_db",
"parameters": {},
"custom_fields": {
"cache_breakpoint": {}
}
}
],
"messages": [
{
"role": "system",
"content": "(long instructions)",
"custom_fields": {
"cache_breakpoint": {
"expire_at": "2014-10-02T15:01:23Z"
}
}
},
{
"role": "user",
"content": "(query)"
}
]
}