Load Balancer
Introduction
Load balancing is crucial for distributing requests to LLMs across various resources, preventing bottlenecks, enhancing fault tolerance and keeping costs under control.
Using the DIAL Load Balancer, you can flexibly distribute load across model deployments, regions, and cloud subscriptions, prioritizing PTU deployments before using per-token-cost options.
You have the option to set up a simple strategy by directing requests to specific endpoints, or you can choose a more sophisticated approach by customizing a combination of parameters.
Configuration
You can configure load balancing in DIAL Core dynamic settings by defining upstream
parameters for models. These parameters can be dynamically adjusted without the need to redeploy the DIAL Core.
public class Upstream {
private String endpoint; // required
private String key; // optional
private int tier = 0; // default weight if not specified
private int weight = 1; // default weight if not specified
}
Note: in a cluster, DIAL Core nodes do share the state of endpoints, which may cause a certain asynchronization in their work routing the requests.
Configuration example:
"models": {
"chat-gpt-35-turbo": {
"upstreams": [
{
"endpoint": "http://localhost:7001",
"key": "modelKey",
"weight": 1,
"tier": 0
},
],
}
}
where:
endpoint
- the model endpoint. The only required parameter.key
- model API keyweight
- the endpoint capacity as a share of the total traffictier
- specifies an endpoint group
In the following sections, you can get familiar with specific combinations of these parameter for configuring load balancer algorithms.
Standard Load Balancer
Standard load balancer follows the round-robin approach. The incoming traffic is distributed evenly across all resources. Each new request is sent to the next endpoint in line, and the process repeats cyclically. Once the end of the list is reached, it starts over from the first endpoint. This can be efficient in scenarios where all resources have similar capabilities and the tasks they handle are roughly equivalent in terms of resource consumption.
Example of configuration of DIAL Core:
"models": {
"chat-gpt-35-turbo": {
"upstreams": [
{
"endpoint": "https://hostname1/openai/deployments/gpt-4-32k-0613/chat/completions"
},
{
"endpoint": "https://hostname2/openai/deployments/gpt-4-32k-0613/chat/completions"
},
{
"endpoint": "https://hostname3/openai/deployments/gpt-4-32k-0613/chat/completions"
},
]
}
}
In this scenario, requests are distributed between endpoints in the order specified in the configuration. By default, the weight
is set to 1
and tier
to 0
for all of them.
Advanced Load Balancer
You can use a combination of weight
and tier
parameters to define a more complex load balancing algorithm.
Tiers
Use tier
to aggregate endpoints in groups. In a regular scenario, all requests are routed to endpoints with the lowest tier
,
but in case of the outage or hitting the limits, the second in line tier
helps to handle the load.
For example, requests will be routed to a group of endpoints with "tier": 0
before they go to a group with "tier": 1
.
The default value is 0, which signifies the highest tier.
Example of configuration of DIAL Core:
"models": {
"chat-gpt-35-turbo": {
"upstreams": [
{
"endpoint": "https://hostname1/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 0
},
{
"endpoint": "https://hostname2/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 0
},
{
"endpoint": "https://hostname3/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 1
},
]
}
}
In this scenario, requests are distributed between groups of endpoints in the order specified in the configuration. By default, the weight
is set to 1
for all of them.
Weights
Use weight
inside groups of endpoints to allocate a share of the total traffic for specific endpoints. Endpoints get a share of requests with probability proportional to their weights.
Note: weight
can be used only in combination with tier
.
The default value is 1.
A positive value represents an endpoint capacity, zero or negative - disables the endpoint from routing.
Example of configuration of DIAL Core:
"models": {
"chat-gpt-35-turbo": {
"upstreams": [
{
"endpoint": "https://hostname1/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 0,
"weight": 3
},
{
"endpoint": "https://hostname2/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 0,
"weight": 2
},
{
"endpoint": "https://hostname3/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 1,
"weight": 1
},
]
}
}
In this configuration, requests are initially directed to tier 0
. Within this tier, requests are distributed among hostname1
and hostname2
with probability proportional to their weights. If tier 0
is unavailable or reaches its capacity limits, the remaining traffic is then routed to tier 1
.
Fallbacks
Fallback logic is designed to ensure system resilience and optimize request handling when upstream endpoints fail or become unavailable. It helps prevent repeated retries to the same endpoint and provides a structured approach to redirect requests to alternative endpoints.
- Endpoint Selection: If endpoints in the lower tier become unavailable, the traffic goes to the group of endpoints that have a higher tier.
- Retry Limitation: To prevent excessive retries to a specific deployment, the
maxRetryAttempts
parameter can be configured in DIAL Core dynamic settings. The default value is 5 for language models and 1 for applications. - Fallback: If all upstream endpoints are unavailable, the fallback algorithm will attempt to send the request to an endpoint that previously returned a 500 (Internal Server Error) or 429 (Too Many Requests) error code. This fallback is conditional and depends on the endpoint meeting predefined criteria, ensuring that even temporarily failing endpoints can be used if necessary.