Skip to main content

Load Balancer

Introduction

Load balancing is crucial for distributing requests to LLMs across various resources, preventing bottlenecks, enhancing fault tolerance and keeping costs under control.

Using the DIAL Load Balancer, you can flexibly distribute load across model deployments, regions, and cloud subscriptions, prioritizing PTU deployments before using per-token-cost options.

You have the option to set up a simple strategy by directing requests to specific endpoints, or you can choose a more sophisticated approach by customizing a combination of parameters.

Configuration

You can configure load balancing in DIAL Core dynamic settings by defining upstream parameters for models. These parameters can be dynamically adjusted without the need to redeploy the DIAL Core.

public class Upstream {
private String endpoint; // required
private String key; // optional
private int tier = 0; // default weight if not specified
private int weight = 1; // default weight if not specified
}

Note: in a cluster, DIAL Core nodes do share the state of endpoints, which may cause a certain asynchronization in their work routing the requests.

Configuration example:

    "models": {
"chat-gpt-35-turbo": {
"upstreams": [
{
"endpoint": "http://localhost:7001",
"key": "modelKey",
"weight": 1,
"tier": 0
},
],
}
}

where:

  • endpoint - the model endpoint. The only required parameter.
  • key - model API key
  • weight - the endpoint capacity as a share of the total traffic
  • tier - specifies an endpoint group

In the following sections, you can get familiar with specific combinations of these parameter for configuring load balancer algorithms.

Standard Load Balancer

Standard load balancer follows the round-robin approach. The incoming traffic is distributed evenly across all resources. Each new request is sent to the next endpoint in line, and the process repeats cyclically. Once the end of the list is reached, it starts over from the first endpoint. This can be efficient in scenarios where all resources have similar capabilities and the tasks they handle are roughly equivalent in terms of resource consumption.

Example of configuration of DIAL Core:

"models": {
"chat-gpt-35-turbo": {
"upstreams": [
{
"endpoint": "https://hostname1/openai/deployments/gpt-4-32k-0613/chat/completions"
},
{
"endpoint": "https://hostname2/openai/deployments/gpt-4-32k-0613/chat/completions"
},
{
"endpoint": "https://hostname3/openai/deployments/gpt-4-32k-0613/chat/completions"
},
]
}
}

In this scenario, requests are distributed between endpoints in the order specified in the configuration. By default, the weight is set to 1 and tier to 0 for all of them.

Advanced Load Balancer

You can use a combination of weight and tier parameters to define a more complex load balancing algorithm.

Tiers

Use tier to aggregate endpoints in groups. In a regular scenario, all requests are routed to endpoints with the lowest tier, but in case of the outage or hitting the limits, the second in line tier helps to handle the load.

For example, requests will be routed to a group of endpoints with "tier": 0 before they go to a group with "tier": 1.

The default value is 0, which signifies the highest tier.

Example of configuration of DIAL Core:

"models": {
"chat-gpt-35-turbo": {
"upstreams": [
{
"endpoint": "https://hostname1/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 0
},
{
"endpoint": "https://hostname2/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 0
},
{
"endpoint": "https://hostname3/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 1
},
]
}
}

In this scenario, requests are distributed between groups of endpoints in the order specified in the configuration. By default, the weight is set to 1for all of them.

Weights

Use weight inside groups of endpoints to allocate a share of the total traffic for specific endpoints. Endpoints get a share of requests with probability proportional to their weights.

Note: weight can be used only in combination with tier.

The default value is 1.

A positive value represents an endpoint capacity, zero or negative - disables the endpoint from routing.

Example of configuration of DIAL Core:

"models": {
"chat-gpt-35-turbo": {
"upstreams": [
{
"endpoint": "https://hostname1/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 0,
"weight": 3
},
{
"endpoint": "https://hostname2/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 0,
"weight": 2
},
{
"endpoint": "https://hostname3/openai/deployments/gpt-4-32k-0613/chat/completions",
"tier": 1,
"weight": 1
},
]
}
}

In this configuration, requests are initially directed to tier 0. Within this tier, requests are distributed among hostname1 and hostname2 with probability proportional to their weights. If tier 0 is unavailable or reaches its capacity limits, the remaining traffic is then routed to tier 1.

Fallbacks

Fallback logic is designed to ensure system resilience and optimize request handling when upstream endpoints fail or become unavailable. It helps prevent repeated retries to the same endpoint and provides a structured approach to redirect requests to alternative endpoints.

  • Endpoint Selection: If endpoints in the lower tier become unavailable, the traffic goes to the group of endpoints that have a higher tier.
  • Retry Limitation: To prevent excessive retries to a specific deployment, the maxRetryAttempts parameter can be configured in DIAL Core dynamic settings. The default value is 5 for language models and 1 for applications.
  • Fallback: If all upstream endpoints are unavailable, the fallback algorithm will attempt to send the request to an endpoint that previously returned a 500 (Internal Server Error) or 429 (Too Many Requests) error code. This fallback is conditional and depends on the endpoint meeting predefined criteria, ensuring that even temporarily failing endpoints can be used if necessary.