Token Limits and Cost Control
Introduction
DIAL provides mechanisms for controlling token usage and managing costs associated with AI models. Usage limitations can be configured on a single-user level, for users groups and API keys. It helps organizations manage consumption, prevent overuse, and allocate AI capabilities effectively across users and applications.
Refer to tutorials to learn how to create and configure token rate limits and cost limits for API keys and JWT.
Token Rate Limiting
Token rate limiting controls the volume of tokens processed by AI models within specific timeframes. The system tracks both input and output tokens, enforcing limits across minute, daily, weekly, or monthly intervals. When a limit is reached, additional requests are rejected until the time window refreshes.
Token limits primarily serve to ensure an optimal resource distribution, and establishing service tiers. Administrators can configure different token allowances appropriate to various operational requirements.
Cost-Based Rate Limiting
Cost-based rate limiting addresses financial governance by setting monetary consumption limits. Unlike token limits, which vary in financial impact depending on the model used, cost limits directly control actual spending across all models available for a specific JWT or API key based on their roles.
The system calculates the financial impact of each model interaction in real-time, tracking currency-based consumption across specified time intervals. This approach automatically accounts for the varying pricing of different models, allowing organizations to implement consistent budget controls regardless of which models are being utilized.
Enforcement Mechanism
DIAL enforces these limits through a validation layer that examines each request's token and cost implications. Both control mechanisms operate in parallel, with requests needing to satisfy all applicable limits. When rejected, users receive notifications explaining which specific constraint was triggered.
This dual-control approach ensures that both technical capacity and financial parameters are respected during model usage. The system maintains records of these limit checks for auditing and monitoring purposes.
Monitoring and Analytics
DIAL's Analytics Realtime Service tracks consumption patterns and limit enforcement. The data flows into time-series databases, enabling visualization through tools like Grafana and PowerBI. This visibility helps administrators optimize limit configurations based on actual usage patterns and business requirements.