Introduction

"
TL;DR

A library for managing LLM API rate limits in distributed environments (K8s, Lambda, etc.)

Uses Redis to share state across multiple processes

Simply pip install llmratelimiter and use await limiter.acquire(tokens=5000)

When using LLM (Large Language Model) APIs, rate limiting is unavoidable. All LLM providers including OpenAI, Anthropic, and Google Vertex AI impose TPM (Tokens Per Minute) and RPM (Requests Per Minute) limits.

This becomes particularly problematic in scenarios like:

Multiple worker processes calling the API simultaneously
Sharing APIs across microservices
Sending large volumes of requests in batch processing
Running in serverless environments like AWS Lambda or Cloud Functions

In serverless environments, each function instance runs independently, making in-memory rate limiting ineffective. You need distributed rate limiting with external shared storage (Redis).

This article introduces LLMRateLimiter, a library developed to solve these problems.

Portfolio: Projects / LLMRateLimiter

Development Background

This library was born from a project that simultaneously processed tens of thousands of LLM API calls using Celery in a Kubernetes environment.

Loading diagram...

The Challenge:

Dozens of Celery Workers calling LLM APIs in parallel
Each Worker has no knowledge of other Workers' state
Simply sending requests immediately hits rate limits
A storm of 429 Too Many Requests errors

The Solution:

Distributed rate limiting using Redis to coordinate across all Workers
FIFO queue for fair ordering
Atomic operations via Lua scripts to prevent race conditions

What is LLMRateLimiter?

LLMRateLimiter is a distributed rate limiting library using Redis. Its FIFO queue-based algorithm controls API requests fairly and efficiently.

Key Features

Feature	Description
FIFO Queue	First-come-first-served processing prevents thundering herd
Distributed Support	Redis backend shares state across processes/servers
Flexible Limits	Supports Combined TPM, Split TPM, and Mixed modes
Auto Retry	Exponential backoff with jitter for Redis failures
Graceful Degradation	Allows requests to pass when Redis is unavailable

Installation

Bash


pip install llmratelimiter

Or using uv:

Bash


uv add llmratelimiter

Basic Usage

Simple Example

Python


from llmratelimiter import RateLimiter

# Just specify Redis URL and limits
limiter = RateLimiter(
    "redis://localhost:6379",
    "gpt-4",
    tpm=100_000,  # 100K tokens/minute
    rpm=100       # 100 requests/minute
)

# Acquire capacity before calling the API
await limiter.acquire(tokens=5000)
response = await openai.chat.completions.create(...)

Using an Existing Redis Client

Python


from redis.asyncio import Redis
from llmratelimiter import RateLimiter

redis = Redis(host="localhost", port=6379)
limiter = RateLimiter(redis=redis, model="gpt-4", tpm=100_000, rpm=100)

SSL Connection

Python


# Use rediss:// scheme for SSL connection
limiter = RateLimiter("rediss://localhost:6379", "gpt-4", tpm=100_000)

Rate Limiting Modes

LLMRateLimiter supports three modes:

Mode	Use Case	Configuration
Combined	OpenAI, Anthropic	`tpm=100000`
Split	GCP Vertex AI	`input_tpm=..., output_tpm=...`
Mixed	Custom	Both settings

Combined Mode

For providers like OpenAI and Anthropic with a single TPM limit combining input and output:

Python


limiter = RateLimiter(
    "redis://localhost:6379", "gpt-4",
    tpm=100_000,  # Combined TPM limit
    rpm=100
)

await limiter.acquire(tokens=5000)

Split Mode

For providers like GCP Vertex AI with separate limits for input and output:

Python


limiter = RateLimiter(
    "redis://localhost:6379", "gemini-1.5-pro",
    input_tpm=4_000_000,   # Input: 4M tokens/minute
    output_tpm=128_000,    # Output: 128K tokens/minute
    rpm=360
)

# Specify input and output token counts
await limiter.acquire(input_tokens=5000, output_tokens=2048)

"
Tip

Output token count needs to be estimated in advance. Using the max_tokens parameter value is common practice.

Mixed Mode

Use both combined and split limits simultaneously:

Python


limiter = RateLimiter(
    "redis://localhost:6379", "custom-model",
    tpm=500_000,           # Combined limit
    input_tpm=4_000_000,   # Input limit
    output_tpm=128_000,    # Output limit
    rpm=360
)

Architecture

Overall Structure

Loading diagram...

Multiple application instances share state through Redis to collectively respect rate limits.

Request Flow

When acquire() is called, the following operations execute atomically on Redis:

Loading diagram...

Key Points of Atomic Operations via Lua Scripts:

Delete Expired Records: Clean up old records outside the window
Aggregation: Calculate token and request counts within the current window
Limit Check: Compare against TPM/RPM limits
Slot Allocation: Immediate if within limits, calculate when capacity becomes available if exceeded
Record Storage: Save consumption record to Redis Sorted Set

Why Lua Scripts Are Necessary

The biggest challenge when implementing rate limiting in distributed environments is race conditions.

The Race Condition Problem

A naive Python implementation creates race conditions like this:

Python


# This is dangerous: race condition occurs
async def acquire_naive(redis, tokens):
    current = await redis.get("token_count")  # (1) Read
    if current + tokens <= limit:             # (2) Check
        await redis.set("token_count", current + tokens)  # (3) Write
        return True
    return False

Loading diagram...

When Worker 1 and 2 GET simultaneously, both determine "2,000 tokens remaining" and try to consume 3,000 tokens each. Result: limit exceeded.

Solution with Lua Scripts

Lua scripts execute atomically on the Redis server. Other commands are blocked during script execution, preventing race conditions.

Loading diagram...

Sliding Window

Rate limiting is managed with a 60-second (configurable) sliding window:

Loading diagram...

Connection Management and Retry

RedisConnectionManager

For production environments, use RedisConnectionManager with connection pooling and retry functionality:

Python


from llmratelimiter import RateLimiter, RedisConnectionManager, RetryConfig

manager = RedisConnectionManager(
    "redis://localhost:6379",
    max_connections=20,
    retry_config=RetryConfig(
        max_retries=5,
        base_delay=0.1,
        max_delay=10.0,
        exponential_base=2.0,
        jitter=0.1
    )
)

limiter = RateLimiter(manager, "gpt-4", tpm=100_000, rpm=100)

Exponential Backoff

When Redis operations fail, retry with exponential backoff:

Loading diagram...

Graceful Degradation

When Redis is unavailable, requests pass through instead of being blocked:

Python


# Even if Redis is down, this returns immediately
result = await limiter.acquire(tokens=5000)

# If wait_time=0, queue_position=0,
# the request may have passed without rate limiting

Monitoring

Use get_status() to check current usage:

Python


status = await limiter.get_status()

print(f"Model: {status.model}")
print(f"Token Usage: {status.tokens_used}/{status.tokens_limit}")
print(f"Input Tokens: {status.input_tokens_used}/{status.input_tokens_limit}")
print(f"Output Tokens: {status.output_tokens_used}/{status.output_tokens_limit}")
print(f"Requests: {status.requests_used}/{status.requests_limit}")
print(f"Queue Depth: {status.queue_depth}")

Best Practices

1. Accurate Token Estimation

When using Split Mode, estimating output token count is important:

Python


# Use max_tokens parameter for estimation
estimated_output = min(request_max_tokens, 4096)

result = await limiter.acquire(
    input_tokens=len(prompt_tokens),
    output_tokens=estimated_output
)

2. Use Context Managers

Ensure connections are properly cleaned up:

Python


async with RedisConnectionManager("redis://localhost:6379") as manager:
    limiter = RateLimiter(manager, "gpt-4", tpm=100_000, rpm=100)
    await limiter.acquire(tokens=5000)
# Connection pool automatically closed

3. Use Separate Limiters Per Model

Python


gpt4_limiter = RateLimiter(manager, "gpt-4", tpm=100_000, rpm=100)
gpt35_limiter = RateLimiter(manager, "gpt-3.5-turbo", tpm=1_000_000, rpm=500)

4. Utilize Burst Allowance

When tolerating temporary spikes:

Python


limiter = RateLimiter(
    "redis://localhost:6379", "gpt-4",
    tpm=100_000,
    rpm=100,
    burst_multiplier=1.5  # Allow 50% burst
)

Summary

LLMRateLimiter is a library for efficiently managing LLM API rate limits in distributed environments.

Key Benefits:

Simple API: Just pass a Redis URL to get started
Distributed Support: Share state across multiple processes/servers
Flexibility: Supports various LLM provider limit models
Reliability: Resilient to failures with retry and graceful degradation

Links:

GitHub: https://github.com/Ameyanagi/LLMRateLimiter
Documentation: https://Ameyanagi.github.io/LLMRateLimiter/
PyPI: https://pypi.org/project/LLMRateLimiter/

Give it a try! Feedback and contributions are welcome.

Managing LLM API Rate Limits in Distributed Environments - Introducing LLMRateLimiter

Introduction

Development Background

What is LLMRateLimiter?

Key Features

Installation

Basic Usage

Simple Example

Using an Existing Redis Client

SSL Connection

Rate Limiting Modes

Combined Mode

Split Mode

Mixed Mode

Architecture

Overall Structure

Request Flow

Why Lua Scripts Are Necessary

The Race Condition Problem

Solution with Lua Scripts

Sliding Window

Connection Management and Retry

RedisConnectionManager

Exponential Backoff

Graceful Degradation

Monitoring

Best Practices

1. Accurate Token Estimation

2. Use Context Managers

3. Use Separate Limiters Per Model

4. Utilize Burst Allowance

Summary

Ameyanagi

Related Posts

Making Chemical Risk Assessment Free, Open, and Agent-Readable

ruviz: Building a High-Performance Plotting Library in Rust

Creating a Python Library Using PyO3/Maturin