3 min
read
Introduction
As an AWS partner, our expertise allows us to tackle complex challenges for our clients. Recently, we had
the opportunity to implement LiteLLM Proxy in a self-hosted environment on AWS ECS for one of our
clients. The main objectives of this project were to address quota limitations, ensure high availability, and
balance the load across multiple language model providers.
Overview of LiteLLM Proxy
LiteLLM Proxy is a powerful tool that serves as a unified interface for accessing more than 100 different
language models (LLMs). It offers two primary usage modes:
- As an SDK for interacting with models via code.
- As a proxy server that abstracts multiple services behind a single OpenAI-compatible API.
For this project, we opted for the second approach, which provided numerous benefits, including:
- Centralised management of API calls to different LLM providers.
- Enhanced flexibility to switch between models.
- Advanced features for quota and security management.
Overall Architecture of the Solution
Our solution is built on a robust architecture deployed on AWS, comprising the following components:
- Amazon ECS (Elastic Container Service) for hosting and managing LiteLLM Proxy containers.
- Cross-account IAM roles to optimize AWS Bedrock quota usage.
- Amazon Bedrock for accessing a catalog of AI models.

Deploying LiteLLM Proxy on ECS involved several key steps:
1. Creating an S3 bucket to store LiteLLM Proxy configuration files (YAML format).
2. In an existing ECS cluster:
- Creating a task definition to initialize the configuration from S3 and run LiteLLM Proxy
containers. - Configuring ECS services to ensure continuous availability.
3. Setting up cross-account IAM roles:
- With custom IAM policies (configurable per model, for example).
- Assumable by ECS tasks for optimal quota management across multiple AWS accounts.
4. Enabling various AI models in AWS accounts across different regions.
To automate these steps, we developed Terraform modules.
How to fetch LiteLLM Proxy Configuration from S3 in ECS ?
To ensure that LiteLLM Proxy loads its configuration dynamically, we set up an init container within the ECS
task definition. This container retrieves the configuration file from an S3 bucket before the main application
starts. Below is a Terraform snippet illustrating this process:

How to make Requests to LiteLLM Proxy ?
LiteLLM Proxy is fully compatible with OpenAI's API format, making it easy to use with standard tools.
Below are examples of how to make requests to LiteLLM Proxy using curl and Python.
Using curl

Using Python (or any OpenAI compatible SDK)

Quota Management with LiteLLM Proxy
One of the major challenges our client faced was managing quotas imposed by LLM providers. To overcome
this limitation, we implemented an innovative strategy:
- Using cross-account IAM roles to access quotas from multiple AWS accounts.
- Configuring LiteLLM Proxy to intelligently distribute API calls across these accounts.
This approach allowed the client to bypass initial limitations and ensure service continuity even during peak
usage.
A key security advantage of this architecture is that using IAM roles to access AI models in AWS Bedrock
eliminates the need to transmit passwords and other sensitive credentials.
Ensuring High Availability
To guarantee maximum service availability, we implemented several measures:
- Deploying multiple instances of the task within the ECS LiteLLM Proxy service.
- Implementing automatic scaling strategies for ECS tasks, dynamically adjusting capacity based on
demand. - Distributing ECS instances across multiple AWS availability zones for increased resilience.
Load Balancing Between Providers
LiteLLM Proxy offers various routing strategies to optimise the use of different LLM providers:
- Rate-Limit Aware v2 (Asynchronous): Takes rate limits into account to prevent overloading
services. - Latency-Based: Directs calls to deployments with the lowest latency.
- Weighted Selection (Asynchronous) (Default): Assigns weights to models and distributes calls
accordingly. - Least Busy: Redirects calls to the least utilised models.
- Custom Routing Strategy: Allows defining specific routing rules based on business needs.
- Lowest Cost Routing (Asynchronous): Directs calls to the least expensive deployments.
For this implementation, we configured intelligent routing based on the "Least Busy" strategy. This
approach ensures a consistent and high-quality user experience.
Monitoring and Logging
To ensure optimal tracking of the solution, we configured and utilised:
- A CloudWatch log group for easy troubleshooting and analysis (including request logs and cost data).
- CloudWatch dashboards to visualize key metrics for ECS and LiteLLM Proxy.
- Automated alerts in case of anomalies or predefined threshold exceedances.
LiteLLM Proxy also includes features such as alerts (Slack, Discord, Microsoft Teams, webhooks) for
notifications regarding:
- LLM performance.
- Budgets and spending.
- System status (e.g., database connection issues).
- Daily usage reports.
For more advanced monitoring, we recommend specialized tools such as AgentOps and LangTrace.
Advanced Features and Future Enhancements
Although our current implementation meets the client's immediate needs, LiteLLM Proxy offers many
additional advanced features:
- Virtual API keys for fine-grained access management (for users or applications).
- IP-based filtering to enhance security.
- Access control per model.
- Advanced budget and rate limit management.
- Tagging system for better organisation.
- Fallback mechanisms for increased resilience.
- Prometheus integration for advanced monitoring.
Looking ahead, we plan to explore additional features to further optimise the solution, including:
- Implementing a caching system to reduce latency and costs.
- Deploying more sophisticated fallback mechanisms.
- Utilising advanced metrics for continuous performance and cost optimisation.
Conclusion
The implementation of LiteLLM Proxy on AWS ECS demonstrates how a well-architected cloud solution can
address complex challenges related to intensive LLM usage. By leveraging AWS ECS, cross-account IAM
roles, and LiteLLM Proxy, we successfully built a robust, scalable, and cost-effective solution.
This approach enabled our client to:
- Overcome LLM access quota limitations.
- Ensure high availability for their AI-powered services.
The lessons learned from this project highlight the importance of careful planning, flexible architecture, and
continuous monitoring to succeed in the evolving fields of AI and cloud computing.
As we continue exploring the possibilities offered by LiteLLM Proxy and AWS, we are confident that this
solution will evolve to meet the growing needs of innovative companies in web development and AI.