Burstable Instances for Cost-Effective AI/ML and LLM workloads

September 30, 2024
Prasad Kanigicherla
Artifical Intelligence, Insights
0

Leveraging Burstable Instances on AWS for Cost-Effective AI/ML and LLM Workloads

1. Introduction

Running AI/ML workloads or Large Language Models (LLMs) on cloud platforms can be expensive due to their high computational demands. AWS burstable instances provide a cost-effective solution that allows you to balance performance with cost efficiency. This paper explores how to leverage AWS burstable instances (T-series) to run AI/ML and LLM workloads without compromising performance, and how this approach can keep operational costs down while meeting workload requirements.

2. What Are Burstable Instances?

Burstable instances on AWS, particularly the T-series (T3, T3a, and T4g), are designed to handle workloads with variable CPU utilization. These instances operate on a credit-based system, where they accumulate CPU credits during periods of low activity and spend them during CPU-intensive tasks, allowing short bursts of high performance.

Key Features:

CPU Credits: Credits accumulated when the instance is under low usage can be used to achieve higher performance when needed.
Cost Efficiency: Lower hourly costs compared to on-demand compute-optimized instances.
Flexibility: Suitable for workloads with intermittent spikes in resource usage.

3. How Burstable Instances Benefit AI/ML and LLM Workloads

3.1 Handling Inference Tasks Efficiently

Inference tasks, where AI/ML models or LLMs are used to generate predictions, often experience variable demand. Burstable instances are ideal for such workloads due to their ability to handle intermittent spikes in CPU usage:

Low Latency Requirements: During periods of heavy inference requests, burstable instances can utilize accumulated CPU credits to handle the increased load without latency.
Cost Savings: When traffic is low, such as during off-peak hours, the instance can run at a reduced CPU capacity, keeping costs down.

Example: Running an LLM-based chatbot on a T3 instance. During peak hours, the instance bursts to handle user queries efficiently, while during off-peak hours, it operates at a lower CPU utilization.

3.2 Training Small to Medium-Sized Models

While burstable instances may not be suitable for training large-scale models, they are efficient for training smaller models or fine-tuning pre-trained models.

Short Bursts of Activity: Model training often involves periods of intense computation followed by less intensive data processing. Burstable instances can handle these peaks efficiently.
Batch Processing: Training jobs can be broken into smaller batches, utilizing the burst capability for computation-heavy phases.

Example: Fine-tuning a BERT-based model on a domain-specific dataset using a T3a instance can be more cost-efficient than using a compute-optimized instance for the entire process.

3.3 Supporting Real-Time LLM Applications with Cache or Retrieval-Augmented Generation (RAG)

For LLM applications incorporating RAG, where the model frequently accesses a knowledge base or cache, burstable instances can efficiently handle the variable load without constant high performance.

4. Cost-Benefit Analysis

Cost Factor	Burstable Instances (T-Series)	Compute-Optimized Instances (C-Series)
Hourly Pricing	Significantly lower	Higher
CPU Utilization	Charges based on baseline CPU usage	Charges based on full compute capacity
Best for	Variable, intermittent workloads	Consistently high CPU utilization workloads
Savings Potential	Up to 50-60% savings for burstable needs	Expensive for variable loads

Key Cost Considerations:

Burstable instances allow you to pay for what you need, making them cost-efficient for workloads with fluctuating demand.
Compute-optimized instances, while more powerful, often result in unnecessary costs if not fully utilized.

5. Best Practices for Using Burstable Instances with AI/ML and LLM Workloads

5.1 Monitor CPU Credit Usage

Regularly monitor CPU credit usage using AWS CloudWatch to ensure that instances have sufficient credits during peak periods.
Use T3 Unlimited mode to ensure consistent performance even if credits run out, with a minimal additional cost.

5.2 Combine with Spot Instances

Utilize AWS Spot Instances in combination with burstable instances for training tasks. This can significantly reduce costs while ensuring high computational power when needed.

5.3 Optimize Inference Loads

For LLM applications, use burstable instances for inference tasks, especially if traffic is unpredictable or varies based on time.
Implement caching mechanisms or retrieval-augmented generation to reduce computational demands on LLM models.

5.4 Right-Sizing Instances

Evaluate the instance size (e.g., t3.medium vs. t3.large) to ensure you’re not over-provisioning resources. Use AWS Compute Optimizer to analyze and recommend the optimal instance type based on your workload.

6. Limitations and Considerations

Limited for High-Throughput Training: Burstable instances are not ideal for large-scale model training requiring sustained high CPU/GPU usage. For such workloads, compute-optimized or GPU instances are better suited.
Dependency on CPU Credits: If an instance exhausts its CPU credits, it could lead to reduced performance, so monitoring is essential.
Memory Constraints: Burstable instances have limited memory compared to compute-optimized instances, which might be a bottleneck for larger AI/ML models.

7. Conclusion

AWS burstable instances offer a cost-effective solution for running AI/ML and LLM workloads with variable computational demands. By leveraging the burst capability, you can maintain performance during peak periods while reducing costs during idle times. This makes burstable instances particularly suited for inference tasks, small-scale training, and applications with fluctuating traffic patterns.