Using KV Cache
This guide shows how to configure and use KV cache with your inference services.
Quick Start
The simplest way to enable KV cache is to let the AIMService create one automatically:
apiVersion: aim.silogen.ai/v1alpha1
kind: AIMService
metadata:
name: llama-chat
namespace: ml-team
spec:
model:
image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
kvCache:
type: redis # Automatically creates 'kvcache-llama-chat'
This creates an AIMKVCache resource with default settings (1Gi storage, default storage class, Redis backend).
Configuration Options
Custom Storage Size
Specify storage size based on your model and workload:
spec:
model:
image: ghcr.io/silogen/aim-meta-llama-llama-3-1-70b-instruct:0.7.0
kvCache:
type: redis
storage:
size: 100Gi # Larger model needs more cache storage
Custom Storage Class
Use a specific storage class for better performance:
spec:
model:
image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
kvCache:
type: redis
storage:
size: 50Gi
storageClassName: fast-ssd # Use your high-performance storage class
Custom Access Modes
Specify persistent volume access modes (defaults to ReadWriteOnce):
spec:
model:
image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
kvCache:
type: redis
storage:
size: 50Gi
accessModes:
- ReadWriteOnce
Sharing KV Cache
Multiple services can share a single KV cache for better resource utilization.
Step 1: Create a Standalone KV Cache
apiVersion: aim.silogen.ai/v1alpha1
kind: AIMKVCache
metadata:
name: shared-cache
namespace: ml-team
spec:
kvCacheType: redis
storage:
size: 200Gi
storageClassName: fast-ssd
Step 2: Reference from Multiple Services
---
apiVersion: aim.silogen.ai/v1alpha1
kind: AIMService
metadata:
name: llama-chat
namespace: ml-team
spec:
model:
image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
kvCache:
name: shared-cache # References existing cache
---
apiVersion: aim.silogen.ai/v1alpha1
kind: AIMService
metadata:
name: llama-completion
namespace: ml-team
spec:
model:
image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
kvCache:
name: shared-cache # Same cache, shared across services
Advanced Configuration
Custom LMCache Configuration
By default, Kaiwo generates a standard LMCache configuration file for your service. For advanced use cases, you can provide a custom LMCache configuration by specifying the lmCacheConfig field. This allows you to fine-tune caching behavior, serialization methods, and other LMCache-specific settings.
Using the {SERVICE_URL} Placeholder
When specifying a custom configuration, you should use the {SERVICE_URL} placeholder for the remote_url field instead of hardcoding the cache endpoint. Kaiwo will automatically replace this placeholder with the actual KV cache service URL at runtime.
Example with custom configuration:
apiVersion: aim.silogen.ai/v1alpha1
kind: AIMService
metadata:
name: llama-chat
namespace: ml-team
spec:
model:
image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
kvCache:
type: redis
storage:
size: 50Gi
lmCacheConfig: |
local_cpu: true
chunk_size: 256
max_local_cpu_size: 2.0
remote_url: "{SERVICE_URL}"
remote_serde: "cachegen"
The {SERVICE_URL} placeholder will be automatically replaced with the actual Redis service URL (e.g., redis://kvcache-llama-chat-redis-svc:6379).
Configuration Options
Common LMCache configuration options include:
| Field | Type | Description | Default |
|---|---|---|---|
local_cpu |
boolean | Enable local CPU RAM cache for fast access | true |
chunk_size |
integer | Size of cache chunks in tokens | 50 |
max_local_cpu_size |
float | Maximum size (GB) for local CPU cache | 1.0 |
remote_url |
string | URL of the remote cache backend (use {SERVICE_URL}) |
- |
remote_serde |
string | Serialization method: "naive" or "cachegen" |
"naive" |
pipelined_backend |
boolean | Enable pipelined backend for better performance | false |
save_decode_cache |
boolean | Whether to cache decode phase KV pairs | false |
Default Configuration
When lmCacheConfig is not specified, the following default configuration is used:
local_cpu: true
chunk_size: 50
max_local_cpu_size: 1.0
remote_url: "{SERVICE_URL}" # Automatically filled in
remote_serde: "naive"
Optimized Configuration Example
For high-throughput workloads, you might want to increase chunk size and local cache:
spec:
model:
image: ghcr.io/silogen/aim-meta-llama-llama-3-1-70b-instruct:0.7.0
kvCache:
type: redis
storage:
size: 100Gi
lmCacheConfig: |
local_cpu: true
chunk_size: 256
max_local_cpu_size: 5.0
remote_url: "{SERVICE_URL}"
remote_serde: "cachegen"
Key tuning considerations:
chunk_size: Larger chunks (256) can improve cache hit rates for longer prompts but use more memorymax_local_cpu_size: Increase for better local cache hit rates (monitor actual usage)remote_serde: Use"cachegen"for better compression and network efficiency
Complete Working Example
For a complete working example with test assertions, see the custom-lmcache-config test which demonstrates:
- Using the
{SERVICE_URL}placeholder in a custom configuration - Verification that the placeholder is correctly replaced with the actual Redis service URL
- Custom LMCache settings including
chunk_size,remote_serde
Storage Sizing Guide
Choose storage size based on your model and expected usage:
Small Models (< 7B parameters)
Use cases: Chat, simple completion, low-to-medium traffic
Medium Models (7B - 70B parameters)
Use cases: Production chat, RAG applications, moderate traffic
Large Models (> 70B parameters)
Use cases: High-volume production, long contexts, batch processing
Calculating Size
For more precise sizing, use this formula:
Example for Llama 70B with 4K context and batch size 8:
Add overhead and growth buffer: 3 GB × 20 = 60 GB recommended
Monitoring and Management
Check KV Cache Status
Output:
For more details including endpoint information:
Output:
NAME TYPE STATUS READY ENDPOINT AGE
kvcache-llama redis Ready 1 redis://kvcache-llama-svc:6379 5m
shared-cache redis Ready 1 redis://shared-cache-svc:6379 10m
View Detailed Status
This shows additional information including: - Ready replicas (e.g., "1/1") - Storage size allocated - Connection endpoint - Recent conditions and events - Last error (if any)
Check Storage Usage
View Backend Logs
# Get the StatefulSet name from the AIMKVCache status
kubectl logs -n ml-team kvcache-llama-statefulset-0
Troubleshooting
KV Cache Stuck in Progressing
Check if the StatefulSet pod is running:
Check for storage issues:
Common causes: - No default storage class configured - Insufficient storage quota - Storage class not available
KV Cache Status Failed
View the conditions for error details:
Check StatefulSet events:
Service Can't Connect to KV Cache
Verify the service endpoint:
Check AIMService status for KVCache readiness:
kubectl get aimservice llama-chat -n ml-team -o jsonpath='{.status.conditions[?(@.type=="KVCacheReady")]}'
Best Practices
1. Use Shared Caches for Related Services
Share KV cache across services that use the same model or have overlapping prompt patterns:
# Good: Multiple chat services using the same model share a cache
kvCache:
name: llama-8b-shared-cache
2. Size Conservatively, Then Scale
Start with recommended sizes and monitor actual usage:
Then scale up if needed based on monitoring.
3. Use Fast Storage Classes
KV cache performance depends on storage speed:
4. Monitor Storage Capacity
Set up alerts before storage fills up:
5. Plan for Failover
Create multiple KV caches for critical workloads:
# Primary service
kvCache:
name: primary-cache
# Failover service (optional)
kvCache:
name: secondary-cache
Default Behavior
When configuration is omitted, the following defaults apply:
| Field | Default Value | Notes |
|---|---|---|
kvCacheType |
redis |
Currently only Redis is supported |
storage.size |
1Gi |
Minimum recommended for Redis |
storage.storageClassName |
nil |
Uses cluster default storage class |
storage.accessModes |
[ReadWriteOnce] |
Standard for single-node access |
See Also
- KV Cache Concepts - Architecture and design
- Deploying Inference Services - AIMService configuration
- Models - Model configuration and optimization