Skip to content

Using KV Cache

This guide shows how to configure and use KV cache with your inference services.

Quick Start

The simplest way to enable KV cache is to let the AIMService create one automatically:

apiVersion: aim.silogen.ai/v1alpha1
kind: AIMService
metadata:
  name: llama-chat
  namespace: ml-team
spec:
  model:
    image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
  kvCache:
    type: redis  # Automatically creates 'kvcache-llama-chat'

This creates an AIMKVCache resource with default settings (1Gi storage, default storage class, Redis backend).

Configuration Options

Custom Storage Size

Specify storage size based on your model and workload:

spec:
  model:
    image: ghcr.io/silogen/aim-meta-llama-llama-3-1-70b-instruct:0.7.0
  kvCache:
    type: redis
    storage:
      size: 100Gi  # Larger model needs more cache storage

Custom Storage Class

Use a specific storage class for better performance:

spec:
  model:
    image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
  kvCache:
    type: redis
    storage:
      size: 50Gi
      storageClassName: fast-ssd  # Use your high-performance storage class

Custom Access Modes

Specify persistent volume access modes (defaults to ReadWriteOnce):

spec:
  model:
    image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
  kvCache:
    type: redis
    storage:
      size: 50Gi
      accessModes:
        - ReadWriteOnce

Sharing KV Cache

Multiple services can share a single KV cache for better resource utilization.

Step 1: Create a Standalone KV Cache

apiVersion: aim.silogen.ai/v1alpha1
kind: AIMKVCache
metadata:
  name: shared-cache
  namespace: ml-team
spec:
  kvCacheType: redis
  storage:
    size: 200Gi
    storageClassName: fast-ssd

Step 2: Reference from Multiple Services

---
apiVersion: aim.silogen.ai/v1alpha1
kind: AIMService
metadata:
  name: llama-chat
  namespace: ml-team
spec:
  model:
    image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
  kvCache:
    name: shared-cache  # References existing cache
---
apiVersion: aim.silogen.ai/v1alpha1
kind: AIMService
metadata:
  name: llama-completion
  namespace: ml-team
spec:
  model:
    image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
  kvCache:
    name: shared-cache  # Same cache, shared across services

Advanced Configuration

Custom LMCache Configuration

By default, Kaiwo generates a standard LMCache configuration file for your service. For advanced use cases, you can provide a custom LMCache configuration by specifying the lmCacheConfig field. This allows you to fine-tune caching behavior, serialization methods, and other LMCache-specific settings.

Using the {SERVICE_URL} Placeholder

When specifying a custom configuration, you should use the {SERVICE_URL} placeholder for the remote_url field instead of hardcoding the cache endpoint. Kaiwo will automatically replace this placeholder with the actual KV cache service URL at runtime.

Example with custom configuration:

apiVersion: aim.silogen.ai/v1alpha1
kind: AIMService
metadata:
  name: llama-chat
  namespace: ml-team
spec:
  model:
    image: ghcr.io/silogen/aim-meta-llama-llama-3-1-8b-instruct:0.7.0
  kvCache:
    type: redis
    storage:
      size: 50Gi
    lmCacheConfig: |
      local_cpu: true
      chunk_size: 256
      max_local_cpu_size: 2.0
      remote_url: "{SERVICE_URL}"
      remote_serde: "cachegen"

The {SERVICE_URL} placeholder will be automatically replaced with the actual Redis service URL (e.g., redis://kvcache-llama-chat-redis-svc:6379).

Configuration Options

Common LMCache configuration options include:

Field Type Description Default
local_cpu boolean Enable local CPU RAM cache for fast access true
chunk_size integer Size of cache chunks in tokens 50
max_local_cpu_size float Maximum size (GB) for local CPU cache 1.0
remote_url string URL of the remote cache backend (use {SERVICE_URL}) -
remote_serde string Serialization method: "naive" or "cachegen" "naive"
pipelined_backend boolean Enable pipelined backend for better performance false
save_decode_cache boolean Whether to cache decode phase KV pairs false

Default Configuration

When lmCacheConfig is not specified, the following default configuration is used:

local_cpu: true
chunk_size: 50
max_local_cpu_size: 1.0
remote_url: "{SERVICE_URL}"  # Automatically filled in
remote_serde: "naive"

Optimized Configuration Example

For high-throughput workloads, you might want to increase chunk size and local cache:

spec:
  model:
    image: ghcr.io/silogen/aim-meta-llama-llama-3-1-70b-instruct:0.7.0
  kvCache:
    type: redis
    storage:
      size: 100Gi
    lmCacheConfig: |
      local_cpu: true
      chunk_size: 256
      max_local_cpu_size: 5.0
      remote_url: "{SERVICE_URL}"
      remote_serde: "cachegen"

Key tuning considerations:

  • chunk_size: Larger chunks (256) can improve cache hit rates for longer prompts but use more memory
  • max_local_cpu_size: Increase for better local cache hit rates (monitor actual usage)
  • remote_serde: Use "cachegen" for better compression and network efficiency

Complete Working Example

For a complete working example with test assertions, see the custom-lmcache-config test which demonstrates:

  • Using the {SERVICE_URL} placeholder in a custom configuration
  • Verification that the placeholder is correctly replaced with the actual Redis service URL
  • Custom LMCache settings including chunk_size, remote_serde

Storage Sizing Guide

Choose storage size based on your model and expected usage:

Small Models (< 7B parameters)

kvCache:
  type: redis
  storage:
    size: 10Gi  # Sufficient for most workloads

Use cases: Chat, simple completion, low-to-medium traffic

Medium Models (7B - 70B parameters)

kvCache:
  type: redis
  storage:
    size: 50Gi  # Balanced for typical usage

Use cases: Production chat, RAG applications, moderate traffic

Large Models (> 70B parameters)

kvCache:
  type: redis
  storage:
    size: 100Gi  # Start here and scale up as needed

Use cases: High-volume production, long contexts, batch processing

Calculating Size

For more precise sizing, use this formula:

Storage (GB) ≈ Model_Size (B) × Context_Length (tokens) × Batch_Size × 0.001

Example for Llama 70B with 4K context and batch size 8:

70 × 4000 × 8 × 0.001 = 2,240 MB ≈ 3 GB (minimum)

Add overhead and growth buffer: 3 GB × 20 = 60 GB recommended

Monitoring and Management

Check KV Cache Status

kubectl get aimkvcache -n ml-team

Output:

NAME              TYPE    STATUS   READY   AGE
kvcache-llama     redis   Ready    1       5m
shared-cache      redis   Ready    1       10m

For more details including endpoint information:

kubectl get aimkvcache -n ml-team -o wide

Output:

NAME              TYPE    STATUS   READY   ENDPOINT                         AGE
kvcache-llama     redis   Ready    1       redis://kvcache-llama-svc:6379   5m
shared-cache      redis   Ready    1       redis://shared-cache-svc:6379    10m

View Detailed Status

kubectl describe aimkvcache kvcache-llama -n ml-team

This shows additional information including: - Ready replicas (e.g., "1/1") - Storage size allocated - Connection endpoint - Recent conditions and events - Last error (if any)

Check Storage Usage

kubectl get pvc -n ml-team -l app.kubernetes.io/managed-by=aimkvcache-controller

View Backend Logs

# Get the StatefulSet name from the AIMKVCache status
kubectl logs -n ml-team kvcache-llama-statefulset-0

Troubleshooting

KV Cache Stuck in Progressing

Check if the StatefulSet pod is running:

kubectl get pods -n ml-team -l app.kubernetes.io/name=aimkvcache

Check for storage issues:

kubectl describe pvc -n ml-team

Common causes: - No default storage class configured - Insufficient storage quota - Storage class not available

KV Cache Status Failed

View the conditions for error details:

kubectl get aimkvcache kvcache-llama -n ml-team -o jsonpath='{.status.conditions}'

Check StatefulSet events:

kubectl describe statefulset -n ml-team kvcache-llama-statefulset

Service Can't Connect to KV Cache

Verify the service endpoint:

kubectl get svc -n ml-team -l app.kubernetes.io/name=aimkvcache

Check AIMService status for KVCache readiness:

kubectl get aimservice llama-chat -n ml-team -o jsonpath='{.status.conditions[?(@.type=="KVCacheReady")]}'

Best Practices

Share KV cache across services that use the same model or have overlapping prompt patterns:

# Good: Multiple chat services using the same model share a cache
kvCache:
  name: llama-8b-shared-cache

2. Size Conservatively, Then Scale

Start with recommended sizes and monitor actual usage:

# Start with 50Gi for a 70B model
storage:
  size: 50Gi

Then scale up if needed based on monitoring.

3. Use Fast Storage Classes

KV cache performance depends on storage speed:

storage:
  size: 100Gi
  storageClassName: premium-ssd  # Use SSD-backed storage

4. Monitor Storage Capacity

Set up alerts before storage fills up:

# Check PVC usage regularly
kubectl get pvc -n ml-team

5. Plan for Failover

Create multiple KV caches for critical workloads:

# Primary service
kvCache:
  name: primary-cache

# Failover service (optional)
kvCache:
  name: secondary-cache

Default Behavior

When configuration is omitted, the following defaults apply:

Field Default Value Notes
kvCacheType redis Currently only Redis is supported
storage.size 1Gi Minimum recommended for Redis
storage.storageClassName nil Uses cluster default storage class
storage.accessModes [ReadWriteOnce] Standard for single-node access

See Also