KV Cache

KV Cache (Key-Value Cache) is a performance optimization technique for Large Language Model (LLM) inference that significantly improves throughput and reduces latency by caching intermediate computation results.

Overview

During LLM inference, the model processes input tokens and generates attention key-value pairs. These key-value pairs can be cached and reused across requests that share common prompt prefixes, eliminating redundant computation and dramatically improving performance for common use cases like:

Chat applications with system prompts
RAG (Retrieval Augmented Generation) with shared context
Code completion with common boilerplate
Batch processing with template prefixes

Architecture

The KV cache implementation in Kaiwo consists of two key components:

┌─────────────────┐
│   AIMService    │  References or creates
│                 │ ────────────────────┐
└─────────────────┘                     │
                                        ▼
                              ┌──────────────────┐
                              │  AIMKVCache      │
                              │  (Custom Resource)│
                              └─────────┬────────┘
                                        │ Creates & manages
                                        ▼
                              ┌──────────────────┐
                              │   StatefulSet    │
                              │   + Service      │
                              │                  │
                              │  Redis  │
                              │  Backend         │
                              └──────────────────┘

Components

AIMService - Specifies KV cache requirements via spec.kvCache - Can create a new KV cache or reference an existing one - Receives KV cache endpoint configuration automatically

AIMKVCache (Custom Resource) - Manages the lifecycle of a KV cache backend - Creates and maintains a StatefulSet with persistent storage - Provides a stable Service endpoint for cache access - Supports Redis backends

Backend (StatefulSet) - Runs the actual KV cache storage (Redis) - Uses persistent volumes for durability - Provides network endpoint for cache operations

Lifecycle Management

Creation Patterns

Pattern 1: AIMService Creates KV Cache When an AIMService specifies kvCache.type without a name, a new AIMKVCache resource is automatically created with the name kvcache-{namespace}.

Pattern 2: Shared KV Cache Multiple AIMService resources can reference the same AIMKVCache by specifying kvCache.name. This enables cache sharing across multiple inference endpoints.

Ownership

When an AIMService creates a KV cache (Pattern 1), it owns the cache resource
The KV cache's lifecycle is tied to the owning service
When referencing an existing cache (Pattern 2), the cache is independent and can outlive the service

States

An AIMKVCache progresses through the following states:

Pending - Resource created, StatefulSet creation queued
Progressing - StatefulSet and Service being deployed, waiting for pods to be ready
Ready - Backend is running and available for use
Failed - Deployment encountered an error (check conditions for details)

Status Information

The AIMKVCache status provides comprehensive information about the cache state:

Basic Information: - status - Current state (Pending, Progressing, Ready, Failed) - statefulSetName - Name of the managed StatefulSet - serviceName - Name of the Kubernetes Service providing network access

Operational Metrics: - endpoint - Connection string for accessing the cache (e.g., redis://service-name:6379) - replicas - Total number of replicas configured - readyReplicas - Number of replicas currently ready and serving - storageSize - Allocated storage capacity (e.g., 50Gi)

Error Tracking: - lastError - Most recent error message (cleared when resolved) - conditions - Detailed condition history for troubleshooting

Storage Considerations

Sizing

The storage size for a KV cache depends on several factors:

Model size: Larger models have bigger key-value tensors
Context length: Longer contexts require more cache storage
Batch size: Higher batch sizes increase cache requirements
Expected request volume: More concurrent requests need more cache space

Monitor actual usage and adjust accordingly.

Storage Classes

The KV cache uses Kubernetes PersistentVolumeClaims for durable storage. If no storageClassName is specified, the cluster's default storage class is used.

Recommendations: - Use SSD-backed storage for better performance - Ensure the storage class supports ReadWriteOnce access mode - Verify sufficient storage quota in your namespace

Backend Types

Redis

Redis is the default and currently supported backend. It provides: - High-performance in-memory caching with disk persistence - Mature, battle-tested reliability - Straightforward configuration

Best Practices

Size appropriately: Start with recommended sizes and monitor actual usage
Share when possible: Use shared caches for services with overlapping use cases
Monitor storage: Set up alerts for storage capacity
Use fast storage: SSD-backed storage classes provide best performance
Plan for growth: KV cache storage needs grow with traffic volume