KV Cache
KV Cache (Key-Value Cache) is a performance optimization technique for Large Language Model (LLM) inference that significantly improves throughput and reduces latency by caching intermediate computation results.
Overview
During LLM inference, the model processes input tokens and generates attention key-value pairs. These key-value pairs can be cached and reused across requests that share common prompt prefixes, eliminating redundant computation and dramatically improving performance for common use cases like:
- Chat applications with system prompts
- RAG (Retrieval Augmented Generation) with shared context
- Code completion with common boilerplate
- Batch processing with template prefixes
Architecture
The KV cache implementation in Kaiwo consists of two key components:
┌─────────────────┐
│ AIMService │ References or creates
│ │ ────────────────────┐
└─────────────────┘ │
▼
┌──────────────────┐
│ AIMKVCache │
│ (Custom Resource)│
└─────────┬────────┘
│ Creates & manages
▼
┌──────────────────┐
│ StatefulSet │
│ + Service │
│ │
│ Redis │
│ Backend │
└──────────────────┘
Components
AIMService
- Specifies KV cache requirements via spec.kvCache
- Can create a new KV cache or reference an existing one
- Receives KV cache endpoint configuration automatically
AIMKVCache (Custom Resource) - Manages the lifecycle of a KV cache backend - Creates and maintains a StatefulSet with persistent storage - Provides a stable Service endpoint for cache access - Supports Redis backends
Backend (StatefulSet) - Runs the actual KV cache storage (Redis) - Uses persistent volumes for durability - Provides network endpoint for cache operations
Lifecycle Management
Creation Patterns
Pattern 1: AIMService Creates KV Cache
When an AIMService specifies kvCache.type without a name, a new AIMKVCache resource is automatically created with the name kvcache-{namespace}.
Pattern 2: Shared KV Cache
Multiple AIMService resources can reference the same AIMKVCache by specifying kvCache.name. This enables cache sharing across multiple inference endpoints.
Ownership
- When an
AIMServicecreates a KV cache (Pattern 1), it owns the cache resource - The KV cache's lifecycle is tied to the owning service
- When referencing an existing cache (Pattern 2), the cache is independent and can outlive the service
States
An AIMKVCache progresses through the following states:
- Pending - Resource created, StatefulSet creation queued
- Progressing - StatefulSet and Service being deployed, waiting for pods to be ready
- Ready - Backend is running and available for use
- Failed - Deployment encountered an error (check conditions for details)
Status Information
The AIMKVCache status provides comprehensive information about the cache state:
Basic Information:
- status - Current state (Pending, Progressing, Ready, Failed)
- statefulSetName - Name of the managed StatefulSet
- serviceName - Name of the Kubernetes Service providing network access
Operational Metrics:
- endpoint - Connection string for accessing the cache (e.g., redis://service-name:6379)
- replicas - Total number of replicas configured
- readyReplicas - Number of replicas currently ready and serving
- storageSize - Allocated storage capacity (e.g., 50Gi)
Error Tracking:
- lastError - Most recent error message (cleared when resolved)
- conditions - Detailed condition history for troubleshooting
Storage Considerations
Sizing
The storage size for a KV cache depends on several factors:
- Model size: Larger models have bigger key-value tensors
- Context length: Longer contexts require more cache storage
- Batch size: Higher batch sizes increase cache requirements
- Expected request volume: More concurrent requests need more cache space
Monitor actual usage and adjust accordingly.
Storage Classes
The KV cache uses Kubernetes PersistentVolumeClaims for durable storage. If no storageClassName is specified, the cluster's default storage class is used.
Recommendations:
- Use SSD-backed storage for better performance
- Ensure the storage class supports ReadWriteOnce access mode
- Verify sufficient storage quota in your namespace
Backend Types
Redis
Redis is the default and currently supported backend. It provides: - High-performance in-memory caching with disk persistence - Mature, battle-tested reliability - Straightforward configuration
Best Practices
- Size appropriately: Start with recommended sizes and monitor actual usage
- Share when possible: Use shared caches for services with overlapping use cases
- Monitor storage: Set up alerts for storage capacity
- Use fast storage: SSD-backed storage classes provide best performance
- Plan for growth: KV cache storage needs grow with traffic volume
See Also
- KV Cache Usage Guide - Practical examples and configuration
- Deploying Inference Services - AIMService configuration
- Runtime Configuration - Additional service configuration