ADR 0041: Multi-Level Caching Strategy¶
Status: Accepted
Date: 2025-11-24
Deciders: Development Team
Related ADRs: adr0016, adr0018, adr0021, adr0039
Context¶
The concept-rag system performs computationally expensive operations that benefit from caching:
Performance Bottlenecks¶
- Embedding Generation: Simple hash-based embeddings take ~1-5ms per text
- Repeated searches generate same embeddings multiple times
- User queries often contain similar or identical terms
-
Concept names embedded repeatedly during hybrid search
-
Vector Search: LanceDB vector search is fast but not instantaneous
- Typical search: 50-200ms depending on dataset size
- Repeated identical queries waste resources
-
Search results stable for short time periods
-
Hybrid Search: Combines multiple operations
- Embedding generation (1-5ms)
- Vector search (50-200ms)
- BM25 keyword search (10-50ms)
- Result merging and ranking (5-10ms)
- Total: 66-265ms per search
Real-World Usage Patterns¶
Query Repetition: - Users refine searches iteratively - Common concepts searched frequently - Dashboard/UI polls for same data - Testing/debugging repeats identical queries
Embedding Reuse: - Same query text embedded multiple times - Common domain terms (e.g., "microservices", "ddd") used repeatedly - Concept names embedded during every search involving them
Cache Hit Rate Potential: - Estimated 60-70% cache hit rate for searches - Estimated 70-80% cache hit rate for embeddings - Significant performance improvement possible
Existing State¶
No Caching ❌: - Every search executes full pipeline - Every embedding computed from scratch - No memory of previous operations - Repeated work wastes resources
Memory Caches Exist ✅: - ConceptIdCache: Concept name → ID mapping (653 entries) - CategoryIdCache: Category lookups (~50 entries) - Both loaded at startup, never expire
Need Multi-Level Strategy: - Search result caching with TTL - Embedding caching (no TTL, embeddings immutable) - Bounded memory usage (LRU eviction) - Configurable cache sizes and TTLs
Decision¶
Implement a multi-level caching strategy with LRU (Least Recently Used) eviction and TTL (Time-to-Live) support:
1. Core LRU Cache Infrastructure¶
Generic LRU Cache: Foundation for all caching
interface CacheMetrics {
hits: number;
misses: number;
evictions: number;
size: number;
hitRate: number;
}
class LRUCache<K, V> {
constructor(options: {
maxSize: number; // Maximum entries
defaultTTL?: number; // Default TTL in milliseconds
onEvict?: (key: K, value: V) => void;
});
get(key: K): V | undefined;
set(key: K, value: V, ttl?: number): void;
has(key: K): boolean;
delete(key: K): boolean;
clear(): void;
// Metrics
getMetrics(): CacheMetrics;
resetMetrics(): void;
// Inspection
keys(): K[];
size(): number;
}
Features: - O(1) get/set operations using JavaScript Map - Automatic LRU eviction when maxSize reached - Per-entry TTL support with lazy expiration checking - Built-in metrics (hits, misses, evictions, hit rate) - Optional eviction callback for cleanup
Implementation Details:
class LRUCache<K, V> {
private cache: Map<K, CacheEntry<V>>;
private maxSize: number;
get(key: K): V | undefined {
const entry = this.cache.get(key);
if (!entry) {
this.metrics.misses++;
return undefined;
}
// Check TTL expiration
if (entry.expiresAt && Date.now() > entry.expiresAt) {
this.cache.delete(key);
this.metrics.misses++;
return undefined;
}
// Move to end (most recently used)
this.cache.delete(key);
this.cache.set(key, entry);
this.metrics.hits++;
return entry.value;
}
set(key: K, value: V, ttl?: number): void {
// Evict oldest if at capacity
if (this.cache.size >= this.maxSize && !this.cache.has(key)) {
const firstKey = this.cache.keys().next().value;
this.cache.delete(firstKey);
this.metrics.evictions++;
}
const expiresAt = ttl ? Date.now() + ttl : undefined;
this.cache.set(key, { value, expiresAt });
}
}
2. Search Result Cache¶
Purpose: Cache search results with TTL to avoid redundant searches
interface SearchResultCacheOptions {
maxSize?: number; // Default: 1000
defaultTTL?: number; // Default: 5 minutes (300000ms)
}
class SearchResultCache {
constructor(options?: SearchResultCacheOptions);
// Cache search results
get<T>(query: string, options?: Record<string, unknown>): T | undefined;
set<T>(query: string, value: T, options?: Record<string, unknown>): void;
// Invalidation
invalidate(query: string): void;
invalidatePattern(pattern: RegExp): number;
clear(): void;
// Metrics
getMetrics(): CacheMetrics;
}
Cache Key Generation: Deterministic SHA-256 hash
private generateCacheKey(
query: string,
options?: Record<string, unknown>
): string {
// Normalize options (sort keys for consistency)
const normalized = options
? Object.keys(options)
.sort()
.reduce((acc, key) => ({ ...acc, [key]: options[key] }), {})
: {};
const input = `${query}:${JSON.stringify(normalized)}`;
return createHash('sha256').update(input).digest('hex');
}
Key Features: - Query + options included in cache key - Option order normalized for consistency - Default 5-minute TTL (configurable) - Query invalidation support - Pattern-based invalidation (regex)
Example Usage:
const cache = new SearchResultCache({ maxSize: 1000, defaultTTL: 300000 });
// Check cache first
const cached = cache.get('microservices', { limit: 10 });
if (cached) {
return cached;
}
// Execute search
const results = await hybridSearch.search('microservices', { limit: 10 });
// Store in cache
cache.set('microservices', results, { limit: 10 });
3. Embedding Cache¶
Purpose: Cache embedding vectors (no TTL, embeddings are immutable)
interface EmbeddingCacheOptions {
maxSize?: number; // Default: 10000
}
class EmbeddingCache {
constructor(options?: EmbeddingCacheOptions);
// Cache embeddings
get(text: string, model: string): number[] | undefined;
set(text: string, model: string, embedding: number[]): void;
// Management
clear(): void;
// Metrics
getMetrics(): CacheMetrics;
estimateMemoryUsage(): number; // Bytes
}
Cache Key Generation: Model + text hash
private generateCacheKey(text: string, model: string): string {
// Hash long texts for efficient cache keys
const textHash = createHash('sha256').update(text).digest('hex');
return `${model}:${textHash}`;
}
Key Features: - Model-specific caching (different models = different embeddings) - No TTL (embeddings are deterministic and immutable) - Text hashing for efficient cache keys - Memory usage estimation (~3KB per embedding) - Large capacity (10K embeddings by default)
Memory Usage Estimation:
estimateMemoryUsage(): number {
const avgDimension = 384; // Typical embedding dimension
const bytesPerFloat = 8; // Float64
const avgEmbeddingSize = avgDimension * bytesPerFloat; // ~3KB
return this.cache.size() * avgEmbeddingSize;
}
4. Service Integration¶
SimpleEmbeddingService with EmbeddingCache:
class SimpleEmbeddingService implements EmbeddingService {
constructor(
private readonly embeddingCache?: EmbeddingCache
) {}
async generateEmbedding(text: string): Promise<number[]> {
const model = 'simple-hash-v1';
// Check cache
if (this.embeddingCache) {
const cached = this.embeddingCache.get(text, model);
if (cached) {
return cached;
}
}
// Generate embedding
const embedding = this.computeEmbedding(text);
// Store in cache
if (this.embeddingCache) {
this.embeddingCache.set(text, model, embedding);
}
return embedding;
}
}
ConceptualHybridSearchService with SearchResultCache:
class ConceptualHybridSearchService {
constructor(
private readonly searchCache?: SearchResultCache,
// ... other dependencies
) {}
async search(
query: string,
options: SearchOptions,
debug?: boolean
): Promise<SearchResult[]> {
// Disable caching in debug mode
if (debug) {
return this.executeSearch(query, options);
}
// Cache key includes collection name
const cacheKey = `${options.collection}:${query}`;
// Check cache
if (this.searchCache) {
const cached = this.searchCache.get(cacheKey, options);
if (cached) {
return cached;
}
}
// Execute search
const results = await this.executeSearch(query, options);
// Store in cache
if (this.searchCache) {
this.searchCache.set(cacheKey, results, options);
}
return results;
}
}
5. Container Configuration¶
Application Container initializes caches:
class ApplicationContainer {
private embeddingCache?: EmbeddingCache;
private searchCache?: SearchResultCache;
async initialize(databaseUrl: string): Promise<void> {
// Initialize caches
this.embeddingCache = new EmbeddingCache({
maxSize: 10000 // 10K embeddings (~30MB max)
});
this.searchCache = new SearchResultCache({
maxSize: 1000, // 1K searches
defaultTTL: 300000 // 5 minutes
});
// Inject into services
this.embeddingService = new SimpleEmbeddingService(
this.embeddingCache
);
this.hybridSearchService = new ConceptualHybridSearchService(
this.searchCache,
// ... other dependencies
);
}
async shutdown(): Promise<void> {
// Clear caches
this.embeddingCache?.clear();
this.searchCache?.clear();
// ... other cleanup
}
}
6. Cache Configuration¶
Default Sizes: | Cache | Max Size | TTL | Max Memory | |-------|----------|-----|------------| | Embedding | 10,000 | None | ~30MB | | Search Result | 1,000 | 5 min | ~100MB | | Total | - | - | ~130MB |
Memory Bounds: - LRU eviction ensures bounded memory - Configurable max sizes - Memory usage grows to limit, then stabilizes - Old entries automatically evicted
TTL Strategy: - Embeddings: No TTL (deterministic, immutable) - Search Results: 5-minute TTL (balance freshness vs performance) - Configurable: Can adjust per use case
Consequences¶
Positive¶
- Significant Performance Improvement
- Expected 40-60% latency reduction for cached searches
- Expected 70-80% reduction in embedding computations
-
Instant response for cache hits (<1ms)
-
Reduced Resource Usage
- Fewer CPU cycles for embedding generation
- Fewer vector search operations
-
Lower database load
-
Better User Experience
- Faster search responses for repeated queries
- Snappier dashboard/UI interactions
-
Improved perceived performance
-
Bounded Memory Usage
- LRU eviction prevents unlimited growth
- Configurable max sizes
-
Predictable memory footprint (~130MB max)
-
Built-in Metrics
- Cache hit/miss tracking
- Hit rate calculation
- Eviction monitoring
-
Performance analysis capabilities
-
Backward Compatible
- Caches are optional (injected via constructor)
- Services work without caching
- All existing tests pass (346 tests)
-
Zero breaking changes
-
Type-Safe
- Full TypeScript type annotations
- Generic type parameters
-
Compile-time guarantees
-
Testable
- 79 unit tests for caching (100% coverage)
- Mock caches for testing services
- Integration tests with real services
Negative¶
- Memory Overhead
- ~130MB for caches at capacity
- Mitigation: Bounded by LRU eviction
-
Acceptable for performance gain
-
Cache Invalidation Complexity
- Stale data if documents updated
- Mitigation: 5-minute TTL for searches, clear on document updates
-
Manual invalidation available if needed
-
Cache Warming Overhead
- Cold cache on startup
- Mitigation: Caches warm naturally through usage
-
Future: Pre-populate common queries
-
Additional Code Complexity
- Cache management code
- Mitigation: Well-tested, isolated infrastructure layer
-
Clear separation of concerns
-
No Persistence
- Cache lost on restart
- Mitigation: Acceptable for current scale
- Future: Redis for persistent caching
Trade-offs¶
| Trade-off | Choice | Rationale |
|---|---|---|
| Memory vs Speed | Accept 130MB | Significant performance gain worth cost |
| TTL vs Freshness | 5-minute TTL | Balance between cache hits and staleness |
| Persistent vs In-Memory | In-memory | Simpler, sufficient for current scale |
| LRU vs LFU | LRU | Simpler, works well for temporal locality |
| Cache Everything vs Selective | Selective | Cache expensive operations only |
Implementation¶
Date: 2025-11-24
Duration: ~2.5 hours agentic development
Branch: feat/add-advanced-caching
Commit: 4ecdcd1
Files Created¶
Cache Infrastructure:
src/infrastructure/cache/
├── index.ts (exports)
├── lru-cache.ts (core LRU implementation)
├── search-result-cache.ts (search result specialization)
├── embedding-cache.ts (embedding specialization)
└── __tests__/
├── lru-cache.test.ts (25 tests)
├── search-result-cache.test.ts (24 tests)
└── embedding-cache.test.ts (30 tests)
Total: - Production code: ~800 lines - Test code: ~840 lines - Test coverage: 79 tests, 100% coverage
Files Modified¶
Service Integration:
- src/application/container.ts: Initialize and inject caches
- src/infrastructure/embeddings/simple-embedding-service.ts: Use EmbeddingCache
- src/infrastructure/search/conceptual-hybrid-search-service.ts: Use SearchResultCache
Changes: - 10 files changed - 1,642 insertions - 13 deletions
Test Results¶
Unit Tests: 79 tests, all passing ✅
LRU Cache: 25 tests (286ms)
Search Result Cache: 24 tests (1258ms, includes TTL tests)
Embedding Cache: 30 tests (108ms)
Integration Tests: 21 tests, all passing ✅
Application Container: 21 tests (7652ms)
- All tools work with caching enabled
- Cache cleanup works correctly
- Backward compatibility maintained
Full Infrastructure Suite: 346 tests, all passing ✅
Performance Impact¶
Expected Performance Improvements¶
Based on cache hit rate projections and typical operation costs:
Search Performance:
Without Cache:
- Embedding generation: 2ms
- Vector search: 100ms
- BM25 search: 20ms
- Result merging: 5ms
- Total: 127ms
With Cache (60% hit rate):
- Cache hit (60%): <1ms
- Cache miss (40%): 127ms
- Average: 0.6ms * 0.6 + 127ms * 0.4 = 51ms
- Improvement: 60% latency reduction
Embedding Performance:
Without Cache:
- Embedding generation: 2ms per text
- 10 searches/sec = 20ms embedding overhead
With Cache (70% hit rate):
- Cache hit (70%): <1ms
- Cache miss (30%): 2ms
- Average: 0.7ms * 0.7 + 2ms * 0.3 = 1.09ms
- Improvement: 45% latency reduction
Memory Usage:
Embedding Cache:
- 10K embeddings * 3KB = 30MB max
Search Result Cache:
- 1K results * ~100KB = 100MB max
Total: ~130MB
Alternatives Considered¶
Alternative 1: Redis Cache¶
Pros: - Persistent across restarts - Shared cache across multiple instances - Rich data structures and features - Battle-tested at scale
Cons: - External dependency (Redis server) - Network latency for cache operations - Added deployment complexity - Overkill for single-instance MCP server
Decision: In-memory caching sufficient for current scale
Alternative 2: Simple Object Cache¶
Pros: - Simplest possible implementation - No dependencies - Minimal code
Cons: - No bounded memory (potential leaks) - No TTL support - No eviction strategy - No metrics
Decision: LRU cache provides necessary features
Alternative 3: LFU (Least Frequently Used) Cache¶
Pros: - Better for stable access patterns - Keeps most popular items
Cons: - More complex to implement - Doesn't handle temporal locality well - Search patterns have temporal locality
Decision: LRU better for search workloads
Alternative 4: No Caching (Keep Current State)¶
Pros: - No memory overhead - No complexity - Always fresh data
Cons: - Repeated expensive operations - Poor performance for common queries - Wasteful resource usage
Decision: Caching provides significant value
Testing Strategy¶
Unit Tests (79 tests, 100% coverage)¶
LRU Cache Tests (25 tests): - Basic get/set operations - LRU eviction behavior - TTL expiration - Metrics tracking - Edge cases (empty values, special characters, unicode)
Search Result Cache Tests (24 tests): - Cache key generation - Option normalization - Query invalidation - Pattern-based invalidation - TTL behavior - Metrics accuracy
Embedding Cache Tests (30 tests): - Model-specific caching - Text hashing - Memory estimation - No TTL behavior - Large embeddings - Edge cases
Integration Tests¶
Container Integration: - Cache initialization - Service integration - Cache cleanup on shutdown - Backward compatibility (services work without cache)
Service Integration: - Caching in SimpleEmbeddingService - Caching in ConceptualHybridSearchService - Debug mode disables caching - All existing tools work with caching
Performance Tests¶
Future performance benchmarks to validate cache effectiveness: - Cache hit rate monitoring - Latency distribution (cached vs uncached) - Memory usage over time - Eviction rate analysis
Usage Examples¶
Search Result Caching (Transparent)¶
// Usage is transparent to callers
const container = new ApplicationContainer();
await container.initialize('~/.concept_rag');
const tool = container.getTool('catalog_search');
// First search: cache miss, executes full search
const result1 = await tool.execute({ text: 'microservices', limit: 5 });
// Second search (within 5min): cache hit, instant return
const result2 = await tool.execute({ text: 'microservices', limit: 5 });
// Different limit: cache miss (different cache key)
const result3 = await tool.execute({ text: 'microservices', limit: 10 });
Embedding Caching (Transparent)¶
// Embedding caching is completely transparent
const embedding1 = await embeddingService.generateEmbedding('microservices');
// Computed from scratch
const embedding2 = await embeddingService.generateEmbedding('microservices');
// Retrieved from cache (instant)
Cache Metrics¶
// Access cache metrics for monitoring
const searchMetrics = searchCache.getMetrics();
console.log(`Hit rate: ${searchMetrics.hitRate.toFixed(2)}%`);
console.log(`Hits: ${searchMetrics.hits}, Misses: ${searchMetrics.misses}`);
console.log(`Evictions: ${searchMetrics.evictions}`);
const embeddingMetrics = embeddingCache.getMetrics();
console.log(`Embedding cache hit rate: ${embeddingMetrics.hitRate.toFixed(2)}%`);
console.log(`Memory usage: ${embeddingCache.estimateMemoryUsage() / 1024 / 1024} MB`);
Manual Cache Invalidation¶
// Invalidate specific query
searchCache.invalidate('microservices');
// Invalidate pattern
const invalidated = searchCache.invalidatePattern(/^micro/);
console.log(`Invalidated ${invalidated} entries`);
// Clear all caches
searchCache.clear();
embeddingCache.clear();
Future Enhancements¶
When System Scales¶
- Redis Integration
- Persistent caching across restarts
- Shared cache across multiple instances
-
Distributed cache invalidation
-
Cache Warming
- Pre-populate common queries on startup
- Background cache refresh
-
Warm cache with popular concepts
-
Adaptive TTL
- Adjust TTL based on query frequency
- Longer TTL for stable results
-
Shorter TTL for volatile data
-
Smart Invalidation
- Invalidate affected caches on document updates
- Dependency tracking for invalidation
-
Partial cache invalidation
-
Metrics Integration
- Expose cache metrics via observability endpoints
- Dashboard for cache performance
-
Alerting on low hit rates
-
Multi-Tier Caching
- L1: In-memory (fast, limited capacity)
- L2: Redis (persistent, larger capacity)
- L3: Disk (very large, slower)
References¶
Documentation¶
- Planning: 05-caching-strategy-plan.md
- Completion: CACHING-IMPLEMENTATION-COMPLETE.md
Related Code¶
src/infrastructure/cache/lru-cache.tssrc/infrastructure/cache/search-result-cache.tssrc/infrastructure/cache/embedding-cache.tssrc/application/container.ts
Related ADRs¶
- ADR0016: Layered architecture foundation
- ADR0018: Dependency injection pattern
- ADR0021: Performance optimization strategy
- ADR0039: Observability for metrics
Concepts Applied¶
From knowledge base (Caching & Performance): 1. LRU eviction - Cache replacement policy 2. TTL (Time-to-Live) - Automatic expiration 3. Cache-aside pattern - Check cache, populate on miss 4. Bounded memory - Prevent unlimited growth 5. Hit rate metrics - Cache effectiveness measurement
Conclusion¶
The multi-level caching strategy significantly improves system performance while maintaining bounded memory usage and backward compatibility.
Key Achievements: - ✅ Generic LRU cache with TTL support - ✅ Specialized caches for searches and embeddings - ✅ Transparent integration into services - ✅ Bounded memory (~130MB max) - ✅ 79 unit tests, 100% coverage - ✅ Zero breaking changes - ✅ Production-ready implementation
Expected Impact: - 40-60% latency reduction for cached searches - 70-80% reduction in embedding computations - Improved user experience for repeated queries - Reduced resource usage (CPU, database)
Status: Production-ready, deployed to feat/add-advanced-caching branch, commit 4ecdcd1
The caching infrastructure is now in place to deliver significant performance improvements while maintaining code quality and system reliability. Future enhancements (Redis, cache warming, metrics) can be added incrementally as system scale increases.