ADR 0041: Multi-Level Caching Strategy¶

Status: Accepted
Date: 2025-11-24
Deciders: Development Team
Related ADRs: adr0016, adr0018, adr0021, adr0039

Context¶

The concept-rag system performs computationally expensive operations that benefit from caching:

Performance Bottlenecks¶

Embedding Generation: Simple hash-based embeddings take ~1-5ms per text
Repeated searches generate same embeddings multiple times
User queries often contain similar or identical terms
Concept names embedded repeatedly during hybrid search
Vector Search: LanceDB vector search is fast but not instantaneous
Typical search: 50-200ms depending on dataset size
Repeated identical queries waste resources
Search results stable for short time periods
Hybrid Search: Combines multiple operations
Embedding generation (1-5ms)
Vector search (50-200ms)
BM25 keyword search (10-50ms)
Result merging and ranking (5-10ms)
Total: 66-265ms per search

Real-World Usage Patterns¶

Query Repetition: - Users refine searches iteratively - Common concepts searched frequently - Dashboard/UI polls for same data - Testing/debugging repeats identical queries

Embedding Reuse: - Same query text embedded multiple times - Common domain terms (e.g., "microservices", "ddd") used repeatedly - Concept names embedded during every search involving them

Cache Hit Rate Potential: - Estimated 60-70% cache hit rate for searches - Estimated 70-80% cache hit rate for embeddings - Significant performance improvement possible

Existing State¶

No Caching ❌: - Every search executes full pipeline - Every embedding computed from scratch - No memory of previous operations - Repeated work wastes resources

Memory Caches Exist ✅: - ConceptIdCache: Concept name → ID mapping (653 entries) - CategoryIdCache: Category lookups (~50 entries) - Both loaded at startup, never expire

Need Multi-Level Strategy: - Search result caching with TTL - Embedding caching (no TTL, embeddings immutable) - Bounded memory usage (LRU eviction) - Configurable cache sizes and TTLs

Decision¶

Implement a multi-level caching strategy with LRU (Least Recently Used) eviction and TTL (Time-to-Live) support:

1. Core LRU Cache Infrastructure¶

Generic LRU Cache: Foundation for all caching

interface CacheMetrics {
  hits: number;
  misses: number;
  evictions: number;
  size: number;
  hitRate: number;
}

class LRUCache<K, V> {
  constructor(options: {
    maxSize: number;          // Maximum entries
    defaultTTL?: number;      // Default TTL in milliseconds
    onEvict?: (key: K, value: V) => void;
  });

  get(key: K): V | undefined;
  set(key: K, value: V, ttl?: number): void;
  has(key: K): boolean;
  delete(key: K): boolean;
  clear(): void;

  // Metrics
  getMetrics(): CacheMetrics;
  resetMetrics(): void;

  // Inspection
  keys(): K[];
  size(): number;
}

Features: - O(1) get/set operations using JavaScript Map - Automatic LRU eviction when maxSize reached - Per-entry TTL support with lazy expiration checking - Built-in metrics (hits, misses, evictions, hit rate) - Optional eviction callback for cleanup

Implementation Details:

class LRUCache<K, V> {
  private cache: Map<K, CacheEntry<V>>;
  private maxSize: number;

  get(key: K): V | undefined {
    const entry = this.cache.get(key);

    if (!entry) {
      this.metrics.misses++;
      return undefined;
    }

    // Check TTL expiration
    if (entry.expiresAt && Date.now() > entry.expiresAt) {
      this.cache.delete(key);
      this.metrics.misses++;
      return undefined;
    }

    // Move to end (most recently used)
    this.cache.delete(key);
    this.cache.set(key, entry);
    this.metrics.hits++;

    return entry.value;
  }

  set(key: K, value: V, ttl?: number): void {
    // Evict oldest if at capacity
    if (this.cache.size >= this.maxSize && !this.cache.has(key)) {
      const firstKey = this.cache.keys().next().value;
      this.cache.delete(firstKey);
      this.metrics.evictions++;
    }

    const expiresAt = ttl ? Date.now() + ttl : undefined;
    this.cache.set(key, { value, expiresAt });
  }
}

2. Search Result Cache¶

Purpose: Cache search results with TTL to avoid redundant searches

interface SearchResultCacheOptions {
  maxSize?: number;         // Default: 1000
  defaultTTL?: number;      // Default: 5 minutes (300000ms)
}

class SearchResultCache {
  constructor(options?: SearchResultCacheOptions);

  // Cache search results
  get<T>(query: string, options?: Record<string, unknown>): T | undefined;
  set<T>(query: string, value: T, options?: Record<string, unknown>): void;

  // Invalidation
  invalidate(query: string): void;
  invalidatePattern(pattern: RegExp): number;
  clear(): void;

  // Metrics
  getMetrics(): CacheMetrics;
}

Cache Key Generation: Deterministic SHA-256 hash

private generateCacheKey(
  query: string,
  options?: Record<string, unknown>
): string {
  // Normalize options (sort keys for consistency)
  const normalized = options 
    ? Object.keys(options)
        .sort()
        .reduce((acc, key) => ({ ...acc, [key]: options[key] }), {})
    : {};

  const input = `${query}:${JSON.stringify(normalized)}`;
  return createHash('sha256').update(input).digest('hex');
}

Key Features: - Query + options included in cache key - Option order normalized for consistency - Default 5-minute TTL (configurable) - Query invalidation support - Pattern-based invalidation (regex)

Example Usage:

const cache = new SearchResultCache({ maxSize: 1000, defaultTTL: 300000 });

// Check cache first
const cached = cache.get('microservices', { limit: 10 });
if (cached) {
  return cached;
}

// Execute search
const results = await hybridSearch.search('microservices', { limit: 10 });

// Store in cache
cache.set('microservices', results, { limit: 10 });

3. Embedding Cache¶

Purpose: Cache embedding vectors (no TTL, embeddings are immutable)

interface EmbeddingCacheOptions {
  maxSize?: number;         // Default: 10000
}

class EmbeddingCache {
  constructor(options?: EmbeddingCacheOptions);

  // Cache embeddings
  get(text: string, model: string): number[] | undefined;
  set(text: string, model: string, embedding: number[]): void;

  // Management
  clear(): void;

  // Metrics
  getMetrics(): CacheMetrics;
  estimateMemoryUsage(): number; // Bytes
}

Cache Key Generation: Model + text hash

private generateCacheKey(text: string, model: string): string {
  // Hash long texts for efficient cache keys
  const textHash = createHash('sha256').update(text).digest('hex');
  return `${model}:${textHash}`;
}

Key Features: - Model-specific caching (different models = different embeddings) - No TTL (embeddings are deterministic and immutable) - Text hashing for efficient cache keys - Memory usage estimation (~3KB per embedding) - Large capacity (10K embeddings by default)

Memory Usage Estimation:

estimateMemoryUsage(): number {
  const avgDimension = 384;     // Typical embedding dimension
  const bytesPerFloat = 8;      // Float64
  const avgEmbeddingSize = avgDimension * bytesPerFloat; // ~3KB
  return this.cache.size() * avgEmbeddingSize;
}

4. Service Integration¶

SimpleEmbeddingService with EmbeddingCache:

class SimpleEmbeddingService implements EmbeddingService {
  constructor(
    private readonly embeddingCache?: EmbeddingCache
  ) {}

  async generateEmbedding(text: string): Promise<number[]> {
    const model = 'simple-hash-v1';

    // Check cache
    if (this.embeddingCache) {
      const cached = this.embeddingCache.get(text, model);
      if (cached) {
        return cached;
      }
    }

    // Generate embedding
    const embedding = this.computeEmbedding(text);

    // Store in cache
    if (this.embeddingCache) {
      this.embeddingCache.set(text, model, embedding);
    }

    return embedding;
  }
}

ConceptualHybridSearchService with SearchResultCache:

class ConceptualHybridSearchService {
  constructor(
    private readonly searchCache?: SearchResultCache,
    // ... other dependencies
  ) {}

  async search(
    query: string,
    options: SearchOptions,
    debug?: boolean
  ): Promise<SearchResult[]> {
    // Disable caching in debug mode
    if (debug) {
      return this.executeSearch(query, options);
    }

    // Cache key includes collection name
    const cacheKey = `${options.collection}:${query}`;

    // Check cache
    if (this.searchCache) {
      const cached = this.searchCache.get(cacheKey, options);
      if (cached) {
        return cached;
      }
    }

    // Execute search
    const results = await this.executeSearch(query, options);

    // Store in cache
    if (this.searchCache) {
      this.searchCache.set(cacheKey, results, options);
    }

    return results;
  }
}

5. Container Configuration¶

Application Container initializes caches:

class ApplicationContainer {
  private embeddingCache?: EmbeddingCache;
  private searchCache?: SearchResultCache;

  async initialize(databaseUrl: string): Promise<void> {
    // Initialize caches
    this.embeddingCache = new EmbeddingCache({
      maxSize: 10000  // 10K embeddings (~30MB max)
    });

    this.searchCache = new SearchResultCache({
      maxSize: 1000,           // 1K searches
      defaultTTL: 300000       // 5 minutes
    });

    // Inject into services
    this.embeddingService = new SimpleEmbeddingService(
      this.embeddingCache
    );

    this.hybridSearchService = new ConceptualHybridSearchService(
      this.searchCache,
      // ... other dependencies
    );
  }

  async shutdown(): Promise<void> {
    // Clear caches
    this.embeddingCache?.clear();
    this.searchCache?.clear();

    // ... other cleanup
  }
}

6. Cache Configuration¶

Default Sizes: | Cache | Max Size | TTL | Max Memory | |-------|----------|-----|------------| | Embedding | 10,000 | None | ~30MB | | Search Result | 1,000 | 5 min | ~100MB | | Total | - | - | ~130MB |

Memory Bounds: - LRU eviction ensures bounded memory - Configurable max sizes - Memory usage grows to limit, then stabilizes - Old entries automatically evicted

TTL Strategy: - Embeddings: No TTL (deterministic, immutable) - Search Results: 5-minute TTL (balance freshness vs performance) - Configurable: Can adjust per use case

Consequences¶

Positive¶

Significant Performance Improvement
Expected 40-60% latency reduction for cached searches
Expected 70-80% reduction in embedding computations
Instant response for cache hits (<1ms)
Reduced Resource Usage
Fewer CPU cycles for embedding generation
Fewer vector search operations
Lower database load
Better User Experience
Faster search responses for repeated queries
Snappier dashboard/UI interactions
Improved perceived performance
Bounded Memory Usage
LRU eviction prevents unlimited growth
Configurable max sizes
Predictable memory footprint (~130MB max)
Built-in Metrics
Cache hit/miss tracking
Hit rate calculation
Eviction monitoring
Performance analysis capabilities
Backward Compatible
Caches are optional (injected via constructor)
Services work without caching
All existing tests pass (346 tests)
Zero breaking changes
Type-Safe
Full TypeScript type annotations
Generic type parameters
Compile-time guarantees
Testable
79 unit tests for caching (100% coverage)
Mock caches for testing services
Integration tests with real services

Negative¶

Memory Overhead
~130MB for caches at capacity
Mitigation: Bounded by LRU eviction
Acceptable for performance gain
Cache Invalidation Complexity
Stale data if documents updated
Mitigation: 5-minute TTL for searches, clear on document updates
Manual invalidation available if needed
Cache Warming Overhead
Cold cache on startup
Mitigation: Caches warm naturally through usage
Future: Pre-populate common queries
Additional Code Complexity
Cache management code
Mitigation: Well-tested, isolated infrastructure layer
Clear separation of concerns
No Persistence
Cache lost on restart
Mitigation: Acceptable for current scale
Future: Redis for persistent caching

Trade-offs¶

Trade-off	Choice	Rationale
Memory vs Speed	Accept 130MB	Significant performance gain worth cost
TTL vs Freshness	5-minute TTL	Balance between cache hits and staleness
Persistent vs In-Memory	In-memory	Simpler, sufficient for current scale
LRU vs LFU	LRU	Simpler, works well for temporal locality
Cache Everything vs Selective	Selective	Cache expensive operations only

Implementation¶

Date: 2025-11-24
Duration: ~2.5 hours agentic development
Branch: feat/add-advanced-caching
Commit: 4ecdcd1

Files Created¶

Cache Infrastructure:

src/infrastructure/cache/
├── index.ts (exports)
├── lru-cache.ts (core LRU implementation)
├── search-result-cache.ts (search result specialization)
├── embedding-cache.ts (embedding specialization)
└── __tests__/
    ├── lru-cache.test.ts (25 tests)
    ├── search-result-cache.test.ts (24 tests)
    └── embedding-cache.test.ts (30 tests)

Total: - Production code: ~800 lines - Test code: ~840 lines - Test coverage: 79 tests, 100% coverage

Files Modified¶

Service Integration: - src/application/container.ts: Initialize and inject caches - src/infrastructure/embeddings/simple-embedding-service.ts: Use EmbeddingCache - src/infrastructure/search/conceptual-hybrid-search-service.ts: Use SearchResultCache

Changes: - 10 files changed - 1,642 insertions - 13 deletions

Test Results¶

Unit Tests: 79 tests, all passing ✅

LRU Cache:              25 tests (286ms)
Search Result Cache:    24 tests (1258ms, includes TTL tests)
Embedding Cache:        30 tests (108ms)

Integration Tests: 21 tests, all passing ✅

Application Container:  21 tests (7652ms)
- All tools work with caching enabled
- Cache cleanup works correctly
- Backward compatibility maintained

Full Infrastructure Suite: 346 tests, all passing ✅

All infrastructure tests pass with new caching layer
No regressions introduced

Performance Impact¶

Expected Performance Improvements¶

Based on cache hit rate projections and typical operation costs:

Search Performance:

Without Cache:
- Embedding generation: 2ms
- Vector search: 100ms
- BM25 search: 20ms
- Result merging: 5ms
- Total: 127ms

With Cache (60% hit rate):
- Cache hit (60%): <1ms
- Cache miss (40%): 127ms
- Average: 0.6ms * 0.6 + 127ms * 0.4 = 51ms
- Improvement: 60% latency reduction

Embedding Performance:

Without Cache:
- Embedding generation: 2ms per text
- 10 searches/sec = 20ms embedding overhead

With Cache (70% hit rate):
- Cache hit (70%): <1ms
- Cache miss (30%): 2ms
- Average: 0.7ms * 0.7 + 2ms * 0.3 = 1.09ms
- Improvement: 45% latency reduction

Memory Usage:

Embedding Cache:
- 10K embeddings * 3KB = 30MB max

Search Result Cache:
- 1K results * ~100KB = 100MB max

Total: ~130MB

Alternatives Considered¶

Alternative 1: Redis Cache¶

Pros: - Persistent across restarts - Shared cache across multiple instances - Rich data structures and features - Battle-tested at scale

Cons: - External dependency (Redis server) - Network latency for cache operations - Added deployment complexity - Overkill for single-instance MCP server

Decision: In-memory caching sufficient for current scale

Alternative 2: Simple Object Cache¶

Pros: - Simplest possible implementation - No dependencies - Minimal code

Cons: - No bounded memory (potential leaks) - No TTL support - No eviction strategy - No metrics

Decision: LRU cache provides necessary features

Alternative 3: LFU (Least Frequently Used) Cache¶

Pros: - Better for stable access patterns - Keeps most popular items

Cons: - More complex to implement - Doesn't handle temporal locality well - Search patterns have temporal locality

Decision: LRU better for search workloads

Alternative 4: No Caching (Keep Current State)¶

Pros: - No memory overhead - No complexity - Always fresh data

Cons: - Repeated expensive operations - Poor performance for common queries - Wasteful resource usage

Decision: Caching provides significant value

Testing Strategy¶

Unit Tests (79 tests, 100% coverage)¶

LRU Cache Tests (25 tests): - Basic get/set operations - LRU eviction behavior - TTL expiration - Metrics tracking - Edge cases (empty values, special characters, unicode)

Search Result Cache Tests (24 tests): - Cache key generation - Option normalization - Query invalidation - Pattern-based invalidation - TTL behavior - Metrics accuracy

Embedding Cache Tests (30 tests): - Model-specific caching - Text hashing - Memory estimation - No TTL behavior - Large embeddings - Edge cases

Integration Tests¶

Container Integration: - Cache initialization - Service integration - Cache cleanup on shutdown - Backward compatibility (services work without cache)

Service Integration: - Caching in SimpleEmbeddingService - Caching in ConceptualHybridSearchService - Debug mode disables caching - All existing tools work with caching

Performance Tests¶

Future performance benchmarks to validate cache effectiveness: - Cache hit rate monitoring - Latency distribution (cached vs uncached) - Memory usage over time - Eviction rate analysis

Usage Examples¶

Search Result Caching (Transparent)¶

// Usage is transparent to callers
const container = new ApplicationContainer();
await container.initialize('~/.concept_rag');

const tool = container.getTool('catalog_search');

// First search: cache miss, executes full search
const result1 = await tool.execute({ text: 'microservices', limit: 5 });

// Second search (within 5min): cache hit, instant return
const result2 = await tool.execute({ text: 'microservices', limit: 5 });

// Different limit: cache miss (different cache key)
const result3 = await tool.execute({ text: 'microservices', limit: 10 });

Embedding Caching (Transparent)¶

// Embedding caching is completely transparent
const embedding1 = await embeddingService.generateEmbedding('microservices');
// Computed from scratch

const embedding2 = await embeddingService.generateEmbedding('microservices');
// Retrieved from cache (instant)

Cache Metrics¶

// Access cache metrics for monitoring
const searchMetrics = searchCache.getMetrics();
console.log(`Hit rate: ${searchMetrics.hitRate.toFixed(2)}%`);
console.log(`Hits: ${searchMetrics.hits}, Misses: ${searchMetrics.misses}`);
console.log(`Evictions: ${searchMetrics.evictions}`);

const embeddingMetrics = embeddingCache.getMetrics();
console.log(`Embedding cache hit rate: ${embeddingMetrics.hitRate.toFixed(2)}%`);
console.log(`Memory usage: ${embeddingCache.estimateMemoryUsage() / 1024 / 1024} MB`);

Manual Cache Invalidation¶

// Invalidate specific query
searchCache.invalidate('microservices');

// Invalidate pattern
const invalidated = searchCache.invalidatePattern(/^micro/);
console.log(`Invalidated ${invalidated} entries`);

// Clear all caches
searchCache.clear();
embeddingCache.clear();

Future Enhancements¶

When System Scales¶

Redis Integration
Persistent caching across restarts
Shared cache across multiple instances
Distributed cache invalidation
Cache Warming
Pre-populate common queries on startup
Background cache refresh
Warm cache with popular concepts
Adaptive TTL
Adjust TTL based on query frequency
Longer TTL for stable results
Shorter TTL for volatile data
Smart Invalidation
Invalidate affected caches on document updates
Dependency tracking for invalidation
Partial cache invalidation
Metrics Integration
Expose cache metrics via observability endpoints
Dashboard for cache performance
Alerting on low hit rates
Multi-Tier Caching
L1: In-memory (fast, limited capacity)
L2: Redis (persistent, larger capacity)
L3: Disk (very large, slower)

References¶

Documentation¶

Planning: 05-caching-strategy-plan.md
Completion: CACHING-IMPLEMENTATION-COMPLETE.md

src/infrastructure/cache/lru-cache.ts
src/infrastructure/cache/search-result-cache.ts
src/infrastructure/cache/embedding-cache.ts
src/application/container.ts

ADR0016: Layered architecture foundation
ADR0018: Dependency injection pattern
ADR0021: Performance optimization strategy
ADR0039: Observability for metrics

Concepts Applied¶

From knowledge base (Caching & Performance): 1. LRU eviction - Cache replacement policy 2. TTL (Time-to-Live) - Automatic expiration 3. Cache-aside pattern - Check cache, populate on miss 4. Bounded memory - Prevent unlimited growth 5. Hit rate metrics - Cache effectiveness measurement

Conclusion¶

The multi-level caching strategy significantly improves system performance while maintaining bounded memory usage and backward compatibility.

Key Achievements: - ✅ Generic LRU cache with TTL support - ✅ Specialized caches for searches and embeddings - ✅ Transparent integration into services - ✅ Bounded memory (~130MB max) - ✅ 79 unit tests, 100% coverage - ✅ Zero breaking changes - ✅ Production-ready implementation

Expected Impact: - 40-60% latency reduction for cached searches - 70-80% reduction in embedding computations - Improved user experience for repeated queries - Reduced resource usage (CPU, database)

Status: Production-ready, deployed to feat/add-advanced-caching branch, commit 4ecdcd1

The caching infrastructure is now in place to deliver significant performance improvements while maintaining code quality and system reliability. Future enhancements (Redis, cache warming, metrics) can be added incrementally as system scale increases.