ADR-0049: Incremental Category Summary Generation¶

Status¶

Accepted

Context¶

During document seeding, the system generates LLM-powered summaries for each category extracted from documents. These summaries provide concise descriptions that help users understand what each category covers.

Technical Forces: - The createCategoriesTable() function currently regenerates summaries for ALL categories on every seeding run - For a typical library with ~700 categories, this requires ~24 LLM API calls (30 categories per batch) - Each batch is rate-limited to 1 second minimum, adding ~24+ seconds to every incremental run - The existing categories table is dropped before querying, losing all previously generated summaries

Business Forces: - LLM API calls incur cost (OpenRouter billing) - Incremental seeding should be fast when adding only a few documents - User experience suffers when adding 5 documents takes as long as initial seeding

Operational Forces: - Most incremental runs add 0-5 new categories out of hundreds existing - Regenerating identical summaries wastes resources without providing value

Decision Drivers¶

Efficiency - Avoid redundant LLM calls for unchanged data
Cost reduction - Minimize API usage when existing data is valid
Speed - Incremental runs should be proportional to new content
Simplicity - Solution should be straightforward to implement and maintain
Reliability - Must handle edge cases (first run, empty categories)

Considered Options¶

Option A: Cache Summaries Before Table Drop (Selected)¶

Query the existing categories table before dropping it to extract all category→summary mappings. Generate LLM summaries only for categories not found in the cache.

Pros: - Simple implementation (~30 lines of code) - Maintains existing table recreation pattern - No schema changes required - 90%+ reduction in LLM calls for incremental runs

Cons: - Requires one additional database query per run - Summaries are only cached in memory during the run

Option B: Update Table In-Place¶

Modify existing records and only insert new categories without dropping the table.

Pros: - No data loss during operation - Could preserve additional metadata

Cons: - Complex delta handling logic - Must handle category deletions - Risk of orphaned records - Significant code changes

Option C: External Summary Cache File¶

Persist summaries to a JSON file outside the database.

Pros: - Survives database issues - Could be version controlled

Cons: - File synchronization complexity - Additional I/O operations - Cache invalidation challenges - Maintenance overhead

Decision¶

Implement Option A: Cache Summaries Before Table Drop.

The implementation will:

Query existing table - At the start of createCategoriesTable(), attempt to query the existing categories table and build a Map<string, string> of category name to summary
Handle first run - If the table doesn't exist, proceed with an empty cache (all categories are new)
Identify new categories - After extracting categories from documents, compute which are genuinely new (not in the cache)
Generate selectively - Call generateCategorySummaries() only for new categories
Merge summaries - Combine cached summaries with newly generated ones
Build records - Use the merged map when creating category records

Consequences¶

Positive¶

90%+ reduction in LLM calls for typical incremental runs
Faster incremental seeding - Proportional to actual new content
Cost savings - Fewer API calls to OpenRouter
Minimal code changes - Localized to one function
Backward compatible - No schema or API changes

Negative¶

One additional DB query per run (negligible performance impact)
Memory usage - All existing summaries held in memory during run (acceptable for ~1000 categories)

Neutral¶

First run behavior unchanged (all categories are new)
Summary quality unchanged (same LLM, same prompts)

Confirmation¶

The optimization will be validated by:

Running seeder with existing database containing categories
Adding a few new documents with 0-2 new categories
Observing log output to confirm only new categories trigger LLM calls
Verifying existing summaries are preserved in the rebuilt table

Implementation¶

Files to modify: - hybrid_fast_seed.ts - Modify createCategoriesTable() function

Changes: 1. Add query for existing category summaries before table drop 2. Filter categories to identify new ones 3. Generate summaries only for new categories 4. Merge cached and new summaries

References¶

src/concepts/summary_generator.ts - Summary generation implementation
ADR-0030: Auto-Extracted Categories - Category extraction design