ADR-0049: Incremental Category Summary Generation¶
Status¶
Accepted
Context¶
During document seeding, the system generates LLM-powered summaries for each category extracted from documents. These summaries provide concise descriptions that help users understand what each category covers.
Technical Forces:
- The createCategoriesTable() function currently regenerates summaries for ALL categories on every seeding run
- For a typical library with ~700 categories, this requires ~24 LLM API calls (30 categories per batch)
- Each batch is rate-limited to 1 second minimum, adding ~24+ seconds to every incremental run
- The existing categories table is dropped before querying, losing all previously generated summaries
Business Forces: - LLM API calls incur cost (OpenRouter billing) - Incremental seeding should be fast when adding only a few documents - User experience suffers when adding 5 documents takes as long as initial seeding
Operational Forces: - Most incremental runs add 0-5 new categories out of hundreds existing - Regenerating identical summaries wastes resources without providing value
Decision Drivers¶
- Efficiency - Avoid redundant LLM calls for unchanged data
- Cost reduction - Minimize API usage when existing data is valid
- Speed - Incremental runs should be proportional to new content
- Simplicity - Solution should be straightforward to implement and maintain
- Reliability - Must handle edge cases (first run, empty categories)
Considered Options¶
Option A: Cache Summaries Before Table Drop (Selected)¶
Query the existing categories table before dropping it to extract all category→summary mappings. Generate LLM summaries only for categories not found in the cache.
Pros: - Simple implementation (~30 lines of code) - Maintains existing table recreation pattern - No schema changes required - 90%+ reduction in LLM calls for incremental runs
Cons: - Requires one additional database query per run - Summaries are only cached in memory during the run
Option B: Update Table In-Place¶
Modify existing records and only insert new categories without dropping the table.
Pros: - No data loss during operation - Could preserve additional metadata
Cons: - Complex delta handling logic - Must handle category deletions - Risk of orphaned records - Significant code changes
Option C: External Summary Cache File¶
Persist summaries to a JSON file outside the database.
Pros: - Survives database issues - Could be version controlled
Cons: - File synchronization complexity - Additional I/O operations - Cache invalidation challenges - Maintenance overhead
Decision¶
Implement Option A: Cache Summaries Before Table Drop.
The implementation will:
- Query existing table - At the start of
createCategoriesTable(), attempt to query the existing categories table and build aMap<string, string>of category name to summary - Handle first run - If the table doesn't exist, proceed with an empty cache (all categories are new)
- Identify new categories - After extracting categories from documents, compute which are genuinely new (not in the cache)
- Generate selectively - Call
generateCategorySummaries()only for new categories - Merge summaries - Combine cached summaries with newly generated ones
- Build records - Use the merged map when creating category records
Consequences¶
Positive¶
- 90%+ reduction in LLM calls for typical incremental runs
- Faster incremental seeding - Proportional to actual new content
- Cost savings - Fewer API calls to OpenRouter
- Minimal code changes - Localized to one function
- Backward compatible - No schema or API changes
Negative¶
- One additional DB query per run (negligible performance impact)
- Memory usage - All existing summaries held in memory during run (acceptable for ~1000 categories)
Neutral¶
- First run behavior unchanged (all categories are new)
- Summary quality unchanged (same LLM, same prompts)
Confirmation¶
The optimization will be validated by:
- Running seeder with existing database containing categories
- Adding a few new documents with 0-2 new categories
- Observing log output to confirm only new categories trigger LLM calls
- Verifying existing summaries are preserved in the rebuilt table
Implementation¶
Files to modify:
- hybrid_fast_seed.ts - Modify createCategoriesTable() function
Changes: 1. Add query for existing category summaries before table drop 2. Filter categories to identify new ones 3. Generate summaries only for new categories 4. Merge cached and new summaries
References¶
src/concepts/summary_generator.ts- Summary generation implementation- ADR-0030: Auto-Extracted Categories - Category extraction design