30. 46 Auto-Extracted Categories¶

Date: 2025-11-19
Status: Accepted
Deciders: concept-rag Engineering Team
Technical Story: Category Search Feature (November 19, 2025)

Sources: - Planning: 2025-11-19-category-search-feature - Git Commits: 55ccee3c07e9a72c36a7b9330e3d899c426b6804, 449e52bb75cdbc8f65d381bc8e3bf7d6745169da (November 19, 2024)

Context and Problem Statement¶

The system had rich concept metadata but documents weren't organized into domains/categories [Gap: no domain organization]. Manual category assignment would be time-consuming for 165 documents [Problem: manual work], and categories needed to reflect actual corpus content [Requirement: corpus-driven].

The Core Problem: How to organize 165 documents into meaningful categories that reflect the corpus's actual domains? [Planning: Category extraction strategy]

Decision Drivers: * 165 documents need categorization [Scope: full corpus] * Manual categorization impractical [Constraint: time] * Categories should reflect actual content [Requirement: accurate] * Concept extraction already provides domain metadata [Opportunity: existing data] * Need stable, meaningful taxonomy [Requirement: quality]

Alternative Options¶

Option 1: Auto-Extract from Concept Metadata - Use existing concept extraction
Option 2: Manual Curation - Humans assign categories
Option 3: Filename-Based - Parse categories from file paths
Option 4: LLM Classification - Separate LLM call per document
Option 5: Predefined Taxonomy - Force documents into fixed categories

Decision Outcome¶

Chosen option: "Auto-Extract from Concept Metadata (Option 1)", because concept extraction already identifies document domains [ADR-0007], extraction is zero additional cost, and results in 46 meaningful categories discovered from actual corpus content.

Category Discovery Process¶

Extraction: [Source: Concept extraction includes categories]

// Concept extraction (Claude Sonnet 4.5) already returns:
{
  primary_concepts: string[],
  technical_terms: string[],
  categories: string[],  // ← Domain categories identified
  related_concepts: string[]
}

[Existing: Part of concept extraction since October 13]

Aggregation: [Source: Category extraction script]

// 1. Extract all category mentions from concept metadata
const allCategories = new Set();
for (const doc of documents) {
  doc.concept_metadata.categories.forEach(cat => {
    allCategories.add(cat.toLowerCase().trim());
  });
}

// 2. Count documents per category
const categoryStats = {};
for (const category of allCategories) {
  categoryStats[category] = documents.filter(doc =>
    doc.concept_metadata.categories.includes(category)
  ).length;
}

// 3. Result: 46 unique categories with document counts

[Source: scripts/extract_categories.ts]

46 Categories Discovered¶

Top Categories: [Source: IMPLEMENTATION-COMPLETE.md, lines 94-107]

Category	Documents	Chunks
embedded systems engineering	5	4,921
software engineering	5	4,074
real-time systems	4	5,007
computer architecture	3	2,660
distributed systems	3	4,561
systems engineering	2	3,766
Plus 40 more categories...

Long Tail: [Source: lines 107] - 36 categories with 1-3 documents each - Includes: numerical analysis, blockchain technology, mathematical physics, control theory, etc.

Total: 46 unique categories [Source: line 90]

Category Quality¶

Examples: [Real categories from corpus] - ✅ "software engineering" (broad domain) - ✅ "distributed systems" (specific subdomain) - ✅ "embedded systems engineering" (specialized field) - ✅ "blockchain technology" (emerging tech) - ✅ "mathematical physics" (interdisciplinary)

Characteristics: - Domain-appropriate granularity - Reflect actual corpus content - Technically meaningful - Not overly broad or narrow

Consequences¶

Positive: * Zero additional cost: Categories from existing extraction [Benefit: free] * 46 categories: Meaningful taxonomy discovered [Result: validated count] * Corpus-driven: Reflects actual content [Quality: accurate] * Auto-generated: No manual categorization needed [Efficiency: automatic] * Statistics available: Document/chunk/concept counts per category [Feature: analytics] * Hierarchical potential: Can add parent categories later [Extensibility: growth] * Browsable: Via category_search tool [Feature: accessible] [ADR-0029]

Negative: * No hierarchy: Flat list (no parent/child initially) [Limitation: flat taxonomy] * LLM-dependent: Quality depends on Claude's categorization [Dependency: LLM] * Granularity variance: Some categories very specific, others broad [Inconsistency: varied granularity] * No consolidation: "software engineering" vs. "software architecture" (separate) [Issue: similar categories]

Neutral: * 46 is manageable: Not too many, not too few [Size: appropriate] * Can evolve: Categories can be merged/split later [Process: iterative]

Confirmation¶

Production Statistics: [Source: IMPLEMENTATION-COMPLETE.md, lines 84-107] - Categories table: 46 unique categories - Top category: embedded systems engineering (5 documents, 4,921 chunks) - Average: ~3.6 documents per category (165 docs / 46 categories) - Long tail: Most categories have 1-3 documents (specialized domains)

Validation: - All categories have descriptions - All have document counts - All have embeddings for semantic similarity - Hash-based IDs generated (stable)

Pros and Cons of the Options¶

Option 1: Auto-Extract from Metadata - Chosen¶

Pros: * Zero additional cost (existing extraction) * 46 meaningful categories [Validated] * Corpus-driven (accurate) * No manual work * Statistics available * Production validated

Cons: * No hierarchy (flat) * LLM-dependent quality * Granularity variance * Some similar categories

Option 2: Manual Curation¶

Human assigns categories to each document.

Pros: * Perfect accuracy (human judgment) * Consistent granularity * Can create hierarchy * Domain expert knowledge

Cons: * Time-intensive: Hours for 165 documents [Effort: impractical] * Subjective: Different people categorize differently [Consistency: variance] * Not scalable: New documents require manual work [Maintenance: burden] * Why automate?: Against automation goal [Philosophy: manual work]

Option 3: Filename-Based¶

Parse categories from file paths or naming conventions.

Pros: * Simple extraction (regex) * Fast (no LLM) * Stable (filename-based)

Cons: * Filename inconsistency: Not all files follow conventions [Problem: unreliable] * Limited information: Filename doesn't capture full domain [Limitation: incomplete] * User-dependent: Relies on user organization [Problem: variable quality] * Shallow: Can't infer categories from content [Gap: surface-level]

Option 4: LLM Classification¶

Separate LLM call specifically for categorization.

Pros: * Dedicated classification prompt * Could use hierarchical taxonomy * Fine-tuned categorization

Cons: * Additional cost: $0.041 × 165 docs = $6.77 [Cost: unnecessary] * Already done: Concept extraction includes categories [Redundancy: duplicate] * Same LLM: Claude already doing this [Duplication: same model] * Why twice?: Concept metadata sufficient [Logic: redundant]

Option 5: Predefined Taxonomy¶

Force documents into ACM/Dewey/custom fixed taxonomy.

Pros: * Consistent granularity * Hierarchical structure * Standard classification

Cons: * Mismatch: Fixed taxonomy may not fit corpus [Problem: Procrustean bed] * Forces categorization: Documents forced into wrong categories [Quality: inaccurate] * Not corpus-driven: Ignores actual content [Problem: prescriptive] * Corpus-driven better: Let content determine categories [Philosophy: bottom-up]

Implementation Notes¶

Category Extraction Script¶

Script: scripts/extract_categories.ts [Source: Implementation]

// Load all catalog entries
const docs = await catalogTable.toArray();

// Extract categories from concept metadata
const categories = new Set();
docs.forEach(doc => {
  if (doc.concept_metadata?.categories) {
    doc.concept_metadata.categories.forEach(cat => {
      categories.add(cat.toLowerCase().trim());
    });
  }
});

// Count documents per category
// Generate category descriptions
// Create categories table

Category Table Creation¶

Script: scripts/create_categories_table.ts [Source: Implementation] - Generates 46 category records - Hash-based IDs (FNV-1a) - Embeddings for each category - Statistics (document_count, chunk_count, concept_count)

Categories Table Schema¶

Structure: [Source: FINAL-DESIGN-SUMMARY.md, lines 53-57]

{
  id: number,                // Hash-based (FNV-1a)
  category: string,          // "software engineering"
  description: string,       // Auto-generated description
  parentCategoryId?: number, // For future hierarchy
  aliases: string[],         // Alternative names
  relatedCategories: string[],
  document_count: number,    // 1-5 documents
  chunk_count: number,       // 1K-5K chunks
  concept_count: number,     // Unique concepts in category
  embeddings: Float32Array   // 384-dim category vector
}

Future Enhancements¶

Potential Improvements: - Merge similar categories ("software engineering" + "software architecture") - Add hierarchy (parent/child relationships) - Category descriptions (currently basic) - Category embeddings for similarity - Cross-category relationships

ADR-0007: Concept Extraction - Extraction includes categories
ADR-0027: Hash-Based IDs - IDs for categories
ADR-0028: Category Storage - Storage on documents
ADR-0029: Category Search Tools - Tools to browse categories

References¶

Confidence Level: HIGH
Attribution: - Planning docs: November 19, 2024 - Git commits: 55ccee3c, 449e52bb - Documented in: IMPLEMENTATION-COMPLETE.md lines 84-107

Traceability: 2025-11-19-category-search-feature