Skip to content

44. Seeding Script Modularization

Date: 2025-12-03
Status: Accepted

Sources: - Work Package Plan: 2025-12-03-modularize-hybrid-fast-seed

Context and Problem Statement

The hybrid_fast_seed.ts script had grown to 2,665 lines, containing document seeding logic, OCR processing, LanceDB table creation, checkpoint management, and various utility functions. This monolithic structure caused several issues:

  • Maintainability: Large file size made navigation and understanding difficult
  • Testability: Functions were tightly coupled and not exported for testing
  • Code duplication: Embedding generation duplicated SimpleEmbeddingService
  • Reusability: OCR and seeding utilities could not be used by other scripts

Decision Drivers: * Single Responsibility Principle - each module should have one reason to change * Eliminate code duplication to reduce maintenance burden * Enable unit testing of extracted utilities * Create reusable modules for other scripts (e.g., seed_specific.ts)

Alternative Options

  • Option 1: Incremental extraction - Extract cohesive functional groups one at a time, maintaining backward compatibility
  • Option 2: Complete rewrite - Redesign the entire seeding pipeline with a new architecture
  • Option 3: Keep monolithic - Leave the file as-is, accept technical debt

Decision Outcome

Chosen option: "Option 1: Incremental extraction", because it provides low risk through testable steps, maintains backward compatibility (same CLI interface), and delivers immediate value while allowing future iterations.

Module Structure

src/infrastructure/
├── ocr/                              # NEW
│   ├── __tests__/tesseract-ocr.test.ts
│   ├── index.ts
│   └── tesseract-ocr.ts              # processWithTesseract(), drawOcrProgressBar()
├── lancedb/
│   ├── seeding/                      # NEW
│   │   ├── __tests__/seeding-utils.test.ts
│   │   ├── index.ts
│   │   ├── index-utils.ts            # calculatePartitions(), createOptimizedIndex()
│   │   └── category-utils.ts         # buildCategoryIdMap(), buildCategoryStats()
│   └── utils/
│       └── field-parsers.ts          # ENHANCED: parseArrayField() with Arrow Vector support
└── seeding/                          # NEW
    ├── __tests__/seeding.test.ts
    ├── index.ts
    ├── document-completeness.ts      # checkDocumentCompleteness(), catalogRecordExists()
    ├── file-discovery.ts             # findDocumentFilesRecursively(), getDatabaseSize()
    └── string-utils.ts               # truncateFilePath(), formatHashDisplay()

Consequences

Positive: * 24% reduction in main script size (2,665 → 2,027 lines) * 55 new unit tests for extracted functionality * Eliminated ~50 lines of duplicate embedding code * OCR, seeding, and LanceDB utilities now independently usable * Enhanced parseArrayField handles Apache Arrow Vectors * CLI interface unchanged (backward compatible)

Negative: * Additional modules to maintain (4 new module directories) * Import statements increased in main script * Some functions remain in main script due to complex dependencies

Confirmation

  • Tests: 55/55 new tests passing
  • Build: Zero errors
  • Line reduction: 638 lines removed (24%)

Files Created: 1. src/infrastructure/ocr/tesseract-ocr.ts - OCR processing service 2. src/infrastructure/ocr/index.ts - Module exports 3. src/infrastructure/lancedb/seeding/index-utils.ts - Index optimization utilities 4. src/infrastructure/lancedb/seeding/category-utils.ts - Category ID mapping 5. src/infrastructure/lancedb/seeding/index.ts - Module exports 6. src/infrastructure/seeding/document-completeness.ts - Document state checking 7. src/infrastructure/seeding/file-discovery.ts - File system utilities 8. src/infrastructure/seeding/string-utils.ts - String formatting utilities 9. src/infrastructure/seeding/index.ts - Module exports

Files Modified: 1. hybrid_fast_seed.ts - Main seeding script (reduced from 2,665 to 2,027 lines) 2. src/infrastructure/lancedb/utils/field-parsers.ts - Enhanced with Arrow Vector support

Pros and Cons of the Options

Option 1: Incremental extraction - Chosen

Pros: * Low risk - each extraction is independently testable * Backward compatible - CLI interface unchanged * Immediate value - duplicate code eliminated first * Enables future iterations - more can be extracted later

Cons: * Multiple small modules to maintain * Some tight coupling remains in main script * Not a complete architectural redesign

Option 2: Complete rewrite

Pros: * Could achieve cleaner architecture * Opportunity to redesign entire pipeline

Cons: * High risk of regressions * Significant time investment * No immediate value until complete

Option 3: Keep monolithic

Pros: * No effort required * No risk of introducing bugs

Cons: * Technical debt continues to grow * Testing remains difficult * Code duplication persists

Future Considerations

The following functions remain in hybrid_fast_seed.ts and could be extracted in future iterations:

  • loadDocumentsWithErrorHandling (~300 lines) - Complex document loading with fallbacks
  • processDocuments / processDocumentsParallel (~200 lines) - Document processing orchestration
  • createCategoriesTable (~120 lines) - Category table creation
  • rebuildConceptIndexFromExistingData (~140 lines) - Concept index rebuilding

These were not extracted in this iteration due to: 1. Complex dependencies on script-level state (checkpoint, config variables) 2. Tight coupling with LLM API calls and progress display 3. Diminishing returns - main reduction goal achieved