44. Seeding Script Modularization¶
Date: 2025-12-03
Status: Accepted
Sources: - Work Package Plan: 2025-12-03-modularize-hybrid-fast-seed
Context and Problem Statement¶
The hybrid_fast_seed.ts script had grown to 2,665 lines, containing document seeding logic, OCR processing, LanceDB table creation, checkpoint management, and various utility functions. This monolithic structure caused several issues:
- Maintainability: Large file size made navigation and understanding difficult
- Testability: Functions were tightly coupled and not exported for testing
- Code duplication: Embedding generation duplicated
SimpleEmbeddingService - Reusability: OCR and seeding utilities could not be used by other scripts
Decision Drivers:
* Single Responsibility Principle - each module should have one reason to change
* Eliminate code duplication to reduce maintenance burden
* Enable unit testing of extracted utilities
* Create reusable modules for other scripts (e.g., seed_specific.ts)
Alternative Options¶
- Option 1: Incremental extraction - Extract cohesive functional groups one at a time, maintaining backward compatibility
- Option 2: Complete rewrite - Redesign the entire seeding pipeline with a new architecture
- Option 3: Keep monolithic - Leave the file as-is, accept technical debt
Decision Outcome¶
Chosen option: "Option 1: Incremental extraction", because it provides low risk through testable steps, maintains backward compatibility (same CLI interface), and delivers immediate value while allowing future iterations.
Module Structure¶
src/infrastructure/
├── ocr/ # NEW
│ ├── __tests__/tesseract-ocr.test.ts
│ ├── index.ts
│ └── tesseract-ocr.ts # processWithTesseract(), drawOcrProgressBar()
├── lancedb/
│ ├── seeding/ # NEW
│ │ ├── __tests__/seeding-utils.test.ts
│ │ ├── index.ts
│ │ ├── index-utils.ts # calculatePartitions(), createOptimizedIndex()
│ │ └── category-utils.ts # buildCategoryIdMap(), buildCategoryStats()
│ └── utils/
│ └── field-parsers.ts # ENHANCED: parseArrayField() with Arrow Vector support
└── seeding/ # NEW
├── __tests__/seeding.test.ts
├── index.ts
├── document-completeness.ts # checkDocumentCompleteness(), catalogRecordExists()
├── file-discovery.ts # findDocumentFilesRecursively(), getDatabaseSize()
└── string-utils.ts # truncateFilePath(), formatHashDisplay()
Consequences¶
Positive:
* 24% reduction in main script size (2,665 → 2,027 lines)
* 55 new unit tests for extracted functionality
* Eliminated ~50 lines of duplicate embedding code
* OCR, seeding, and LanceDB utilities now independently usable
* Enhanced parseArrayField handles Apache Arrow Vectors
* CLI interface unchanged (backward compatible)
Negative: * Additional modules to maintain (4 new module directories) * Import statements increased in main script * Some functions remain in main script due to complex dependencies
Confirmation¶
- ✅ Tests: 55/55 new tests passing
- ✅ Build: Zero errors
- ✅ Line reduction: 638 lines removed (24%)
Files Created:
1. src/infrastructure/ocr/tesseract-ocr.ts - OCR processing service
2. src/infrastructure/ocr/index.ts - Module exports
3. src/infrastructure/lancedb/seeding/index-utils.ts - Index optimization utilities
4. src/infrastructure/lancedb/seeding/category-utils.ts - Category ID mapping
5. src/infrastructure/lancedb/seeding/index.ts - Module exports
6. src/infrastructure/seeding/document-completeness.ts - Document state checking
7. src/infrastructure/seeding/file-discovery.ts - File system utilities
8. src/infrastructure/seeding/string-utils.ts - String formatting utilities
9. src/infrastructure/seeding/index.ts - Module exports
Files Modified:
1. hybrid_fast_seed.ts - Main seeding script (reduced from 2,665 to 2,027 lines)
2. src/infrastructure/lancedb/utils/field-parsers.ts - Enhanced with Arrow Vector support
Pros and Cons of the Options¶
Option 1: Incremental extraction - Chosen¶
Pros: * Low risk - each extraction is independently testable * Backward compatible - CLI interface unchanged * Immediate value - duplicate code eliminated first * Enables future iterations - more can be extracted later
Cons: * Multiple small modules to maintain * Some tight coupling remains in main script * Not a complete architectural redesign
Option 2: Complete rewrite¶
Pros: * Could achieve cleaner architecture * Opportunity to redesign entire pipeline
Cons: * High risk of regressions * Significant time investment * No immediate value until complete
Option 3: Keep monolithic¶
Pros: * No effort required * No risk of introducing bugs
Cons: * Technical debt continues to grow * Testing remains difficult * Code duplication persists
Future Considerations¶
The following functions remain in hybrid_fast_seed.ts and could be extracted in future iterations:
loadDocumentsWithErrorHandling(~300 lines) - Complex document loading with fallbacksprocessDocuments/processDocumentsParallel(~200 lines) - Document processing orchestrationcreateCategoriesTable(~120 lines) - Category table creationrebuildConceptIndexFromExistingData(~140 lines) - Concept index rebuilding
These were not extracted in this iteration due to: 1. Complex dependencies on script-level state (checkpoint, config variables) 2. Tight coupling with LLM API calls and progress display 3. Diminishing returns - main reduction goal achieved