ADR-0053: Meta Content Detection and Filtering¶
Status¶
Accepted
Date¶
2024-12-26
Context¶
Search results are polluted with non-content text such as document headers, table of contents entries, page numbers, and other meta-content. This reduces retrieval quality by returning chunks that contain query terms but provide no substantive information.
Problem Examples¶
Table of Contents Pollution:
Chunk: "Chapter 5: Dependency Injection............................45"
Score: 0.82 (high BM25 match on exact terms)
Front Matter:
Chunk: "For my wife, who understood when I said
'I need to finish the chapter on dependency injection'"
Score: 0.78
Current State¶
The project already handles some content classification:
- ✅ is_reference - Filters out bibliography/reference sections
- ✅ has_extraction_issues - Filters out chunks with garbled math/OCR issues
- ✅ has_math - Identifies mathematical content
Missing: - ❌ Table of Contents (ToC) detection and filtering - ❌ Front matter (title pages, copyright, dedications) - ❌ Back matter (index, glossary, appendices)
Decision¶
1. New Classification Fields¶
Add boolean fields to the chunks schema:
interface ChunkClassificationFields {
// Existing
is_reference: boolean;
has_extraction_issues: boolean;
has_math: boolean;
// New
is_toc: boolean; // Table of contents entry
is_front_matter: boolean; // Title page, copyright, dedication, preface
is_back_matter: boolean; // Index, glossary, appendices
is_meta_content: boolean; // Aggregate: is_toc OR is_front_matter OR is_back_matter
}
2. MetaContentDetector Class¶
Implement pattern-based detection following the established ReferencesDetector pattern:
export class MetaContentDetector {
// ToC patterns
private readonly TOC_PATTERNS = [
/^(Chapter|Section|\d+\.)\s+.+?\.{3,}\s*\d+$/m, // "Chapter 1...........5"
/^Contents$/im,
/^Table of Contents$/im,
/^\d+\.\d+\s+.+?\s+\d+$/m // "1.2 Topic Name 42"
];
// Front matter patterns
private readonly FRONT_MATTER_PATTERNS = [
/^(Dedication|Preface|Foreword|Acknowledgments)$/im,
/^Copyright\s+©/im,
/^ISBN\s+[\d-]+/im,
/^All rights reserved/im,
/^First (published|edition)/im
];
// Back matter patterns
private readonly BACK_MATTER_PATTERNS = [
/^(Index|Glossary|Appendix\s+[A-Z]?)$/im,
/^About the Author/im
];
analyze(text: string, pageNumber: number, totalPages: number): MetaContentAnalysis;
}
3. Detection Heuristics¶
| Content Type | Detection Method |
|---|---|
| ToC | Pattern: Title.....Page#, "Contents" header, high density of ToC-like lines |
| Front Matter | Keywords + page position (first 10-15% of document) |
| Back Matter | Keywords + page position (last 10-15% of document) |
4. Search Filter Extension¶
Add excludeMetaContent option to search:
interface HybridSearchOptions {
excludeReferences?: boolean; // Existing
excludeExtractionIssues?: boolean; // Existing
excludeMetaContent?: boolean; // NEW - excludes ToC, front/back matter
}
Default behavior: excludeMetaContent: true for chunk searches.
5. Migration Script¶
Create scripts/populate-meta-content.ts to classify existing chunks, following the pattern from populate-concept-density.ts.
Consequences¶
Positive¶
- Improved Retrieval Quality: Search results no longer polluted with ToC entries
- Consistent Pattern: Follows established
ReferencesDetectorapproach - Simple Heuristics: Fast pattern matching, no ML dependencies
- Backward Compatible: New fields default to
false - Single Filter Option:
excludeMetaContentprovides simple UX
Negative¶
- Schema Expansion: Four new boolean fields per chunk
- False Positives Risk: Aggressive patterns could mark content as ToC
- Migration Required: Existing databases need script to classify chunks
Neutral¶
- Running Headers Deferred: Cross-page analysis complexity deferred to future work
- Storage Impact: Minimal - 4 booleans per chunk
Implementation¶
Files Changed¶
| File | Change |
|---|---|
src/infrastructure/document-loaders/meta-content-detector.ts |
NEW - Detection class |
hybrid_fast_seed.ts |
Integrate detection into pipeline |
src/domain/interfaces/services/hybrid-search-service.ts |
Add excludeMetaContent |
src/infrastructure/search/conceptual-hybrid-search-service.ts |
Implement filter |
src/domain/models/search-result.ts |
Add to SearchQuery |
scripts/populate-meta-content.ts |
NEW - Migration script |
Test Coverage¶
- Unit tests for pattern matching accuracy
- Integration tests for filter functionality
- Manual validation on diverse document corpus
Alternatives Considered¶
1. ML-Based Classification¶
Train a model to classify chunks as content vs meta-content.
Rejected: Adds latency, complexity, and dependencies. Heuristic patterns are sufficient and deterministic.
2. Delete Meta Content Chunks¶
Remove ToC/front matter chunks entirely rather than marking them.
Rejected: Irreversible data loss. Marking preserves information for users who may want it.
3. Running Header Detection (Phase 1)¶
Detect running headers/footers by analyzing repeated text across pages.
Deferred: Requires cross-page analysis which adds complexity. Can be added in future phase.