15. Formal Concept Model Definition¶
Date: 2025-11-13
Status: Accepted
Deciders: Engineering Team
Technical Story: Concept Extraction Enhancement (November 13, 2025)
Sources: - Planning: 2025-11-13-concept-extraction-enhancement
Context and Problem Statement¶
While concept extraction was working (since October 13) [ADR-0007], there was no formal definition of what constitutes a "concept" in the system [Gap: definition missing]. This led to potential inconsistencies in extraction quality, unclear inclusion/exclusion criteria, and difficulty evaluating whether extracted concepts met quality standards [Problem: no evaluation criteria].
The Core Problem: What exactly IS a concept in the concept-RAG system, and how do we ensure all LLM agents extract concepts consistently? [Planning: FORMAL_CONCEPT_DEFINITION.md]
Decision Drivers: * Need consistent extraction across different LLM calls [Requirement: consistency] * Evaluation criteria for concept quality [Requirement: quality assessment] * Clear guidelines for inclusion/exclusion [Requirement: explicit criteria] * Training reference for future LLM agents [Requirement: documentation] * Foundation for system improvements [Architecture: formal specification]
Alternative Options¶
- Option 1: Formal Definition with Inclusion/Exclusion Rules - Explicit specification
- Option 2: Example-Based - Show examples of good/bad concepts
- Option 3: No Formal Definition - Let LLM decide implicitly
- Option 4: Schema-Based - Define via type structure only
- Option 5: Taxonomy-Based - Define by category membership
Decision Outcome¶
Chosen option: "Formal Definition with Inclusion/Exclusion Rules (Option 1)", because it provides the clearest guidance for LLM extraction, enables quality evaluation, and serves as authoritative documentation for the entire system.
The Formal Definition¶
Canonical Text: [Source: FORMAL_CONCEPT_DEFINITION.md, line 9; implemented in AGENTS.md]
A concept is a uniquely identified, abstract idea packaged with its names, definition, distinguishing features, relations, and detection cues, enabling semantic matching and disambiguated retrieval across texts.
Key Components¶
9 Essential Elements: [Source: AGENTS.md created November 13, 2025]
- Uniquely Identified - Distinct from other concepts
- Abstract Idea - Not concrete instances or examples
- Names - Multiple names/terms for the concept
- Definition - Clear meaning and scope
- Distinguishing Features - What makes it unique
- Relations - Connections to other concepts
- Detection Cues - How to recognize in text
- Semantic Matching - Enables search and retrieval
- Disambiguated - Clear boundaries vs. related concepts
Inclusion Criteria¶
✅ INCLUDE: [Source: README.md, line 137; AGENTS.md]
- Domain terms (e.g., "consensus algorithm")
- Theories (e.g., "Elliott Wave Theory")
- Methodologies (e.g., "test-driven development")
- Multi-word conceptual phrases (e.g., "separation of concerns")
- Phenomena (e.g., "race condition")
- Abstract principles (e.g., "single responsibility")
Exclusion Criteria¶
❌ EXCLUDE: [Source: README.md, line 138; AGENTS.md]
- Temporal descriptions ("in 2020", "during the war")
- Action phrases ("should implement", "must configure")
- Suppositions ("might be", "could potentially")
- Proper names (people, places, organizations)
- Dates and numbers
- Generic words without conceptual meaning
Implementation¶
Files Created/Updated: [Source: FORMAL_CONCEPT_DEFINITION.md, lines 13-38]
AGENTS.md(Project Root) [Source:FORMAL_CONCEPT_DEFINITION.md, lines 13-24]- Formal definition
- Key components breakdown
- Extraction guidelines
-
Integration instructions
-
src/concepts/concept_extractor.ts[Source:FORMAL_CONCEPT_DEFINITION.md, lines 26-38] - Line 102: Multi-pass extraction prompt updated
- Line 197: Single-pass extraction prompt updated
- Both include formal definition at beginning
Prompt Integration: [Source: concept_extractor.ts, lines 102 and 197]
const prompt = `
FORMAL DEFINITION:
A concept is a uniquely identified, abstract idea packaged with its names, definition, distinguishing features, relations, and detection cues, enabling semantic matching and disambiguated retrieval across texts.
[... rest of extraction prompt ...]
`;
Consequences¶
Positive:
* Consistency: All LLM agents receive same definition [Benefit: FORMAL_CONCEPT_DEFINITION.md, line 42-43]
* Quality: Clear guidelines improve extraction quality [Benefit: lines 48-54]
* Evaluation: Can assess if extractions meet criteria [Benefit: quality metrics possible]
* Documentation: Authoritative reference for developers [Benefit: lines 57-62]
* Training: Future agents can reference formal model [Benefit: onboarding]
* Disambiguation: Explicit guidance on concept boundaries [Benefit: clarity]
* System alignment: All components reference same model [Benefit: lines 64-70]
Negative: * Rigidity: Definition may need evolution as system grows [Risk: future constraints] * Interpretation: LLMs may still interpret definition differently [Limitation: LLM variance] * Verification: Hard to programmatically verify compliance [Challenge: quality assurance] * Maintenance: Definition must be kept synchronized across prompts [Burden: consistency maintenance]
Neutral: * Backward compatibility: Doesn't invalidate existing concepts [Impact: historical data] * Iterative refinement: Definition can be improved over time [Process: evolutionary]
Confirmation¶
Validation: [Source: FORMAL_CONCEPT_DEFINITION.md, lines 82-88]
- ✅ AGENTS.md created with formal definition
- ✅ Chunk extraction prompt updated (line 102)
- ✅ Single-pass extraction prompt updated (line 197)
- ✅ TypeScript compiled successfully
- ✅ No linting errors
- ✅ Planning documentation created
Production Impact: - Clearer concept extraction from Nov 13 onwards - Consistent quality across documents indexed after formalization - Reference point for evaluating existing concepts
Pros and Cons of the Options¶
Option 1: Formal Definition with Rules - Chosen¶
Pros: * Crystal clear guidelines [Source: definition text] * Explicit inclusion/exclusion rules [Source: AGENTS.md] * Evaluatable quality * Authoritative documentation * Improves extraction consistency [Validated: FORMAL_CONCEPT_DEFINITION.md]
Cons: * May need evolution * LLM interpretation variance * Hard to verify programmatically * Maintenance burden
Option 2: Example-Based¶
Provide good/bad examples instead of formal definition.
Pros: * Concrete and easy to understand * Shows real cases * Less abstract
Cons: * Less precise: Examples don't cover all cases [Limitation: incomplete] * Ambiguity: What about edge cases not in examples? * Maintenance: Must update examples as system evolves * Inconsistent: Different agents may generalize differently from examples
Option 3: No Formal Definition¶
Let LLM decide what concepts are based on its training.
Pros: * Zero effort (current state before Nov 13) * LLM uses built-in understanding * Flexible
Cons: * Inconsistent: Different calls may extract differently [Problem: variance] * No quality criteria: Can't evaluate extraction quality [Gap: no standards] * Unclear expectations: Developers don't know what to expect [Documentation: lacking] * This was the problem: Why formalization was needed [History: pre-Nov-13 state]
Option 4: Schema-Based¶
Define via TypeScript interfaces only.
Pros: * Type-safe * Programmatically enforced * IDE support
Cons: * Doesn't guide extraction: LLM doesn't see TypeScript types [Gap: not in prompts] * Structure without semantics: Types show "what" not "why" * Incomplete: Doesn't specify inclusion/exclusion criteria
Option 5: Taxonomy-Based¶
Define concepts by their position in taxonomy.
Pros: * Hierarchical organization * Clear categories * Navigable structure
Cons: * Doesn't define concept itself: What makes something a concept? [Gap: fundamental question] * Circular: Need to define concepts before building taxonomy * Rigid: Forces concepts into predetermined categories * Over-constrains: Some concepts span categories
Implementation Notes¶
AGENTS.md Structure¶
File Location: Project root [Source: FORMAL_CONCEPT_DEFINITION.md, line 15]
Sections: [Source: lines 15-24] 1. Formal definition 2. Key components breakdown (9 elements) 3. Document parsing guidelines 4. Concept extraction process 5. Integration guidelines 6. Usage instructions for system components
Integration in Extraction¶
Both Prompts Updated: [Source: lines 30-38] - Multi-pass extraction (line 102 of concept_extractor.ts) - Single-pass extraction (line 197 of concept_extractor.ts)
Format:
Future Enhancements¶
Optional Improvements: [Source: FORMAL_CONCEPT_DEFINITION.md, lines 72-80]
1. Structured concept schema (capture relations explicitly)
2. Concept validation logic
3. Relation extraction (hierarchical, associative, causal)
4. Detection cue database
5. Quality metrics
Related Decisions¶
- ADR-0007: Concept Extraction - Extraction process formalized
- ADR-0009: Three-Table Architecture - Concept storage
- ADR-0008: WordNet Integration - Semantic relationships
References¶
Related Decisions¶
Confidence Level: HIGH Attribution: - Planning docs: November 13, 2024 - Documented in: FORMAL_CONCEPT_DEFINITION.md
Traceability: 2025-11-13-concept-extraction-enhancement