Add LinkML validator documentation for hallucination guardrails

Dragon-AI Agent · claude · Dragon-AI Agent · commit 19041a61f402 · 2025-11-17T19:44:55.000Z
Extends existing hallucination prevention documentation to cover: - Distinction between term validation (ID + label) and reference validation (quote + citation) - Core concepts and principles for both validation approaches - When to use each type of validation - Practical examples of text excerpt validation - Implementation details for linkml-term-validator and linkml-reference-validator - Integration guidance for using both tools together Focuses on concepts with links to implementation-specific documentation per feedback from @cmungall in #41. Addresses #51 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
diff --git a/docs/how-tos/make-ids-hallucination-resistant.md b/docs/how-tos/make-ids-hallucination-resistant.md
@@ -76,10 +76,38 @@ publication:
 
 # Would catch mismatches like:
 publication:
-  pmid: 10802651  
+  pmid: 10802651
   title: "Some other paper title"  # Wrong title for this PMID
 ```
 
+### Text Excerpts from Publications
+
+When your curation includes quoted text or supporting evidence from papers, you can validate that the text actually appears in the cited source:
+
+```yaml
+# This would pass validation
+annotation:
+  term_id: GO:0005634
+  supporting_text: "The protein localizes to the nucleus during cell division"
+  reference: PMID:12345678
+  # Validation checks that this exact text appears in PMID:12345678
+
+# This would fail - text doesn't appear in the paper
+annotation:
+  term_id: GO:0005634
+  supporting_text: "Made-up description that sounds plausible"
+  reference: PMID:12345678
+```
+
+The validator supports standard editorial conventions:
+
+```yaml
+# These are valid - bracketed clarifications and ellipses are allowed
+annotation:
+  supporting_text: "The protein [localizes] to the nucleus...during cell division"
+  # Matches: "The protein to the nucleus early during cell division"
+```
+
 ## You Need Tooling for This
 
 This pattern only works if you have validation tools that can actually check the identifiers against authoritative sources. You need:
@@ -89,19 +117,66 @@ This pattern only works if you have validation tools that can actually check the
 3. **Label matching**: Compare provided labels against canonical ones
 4. **Consistency checking**: Make sure everything matches up
 
+## Validation Concepts
+
+There are two complementary approaches to preventing hallucinations in AI-assisted curation:
+
+### 1. Term Validation (ID + Label Checking)
+
+This is the approach we've been discussing: validating that identifiers and their labels are consistent with authoritative ontology sources. The key concept is **dual verification** - requiring both the ID and its canonical label makes it exponentially harder for an AI to accidentally fabricate a valid combination.
+
+**Core principles:**
+- Validate term IDs against ontology sources to ensure they exist
+- Verify labels match the canonical labels from those sources
+- Check consistency between related terms in your data
+- Support dynamic enum validation for flexible controlled vocabularies
+
+**When to use this:**
+- You're working with ontology terms (GO, HP, MONDO, etc.)
+- You're handling gene identifiers, chemical compounds, or other standardized entities
+- You need to validate that AI-generated annotations use real, correctly-labeled terms
+
+### 2. Reference Validation (Quote + Citation Checking)
+
+A complementary approach validates that text excerpts or quotes in your data actually appear in their cited sources. This prevents AI from inventing supporting text or misattributing quotes to publications.
+
+**Core principles:**
+- Fetch the actual publication content from authoritative sources
+- Perform deterministic substring matching (not fuzzy matching)
+- Support legitimate editorial conventions (bracketed clarifications, ellipses)
+- Reject any text that can't be verified in the source
+
+**When to use this:**
+- Your curation workflow includes extracting text from publications
+- You're building datasets with quoted material and citations
+- AI systems are summarizing or extracting information from papers
+- You need to verify that supporting text for annotations comes from real sources
+
+### Why Both Matter
+
+These validation approaches protect against different types of hallucinations:
+- **Term validation** prevents fabricated identifiers and misapplied terms
+- **Reference validation** prevents fabricated quotes and misattributed text
+
+For comprehensive AI guardrails, you often need both. For example, when curating gene-disease associations, you might validate:
+1. That the gene IDs and disease term IDs are real and correctly labeled
+2. That the supporting text cited from a paper actually appears in that paper
+
 ### Useful APIs for Validation
 
 - **OLS (Ontology Lookup Service)**: EBI's comprehensive API for biomedical ontologies
 - **OAK (Ontology Access Kit)**: Python library that can work with multiple ontology sources
 - **PubMed APIs**: For validating PMIDs and retrieving titles
+- **PMC (PubMed Central)**: For accessing full-text content to validate excerpts
 - **Individual ontology APIs**: Many ontologies have their own REST APIs
 
 ### Implementation Notes
 
-- **Cache responses** to avoid hitting APIs repeatedly for the same IDs
+- **Cache responses** to avoid hitting APIs repeatedly for the same IDs or references
 - **Handle network failures** gracefully - you don't want validation failures to break your workflow
 - **Consider performance** - real-time validation can slow things down, so you might need to batch or background the checks
 - **Plan for errors** - decide how to handle cases where validation fails (reject, flag for review, etc.)
+- **Use deterministic validation** - avoid fuzzy matching that might accept AI-generated approximations
 
 ## Beyond Basic Ontologies
 
@@ -145,9 +220,75 @@ But for most scientific curation workflows involving ontologies, genes, and publ
 ## Getting Started
 
 1. **Pick one identifier type** that's important for your workflow
-2. **Find the authoritative API** for that type  
+2. **Find the authoritative API** for that type
 3. **Modify your prompts** to require both ID and label
 4. **Build simple validation** that checks both pieces
 5. **Expand gradually** to other identifier types
 
-The key is to start simple and build up. You don't need a comprehensive system from day one - even basic validation on your most critical identifier types can make a big difference.
+The key is to start simple and build up. You don't need a comprehensive system from day one - even basic validation on your most critical identifier types can make a big difference.
+
+## Implementation Tools
+
+The concepts described in this guide are implemented in several practical tools:
+
+### LinkML Validator Plugins
+
+The [LinkML](https://linkml.io) ecosystem provides validator plugins specifically designed for these hallucination prevention patterns:
+
+#### linkml-term-validator
+
+Validates ontology terms in LinkML schemas and datasets using the dual verification approach (ID + label checking):
+
+- **Schema validation**: Verifies `meaning` fields in enum definitions reference real ontology terms
+- **Dynamic enum validation**: Checks data against constraints like `reachable_from`, `matches`, and `concepts`
+- **Binding validation**: Enforces constraints on nested object fields
+- **Multi-level caching**: Speeds up validation with in-memory and file-based caching
+- **Ontology Access Kit integration**: Works with multiple ontology sources through OAK adapters
+
+**Learn more:** [linkml-term-validator documentation](https://linkml.io/linkml-term-validator/)
+
+#### linkml-reference-validator
+
+Validates that text excerpts match their source publications using deterministic verification:
+
+- **Deterministic substring matching**: No fuzzy matching or AI approximations
+- **Editorial convention support**: Handles bracketed clarifications and ellipses
+- **PubMed/PMC integration**: Fetches actual publication content for verification
+- **Smart caching**: Minimizes API requests with local caching
+- **Multiple interfaces**: Command-line tool, Python API, and LinkML schema integration
+- **OBO format support**: Can validate supporting text annotations in OBO ontology files
+
+**Learn more:** [linkml-reference-validator documentation](https://linkml.io/linkml-reference-validator/)
+
+### Using These Tools Together
+
+For comprehensive AI guardrails, you can combine both validators in your workflow:
+
+1. Use **linkml-term-validator** to ensure all ontology terms, gene IDs, and other identifiers are real and correctly labeled
+2. Use **linkml-reference-validator** to verify that supporting text and quotes actually appear in their cited sources
+3. Integrate both into your CI/CD pipeline to catch hallucinations before they enter your knowledge base
+
+### Example: Validating OBO Ontology Files
+
+If you're working with OBO format ontologies that include supporting text annotations, you can use regex-based validation:
+
+```bash
+linkml-reference-validator validate text-file my-ontology.obo \
+  --regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
+  --cache-dir ./cache
+```
+
+This validates that supporting text annotations actually appear in their referenced publications.
+
+**Learn more:** [Validating OBO files guide](https://linkml.io/linkml-reference-validator/how-to/validate-obo-files/)
+
+### Getting Started with LinkML Validators
+
+Both tools are available as Python packages and can be installed via pip:
+
+```bash
+pip install linkml-term-validator
+pip install linkml-reference-validator
+```
+
+They work as both command-line tools and Python libraries, so you can integrate them into your existing workflows however makes sense for your use case.