Skip to content

Commit 19041a6

Browse files
Dragon-AI Agentclaude
andcommitted
Add LinkML validator documentation for hallucination guardrails
Extends existing hallucination prevention documentation to cover: - Distinction between term validation (ID + label) and reference validation (quote + citation) - Core concepts and principles for both validation approaches - When to use each type of validation - Practical examples of text excerpt validation - Implementation details for linkml-term-validator and linkml-reference-validator - Integration guidance for using both tools together Focuses on concepts with links to implementation-specific documentation per feedback from @cmungall in #41. Addresses #51 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
1 parent 11266f0 commit 19041a6

File tree

1 file changed

+145
-4
lines changed

1 file changed

+145
-4
lines changed

docs/how-tos/make-ids-hallucination-resistant.md

Lines changed: 145 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -76,10 +76,38 @@ publication:
7676

7777
# Would catch mismatches like:
7878
publication:
79-
pmid: 10802651
79+
pmid: 10802651
8080
title: "Some other paper title" # Wrong title for this PMID
8181
```
8282
83+
### Text Excerpts from Publications
84+
85+
When your curation includes quoted text or supporting evidence from papers, you can validate that the text actually appears in the cited source:
86+
87+
```yaml
88+
# This would pass validation
89+
annotation:
90+
term_id: GO:0005634
91+
supporting_text: "The protein localizes to the nucleus during cell division"
92+
reference: PMID:12345678
93+
# Validation checks that this exact text appears in PMID:12345678
94+
95+
# This would fail - text doesn't appear in the paper
96+
annotation:
97+
term_id: GO:0005634
98+
supporting_text: "Made-up description that sounds plausible"
99+
reference: PMID:12345678
100+
```
101+
102+
The validator supports standard editorial conventions:
103+
104+
```yaml
105+
# These are valid - bracketed clarifications and ellipses are allowed
106+
annotation:
107+
supporting_text: "The protein [localizes] to the nucleus...during cell division"
108+
# Matches: "The protein to the nucleus early during cell division"
109+
```
110+
83111
## You Need Tooling for This
84112

85113
This pattern only works if you have validation tools that can actually check the identifiers against authoritative sources. You need:
@@ -89,19 +117,66 @@ This pattern only works if you have validation tools that can actually check the
89117
3. **Label matching**: Compare provided labels against canonical ones
90118
4. **Consistency checking**: Make sure everything matches up
91119

120+
## Validation Concepts
121+
122+
There are two complementary approaches to preventing hallucinations in AI-assisted curation:
123+
124+
### 1. Term Validation (ID + Label Checking)
125+
126+
This is the approach we've been discussing: validating that identifiers and their labels are consistent with authoritative ontology sources. The key concept is **dual verification** - requiring both the ID and its canonical label makes it exponentially harder for an AI to accidentally fabricate a valid combination.
127+
128+
**Core principles:**
129+
- Validate term IDs against ontology sources to ensure they exist
130+
- Verify labels match the canonical labels from those sources
131+
- Check consistency between related terms in your data
132+
- Support dynamic enum validation for flexible controlled vocabularies
133+
134+
**When to use this:**
135+
- You're working with ontology terms (GO, HP, MONDO, etc.)
136+
- You're handling gene identifiers, chemical compounds, or other standardized entities
137+
- You need to validate that AI-generated annotations use real, correctly-labeled terms
138+
139+
### 2. Reference Validation (Quote + Citation Checking)
140+
141+
A complementary approach validates that text excerpts or quotes in your data actually appear in their cited sources. This prevents AI from inventing supporting text or misattributing quotes to publications.
142+
143+
**Core principles:**
144+
- Fetch the actual publication content from authoritative sources
145+
- Perform deterministic substring matching (not fuzzy matching)
146+
- Support legitimate editorial conventions (bracketed clarifications, ellipses)
147+
- Reject any text that can't be verified in the source
148+
149+
**When to use this:**
150+
- Your curation workflow includes extracting text from publications
151+
- You're building datasets with quoted material and citations
152+
- AI systems are summarizing or extracting information from papers
153+
- You need to verify that supporting text for annotations comes from real sources
154+
155+
### Why Both Matter
156+
157+
These validation approaches protect against different types of hallucinations:
158+
- **Term validation** prevents fabricated identifiers and misapplied terms
159+
- **Reference validation** prevents fabricated quotes and misattributed text
160+
161+
For comprehensive AI guardrails, you often need both. For example, when curating gene-disease associations, you might validate:
162+
1. That the gene IDs and disease term IDs are real and correctly labeled
163+
2. That the supporting text cited from a paper actually appears in that paper
164+
92165
### Useful APIs for Validation
93166

94167
- **OLS (Ontology Lookup Service)**: EBI's comprehensive API for biomedical ontologies
95168
- **OAK (Ontology Access Kit)**: Python library that can work with multiple ontology sources
96169
- **PubMed APIs**: For validating PMIDs and retrieving titles
170+
- **PMC (PubMed Central)**: For accessing full-text content to validate excerpts
97171
- **Individual ontology APIs**: Many ontologies have their own REST APIs
98172

99173
### Implementation Notes
100174

101-
- **Cache responses** to avoid hitting APIs repeatedly for the same IDs
175+
- **Cache responses** to avoid hitting APIs repeatedly for the same IDs or references
102176
- **Handle network failures** gracefully - you don't want validation failures to break your workflow
103177
- **Consider performance** - real-time validation can slow things down, so you might need to batch or background the checks
104178
- **Plan for errors** - decide how to handle cases where validation fails (reject, flag for review, etc.)
179+
- **Use deterministic validation** - avoid fuzzy matching that might accept AI-generated approximations
105180

106181
## Beyond Basic Ontologies
107182

@@ -145,9 +220,75 @@ But for most scientific curation workflows involving ontologies, genes, and publ
145220
## Getting Started
146221
147222
1. **Pick one identifier type** that's important for your workflow
148-
2. **Find the authoritative API** for that type
223+
2. **Find the authoritative API** for that type
149224
3. **Modify your prompts** to require both ID and label
150225
4. **Build simple validation** that checks both pieces
151226
5. **Expand gradually** to other identifier types
152227
153-
The key is to start simple and build up. You don't need a comprehensive system from day one - even basic validation on your most critical identifier types can make a big difference.
228+
The key is to start simple and build up. You don't need a comprehensive system from day one - even basic validation on your most critical identifier types can make a big difference.
229+
230+
## Implementation Tools
231+
232+
The concepts described in this guide are implemented in several practical tools:
233+
234+
### LinkML Validator Plugins
235+
236+
The [LinkML](https://linkml.io) ecosystem provides validator plugins specifically designed for these hallucination prevention patterns:
237+
238+
#### linkml-term-validator
239+
240+
Validates ontology terms in LinkML schemas and datasets using the dual verification approach (ID + label checking):
241+
242+
- **Schema validation**: Verifies `meaning` fields in enum definitions reference real ontology terms
243+
- **Dynamic enum validation**: Checks data against constraints like `reachable_from`, `matches`, and `concepts`
244+
- **Binding validation**: Enforces constraints on nested object fields
245+
- **Multi-level caching**: Speeds up validation with in-memory and file-based caching
246+
- **Ontology Access Kit integration**: Works with multiple ontology sources through OAK adapters
247+
248+
**Learn more:** [linkml-term-validator documentation](https://linkml.io/linkml-term-validator/)
249+
250+
#### linkml-reference-validator
251+
252+
Validates that text excerpts match their source publications using deterministic verification:
253+
254+
- **Deterministic substring matching**: No fuzzy matching or AI approximations
255+
- **Editorial convention support**: Handles bracketed clarifications and ellipses
256+
- **PubMed/PMC integration**: Fetches actual publication content for verification
257+
- **Smart caching**: Minimizes API requests with local caching
258+
- **Multiple interfaces**: Command-line tool, Python API, and LinkML schema integration
259+
- **OBO format support**: Can validate supporting text annotations in OBO ontology files
260+
261+
**Learn more:** [linkml-reference-validator documentation](https://linkml.io/linkml-reference-validator/)
262+
263+
### Using These Tools Together
264+
265+
For comprehensive AI guardrails, you can combine both validators in your workflow:
266+
267+
1. Use **linkml-term-validator** to ensure all ontology terms, gene IDs, and other identifiers are real and correctly labeled
268+
2. Use **linkml-reference-validator** to verify that supporting text and quotes actually appear in their cited sources
269+
3. Integrate both into your CI/CD pipeline to catch hallucinations before they enter your knowledge base
270+
271+
### Example: Validating OBO Ontology Files
272+
273+
If you're working with OBO format ontologies that include supporting text annotations, you can use regex-based validation:
274+
275+
```bash
276+
linkml-reference-validator validate text-file my-ontology.obo \
277+
--regex 'ex:supporting_text="([^"]*)\[(\S+:\S+)\]"' \
278+
--cache-dir ./cache
279+
```
280+
281+
This validates that supporting text annotations actually appear in their referenced publications.
282+
283+
**Learn more:** [Validating OBO files guide](https://linkml.io/linkml-reference-validator/how-to/validate-obo-files/)
284+
285+
### Getting Started with LinkML Validators
286+
287+
Both tools are available as Python packages and can be installed via pip:
288+
289+
```bash
290+
pip install linkml-term-validator
291+
pip install linkml-reference-validator
292+
```
293+
294+
They work as both command-line tools and Python libraries, so you can integrate them into your existing workflows however makes sense for your use case.

0 commit comments

Comments
 (0)