You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add LinkML validator documentation for hallucination guardrails
Extends existing hallucination prevention documentation to cover:
- Distinction between term validation (ID + label) and reference validation (quote + citation)
- Core concepts and principles for both validation approaches
- When to use each type of validation
- Practical examples of text excerpt validation
- Implementation details for linkml-term-validator and linkml-reference-validator
- Integration guidance for using both tools together
Focuses on concepts with links to implementation-specific documentation
per feedback from @cmungall in #41.
Addresses #51
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Copy file name to clipboardExpand all lines: docs/how-tos/make-ids-hallucination-resistant.md
+145-4Lines changed: 145 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -76,10 +76,38 @@ publication:
76
76
77
77
# Would catch mismatches like:
78
78
publication:
79
-
pmid: 10802651
79
+
pmid: 10802651
80
80
title: "Some other paper title"# Wrong title for this PMID
81
81
```
82
82
83
+
### Text Excerpts from Publications
84
+
85
+
When your curation includes quoted text or supporting evidence from papers, you can validate that the text actually appears in the cited source:
86
+
87
+
```yaml
88
+
# This would pass validation
89
+
annotation:
90
+
term_id: GO:0005634
91
+
supporting_text: "The protein localizes to the nucleus during cell division"
92
+
reference: PMID:12345678
93
+
# Validation checks that this exact text appears in PMID:12345678
94
+
95
+
# This would fail - text doesn't appear in the paper
96
+
annotation:
97
+
term_id: GO:0005634
98
+
supporting_text: "Made-up description that sounds plausible"
99
+
reference: PMID:12345678
100
+
```
101
+
102
+
The validator supports standard editorial conventions:
103
+
104
+
```yaml
105
+
# These are valid - bracketed clarifications and ellipses are allowed
106
+
annotation:
107
+
supporting_text: "The protein [localizes] to the nucleus...during cell division"
108
+
# Matches: "The protein to the nucleus early during cell division"
109
+
```
110
+
83
111
## You Need Tooling for This
84
112
85
113
This pattern only works if you have validation tools that can actually check the identifiers against authoritative sources. You need:
@@ -89,19 +117,66 @@ This pattern only works if you have validation tools that can actually check the
89
117
3.**Label matching**: Compare provided labels against canonical ones
90
118
4.**Consistency checking**: Make sure everything matches up
91
119
120
+
## Validation Concepts
121
+
122
+
There are two complementary approaches to preventing hallucinations in AI-assisted curation:
123
+
124
+
### 1. Term Validation (ID + Label Checking)
125
+
126
+
This is the approach we've been discussing: validating that identifiers and their labels are consistent with authoritative ontology sources. The key concept is **dual verification** - requiring both the ID and its canonical label makes it exponentially harder for an AI to accidentally fabricate a valid combination.
127
+
128
+
**Core principles:**
129
+
- Validate term IDs against ontology sources to ensure they exist
130
+
- Verify labels match the canonical labels from those sources
131
+
- Check consistency between related terms in your data
132
+
- Support dynamic enum validation for flexible controlled vocabularies
133
+
134
+
**When to use this:**
135
+
- You're working with ontology terms (GO, HP, MONDO, etc.)
136
+
- You're handling gene identifiers, chemical compounds, or other standardized entities
137
+
- You need to validate that AI-generated annotations use real, correctly-labeled terms
A complementary approach validates that text excerpts or quotes in your data actually appear in their cited sources. This prevents AI from inventing supporting text or misattributing quotes to publications.
142
+
143
+
**Core principles:**
144
+
- Fetch the actual publication content from authoritative sources
@@ -145,9 +220,75 @@ But for most scientific curation workflows involving ontologies, genes, and publ
145
220
## Getting Started
146
221
147
222
1. **Pick one identifier type** that's important for your workflow
148
-
2. **Find the authoritative API** for that type
223
+
2. **Find the authoritative API** for that type
149
224
3. **Modify your prompts** to require both ID and label
150
225
4. **Build simple validation** that checks both pieces
151
226
5. **Expand gradually** to other identifier types
152
227
153
-
The key is to start simple and build up. You don't need a comprehensive system from day one - even basic validation on your most critical identifier types can make a big difference.
228
+
The key is to start simple and build up. You don't need a comprehensive system from day one - even basic validation on your most critical identifier types can make a big difference.
229
+
230
+
## Implementation Tools
231
+
232
+
The concepts described in this guide are implemented in several practical tools:
233
+
234
+
### LinkML Validator Plugins
235
+
236
+
The [LinkML](https://linkml.io) ecosystem provides validator plugins specifically designed for these hallucination prevention patterns:
237
+
238
+
#### linkml-term-validator
239
+
240
+
Validates ontology terms in LinkML schemas and datasets using the dual verification approach (ID + label checking):
241
+
242
+
- **Schema validation**: Verifies `meaning` fields in enum definitions reference real ontology terms
243
+
- **Dynamic enum validation**: Checks data against constraints like `reachable_from`, `matches`, and `concepts`
244
+
- **Binding validation**: Enforces constraints on nested object fields
245
+
- **Multi-level caching**: Speeds up validation with in-memory and file-based caching
246
+
- **Ontology Access Kit integration**: Works with multiple ontology sources through OAK adapters
Both tools are available as Python packages and can be installed via pip:
288
+
289
+
```bash
290
+
pip install linkml-term-validator
291
+
pip install linkml-reference-validator
292
+
```
293
+
294
+
They work as both command-line tools and Python libraries, so you can integrate them into your existing workflows however makes sense for your use case.
0 commit comments