Vocabulary Design
Rules for designing controlled vocabularies, referencing standards, and validating through system mappings.
Rules for controlled vocabularies. Builds on Schema Design.
1. Universal by default, domain-scoped by exception
A vocabulary lives at the root unless the same codes would carry different meanings across domains.
Examples:
payment-statusis universal. "Paid" means the same thing in social protection, health, and education.targeting-approachis domain-specific. "Proxy means testing" is social protection methodology with no equivalent in other domains.severityis ambiguous. In emergency alerting it means hazard impact; in health it means disease progression. Disambiguate by renaming (event-severity) or scoping to a domain.
Definitions must match scope. A universal vocabulary's definition must not reference a specific domain. Write "the lifecycle states of an enrollment in a program," not "...in a social protection program."
Process and mechanism vocabularies
Vocabularies that describe processes (enrollment lifecycles, payment processing, complaint handling) and delivery mechanisms (channels, modalities) are universal by default, even if the first concept using them is domain-specific. The test: would the same codes carry different meanings in another domain? If "active" enrollment means the same thing in social protection and education, the vocabulary is universal.
Individual values that only apply to one domain use the domain annotation on the value (see section 4), not on the vocabulary itself.
2. One concept per vocabulary
Each vocabulary answers one question. Do not combine orthogonal concerns.
- Status and outcome are different questions. "Where is this process?" vs. "What was the result?"
- Channel and modality are different questions. "How does the benefit reach the person?" vs. "What form does the benefit take?"
Do not split prematurely. If a single vocabulary cleanly captures the lifecycle and all mapped systems treat it as one field, splitting adds complexity without proven benefit.
3. Reference existing standards
If a formal code system exists, reference it. Three levels:
| Level | When to use | YAML field |
|---|---|---|
| Sync | Machine-readable standard defines the authoritative value set. | standard + sync |
| Reference | Standard exists but we don't sync (prose-only, partial overlap, deliberate simplification). | references |
| None | No relevant standard exists. | Neither field |
Synced standard vocabularies
When a vocabulary is synced from an international standard and systems use that standard's codes directly, same_standard_systems is sufficient in place of value-level system_mappings. This applies to large code lists (countries, currencies, languages) where the standard is the de facto interoperability layer.
A vocabulary with no standard, no references, and no system_mappings is unvalidated. Acceptable at draft maturity; must be resolved before trial use.
Do not adopt a standard's codes when they don't serve the audience. ISO 20022 codes (RCVD, ACTC, ACSP) are for interbank messaging. Use readable codes; map to the standard.
Prefer machine-readable standards over prose-only ones. Machine-readable standards can be synced automatically; prose standards drift.
4. Domain annotations on individual values
When a universal vocabulary contains values that only apply to one domain, annotate the value rather than splitting the vocabulary:
- code: graduated
domain: sp
label:
en: Graduated
definition:
en: The beneficiary has exited through program-defined graduation criteria.
Use sparingly. If more than a third of values carry domain annotations, the vocabulary should move to a domain namespace.
5. Disambiguate collision-prone names
Before naming a vocabulary, ask: could another domain define a vocabulary with this name but different values?
severity->event-severitycertainty->event-certainty- Never use bare
status; always qualify (enrollment-status,payment-status)
6. The other code and system mappings
other is acceptable at draft maturity. Track what maps to it in real deployments. When the same unmapped value appears in 2+ systems, promote it to a named code.
System mappings are the primary validation mechanism:
- 3+ systems mapping to
other= vocabulary gap. Promote to a named code. - 4+ system codes mapping to one canonical value = vocabulary may be too coarse.
- System code mapping to
nullwhen a broader canonical code exists = mapping bug.
A vocabulary with zero system mappings is unvalidated. Add mappings before advancing maturity.