Imagine a French cardiologist and a Japanese oncologist walk into a conference room. Both are brilliant, having spent decades studying disease. More importantly, both are holding datasets that if combined could accelerate a treatment breakthrough. However, one speaks only French and the other speaks only Japanese. Without a translator in the room, the datasets stay separate, not because the science is wrong, but because the language is incompatible.
While that’s a simplified analogy, it serves as an example of the importance of ontology.
Now take that same analogy and scale that problem to hundreds of clinical systems, thousands of studies, and millions of data points, each captured in a different "language.” One system calls it "myocardial infarction," another says "MI," a third records "heart attack." All of them mean the same thing. However, without something in the middle to bridge the meaning, the systems can't talk to each other, analysts can't merge the data, and AI can't reason across it.
That "something in the middle" is the ontology.
It acts as the semantic translator between the fragmented vocabularies of modern clinical data. Now, as the life sciences industry accelerates toward AI-driven drug development and orchestrated clinical operations, understanding ontologies has become foundational to success in our industry.
What is ontology?
In philosophy, "ontology" refers to the study of existence (what kinds of things are real and how they relate). In computer science and life sciences, the term has a more practical meaning: a formal, structured vocabulary that defines concepts within a domain and the relationships between them.
Think of it as the difference between a dictionary and a knowledge graph. A dictionary tells you what a word means in natural language. An ontology tells a machine what a concept means, its class, its parent and child concepts, its relationships to other concepts, in a way that can be reasoned over programmatically. In other words, it adds in the necessary layers of context.
A working definition: An ontology is an explicit, machine-readable specification of a domain's conceptualization. This includes including the classes of things that exist, the relations between them, and the axioms that constrain their meaning.
The key ingredients are four interlocking features. Every major biomedical ontology provides all of them:
01 Standard identifiers
Unique, stable IDs for every concept, enabling the same entity to be referenced consistently across disconnected databases and systems.
02 Domain vocabulary
Human-readable labels and synonyms associated with each class. The terms domain experts actually use in text and UI.
03 Textual definitions
Precise natural language descriptions that remove ambiguity so users and systems apply concepts consistently.
04 Formal axioms
Machine-readable logical rules and definitions enabling automated reasoning, consistency checking, and inference.
This last feature is what separates a proper ontology from a glossary or a controlled vocabulary. When concepts are defined with formal logic, using languages like OWL (Web Ontology Language), software can reason over the ontology. It does this by checking for contradictions, inferring implicit relationships, and answering complex queries that no human programmed explicitly.
How clinical trials came to use ontologies
The life sciences community didn't adopt ontologies out of philosophical interest. They arrived out of necessity, driven by one of the most acute data heterogeneity problems in any industry.
By the late 1990s, the genomics revolution was generating vast datasets. However, the data was being described with inconsistent terminology across labs, institutions, and species. Two researchers studying the same biological process in different organisms used completely different terms. Databases couldn't be merged. Analyses couldn't be reproduced. The signal was being lost in the noise of semantic inconsistency.
The response was the Gene Ontology (GO), launched in 1998, which created a shared vocabulary for annotating gene products across species. It was a foundational moment. By the mid-2000s, there was enough activity to merit coordinated international efforts, including the Open Biomedical Ontologies (OBO) Foundry. This was a governance body that established principles for how ontologies in the life sciences should be designed, maintained, and made interoperable with each other.
Today, repositories like BioPortal host over 400 ontologies with more than six million classes covering everything from molecular biology and pharmacology to clinical phenotypes and research methodology.
The Core ontologies powering clinical operations today
If you’re curious to see what ontologies exist today in healthcare and beyond, look below. These layers provide the standardization needed for modern healthcare.
SNOMED CT
Scope: Clinical concepts — diseases, findings, procedures, anatomy
Where you see it: EHR systems, clinical data capture, safety coding
MedDRA
Scope: Medical terminology for adverse events and regulatory reporting
Where you see it: Safety databases, pharmacovigilance, FDA submissions
CDISC / SDTM
Scope: Clinical trial data standards and tabulation models
Where you see it: Regulatory submissions, EDC systems, data warehouses
LOINC
Scope: Lab tests, clinical observations, and measurements
Where you see it: Lab systems, HL7 interfaces, real-world data
Gene Ontology (GO)
Scope: Molecular function, biological process, cellular component
Where you see it: Genomics databases, target identification, biomarker work
NCI Thesaurus
Scope: Cancer, drugs, anatomy, clinical trials terminology
Where you see it: Oncology trials, NCI submissions, ClinicalTrials.gov
OBI
Scope: Biomedical investigations — study design, protocols, assays
Where you see it: Study metadata, protocol ontologies, data provenance
The challenge, and the opportunity, is that these ontologies were built separately, for different purposes, by different communities. No single system in a clinical development organization touches just one of them. Getting these to talk to each other is where ontology-based data integration becomes genuinely hard.
How ontologies enable data integration and AI
The practical power of ontologies shows up most clearly when you try to integrate data from multiple sources. In clinical development, this happens constantly: data from an EDC, a lab system, a safety database, a real-world data vendor, and a legacy clinical database all need to be merged for analysis. Without a shared semantic layer, this is a manual, error-prone mapping exercise done by data engineers who may or may not know the clinical domain.
With ontologies, integration can be driven by shared identifiers. If both systems annotate their data using SNOMED CT concept IDs, a query across both systems can resolve the semantics automatically, without humans manually reconciling "MI" versus "myocardial infarction" versus "heart attack" every time.
To illustrate the semantic gap: without an ontology, a data engineer might write an entry for a “participant ID”. If it connects with a different system that instead uses “patient ID,” the two systems will be unable to communicate.
With an ontology, a single concept ID for "Myocardial infarction" carries all its synonyms, its place in the disease hierarchy, and its relationships to related concepts. Any system using that ID gets all of that for free, and the ontology consortium, not the data engineer, is responsible for keeping it current.
This becomes critical for AI. Large language models and AI agents operating on life sciences data need to understand that concepts have hierarchical relationships: that "adverse event" encompasses "serious adverse event," which encompasses "fatal adverse event," each with different regulatory implications. They need to know that the same event can be described at different levels of specificity. They need provenance, specifically why a concept was applied to a data point and with what level of evidence.
Ontologies, when properly implemented, give AI agents this background knowledge. It is pre-baked, validated by domain experts, and maintained over time. Without it, every AI initiative in clinical development is essentially starting from scratch on the semantic understanding problem.
Why the old approach broke down
For years, integrating clinical data meant one thing: rebuilding. Data engineers had to reverse-engineer proprietary schemas, map fields by hand, negotiate with IT teams at every system, and produce brittle pipelines that broke whenever a vendor pushed an update. A single data harmonization project could consume six to twelve months before a single AI model saw a single training example. As a result, many initiatives in AI failed not because the models were bad, but because the data never arrived in a usable form.
The 2026 model: Agents + MCPs + Ontology
This is no longer the only way forward. The combination of AI agents, MCP connectors, and a shared ontology layer has changed the calculus entirely.
MCPs (the connectors described in the stack above) give agents live, permissioned access to the systems of record (ex: Veeva, Medidata, safety databases, lab systems) without requiring those systems to be rebuilt or re-exported. Agents can probe data structures in real time, surface schema inconsistencies, and propose mappings that would have taken a data engineering team weeks to produce manually.
The ontology layer then does something the agent alone cannot: it provides validated, expert-maintained semantic grounding. When an agent discovers that one system uses "MI" and another uses "myocardial infarction," it doesn't have to guess or hallucinate a relationship. The ontology confirms it, defines the hierarchy, and makes that resolution reusable across every future query.
The result is a compounding flywheel. Agents accelerate the discovery of data relationships. MCPs make live integration possible without rearchitecting. Ontologies make the semantic decisions durable and trustworthy. Together, the combination routinely collapses timelines that used to stretch across quarters into days or weeks.
This is the new baseline for 2026. Organizations still treating data harmonization as a prerequisite waterfall phase, build the schema, clean the data, then start the AI work, are operating on a model that has been superseded. The modern approach runs these in parallel, with agents doing the heavy lifting of exploration and the ontology layer providing the semantic guardrails that keep the outputs reliable.
The formula is straightforward: Agent + MCP + Ontology = 10x. Not because any one of the three is revolutionary on its own, but because each removes a different bottleneck that previously made the others ineffective.
The stack: where ontologies sit in modern clinical infrastructure
In a modern clinical development data architecture, the ontology layer does not replace your systems of record. It sits between them, making them coherent.
At the top sit the AI agents and clinical intelligence tools: signal detection, narrative generation, protocol intelligence, all consuming semantically enriched data. Here, it’s important to remember that the agents take action, and that communication is a two way street. For example, an AI Agent creates a note, a human-in-the-loop CRA reviews it, and the note is sent to the site and the TMF to be filed in the correct place.
Directly beneath them is the semantic layer, where ontologies live. This is where concept mappings and annotation services operate, making everything else interoperable.
Below that is the orchestration layer: integration middleware, API gateways, and data pipelines that route data between systems using shared terminology. These are commonly called MCPs or connectors.
Then come the systems of record: Veeva Vault, Medidata Rave, safety databases, EDC, and EHR, the authoritative sources that already exist and that organizations have invested years building.
At the base sit the raw data sources: real-world data, lab systems, genomics platforms, and biomarker data, all heterogeneous, messy, and valuable.
Here, the semantic layer is not a replacement for existing systems. Instead, it is what lets those systems, including Veeva, Medidata, and your internal AI tools, actually understand each other.
Organizations that try to replace their stack wholesale face enormous disruption with uncertain payoff. Those that add an ontology-driven semantic layer on top of existing infrastructure can augment what they have rather than starting over.
What this means for AI in clinical development
The intersection of ontologies and AI in life sciences is where the real leverage lives right now. Clinical AI agents, including systems that can reason over protocol data, monitor safety signals, support site feasibility decisions, or draft regulatory narratives, are only as good as their underlying semantic understanding of the clinical domain.
The organizations building durable AI capabilities in clinical development are not doing it by fine-tuning models on proprietary data in isolation. They are investing in the semantic layer: consistent ontology-based annotation of their clinical data assets, standardized concept mappings across their systems of record, and AI architectures that can leverage that structure.
The practical implication is straightforward. When evaluating AI platforms and clinical agents for your organization, the right question is not just "what does this agent do?" It is "does this agent understand my data the way my domain experts understand it, and does it fit into the semantic infrastructure I already have, or does it require me to rebuild?"
Ontologies are not glamorous. They do not make headlines. But, they are the hidden translator that determines whether your AI investments compound over time or stay siloed. In life sciences, where the data spans molecules to patients, genomics to regulatory submissions, and basic research to post-market surveillance, that layer matters more than almost anything else you can build.