Discovery health information model

From Discovery Data Service
Jump to navigation Jump to search

The Discovery health information model is a semantically interpreted model of the data held in Discovery, the model covering 3 aspects:

  1. Model of the data stored
  2. Model of the means of access to retrieve data both from the model itself and the data stored, i.e. various forms of query
  3. Model of the means of adding data to the store and to the model.

The model is designed both for human visualisation and for computers to use. More precisely, the model can be considered as a set of modular models, each depending on the business purpose, with a 'common model' encompassing data from all the modular models.

This article describes the meta data model of the information model (and does not include the content of a particular model. The article makes reference to the languages that may be used to access the model, using either interoperability standards or a pragmatic approach, and this language is described in the article introducing the health modelling language.


Types of data as a graph

The data in a health record stored can be visualised as a graph and a model of that data can be visualised as a graph of types, as on the right hand side. In this type of approach an entry in a record would be represented as node with the type of node as a label on the node.

Attributes of each record would either be relationships with other records (e.g. a persons address), or data properties. Properties of records that are themselves concepts, in a pure graph would be data nodes, and in a property graph may be data properties with references to the concept operating as a particular node.

In order to model data there is a need to model the thing that models the health data i.e. the model's meta model! This article is about the meta model

IM main structural types as classes
Predicates as relationships

Like the health data itself, the model can likewise be modelled as a set of classes with prescribed properties using a conventional UML approach. This enables a conventional object model to be created, thus supporting OO languages such as Java or C#. See left hand diagram.

Alternatively the meta model can be accessed as a graph or set of triples, with predicates that may have some form of semantically defined specialised purpose (such as a subclass or intersection or value set member), or with relationships that are part of the definitions of concepts (such as causative agent). (See right hand side node model)

Information model APIs and languages

For an information model to be useable, it has to be accessible in some way. The means of accessing an information model is via the use of a language i.e. an information modelling language and this is described in a separate article. The language assumes a graph representation of the model and uses RDF concepts as its basis.

IM Service architecture

For an information model to be useful, it has to have at least one information model service, i.e. an operational service that provides access to one or more information models. A service must provide a set of APIs as well as provide instances of the model for implementations to use directly should they wish to.

The diagram on the right shows a tiered architecture for such a service. Information model APIs are described in a separate article.

All implementation code including the evolving service, APIs, language grammars and object models are also available on Github in the following repositories:

Information model purposes and functions

The information Information models have 4 core functional requirements internal to the model: Description of the model , validation of model content, population of the model, and query of the model. In support of query there is also the need to support inference which generates new insights that were not necessarily authored.

In addition the information model must support the same 4 core functional requirements on actual health data that is modelled.

Systems that use the models can use any or all of three approaches:

  1. Direct use of the model data content as a database (or set of files that can populate a database via script)
  2. Use via a set of APIs (both local and remote) designed to provide access to the data within the model, or to trigger outputs of the model for 1)
  3. Use of the information model technologies themselves via the use of the published open source code

The main functional purposes of an information model is further described:

  • Description of the model. There is little point in having a model unless it can be described and understood. Knowing what is in a model is a pre-requisite to using it. For example, there is no point in trying to find out if a patient record indicates whether or not they have diabetes if the model doesn't include the ability to record it. In order to understand a model, two techniques are required: diagrammatic representation and human readable text representation. A model must support both.
  • Data Validation is essential for consistent business operations. Data models, user input forms, and data set specifications are designed to enable data collections to be validated. Maintaining a standard for data collection is essential. For example, if you have a patient record in front of you, you will likely need to know their approximate age. To work this out date of birth must be recorded. Validating that the date of birth can be and has been recorded is important. However, if more than one date of birth was recorded for the same patient, it would be less valuable. Thus a modelling language must include the ability to constrain data models to suit particular business needs as part of validation, even when the data model shows more than one.
  • Population of the model. It is impractical to build model content from scratch and likewise virtually impossible to populate instances with existing data without some manipulation. An information model must contain the ability to model mappings between currently held data and model conformant data.
  • Enquiry (or query) is necessary to generate information from data. There is little point in recording data unless it can be interrogated and the results of the interrogation acted upon. Thus a modelling language must include the ability to query the data as defined or described, including the use of inference rules to find data that was recorded in one context for use in another.
  • Inference is pivotal to decision making. For example, if you are about to prescribe a drug containing methicillin to a patient, and the patient has previously stated that they are allergic to penicillin, it is reasonable to infer that if they take the drug, an allergic reaction might ensue, and thus another drug is prescribed. Thus a modelling language must include the ability to infer things and classify things for safe decisions to be made

Model structure

A model must be built from some structure, using some tools or processes to build it. This section describes the nature of the structure that makes up the information model. The tools used to build the model includes the use of an information modelling language which is described separately.

A model must have a model i.e. a meta model that models the model i.e. the types of things a model is made up of. The types of things are defined, not so much from the perspective of structure, but from the perspective of purpose. The structure is a simple graph of nodes and relationships. However, the relationships have inbuilt semantics which means that the nodes connected to one type of relationship have a different purpose to the nodes connected to another.

For example, one of the categories of relationship (role) types are ontological axioms. This means that the model can support reasoning using the open world assumption (as required by the semantic web). Another type of relationship such as a role expression can be used to to represent known data relationships that exist in Discovery and another for the purposes of data validation , and these use the closed world assumption, as required by the business of healthcare.

The main types are illustrated in the image above.

Interaction between the model and the external world is undertaken via either the Discovery information model language, or alternatively a set of W3C recommended languages . These are described separately but consists of a language built from RDF triples applying where necessary the W3C language grammars and vocabularies of OWL2, SHACL, SHex, SPARQL, with support for GRAPHQL and ECL

The following sections briefly describe the various model classes illustrated above and he IM language specification provides more insight into the details of the classes.


All things that can be referenced via an identifier can be thought of as a concept. Even the classes and structures of the information model themselves are concepts.

A concept is defined as an ‘abstract idea’ or ‘general understanding of something’ and this meaning is preserved in the modelling language. It is one of the few abstract classes in the information model. This means that there is no actual object of 'type concept' unless it is also a type of some subtype of a concept.

Types of concepts include : Class, Property, Shape, Value Set, Data type, Query, Collection, term, and annotation. Each of these specialise in their function and properties and inherit the core properties of a concept and specialise by extension.

Use of aliases. Aliases enable properties and classes to be used in their alias form. See language specification for how context is used to provide aliases to enable key terms to be used in business processes without the inconvenience of using IRIs. Thus these sections use aliases for convention, the aliases themselves defined as aliases to concepts.

In this section aliases that are instantiated as a number of alternatives are enclosed in { } e.g. {axiom} in a class refers to subclass, equivalent, or disjoint

A concept  also comes with a fixed set of annotation properties that can be relied on to be present or have null values

Property alias Cardinality Type Description
iri 1 IRI an international identifier, the format as described within the language specification
status 1 Status type A status concept representing the status of the concept in terms if its activity status   e.g. Active or inactive
name      1 String This is the full name of the concept (or preferred term in Snomed-CT.) In OWL2 this is a label annotation
description 1 String  A plain language meaning of the concept, and how it may be used
version 1 integer The version in which this concept was first created
code 0..1 String  If the concept has a code, the code assigned to this concept by the original creator, e,g, a Snomed-CT, READ2, ICD10, OPCS or local code or auto generated code
scheme 0..1 IRI f the concept has a code, the code scheme assigned to this code, the scheme itself being an IRI
termKey 0..* String A number of keys used to link to the concept. Should not be confused with a term concept which is an alternative term linked to a concept
annotation 0..* Annotation Concepts may have additional informative simple string properties used for a variety of business purposes
alias 0..* String Aliases for this concept i.e. reserved terms within the context of a particular application that is implementing the information model and wishes to use aliases rather than the IRIs

Ontological Class

An ontological class is an extension of the concept class and is used as the main means for defining semantic concepts that are classes of objects for use in healthcare records.

The difference between an ontological class (often referred to as an owl class) and a simple concept is that it can be semantically defined by the use of class axioms. Class axioms such as subclass or equivalent classes are used for reasoning (inferencing and classification )and enable the information model to be queries using subsumption query.

Property alias Cardinality Type Description
type 1 Class A type of concept that is a class for the purposes of ontological definition
{axiom} 0..* Class axiom An axiom normally used to define a class e.g. Subclass, equivalent class, disjoint classes

Ontological property

An ontological property is an extension of the concept class and is used as the main means for defining semantic concepts that are used as properties or predicates. The difference between this and a class is that properties themselves cannot have properties. Nevertheless the use of property axioms to define properties makes them very powerful. Sub properties are included in subsumption tests on classes as well as linking properties that operate in reverse directions in a graph.

Property alias Type Cardinality Description
type 1 Property A type of concept that is a property or predicate, and used throughout the model. This includes most of the reserved tokens used in the IM classes and the IM language itself
{axiom} 1..* Property axiom An axiom normally used to define a property including domain, range, sub property, and whether transitive, or inverse of etc

Value set

This is a specialised class that defines and holds a collection of concepts, those concepts not necessarily being related by subclass relationships.

A value set member is a definition of a concept which is defined by a simple form of query called an expression constraint, which is a definition of a collection of classes as described below. A value set without members can be used as a means of inferencing subclass value sets.

A value set like any other class can be ontologically defined e.g. as a subclass of another value set and thus if a value set has members defined then the subclass members would be subsets of the superclass members. Conversely, when selecting a value set that has no members in a query, and that value set has subclasses, then the inference engine would include all the members of all of the subclasses.

Property alias Type Cardinality Description
type 1 "ValueSet" A type of concept that is used as a value set, a specialised class for defining concepts in a query
subClassOf 0..* Class Expression A value set may be a subclass of another value set
member 0..1 Expression constraint A specialised form of query that defines a collection of classes that would be subsumed when the query is run
expansion 0..* IRI A list of concept identifiers produced by inference from the member definition, or in the absence of a member definition, simply a list of concepts

Expression constraint

An expression constraint is a specialised query describing a set of concepts using class expressions and boolean logic . I.e. describes the attributes that a concept must have to be included or excluded from the set, using Boolean logic when necessary. Because ontological classes are defined as being things with certain properties and values or value types (attribute value pairs) then the definition can include simple constraints such as something being a subclass of another concept using inference.

An expression constraint is a String of one of the IM supported language grammars e.g. Discovery expression constraint, SPARQL fragment, Snomed-CT Expression constraint language


A collection is a constraint of a concept in that the concept type is one of the collection subtypes.

Collections subtypes are either lists or sets and lists may be ordered lists or unordered lists. Lists such as folders are used to initiate user navigation of the model. Collection contents have no inherent relationship with a collection concept. N.B Collections in this context should not be confused with the collection construct used in the language.

A collection is defined by its type e.g. a folder

Record type (data model)

A record type extends a concept and is the mainstay of the data modelling section of the information model.

The Discovery data model is not a pre-authored model. Instead it is a model that is responsive to the data that is stored in Discovery. When responding to new Data, the first step is to semantically translate the data in a way that makes access to the same concept easier. For example a PAT.DOB and a PATIENT.DATE_OF_BIRTH might both map to a record type of 'Patient' with a property of 'Date of birth' which has a data type of 'Date Time'.

A record type is a fairly simple graph structure, thus has a simple vocabulary and avoids the use of business level rules to constrain the model. For example, a business rule would dictate that a patient record will have one and only one date of birth and might have zero or one date of death. The record type model does not assume this however. Constraints are delegated to the implementation model with the use of constraints based on other meta data. For example, in one implementation it is possible that a person may have several dates of birth (recorded perhaps by different people). Shape expressions based on the record type, or independently of it. may be used for this purpose.

Shapes (validation and implementation mapping)

A shape dictates the properties and values used in set of business oriented data stores i.e. defines and constrains the properties for particular purposes. A shape seems on the surface to be similar to a semantic class i.e. the properties described in a shape are all properties that one would expect to be properties of a class (e.g. date of birth as a property of a person). However, a shape is designed to be more prescriptive and "closed world". Consequently a shape can be used both to define a database schema, a message schema, and validate data content.

To illustrate the difference between a shape and a standard class, take the modelling of a human being or person.

From a data model perspective we may wish to model a common record type of a person, and that one of the properties of the common person record is a date of birth which is a date time of a kind that allows approximate year only dates. When modelled as a shape for a particular business purpose the person record is referred to as "target class of the shape". The shape may be constrained in that it might allow only ONE date of birth and may allow only precise dates.

The binding of the common model to an implementation model may based on a business specific model. For example a standard FHIR patient resource will map directly to the equivalent data model shape, which is itself constrained from the common patient record model.

The effect of this 2 step separation is to allow a clear understanding of the type of data held in the data model, from a logical business specific model to implementation models for messaging or data bases.

A common approach to modelling and use of a standard approach to ontology, together with modularisation, means that any sending or receiving machine which uses concepts from the semantic ontology can adopt full semantic interoperability. If both machines use the same data model for the same business, the data may presented in the same relationship, but if the two machines use different data models for different businesses they may present the data in different ways, but without any loss of meaning or query capability.

The integration between a common record model, specific data model shapes and ontological concepts makes the information model is the singe most important contributor to semantic interoperability,

The shape constraint language is the major part of the modelling language and based on the W3C SHACL language.

A shape is a shape of something i.e. has a target class or a target properties and thus a shape is by default a class shape or property shape. The connection between a shape and a corresponding class (e.g. shape of a person) brings together the semantic ontology and the data modelling. Class shapes will contain property shapes which may be embedded in line rather than distinct.

Property alias Cardinality type Description
type 1 Shape This is a shape class
targetClass 0..1 Class What the shape is a shape of. In health information model the alias 'record of' is likely to be used
property 0..* property constraint a restriction or constraint on a property e.g. cardinality, value type or date range

Property Shape

A property shape, (normally used within a shape to constrain the properties) consists of a property path (which might be a single property, a sequence of properties, alternative properties. And a set of constraints such as miniumom and maximum cardinality or data range, or in advanced modelling, a query or function.

Property alias Type Cardinality Description
path 1 Property path One of the property path patterns suppored by TURTLE/ SPARQL
{constraint} 0..* property constraint What are the constraints of this property, predicate from the SHACL property constraint vocabulary


The models include modular classifications of concepts. The classification modules are either generated from the ontology via classifiers (which are functions of reasoners) or have been incorporated as handcrafted classifications . Examples of ontology generated classifications are Snomed-CT "ISA" hierarchy and the Discovery health classification. Examples of handcrafted classification modules are ICD10, Read. The main thing to note about the difference between the two is that concepts in classifications that are generated from an ontology subsume their descendants as proper subtypes whereas handcrafted classifications may include subcategories that are inconsistent.

Property alias Type Cardinality Description
parent 1 IRI references the concept of the parent entry in the classification
child 1 IRI references the child concept in the classification
relationship 1 IRI references the relationship type between the child and parent e.g. 'is a' or 'has parent'

To illustrate the difference between an ontology and a classification, Let us say that we state in an ontology that "ALL THINGS THAT BARK ARE DOGS".

Let us then say we go to the beach at Ravenscar on the North East coast of England and hear a bark. We see the animal at a distance. We ask the computer what it is. The computer, using the generated classification would classify that animal automatically as a DOG ( because it barks) .

However, as we get closer we see that it is something else. The ontology is clearly incorrect. Consequently we amend the ontology to state that there is such a thing as an "AN ANIMAL THAT BARKS" and that "A DOG IS A SUBCLASS OF AN ANIMAL THAT BARKS". We also state that such a thing exists as an "ANIMAL" and that "AN ANIMAL THAT BARKS IS A SUBCLASS OF AN ANIMAL". Now, when asking the computer what the animal is, the computer knows only that it is an animal that barks but does not know what it is.

We then amend the ontology to state that such a thing exists called a SEAL and that a "SEAL BARKS". We also author the ontology to say that "AN ANIMAL THAT BARKS IS EQUIVALENT TO A THING THAT BARKS" i.e. by definition if it barks it is an animal that barks. Now the computer automatically classifies the seal to state that "A SEAL "IS A" ANIMAL THAT BARKS" (because it is a thing that barks and must therefore be an animal that barks). It will then be found when searching for types of animals that bark, things that bark, and things that are animals i.e. the seal MUST be an animal because all things that barks are animals that bark and an animal that barks is a subclass of an animal. The human has not needed to find the category, the reasoner does it for you, it has automatically created the classification from the properties of the thing on the beach.


A data type extends a concept by defining a new simple data type via a constraint on another data type using the data definition grammar from the IM language.

Examples include formats of codes or numbers using "Regex" patterns or perhaps ranges of values or dates when a data type is used in multiple places in the model.

Property alias Type Cardinality Description
type 1 Concept.DataType The concept is a data type
definition 1 data type definition The definition of the data type using IM grammar


A query concept is the most sophisticated of the IM classes as it incorporates the query definition using the query definition language which is a SPARQL fragment or expression constraint.

Property alias Type Cardinality Description
query 1 query definition A query definition using a profile of SPARQL

Information model storage architecture

IM logical object model.png

An information model is an abstract representation of data, but an information model must have content and that content must be stored.

Data cannot be stored conceptually, only physically, and thus there must be a relationship between the abstract model and a physical store.

In the information model services, the abstract model is instantiated as a set of objects of classes, the data element of those classes holding the subject, predicate and object structures. In reality those objects together with translation and data access methods are instantiated in some form of language. e.g. Java.

The physical store is currently held in a triple like relational database accessed by a relational database engine but could be easily stored as a native graph.

The model can then be used as the source and target of the exchange of data, the latter using a language interoperating via a set of APIs

This can be visualised as in the diagram on the right. It can be seen that the inner physical store, is accessed by an object model layer, which is itself accessed by APIs using modelling language grammar and syntax. The diagram shows the main grammars supported by the Discovery information model, including the Discovery information modelling language grammar itself.

Support for the main languages means that a Discovery information model instance has 2 levels of separation of concerns from the languages used to exchange data, and the underlying model store. There is thus no reason to buy into Discovery language to use the information model.

Likewise, an implementation of objects that hold data in a form that is compatible with a particular data model and ontology module, can be accessed using the same language.

This makes the language just as useful for exchanging query definitions, value sets as well as useful for actual query of health record stores via interpreters.

The remainder of this article describes the language itself, starting with some high level sections on the components, and eventually providing a specification of the language and links to technical implementations, all of which are open source.

Ontologies and modules

The Discovery common information model can be thought of as an ontology of ontologies. More precisely though it should be considered as an ontology consisting of a set of ontology modules with each module defined according to business needs. The principle of concept sharing, whereby one concept is identified once across the entire set of domains, suggests that there is a single ontology. However a data model that is specified for a particular business purpose may have different class structures from another business purpose even though they share the same semantic definitions.

For example, take the idea of recoding information about a blood pressure. This is an example of a component in a data model. In General practice, it would be common practice to record a systolic and diastolic blood pressure and thus the component would consist of 3 classes. However, in a specialist research study involving different interpretations of blood pressures, including perhaps the size o nature of the cuff, or the exact position of the patient, this component may be more complex.

This is addressed by modularisation where the axioms that define the classes belong to a particular model, even though the property domains and their ranges are shared across the ontology. This is analogous to the idea of templates derived from subsets of archetypes. The difference is that there is no "super-archetype" requiring international agreement on the items in the archetype, but instead there is a demand that the same identifier of the diastolic blood pressure record class is used throughout, even though the class definition is business specific.