Data Storage Architecture

From Discovery Data Service
Jump to navigation Jump to search

This Discovery data storage architecture is described in this article. The model presented here is a high level overview and avoids drilling down to technical elements of the storage software, which as one would expect, is constantly evolving.

The objective of this architecture is to align the physical aspects of data storage to governance and ownership arrangements in a way that enables an à la carte approach to some elements of the service. The effect of this is to enable a range of organisation groupings to use the service, varying from a single small organisation like a rural GP practice, through to a large scale integrated care system.

Data storage tiers

The data storage architecture consists of four types of zones or "tiers", each tier having a different purpose than the other tiers.

It can be seen that data is deliberately duplicated across the tiers, in line with the modern approach to cloud data storage. From a logical perspective, each element of data is likely to be present in at least 3 if not 4, data zones.

Tier 1 - Core Data zone

The core data zone is a multi-tenanted single cloud instance with horizontally scalable data processing and storage technologies.

It is one of the two "publisher zones", whereby all data remains under the control of the provider's data controller and cannot move out without explicit permission from the controller, those permissions configured by machine readable data sharing agreements, and project level configurations.

In this zone, data is received from publishers in a huge variety of formats and at different latency times from the time of entry into provider systems. The data is processed and citizen related data is linked by a common identifier. The data is then transformed into a common model, and stored for onward distribution to a number of of subscribers. The common model consists of a super-ontology of ontologies based around a core ontology for clinical concepts, (based on Snomed-CT), and a data model based on an extensible model of archetypes, with FHIR entities as the baseline specification of structures.

Beyond the above, the core data zone has only one additional purpose, and that is to service data to tier 2 as soon as possible on receipt i.e. is in effect a transit store.

Tier 2 - Regional data zone

A region is defined as an organisation, or group of organisations, that collectively provide a governance structure for determining data access and use across a population. There is no lower or upper limit to the size of a region and typically may be integrated care system or an integrated care provider. In London a region is the size of a large CCG or STP.

Tier 2 is the second of the two publisher zones. The data in a regional zone remains under the control of the publisher acting as data controller.

The data in the regional zone is stored using technologies that are optimised to be accessible for data extract and processing, either at the level of a person or population i.e. query. Within Discovery, each person's record is stored in ONE region only, normally the region that covers the person's main residence, and if an NHS patient, their main General Practice. Thus data published from another provider outside the region through the core data zone is 'repatriated' to the regional store.

The main purpose of the regional separation is to support regional governance and service ownership arrangements. For example one region may have different approaches and priorities and may wish to adopt different technologies from another region.

Whilst the data is separate, it should be noted that data can be obtained from across any of the regional stores, i.e. from the perspective of a subscriber requiring access to data, the data is projected as a single store with the core record locators and person index operating across the regional stores.

Tier 3 - Regional analytics

Data in these zones has been made available to subscribers as a result of sharing agreements i.e. the data controller of the data has been transferred by consent.

The term "region" is used to indicate that the subscriber, or subscriber group, is likely to be part of the same sort of organisation as tier 2. Regional analytic stores represent one or more logical representations of data derived from any of the regions (but usually by default their own region). For example, an analytics service with permission to hold a register of diabetics for provision of services across ICSs may obtain a diabetic data set from some or all parts of the country.

Tier 4 - Local analytics

Data held in this tier is managed and controlled outside the Discovery environment. The role of the Discovery data service is to transact data from the relevant regional stores to the local analytics environment and (optionally) provide and manage the filing of that data.

Implementation options

There are a number of data store configuration options available to subscribers or regions, when determining whether or not to use part or all of Discovery Data Service or information service facilities.

Fundamental to decision making is the idea of the separation of the core data service from an information service. The data service delivers data in a common structural form, using a common ontology, but in a relatively raw state that mirrors the original event level entries. An information service would provide various visualisations of the data e.g. as reports, dashboards or record viewers. In the context of the above architecture, Tier 1 and tier 2 data stores can be considered to be part of the data service, whereas tier 3 or tier 4 are data stores that can be used as part of information services.

The following options reflect the different patterns of use of Discovery

Option 1 - Tier 1 only

In this option only tier 1 is used. The regional organisation already has a fully operational hosted environment and wishes to look after its own data, including access to individual care records, subsets of the records and maintains its own population of patients. Discovery is an "Extract and Transform" service and transacts out at the same rate as the data arriving. No normalisation or query is applied. All additional work is delegated to the regional or ICS infrastructure.

Option 2 - Tier 1 and Tier 2 Data Service

In this option Discovery is used as a data aggregator as in Tier 1, and source for delivering either individual structured care records (e.g. via Care Connect API) or passing on data sets for use in local or regional analytics databases. The data is published either as FHIR or CSV, i.e. in a single common form via the single common model, preserving the original codes or text in the output. The service holds the provenance of each data item for audit purposes.

Option 3 - Tier 1, 2 and 3 Data service and analytics

In this option, Discovery also hosts and maintains databases for the purposes of analytics including individual or population level decision support, dashboards and other visualisations, together with integrated record views, providing drill down to care record concepts and interpretations. In effect, this is using the Discovery technologies and hosting environment as both a data service and an information service.

Option 3 can be selected as well as, or instead of, option 4

Option 4 - Tier 1, 2 and tier 4 local analytics

In this option subscribers wish to obtain data for use in their own locally hosted analytics databases. They control the environment themselves, with Discovery providing the data as transactional outputs from tier 2 and providing data filers into pre-agreed subscriber database schemas.

This option can be selected as well as, or instead of, option 3