Data Storage Architecture

From Discovery Data Service
Jump to navigation Jump to search

This Discovery data storage architecture is described in this article. The model presented here is a high-level overview and avoids drilling down to technical elements of the storage software. The purpose of the model is to inform potential users of the Discovery services as to the optional configurations available.

The objective of the storage architecture is to align the physical aspects of data storage to the governance and ownership arrangements in a way that enables an à la carte approach to some elements of the service. The effect of this is to enable a range of organisation groupings to use the service, varying from a single small organisation like a rural GP practice, through to a large-scale integrated care system at the scale of one or more STPs.

Data storage tiers

The data storage architecture consists of four types of zones or "tiers", each tier having a different purpose than the other tiers.

Data is deliberately duplicated across the tiers, in line with the modern approach to cloud data storage. From a logical perspective, many elements of data are likely to be present in at least 3 if not 4, data zones. Physically, some subsets of the data may be duplicated in each for resilience and performance optimisation, 20 times or more.

The following diagram highlights the four tiers:

Tier 1 - Raw data and health events

The core data zone is the tier through which all inbound data flows.

It is one of the two "publisher zones", whereby all data in this tier remains under the control of the provider's data controller, and cannot move out without explicit permission from the controller, those permissions configured by machine readable data sharing agreements, and project level configurations.

In this zone, data is received from publishers in a huge variety of formats and at different latency times from the time of entry into provider systems. The data is stored as raw files, then processed, and citizen related data is linked by a common identifier. The data is then held as health events, and event reflecting a data transaction such as an observation, encounter, or registration of a patient. At this point the data transformed into a standardised format and stored for onward distribution to tier 2. As every addition edit and deletion generates an event then each event is stored.

The common model consists of a super-ontology of ontologies based around a core ontology for clinical concepts, (based on Snomed-CT), and a data model based on an extensible model of archetypes, with FHIR entities as the baseline specification of structures.

Beyond the above, the core data zone has only one additional purpose, and that is to service data to tier 2 as soon as possible on receipt i.e. is in effect a transit store which also holds the provenance of each byte of data that is received, stored and passed on.

Tier 2 - Regional data zone

A region is defined as an organisation, or group of organisations, that collectively provide a governance structure for determining data access and use across a population. There is no lower or upper limit to the size of a region, and typically may be integrated care system or an integrated care provider. For example, in London a region is the size of a large CCG or STP.

Tier 2 is the second of the two publisher zones. The data in a regional zone remains under the control of the publisher acting as data controller.

The data in the regional zone is stored using technologies that are optimised to be accessible for data extract and processing, either at the level of a person or population i.e. query.

The main purpose of the regional separation is to support regional governance and service ownership arrangements. For example, one region may have different approaches and development priorities and may wish to adopt different technologies from another region.

Whilst the data is separate from region to region, it should be noted that data can technically be obtained from across any of the regional stores, i.e. from the perspective of a subscriber requiring access to data, the data can be projected as a single cloud store with the core record locators and person index operating across the regional stores. Access to personal data and organisation data is control at a granular level by the data sharing agreements.

Tier 3 - Regional analytics

Data in these zones has been made available to subscribers as a result of sharing agreements i.e. no data leaves tier 2 without explicit granular level machine level data sharing agreements that contain item level rules.

The data for the tier 3 stores is transacted from tier 2 as near real time as practical, for example daily, but scheduled in a way not to negatively impact on query operating in this tier.

The term "region" is used to indicate that the subscriber, or subscriber group, is likely to be part of the same sort of organisation as tier 2. Regional analytic stores represent one or more logical representations of data derived from any of the regions (but usually by default their own region). For example, an analytics service with permission to hold a register of diabetics for provision of services across ICSs may obtain a diabetic data set from some or all parts of the country.

The analytics stores in the regional tier, tend to be quite extensive holding billions of entries of data formatted in a way that optimises population level query. Analytic stores represent the source of data for producing dimension tables for report visualisations, or for data sets for statistics

Tier 4 - Local infrastructure-based analytics

Data held in this tier is managed and controlled outside the Discovery environment. It is conceptually the same as tier 3 i.e. may be small project level databases or a large regional database, but the data hosting is not managed in the same infrastructure. The role of the core Discovery data service is to transact data from the relevant tier 2 regional stores to the tier 4 analytics environment and (optionally) provide and manage the filing of that data.

Implementation options

There are a number of data store configuration options available to subscribers or regions, when determining whether or not to use part or all of Discovery Data Service or information service facilities.

Fundamental to decision making is the idea of the separation of the core data service from information services, the further separation of population level information services from individual record based services. The data service delivers data in a common structural form, using a common ontology, but in a relatively raw state that mirrors the original event level entries. An information service would provide various visualisations of the data e.g. as reports, dashboards or record viewers. In the context of the above architecture, Tier 1 data stores can be considered to be part of the data service, tier 2 is both part of the data service and a source for data for individual record based information services, and a source of data for the tier 3 and 4, whereas tier 3 or tier 4 are data stores that can be used as part of analytics services.

The following options reflect the different patterns of use of Discovery:

Option 1 - Tier 1 only

In this option only tier 1 is used. The regional organisation already has a fully operational hosted environment and wishes to look after its own data, including access to individual care records, subsets of the records and maintains its own population of patients. Discovery acts as an "Extract and Transform" service and transacts out at the same rate as the data arriving. Although it produces data conformant to the common model, no de-duplication or subset query is applied. All additional work is delegated to the regional or ICS infrastructure.

Option 2 - Tier 1 and Tier 2 Data Service

In this option Discovery is used as a data aggregator as in Tier 1, and source for delivering either individual structured care records (e.g. via Care Connect API) or passing on data sets for use in local or regional analytics databases from tier 2. The data is published from tier 2 either as FHIR or CSV, i.e. in a single common form using the single common model, as well as preserving the original codes or text in the output.

Tier 2 is separate to tier 1 as it is optimised for large scale extract as well as single record extract, but the main reason for its existence is the regional governance and control that it enables.

Option 3 - Tier 1, 2 and 3 Data service and analytics

In this option, Discovery also hosts and maintains databases for the purposes of analytics including individual or population level decision support, dashboards and other visualisations, together with integrated record views, providing drill down to care record concepts and interpretations. In effect, this is using the Discovery technologies and hosting environment as both a data service and a set of information services.

This moves Discovery into the solution space and is likely to be selected by organisations that wish to take an open source community-oriented approach to evolving information services, whilst maintaining secure hosting of data.

Option 3 can be selected as well as, or instead of, option 4

Option 4 - Tier 1, 2 and tier 4 Local analytics

In this option subscribers wish to obtain data for use in their own locally hosted analytics databases. This can vary from a small project specific analytic database to a full set of regional data. They control their environment themselves, with Discovery providing the data as transactional outputs from tier 2 and providing data filers into pre-agreed subscriber database schemas.

This option can be selected as well as, or instead of, option 3