Data Storage Architecture: Difference between revisions

From Discovery Data Service
Jump to navigation Jump to search
No edit summary
No edit summary
Line 2: Line 2:


The objective of this architecture is to align the physical aspects of data storage to governance and ownership arrangements in a way that enables an  ''à la carte''  approach to some elements of the service.  The effect of this is to enable a range of organisation groupings to use the service. This can vary from a single small organisation through to a large scale integrated care system.  
The objective of this architecture is to align the physical aspects of data storage to governance and ownership arrangements in a way that enables an  ''à la carte''  approach to some elements of the service.  The effect of this is to enable a range of organisation groupings to use the service. This can vary from a single small organisation through to a large scale integrated care system.  
== Data storage tiers ==
The data storage architecture consists of four zones or "tiers", each tier having a different purpose than the other tiers.
It can be seen that data is deliberately duplicated across the tiers, (in line with the modern approach to cloud data storage), whereby as storage costs are around 1000 times less expensive than in the 20th century, and the focus is on reducing the cost of processing the data. From a logical perspective, each element of data is likely to be present in at least 3 if not 4, data zones.
[[File:Discovery 4 tiers.jpg|center|frameless|900x900px|Discovery 4 tier data zones|alt=]]
[[File:Discovery 4 tiers.jpg|center|frameless|900x900px|Discovery 4 tier data zones|alt=]]
=== Tier 1 - Core Data zone ===
The core data zone or  is a multi-tenanted single cloud instance with horizontally scalable storage technologies.
It is one of the two "publisher zones", whereby all data remains under the control of the provider's data controller and cannot move out without explicit permission from the controller, subject to data sharing agreements and project agreements.
In this zone, data is received from publishers in a huge variety of formats and at different latency times from entry into provider systems. The data is processed and citizen related data is linked by a common identifier, transformed into a common model, and stored for onward distribution to a number of of subscribers. The common model consists of a super-ontology of ontologies based around a core ontology for clinical concepts, (itself including Snomed-CT),  and a data model based on an extensible model of archetypes, with FHIR entities as the baseline.
Beyond the above, the core data zone has only one additional purpose, and that is to service data to tier 2 on receipt i.e. is in effect a transit store.
=== Tier 2 - Regional data zone ===
A region is defined as an organisation of group of organisations that collectively provide a governance structure for determining data access and use across a population.  There is no lower or upper limit to the size of a region and typically may be integrated care system or an integrated care provider. In London a region is the size of a large CCG or STP.
Tier 2 is the second of the two publisher zones. The data in a regional zone remains under the control of the publisher acting as data controller.
The data in the regional zone is stored using technologies that are optimised to be accessible for data extract and processing, either at the level of a person or population i.e. query.  Within Discovery, each person's record is stored in ONE region only, normally the region that covers the person's main residence, and if an NHS patient, their main General Practice. Thus data published from another provider outside the region through the core data zone is 'repatriated' to the regional store.
The main purpose of the regional separation is to support regional governance and service ownership arrangements. For example one region may have different approaches and priorities and may wish to adopt different technologies from another region.
Whilst the data is separate, it should be noted that data can be obtained from across any of the regional stores, i.e. from the perspective of a subscriber requiring access to data, the data is projected as a single store with the core record locators and person index operating across the regional stores.
=== Tier 3 - Regional analytics ===
Data in these zones has been made available to subscribers as a result of sharing agreements i.e. the data controller of the data has been transferred by consent.
The term "region" is used to indicate that the subscriber, or subscriber group, is likely to be part of the same sort of organisation as tier 2. Regional analytic stores represent one or more logical representations of data derived from any of the regions (but usually by default their own region). For example, an analytics service with permission to hold a register of diabetics for provision of services across ICSs may obtain a diabetic data set from some or all parts of the country.
=== Tier 4 - Local analytics ===
Data held in this tier is managed and controlled outside the Discovery environment. The role of the Discovery data service is to transact data from the relevant regional stores to the local analytics environment and (optionally) provide and manage the filing of that data.
<br />
<br />

Revision as of 09:02, 6 July 2020

This Discovery data storage architecture is described in this article. The model presented here is a high level overview and avoids drilling down to technical elements of the storage software, which as one would expect, is constantly evolving.

The objective of this architecture is to align the physical aspects of data storage to governance and ownership arrangements in a way that enables an à la carte approach to some elements of the service. The effect of this is to enable a range of organisation groupings to use the service. This can vary from a single small organisation through to a large scale integrated care system.

Data storage tiers

The data storage architecture consists of four zones or "tiers", each tier having a different purpose than the other tiers.

It can be seen that data is deliberately duplicated across the tiers, (in line with the modern approach to cloud data storage), whereby as storage costs are around 1000 times less expensive than in the 20th century, and the focus is on reducing the cost of processing the data. From a logical perspective, each element of data is likely to be present in at least 3 if not 4, data zones.

Tier 1 - Core Data zone

The core data zone or is a multi-tenanted single cloud instance with horizontally scalable storage technologies.

It is one of the two "publisher zones", whereby all data remains under the control of the provider's data controller and cannot move out without explicit permission from the controller, subject to data sharing agreements and project agreements.

In this zone, data is received from publishers in a huge variety of formats and at different latency times from entry into provider systems. The data is processed and citizen related data is linked by a common identifier, transformed into a common model, and stored for onward distribution to a number of of subscribers. The common model consists of a super-ontology of ontologies based around a core ontology for clinical concepts, (itself including Snomed-CT), and a data model based on an extensible model of archetypes, with FHIR entities as the baseline.

Beyond the above, the core data zone has only one additional purpose, and that is to service data to tier 2 on receipt i.e. is in effect a transit store.

Tier 2 - Regional data zone

A region is defined as an organisation of group of organisations that collectively provide a governance structure for determining data access and use across a population. There is no lower or upper limit to the size of a region and typically may be integrated care system or an integrated care provider. In London a region is the size of a large CCG or STP.

Tier 2 is the second of the two publisher zones. The data in a regional zone remains under the control of the publisher acting as data controller.

The data in the regional zone is stored using technologies that are optimised to be accessible for data extract and processing, either at the level of a person or population i.e. query. Within Discovery, each person's record is stored in ONE region only, normally the region that covers the person's main residence, and if an NHS patient, their main General Practice. Thus data published from another provider outside the region through the core data zone is 'repatriated' to the regional store.

The main purpose of the regional separation is to support regional governance and service ownership arrangements. For example one region may have different approaches and priorities and may wish to adopt different technologies from another region.

Whilst the data is separate, it should be noted that data can be obtained from across any of the regional stores, i.e. from the perspective of a subscriber requiring access to data, the data is projected as a single store with the core record locators and person index operating across the regional stores.

Tier 3 - Regional analytics

Data in these zones has been made available to subscribers as a result of sharing agreements i.e. the data controller of the data has been transferred by consent.

The term "region" is used to indicate that the subscriber, or subscriber group, is likely to be part of the same sort of organisation as tier 2. Regional analytic stores represent one or more logical representations of data derived from any of the regions (but usually by default their own region). For example, an analytics service with permission to hold a register of diabetics for provision of services across ICSs may obtain a diabetic data set from some or all parts of the country.

Tier 4 - Local analytics

Data held in this tier is managed and controlled outside the Discovery environment. The role of the Discovery data service is to transact data from the relevant regional stores to the local analytics environment and (optionally) provide and manage the filing of that data.