Remote subscriber data definition

From Discovery Data Service
Jump to navigation Jump to search

This article describes how the Discovery Data Service makes data available for Remote Subscriber Databases (RSDs) and how the DDS Remote Filer application interacts with this to update an RSD. Although it is recommended that the Remote Filer application be used for updating RSDs, any other technical solution can be substituted provided it is able to match the Remote Filer behaviour as far as the points of interaction are concerned.

Subscriber database schema

DDS currently provides SQL scripts for creating subscriber databases, one for each of the two support database engines; the database scripts are the same no matter what configuration options are selected when setting up the feed from DDS (e.g. PI versus de-identified).

Please note the following:
  • The above are links to the public GitHub repository which was in active development until early 2021. Development has moved to private GitHub repositories since this date, so the SQL schemas linked to above should only be taken as illustrations of the schema and not the latest version (which can be provided on request).
  • The Remote Filer application currently only supports loading data into these two database engines, and part of the data feed (for reference data) is sent as raw SQL and only these two formats are supported.
  • DDS still supports data feeds to an older version of the subscriber database known as Compass v1 (the current version being v2). Although v1 is still supported, new instances of this will not be deployed and this article specifically addresses the v2 standard. Future developments/improvements to the DDS subscriber database will be iterative upgrades to v2.

Subscriber feeds

There are two separate feeds of data that DDS sends to each RSD:

  • Published Data Feed – this includes all patient data, plus some supporting data (clinicians and organisations for example) that is sent into DDS by external publishers.
  • Reference Data Feed – this includes lookups and mappings for clinical codes (Read2 to SNOMED for example) that is not directly published into DDS but is updated in subscriber databases.

Data for each feed is staged in a separate directory on the DDS SFTP server for each subscriber. The DDS Remote Filer application runs supports running in two different modes, one to download and process the Published Data Feed and the other to download and process the Reference Data Feed.

If you replace the Remote Filer application with an alternative solution it must support both feeds.

DDS SFTP server

The DDS SFTP server is used to stage all data for all RSDs.

A user is created on this server for each DDS subscriber, with their own username and SSH certificate, to allow them to securely access the data intended for them.

If a DDS subscriber has multiple RSDs (for example, one for GP data and one for acute data), the same SFTP user is used for both RSDs.

For each RSD, the following three directories are created under the SFTP user home directory to:

  1. stage data for download for the Published Data Feed.
  2. upload feedback files related to the Published Data Feed.
  3. stage data for download for the Reference Data Feed.

Example SFTP user home directory structure

Example SFTP user home directory structure

Published data feed staging directory

This directory is used by DDS to stage the published data intended for the RSD, and includes all the patient and clinical data. When data from the DDS is available, for a subscriber, it is placed in this directory.

The files placed in this directory are always named in the following format:

<YYYYMMDDHHMMSS>_Subscriber_Data.zip

Where YYYYMMDDHHMMSS is the date and time the data is staged for collection. When a Remote Filer connects and downloads the files, they should be sorted by file name, so they are in date order, and applied in that order.

DDS SFTP server zip files.png

Example date format

Zip files

The zip files are PGP encrypted, using a pre-agreed public key. The Remote Filer has the private key, allowing it to decrypt the files.

If a zip file exceeds 10MB, it is fragmented into a multi-part zip.

Within each zip file is one or more 'inner' zip files that generally represent updates to individual patient records (although this is not always the case). These zip files are named with the pattern:

<YYYYMMDD>_<Ordinal>_<UUID>.zip

Where:

  • <YYYYMMDD> is the date the file was staged.
  • <Ordinal> is the number that indicates that files should be processed by the subscriber.
    There may gaps in the ordinals due to the incrementing integer being used across all DDS subscribers. For example, file 63579 should be processed before 63581, but the fact that file 63580 was not found is not an error.
  • <UUID> is a unique identifier generated by DDS for each inner zip file.

Example zip files showing the naming convention and required order

Files should be sorted by ordinal number and applied in that order.

Each inner zip file contains one or more CSV files, which contain the actual data to be applied to the Remote Subscriber Database, and a single JSON file.

Zip file content.png

Example inner zip file contents showing the CSV and JSON files

CSV files

Each CSV file relates to a table in the Remote Subscriber Database, with the same name; for example, patient.csv relates to the patient table.

Within each CSV file, the first column is always called is_delete. This is either:

  • FALSE - indicates that the record is an upsert (update or insert), in which case the remaining columns directly map to the columns on the table, giving the new values. If any cell is empty (as 0 is in the below image) then this should be treated as a null value rather than an empty String.
  • TRUE - indicates that the record should be deleted.

The second column is always an id column, giving the unique identifier for the record.

CSV example.png

Example CSV file showing is_delete and id columns

JSON files

The JSON file contains metadata about the CSV files and source data types in DDS (which map to the data types used in the Remote Subscriber database). For example, the below image shows the JSON file definition of the patient.csv file, stating that the title field was generated from a String value in DDS (which would indicate that the equivalent field in the RSD is a varchar), and that the date_of_birth field is a date.

JSON example.png

Example JSON file showing details of a patient.csv file

Download an example of a published data feed zip file (without the encryption) here.

Published data feed feedback directory

The Published Data Feedback directory is used by the Remote Filer to upload success/failure responses about the Published Data files being applied to the Remote Subscriber Database.

After downloading and applying data to the RSD, a zip file for each downloaded file must be uploaded to this directory. The files uploaded to this directory are always named in the following format:

<YYYYMMDDHHMMSS>_Subscriber_Results.zip

Where YYYYMMDDHHMMSS is the same date and time as from the data file originally downloaded.

DDS SFTP server feedback directory.png

Example Published data feedback directory showing required zip files

The zip file are not encrypted.

The zip file must include at least one of the following files:

If there are no failures, a failure.txt file does not need to be created.

Similarly, if there are no successes, a success.txt file does not need to be created.

Download a sample of the feedback file here.

success.txt

This file must list the UUIDs from the file names of all inner zip files that were applied to the RSD.

Each UUID should be on a single line, with no other characters.

Each line should be separated by a newline (/n) or a newline and carriage return (/r/n).

Example success.txt file

failure.txt

This file must list the UUIDs, if any, that failed when applied to the RSD, along with whatever error message was raised.

Each line should contain the UUID from the inner zip file, followed by a comma and then the error message.

Each line should be separated by a newline (/n) or a newline and carriage return (/r/n).

Failure.txt.png

Example failure.txt file

Processing feedback

The DDS processes the feedback files to ensure that data intended for RSDs is being collected and applied successfully.

Any feedback of failures is detected by the DDS operational monitoring team and resolved accordingly. Once any error is resolved the data will be re-staged by DDS for collection again; it is not necessary for the Remote Filer to store any data that failed to be applied because the DDS will make it available again once the issue is fixed.

Reference data feed staging directory

When DDS has new reference data to sent to an RSD, a file called “concepts.zip” is put in this directory. This zip file is PGP encrypted in the same way as the Published Data Feed files. As with the Published Data Feed staging directory, if this file is over 10MB it will be split into a multi-part zip file.

DDS SFTP server concepts zip.png

Example Reference data feed staging directory showing Concepts.zip file

The zip file contains files that must be executed in the following order:

  1. One or more SQL files - containing literal SQL statements that should be executed on the RSD to update the relevant reference data; the order is not important.
    The DDS can generate files that are formatted for MySQL or SQL Server (and other compatible) databases.
  2. Setup extension - SQL that defines one or more stored procedures that use the new/updated reference data to update other tables as required.
  3. Execute extension - SQL commands to run the procedures created by the proceeding step.

Setup execute.png

Example Concepts.zip file showing SQL files, and setup and execute extensions

Download an example concepts.zip file (without the encryption) here.

Remote Filer overview

The DDS Remote Filer is a Java application that is provided as part of the DDS to bridge the gap between the compressed, encrypted data staged on the DDS SFTP server and a Remote Subscriber Database. The application can be run in two modes, one for each of the data feeds, and performs the following tasks:

  1. Connects to the DDS SFTP server (either over HSCN or internet – this is configured subscriber by subscriber and doesn’t affect how the Remote Filer operates).
  2. Downloads any new data.
  3. Decrypts and decompresses the data.
  4. Applies the data to the Remote Subscriber Database.
  5. Generates and uploads feedback files to report success/failure, and then uploads them to the SFTP server.
    This is done for the Published Data Feed only

Since it is Java-based, it can be run on a Windows or Linux server.

The server (and infrastructure hosting that server) must allow:

  • outbound connectivity to the DDS SFTP server to download new data and upload feedback files.
  • connectivity to the Remote Subscriber Database itself.

Typically, a relevant scheduling agent, such as Cron on Linux or Windows Scheduler, would be used to run the Remote Filer once daily for each feed. For example, once for the Published Data Feed and once for the Reference Data Feed.

The following sections describe the details of the Remote Filer’s operation for both feeds:

Remote Filer: Published Data Feed

Newly published data, is processed and the the subscriber output is staged in a nominated directory on an SFTP server. The Remote Filer then connects and downloads these files to process them.

Process steps

The following diagram shows the steps the Remote Filer (shown in green) performs for the Published Data feed. This entire end to end process would normally be performed once daily, typically scheduled using a suitable scheduling tool such as cron on Ubuntu or Windows Scheduler.

Process steps.png

The steps in detail:

  1. The Remote Filer creates a connection to the DDS SFTP server, using the pre-configured address, port, username and SSH certificate.
    See also: Required parameters.
  2. The Remote Filer downloads all zip files found in the pre-configured source incoming directory and writes them to temporary local storage.
  3. The Remote Filer deletes downloaded the zip files from the SFTP server.
  4. Each downloaded zip file is PGP encrypted, so is decrypted using the pre-configured PGP private key.
  5. Some of the zip files may be multi-part zips, so these are then combined into standalone zip files.
  6. The zip files are sorted by name. The file names will be of the format <YYYYMMDDHHMMSS>_Subscriber_Data.zip (e.g. 20210318100810_Subscriber_Data.zip) so the sorting will place them into date order.
  7. Then, for each zip file:
  • Unzip the file to a temporary storage location. This will generate further “inner” zip files. These zip files are named <YYYYMMD>_<ordinal>_<uuid>.zip (e.g. 20210318_000000001_ f5900473-9884-11eb-b6fa-00ff053700b6.zip).
  • These inner zip files are sorted by name. Due to the file naming convention, this will place in the correct order to be applied to the database.
  • For each inner zip file:
  1. Unzip the file to a temporary storage location. This will generate one or more CSV files, containing the published data for the subscriber database, along with a JSON file. These files and how they relate to the RSD are explained earlier in this document.
  2. All data upserts contained in the CSV file(s) are applied to the subscriber database. Connection details for the subscriber database (connection string, user name and password) are included in the pre-configured configuration details.
  3. All data deletes contained in the CSV file(s) are applied to the subscriber database.
  • The UUIDs from each zip file that was successfully applied to the database are written to a file called “success.txt”, with a UUID per line.
  • If there were any errors applying the data to the database, these are written to a file called “failure.txt”, with each line being a UUID, a comma, then an error message or similar (the error message itself should not contain any newline or other special characters and should not exceed 65534 characters).
  • The success.txt and failure.txt file (if it exists) are both compressed into a new “results” zip file with the same name as the one being processed, except with “Data” replaced with “Results” (e.g. if processing 20210318100810_Subscriber_Data.zip, then this new file will be called 20210318100810_Subscriber_Results.zip). This results file allows DDS to know that data is being applied to the subscriber or if there are any errors.
  • The results zip file is then uploaded to the SFTP server to another pre-configured directory.

Error handling

Required parameters

Remote Filer: Reference Data Feed