Graph or relational databases

From Discovery Data Service
Jump to navigation Jump to search

A question arises as to whether the Discovery Data Service should use relational, property graph, document, or key value based database management systems  in its main population patient record data stores.

This article considers the differences between graph and relational from the perspective of the known information requirements of health query and goes on to consider other factors involved in deciding which to go for.  It reaches a temporary conclusion as a result.

The starting point is to examine a few key logical differences in structure then move on to some technical differences. Following this the two are compared and contrasted from the perspective of the type of queries needed. Subsequently other approaches are considered, as well as a list of the additional factors that should be taken into account when making the decision.


 


Graph and relational relationships

Taking some very simple entity relationships from a single patient health record, the following illustrates the different logical approach.  Graph is on the left, relational is on the right.

Graph versus relational.jpeg

The logical difference is that a graph is explicit about a type of relationship between entities, whereas a relational model uses foreign keys to implicitly relate entities  i.e... on the right the field on the encounter "patient id" points to the patient entity on the left, but does not state the semantic reason for the link. On the left, the graph states that a patient is a subject of an encounter, and could conversely indicate that an encounter 'has a subject' that is a patient.

The Graph approach is much clearer and much more intuitive. It is great for explicit relationships. A graph seems to be more relational than a relational database. Conclusion : Graph wins for intuitiveness.

The following diagram illustrates the processes involved in a graph traversing of a kind that uses "index adjacency" from a patient to an observation via an encounter, versus a join between a patient and an observation, via an encounter,  in both case using the encounter as an intermediate entity (accepting that in reality there may be  more direct relationship).

Graph vs relational traverse.jpeg

It can be seen that in Graph, the relationship indexes are immediately available to the connecting nodes and therefore 4 direct file address pointers are used,  whereas in the relational model the foreign key of the observation table is searched for in the encounter primary root and leaf index pages, back to the encounter, and up again making a total of 5 searches i.e... each index block must be searched to find the destination of the record in each table.  Conclusion, Graph wins for multi-hop traversals over relational joins. (which is of course what graph databases are for).

It should be noted that pure Graph databases would index the relationships rather than the nodes and therefore graph traversal may not include the above.

Health query patterns

The above section illustrates the power of graph over relational. The question is whether this is helpful in health query at population level, with a supplementary issue around patient level.

Population query

The main characteristics of health query over standard business query is the idea of subsumption testing. Subsumption testing involves the examination of concepts that are recorded in health events to see if they are subtypes of a high level type. For example, a query such as "find me the incidence of all infectious diseases in the population of London in the last 10 years" involves (in theory) the examination of billions of observations in patient records, determine that the dates are in the last 10 years and then for each concept recorded in each observation, determine whether it is an infectious Disease. 

To do this, the general approach is to start with a set of infectious diseases. This usually involves using an ontology that has concepts with definitions, which include the fact that something is a disease, and has a pathological process which is infectious i.e. would use inference to generate the initial list of things to look for; a value set.   This would result in a list of around 10,000 concepts i.e.. lots of infectious diseases.

Rather than examine the billions of observations in total, it would be beneficial to use some form of index.  

In a relational query this would mean joining the 10,000 disease concepts (likely to be in a table prepared for the query) against the observations  via an index on the observation instance of the disease.  In a graph query this would involve linking 10,000 disease nodes with the disease nodes that are indicators of disease via the property that links the disease record to the disease. The difference can be illustrated as follows:

 

Property approach to diseases.jpeg

                                                                               Graph                                                                                                       Relational

 

The logic of each is somewhat similar. In the above example, an additional traverse is needed by the graph in order to match the patient. On the other hand. if you need something in relation to the patient e.g. an NHS number, then it is accessible. In the relational approach, because the value of the patient is implicit in the foreign key in the observation, no additional join is needed. Furthermore, it is likely that both the patient and concept would be included in a 'covering index' in a relational model, (as well as an index on the diseases) whereas in graph, the patient would need to be assigned in the observation node to achieve the same.

Thus in this particular case the relational model appears to outperform the graph model. But not by much.

So it seems that there is little to choose in this type of health query between the logic of a graph query and the logic of a table query. In the end it may boil down to other factors

Single patient query

Graph appears to be more logically simpler and more efficient at patient level if the retrieval pattern is at the full record level, as it takes account of the relationship optimisation. 

However, many queries at patient level operate on subsets of records e.g. observation or medications and the partitioning by table provides an advantage over node filtering by the use of c clustered BTree on the table.

At the level of as single patient, whether graph or relational is probably secondary to where the the record query business logic operates. if for example the business logic occurs in the application tier then a simpler document style structured such as JSON would probably win over both.

Relational as graph

A third option is to consider a model whereby the model of the data is conceptualised as a graph (or perhaps an OWLY graph to accommodate subsumption), but is implemented using a relational database. In this approach, relational table design falls into one of two patterns:

1. Conventional pattern. Fields are fixed and explicit, and where their values are foreign keys, the entity they point to is implicit in the field name, perhaps further constrained by a foreign key constraint.

2. Triple tables. These are link tables that have 3 main fields. The first field is a "source or owner" and points to a main entity record (the source table). The second field  is the "relationship or property" field, and whose value is the type of relationship or property.  A third field "value or target" ,whose value depends on the type of Relationship (e.g.. points to another table, an object value, or a data value. 

This approach enables a designer to think graph and implement relational.

The user might have a conceptual graph 'Chrisp Street health centre' -- 'is commissioned by' ---> 'Tower Hamlets CCG' and the question might be"which organisations commission Chrisp Street health centre?

implementation might be

Select name from organisation B

     join organisationExtension E on E.target= B.id  join organisation A on E.Source= A.id

Where A.name="Chrisp street health centre' and E.relationship= 'is commissioned by'

Factors to be considered

Query performance and intuitiveness are just two factors. There are many other factors to take into account, some technical and some non technical. These include:

1. Data storage costs. Although far less important than in the past, storage remains a good proxy for overall cost and performance

2. Data write performance. More relationships and more indexes = greater cost of writes

3. Data access Language, including functionality, ease of use, reliability.

4. Database management system performance, reliability and scalability

5. Infrastructure service support, including whether a technology is well supported by the infrastructure

6. Human resources, including whether their are a pool of resources with the right level of skills to choose from.

7. Community support. Whether a particular solution is supported by a broad community that offers problem solving

Conclusions

  • The decision to select graph versus relational should not be down to the theory of graph versus relational.
  • Health query is predominantly focused on management of sets and less focused on relationships, although both are important in different places.
  • It is possible to design in graph and implement in relational but not vice versa.
  • The overall cost and efficiency are the main considerations. Query performance is just one.

In the Discovery data service, as of May 2020, the conclusion has been drawn that:

"We should design and implement the data model in graph and enable implementation in relational"


"The health record store should be implemented in relational, until such time as the evidence of native graph databases providing an improve cost/ benefit ratio is provided"


"It is very likely that an relational model will be replaced by graph."