Graph or relational databases

From Discovery Data Service
Jump to navigation Jump to search

A question arises as to whether the Discovery Data Service should use relational, property graph, document, or key value based database management systems  in its main patient record data stores.

This article considers the differences between graph and relational from the perspective of the known information requirements of health query. It reaches a conclusion as a result.

The starting point is to examine a few key logical differences then move on to some technical differences. Following this they are compared and contrasted from the perspective of the type of queries needed, in order to conclude which one to go for.

Firstly, its worth noting that a pure Graph model is not included for consideration. There are known problems at scale with pure graphs. If one were to adopt Graph then a property graph is the most likely variant. A property graph has properties on the nodes, and in the case of NEO4J, also has properties on the edges (which are called relationships).

Graph and relational relationships

Taking some very simple entity relationships from a health record, the following illustrates the different logical approach.  Graph is on the left, relational is on the right.

Graph versus relational.jpeg

The logical difference is that graph's can be explicit about a type of relationship between entities, whereas a relational model uses foreign keys to implicitly relate entities  i.e.. the field on the right "patient id" points to the patient entity on the left, but does not state the semantic reason for the link. On the left the graph states that a patient is a subject of an encounter, and could conversely indicate that an encounter has a subject that is a patient.

The Graph approach is much clearer and much more intuitive. A graph seems to be more relational than a relational database. Conclusion : Graph wins for intuitiveness.

The following diagram illustrates the processes involved in traversing from a patient to an observation (graph) via an encounter, versus a join between a patient and an observation, via an encounter,  in both case using the encounter as an intermediate entity (accepting that in reality there may be  more direct relationship).

Graph vs relational traverse.jpeg

It can be seen that in Graph, the relationship indexes are adjacent to the connecting nodes and therefore 4 direct file address pointers are needed whereas in the relational model the foreign key of the observation table is searched for in the encounter primary route index and back to the encounter and up again making a total of 5 searches i.e.. each index block must be searched to find the destination of the record in each table.  Conclusion, Graph wins for multi-hop traversals over relational joins. (which is of course what graph databases are for are what they are)

Health query patterns

The above section illustrates the power of graph over relational. The question is whether this is helpful in health query.

The main characteristics of health query over standard business query is the idea of subsumption testing. Subsumption testing involves the examination of concepts that are recorded in health events to see if they are subtypes of a high level type. For example, a query such as "find me the incidence of all infectious diseases in the population of London in the last 10 years" involves (in theory) the examination of billions of observations in patient records, determine that the dates are in the last 2 weeks and then for each concept recorded in each observation, determine whether it is an infectious Disease. 

To do this, the general approach is to start with a set of infectious diseases. This involves using an ontology that has concepts with definitions which include the fact that something is a disease, and has a pathological process which is infectious. This would result in a list of around 10,000 concepts.

Rather than examine the billions of observations in total, it would be beneficial to use some form index.  

In a relational query this would mean joining the 10,000 disease concepts (likely to be in a table prepared for the query) against the observations  via an index on observation instance of the disease.  In a graph query this would involve matching 10,000 disease nodes with observation nodes that are indicators of disease. The difference can be illustrated as follows:

 

Property approach to diseases.jpeg

                                                                               Graph                                                                                                       Relational

 

The logic of each is somewhat similar. In the above example, an additional traverse is needed by the graph in order to match the patient. In the relational approach, because the value of the patient is implicit in the foreign key in the observation, no additional join is needed. Furthermore, it is likely that both the patient and concept would be included in a 'covering index' in a relational model, whereas in graph, the patient would need to be assigned in the observation node to achieve the same.

So it seems that there is little to choose in this type of health query between the logic of a graph query and the logic of a table query. In the end it may boild down to other factors

Relational as graph

A third option is to consider a model whereby the model of the data is conceptualised as a graph (or perhaps an OWLY graph to accomodate subsumption), but is implemented using a relational database. In this approach, reelational table design falls into one of two patterns:

1. Conventional pattern. Fields are fixed and explicit, and where their values are foreign keys, the entity they point to is implicit in the field name, perhaps further constrained by a foreign key constraint.

2. Triple tables. These are link tables that have 3 main fields. The fist field is a "source or owner" and points to a main entity record (the source table). The second field  is the "relationship or property" field, and whose value is the type of relationship or property.  A third field "value or target" ,whose value depends on the type of relationsip (e.g. points to another table, an object value, or a data value. 

This approach enables a designer to think graph and implement relational.

Factors to be considered

Query performance and intutiveness are just two factors. There are many other factors to take into account, some tecnical and some non technical. These include

1. Data storage costs. Although far less important than previously, storage remains a good proxy for overall cost and performance

2. Data write performance. More relationships and more indexes = greater cost of writes

3. Data access Language, including functionality, ease of use, reliability.

4. Database managem system performance, including reliability and scalability

5. Infrastucture service support, including whether a technology is well supported by the infrastructure

6. Human resources, including whether their are a pool of resources with the right level of skills to choose from.

7. Community suppport. Whether a particular solution is supported by a broad community that offers problem solving

Conclusions

  • The decision to select graph versus relational should not be down to the theory of graph versus relational.
  • Health query is predominantly focused on management of sets and less focused on relationships, although both are important in different places.
  • It is possible to design in graph and implement in relational but not vice versa.
  • The overall cost and efficiency are the main considerations. Query performance is just one.

In the Discovery data service, as of May 2020, the conclusion has been drawn that:

We should design in graph and implement in relational, until such time as the evidence of native graph databases providing an improve cost/ benefit ratio is present.