Graph or relational databases: Difference between revisions

From Discovery Data Service
Jump to navigation Jump to search
No edit summary
No edit summary
Line 12: Line 12:
Taking some very simple entity relationships from a health record, the following illustrates the different logical approach.  Graph is on the left, relational is on the right.
Taking some very simple entity relationships from a health record, the following illustrates the different logical approach.  Graph is on the left, relational is on the right.


[[File:Graph versus relational.jpeg|center|800x300px|Graph versus relational.jpeg]]The logical difference is that graph's can be&nbsp;explicit about a type of relationship between entities&nbsp;whereas a relational model uses foreign keys to link entities together i.e. the field on the right "patient id" points to the patient entity on the left but does not state the semantic reason for the link. On the left the graph states that a patient is a subject of an encounter, and could conversely indicated that an encounter has a subject that is a patient. The Graph approach is much clearer and much more intuitive. A graph is therefore more relational than a relational database. Conclustion&nbsp;: <span style="color:#27ae60;">Graph wins for intuitiveness.</span>
[[File:Graph versus relational.jpeg|center|800x300px|Graph versus relational.jpeg]]The logical difference is that graph's can be&nbsp;explicit about a type of relationship between entities, whereas a relational model uses foreign keys to implicitly relate&nbsp;entities &nbsp;i.e. the field on the right "patient id" points to the patient entity on the left, but does not state the semantic reason for the link. On the left the graph states that a patient is a subject of an encounter, and could conversely indicate&nbsp;that an encounter has a subject that is a patient.


The following diagram illustrates the processes involved in traversing from a patient to an observation (graph) versus a join between a patient and an observation, in both case using the encounter as an intermediate entity (accepting that in reality there may be&nbsp; more direct relationship.
The Graph approach is much clearer and much more intuitive. A graph seems to be more relational than a relational database. Conclustion&nbsp;: <span style="color:#27ae60;">Graph wins for intuitiveness.</span>
 
The following diagram illustrates the processes involved in traversing from a patient to an observation (graph) via an encounter, versus a join between a patient and an observation, via an encounter,&nbsp; in both case using the encounter as an intermediate entity (accepting that in reality there may be&nbsp; more direct relationship).


[[File:Graph vs relational traverse.jpeg|800x400px|Graph vs relational traverse.jpeg]]
[[File:Graph vs relational traverse.jpeg|800x400px|Graph vs relational traverse.jpeg]]


It can be seen that in Graph, the relationship indexes are adjacent to the connecting nodes and therefore 4 direct memory adress pointers are needed whereas in the relational model the foregn key of the observation table is searched for in the encounter primary route index and back to the encounter and up again making a total of 5 searches i.e. each index block must be searched to find the destination of the record in each table.&nbsp; Conclusion, <span style="color:#27ae60;">Graph wins for multi-hop traversals over relational joins</span>. (which is of course what graph databases are for&nbsp;are what they are)
It can be seen that in Graph, the relationship indexes are adjacent to the connecting nodes and therefore 4 direct file&nbsp;address pointers are needed whereas in the relational model the foreign key of the observation table is searched for in the encounter primary route index and back to the encounter and up again making a total of 5 searches i.e. each index block must be searched to find the destination of the record in each table.&nbsp; Conclusion, <span style="color:#27ae60;">Graph wins for multi-hop traversals over relational joins</span>. (which is of course what graph databases are for&nbsp;are what they are)


= Health query patterns =
= Health query patterns =
Line 44: Line 46:
{| border="1" cellpadding="1" cellspacing="1" style="width: 1069px;"
{| border="1" cellpadding="1" cellspacing="1" style="width: 1069px;"
|-
|-
| style="width: 269px;" | [[File:Property approach to diseases.jpeg|350x100px]]
| style="width: 269px;" | [[File:Property approach to diseases.jpeg|350x100px|Property approach to diseases.jpeg]]
| style="width: 390px;" | [[File:Instance model.jpeg|300x200px]]
| style="width: 390px;" | [[File:Instance model.jpeg|300x200px|Instance model.jpeg]]
| style="width: 391px;" | [[File:Concept link.jpeg|300x200px]]
| style="width: 391px;" | [[File:Concept link.jpeg|300x200px|Concept link.jpeg]]
|}
|}


Line 53: Line 55:
= Conclusion =
= Conclusion =


The decision to select graph versus relational should not be down to the theory of graph versus relational. Instead the decision should be down to consideration of practical issues such as:
The decision to select graph versus relational should not be down to the theory of graph versus relational. Health query is predominantly focused on management of sets and less focused on relationships, ahtough both are important in different places. Instead the decision should be down to consideration of practical issues such as:


*The performance and cost of a particular database management engine and how efficient they are at managing large set intersections and unions.  
*The performance and cost of a particular database management engine and how efficient they are at managing large set intersections and unions.  
*The value to the enterprise of the use of one query syntax over another e.g. its level of sophistication, flexibility, ease of learning and community support for problem solving.  
*The value to the enterprise of the use of one query syntax over another e.g. its level of sophistication, flexibility, ease of learning and community support for problem solving.  
*The scale of database implementation  
*The scale of database implementation  
 
*The cost of support of one technology over another.
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
Property approach
 
&nbsp;


&nbsp;
&nbsp;

Revision as of 12:49, 3 May 2020

A question arises as to whether the Discovery Data Service should use relational, property graph, document, or key value based database management systems  in its main patient record data stores.

This article considers the differences between graph and relational from the perspective of the known information requirements of health query. It reaches a conclusion as a result.

The starting point is to examine a few key logical differences then move on to some technical differences. Following this they are compared and contrasted from the perspective of the type of queries needed, in order to conclude which one to go for.

Firstly, its worth noting that a pure Graph model is not included for consideration. There are known problems at scale with pure graphs. If one were to adopt Graph then a property graph is the most likely variant. A property graph has properties on the nodes, and in the case of NEO4J, also has properties on the edges (which are called relationships).

Graph and relational relationships

Taking some very simple entity relationships from a health record, the following illustrates the different logical approach.  Graph is on the left, relational is on the right.

Graph versus relational.jpeg

The logical difference is that graph's can be explicit about a type of relationship between entities, whereas a relational model uses foreign keys to implicitly relate entities  i.e. the field on the right "patient id" points to the patient entity on the left, but does not state the semantic reason for the link. On the left the graph states that a patient is a subject of an encounter, and could conversely indicate that an encounter has a subject that is a patient.

The Graph approach is much clearer and much more intuitive. A graph seems to be more relational than a relational database. Conclustion : Graph wins for intuitiveness.

The following diagram illustrates the processes involved in traversing from a patient to an observation (graph) via an encounter, versus a join between a patient and an observation, via an encounter,  in both case using the encounter as an intermediate entity (accepting that in reality there may be  more direct relationship).

Graph vs relational traverse.jpeg

It can be seen that in Graph, the relationship indexes are adjacent to the connecting nodes and therefore 4 direct file address pointers are needed whereas in the relational model the foreign key of the observation table is searched for in the encounter primary route index and back to the encounter and up again making a total of 5 searches i.e. each index block must be searched to find the destination of the record in each table.  Conclusion, Graph wins for multi-hop traversals over relational joins. (which is of course what graph databases are for are what they are)

Health query patterns

The above section illustrates the power of graph over relational. The question is whether this is helpful in health query.

The main characteristics of health query over standard business query is the idea of subsumption testing. Subsumption testing involves the examination of concepts that are recorded in health events to see if they are subtypes of a high level type. For example, a query such as "find me the incidence of all infectious diseases in the population of London in the last 10 years" involves (in theory) the examination of billions of observations in patient records, determine that the dates are in the last 2 weeks and then for each concept recorded in each observation, determine whether it is an infectious deisease. 

To do this, the general approach is to start with a list of infectious diseases. This involves using an ontology that has concepts with definitions which include the fact that it is a disease, and has a patholigical process which is infectious. This would result in a list of around 10,000 concepts.

Rather than examine the billions of observations in total, it may be beneficial to use some form index or relationship on the observation.  

In a relational query this would mean joining the 10,000 disease concepts (likely to be in a table prepared for the query) against the observations  via an index on observation instance of the disease

In a graph query there are two options depending on how concepts are related to the observations:

a) If the disease instance is a property value on an observation node, then the list of 10,000 concepts would be matched against a concept index on the observation nodes

b) If the instance disease is created as a node linked to the observation then each of the 10,000 concepts would be linked via the node type of that particular disease and the link to the observation would then be used:

c) If a disease concept is linked via a relationship to the observations i.e. the disease class is linked to the observation via an instance relationship then the relationships between each of the 10,000 concepts and their linked observations would be used directly.

These three options can be illustrated as follows:

Property approach to diseases.jpeg Instance model.jpeg Concept link.jpeg

In the left hand example, a node is in effect the same as a relational record. In the second each observation is linked to one additional node, and therefore the index operates on the node index and not the property index. In the third there is one concept of each disease and all observations are linked to it.

Conclusion

The decision to select graph versus relational should not be down to the theory of graph versus relational. Health query is predominantly focused on management of sets and less focused on relationships, ahtough both are important in different places. Instead the decision should be down to consideration of practical issues such as:

  • The performance and cost of a particular database management engine and how efficient they are at managing large set intersections and unions.
  • The value to the enterprise of the use of one query syntax over another e.g. its level of sophistication, flexibility, ease of learning and community support for problem solving.
  • The scale of database implementation
  • The cost of support of one technology over another.