Of graphs and relationships - An evening at the JUG Karlsruhe
The Java User Group in Karlsruhe (JUG Karlsruhe) arranges free-for-all presentations about topics around, but not limited to, Java. For the first time in the new location, now at the Karlsruhe 'Hochschule für Technik&Wirtschaft' . The talk last Wednesday (September 4th 2013) was dedicated to the beautiful world of graphs. The presentation was given by Peter Neubauer, a graph enthusiast. Christian Mennerich and I don't miss the chance to see this event.
Peter, one of the founder of Neo Technology and the graph database Neo4j, was presenting 'Neo4j und die wunderbare Welt der Graphen'. In a talk that stretched to two and a half hours, without getting boring, Peter presented some graph database basics, the Neo4j graph database and lots of applications and code examples. At first, Peter talked a bit about the relational database world and the NoSQL movement, but he skipped most of this part. He introduced the property graph model and how it is reflected in Neo4j. Graphs can be the data structure of choice for data mining in highly related data sets. But there are lots of other interesting applications for the Neo4j graph database, ranging from social network applications and recommendation systems to routing problems.
Neo4j and graph-databases
Neo4j is a graph database, what means that the basic data structure is a graph, in opposition to classic relational databases that deal with relations only. Neo4j is based on so called *property graphs*. A property graph consists of *nodes* that are linked to other nodes by *relationships*. Here, as in real live relationships, the direction matters.
*Properties* can be assigned to both, the nodes and the relationships, and more than one relationship between nodes is possible. So, a property graph is what you actually call a directed labeled multi graph. Properties are 'flat data', no links to real objects is directly supported (but nobody stops you from putting a serialized object as a property...).
Neo4j is implemented in Java. Nodes are stored as doubly linked lists, making graph traversals possible in the direction of ingoing and outgoing relationships. This implementation makes operations concerning neighbored nodes (i.e. nodes linked by a relationship) possible in constant time. Thus, graphs provide what you call 'index-free adjacency'. At the moment, more than 17 Billion nodes and relationships can be managed by a single instance of Neo4j (enough to represent a continents data points in Open Street Map). To traverse a path in a graph, you need an entry node in the graph.
Neo4j comes along with full ACID support. The ACID properties, atomicity, consistency, isolation and durability, are well known from relational databases. In short, ACID means that a transaction manipulating data in the databse is executed as if it had exclusive access, and either all its operations take effect, or none. In the sense of the CAP theorem, Neo4j is on the CA side of the CAP triangle, in opposition to many other NoSQL databases that 'sacrifice' A or C for P. (CAP stands for consistency, availability, partition tolerance.)
Thus, horizontal scale-out is not an easy problem in Neo4j. To gain high availability, Neo4j offers a master/slave replication strategy, where read performance is ensured by replication of the data. Write operations here become a bottleneck, as a dedicated master node is responsible for the write processes. If the master fails (e.g. due to network problems), a new master is automatically elected by a quorum. Sharding in graphs seems to be complicated, as it involves the problem of partitioning a graph.
In addition to the Neo4j database a REST-server and a web front-end are available to maintain the data in the database. In the web-frontend, the graph itself can be visualized in a node-and-arrow manner, starting from a selected node. Neighboring nodes and relationships can be folded and unfolded to explore the data. Algorithms to arrange the graph clearly on the screen are used. Depending on the size of the graph and the degree of the nodes (i.e. the number of ingoing and outgoing relationships) the visual exploration can soon lose its fun.
Neo4j comes in of three editions: The Community Edition is for first steps with the graph database, the Advanced and Enterprise Edition are for more sophisticated and production set-ups. The community edition is free and might be used in a similar way as MySQL might be used. For the usage of the Advanced and Enterprise Edition, licensing is needed. For Open-Source projects, there is the AGPL, to include Neo4j in closed-source business application, the Neo Technology Commercial License (NTCL) is available. Actual terms of the license that fits your purposes and business models can be discussed with Neotechnology directly. But the good thing is: Neo4j is free in all three versions for testing and evaluating purposes!
Working with graphs
Neo4j offers different possibilities to interact and query the graph. Among others, there are: A REST-interface, interfaces to the query languages Cypher, and the usage of Spring Data Neo4j. The REST interface accepts JSON and can be queried using.e.g. 'curl'.
Cypher is an easy-to-learn query language, inspired by ASCII-arts: You kind of 'draw' your query, using symbols for nodes, and the directed relationships ( (...) and -[...]-> ). With that you can ask for the friends of Tobis friends:
Hopefully, Tobi will be in the result set.
A different way to work with graphs is using the Spring Data Neo4j API. The object-graph mapping works similar to the objects-relational mapping in Spring Data for relational databases. Using annotations for nodes and relationships, objects are linked to nodes and relations in the database. It is possible to include Cypher queries in annotations, what makes the API flexible to use.
Compared to relational database systems, queries become incredibly fast due the locality and topology of the nodes. As the friends-of-friend query shown above will be answered in a time, almost independent of the number of persons stored in the database, as only neighboring nodes are searched what keeps the data local. (Actually, the time for the query scales with the number of friends a person has.) In a relational database system, the query time strongly depends on the number of persons in the system, so that exceeding a certain number of persons, in a relational database system the query becomes unfeasible in an acceptable time.
Thinking in graphs
Thinking in graphs needs getting used to. Queries have to be formulated as graph traversals or graph-related problems. But once the thought is accepted and the classical thinking in relations is abandoned, the advantages become clear. Many scenarios can be described in the graph world. Social networks and product recommendation systems are an obvious example. But also the problem of dynamically assigning roles and rights in a content management system (CMS), a hard problem, can be formulated and solved as a shortest path problem in a graph. Graphs are often easy to understand, as they are 'white board friendly': Once painted to a white board, the data model is there and ready to be implemented. No matter if you draw in an abstract manner, or as a concrete example.
The concept of a graph is powerful, and as general as the concept of relations in relational databases. They are fast to traverse when the data is highly connected. And giving the easy data structure of a graph, queries are often easy to formulate. On the other side, sharding can be a real problem in big data scenarios (but it also is in relational databases), and the shift towards 'thinking in graphs' might take some time to adjust to. Give it a try, it is fun! (But don't make it another all-dominant paradigm!)
Peter showed lots of interesting application of models of real world scenarios with graphs, covering CMS applications, strutr (structr.org), routing (www.transportdublin.ie), gene sequencing and more. On his laptop, he also could show coding examples of programming against the Spring Data Neo4j API, proving the elegance of this approach. In a live-experiment, Peter showed exemplary queries against a database he had populated with data from from MusicBrainz and Last.fm, about 15 GBs in size! With queries like 'What music does A like that B likes' Peter showed how easy it is to create a music recommendations in this huge amount of data by exploiting the locality of the data. The Cypher expression he used all just span a few lines.
Graphs are great, we all knew that before. But there are more applications to graphs than one might have thought of. Neo4j comes with solutions for many often occurring problems, such as finding the shortest path between two nodes. After discussing the importance of self relationships of nodes in a graph database, Peter proudly presented his T-shirt, so we now know: (peter)-[FRIEND]->(peter). And, after paying the beer, he surely has a lot more friends now. But seriously: Thank you Peter for the presentation, the discussion and the interesting insides into the applications of graphs and graph databases! It was a long, informative and fun evening.