Support Questions

Find answers, ask questions, and share your expertise

What Graph Database is best to use with Spark GraphX?

avatar
Guru

Spark provide a lot of powerful capabilities for working with Graph data structures. What Graph oriented database is best to use in combination with Spark GraphX and why?

1 ACCEPTED SOLUTION

avatar

Vadim check out Neo4j, they have a connector for spark out of the box. http://neo4j.com/developer/apache-spark/

I have used Neo4j in my previous life and it is a very popular graphdb.

You can integrate Neo4j with Spark in a variety of ways, both to pre-process (aggregate, filter, convert) your raw data to be imported into Neo4j. But also as external Graph Compute solution, where you export data of selected subgraphs to Spark, compute the analytic aspects and write them back to Neo4j to be used in your Neo4j operations and Cypher queries. A well known example of this approach is the Neo4j-Mazerunner project.

View solution in original post

2 REPLIES 2

avatar
Guru

GraphX works by loading an entire graph into a combination of VertexRDDs and EdgeRDDs, so the underlying database's capabilities are not really relevant to the graph computation, since GraphX won't touch it beyond initial load.

On that basis you can really use any thing that will effectively store and scan a list of paired tuples, and a list of ids and other properties. From this perspective HBase or Accumulo would seem like a good bet to attach Spark to, but of course any file in HDFS would do.

For the ability to modify a graph prior to analysing it in GraphX, it's more useful to pick a 'proper' graph database. For this it's worth looking at something like Accumulo Graph which provides a graph database hosted on Accumulo, or possibly another very exciting new project, Gaffer (https://github.com/GovernmentCommunicationsHeadquarters/Gaffer) also hosted on Accumulo. Tinkerpop (https://tinkerpop.apache.org/) provides some other options focussed around the common Gremlin APIs, which are generally well understood in graph world. Other options might include something like TitanDB hosted on HBase. These will provide you with an interface API for modifying graphs effectively.

avatar

Vadim check out Neo4j, they have a connector for spark out of the box. http://neo4j.com/developer/apache-spark/

I have used Neo4j in my previous life and it is a very popular graphdb.

You can integrate Neo4j with Spark in a variety of ways, both to pre-process (aggregate, filter, convert) your raw data to be imported into Neo4j. But also as external Graph Compute solution, where you export data of selected subgraphs to Spark, compute the analytic aspects and write them back to Neo4j to be used in your Neo4j operations and Cypher queries. A well known example of this approach is the Neo4j-Mazerunner project.