Store graph in MongoDB - mongodb

What is the most efficient way to store a directed graph with vertices and edges in mongodb?
I have stored it as a collection node and a collection edge where each edge has source and target both pointing at the node collection.
But is this the most efficient way to do it if I want to traverse the graph and retrieve successors and predecessors?
Edit
Each node and edge won't have much other data (maybe 2 other fields) and each node will not have many edges (between 1-5).

MongoDb is not suitable for graphs. There are better alternatives than mongoDB as graph database. You can explore Orientdb, it's open sourced. If you are looking for more professional and performance optimized graph database, then go for Neo4j.

Related

Multimodel database vs multiple individual databases?

I am working on application which requires features offered by both graph database(to store raw data) and document database(extracted reports from raw data). I planned to use neo4j and mongodb. I am having second thoughts about and looking at orientDB. is it better to have a single multimodel database than two separate databases? The reason I leaned towards neo4j is its native graph storage which might come in handy for memory locality for large graphs. OrientDB doesn't store graph natively. or does it?
OrientDB stores graph natively. Its engine is 100% a Graph Database like Neo4j. Actually OrientDB and Neo4j are the only Graph Databases with index-free adjacency. Some other Graph Database acts as a layer on top of an existent model (RDBMS, Column or Document stores).
So there is nothing you can do with Neo4j that you can't do with OrientDB. But OrientDB allows to model more complex data, like Document DBMS (MongoDB) can do. For example each vertices and edges in OrientDB is a document (json), so you can store in the vertex and edge complex types like embedded properties, list, sets, date, decimal, etc.
Don't be dazzled by terminology. "Index-free adjacency" is a term that simply means graph vertices are stored "with" their edges. Each database does this in a slightly different way. Neo4J stores them on disk in a linked list. If you have them in memory, and there's not too many of them, they're fast. If you have to hit them on disk, then you may need an index. Titan stores them as columns in a wide-column database such as Cassandra. If they're in memory, they're fast. If you have to hit them on disk, the underlying database's range queries make them fast to load in bulk, and extra indexing can decrease the cost of searching large edge lists.
This discussion is fairly valuable: How does Titan achieve constant time lookup using HBase / Cassandra?
Whether you're using OrientDB or any other database, your efficiency at graph queries will rely in large part on the indexing you put in place so that you start your graph queries on, and traverse through, a relatively small set of nodes. Be sure to model some of the queries you're doing to make sure that whatever database you choose will support the right indexes, whether they're across the whole graph, or local to each vertex.

OrientDB is a graph database?

I am trying to understand how OrientDB can be a graph and a document database at the same time.
How a document is persisted? It appears to be different compared with ArangoDB, when everything is a document.
OrientDB is a graph database that supports documents? The documents are graph nodes?
Thanks.
Orientdb graphs work on top of the documents.
http://orientdb.com/docs/2.0/orientdb.wiki/Choosing-between-Graph-or-Document-API.html
(So one would say orientdb is a document db which supports graphs.)
Therefore documents are much faster when accessing.
With graphs you can have edges (bi directional) while in documents you only have links.

How to compute connected components in OrientDB

Does OrientDB's support efficient computations of connected components?
I am not experienced with graph databases. My naiive intuition is that this operation should be quite efficient.
If it is efficiently supported, how would a query look like to find all connected components?
I had your same issue but I finally ended up writing an OSQL query to compute connected components in a graph, here is my solution
Below is an excerpt from the OrientDB website. I've highlighted a few relevant portions.
OrientDB can embed documents like any other document database, but
also supports relationships. It doesn’t use the costly JOIN. Instead,
OrientDB uses super-fast, persistent pointers between records, taken
from the graph database world. You can traverse parts of or entire
trees and graphs of records in just a few milliseconds.
This illustration shows how the original document has been
split into two documents linked using the Customer’s Record ID #8:124
to connect the Order to the Customer document. Links can be thought of
as in-memory pointers, but persistent on disk.
[snip]
Equipped With document and relational DBMS, the more data you
have, the slower the database will be. Joins have a heavy runtime
cost. Instead, OrientDB handles relationships as physical links to the
records, assigned only once, when the edge is created O(1). Compare
this to an RDBMS that “computes“ the relationship every single time
you query a database O(LogN). With OrientDB, traversing speed is not
affected by the database size. It is always constant, whether for one
record or 100 billion records. This is critical in the age of Big
Data!
And here is an example query taken from the tutorial document, which will get all the friends of the person called Luca.
SELECT EXPAND( BOTH( 'Friend' ) ) FROM Person WHERE name = 'Luca'

neo4j vs mongodb for spatial search

I'm getting ready to start a project where I will be building a recommendation engine for restaurants. I have been waffling between neo4j (graph db) and mongodb (document db). my nodes/documents will be things like restaurant and person. i know i will want some edges, something like person->likes->restaurant, or person->ate_at->restaurant. my main query, however, will be to find restaurants within X miles of location Y.
if i have 20 restaurant's within X miles of Y, but not connected by any edges, how will neo4j be able to handle the spatial query? i know with mongodb i can index on lat/long and query all restaurant types. does neo4j offer the same functionality in a disconnected graph?
when it comes to answering questions like, 'which restaurants do my friends eat at most often?', is neo4j (graph db) the way to go? or will mongodb (document db) provide me similar functionality?
Neo4j Spatial introduces a Spatial RTree (or other means) index that is part of the graph itself. That means, even disconnected domain entities will be found via the spatial search, if you index them (that is relationships will connect the Spatial index to the Restaurants). Also, this is flexible enough that you can combine the Raw BBox search in the RTree with other things like check on the restaurants categories in the same go, since you can hop out and in the different parts of the graph.
This way, neo4j Spatial is supporting the full range of search capabilities that you would expect form a full Topology, like combined searches and searches on polygons with holes etc.
Be aware that Neo4j Spatial is in 0.7, so be gentle and ask on http://groups.google.com/group/neo4j/about :)
I'm not that familiar with Neo4J Spatial but it would seem that MongoDB is at the very least a good fit since it's the database Foursquare uses with exactly the purpose you describe. MongoDB geo indexing is extremely fast and scales up nicely.
Another possible solution is to use CouchBase. It uses a document model as well - though you need to be much more comfortable with MapReduce for queries. It has better spatial capabilities right now thank MongoDB but that may change over time.
Suggestion aside, I agree that of the two choices you have given Mongo will suit your needs fine and probably more appropriate for your spatial queries.
Neo4j geospatial doesn't scale up that good. I created a geospatial layer in neo4j and added nodes to this layer. Beyond 10,000 nodes the addition of nodes to the layer becomes very slow even when using neo4j2.0
On the other hand, mongodb geo-location works comparatively much faster and is more scalable.

Storing a graph in mongodb

I have an undirected graph where each node contains an array. Data can be added/deleted from the array. What's the best way to store this in Mongodb and be able to do this query effectively: given node A, select all the data contained in the adjacent nodes of A.
In relational DB, you can create a table representing the edges and another table for storing the data in each node this so.
table 1
NodeA, NodeB
NodeA, NodeC
table 2
NodeA, item1
NodeA, item2
NodeB, item3
And then you join the tables when you query for the data in adjacent nodes. But join is not possible in MongoDB, so what's the best way to setup this database and efficiently query for data in adjacent nodes (favoring performance slightly over space).
Specialized Distributed Graph Databases
I know this is sounds a little far afield from the OPs question about Mongo, but these days there are more specialized graph databases that excel at this kind of work and may be much easier for you to use, especially on large graphs.
There is a comparison of 7 such offerings here: https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE#gid=0
Of the three most significant open source offerings (Titan, OrientDB, and Neo4J), all of them support the Tinkerpop Blueprints interface. So for a graph that looks like this...
... a query for "all the people that Juno greatly admires who she has known since the year 2011" would look like this:
Iterable<Vertex> results = juno.query().labels("knows").has("since",2011).has("stars",5).vertices()
This, of course, is just the tip of the iceberg. Pretty powerful stuff!
If you have to stay with Mongo
Think of Tinkerpop Blueprints as the "JDBC of storing graph structures" in various databases. The Tinkerpop Blueprints API has a specific MongoDB implementation that would work for you I'm sure. Then using Tinkerpop Gremlin, you have all sorts of advanced traversal and search methods at your disposal.
I'm picking up mongo, looking into this sort of schema as well (undirected graphs, querying for information from neighbors) I think the way that I favor so far looks something like this:
Each node contains an array of neighbor keys, like so.
{
nodeIndex: 4
myData: "data"
neighbors: [8,15,16,23,42]
}
To find data from neighbors, use the $in "operator":
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}});
You can use field selection to limit results to the relevant data.
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}}, {myData:1});
See http://www.mongodb.org/display/DOCS/Trees+in+MongoDB for inspiration.
MongoDB will introduce native graph capabilities in version 3.4 and it could be used to store graph stuctures and do analytics on them although performance might not be that good compared to native graph databases like Neo4j depending on the cases but it is too early to judge.
Check those links for more information:
$graphLookup (aggregation)
MongoDB 3.4 Accelerates Digital Transformation for the Modern Enterprise
MongoDB can simulate a graph using a flexible tree hierarchy. You may want to consider neo4j for strict graphing needs.