Storing a graph in mongodb - mongodb

I have an undirected graph where each node contains an array. Data can be added/deleted from the array. What's the best way to store this in Mongodb and be able to do this query effectively: given node A, select all the data contained in the adjacent nodes of A.
In relational DB, you can create a table representing the edges and another table for storing the data in each node this so.
table 1
NodeA, NodeB
NodeA, NodeC
table 2
NodeA, item1
NodeA, item2
NodeB, item3
And then you join the tables when you query for the data in adjacent nodes. But join is not possible in MongoDB, so what's the best way to setup this database and efficiently query for data in adjacent nodes (favoring performance slightly over space).

Specialized Distributed Graph Databases
I know this is sounds a little far afield from the OPs question about Mongo, but these days there are more specialized graph databases that excel at this kind of work and may be much easier for you to use, especially on large graphs.
There is a comparison of 7 such offerings here: https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE#gid=0
Of the three most significant open source offerings (Titan, OrientDB, and Neo4J), all of them support the Tinkerpop Blueprints interface. So for a graph that looks like this...
... a query for "all the people that Juno greatly admires who she has known since the year 2011" would look like this:
Iterable<Vertex> results = juno.query().labels("knows").has("since",2011).has("stars",5).vertices()
This, of course, is just the tip of the iceberg. Pretty powerful stuff!
If you have to stay with Mongo
Think of Tinkerpop Blueprints as the "JDBC of storing graph structures" in various databases. The Tinkerpop Blueprints API has a specific MongoDB implementation that would work for you I'm sure. Then using Tinkerpop Gremlin, you have all sorts of advanced traversal and search methods at your disposal.

I'm picking up mongo, looking into this sort of schema as well (undirected graphs, querying for information from neighbors) I think the way that I favor so far looks something like this:
Each node contains an array of neighbor keys, like so.
{
nodeIndex: 4
myData: "data"
neighbors: [8,15,16,23,42]
}
To find data from neighbors, use the $in "operator":
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}});
You can use field selection to limit results to the relevant data.
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}}, {myData:1});

See http://www.mongodb.org/display/DOCS/Trees+in+MongoDB for inspiration.

MongoDB will introduce native graph capabilities in version 3.4 and it could be used to store graph stuctures and do analytics on them although performance might not be that good compared to native graph databases like Neo4j depending on the cases but it is too early to judge.
Check those links for more information:
$graphLookup (aggregation)
MongoDB 3.4 Accelerates Digital Transformation for the Modern Enterprise

MongoDB can simulate a graph using a flexible tree hierarchy. You may want to consider neo4j for strict graphing needs.

Related

Database design for time and geolocation data

I have successfully implemented the Time Series Use case like shown in the documentation. The data(Event class) pointed by the smallest time unit is indexed with a lucene spatial index.
I have two types of events : private or public.
How should I design my database and clusters for my use case ?
Would it be better to use an Edge or a linklist to make the link from Min to Event ?
I am worried that my lucene spatial index will get too big in the future.
By reading at the documentation, it looks like having clusters for the geolocation data would be a great strategy.
It is possible to use the index only on the subquery:
select
from
(select
expand(month["12"].day["25"].hour["17"].min["07"].events)
from
Year
where
year = 2015)
where
[lat,lng,$spatial]
NEAR
[66,66,{"maxDistance":1}]
The documentation on indexes tells me it is possible to use indexes on an edge properties. The bad side is that it takes more storage then linked list I tested it and it works :
select
expand(inV())
from
(select
expand(month["12"].day["25"].hour["17"].min["07"].outE('MinPublicEvent'))
from
Year
where
year = 2015)
where
[lat,lng,$spatial]
NEAR
[66,66,{"maxDistance":1}]
In regard of edge vs link, taken from OrientDB doc: lightweight edges, the first difference is that edges can store properties and links don't.
These are the PROS and CONS of Lightweight Edges vs Regular Edges:
PROS:
faster in creation and traversing, because don't need an additional
document to keep the relationships between 2 vertices
CONS:
cannot store properties harder working with Lightweight edges from
SQL, because there is no a regular document under the edge
Since you already mentioned using properties on edges, which makes sense to me, as you can use these properties in the edges to transverse the graph, this means that you can't use a link to store that relationship.
In the case you want to embed these properties on the Event vertex, that is also fine, and you'd be able to use links, loosing the hability of using the properties in the edge to transverse the graph in favour of improved performance.
The edge approach is more expressive, but when performance really matters, and there is risk of a bottleneck, you should monitor the metrics and the performance, and refactor to the embed + link approach in case there is an issue with performance.
Update:
Clusters are basically a mechanism to split data in OrientDB (Clusters tutorial), which works for both, edge and vertex.
You may also find it beneficial to locate different clusters on
different servers, physically separating where you store records in
your database. The advantages of this include:
Optimization: Faster query execution against clusters, given that you
need only search a subset of the clusters in a class.
Indexes: With good partitioning, you can reduce or remove the use of > indexes.
Parallel Queries: Queries can be run in parallel when made to data on
multiple disks.
Sharding: You can shard large data-sets across
multiple instances.
Unless you can identify clearly a good way to partition your data, and can distribute your database between different servers, i suggest you start with the default, as OrientDB already creates 1 cluster for each class in the schema, and add more clusters as your database grow.
When to add more clusters? Metrics, metrics and metrics. Keep track of how your application access your database, what kind of queries, amount of queries, etc.

How to compute connected components in OrientDB

Does OrientDB's support efficient computations of connected components?
I am not experienced with graph databases. My naiive intuition is that this operation should be quite efficient.
If it is efficiently supported, how would a query look like to find all connected components?
I had your same issue but I finally ended up writing an OSQL query to compute connected components in a graph, here is my solution
Below is an excerpt from the OrientDB website. I've highlighted a few relevant portions.
OrientDB can embed documents like any other document database, but
also supports relationships. It doesn’t use the costly JOIN. Instead,
OrientDB uses super-fast, persistent pointers between records, taken
from the graph database world. You can traverse parts of or entire
trees and graphs of records in just a few milliseconds.
This illustration shows how the original document has been
split into two documents linked using the Customer’s Record ID #8:124
to connect the Order to the Customer document. Links can be thought of
as in-memory pointers, but persistent on disk.
[snip]
Equipped With document and relational DBMS, the more data you
have, the slower the database will be. Joins have a heavy runtime
cost. Instead, OrientDB handles relationships as physical links to the
records, assigned only once, when the edge is created O(1). Compare
this to an RDBMS that “computes“ the relationship every single time
you query a database O(LogN). With OrientDB, traversing speed is not
affected by the database size. It is always constant, whether for one
record or 100 billion records. This is critical in the age of Big
Data!
And here is an example query taken from the tutorial document, which will get all the friends of the person called Luca.
SELECT EXPAND( BOTH( 'Friend' ) ) FROM Person WHERE name = 'Luca'

Best NoSQL for filtering on multiple indexes/fields

Because of the size of the data that needs to be queried and ability to scale as needed on multiple nodes, I am considering using some type of NoSQL db.
I have been researching numerous NoSQL offerings but can't yet decide on what would be the best option which would provide best performance, scalability and features for our data structure.
Data structure model is of a product catalog where each document/set contains certain properties and descriptions for the that individual product. Properties would vary from product to product which is why schema-less offering would work the best.
Sample structure would be like
[
{"name": "item name",
"cost": 563.34,
"category": "computer",
"manufacturer: "sony",
.
.
.
}
]
So requirement is that I need to be able to filter/query on many different data set fields/indexes in the record set, where I could filter on and exclude multiple indexes/fields in the same query. Queries will be mostly reads and there would not be much of a need for any joins or relationship type of linking.
I have looked into: Elastic Search, mongodb, OrientDB, Couchbase and Aerospike.
Elastic Search seems like an obvious choice, but I was wondering on the performance and it's stability?
Aerospike seems like it would be really fast since it does it all mostly in memory but it's filtering and searching capability didn't seem that capable
What do you think best option would be for my use case? or if there any other recommended DBs that I should look into.
I know that best way is to test the performance with the actual real life use case, but I am hoping to first narrow it down little bit.
Thank you
This is a variant on the popular question "what is the best product" :)
As always: this depends on your specific use case and goals. Database products (like all products) are always the result of trade-offs. So there does NOT exist a single product offering best performance, scalability and features. However there are many very good products for your use case.
Because your question is about Product Data and I am working with Product Data for more than 15 years, it will try to answer your question.
A document model is a perfect fit for Product Data. So for all use cases other than simple look up I would recommend a Document Store
If your use case concerns a single application and you are using the Java platform. I would recommend to use an embedded database. This makes things simpler and has a big performance advantage
If you need faceted search or other advance product search, i recommend you to use SOLR or Elastic Search
If you need a distributed system I recommend Elastic Search over SOLR
If you need Product recommendations based on reviews or other graph oriented algorithms, I recommend to use OrientDB or ArangoDB (or Neo4J, but in this case this would be my second choice)
Products we are using in Production or evaluated in depth for the use case you describe are
SOLR and ES. Both extremely well engineered products. Both (also ES) mature and stable products
Neo4J. Most mature graph database. Big advantage IMO is the awesome query language they use. Integrated Lucene engine. Very mature and well engineered product. Disadvantage is the fact that it is not a Document Graph but Property (key-value) Graph. Also it can be expensive
MongoDB. Our first experience with Document store. Very good product. Big advantage: excellent documentation, (by far) most popular NoSQL database
OrientDB and ArangoDB. Both support the Graph/Document paradigm. This are less known products, but very powerful. Because we are a Java based shop, our preference goes to OrientDB. OrientDB has a Lucene engine integrated (although the implementation is quite simple). ArangoDB on the other hand has very good documentation and a very smart and efficient storage format and finally the AQL is also very nice!
Performance: (tested with 11.43 mio Articles and 2.3 mio products). All products are very fast, especially SOLR and ES in this use case. Embedded OrientDB is also mind blowing fast for import and simple queries. For faceted search only the Search Servers provide real fast performance!
Bottom line: I would go for a Graph/Document store and/or Search Server (SOLR or ES). Because you mentioned "filtering" (I assume faceted search). The Search Server is the obvious first choice
OrientDB supports composite indexes on multiple fields. Example:
CREATE INDEX Product_idx ON Product (name, category, manufacturer) unique
SELECT FROM INDEX:Product_idx WHERE key = ["Donald Knuth", "computer"]
You could also create a FULL-TEXT index by using all the power of Lucene as engine.
Aerospike is a key-value store, not an document database. A document database would do such field-level indexing and deeper searching into a nested object better. The secondary indexes in Aerospike currently (version 3.4.x) work on string and integer 'bins' (a concept similar to a document's field or a SQL table's column).
That said, the list and map complex types of Aerospike are being augmented with those capabilities, in work being done in this quarter. Keep an eye out for those changes in the upcoming releases. You'll be able to index and query on bins of type list and map.

neo4j vs mongodb for spatial search

I'm getting ready to start a project where I will be building a recommendation engine for restaurants. I have been waffling between neo4j (graph db) and mongodb (document db). my nodes/documents will be things like restaurant and person. i know i will want some edges, something like person->likes->restaurant, or person->ate_at->restaurant. my main query, however, will be to find restaurants within X miles of location Y.
if i have 20 restaurant's within X miles of Y, but not connected by any edges, how will neo4j be able to handle the spatial query? i know with mongodb i can index on lat/long and query all restaurant types. does neo4j offer the same functionality in a disconnected graph?
when it comes to answering questions like, 'which restaurants do my friends eat at most often?', is neo4j (graph db) the way to go? or will mongodb (document db) provide me similar functionality?
Neo4j Spatial introduces a Spatial RTree (or other means) index that is part of the graph itself. That means, even disconnected domain entities will be found via the spatial search, if you index them (that is relationships will connect the Spatial index to the Restaurants). Also, this is flexible enough that you can combine the Raw BBox search in the RTree with other things like check on the restaurants categories in the same go, since you can hop out and in the different parts of the graph.
This way, neo4j Spatial is supporting the full range of search capabilities that you would expect form a full Topology, like combined searches and searches on polygons with holes etc.
Be aware that Neo4j Spatial is in 0.7, so be gentle and ask on http://groups.google.com/group/neo4j/about :)
I'm not that familiar with Neo4J Spatial but it would seem that MongoDB is at the very least a good fit since it's the database Foursquare uses with exactly the purpose you describe. MongoDB geo indexing is extremely fast and scales up nicely.
Another possible solution is to use CouchBase. It uses a document model as well - though you need to be much more comfortable with MapReduce for queries. It has better spatial capabilities right now thank MongoDB but that may change over time.
Suggestion aside, I agree that of the two choices you have given Mongo will suit your needs fine and probably more appropriate for your spatial queries.
Neo4j geospatial doesn't scale up that good. I created a geospatial layer in neo4j and added nodes to this layer. Beyond 10,000 nodes the addition of nodes to the layer becomes very slow even when using neo4j2.0
On the other hand, mongodb geo-location works comparatively much faster and is more scalable.

Extract to MongoDB for analysis

I have a relational database with about 300M customers and their attributes from several perspectives (360).
To perform some analytics I intent to make an extract to a MongoDB in order to have a 'flat' representation that is more suited to apply data mining techniques.
Would that make sense? Why?
Thanks!
No.
Its not storage that would be the concern here, its your flattening strategy.
How and where you store the flattened data is a secondary concern, note MongoDB is a document database and not inherently flat anyway.
Once you have your data in the shape that is suitable for your analytics, then, look at storage strategies, MongoDB might be suitable or you might find that something that allows easy Map Reduce type functionality would be better for analysis... (HBase for example)
It may make sense. One thing you can do is setup MongoDB in a horizontal scale-out setup. Then with the right data structures, you can run queries in parallel across the shards (which it can do for you automatically):
http://www.mongodb.org/display/DOCS/Sharding
This could make real-time analysis possible when it otherwise wouldn't have been.
If you choose your data models right, you can speed up your queries by avoiding any sorts of joins (again good across horizontal scale).
Finally, there is plenty you can do with map/reduce on your data too.
http://www.mongodb.org/display/DOCS/MapReduce
One caveat to be aware of is there is nothing like SQL Reporting Services for MongoDB AFAIK.
I find MongoDB's mapreduce to be slow (however they are working on improving it, see here: http://www.dbms2.com/2011/04/04/the-mongodb-story/ ).
Maybe you can use Infobright's community edition for analytics? See here: http://www.infobright.com/Community/
A relational db like Postgresql can do analytics too (afaik MySQL can't do a hash join but other relational db's can).