neo4j vs mongodb for spatial search - mongodb

I'm getting ready to start a project where I will be building a recommendation engine for restaurants. I have been waffling between neo4j (graph db) and mongodb (document db). my nodes/documents will be things like restaurant and person. i know i will want some edges, something like person->likes->restaurant, or person->ate_at->restaurant. my main query, however, will be to find restaurants within X miles of location Y.
if i have 20 restaurant's within X miles of Y, but not connected by any edges, how will neo4j be able to handle the spatial query? i know with mongodb i can index on lat/long and query all restaurant types. does neo4j offer the same functionality in a disconnected graph?
when it comes to answering questions like, 'which restaurants do my friends eat at most often?', is neo4j (graph db) the way to go? or will mongodb (document db) provide me similar functionality?

Neo4j Spatial introduces a Spatial RTree (or other means) index that is part of the graph itself. That means, even disconnected domain entities will be found via the spatial search, if you index them (that is relationships will connect the Spatial index to the Restaurants). Also, this is flexible enough that you can combine the Raw BBox search in the RTree with other things like check on the restaurants categories in the same go, since you can hop out and in the different parts of the graph.
This way, neo4j Spatial is supporting the full range of search capabilities that you would expect form a full Topology, like combined searches and searches on polygons with holes etc.
Be aware that Neo4j Spatial is in 0.7, so be gentle and ask on http://groups.google.com/group/neo4j/about :)

I'm not that familiar with Neo4J Spatial but it would seem that MongoDB is at the very least a good fit since it's the database Foursquare uses with exactly the purpose you describe. MongoDB geo indexing is extremely fast and scales up nicely.

Another possible solution is to use CouchBase. It uses a document model as well - though you need to be much more comfortable with MapReduce for queries. It has better spatial capabilities right now thank MongoDB but that may change over time.
Suggestion aside, I agree that of the two choices you have given Mongo will suit your needs fine and probably more appropriate for your spatial queries.

Neo4j geospatial doesn't scale up that good. I created a geospatial layer in neo4j and added nodes to this layer. Beyond 10,000 nodes the addition of nodes to the layer becomes very slow even when using neo4j2.0
On the other hand, mongodb geo-location works comparatively much faster and is more scalable.

Related

Multimodel database vs multiple individual databases?

I am working on application which requires features offered by both graph database(to store raw data) and document database(extracted reports from raw data). I planned to use neo4j and mongodb. I am having second thoughts about and looking at orientDB. is it better to have a single multimodel database than two separate databases? The reason I leaned towards neo4j is its native graph storage which might come in handy for memory locality for large graphs. OrientDB doesn't store graph natively. or does it?
OrientDB stores graph natively. Its engine is 100% a Graph Database like Neo4j. Actually OrientDB and Neo4j are the only Graph Databases with index-free adjacency. Some other Graph Database acts as a layer on top of an existent model (RDBMS, Column or Document stores).
So there is nothing you can do with Neo4j that you can't do with OrientDB. But OrientDB allows to model more complex data, like Document DBMS (MongoDB) can do. For example each vertices and edges in OrientDB is a document (json), so you can store in the vertex and edge complex types like embedded properties, list, sets, date, decimal, etc.
Don't be dazzled by terminology. "Index-free adjacency" is a term that simply means graph vertices are stored "with" their edges. Each database does this in a slightly different way. Neo4J stores them on disk in a linked list. If you have them in memory, and there's not too many of them, they're fast. If you have to hit them on disk, then you may need an index. Titan stores them as columns in a wide-column database such as Cassandra. If they're in memory, they're fast. If you have to hit them on disk, the underlying database's range queries make them fast to load in bulk, and extra indexing can decrease the cost of searching large edge lists.
This discussion is fairly valuable: How does Titan achieve constant time lookup using HBase / Cassandra?
Whether you're using OrientDB or any other database, your efficiency at graph queries will rely in large part on the indexing you put in place so that you start your graph queries on, and traverse through, a relatively small set of nodes. Be sure to model some of the queries you're doing to make sure that whatever database you choose will support the right indexes, whether they're across the whole graph, or local to each vertex.

Best NoSQL for filtering on multiple indexes/fields

Because of the size of the data that needs to be queried and ability to scale as needed on multiple nodes, I am considering using some type of NoSQL db.
I have been researching numerous NoSQL offerings but can't yet decide on what would be the best option which would provide best performance, scalability and features for our data structure.
Data structure model is of a product catalog where each document/set contains certain properties and descriptions for the that individual product. Properties would vary from product to product which is why schema-less offering would work the best.
Sample structure would be like
[
{"name": "item name",
"cost": 563.34,
"category": "computer",
"manufacturer: "sony",
.
.
.
}
]
So requirement is that I need to be able to filter/query on many different data set fields/indexes in the record set, where I could filter on and exclude multiple indexes/fields in the same query. Queries will be mostly reads and there would not be much of a need for any joins or relationship type of linking.
I have looked into: Elastic Search, mongodb, OrientDB, Couchbase and Aerospike.
Elastic Search seems like an obvious choice, but I was wondering on the performance and it's stability?
Aerospike seems like it would be really fast since it does it all mostly in memory but it's filtering and searching capability didn't seem that capable
What do you think best option would be for my use case? or if there any other recommended DBs that I should look into.
I know that best way is to test the performance with the actual real life use case, but I am hoping to first narrow it down little bit.
Thank you
This is a variant on the popular question "what is the best product" :)
As always: this depends on your specific use case and goals. Database products (like all products) are always the result of trade-offs. So there does NOT exist a single product offering best performance, scalability and features. However there are many very good products for your use case.
Because your question is about Product Data and I am working with Product Data for more than 15 years, it will try to answer your question.
A document model is a perfect fit for Product Data. So for all use cases other than simple look up I would recommend a Document Store
If your use case concerns a single application and you are using the Java platform. I would recommend to use an embedded database. This makes things simpler and has a big performance advantage
If you need faceted search or other advance product search, i recommend you to use SOLR or Elastic Search
If you need a distributed system I recommend Elastic Search over SOLR
If you need Product recommendations based on reviews or other graph oriented algorithms, I recommend to use OrientDB or ArangoDB (or Neo4J, but in this case this would be my second choice)
Products we are using in Production or evaluated in depth for the use case you describe are
SOLR and ES. Both extremely well engineered products. Both (also ES) mature and stable products
Neo4J. Most mature graph database. Big advantage IMO is the awesome query language they use. Integrated Lucene engine. Very mature and well engineered product. Disadvantage is the fact that it is not a Document Graph but Property (key-value) Graph. Also it can be expensive
MongoDB. Our first experience with Document store. Very good product. Big advantage: excellent documentation, (by far) most popular NoSQL database
OrientDB and ArangoDB. Both support the Graph/Document paradigm. This are less known products, but very powerful. Because we are a Java based shop, our preference goes to OrientDB. OrientDB has a Lucene engine integrated (although the implementation is quite simple). ArangoDB on the other hand has very good documentation and a very smart and efficient storage format and finally the AQL is also very nice!
Performance: (tested with 11.43 mio Articles and 2.3 mio products). All products are very fast, especially SOLR and ES in this use case. Embedded OrientDB is also mind blowing fast for import and simple queries. For faceted search only the Search Servers provide real fast performance!
Bottom line: I would go for a Graph/Document store and/or Search Server (SOLR or ES). Because you mentioned "filtering" (I assume faceted search). The Search Server is the obvious first choice
OrientDB supports composite indexes on multiple fields. Example:
CREATE INDEX Product_idx ON Product (name, category, manufacturer) unique
SELECT FROM INDEX:Product_idx WHERE key = ["Donald Knuth", "computer"]
You could also create a FULL-TEXT index by using all the power of Lucene as engine.
Aerospike is a key-value store, not an document database. A document database would do such field-level indexing and deeper searching into a nested object better. The secondary indexes in Aerospike currently (version 3.4.x) work on string and integer 'bins' (a concept similar to a document's field or a SQL table's column).
That said, the list and map complex types of Aerospike are being augmented with those capabilities, in work being done in this quarter. Keep an eye out for those changes in the upcoming releases. You'll be able to index and query on bins of type list and map.

Relational or Mondo Db in a network of location application?

I'm trying to start a project where a user can add a location and each location can have a relation to each other. It will be a web application and I'm planning to use javaee7 + jpa + wildfly.
Basically a location will have at least a name, text, longitude and latitude fields. Some other entities will be linked to this location. There will also be query like, search he nearest x location to y.
I've encountered mongodb several times before but, I'm always on the other part of the system, so I'm not really sure if there's a big benefit in using it in this kind of system. The application is not a store so there's no checkout process and therefore no transaction.
If ever I decided to use mongodb I'm looking at:
http://hibernate.org/ogm/ and
mongo db geospatial queries
I would like to ask for opinions. Thanks.
From your description, I think MongoDB is a pretty good choice:
Your data model is fairly simple. You might even think of embedding the other entities in the location documents, reducing the number of queries needed to retrieve these entities which are near to a specific point to exactly 1 by doing
db.collection.ensureIndex({fieldHoldingGeoJSON:"2dsphere"})
db.collection.find({
$near:{
$geometry:{
type:"Point",
coordinates: [latitude, longitude]
},
$maxDistance: distanceInMeters
}
})
Keep in mind though that the maximum size of a BSON document is limited to 16MB. So if there are a lot of related entities, it might be worth to actually embed the locations into the entities eliminating the need of a dedicated locations collection, though those still can be created on demand using the aggregation framework. With embedding the locations into the entities, you can scale basically indefinitely and you have all properties of said entities stored where they belong to (very NoSQLish).
Although the second approach imposes redundancy, we are only talking a a few bytes, and even with 100M records, that would be less than 1Gb, with disk space being cheap.
Also keep in mind that MongoDB comes with powerful features such as ( relatively ) easy scaling and an awesome aggregation framework, to name a few.
MongoDB supports queries where you can ask for "nearest" within a radius -- very useful, if thats what you are looking for.
Actually, PostgeSQL has the extension PostGIS that implements spacial and geographical objects that can be queried using SQL. An example extracted from their site is the following:
SELECT superhero.name
FROM city, superhero
WHERE ST_Contains(city.geom, superhero.geom)
AND city.name = 'Gotham';

Storing a graph in mongodb

I have an undirected graph where each node contains an array. Data can be added/deleted from the array. What's the best way to store this in Mongodb and be able to do this query effectively: given node A, select all the data contained in the adjacent nodes of A.
In relational DB, you can create a table representing the edges and another table for storing the data in each node this so.
table 1
NodeA, NodeB
NodeA, NodeC
table 2
NodeA, item1
NodeA, item2
NodeB, item3
And then you join the tables when you query for the data in adjacent nodes. But join is not possible in MongoDB, so what's the best way to setup this database and efficiently query for data in adjacent nodes (favoring performance slightly over space).
Specialized Distributed Graph Databases
I know this is sounds a little far afield from the OPs question about Mongo, but these days there are more specialized graph databases that excel at this kind of work and may be much easier for you to use, especially on large graphs.
There is a comparison of 7 such offerings here: https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE#gid=0
Of the three most significant open source offerings (Titan, OrientDB, and Neo4J), all of them support the Tinkerpop Blueprints interface. So for a graph that looks like this...
... a query for "all the people that Juno greatly admires who she has known since the year 2011" would look like this:
Iterable<Vertex> results = juno.query().labels("knows").has("since",2011).has("stars",5).vertices()
This, of course, is just the tip of the iceberg. Pretty powerful stuff!
If you have to stay with Mongo
Think of Tinkerpop Blueprints as the "JDBC of storing graph structures" in various databases. The Tinkerpop Blueprints API has a specific MongoDB implementation that would work for you I'm sure. Then using Tinkerpop Gremlin, you have all sorts of advanced traversal and search methods at your disposal.
I'm picking up mongo, looking into this sort of schema as well (undirected graphs, querying for information from neighbors) I think the way that I favor so far looks something like this:
Each node contains an array of neighbor keys, like so.
{
nodeIndex: 4
myData: "data"
neighbors: [8,15,16,23,42]
}
To find data from neighbors, use the $in "operator":
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}});
You can use field selection to limit results to the relevant data.
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}}, {myData:1});
See http://www.mongodb.org/display/DOCS/Trees+in+MongoDB for inspiration.
MongoDB will introduce native graph capabilities in version 3.4 and it could be used to store graph stuctures and do analytics on them although performance might not be that good compared to native graph databases like Neo4j depending on the cases but it is too early to judge.
Check those links for more information:
$graphLookup (aggregation)
MongoDB 3.4 Accelerates Digital Transformation for the Modern Enterprise
MongoDB can simulate a graph using a flexible tree hierarchy. You may want to consider neo4j for strict graphing needs.

Cassandra Or MongoDB For Our Location Based Application

We are looking at using a NoSQL database system for a large project. Currently, we have read a bit about MongoDB and Cassandra, though we have absolutely no experience with either. We are very proficient with traditional relational databases like MySQL and Microsoft SQL, but the NoSQL (key/value store) is a new paradigm for us.
So basically, which NoSQL database do you guys recommend for our use?
We do both heavy writes and reads. Basically we have tens of thousands of devices that are reporting:
device_id (int), latitude (decimal), longitude (decimal), date/time (datetime), heading char(2), speed (int)
Every minute. So, at peak times we need to be able to process hundreds of writes a second.
Then, we also have users, that are querying this information in the form of, give me all messages from device_id 1234 for the last day, or last week. Also, users do other querying like, give me all messages from device_1234 where speed is greater than 50 and date is today.
So, our initial thoughts are that MongoDB or Cassandra are going to allow us to scale this much easier then using a traditional database.
A document or value in MongoDB or Cassandra for us, might look like:
{
device_id: 1234,
location: [-118.12719739973545, 33.859012351859946],
datetime: 1282274060,
heading: "N",
speed: 34
}
Which system do you guys recommend? Thanks greatly.
MongoDB has built-in support for geospatial indexes: http://www.mongodb.org/display/DOCS/Geospatial+Indexing
As an example to find the 10 closest devices to that location you can just do
db.devices.find({location: {$near: [-118.12719739973545, 33.859012351859946]}}).limit(10)
I have post on a location based app using MongoDB, just like the one you described. MongoDB, with it's strong query and index support, might make it a better choice for you. Just like Cassandra, MongoDB has partitioning and replication, for scaling read and writes. Their underlying architecture is very different.
Although you have not mentioned any location based queries, if you are interested in queries like "give me all the devices within the radius r of location l and between time t1 and t2", you will find MongoDB's geospatial query and indexing extremely useful.
I have done some work with mongodb and geospatial data, but not on the scale mentioned above. The geospatial searches are very fast, much more so than mysql.
I suggest looking into mongodb's sharding, replication, and clustering functionality to deal with the volume of writes. Sharding across device identifier may be a good way to deal with the write volume. If you're interested in proximity of events then sharding across lat/lng may be more appropriate.
jack
Go with mongodb for geo-location search. Release 2.4 improves on core geo features. Lot's of big sites use it for geolocation search.
You might consider using ElasticSearch. ES keeps the JSON of the original document stored, along with all the indexed fields. JSON can be instantiated into any modern languages variables/arguments. In Java, one could even disable that, and store native Java persistence data in a field. After search retrieval, just loop through and instantiate a collection of the original object types.
Using Elastics Search gives you Trie Indexes for high speed numberic range indexes, obviously you get full text searches of every flavor, and geographic bounding box queries, all in AND or OR filtering. Date searches are also native (although Java's handing of dates sucks so I switched to BIG INT representations of timestamps to represent dates)
UNLIKE some past and maybe present NoSQL solutions, the geographic indexing and querying is PART of any query and no extra steps are required. I.E., one MongoDB solution in the recent past required a geospatial search to collect conforming document IDs, then you used those IDs inside another query and searched within those for your other criteria. In reality, that's what happens in all solutions anyways, but it's much faster and cached in ElasticSearch.