Is there a graph-document database that supports geospatial queries? - mongodb

Here's the rundown of what I need:
A graph database
Each node is a document; there will be hundreds of types of nodes; each of these several hundred types will have its own consistent schema.
Can scale to billions of nodes
Each node also has a (lat,lng) cooordinate in addition to the edges between nodes
I want to use (lat,lng) as a shard key so this can be scaled to a large sharded, replicated cluster. Edge traversals will occur ~95% within nearby (lat,lng) locations.
I want to be able to issue geo+document queries. For example "Show me all the graph nodes/documents matching this query { ... } ordered by distance from (lat_0, lng_0)"
I want something that's well-documented, has an active developer community, is recommended for production use, and likely to be around for years.
Here are problems with existing databases:
MongoDB: no graph support, no joins
Neo4j: no sharding
OrientDB: no geospatial indexing
ArangoDB: can do WITHIN queries but cannot have additional query clauses (e.g. MongoDB's geoNear has a query parameter)
Is there anything that fits my use case?

Would you like a unicorn and a machine that prints an unlimited number of $100 bills to go along with that? Har har har....
OK but seriously, you've got a tall order there. You're going to need a custom system that blends a few of those things together. For one, as you observe, there's really no such thing as a "graph/document" database.
As a general area of systems research, many people are looking into hybrid systems. An example would be that you maintain your graph structure in neo4j, and that the IDs of nodes in neo4j point to identifiers for documents in MongoDB. In this way, you'd have a graph/document database, but it would really be two databases. Such hybrid systems are rife with tradeoffs. For one, writing a query across both systems will be extremely difficult. For two, you'll introduce data dependencies across them, such that it might not be easy to update your graph structure without changing your documents, or vice versa.
For really intense performance requirements, hybrid systems are sometimes the only way to go. But just as a rule of thumb, for every 100 times you see someone say they need such a solution, probably 80 times they're better off with picking just one database and then living with the pros and the cons that it provides to them. Technology is ultimately about choices, pros, and cons, and learning to live with what you've picked. :)
To give you a succinct answer to the question you've asked, no there's nothing that does all of that. I'd recommend you work with an architect or consultant who can explore your requirements in depth, and make a recommendation on what architecture best suits most of your needs, balancing simplicity and cost. That's as much an art as a science.

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

MongoDB : where is the limit between "few" and "many"?

I am coming from the relational database world (Rails / PostgreSQL) and transitioning to the NoSQL world (Meteor / MongoDB), so I am learning about denormalization, embedding and true links.
It seems that, in many cases, choosing between various database schemas comes down to the number of documents that will be "related" to each others.
In this video series, the author distinguishes:
one-to-many relationships from one-to-few relationships
many-to-many relationships from few-to-few relationships
So, I am wondering: where is the limit between few and many?
I guess there may not be a hard number, but are we in the dozens, the hundreds, the thousands or the millions?
It's all relative and is really kind of a dangerous question to make assumptions about when you're designing an architecture. It's worth investing time to make the right choices for your schema and your setup. I would advise a few steps:
Do the math. Multiply your relationships out based on what you expect your application to need to do. If you have a few nested arrays or embedded documents, a couple of "one-to-few" can expand out to many documents pretty easily when you start $unwinding them.
Write a prototype. Do some basic testing on your expected hardware/environment to see if it can easily handle that load when you do queries for all the data.
Based on your testing, create the limitations. This is where you need to draw the line on how many relations you can create per document, for each relationship type, before the system breaks down.
If it were me, I would say one-to-few is less than a dozen, and one-to-many is theoretically unlimited, but practically in the millions. Maybe there should be a middle ground of "one-to-some" to indicate possibly hundreds.
Taken from 6 rules of thumb for MongoDB schema design:
one-to-few - two until few hundreds
one-to-many - couple of hundreds until few thousands
one-to-squillions - thousands and up
I totally agree with #womp about the need to choose the right scheme for your use case. The article I posted above has some good guidelines and examples of which schema design to use.

Best NoSQL for filtering on multiple indexes/fields

Because of the size of the data that needs to be queried and ability to scale as needed on multiple nodes, I am considering using some type of NoSQL db.
I have been researching numerous NoSQL offerings but can't yet decide on what would be the best option which would provide best performance, scalability and features for our data structure.
Data structure model is of a product catalog where each document/set contains certain properties and descriptions for the that individual product. Properties would vary from product to product which is why schema-less offering would work the best.
Sample structure would be like
[
{"name": "item name",
"cost": 563.34,
"category": "computer",
"manufacturer: "sony",
.
.
.
}
]
So requirement is that I need to be able to filter/query on many different data set fields/indexes in the record set, where I could filter on and exclude multiple indexes/fields in the same query. Queries will be mostly reads and there would not be much of a need for any joins or relationship type of linking.
I have looked into: Elastic Search, mongodb, OrientDB, Couchbase and Aerospike.
Elastic Search seems like an obvious choice, but I was wondering on the performance and it's stability?
Aerospike seems like it would be really fast since it does it all mostly in memory but it's filtering and searching capability didn't seem that capable
What do you think best option would be for my use case? or if there any other recommended DBs that I should look into.
I know that best way is to test the performance with the actual real life use case, but I am hoping to first narrow it down little bit.
Thank you
This is a variant on the popular question "what is the best product" :)
As always: this depends on your specific use case and goals. Database products (like all products) are always the result of trade-offs. So there does NOT exist a single product offering best performance, scalability and features. However there are many very good products for your use case.
Because your question is about Product Data and I am working with Product Data for more than 15 years, it will try to answer your question.
A document model is a perfect fit for Product Data. So for all use cases other than simple look up I would recommend a Document Store
If your use case concerns a single application and you are using the Java platform. I would recommend to use an embedded database. This makes things simpler and has a big performance advantage
If you need faceted search or other advance product search, i recommend you to use SOLR or Elastic Search
If you need a distributed system I recommend Elastic Search over SOLR
If you need Product recommendations based on reviews or other graph oriented algorithms, I recommend to use OrientDB or ArangoDB (or Neo4J, but in this case this would be my second choice)
Products we are using in Production or evaluated in depth for the use case you describe are
SOLR and ES. Both extremely well engineered products. Both (also ES) mature and stable products
Neo4J. Most mature graph database. Big advantage IMO is the awesome query language they use. Integrated Lucene engine. Very mature and well engineered product. Disadvantage is the fact that it is not a Document Graph but Property (key-value) Graph. Also it can be expensive
MongoDB. Our first experience with Document store. Very good product. Big advantage: excellent documentation, (by far) most popular NoSQL database
OrientDB and ArangoDB. Both support the Graph/Document paradigm. This are less known products, but very powerful. Because we are a Java based shop, our preference goes to OrientDB. OrientDB has a Lucene engine integrated (although the implementation is quite simple). ArangoDB on the other hand has very good documentation and a very smart and efficient storage format and finally the AQL is also very nice!
Performance: (tested with 11.43 mio Articles and 2.3 mio products). All products are very fast, especially SOLR and ES in this use case. Embedded OrientDB is also mind blowing fast for import and simple queries. For faceted search only the Search Servers provide real fast performance!
Bottom line: I would go for a Graph/Document store and/or Search Server (SOLR or ES). Because you mentioned "filtering" (I assume faceted search). The Search Server is the obvious first choice
OrientDB supports composite indexes on multiple fields. Example:
CREATE INDEX Product_idx ON Product (name, category, manufacturer) unique
SELECT FROM INDEX:Product_idx WHERE key = ["Donald Knuth", "computer"]
You could also create a FULL-TEXT index by using all the power of Lucene as engine.
Aerospike is a key-value store, not an document database. A document database would do such field-level indexing and deeper searching into a nested object better. The secondary indexes in Aerospike currently (version 3.4.x) work on string and integer 'bins' (a concept similar to a document's field or a SQL table's column).
That said, the list and map complex types of Aerospike are being augmented with those capabilities, in work being done in this quarter. Keep an eye out for those changes in the upcoming releases. You'll be able to index and query on bins of type list and map.

NoSQL vs. Relational Databases vs. Possible Hybrid

I'm hearing more about NoSQL, but have yet had someone give me a clear explanation of how it is to be used instead of relational databases.
I've read that it can't do left joins, so I was trying to figure out how you'd be able to use such a data storage. From reading: Preserve Joins by code in MongoDB it seems like a suggestion is to just make a large table, as if you already did the joins on it.
If the above statement is true, then I can see how it can be used. However I'm curious on how you'd handle repeat data. As the concept of normalizing, helps you remove the redundancy and ensure consistency in the data (e.g. Slight modifications like capitalization, white space, etc)...
Are we simply sacrificing the consistency of the data for scalable speed, or am I missing something?
Edit
I've been doing some more digging and found the answers the following questions useful for clarifying my understanding:
Why Google's BigTable referred as a NoSQL database?
How do you track record relations in NoSQL?
My understanding of consistency seems to be correct from those answers. And it looks like NoSQL is suppose to be used for specific problems types and that if you need relations that you should use a relational database.
But this raises more questions like:
It makes me wonder about real life examples of when to use NoSQL versus when not to?
By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?
Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?
MongoDB has the ability to have documents which include arrays of other documents. This solves many cases where you would have relations in reational databases.
When an invoice has multiple positions, you wouldn't put these positions into a separate collection. You would embed them as an array.
It makes me wonder about real life examples of when to use NoSQL versus when not to?
There are many different NoSQL databases, each one designed with different use-cases in mind. But you tagged this question as MongoDB, so I assume that you mean MongoDB in particular.
MongoDB has two main advantages over relational databases.
First, it scales well.
When the database is too slow or too big, you can easily add more servers by creating a cluster or replica-set of multiple shards. This doesn't work nearly as well with most relational databases.
Second, it allows heterogeneous data.
Imagine, for example, the product database of a computer hardware store. What properties do products have? All products have a price and a vendor. But CPUs have a clock rate, hard drives and RAM chips have a capacity (and these capacities aren't comparable), monitors have a resolution and so on. How would you design this in a relational database? You would either create a very long productID-property-value table or you would create a very wide and sparse product table with every property you can imagine, but most of them being NULL for most products. Both solutions aren't really elegant. But MongoDB can solve this much better because it allows each document in a collection to have a different set of properties.
What can't it do?
As a rather new technology, there isn't that much literature about it. The software ecosystem around it isn't that well either. The tools you can get for relational databases are often much more shiny.
There are also some use-cases MongoDB isn't well-suited for.
MongoDB doesn't do JOINs. When your data is very relational and denormalizing it would be counter-productive, it might be a poor choice for your product. But you might want to take a look at graph databases like Neo4j, which focus even more on relations than relational databases. Update 2016: MongoDB 3.2 now has rudimentary JOIN support with the $lookup aggregation stage, but it's still very limited in functionality compared to relational and graph databases.
MongoDB doesn't do transactions. At least not complex transactions. Certain actions which only affect a single document are guaranteed to be atomic, but as soon as you affect more than one document, you can't guarantee that no other query will happen in-between and find an inconsistent state.
MongoDB is bad for ad-hoc reporting. Its options for data-mining are severely limited. The rather new aggregation functions help and MapReduce can also solve some surprisingly complex problems when you learn to use it smart, but SQL has usually the better tools for things like that.
By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?
Relational databases are around for about 40 years. Their theory is a well-researched topic in computer science. There are whole libraries of books written about the theory behind them. There is a by-the-book solution for every imaginable corner-case by now.
But NoSQL databases, on the other hand, are a rather new technology. We are still figuring out the best practices. The most frequent advise is: "Use your own head. Think about what queries are performed most often, and optimize your data schema for them."
Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?
When possible I would advise against using two different database technologies in the same product:
Anyone who maintains and supports the product must be familiar with both technologies
Troubleshooting gets a lot harder
The sysadmins need to keep an additional database running and updated
You have an additional point of failure which can lead to downtime
I would only recommend to mix database technologies when fulfilling your requirements without it doesn't just become hard but physically impossible. Otherwise, make your pick and stay with it.

Which NoSQL solution for a dating search site?

I am highly interested in new NoSQL solutions to implement a search engine for a dating site. However because of having a lot of possibilities, I am little bid confused. My requirements,
1) 10 million people
2) More than 8 index (gender, online, city, name etc...)
3) Scalability
Thanks
You wanna go for either mangoDB or CouchDB.
CouchDB scales a little better while mangoDB syntax is a little more familiar.
also it depends what framework/language u use to create the dating site.
i personally would choose couchdb. (u should know javascript...a lot)
Apache Solr is a data store and fulltext search engine that might be useful to you. Solr is rarely mentioned as a NoSQL technology, but it shares many characteristics with document-oriented databases.
Keep in mind that you have to know what type of queries you're going to run before you can choose a NoSQL solution or design your database.
That's in contrast to a relational database, where you can design a general-purpose database based on the data relationships.
With that large of a dataset you would probably be well advised to look at search as separate from data store. As someone suggested, SOLR will index your data for you to search independently of your database. You have 2 problems, data store and search.
ElasticSearch http://www.elasticsearch.org/overview/
Can handle age difference, geographic location, tastes and dislikes, etc. Or a leaderboard system that depends on many variables.
You'd want something that has sophisticated search and aggregation support.
Elasticsearch is a good candidate. In addition to its ability to perform fuzzy, proximity searches (which is something you'd likely want), you'd also want to integrate some machine learning pipeline to constantly improve your matching 'accuracy'.