I'm evaluating MongoDB, coming from Membased/memcached because I want more flexibility.
Of course Membase is excellent in doing fast (multi)-key lookups.
I like the additional options that MongoDB gives me, but is it also fast in doing multi-key lookups? I've seen the $or and $in operator and I'm sure I can model it with that. I just want to know if it's performant (in the same league) as Membase.
use-case, e.g., Lucene/Solr returns 20 product-ids. Lookup these product-ids in Couchdb to return docs/ appropriate fields.
Thanks,
Geert-Jan
For your use case, I'd say it is, from my experience: I hacked some analytics into a database of mine that made a lot of $in queries with thousands of ids and it worked fine (it was a hack). To my surprise, it worked rather well, in the lower millisecond area.
Of course, it's hard to compare this, and -as usual- theory is a bad companion when it comes to performance. I guess the best way to figure it out is to migrate some test data and send some queries to the system.
Use MongoDB's excellent built-in profiler, use $explain, keep the one index per query rule in mind, take a look at the logs, keep an eye on mongostat, and do some benchmarks. This shouldn't take too long and give you a definite and affirmative answer. If your queries turn out slow, people here and on the news group probably have some ideas how to improve the exact query, or the indexation.
One index per query. It's sometimes thought that queries on multiple
keys can use multiple indexes; this is not the case with MongoDB. If
you have a query that selects on multiple keys, and you want that
query to use an index efficiently, then a compound-key index is
necessary.
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-Oneindexperquery.
There's more information on that page as well with regard to Indexes.
The bottom line is Mongo will be great if your indexes are in memory and you are indexing on the columns you want to query using composite keys. If you have poor indexing then your performance will suffer as a result. This is pretty much in line with most systems.
Related
Does my query performance get affected if I used
db.collection.find({field:{$in:[true,false]}})
I mean I know it's the same thing like
db.collection.find({})
But I may some cases where it could be like the first query. So, is it going to affect my query performance?
Thanks in advance!
Definitely, the simple find will be faster. Because
It doesn't need to check any index
It doesn't need to do any filter
Whereas the other query
db.collection.find({field:{$in:[true,false]}})
If you have an index on field then the performance will be better but depends on the total amount of data, cluster resources, schema, etc.
If you do not have an index, it will be slower than #1 as it has to do COLSCAN.
Again, these are all theories. You should try it out with different amounts of data in your cluster and benchmark the results as recommended in the documentation.
When do you need db.collection.find({field:{$in:[true,false]}}) in your case?
It would help if you had this where you have documents that do not fall under either true or false. Otherwise, you can use simple find.
I really like MongoDB, I use it at work and home, and not once yet have I hit a performance, complexity, or limitation issue with it. But I've been thinking about indexes a lot and I had a question I've not found an adequate answer to.
One of the big issues with SQL databases at scale is the relative complexity of queries. Specifically, MySQL uses b-trees for most of it's indexes, which querying takes O(log(n)), better than linear, but still means things take longer the more data you have.
A big attraction of noSQL databases is the removal/mitigation of this scaling issue, often relying instead on hash style indexes, which have O(1) lookup time, so having more data doesn't slow down your app. This is where my question comes in:
According to the offical MongoDB documentation, all indexes in Mongo use b-trees. Despite the fact that Mongo does in fact have a hashed index, as far as I can tell these are still stored in b-trees, same with the index on the _id field. I couldn't even find anything indicating anything about constant time anywhere in Mongo's documentation!
So my question is this: are, in fact, all indexes (including _id and hashed) in Mongo stored in b-trees? Does this mean querying for keys (even by _id) in fact takes O(log(n)) time?
Addendum: As a point of note, I'd be great if Mongo documentation provided some complexity formulas with examples queries. My favorite example of this is the Redis documentation.
Also: This is related. But I have the added specific questions regarding the hashed indexes and (more importantly) the _id index.
If you look at the code for indexing in mongodb (here), you can easily see that it's using btree for indexing. So the order of the algorithm is O(log n), but the base of this logarithm function is not 2, but 8192 instead, which is here in the code.
So for a million records we only have two levels (assuming the tree is balanced) and that is why it can find the record so fast. Overall, it's true the order is logarithmic, but since the base of the logarithm function is so large, it grows slowly.
I am building a web based system for my organization, using Mongo DB, I have gone through the document provided by mongo db and came to the following conclusion:
find: Cannot pull data from sub array.
group: Cannot work in sharded environment.
aggregate:Best for sub arrays, but has performance issue when data set is large.
Map Reduce : Too risky to write map and reduce function.
So,if someone can help me out with the best approach to work with sub array document, in production environment having sharded cluster.
Example:
{"testdata":{"studdet":[{"id","name":"xxxx","marks",80}.....]}}
now my "studdet" is a huge collection of more than 1000, rows for each document,
So suppose my query is:
"Find all the "name" from "studdet" where marks is greater than 80"
its definitely going to be an aggregate query, so is it feasible to go with aggregate in this case because ,"find" cannot do this and "group" will not work in sharded environment, so if I go with aggregate what will be the performance impact, i need to call this query most of the time.
Please have a look at:
http://docs.mongodb.org/manual/core/data-modeling/
and
http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
These documents describe the decisions in creating a good document schema in MongoDB. That is one of the hardest things to do in MongoDB, and one of the most important. It will affect your performance etc.
In your case running a database that has a student collection with an array of grades looks to be the best bet.
{_id:, …., grades:[{type:”test”, grade:80},….]}
In general, and, given your sample data set, the aggregation framework is the best choice. The aggregation framework is faster then map reduce in most cases (certainly in execution speed, it is C++ vs javascript for map reduce).
If your data's working set becomes so large you have to shard then aggregation, and everything else, will be slower. Not, however, slower then putting everything on a single machine that has a lot of page faults. Generally you need a working set larger then the RAM available on a modern computer for sharding to be the correct way to go such that you can keep everything in RAM. (At this point a commercial support contract for Mongo for assistance is going to be a less then the cost of hardware, and that include extensive help with schema design.)
If you need anything else please don’t hesitate to ask.
Best,
Charlie
The question is a very simple one, can you have more than one index in a collection. I suppose you can, but every time I search for multiple indexes I get explanations on compound indexes and that is not what I'm looking for.
All I want to do is make sure that I can have two simple separate indexes.
(I'm using PHP, I'll use php code formatting, but I understand
db.posts.ensureIndex({ my_id1: 1 }, {unique: true, background: true});
db.posts.ensureIndex({ my_id2: 1 }, {background: true});
I'll only search for one index at a time.
Compound indexes are not what I'm looking for because:
one index is unique and the other is not.
I think it's not going to be the fastest option. (open the link to understand the reason I think its going to be slower. link)
I just want to make sure that the indexes will work.
You sure can have indexes defined the way you have it. From MongoDB documentation:
How many indexes? Indexes make retrieval by a key, including ordered sequential retrieval, very fast. Updates by key are faster too as MongoDB can find the document to update very quickly. However, keep in mind that each index created adds a certain amount of overhead for inserts and deletes. In addition to writing data to the base collection, keys must then be added to the B-Tree indexes. Thus, indexes are best for collections where the number of reads is much greater than the number of writes. For collections which are write-intensive, indexes, in some cases, may be counterproductive. Most collections are read-intensive, so indexes are a good thing in most situations.
I also recommend you look at how Mongo will decide what index to use when it comes to running a query that goes by both fields.
Also take a look at their Indexing Advice and FAQ page. It will explain things like only one index per query, selectivity, etc.
p.s. This slideshare deck from 10gen suggests there's a limit of 40 indexes per collection.
I am inserting my data into MongoDB and had 240 such files. Instead of inserting everything into one big collection, I was thinking of inserting the files as a collection by themselves. Is this a good idea if I do a lot of queries on a commonly indexed column?
If so, how can I initiate a query to query all the collections in my database?
Using an application server such as Solr can help you achieve what you want, also with the addition of fuzzy matching, synonyms, phonetic matching, misspellings, etc.
Solor is built on top of Lucene. It's docs are here:
http://lucene.apache.org/solr/
The learning curve is a little bit steep, but you can get pretty good searchability using much of its defaults, leaving you to build a schema and index your data to get started.
I think the answer you're looking for is really here on your other question: Is there any multicore exploiting NoSQL system?
There is no way to query across all collections in Mongo. It wouldn't make a lot of sense to do so. MongoDB's strength is focused on tactically denormalizing data into collections. Providing operations to query across all collections run exactly counter to the concept of tactical denormalization.
In theory, you could just run 240 queries. But more practically you'll probably end up "partitioning" your data so that you only need to query some of the collections. At this point you end up back at the link I provided above, which suggests that sharding your data is probably the answer here.