What operations are cheap/expensive in mongodb? - mongodb

I'm reading up on MongoDB, and trying to get a sense of where it's best used. One question that I don't see a clear answer to is which operations are cheap or expensive, and under what conditions.
Can you help clarify?
Thanks.

It is often claimed that mongodb has insanely fast writes. While they are not slow indeed, this is quite an overstatement. Write throughput in mongodb is limited by global write lock. Yes, you heard me right, there can be only ONE* write operation happening on the server at any given moment.
Also I suggest you take advantage of schemaless nature of mongodb and store your data denormalized. Often it is possible to do just one disk seek to fetch all required data (because it is all in the same document). Less disk seeks - faster queries.
If data sits in RAM - no disk seeks are required at all, data is served right from memory. So, make sure you have enough RAM.
Map/Reduce, group, $where queries are slow.
It is not fast to keep writing to one big document (using $push, for example). The document will outgrow its disk boundaries and will have to be copied to another place, which involves more disk operations.
And I agree with #AurelienB, some basic principles are universal across all databases.
Update
* Since 2011, several major versions of mongodb were released, improving situation with locking (from server-wide to database-level to collection-level). A new storage engine was introduced, WiredTiger, which has document-level locks. All in all, writes should be significantly faster now, in 2018.

From my practice one thing that should mentioned is that mongodb not very good fit for reporting, because usual in reports you need data from different collections ('join') and mongodb does not provide good way to aggregate data multiple collections (and not supposed to provide). For sure for some reports map/reduce or incremental map/reduce can work well, but it rare situations.
For reports some people suggest to migrate data into relations databases, that's have a lot of tools for reporting.

This is not very different than all database systems.
Query on indexed data are fast. Query on a lot of data are... slow.
Due to denormalization, if there is no index, writing on the base will be fast, that's why logging is the basic use case.
At the opposite, reading data which are on disk (not in RAM) without index can be very slow when you have billion of document.

Related

Is document-level Transaction enough? (in mongodb)

MongoDB documentation and blog describe its transaction capabilities like this.
MongoDB write operations are ACID-compliance at the document level- including the >ability to update embedded arrays and sub-documents automatically.
Now I'm wondering is this "document-level transaction support" enough ?
by enough I mean can it be as good as transaction support in old fashioned RDBMSs ?
about the possible duplicate, what i had in mind was a general question, the fact that "is this enough?" for a developer? or not.
I'm going to agree with Joshua on this and add my two cents. In the RDBMS world a transaction is very frequently updating multiple normalized data-bearing structures. A robust level of atomicity is required to ensure that changes are committed to all of those structures as a unit, or rolled back as a unit. In MongoDB you would ideally be designing your schema to keep data that logically belongs together housed together in the same document. This makes document-level atomicity perfectly sufficient for your typical document schema.
I'll also agree that neither RDBMS nor MongoDB transaction handling should be your only line of defense against errors and data corruption. For critical data changes that must be atomic you should always check consistency at the code level post-update.
One final thought: In most RDBMS systems, transaction handling does not always map one-to-one to concurrency. Frequently a large transaction can lock an entire table or tables and cause backlogs in response. In MongoDB, document-level ACID compliance in transaction handling pairs well with document-level concurrency available to those using the WiredTiger storage engine. If designed with both in mind your application can be highly concurrent and completely ACID compliant at the document level, giving you a high level of performance and throughput for transactional workloads.
Cheers,
Bill Finch
Answering this question involves an understanding of schema design in the NoSQL world. If you approach your schema design like you would in an RDBMS, then you will have a very bad time, and not just because of transactions.
If you design your documents properly, however, document level ACID-compliance should be just fine for 99% of use cases. I would even argue that outside that 99% and in that 1% of use cases, you shouldn't be relying on your database for transactions anyways. This would be a really complicated case where you were changing two completely separate things in parallel. Even in an RDBMS if you were doing this, you would always write a verification in code.
One example might be a bulk update for a banking customer that involved them changing their name and doing an address change at the same time. In an RDBMS these are likely to be separate tables. In MongoDB these will both be in the same document. So this fits in the 99%.
A debit to one account and credit to another would be an example that fit into the 1%. You can wrap that in a transaction in SQL, but if you don't write code to verify the writes afterward, you are going to loose your job. You would never rely on the database for that. Same with MongoDB, where these would be two different documents.
Document-level transactions are good, but not enough for real-world applications. in general, you have to think a bit different as in a RDBMS-world, and use "sub-document" and you can solve many situations without collection-wide transactions, but there are enough use-cases, where you definitly need collection-wide transactions.
The debit/credit-situation of an account-system is one example... or if you implement a battle-game, where two player fight against each other and the one (winner) gets "resources" from the other (looser)... you have to update the resource-state of both players in parallel or both have to be rolled back, if something failed. This is not handled by MongoDB transactional as in RDBM-systems.
Once again, as others said already: you have to think in objects/document-structure, there you can handle many situations, where document-level-transactions are far enough...
But the collection-wide-transactions are on the roadmap of MongoDB ;-)
If you are able to include all your logical data in one document, MongoDB is going to be faster and higher performance than a relational database. You must be sure that all your data are going to be written, or not, at the same time (ACID compliant at the document level).
If you are not in a hurry, MongoDB is working hard to get transactions across collections!
Regards,
Juan
Starting from version 4.0 MongoDB will add support for multi-document transactions. So you will have the power of the document model with ACID guarantees in MongoDB.
For details visit this link: https://www.mongodb.com/blog/post/multi-document-transactions-in-mongodb?jmp=community

comment system rdbms vs nosql

What is the best way to implement a commenting system (huge data writing)?
1) Use a RDBMS database such as MySQL, 2 tables one for the topics and one for the comments
Pros is that the insertion of new comment is fast, efficient and simple, efficient indexing. Cons is that scaling out (horizontal scaling) is hard.
2) Use a nosql database such as couchdb or mongodb, Pros is that scaling out (horizontal scaling) is easy, Supports huge data writes, schemaless Cons I think that the insertion of new data is not fast and efficient as the RDBMS
For example to update couchdb document you need to grab the whole document, update it locally the submit it again, and the document size will be huge so it will consume bandwidth.
Also I think that couchdb in-place updates, Mongodb updates would be slow and won't be efficient as in RDBMS
Also when you want to get the comments of the each user in various topics I think the search would be faster in RDBMS than in the nosql system.
That is a sample of couchdb database document [document sample for each topic]
{"_id":"doc id",
"_rev":"45521231465421"
"topic_title":"the title of the topic"
"topic_body":"the body of the topic"
"comments":[
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla1"}, {"user":"user1"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla2"}, {"user":"user2"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla3"}, {"user":"user3"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla4"}, {"user":"user4"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla5"}, {"user":"user5"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla6"}, {"user":"user6"}
]
}
I think that the insertion of new data is not fast and efficient as the RDBMS
You have hit something there. Insertion speed of NoSQL databases relies upon your scenario. I cannot make that clear enough, so many people expect MongoDB to just perform magically faster than SQL and are sorely disappointed when it does not for them, in fact before now the mongodb-user Google group has been filled with such people.
For example to update couchdb
Not only that but CouchDB also uses versioning and JSON which is not as efficient as storing it in SQL and will consume more space per record.
Mongodb updates would be slow and won't be efficient as in RDBMS
Schema, Queries, Schema, Queries...
That is what it comes down to. Ask yourself one question.
Will I be expecting a lot of comments per post?
If so the in-memory (yes, in-memory) $push, $pull and other subdocument operators may get slow on a large subdocument (let's be honest, will).
Not only that but consistently growing documents can be a problem and can cause heavy fragmentation and space usage, creating a "swiss cheese" effect slowing your system down massively (bringing it to a grinding halt). This presentation should help understand more about how storage really works: http://www.10gen.com/presentations/storage-engine-internals
So you already know that, if used wrong, subdocuments can be a bad idea. That being said you could partially remedy it with power of 2 sizes allocation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes but if you are getting way too many comment insertions then it won't help too much.
I personally would not embed this relationship.
So I would go for the same set-up as a RDBMS and now you start to see the problem. Insertions will probably be about the same speed if it wasn't for MongoDBs fsync queue, unlike SQL which writes straight to disk. You can set-up MongoDB with journalled writes but then you will probably get the same performance metrics from SQL at the end of the day.
As for querying, this is where MongoDB can still come out on top, providing your working set fits into RAM. I cannot bold that last bit enough!!
Unlike SQL, MongoDB maps everything (your entire data) to virtual memory, not RAM and definitely not to be confused with RAM. This does make it faster for larger lookups, for smaller lookups the speed will be about the same because both will be serving from in-memory cache.
Also when you want to get the comments of the each user in various topics I think the search would be faster in RDBMS than in the nosql system.
If the topic id is in the comment document it would definitely be faster in MongoDB, providing your working set is ready in RAM.
What is meant by the working set? Here is a good answer: What does it mean to fit "working set" into RAM for MongoDB?
Hope this helps,
I can speak only about MongoDB and you are indeed wrong about inserts. Here is nice comparison of Mongo with MSSQL and Mongo is performing 100x times better then MSSQL. So it's quite suitable for large data processing.
Searching is also much more faster (what would be the whole point of NoSQL if inserting and searching wouldn't be faster?) - but with one caveat, you can't perform joins in queries, you have to join tables manually in your application (but there is recommanded workaround - nested documents).

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

MongoDB: Sharding on single machine. Does it make sense?

created a collection in MongoDB consisting of 11446615 documents.
Each document has the following form:
{
"_id" : ObjectId("4e03dec7c3c365f574820835"),
"httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1",
"words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],
"howMany" : 3
}
httpReferer: just an url
words: words parsed from the url above. Size of the list is between 15 and 90.
I am planning to use this database to obtain list of webpages which have similar content.
I 'll by querying this collection using words field so I created (or rather started creating) index on this field:
db.my_coll.ensureIndex({words: 1})
Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):
Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.
Indexing before inserting It would take a few days to insert all data to collection.
My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.
Now, time for a question:
I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?
Yes, it does make sense to shard on a single server.
At this time, MongoDB still uses a global lock per mongodb server.
Creating multiple servers will release a server from one another's locks.
If you run a multiple core machine with seperate NUMAs, this can
also increase performance.
If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.
Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.
In modern day(2015) with mongodb v3.0.x there is collection-level locking with mmap, which increases write throughput slightly(assuming your writing to multiple collections), but if you use the wiredtiger engine there is document level locking, which has a much higher write throughput. This removes the need for sharding across a single machine. Though you can technically still increase the performance of mapReduce by sharding across a single machine, but in this case you'd be better off just using the aggregation framework which can exploit multiple cores. If you heavily rely on map reduce algorithms it might make most sense to just use something like Hadoop.
The only reason for sharding mongodb is to horizontally scale. So in the event that a single machine cannot house enough disk space, memory, or CPU power(rare), then sharding becomes beneficial. I think its really really seldom that someone has enough data that they need to shard, even a large business, especially since wiredtiger added compression support that can reduce disk usage to over 80% less. Its also infrequent that someone uses mongodb to perform really CPU heavy queries at a large scale, because there are much better technologies for this. In most cases IO is the most important factor in performance, not many queries are CPU intensive, unless you're running a lot of complex aggregations, even geo-spatial is indexed upon insertion.
Most likely reason you'd need to shard is if you have a lot of indexes that consume a large amount of RAM, wiredtiger reduces this, but its still the most common reason to shard. Where as sharding across a single machine is likely just going to cause undesired overhead, with very little or possible no benefits.
This doesn't have to be a mongo question, it's a general operating system question. There are three possible bottlenecks for your database use.
network (i.e. you're on a gigabit line, you're using most of it at peak times, but your database isn't really loaded down)
CPU (your CPU is near 100% but disk and network are barely ticking over)
disk
In the case of network, rewrite your network protocol if possible, otherwise shard to other machines. In the case of CPU, if you're 100% on a few cores but others are free, sharding on the same machine will improve performance. If disk is fully utilized add more disks and shard across them -- way cheaper than adding more machines.
No, it does not make sense to shard a on a single server.
There are a few exceptional cases but they mostly come down to concurrency issues related to things like running map/reduce or javascript.
This is answered in the first paragraph of the Replica set tutorial
http://www.mongodb.org/display/DOCS/Replica+Set+Tutorial

What is pre-distilled data or data aggregated at runtime, and why is MongoDB not good at it?

What is an example of data that is "predistilled or aggregated in runtime"? (And why isn't MongoDB very good with it?)
This is a quote from the MongoDB docs:
Traditional Business Intelligence. Data warehouses are more suited to new, problem-specific BI databases. However note that MongoDB can work very well for several reporting and analytics problems where data is pre-distilled or aggregated in runtime -- but classic, nightly batch load business intelligence, while possible, is not necessarily a sweet spot.
Let's take something simple like counting clicks. There are a few ways to report on clicks.
Store the clicks in a single place. (file, database table, collection) When somebody wants stats, you run a query on that table and aggregate the results. Of course, this doesn't scale very well, so typically you use...
Batch jobs. Store your clicks as in #1, but only summarize them every 5 minutes or so. When people want to query the summary table. Note that "clicks" may have millions of rows, but "summary" may only have a few thousand rows, so it's much quicker to query.
Count the clicks in real-time. Every time there's a click you increment a counter somewhere. Typically this means incrementing the "summary" table(s).
Now most big systems use #2. There are several systems that are very good for this specifically (see Hadoop).
#3 is difficult to do with SQL databases (like MySQL), because there's a lot of disk locking happening. However, MongoDB isn't constantly locking the disk and tends to have much better write throughput.
So MongoDB ends up being very good for such "real-time counters". This is what they mean by predistilled or aggregated in runtime.
But if MongoDB has great write throughput, shouldn't it be good at doing batch jobs?
In theory, this may be true and MongoDB does support Map/Reduce. However, MongoDB's Map/Reduce is currently quite slow and not on par with other Map/Reduce engines like Hadoop. On top of that, the Business Intelligence (BI) field is filled with many other tools that are very specific and likely better-suited than MongoDB.
What is an example of data that is "predistilled or aggregated in
runtime"?
Example of this can be any report that require data from multiple collections.
And why isn't MongoDB very good with it?
In document databases you can't make a join and because of this it hard to build reports. Usually reports it's data aggregating from many tables/collections.
And since mongodb (and document database in general) good fit for data distribution and denormalization better to prebuild reports whenever it possible and just display data from this collection in runtime.
For some tasks/reports it is not possible to prebuild data, in this case mongodb give to you map/reduce, grouping, etc.