How to throttle MongoDB write operations on one shard - mongodb

We have a sharded MongoDB Cluster. One of our biggest collections is sharded based on the document id.
During some occasions, the interactions on one document increase a lot, and then because of this, we get the MongoDBLockQueueBacklogging error which brings the whole cluster down.
We don't really need those interactions to be persisted in real-time, we can throttle them for some milliseconds.
I was reading about the Kafka-MongoDB connector, and it sounds like a very good idea. The problem is we cannot use update operators ($inc, $current_date, etc.) with it.
I have a feeling that there is something already for such a problem but I cannot find it (or I am not searching correctly)

Related

MongoDB - Can I shard all new db's (Created by application) automatically?

My team will deploy a new version of our app (Capture social media posts, hashtags etc.) they create a different DB for each user and we may have thousands of collections on each DB. I read all mongoDB shard documentation and I saw that I can only shard an collection or one DB at time, I'm missing something ?
We will start this new version fresh, without any databases and we will grow from 0 again (For now, we have 23k users) but we will escalate this number really quickly (100.000+ at the end of the year)
My question is: I really need a Shard cluster ? (My test setup have 3 shards with 3 microshards, 3 config servers and 2 mongos) for now, in production, i have a large server doing all the hard work but i dont want to scale to top, the horizontal scale is the best choice, i think.
Can I shard all my databases automatically or I really need to do that one by one doing the shard key procedure and so. ?
Thanks in advance
You are reading correctly. What you intend to do is so far away from what any sensible person would do that MongoDB doesn't offer any tools to support this. If you really want to go with this WTF solution, your application will be responsible to set up sharding for each collection it creates. This forces you to give administration permission to the application (despite what any security guides recommend).
"Will you really need a sharded cluster" - that depends on how much data you will have and how often you query it with what kind of query. But it is unlikely to work anyway, because your sharded cluster will have to manage (100,000 databases* 1.000 collections) = a hundred million collections. MongoDB is not designed for scaling in that direction. The cluster will likely be so busy with bookkeeping that you won't really see any notable performance gain.
It is also questionable if clustering would even theoretically make sense. Clustering is usually only useful when you have very large collections. But in your scenario where your data is so heavily fragmented into a million collections, each individual collection is unlikely to be very large.
If you really want to go this route, it might in fact be a better solution to separate the databases physically by assigning each user to a database server.
Or you could just build a database architecture like a normal team would with one database for all users and one collection per type of document. You would then speed up lookups by creating a compound index on user and whatever criteria you used to tell which database a document belonged to. This index might also be a good shard key.

MongoDB fast deletion best approach

My application currently use MySQL. In order to support very fast deletion, I organize my data in partitions, according to timestamp. Then when data becomes obsolete, I just drop the whole partition.
It works great, and cleaning up my DB doesn't harm my application performance.
I would want to replace MySQL with MongoDB, and I'm wondering if there's something similiar in MongoDB, or would I just need to delete the records one by one (which, I'm afraid, will be really slow and will make my DB busy, and slow down queries response time).
In MongoDB, if your requirement is to delete data to limit the collection size, you should use a capped collection.
On the other hand, if your requirement is to delete data based on a timestamp, then a TTL index might be exactly what you're looking for.
From official doc regarding capped collections:
Capped collections automatically remove the oldest documents in the collection without requiring scripts or explicit remove operations.
And regarding TTL indexes:
Implemented as a special index type, TTL collections make it possible to store data in MongoDB and have the mongod automatically remove data after a specified period of time.
I thought, even though I am late and an answer has already been accepted, I would add a little more.
The problem with capped collections is that they regularly reside upon one shard in a cluster. Even though, in latter versions of MongoDB, capped collections are shardable they normally are not. Adding to this a capped collection MUST be allocated on the spot, so if you wish to have a long history before clearing the data you might find your collection uses up significantly more space than it should.
TTL is a good answer however it is not as fast as drop(). TTL is basically MongoDB doing the same thing, server-side, that you would do in your application of judging when a row is historical and deleting it. If done excessively it will have a detrimental effect on performance. Not only that but it isn't good at freeing up space to your $freelists which is key to stopping fragmentation in MongoDB.
drop()ing a collection will literally just "drop" the collection on the spot, instantly and gracefully giving that space back to MongoDB (not the OS) giving you absolutely no fragmentation what-so-ever. Not only that but the operation is a lot faster, 90% of the time, than most other alternatives.
So I would stick by my comment:
You could factor the data into time series collections based on how long it takes for data to become historical, then just drop() the collection
Edit
As #Zaid pointed out, even with the _id field capped collections are not shardable.
One solution to this is using TokuMX which supports partitioning:
https://www.percona.com/blog/2014/05/29/introducing-partitioned-collections-for-mongodb-applications/
Advantages over capped collections: capped collections use a fixed amount of space (even when you don't have this much data) and they can't be resized on-the-fly. Partitioned collections usage depends on data; you can add and remove partitions (for newly inserted data) as you see fit.
Advantages over TTL: TTL is slow, it just takes care of removing old data automatically. Partitions are fast - removing data is basically just a file removal.
HOWEVER: after getting acquired by Percona, development of TokuMX appears to have stopped (would love to be corrected on this point). Unfortunately MongoDB doesn't support this functionality and with TokuMX on its way out it looks like we will be stranded without proper solution.

comment system rdbms vs nosql

What is the best way to implement a commenting system (huge data writing)?
1) Use a RDBMS database such as MySQL, 2 tables one for the topics and one for the comments
Pros is that the insertion of new comment is fast, efficient and simple, efficient indexing. Cons is that scaling out (horizontal scaling) is hard.
2) Use a nosql database such as couchdb or mongodb, Pros is that scaling out (horizontal scaling) is easy, Supports huge data writes, schemaless Cons I think that the insertion of new data is not fast and efficient as the RDBMS
For example to update couchdb document you need to grab the whole document, update it locally the submit it again, and the document size will be huge so it will consume bandwidth.
Also I think that couchdb in-place updates, Mongodb updates would be slow and won't be efficient as in RDBMS
Also when you want to get the comments of the each user in various topics I think the search would be faster in RDBMS than in the nosql system.
That is a sample of couchdb database document [document sample for each topic]
{"_id":"doc id",
"_rev":"45521231465421"
"topic_title":"the title of the topic"
"topic_body":"the body of the topic"
"comments":[
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla1"}, {"user":"user1"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla2"}, {"user":"user2"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla3"}, {"user":"user3"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla4"}, {"user":"user4"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla5"}, {"user":"user5"}
{"date":"mm/dd/yy hh:mm:ss"}, {"commment":"bla6"}, {"user":"user6"}
]
}
I think that the insertion of new data is not fast and efficient as the RDBMS
You have hit something there. Insertion speed of NoSQL databases relies upon your scenario. I cannot make that clear enough, so many people expect MongoDB to just perform magically faster than SQL and are sorely disappointed when it does not for them, in fact before now the mongodb-user Google group has been filled with such people.
For example to update couchdb
Not only that but CouchDB also uses versioning and JSON which is not as efficient as storing it in SQL and will consume more space per record.
Mongodb updates would be slow and won't be efficient as in RDBMS
Schema, Queries, Schema, Queries...
That is what it comes down to. Ask yourself one question.
Will I be expecting a lot of comments per post?
If so the in-memory (yes, in-memory) $push, $pull and other subdocument operators may get slow on a large subdocument (let's be honest, will).
Not only that but consistently growing documents can be a problem and can cause heavy fragmentation and space usage, creating a "swiss cheese" effect slowing your system down massively (bringing it to a grinding halt). This presentation should help understand more about how storage really works: http://www.10gen.com/presentations/storage-engine-internals
So you already know that, if used wrong, subdocuments can be a bad idea. That being said you could partially remedy it with power of 2 sizes allocation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes but if you are getting way too many comment insertions then it won't help too much.
I personally would not embed this relationship.
So I would go for the same set-up as a RDBMS and now you start to see the problem. Insertions will probably be about the same speed if it wasn't for MongoDBs fsync queue, unlike SQL which writes straight to disk. You can set-up MongoDB with journalled writes but then you will probably get the same performance metrics from SQL at the end of the day.
As for querying, this is where MongoDB can still come out on top, providing your working set fits into RAM. I cannot bold that last bit enough!!
Unlike SQL, MongoDB maps everything (your entire data) to virtual memory, not RAM and definitely not to be confused with RAM. This does make it faster for larger lookups, for smaller lookups the speed will be about the same because both will be serving from in-memory cache.
Also when you want to get the comments of the each user in various topics I think the search would be faster in RDBMS than in the nosql system.
If the topic id is in the comment document it would definitely be faster in MongoDB, providing your working set is ready in RAM.
What is meant by the working set? Here is a good answer: What does it mean to fit "working set" into RAM for MongoDB?
Hope this helps,
I can speak only about MongoDB and you are indeed wrong about inserts. Here is nice comparison of Mongo with MSSQL and Mongo is performing 100x times better then MSSQL. So it's quite suitable for large data processing.
Searching is also much more faster (what would be the whole point of NoSQL if inserting and searching wouldn't be faster?) - but with one caveat, you can't perform joins in queries, you have to join tables manually in your application (but there is recommanded workaround - nested documents).

Configure a Mongo replica set to only replicate certain collections

I have a ~3GB mongo database with several dozen collections. Three of these collections handle ~300 queries per second, while the rest sustain a much lower volume. I expect the traffic to continue to grow quickly.
I'd like to set up a replica set to handle the high-traffic collections. It isn't necessary for this new instance to replicate the rest of the database. Is this possible?
Seems like not possible at the moment by built-in features of mongodb and only way to do is to come up with your own manual replication algorithm or use some other tools written by third parties.
https://github.com/wordnik/wordnik-oss project might help you to achieve this according to the following post.
https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/Ap9V4ArGuFo
Describes workaround to filter documents in replication.
Replicate only documents where {'public':true} in MongoDB
Or just replicate the data yourself manually which might worth trying.
Good luck.
No that isn't possible now. What you could do is move those collections into another unreplicated database. But this will cause headaches once these collections see higher traffic too, so you would need to move them into your "replication"-db.
But in general Replication isn't the way to go if you need to scale, it's more considered for DR/failover. Replicaset Secondaries can only (optionally) answer read queries but no write queries, this is something you should keep in mind. So if you have high write load this may not cure your problem.
Once you allow your application to read from secondaries you need to live with eventual consistency, meaning that your application isn't guaranteed to see always the latest data. This is caused due to the asynchronous replication to the secondaries.
Indeed you can cure this problem if you configure your writeconcern, so that the write needs to succeeded on all replicas, before it's considered written and your driver returns. But this may slow down your write operations significant.
So for scaling query execution capabilities I would go with Sharding. This is possible on a per collection level, all unsharded collections will remain on a "default-shard".
Not possible but then if the data size is so small and these collections aren't updated, then the only overhead of having them replicated is the small storage size on the secondary. That is a relatively small price to pay, especially since the collections won't grow in size, compared with writing your own replication logic.
Instead of that archive the data, and have only the latest data set on the production server and the rest of the data can archive on the new server.

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?
It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices
You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.