MongoDB cursor and write operations - mongodb

I am using MongoDB to save data about products. After writing the initial large data-set (24mio items) I would like to change all the items in the collection.
Therefore I use a cursor to iterate over the whole collection. Then I want to add a "row" or field to every item in the collection. With large data-sets this is not working. There were only 180000 items updated. On a small scale it is working. Is that normal behavior?
Is MongoDB not supposed to support writes while iterating with a cursor over the whole collection?
What would be a good practice to do that instead?

For larger collections, you might run into snapshotting problems. When you add data to the object and save it, it will grow, forcing mongodb to move the document around. Then you might find the object twice.
You can either use $snapshot in your query, or use a stable order such as sort({"_id":1}). Note that you can't use both.
Also make sure to use at least acknowledged write concern.

When we had a similar problem, we fetched data in 100k(with some test) chunks. It's a quick and simple solution.

Related

Performance loss with big size of collections

I've a collection that name "test" and has 132K documents in it. When I get first document of the collection it takes between 2-5ms but it's not same for last documation. It takes 100-200ms to pull.
So I've decided to ask the community.
My questions
What is the best document amount in one collection for the performance?
Why does it take so long to get last document from the collection? (I actually don't know how mongo works partially.)
What should I do for this issue and future problems?
After some search of how mongodb works, I found the solution. I didn't use any indexes for my collection so whenever I try to pull something it scans each data and each document. After creating some indexes for my needs, it is much more faster, actually 1ms, than before.
Conclusion
Create indexes for your collection and your needs. It'd be effective write and read operation both. Do not forget to search more 'cause there're some options like background which prevents blocking operations while creating index.

Faster way to remove 2TB of data from single collection (without sharding)

We collect a lot of data and currently decided to migrate our data from mongodb into data lake. We are going to leave in mongo only some portion of our data and use it as our operational database that keeps only newest most relevant data. We have replica set, but we don't use sharding. I suspect, if we had sharded cluster, we could achieve necessary results much simpler, but it's one-time operation, so setting up cluster just for one operation looks like very complex solution (plus I also suspect, that it will be very long running operation to convert such collection into sharded collection, but I can be completely wrong here)
One of our collections has size of 2TB right now. We want to remove old data from original database as fast as possible, but looks like standard "remove" operation is very slow, even if we use unorderedBulkOperation.
I found a few suggestions to copy data into another collection and then just drop original collection instead of trying remove data (so migrate data that we want to keep instead of removing data that we don't want to keep). There are few different ways, that I found, to copy portion of data from original collection to another collection:
Extract data and insert it into other collection one by one. Or extract portion of data and insert it in bulk using insertMany(). It looks faster than just remove data, but still not enough fast.
Use $out operator with aggregation framework to extract portions of data. It's very fast! But it extracts every portion of data into separate collections and doesn't have ability to append data in current mongodb version. So we will need to combine all exported portions of data into one final collection, what is slow again. I see that $out will be able to append data in next release of mongo (https://jira.mongodb.org/browse/SERVER-12280). But we need some solution now, and unfortunately, we won't be able to do quick update of mongo version anyway.
mongoexport / mongoimport - it exports portion of data into json file and append to another collection using import. It's quite fast too, so looks like good option.
Currently it looks like the best choice to improve performance of migration is combination of $out + mongoexport/mongoimport approaches. Plus multithreading to perform multiple described operations at once.
But is there any even faster option that I might missed?

Get changed data after execute bulk [duplicate]

I'm using an update operation with upsert. I want to retrieve all documents that have been modified after an update.
for key in categories_links:
collection.update({"name" : key}, {"name": key ,"url" : categories_links[key]}, True)
You should use a timestamp field in your documents if you ever need to find which ones where updated and when. There is a BSON type for that.
To my knowledge, pymongo will not return a list of all of the records which have been modified by an update.
However, if you are using a replicaset, you might be able to accomplish this by looking at the oplog.
According to the documentation:
The oplog must translate multi-updates into individual operations in
order to maintain idempotency. This can use a great deal of oplog
space without a corresponding increase in data size or disk use.
If you want to keep track of each element being updated, you might instead do a find(), and then loop through those to do an individual update() on each. Obviously this would be much slower, but perhaps a tradeoff for your specific use case.

MongoDB space usage inefficiencies when using $push

Let's say that I have two collections, A and B. Among other things, one of them (collection A) has an array whose cells contain subdocuments with a handful of keys.
I also have a script that will go through a queue (external to MongoDB), insert its items on collection B, and push any relevant info from these items into subdocuments in an array in collection A, using $push. As the script runs, the size of the documents in collection A grows significantly.
The problem seems to be that, whenever a document does not fit its allocated size, MongoDB will move it internally, but it won't release the space it occupied previously---new MongoDB documents won't use that space, not unless I run a compact or repairDatabase command.
In my case, the script seems to scorch through my disk space quickly. It inserts a couple of items into collection B, then tries to inserts into a document in collection A, and (I'm guessing) relocates said document without reusing its old spot. Perhaps this does not happen every time, with padding, but when these documents are about 10MB in size, that means that every time it does happen it burns through a significant chunk of the DB, even though the actual data size remains small. The process eats up my (fairly small, admittedly) DB in minutes.
Requiring a compact or repairDatabase command every time this happens is clumsy: there is space on disk, and I would like MongoDB to use it without requesting it explicitly. The alternative of having a separate collection for the subdocuments in the array would fix this issue, and is probably a better design anyway, but one that will require me to make joins that I wanted to avoid, this being one of the advantages of NoSQL.
So, first, does MongoDB actually use space the way I described above? Second, am I approaching this the wrong way? Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it? And third, are there other, more fitting, design approaches I'm missing?
Most of the questions you have asked you should have already known (Google searching would have brought up 100's of links including critical blog posts on the matter) having tried to use MongoDB in such a case however, this presentation should answer like 90% of your questions: http://www.mongodb.com/presentations/storage-engine-internals
As for solving the problem through settings etc, not really possible here, power of 2 sizes won't help for an array which grows like this. So to answer:
Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it?
I would say no.
And third, are there other, more fitting, design approaches I'm missing?
For something like this I would recommend using a separate collection to store each of the array elements as a new row independent of the parent document.
Sammaye's recommendation was correct, but I needed to do some more digging to understand the cause of this issue. Here's what I found.
So, first, does MongoDB actually use space the way I described above?
Yes, but that's not as intended. See bug SERVER-8078, and its (non-obvious) duplicate, SERVER-2958. Frequent $push operations cause MongoDB to shuffle documents around, and their old spots are not (yet!) reused without a compact or repairDatabase command.
Second, am I approaching this the wrong way? Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it?
For some usages of $push, the usePowerOf2Size option initially consumes more memory, but stabilizes better (see the discussion on SERVER-8078). It may not work well with arrays that consistently tend to grow, which are a bad idea anyway because document sizes are capped.
And third, are there other, more fitting, design approaches I'm missing?
If an array is going to have hundreds or thousands of items, or if its length is arbitrary but likely large, it's better to move its cells to a different collection, despite the need for additional database calls.

Is there any way to register a callback for deletions in a capped collection in Mongo?

I want to use a capped collection in Mongo, but I don't want my documents to die when the collection loops around. Instead, I want Mongo to notice that I'm running out of space and move the old documents into another, permanent collection for archival purposes.
Is there a way to have Mongo do this automatically, or can I register a callback that would perform this action?
You shouldn't be using a capped collection for this. I'm assuming you're doing so because you want to keep the amount of "hot" data relatively small and move stale data to a permanent collection. However, this is effectively what happens anyway when you use MongoDB. Data that's accessed often will be in memory and data that is used less often will not be. Same goes for your indexes if they remain right-balanced. I would think you're doing a bit of premature optimization or at least have a suboptimal schema or index strategy for your problem. If you post exactly what you're trying to achieve and where your performance takes a dive I can have a look.
To answer your actual question; MongoDB does not have callbacks or triggers. There are some open feature requests for them though.
EDIT (Small elaboration on technical implementation) : MongoDB is built on top of memory mapped files for it's storage engine. It basically means it's an LRU based cache of "hot" data where data in this case can be both actual data and index data. As a result data and associated index data you access often (in your case the data you'd typically have in your capped collection) will be in memory and thus very fast to query. In typical use cases the performance difference between having an "active" collection and an "archive" collection and just one big collection should be small. As you can imagine having more memory available to the mongod process means more data can stay in memory and as a result performance will improve. There are some nice presentations from 10gen available on mongodb.org that go into more detail and also provide detail on how to keep indexes right balanced etc.
At the moment, MongoDB does not support triggers at all. If you want to move documents away before they reach the end of the "cap" then you need to monitor the data usage yourself.
However, I don't see why you would want a capped collection and also still want to move your items away. If you clarify that in your question, I'll update the answer.