Rebuild collection in pymongo - mongodb

I'd like to "rebuild" my collection atomically, which means delete all existing documents and populate it from scratch.
The thing is, since transactions are not supported there is a small time gap that the collection is empty, which is what I want to avoid.
Is there a way to perform such action in an atomically matter? so there will be no point where the collection is empty?

You can build a new collection with a different name and then use rename command to rename the new collection and drop the existing collection (using dropTarget=True option).
There are several caveats though:
The command will invalidate open cursors which interrupts queries that
are currently returning data.
renameCollection blocks all database activity for the duration of the operation.
renameCollection is not compatible with sharded collections.
If the renameCollection operation does not complete, the target collection and indexes will not be usable and will require manual intervention to clean up.
You can find more info in the official docs.

Related

How to manually create empty MongoDB index on a new field?

I have a huge collection, more than 2TiB of data. During release of a new feature I add an index of new field, that 100% sure doesn't exist in any document, MongoDB will still perfom a full scan for this field, which may process for a long time.
Is there any hack to just manually create an empty index file with valid structure and notify MongoDB node about it, so it will load index into memory and everything else MongoDB is doing when index is crerated?
Unlike in relational RDBMS, MongoDB creates indexes also on non-existing fields, i.e. it scans the entire collection.
Index creation runs in background, so it should not harm so much.
See createIndexes
Changed in version 4.2.
For feature compatibility version (fcv) "4.2", all index builds use an optimized build process that holds the exclusive lock only at the beginning and end of the build process. The rest of the build process yields to interleaving read and write operations. MongoDB ignores the background option if specified.
If you run MongoDB version 4.2 or earlier, then you may specify option { background: true }

how to convert mongoDB Oplog file into actual query

I want to convert the MongoDB local Oplog file into an actual real query so I can execute that query and get the exact copy database.
Is there any package, file, build-in tools, or script for it?
It's not possible to get the exact query from the oplog entry because MongoDB doesn't save the query.
The oplog has an entry for each atomic modification performed. Multi-inserts/updates/deletes performed on the mongo instance using a single query are converted to multiple entries and written to the oplog collection. For example, if we insert 10,000 documents using Bulk.insert(), 10,000 new entries will be created in the oplog collection. Now the same can also be done by firing 10,000 Collection.insertOne() queries. The oplog entries would look identical! There is no way to tell which one actually happened.
Sorry, but that is impossible.
The reason is that, that opLog doesn't have queries. OpLog includes only changes (add, update, delete) to data, and it's there for replication and redo.
To get an exact copy of DB, it's called "replication", and that is of course supported by the system.
To "replicate" changes to f.ex. one DB or collection, you can use https://www.mongodb.com/docs/manual/changeStreams/.
You can get the query from the Oplogs. Oplog defines multiple op types, for instance op: "i","u", "d" etc, are for insert, update, delete. For these types, check the "o"/"o2" fields which have corresponding data and filters.
Now based on the op types call the corresponding driver APIs db.collection.insert()/update()/delete().

What would happen if I send a query while renaming MongoDB collection?

Let's say I have MongoDB collection A and B, and they are in the same database.
I'm renaming B to A with deleting target.
I know renaming takes really short time.
But what if I send a query to A while mongo is still renaming B to A?
----------|--------------------------------|---------------------------|-------------------
rename B to A begin Send query to A rename B to A done
Am I gonna get the result right away? or wait till the end of rename?
Per the documentation
The db.collection.renameCollection() method and renameCollection command will invalidate open cursors which interrupts queries that are currently returning data.
So you might end up with errors of dead or killed cursors while renaming your collection.
Since MongoDB 4.2, all operations will wait on the rename to be finished.
renameCollection() obtains an exclusive lock on the source and target collections for the duration of the operation. All subsequent operations on the collections must wait until renameCollection() completes. Prior to MongoDB 4.2, renaming a collection within the same database with renameCollection required obtaining an exclusive database lock.

MongoDB multiple update isolation

I'm confused about how MongoDB updates works.
In the following docs: https://docs.mongodb.com/manual/core/write-operations-atomicity/ says:
In MongoDB, a write operation is atomic on the level of a single
document, even if the operation modifies multiple embedded documents
within a single document.
When a single write operation modifies multiple documents, the
modification of each document is atomic, but the operation as a whole
is not atomic and other operations may interleave.
I guess it means: if I'm updating all fields of a document I will be unable to see a partial update:
If I get the document before the update I will see it without any change
If I get the document after the update I will see it with all the changes
For a multiple elements the same behavior happens for each document. I guess we could say there is a transaction for each document update instead of a big one for all of them.
But let's say there are a lots of documents on the multiple update, and it takes a while to update all of them. What happen with the queries by other threads during the update?
They will see the old version? Or they will be blocked until the update finishes?
Other updates to same documents are possible during this big update? If so, could this intermediate update exclude some document from the big update?
They will see the old version? Or they will be blocked until the update finishes?
I guess other threads may see the old version of a document or the new, depending on whether they query the document before or after the update is finished, but they will never see a partial update on a document (i.e. one field changed and another not changed).
Other updates to same documents are possible during this big update? If so, could this intermediate update exclude some document from the big update?
Instead of big or small updates, think of 2 threads doing an update on the same document. Thread 1 sets fields {a:1, b:2} and thread 2 sets {b:3, c:4}. If the original document is {a:0, b:0, c:0} then we can have two scenarios:
Update 1 is executed before update 2:
The document will finally be {a:1, b:3, c:4}.
Update 2 is executed before update 1:
The document will finally be {a:1, b:2, c:4}.

How does mongo rename collection works?

I am confused by how mongo renames collections and how much time will it take to rename a very large collection.
Here is the scenario, I have a mongo collection with too much data (588 million documents), which slows down finding and insertion, so I creating an archive collection to keep all this data.
For this I am thinking to rename the old collection to oldcollectionname_archive and start with a fresh collection with oldcollectionname.
And planning to do this by following command :
db.oldCollectionName.renameCollection("oldCollectionName_archive")
But I am not sure, how much time it will take.
I read the mongodocs and many stackoverflow answers regarding collection renaming, but I could find anywhere any data regarding whether the size of the collection affect the time required to renaming the collection.
Please help if anyone has any knowledge regarding this or any same experience.
Note : I have read other issues which can occur during renaming, on mongo documentation and other SO answers.
From the mongodb documentation (https://docs.mongodb.com/manual/reference/command/renameCollection/)
renameCollection has different performance implications depending on the target namespace.
If the target database is the same as the source database, renameCollection simply changes the namespace. This is a quick operation.
If the target database differs from the source database, renameCollection copies all documents from the source collection to the target collection. Depending on the size of the collection, this may take longer to complete. Other operations which require exclusive access to the affected databases will be blocked until the rename completes. See What locks are taken by some common client operations? for operations which require exclusive access to all databases.
Note that:
* renameCollection is not compatible with sharded collections.
* renameCollection fails if target is the name of an existing collection and you do not specify dropTarget: true.
I have renamed multiple collections with around 500M documents. It completes in ~0 time.
This is true for MongoDB 3.2 and 3.4, and I would guess also for older versions.