Microsoft Cosmosdb for Mongodb: merge unsharded collection into sharded ones - mongodb

I have 2 collections of similar documents(i.e. same object, different values). One collection(X) is unsharded in database A, another collection(Y) is sharded and inside database B. When I try copy collection X into database B, I got error saying that "Shared throughput collection should have a partition key". I also tried copying data using foreach insert, but it takes too long time.
So my question is, how can I append the data from collection X to collection Y in efficient way?
Mongodb version on CosmosDB is 3.4.6

You may perform aggregation and add as last stage $merge operator.
| $merge | $out |
| Can output to a sharded collection. | Cannot output to a sharded collection. |
| Input collection can also be sharded. | Input collection, however, can be sharded. |
https://docs.mongodb.com/manual/reference/operator/aggregation/merge/#comparison-with-out

So my question is, how can I append the data from collection X to
collection Y in efficient way?
The server tools mongodump and mongorestore can be used. You can export the source collection data into BSON dump files and import into the target collection. These processess are very quick, because the data in the database is already in BSON format.
Data can be exported from a non-sharded collection to a sharded collection using these tools. In this case, it is required that the source collection has the shard-key field (or fields) with values. Note the indexes from the source collection are also exported and imported (using these tools).
Here is an example of the scenario in question:
mongodump --db=srcedb --collection=srcecoll --out="C:\mongo\dumps"
This creates a dump directory with the database name. There will be "srcecoll.bson" file in it and it is used for importing.
mongorestore --port 26xxxx --db=trgtdb --collection=trgtcoll --dir="C:\mongo\dumps\srcecoll.bson"
The host/port connects to the mongos of the sharded cluster. Note the bson file name need to be specified in the --dir option.
The import adds data and indexes into the existing sharded collection. The process only inserts data; the existing documents cannot be updated. If the _id value from the source collection already exists in the target collection, the process will not overwrite the documents (and those documents will not be imported, and it is not an error).
There are some useful options for mongorestore like: --noIndexRestore and --dryRun.

Because, the MongoDb version in CosmosDB currently 3.4.6, it doesn't support $merge and a lot of other commands such as colleciton.copyTo etc. Using Studio 3T's import feature didn't help as well.
The solution I use, is to download the target collection on my local mongodb, clean it then write java code that will read my clean data from local db and insertMany(or bulkwrite) it to the target collection. This way, the data will be appended to the target collection.
The speed I measured was 2 hours for 1m document count(~750MB), of course, this numbers might vary depending on various factors, i.e. network, document size etc.

Related

how to convert mongoDB Oplog file into actual query

I want to convert the MongoDB local Oplog file into an actual real query so I can execute that query and get the exact copy database.
Is there any package, file, build-in tools, or script for it?
It's not possible to get the exact query from the oplog entry because MongoDB doesn't save the query.
The oplog has an entry for each atomic modification performed. Multi-inserts/updates/deletes performed on the mongo instance using a single query are converted to multiple entries and written to the oplog collection. For example, if we insert 10,000 documents using Bulk.insert(), 10,000 new entries will be created in the oplog collection. Now the same can also be done by firing 10,000 Collection.insertOne() queries. The oplog entries would look identical! There is no way to tell which one actually happened.
Sorry, but that is impossible.
The reason is that, that opLog doesn't have queries. OpLog includes only changes (add, update, delete) to data, and it's there for replication and redo.
To get an exact copy of DB, it's called "replication", and that is of course supported by the system.
To "replicate" changes to f.ex. one DB or collection, you can use https://www.mongodb.com/docs/manual/changeStreams/.
You can get the query from the Oplogs. Oplog defines multiple op types, for instance op: "i","u", "d" etc, are for insert, update, delete. For these types, check the "o"/"o2" fields which have corresponding data and filters.
Now based on the op types call the corresponding driver APIs db.collection.insert()/update()/delete().

MongoDB : How to exclude a collection from generating oplog

I am working on processing the mongodb oplog and I create a collection in mongodb to add the processed data and I don't want this collection to again generate oplog.
I want all other collection to generate oplog but need to exclude one of the collection. How can I achieve this. Is there any settings to let mongodb know not to generate oplog for a collection.
Any help is appreciated.
Collection in local database are not part of replication. So, If you create a collection in the local database and insert records to that, oplog entries are not created.

Best way to remove all elements of a large mongo collection without lock and performance impact? [duplicate]

How do I truncate a collection in MongoDB or is there such a thing?
Right now I have to delete 6 large collections all at once and I'm stopping the server, deleting the database files and then recreating the database and the collections in it. Is there a way to delete the data and leave the collection as it is? The delete operation takes very long time. I have millions of entries in the collections.
To truncate a collection and keep the indexes use
db.<collection>.remove({})
You can efficiently drop all data and indexes for a collection with db.collection.drop(). Dropping a collection with a large number of documents and/or indexes will be significantly more efficient than deleting all documents using db.collection.remove({}). The remove() method does the extra housekeeping of updating indexes as documents are deleted, and would be even slower in a replica set environment where the oplog would include entries for each document removed rather than a single collection drop command.
Example using the mongo shell:
var dbName = 'nukeme';
db.getSiblingDB(dbName).getCollectionNames().forEach(function(collName) {
// Drop all collections except system ones (indexes/profile)
if (!collName.startsWith("system.")) {
// Safety hat
print("WARNING: going to drop ["+dbName+"."+collName+"] in 5s .. hit Ctrl-C if you've changed your mind!");
sleep(5000);
db[collName].drop();
}
})
It is worth noting that dropping a collection has different outcomes on storage usage depending on the configured storage engine:
WiredTiger (default storage engine in MongoDB 3.2 or newer) will free the space used by a dropped collection (and any associated indexes) once the drop completes.
MMAPv1 (default storage engine in MongoDB 3.0 and older) will
not free up preallocated disk space. This may be fine for your use case; the free space is available for reuse when new data is inserted.
If you are instead dropping the database, you generally don't need to explicitly create the collections as they will be created as documents are inserted.
However, here is an example of dropping and recreating the database with the same collection names in the mongo shell:
var dbName = 'nukeme';
// Save the old collection names before dropping the DB
var oldNames = db.getSiblingDB(dbName).getCollectionNames();
// Safety hat
print("WARNING: going to drop ["+dbName+"] in 5s .. hit Ctrl-C if you've changed your mind!")
sleep(5000)
db.getSiblingDB(dbName).dropDatabase();
// Recreate database with the same collection names
oldNames.forEach(function(collName) {
db.getSiblingDB(dbName).createCollection(collName);
})
the below query will delete all records in a collections and will keep the collection as is,
db.collectionname.remove({})
remove() is deprecated in MongoDB 4.
You need to use deleteMany or other functions:
db.<collection>.deleteMany({})
There is no equivalent to the "truncate" operation in MongoDB. You can either remove all documents, but it will have a complexity of O(n), or drop the collection, then the complexity will be O(1) but you will loose your indexes.
Create the database and the collections and then backup the database to bson files using mongodump:
mongodump --db database-to-use
Then, when you need to drop the database and recreate the previous environment, just use mongorestore:
mongorestore --drop
The backup will be saved in the current working directory, in a folder named dump, when you use the command mongodump.
The db.drop() method obtains a write lock on the affected database and will block other operations until it has completed.
I think using the db.remove({}) method is better than db.drop().

mongodb - strategy from having relational DB CSV dump imported to highly denormalised mongodb documents

We want to migrate data to mongodb using a CSV files dump created out of a teradata. Data needs to be refreshed in mongodb every night from a fresh teradata csv dump
Approach we are thinking of is:
Get the CSV files exported out of relational db. They are going to be very similar to table structure in relational db
Import the CSV files into mongodb staging collections subsequently, which will be mirroring relational db structure in terms of being normalised. This may be done using say mongoimport in overnight batches. This is going to result in many collections as we are thinking to import each 'type' of CSV into its own collection e.g. Customers.csv and Accounts.csv will result in two respective collections of same name.
Create de-normalised collections out of staging collections, ready to be exposed to UI. Run some schema migration script, which queries the staging collections and creates more denormalised and fewer collections ready for use in the application UI.Eg, Customers and Accounts collections, after running migration script should result in a third collection say AccountCustomers collection where Each account doc has embedded customers array (denormalised and no need of joins when UI needs data)
Question: Is there a better strategy , as all above steps are expected to complete nightly, every night?
Question: Is mongoimport OK to use for importing the CSV files in nightly batches.
Question: What is the best way to migrate (denormalise) collections within the same mongo instance?Eg we have stagingdb having collections customers and accounts and we want to reach a state where we have proddb having collection accountcustomers

Truncate a collection

How do I truncate a collection in MongoDB or is there such a thing?
Right now I have to delete 6 large collections all at once and I'm stopping the server, deleting the database files and then recreating the database and the collections in it. Is there a way to delete the data and leave the collection as it is? The delete operation takes very long time. I have millions of entries in the collections.
To truncate a collection and keep the indexes use
db.<collection>.remove({})
You can efficiently drop all data and indexes for a collection with db.collection.drop(). Dropping a collection with a large number of documents and/or indexes will be significantly more efficient than deleting all documents using db.collection.remove({}). The remove() method does the extra housekeeping of updating indexes as documents are deleted, and would be even slower in a replica set environment where the oplog would include entries for each document removed rather than a single collection drop command.
Example using the mongo shell:
var dbName = 'nukeme';
db.getSiblingDB(dbName).getCollectionNames().forEach(function(collName) {
// Drop all collections except system ones (indexes/profile)
if (!collName.startsWith("system.")) {
// Safety hat
print("WARNING: going to drop ["+dbName+"."+collName+"] in 5s .. hit Ctrl-C if you've changed your mind!");
sleep(5000);
db[collName].drop();
}
})
It is worth noting that dropping a collection has different outcomes on storage usage depending on the configured storage engine:
WiredTiger (default storage engine in MongoDB 3.2 or newer) will free the space used by a dropped collection (and any associated indexes) once the drop completes.
MMAPv1 (default storage engine in MongoDB 3.0 and older) will
not free up preallocated disk space. This may be fine for your use case; the free space is available for reuse when new data is inserted.
If you are instead dropping the database, you generally don't need to explicitly create the collections as they will be created as documents are inserted.
However, here is an example of dropping and recreating the database with the same collection names in the mongo shell:
var dbName = 'nukeme';
// Save the old collection names before dropping the DB
var oldNames = db.getSiblingDB(dbName).getCollectionNames();
// Safety hat
print("WARNING: going to drop ["+dbName+"] in 5s .. hit Ctrl-C if you've changed your mind!")
sleep(5000)
db.getSiblingDB(dbName).dropDatabase();
// Recreate database with the same collection names
oldNames.forEach(function(collName) {
db.getSiblingDB(dbName).createCollection(collName);
})
the below query will delete all records in a collections and will keep the collection as is,
db.collectionname.remove({})
remove() is deprecated in MongoDB 4.
You need to use deleteMany or other functions:
db.<collection>.deleteMany({})
There is no equivalent to the "truncate" operation in MongoDB. You can either remove all documents, but it will have a complexity of O(n), or drop the collection, then the complexity will be O(1) but you will loose your indexes.
Create the database and the collections and then backup the database to bson files using mongodump:
mongodump --db database-to-use
Then, when you need to drop the database and recreate the previous environment, just use mongorestore:
mongorestore --drop
The backup will be saved in the current working directory, in a folder named dump, when you use the command mongodump.
The db.drop() method obtains a write lock on the affected database and will block other operations until it has completed.
I think using the db.remove({}) method is better than db.drop().