Taking hours to iterate 300 million mongo db records - mongodb

I'm iterating over the whole mongo documents from a mongo slave using mongo java API.
Mongo Server: 2.4.10
Number of records in slave: 300 million.
I've one mongo master, one mongo slave.
(No sharding done)
The mongo slave gets replicated very high frequency 2000 insertions and deletions every 10 seconds.
The iteration is taking more than 10 hours.
My goal is to fetch each record in the collection and create a csv and load it to redshift.
DB db = null;
DBCursor cursor = null;
mongo = new MongoClient(mongoHost);
mongo.slaveOk();
db = mongo.getDB(dbName);
DBCollection dbCollection = db.getCollection(dbCollectionName);
cursor = dbCollection.find();
while (cursor.hasNext()) {
DBObject resultObject = cursor.next();
String uid = (String) ((Map) resultObject.get("user")).get("uid");
String category = (String) resultObject.get("category");
resultMap.put(uid, category);
if (resultMap.size() >= csvUpdateBatchSize) {
//store to a csv - append to an existing csv
}
}
is there a way to bring down the iteration time to below 1 hours?
Infrastructure changes can be done too ..Like increasing shards.
Please suggest.

Have you considered performing a parallel mongoexport on your collection?
If you have a way to partition your data with a query (something like modulo over an id or indexed field) and pass this as a standard input to your program.
Your program then will handle each document as a JSON row, which you can load to a designated object representing document structure with GSON or some other similar library
and eventually run your logic on that object.
Using mongoexport and adding parallelism can improve your performance greatly.

Related

MongoDB takes an hour to delete_many() 1GB of data

We have a 3GB collection in mongoDB 4.2 and this python 3.7, pymongo 3.12 function that deletes rows from the collection:
def delete_from_mongo_collection(table_name):
# connect to mongo cluster
cluster = MongoClient(MONGO_URI)
db = cluster["cbbap"]
# remove rows and return
query = { 'competitionId': { '$in': [30629, 30630] } }
db[table_name].delete_many(query)
return
Here is the relevant info on this collection, note that it has 360MB worth of indexes which are set to speed up retrievals of data from this collection by our Node API, although they may be the problem here.
The delete_many() is part of a pattern where we (a) remove stale data and (b) upload fresh data, each day. However, given that it is taking over an hour to remove the rows that match the query { 'competitionId': { '$in': [30629, 30630] } }, we'd be better off just dropping and re-inserting the entire table. What's frustrating is that competitionId is an index, and as the first index in our compound indexes, I thought it should be very fast to drop rows using an index. I wonder if having 360MB of indexes is responsible for the slow deletes?
We cannot use the hint parameter as we have mongoDB 4.2, not 4.4, and we do not want to upgrade to 4.4 yet as we are worried about major breaking changes that may occur in our pipelines and our node API.
What else can be done here to improve the performance of delete_many()?

Performance degrade with Mongo when using bulkwrite with upsert

I am using Mongo java driver 3.11.1 and Mongo Version 4.2.0 for my development.I am still learning mongo. My application receives data and either have to do insert or replace the existing document i.e. do an upsert.
Each document size is 780-1000 bytes as of now and each collection can have more than 3 millions records.
Approach 1: I tried using findOneandreplace for each document and it was taking more than 15 minutes to save the data.
Approach-2 I changed it to bulkwrite using below, which resulted in ~6-7 minutes for saving 20000 records.
List<Data> dataList;
dataList.forEach(data-> {
Document updatedDocument = new Document(data.getFields());
updates.add(new ReplaceOneModel(eq("DataId", data.getId()), updatedDocument, updateOptions));
});
final BulkWriteResult bulkWriteResult = mongoCollection.bulkWrite(updates);
3) I tried using collection.insertMany which takes 2 seconds to store the data.
As per driver code, insertMany also Internally InsertMany uses MixedBulkWriteOperation for inserting the data similar to bulkWrite.
My queries are -
a) I have to do upsert operation, Please let me know where i am doing any mistakes.
- Created the indexes on DataId field but resulted in <2 miliiseconds difference in terms of performance.
- Tried using writeConcern of W1, but performance is still the same.
b) why insertMany's performance is faster than bulk write. I could understand in terms of few seconds difference but unable to figure out the reason for 2-3 seconds for insertMany and 5-7 minutes for bulkwrite.
c) Are there any approaches that can be used to solve this situation.
This problem was solved to greater extent by adding index on DataId Field. Previously i had created index on DataId field but forgot to create index after creating collection.
This link How to improve MongoDB insert performance helped in resolving the problem

Getting output of 2 queries in one mongodb call from java morphia

I'm fairly new to mongodb but I was wondering if there's a way by which we can get 2 different results from same mongodb collection in one database call uisng mongo java driver with morphia.
I have a collection accounts and I'm fetching data based on a key accountId. I need below two results/outputs from this collection in one query.
count of all the documents where accountID is 'xyz'
ResultList of first N documents where accountID is 'xyz' AND resultSet is sorted by a timestamp field.
to resolve the second scenario I'm using:
..Query....limit(N).order("TimeField").field("TimeField").filter("accountID =", "xyz").asList();
This is working fine as per expectation but to get the total count (scenario 1) of all documents with accountId = 'xyz' needs another mongodb call, which I want to avoid.
MongoDB doesn't support such batching on queries, unfortunately. You'll have to execute two separate calls.

mongodb 3.6 speed up query last records

I have a 30GB mongodb 3.6 collection with 500k documents. the main _id field is a float timestamp (i didnt manually define an index, but inserted on the _id field, assuming from documentation that _id will be used as index and automatically maintained.
Now if I query last data I do in Python 3
cursor = cry.find().sort('_id', pymongo.DESCENDING).limit(600)
df = list(cursor)
However, just querying the last 600 records takes about 1 minute. How can this be if the index is maintained? Is there a faster way to query (like by natural order) or do I need to re-index although documentation says its done automatically?
I also tried
cursor=cry.find().skip(cry.count() - 1000)
df = list(cursor)
but this is just as slow

MongoDB: range queries on insertion time with _id and ObjectID

I am trying to use mongodb's ObjectID to do a range query on the insertion time of a given collection. I can't really find any documentation that this is possible, except for this blog entry: http://mongotips.com/b/a-few-objectid-tricks/ .
I want to fetch all documents created after a given timestamp. Using the nodejs driver, this is what I have:
var timeId = ObjectId.createFromTime(timestamp);
var query = {
localUser: userId,
_id: {$gte: timeId}
};
var cursor = collection.find(query).sort({_id: 1});
I always get the same amount of records (19 in a collection of 27), independent of the timestamp. I noticed that createFromTime only fills the bytes in the objectid related to time, the other ones are left at 0 (like this: 4f6198be0000000000000000).
The reason that I try to use an ObjectID for this, is that I need the timestamp when inserting the document on the mongodb server, not when passing the document to the mongodb driver in node.
Anyone knows how to make this work, or has another idea how to generate and query insertion times that were generated on the mongodb server?
Not sure about nodejs driver in ruby, you can simply apply range queries like this.
jan_id = BSON::ObjectId.from_time(Time.utc(2012, 1, 1))
feb_id = BSON::ObjectId.from_time(Time.utc(2012, 2, 1))
#users.find({'_id' => {'$gte' => jan_id, '$lt' => feb_id}})
make sure
var timeId = ObjectId.createFromTime(timestamp) is creating an ObjectId.
Also try query without localuser