Getting BSONObj size error even with allowDiskUse true option - mongodb

I have a collection with 300 million documents, each doc has a user_id field like following:
{
"user_id": "1234567",
// and other fields
}
I want to a list of unique user_ids in the collection, but the following mongo shell command results in an error.
db.collection.aggregate([
{ $group: { _id: null, user_ids: { $addToSet: "$user_id" } } }
], { allowDiskUse: true });
2021-11-23T14:50:28.163+0900 E QUERY [js] uncaught exception: Error: command failed: {
"ok" : 0,
"errmsg" : "Error on remote shard <host>:<port> :: caused by :: BSONObj size: 46032166 (0x2BE6526) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: null",
"code" : 10334,
"codeName" : "BSONObjectTooLarge",
"operationTime" : Timestamp(1637646628, 64),
...
} : aggregate failed :
Why does the error occur even with allowDiskUse: true option?
The db version 4.2.16.

You try to insert all unique user_ids in single document , but apparently the size of this document become greater then16MB causing the issue.

distinct may be more useful
db.collection.distinct( "user_id" )

Related

MongoDB hint() fails - not sure if it is because index is still indexing

In SSH session 1, I have ran operation to create partial index in MongoDB as follows:
db.scores.createIndex(
... { event_time: 1, "writes.k": 1 },
... { background: true,
... partialFilterExpression: {
... "writes.l_at": null,
... "writes.d_at": null
... }});
The creation of the index is quite large and lasts about 30+ minutes. While it is still running I started SSH session 2.
In SSH session 2 to cluster, I described indexes on my collection scores, and it looks like it is already there...
db.scores.getIndexes()
[
...,
{
"v" : 1,
"key" : {
"event_time" : 1,
"writes.k" : 1
},
"name" : "event_time_1_writes.k_1",
"ns" : "leaderboard.scores",
"background" : true,
"partialFilterExpression" : {
"writes.l_at" : null,
"writes.d_at" : null
}
}
]
When trying to count with hint to this index, I get below error:
db.scores.find().hint('event_time_1_writes.k_1').count()
2019-02-06T22:35:38.857+0000 E QUERY [thread1] Error: count failed: {
"ok" : 0,
"errmsg" : "error processing query: ns=leaderboard.scoresTree: $and\nSort: {}\nProj: {}\n planner returned error: bad hint",
"code" : 2,
"codeName" : "BadValue"
} : _getErrorWithCode#src/mongo/shell/utils.js:25:13
DBQuery.prototype.count#src/mongo/shell/query.js:383:11
#(shell):1:1
Never seen this below, but need confirmation to check if its failing because indexing is still running ?
Thanks!

Solving "BSONObj size: 17582686 (0x10C4A5E) is invalid" when doing aggregation in MongoDB?

I'm trying to remove duplicate documents in MongoDB in a large collection according to the approach described here:
db.events.aggregate([
{ "$group": {
"_id": { "firstId": "$firstId", "secondId": "$secondId" },
"dups": { "$push": "$_id" },
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 1 } }}
], {allowDiskUse:true, cursor:{ batchSize:100 } }).forEach(function(doc) {
doc.dups.shift();
db.events.remove({ "_id": {"$in": doc.dups }});
});
I.e. I want to remove events that has the same "firstId - secondId" combination. However after a while MongoDB responds with this error:
2016-11-30T14:13:57.403+0000 E QUERY [thread1] Error: getMore command failed: {
"ok" : 0,
"errmsg" : "BSONObj size: 17582686 (0x10C4A5E) is invalid. Size must be between 0 and 16793600(16MB)",
"code" : 10334
}
Is there anyway to get around this? I'm using MongoDB 3.2.6.
The error message indicates that some part of the process is attempting to create a document that is larger than the 16 MB document size limit in MongoDB.
Without knowing your data set, I would guess that the size of the collection is sufficiently large that the number of unique firstId / secondId combinations is growing the result set past the document size limit.
If the size of the collection prevents finding all duplicates values in one operation, you may want to try breaking it up and iterating through the collection and querying to find duplicate values:
db.events.find({}, { "_id" : 0, "firstId" : 1, "secondId" : 1 }).forEach(function(doc) {
cnt = db.events.find(
{ "firstId" : doc.firstId, "secondId" : doc.secondId },
{ "_id" : 0, "firstId" : 1, "secondId" : 1 } // explictly only selecting key fields to allow index to cover the query
).count()
if( cnt > 1 )
print('Dupe Keys: firstId: ' + doc.firstId + ', secondId: ' + doc.secondId)
})
It's probably not the most efficient implementation, but you get the idea.
Note that this approach heavily relies upon the existence of the index { 'firstId' : 1, 'secondId' : 1 }

Retrieve all documents where _ids are in another collection

I want to return all documents from the "usersessions" collection where _ids are in my "users" collection.
I tried the following:
db.usersessions.find( { "userId": { $in: (db.getCollection('users').find({},{"_id":1})) } } )
which returns an error:
Error: error: { "waitedMS" : NumberLong(0), "ok" : 0, "errmsg" : "$in needs an array", "code" : 2 }
As mentioned in the error message $in needs an array. You can use the distinct to return an array of _id from the "users" collection. The reason is that _id are unique within the collection.
var ids = db.getCollection('users').distinct('_id');
db.usersessions.find( { "userId": { "$in": ids } })

MongoDB aggregation query

I am using mongoDb 2.6.4 and still getting an error:
uncaught exception: aggregate failed: {
"errmsg" : "exception: aggregation result exceeds maximum document size (16MB)",
"code" : 16389,
"ok" : 0,
"$gleStats" : {
"lastOpTime" : Timestamp(1422033698000, 105),
"electionId" : ObjectId("542c2900de1d817b13c8d339")
}
}
Reading different advices I came across of saving result in another collection using $out. My query looks like this now:
db.audit.aggregate([
{$match: { "date": { $gte : ISODate("2015-01-22T00:00:00.000Z"),
$lt : ISODate("2015-01-23T00:00:00.000Z")
}
}
},
{ $unwind : "$data.items" } ,
{
$out : "tmp"
}]
)
But I am getting different error:
uncaught exception: aggregate failed:
{"errmsg" : "exception: insert for $out failed: { lastOp: Timestamp 1422034172000|25, connectionId: 625789, err: \"insertDocument :: caused by :: 11000 E11000 duplicate key error index: duties_and_taxes.tmp.agg_out.5.$_id_ dup key: { : ObjectId('54c12d784c1b2a767b...\", code: 11000, n: 0, ok: 1.0, $gleStats: { lastOpTime: Timestamp 1422034172000|25, electionId: ObjectId('542c2900de1d817b13c8d339') } }",
"code" : 16996,
"ok" : 0,
"$gleStats" : {
"lastOpTime" : Timestamp(1422034172000, 26),
"electionId" : ObjectId("542c2900de1d817b13c8d339")
}
}
Can someone has a solution?
The error is due to the $unwind step in your pipeline.
When you unwind by a field having n elements, n copies of the same documents are produced with the same _id. Each copy having one of the elements from the array that was used to unwind. See the below demonstration of the records after an unwind operation.
Sample demo:
> db.t.insert({"a":[1,2,3,4]})
WriteResult({ "nInserted" : 1 })
> db.t.aggregate([{$unwind:"$a"}])
{ "_id" : ObjectId("54c28dbe8bc2dadf41e56011"), "a" : 1 }
{ "_id" : ObjectId("54c28dbe8bc2dadf41e56011"), "a" : 2 }
{ "_id" : ObjectId("54c28dbe8bc2dadf41e56011"), "a" : 3 }
{ "_id" : ObjectId("54c28dbe8bc2dadf41e56011"), "a" : 4 }
>
Since all these documents have the same _id, you get a duplicate key exception(due to the same value in the _id field for all the un-winded documents) on insert into a new collection named tmp.
The pipeline will fail to complete if the documents produced by the
pipeline would violate any unique indexes, including the index on the
_id field of the original output collection.
To solve your original problem, you could set the allowDiskUse option to true. It allows, using the disk space whenever it needs to.
Optional. Enables writing to temporary files. When set to true,
aggregation operations can write data to the _tmp subdirectory in the
dbPath directory. See Perform Large Sort Operation with External Sort
for an example.
as in:
db.audit.aggregate([
{$match: { "date": { $gte : ISODate("2015-01-22T00:00:00.000Z"),
$lt : ISODate("2015-01-23T00:00:00.000Z")
}
}
},
{ $unwind : "$data.items" }] , // note, the pipeline ends here
{
allowDiskUse : true
});

Mongodb find near maxdistance

I am firing the following query in mongodb
db.acollection.find({
"field.location": {
"$near": [19.0723058, 73.00067739999997]
},
$maxDistance : 100000
}).count()
and getting the following error -
uncaught exception: count failed: {
"shards" : {
},
"cause" : {
"errmsg" : "exception: unknown top level operator: $maxDistance",
"code" : 2,
"ok" : 0
},
"code" : 2,
"ok" : 0,
"errmsg" : "failed on : Shard ShardA"
}
You did it wrong. The $maxDistance argument is a "child" of the $near operator:
db.acollection.find({
"field.location": {
"$near": [19.0723058, 73.00067739999997],
"$maxDistance": 100000
}
}).count()
Has to be within the same expression.
Also look at GeoJSON when you are making a new application. It is the way you should be storing in the future.