Get the size of all the documents in a query - mongodb

Is there a way to get the size of all the documents that meets a certain query in the MongoDB shell?
I'm creating a tool that will use mongodump (see here) with the query option to dump specific data on an external media device. However, I would like to see if all the documents will fit in the external media device before starting the dump. That's why I would like to get the size of all the documents that meet the query.
I am aware of the Object.bsonsize method described here, but it seems that it only returns the size of one document.

Here's the answer that I've found:
var cursor = db.collection.find(...); //Add your query here.
var size = 0;
cursor.forEach(
function(doc){
size += Object.bsonsize(doc)
}
);
print(size);
Should output the size in bytes of the documents pretty accurately.
I've ran the command twice. The first time, there were 141 215 documents which, once dumped, had a total of about 108 mb. The difference between the output of the command and the size on disk was of 787 bytes.
The second time I ran the command, there were 35 914 179 documents which, once dumped, had a total of about 57.8 gb. This time, I had the exact same size between the command and the real size on disk.

Starting in Mongo 4.4, $bsonSize returns the size in bytes of a given document when encoded as BSON.
Thus, in order to sum the bson size of all documents matching your query:
// { d: [1, 2, 3, 4, 5] }
// { a: 1, b: "hello" }
// { c: 1000, a: "world" }
db.collection.aggregate([
{ $group: {
_id: null,
size: { $sum: { $bsonSize: "$$ROOT" } }
}}
])
// { "_id" : null, "size" : 177 }
This $groups all matching items together and $sums grouped documents' $bsonSize.
$$ROOT represents the current document from which we get the bsonsize.

Related

Most efficient MongoDB find() query for very large data

I trying to implement a research system for a scientific study in 2D astronomical coordinate systems. Into an hand I have a process that generates a lot of geospatial data that we can thing organized in documents of two coordinates and a string value. In the other hand I have a little set of data which have only one coordinate that must match wit at least one of the two of the generated date.
Simplifying, I have organized data in two collections:
collA avg. size: ~2GB (constant in the time)
collB more than 50GB (continuously increasing)
Where:
The schema of a document in the collA is:
{
terrainType: 'myType00001',
lat: '000000123',
lon: '987000000'
},
{
terrainType: 'myType00002',
lat: '000000124',
lon: '987000000'
},
{
terrainType: 'myType00003',
lat: '000000124',
lon: '997000000'
}
Please note that first of all we put indexes to avoid COLLSCAN. There are two indexes in collA: __lat_idx (unique) and __long_idx (unique). I can guarantee that the generation process could not generate duplicates in lat e in lot columns (as seen above: lat and lon have nine digits, but it is only for simplicty... in the real case these values are extremely huge).
The schema of a document in the collB is:
{
latOrLon: '0045600'
},
{
latOrLon: '0045622'
},
{
latOrLon: '1145600'
}
I tried some different query strategies.
Strategy A
let cursor = collB.find() // Should load entire 2GB into memory?
curosor.forEach(c => {
collA.find({lat: c.latOrLon})
collA.find({lon: c.latOrLon})
})
This takes two mongo calls for each document in collB: is extremely slow.
Strategy B
let cursor = collB.find() // Should load entire 2GB into memory?
curosor.forEach(c => {
collA.find({$expr: {$or: [{$eq: [lat, c.latOrLon]}, {$eq: [lon, c.latOrLon]} })
})
This taks on mongo call for each document in collB, faster than A but still slow.
Strategy C
for chunk in chunks:
docs = []
for doc in chunk:
CHUNK_SIZE = 5000
batch_cursor = collB.find({}, {'_id': 0}, batch_size=CHUNK_SIZE)
chunks = yield_rows(batch_cursor, CHUNK_SIZE) # yield_rows defined below
res = collA.find({
'$or': [
{
'lon': {
'$in': docs
}
}, {
'lat': {
'$in': docs
}
}
]
})
This takes firstly 5000 documents from collA then puts them into an array and send the find() query. Good efficiency: one mongo call for each 5000 documents in collA.
This solution is very fast, but I noticed that becomes slower as collA increases in size, why? Since I am using indexes, the search for a indexed value should costs O(1) in terms of computation time... for example, when collA was around 25GB in size, it will takes roughly 30 minutes to perform a full find(). Now this collection is 50GB size and it is taking morte than 2 hours.
We plan that the DB will reach at least 5TB next month and this will be a problem.
I am asking to university the possibility to parallelize this job using MongoDB Sharding but it will no be immediate. I am looking for a temporary solution until we can parallelize this job.
Please note that we tried more than these three strategies and we mixed Python3 and NodeJS.
Definition of yield_rows:
def yield_rows(batch_cursor, chunk_size):
"""
Generator to yield chunks from cursor
:param cursor:
:param chunk_size:
:return:
"""
chunk = []
for i, row in enumerate(batch_cursor):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
You can just index the field through which you're querying ( It doesn't have to be unique), that way only the matching data will be returned, It won't go through the entire collection data.

Can somebody please explain the size limit of a document in MongoDB? It says there is a 16MB limit per document but what does it mean?

I have about 3 collections and each don’t even have a lot of data. Each of them have about 50 objects/items that each contain around 200 characters but the whole collection (one collection) is taking up ≈270KB of space (which is a lot). I don’t understand why it is doing that.
So going back to the question, do those collections each have a limit of 16mb or is it the entire database? Please help. Thank you.
Here are some examples:
Object.bsonsize({ a: null }) => 8
Object.bsonsize({}) => 5
Object.bsonsize({ _id: ObjectId() }) => 22
Object.bsonsize({a: "fo"}) => 15
Object.bsonsize({a: "foo"}) => 16
Object.bsonsize({ab: "fo"}) => 16
5 Bytes seems to be smallest possible size.
You can retrieve the BSON size of your documents with this aggregation pipeline:
db.collection.aggregate([{ $project: { size: { $bsonSize: "$$ROOT" } } }])
The max size of a document is 16 MiByte, which is a hard-code limit. Each document has the _id field with typically an ObjectId value, thus the minimum size of a document is 22 Byte.
According to MongoDB: The Definitive Guide the entire text of "War and Peace" is just 3.14MB, so 16MiB is quite a lot.

Aggregation using $sample

With an aggregation using { $sample: { size: 3 } }, I'll get 3 random documents returned.
How can I use a percentage of all documents instead?
Something that'd look like { $sample: { size: 50% } }?
You can not do it, as expression to $sample should be a positive number.
If you still needed to use $sample you can try to get the total count of documents in a collection, get number half of it & then run $sample :
1) Count no.of documents in a collection (mongo Shell) :
var totalDocumentsCount = db.yourCollectionName.count()/2
print(totalDocumentsCount) // Replace it with console.log() in code
2) $sample for random documents :
db.yourCollectionName.aggregate([{$sample : {size : totalDocumentsCount}}])
Note :
If you wanted to get half of the documents from the collection (Which is 50% of documents) then $sample might not be a good option - it can become an inefficient query. Also result of $sample can have duplicate documents being returned (So really you might not get unique 50% of documents). Try to read more about it here : $sample
If someone is looking for this solution in PHP just use this as required in your aggregate at the end ( i.e before projection ) and avoid using limit and sort
[
'$sample' => [
'size' => 30
]
]
Starting in Mongo 4.4, you can use the $sampleRate operator:
// { x: 1 }
// { x: 2 }
// { x: 3 }
// { x: 4 }
// { x: 5 }
// { x: 6 }
db.collection.aggregate([ { $match: { $sampleRate: 0.33 } } ])
// { x: 3 }
// { x: 5 }
This matches a random selection of input documents (33%). The number of documents selected approximates the sample rate expressed as a percentage of the total number of documents.
Note that this is equivalent to adding a random number between 0 and 1 for each document and filtering them in if this random value is bellow 0.33. Such that you may get more or less documents in output, and running this several times won't necessarily give you the same output.

Replace part of an array in a mongo db document

With a document structure like:
{
_id:"1234",
values : [
1,23,... (~ 2000 elements)
]
}
where values represent some time series
I need to update some elements in the values array and I'm looking for an efficient way to do it. The number of elements and the positions to update vary.
I would not like to get the whole array back to the client (application layer) so i'm doing something like :
db.coll.find({ "_id": 1234 })
db.coll.update(
{"_id": 128244 },
{$set: {
"values.100": 123,
"values.200": 124
}})
To be more precise, i'm using pymongo and bulk operations
dc = dict()
dc["values.100"] = 102
dc["values.200"] = 103
bulk = db.coll.initialize_ordered_bulk_op()
bulk.find({ "_id": 1234 }).update_one({$set:dc})
....
bulk.execute()
Would you know some better way to do it ?
Would it be possible to indicate a range in the array like (values from l00 to 110) ?

Mongo $geoNear query - incorrect nscanned number and incorrect results

I have a collection with around 6k documents with 2dsphere index on location field, example below:
"location" : {
"type" : "Point",
"coordinates" : [
138.576187,
-35.010441
]
}
When using the below query I only get around 450 docs returned with nscanned around 3k. Every document has a location, many locations are duplicated. Distances returned from GeoJSON are in meters, and a distance multiplier of 0.000625 will convert distances to miles. To test, I'm expecting max distance of 32180000000000 to return all the documents on the planet, ie 6000
db.x.aggregate([
{"$geoNear":{
"near":{
"type":"Point",
"coordinates":[-0.3658702,51.45686]
},
"distanceField":"distance",
"limit":100000,
"distanceMultiplier":0.000625,
"maxDistance":32180000000000,
"spherical":true,
}}
])
Why dont I get 6000 documents returned? I'm unable to find the logic behind this behaviour from Mongo. I've found on the mongo forums:
"geoNear's major limitation is that as a command it can return a result set up to the maximum document size as all of the matched documents are returned in a single result document."
I'm pretty sure that mongodb has a limit of 16 MB on the results of $GeoNear. In https://github.com/mongodb/mongo/blob/master/src/mongo/db/commands/geo_near_cmd.cpp you can see that while the results of the geonear are being built, there's this condition
// Don't make a too-big result object.
if (resultBuilder.len() + resObj.objsize()> BSONObjMaxUserSize) {
warning() << "Too many geoNear results for query " << rewritten.toString()
<< ", truncating output.";
break;
}
And in https://github.com/mongodb/mongo/blob/master/src/mongo/bson/util/builder.h youll see its limited to 16 MB.
const int BSONObjMaxUserSize = 16 * 1024 * 1024;