Write efficiency difference between Update, Set, and Set with merge - google-cloud-firestore

Cloud Firestore has this limit:
"Maximum writes per second per database of 10,000 (up to 10 MiB per second)"
https://firebase.google.com/docs/firestore/quotas
I was wondering how the 10 MiB per second limit works.
Let's say I have a document doc that is 100 KB
And I run a transaction in which I only want to update 1 KB of the doc
Using update (passing the entire object)
doc.foo.bar = calculateNewBar(doc.foo.bar)
t.update(docRef, doc);
Using update (targeting specific field)
t.update(docRef, {
"doc.foo.bar": calculateNewBar(doc.foo.bar),
});
Using set (passing the entire object)
doc.foo.bar = calculateNewBar(doc.foo.bar)
t.set(docRef, doc);
Using set (with merge)
t.set(docRef, {
doc: { foo: { bar: calculateNewBar(doc.foo.bar) } },
}, {
merge: true,
});
Which way would be the most efficient?
Does Cloud Firestore smartly calculate the diff in all 4 cases above and only write 1 KB?
Or would 1 and 3 write the entire 100 KB, and 2 and 4 write 1 KB?

Related

Most efficient MongoDB find() query for very large data

I trying to implement a research system for a scientific study in 2D astronomical coordinate systems. Into an hand I have a process that generates a lot of geospatial data that we can thing organized in documents of two coordinates and a string value. In the other hand I have a little set of data which have only one coordinate that must match wit at least one of the two of the generated date.
Simplifying, I have organized data in two collections:
collA avg. size: ~2GB (constant in the time)
collB more than 50GB (continuously increasing)
Where:
The schema of a document in the collA is:
{
terrainType: 'myType00001',
lat: '000000123',
lon: '987000000'
},
{
terrainType: 'myType00002',
lat: '000000124',
lon: '987000000'
},
{
terrainType: 'myType00003',
lat: '000000124',
lon: '997000000'
}
Please note that first of all we put indexes to avoid COLLSCAN. There are two indexes in collA: __lat_idx (unique) and __long_idx (unique). I can guarantee that the generation process could not generate duplicates in lat e in lot columns (as seen above: lat and lon have nine digits, but it is only for simplicty... in the real case these values are extremely huge).
The schema of a document in the collB is:
{
latOrLon: '0045600'
},
{
latOrLon: '0045622'
},
{
latOrLon: '1145600'
}
I tried some different query strategies.
Strategy A
let cursor = collB.find() // Should load entire 2GB into memory?
curosor.forEach(c => {
collA.find({lat: c.latOrLon})
collA.find({lon: c.latOrLon})
})
This takes two mongo calls for each document in collB: is extremely slow.
Strategy B
let cursor = collB.find() // Should load entire 2GB into memory?
curosor.forEach(c => {
collA.find({$expr: {$or: [{$eq: [lat, c.latOrLon]}, {$eq: [lon, c.latOrLon]} })
})
This taks on mongo call for each document in collB, faster than A but still slow.
Strategy C
for chunk in chunks:
docs = []
for doc in chunk:
CHUNK_SIZE = 5000
batch_cursor = collB.find({}, {'_id': 0}, batch_size=CHUNK_SIZE)
chunks = yield_rows(batch_cursor, CHUNK_SIZE) # yield_rows defined below
res = collA.find({
'$or': [
{
'lon': {
'$in': docs
}
}, {
'lat': {
'$in': docs
}
}
]
})
This takes firstly 5000 documents from collA then puts them into an array and send the find() query. Good efficiency: one mongo call for each 5000 documents in collA.
This solution is very fast, but I noticed that becomes slower as collA increases in size, why? Since I am using indexes, the search for a indexed value should costs O(1) in terms of computation time... for example, when collA was around 25GB in size, it will takes roughly 30 minutes to perform a full find(). Now this collection is 50GB size and it is taking morte than 2 hours.
We plan that the DB will reach at least 5TB next month and this will be a problem.
I am asking to university the possibility to parallelize this job using MongoDB Sharding but it will no be immediate. I am looking for a temporary solution until we can parallelize this job.
Please note that we tried more than these three strategies and we mixed Python3 and NodeJS.
Definition of yield_rows:
def yield_rows(batch_cursor, chunk_size):
"""
Generator to yield chunks from cursor
:param cursor:
:param chunk_size:
:return:
"""
chunk = []
for i, row in enumerate(batch_cursor):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
You can just index the field through which you're querying ( It doesn't have to be unique), that way only the matching data will be returned, It won't go through the entire collection data.

Bucketing and counting for histogram in MongoDB

I want to implement a histogram based on the data stored in MongoDB. I want to get counts based on bucketing. I have to create buckets based on only one input value that is number of groups. for example group = 4
Consider there are multiple transactions are running and we stored transaction time as one of the fields. I want to calculate counts of transactions based on time required to finish the transaction.
How can I use aggregation framework or map reduce to create a bucketing?
Sample data:
{
"transactions": {
"149823": {
"timerequired": 5
},
"168243": {
"timerequired": 4
},
"168244": {
"timerequired": 10
},
"168257": {
"timerequired": 15
},
"168258": {
"timerequired": 8
},
"timerequired": 18
}
}
In the output I want to print bucket size and count of transactions fall into that bucket.
Bucket count
0-5 2
5-10 2
10-15 1
15-20 1
From mongo version 3.4, the functions $bucket and $bucketAuto are available . They can easily solve your request:
db.transactions.aggregate( [
{
$bucketAuto: {
groupBy: "$timerequired",
buckets: 4
}
}
])

What are the sizes returned by `show collections`?

Edit: This question is not about vanilla MongoDB's show collections but about mongo-hacker. See accepted answer and comments.
Using Mongo DB 3.2 + WiredTiger, show collections displays two sizes: s1 / s2.
show collections
coll_1 → 10.361MB / 1.289MB
coll_2 → 0.000MB / 0.004MB
coll_3 → 0.000MB / 0.016MB
coll_4 → 0.001MB / 0.031MB
My guess is these are:
s1: total size of the documents in the database
s2: size of the database on disk (documents + indexes) after compression
Is this correct? I couldn't find any reference in the docs.
You can use the below query to get the size of each collections:
var collectionNames = db.getCollectionNames(),
stats = [];
collectionNames.forEach(function (n) {
stats.push(db[n].stats());
});
for (var c in stats) {
// skip views
if (!stats[c]["ns"]) continue;
print(stats[c]["ns"].padEnd(40) + ": " + (''+stats[c]["size"]).padEnd(12) + " (" + (stats[c]["storageSize"] / 1073741824).toFixed(3).padStart(8) + "GB)");
}
Example putout:
cod-prod-db.orders: 35179407 (0.012GB)
cod-prod-db.system.profile: 4323 (0.000GB)
cod-prod-db.users: 21044037 (0.015GB)
Are you using mongo-hacker? By default in MongoDB 3.2.11, show collections doesn't show any size information at all.
The size information provided by mongo-hacker is obtained from the output of db.collection.stats().size which shows you the total uncompressed size of the collection (without indexes), and db.collection.stats().storageSize which shows you the physical storage size. If you enable compression in WiredTiger, the storageSize will typically be smaller than size.
You can find the relevant source code in here: https://github.com/TylerBrock/mongo-hacker/blob/0.0.13/hacks/show.js#L57-L72
Very similar to #kalaivani's answer, I just refactored it for easier (for me) understanding and also printing in GB
// Get collection names
var collectionNames = db.getCollectionNames()
var col_stats = [];
// Get stats for every collections
collectionNames.forEach(function (n) {
col_stats.push(db.getCollection(n).stats());
});
// Print
for (var item of col_stats) {
print(`${item['ns']} | size: ${item['size']}
(${(item['size']/1073741824).toFixed(2)} GB) | storageSize: ${item['storageSize']}
(${(item['storageSize']/1073741824).toFixed(2)} GB)`);
}

Get the size of all the documents in a query

Is there a way to get the size of all the documents that meets a certain query in the MongoDB shell?
I'm creating a tool that will use mongodump (see here) with the query option to dump specific data on an external media device. However, I would like to see if all the documents will fit in the external media device before starting the dump. That's why I would like to get the size of all the documents that meet the query.
I am aware of the Object.bsonsize method described here, but it seems that it only returns the size of one document.
Here's the answer that I've found:
var cursor = db.collection.find(...); //Add your query here.
var size = 0;
cursor.forEach(
function(doc){
size += Object.bsonsize(doc)
}
);
print(size);
Should output the size in bytes of the documents pretty accurately.
I've ran the command twice. The first time, there were 141 215 documents which, once dumped, had a total of about 108 mb. The difference between the output of the command and the size on disk was of 787 bytes.
The second time I ran the command, there were 35 914 179 documents which, once dumped, had a total of about 57.8 gb. This time, I had the exact same size between the command and the real size on disk.
Starting in Mongo 4.4, $bsonSize returns the size in bytes of a given document when encoded as BSON.
Thus, in order to sum the bson size of all documents matching your query:
// { d: [1, 2, 3, 4, 5] }
// { a: 1, b: "hello" }
// { c: 1000, a: "world" }
db.collection.aggregate([
{ $group: {
_id: null,
size: { $sum: { $bsonSize: "$$ROOT" } }
}}
])
// { "_id" : null, "size" : 177 }
This $groups all matching items together and $sums grouped documents' $bsonSize.
$$ROOT represents the current document from which we get the bsonsize.

How to Cap a Collection based on Size - MOngodb

Im new to MongoDB. Im exploring options on Capped Collection.
I created a capped collection with Size :10 i assume its in Byte.
And i inserted a document of size 52 .(referring from the db.collection.stats() size option)
Shouldn't this document be rejected since its size is greater than 10b?
As the documentation for MongoDB 2.6 says, "If the size field is less than or equal to 4096, then the collection will have a cap of 4096 bytes. Otherwise, MongoDB will raise the provided size to make it an integer multiple of 256." You can see the size MongoDB actually chose by querying system.namespaces:
> // Create collection of size 10.
> db.createCollection('my_collection', {capped: true, size: 10})
{ "ok" : 1 }
> // MongoDB actually sets the size to 4096.
> db.system.namespaces.findOne({name: 'test.my_collection'}).options
{ "capped" : true, "size" : 4096 }