Unexpectedly High no. of Reads - google-cloud-firestore

I am using Google Cloud Functions and read, write operations are performed on the Firestore thru these Cloud Functions. We are seeing unexpectedly high number of read operations on Firestore, the source of which I am unable to figure out.
Not more than 20K documents are generated on a daily basis. But the daily read count is usually more than 25,000,000
What I am looking for, is ways to identify the root cause of these high number of reads in the Cloud Functions.
To start with, I have captured the size of the results of all the Firestore get() methods in Cloud Functions. But the sum total of all the sizes is much much much lower than the read count I mentioned above.
Need suggestions on ways/practices to identify the source from where these high reads are generating.

You can use a SnapshotListener as a workaround, which allows us to listen for changes in real-time.
You will be charged for readings as if we had sent a new query if the listener is disconnected for more than 30 minutes. If the listener is disconnected every 31 minutes in the worst-case scenario, we will be charged 50 reads each time.
As a result, this technique is only practicable when the listener is not frequently disconnected.
According to the documentation, I found you can reduce the number of reads using get().
In each document in the collection, you must add a new property named lastModified of type Date. When you create a new document or edit an existing one, you must use FieldValue.serverTimestamp() to set or update the field's value.
Reviewing Firebase documentation, I found that high read or write rates to lexicographically close documents need to be avoided to avoid contention faults in your application. Hotspotting is the term for this problem, and it can occur if your program does any of the following:
Creates new documents at a rapid rate and assigns its own IDs that
are monotonically rising.
A scatter algorithm is used by Cloud Firestore to assign document
IDs.
If you use automated document IDs to create new documents, you must
not see hotspotting on writes.
In a collection with few documents, creates new documents at a fast
rate.
Creates new documents at a rapid rate with a monotonically growing
field, such as a timestamp.
Deletes a large number of documents from a collection.
Writes to the database at a rapid rate without growing traffic
gradually.

Related

Does Firestore's 500 writes/second limit apply to updates to non-sequential fields in documents with indexed sequential fields?

Let's say we have the following database structure:
[collection]
<documentId>
- indexedSequentialField
- indexedNonSequentialField
- nonIndexedSequentialField
Firestore's 500 writes/second limit will apply to the creation of new documents if indexedSequentialField is there during the creation. Similarly, Firestore's 500 writes/second limit should also apply to any updates to the documents that change indexedSequentialField because that involves rewriting the index entry. This part is clear.
My understanding is that this limit comes from writing the index entries and not to the collection itself.
If that's true, would it be correct to say that making more than 500 updates per second to the documents that change only the indexedNonSequentialField or nonIndexedSequentialField is fine as long as the indexedSequentialField is not changed, even if the indexedSequentialField is already present in the documents and the index entries since their creation?
For the sake of this question, please assume that there are no composite indices present that end up being sequential in nature.
Firestore's hotspots on writes occur when it needs to write data from multiple write operations close to each other on disk, as it needs to synchronize those writes across multiple data centers while also isolation each write from each other (since its write operations are immediately consistent).
If your collection or collection group has an index with sequential fields, that will trigger a hotspot indeed. Note that the limit of 500 writes per second is a soft limit, and you may well be able to write much more than that before hitting a hotspot. Nowadays I recommend using the Key Visualizer to analyze the performance of your writes.

Limit frequency with which firestore retrieves data

I am using swift and Firestore and in my application I have a snapshotlistener which retrieves data every time some documents are changed. As I expect this to happen many times a second, I would like to limit the snapshotlistener to retrieve data once every 2 seconds, say. Is this possible? I looked everywhere but could not find anything.
Cloud Firestore stores your data in multiple data centers, and only confirms the write operations once it's written to all of those. For this reason the maximum update frequency of a single document in Cloud Firestore is roughly once per second. So if your plan is to update a document many times per second, that won't work anyway.
There is no way to set a limit on how frequently Firestore broadcasts out updates to the underlying data. If the data gets updated, it is broadcast out to all active listeners.
The typical solution would be to limit how frequently you update the data. If nobody is going to see a significant chunk of the updates, you might as well not write them to the database. This sort of logic if often accomplished with a client side throttle/debounce (see 1, 2).

Firestore: Reading data with references do increase in number of requests?

When documents on firestore is read, firestore wont give references data, if any. so currently I am requesting firestore for data from reference path. Do this increase in number of requests to server, eventually decrease in performance and increase in pricing ? How storing references is helpful in terms of requesting data from server ?
Reading a document that has a reference counts as a read of that document. Reading the referenced document count as a read of another document. So in total that is two reads.
There is no hidden cost-inflation here: if the server were to automatically follow the reference, it would also have to read both documents.
If you're looking to minimize the number of documents you read, you can consider adding the minimum data you need from the referenced document into the document containing the reference. For example, if you have a chat app:
you might want to include the display name of each user posting the message in the message itself, so that you don't have to read the user's profile document.
if you do so, you'll have to consider what to do if the user updates their display name. See my answer here for some options: How to write denormalized data in Firebase
the number of users is likely smaller than the number of chat messages (and rather limited in a specific time-frame), making the number of reads of linked documents lower than the number of messages.
by duplicating the data, you may be inflating the bandwidth usage, especially if the number of users is much lower than the number of messages.
What this boils down to is: you're likely optimizing prematurely, but even if not: there's no one-size-fits-all approach. NoSQL data modeling depends on the use-cases of your app, and Firestore is no different.

Monitoring usage of capped collections

I'm using MongoDB's awesome capped collections + tailable cursors for message-passing between the different processes in my system. I have many such collections, for the different types of messages, to which documents are written at variable rates and sizes. Per collection, writing rates can vary a lot, but it should be easy to derive a typical/conservative upper bound on document sizes and rates from past/ongoing action.
In addition, I have a periodic job (once an hour) which queries the messages and archives them. It is important that all messages are archived, i.e. must not be dropped before the job gets a chance to archive them. (Archived messages are written to files.)
What I would like to do is some kind of size/rate monitoring, which would allow figuring out an upper bound on message sizes and rates, based on which I can decide on a good size for my capped collections.
I plan to set up some monitoring tool, run it for a while to gather information, and then analyse it, and decide on good sizes for my capped collections. The goal is, of course, to keep them small enough in order not to take too much memory, but big enough to make dropped-message improbable.
This is the information which I think can help:
number of messages and total size written in the last hour (average, over time)
how long does it takes to complete a "full cycle" (on average, over time)
is the collection bound by the max-bytes or the max-documents limit
What is the best way to find this information, and is there any other stat which seems relevant?
Tips about how to integrate that with Graphite/Carbon would also be great!
Setup the StatsD-Graphite stack and begin by sending metrics to it.
The information that you want to send can be sent by any language that can send a message over UDP.
There are language bindings in all common languages- PHP, Python, Ruby, C++, Java etc.. so that shouldn't be a problem.
Once you do this from a technical standpoint, you can focus on the other things you'd like to measure.
Failing to find an out-of-the-box solution, and getting no answers here, here's what I ended up doing
I ended up setting up a process which:
registers to all of the message-passing capped collections in my mongodb (using a tailable-cursor query), a thread per collection.
keeps message-counters per collection X time_unit (the time unit is every 10 minutes, i.e. every 10 minutes I start a new counter, while keeping all the old ones in memory)
periodically querying the stats of the capped collections (size, number of documents, and the limits), and also keep all this data in memroy.
Then I let it run for a week, and checked its state. This way, I managed to get a very good picture of the activity during the week.
For 1., I used projection to keep it as lightweight as possible, only retrieving the ID, and extracting the timestamp from it.
The data collected in 3. was used to figure out if the collections are bound by the size-limit or the number-of-documents limit.

Mongodb: keeping a frequently written collection in RAM

I am collecting data from a streaming API and I want to create a real-time analytics dashboard. This dashboard will display a simple timeseries plotting the number of documents per hour. I am wondering if my current approach is optimal.
In the following example, on_data is fired for each new document in the stream.
# Mongo collections.
records = db.records
stats = db.records.statistics
on_data(self, data):
# Create a json document from data.
document = simplejson.loads(data)
# Insert the new document into records.
records.insert(document)
# Update a counter in records.statistics for the hour this document belongs to.
stats.update({ 'hour': document['hour'] }, { '$inc': { document['hour']: 1 } }, upsert=True)
The above works. I get a beautiful graph which plots the number of documents per hour. My question is about whether this approach is optimal or not. I am making two Mongo requests per document. The first inserts the document, the second updates a counter. The stream sends approximately 10 new documents a second.
Is there for example anyway to tell Mongo to keep the db.records.statistics in RAM? I imagine this would greatly reduce disk access on my server.
MongoDB uses memory map to handle file I/O, so it essentially treats all data as if it is already in RAM and lets the OS figure out the details. In short, you cannot force your collection to be in memory, but if the operating system handles things well, the stuff that matters will be. Check out this link to the docs for more info on mongo's memory model and how to optimize your OS configuration to best fit your use case: http://docs.mongodb.org/manual/faq/storage/
But to answer your issue specifically: you should be fine. Your 10 or 20 writes per second should not be a disk bottleneck in any case (assuming you are running on not-ancient hardware). The one thing I would suggest is to build an index over "hour" in stats, if you are not already doing that, to make your updates find documents much faster.