cloud_firestore package: different behaviour, equivalent queries - flutter

I am running a Flutter mobile app that queries data points from Firestore. Until very recently, I have been running the following query:
return firestore
.collection('organisations/$organisationId/alerts/$alertId/deviceTrails/$deviceTrailId/markers')
.where('deviceCreatedUtc', isGreaterThanOrEqualTo: timestamp)
.snapshots().handleError(handleFirestoreError);
I found, while running this query, that it would work well and provide a certain number of snapshots, but that it would stop generating snapshots without throwing any errors after a period of a few minutes. Changing the query to the following seemed to resolve this error (snapshots became more reliable):
return firestore
.collection('organisations/$organisationId/alerts/$alertId/deviceTrails/$deviceTrailId/markers')
.orderBy('deviceCreatedUtc')
.startAt([timestamp])
.snapshots().handleError(handleFirestoreError);
Other than the ordering (which is not strictly necessary in my case, since I am adding the points to my on-device database instead of using them directly), there does not appear to be much in the way of functional difference between these queries. But the former fails silently, while the latter is more reliable.
Is there any reason why this would happen? And is one of the queries intrinsically more efficient than the other?

As mentioned by Frank van Puffelen, two snippets should do exactly the same (outside of ordering as you said). You can fill a bug on the repo here :
For more information you can refer to the Documentation (listen a document with onsnapshot() method )and Documentation(working with list of data in flutter firebase.)
index.js
const query = db.collection('cities').where('state', '==', 'CA');
const observer = query.onSnapshot(querySnapshot => {
console.log(`Received query snapshot of size ${querySnapshot.size}`);
// ...
}, err => {
console.log(`Encountered error: ${err}`);
});

Related

Comparing find results in an active environment with transactions

I’m looking to compare two find results from mongoDb since we are evaluating a different query structure that should yeild the same results.
I want to execute the queries to run in parallel with the same start time, and then compare that the results are an exact match.
The system is very active and documents are updated and deleted very often.
Does it make sense to execute the two finds in a transaction ?
Yes, it makes sense. In your case, you need to block phantom reads, that are guaranteed not to occur in a MongoDB transaction, whose isolation level is snapshot. This is a test I run in Nodejs, that you can find here:
test('Newly inserted rows are invisible, no phantom reads', async () => {
await transactionWithCustomOptions({}, async (client, session, db) => {
const beforeInsertUsers = await db.collection(ACCOUNT_COLL).countDocuments({}, { session });
await accountRepository.insertAccount("test3", 2000);
const afterInsertUsers = await db.collection(ACCOUNT_COLL).countDocuments({}, { session });
expect(beforeInsertUsers).toEqual(afterInsertUsers);
});
});
This test asserts that transactional operations are completely isolated from outside operations. For the theoretical part, I wrote a report about MongoDB transactions: you might be interested in the paragraphs WiredTiger cache and Snapshot isolation.

Contention-friendly database architecture for large documents and inner arrays

Context
I have a database with a collection of documents using this schema (shortened schema because some data is irrelevant to my problem):
{
title: string;
order: number;
...
...
...
modificationsHistory: HistoryEntry[];
items: ListRow[];
finalItems: ListRow[];
...
...
...
}
These documents can easily reach 100 or 200 kB, depending on the amount of items and finalItems that they hold. It's also very important that they are updated as fast as possible, with the smallest bandwidth usage possible.
This is inside a web application context, using Angular 9 and #angular/fire 6.0.0.
Problems
When the end user edits one item inside the object's item array, like editing just a property, reflecting that inside the database requires me to send the entire object, because firestore's update method doesn't support array indexes inside the field path, the only operations that can be done on arrays are adding or deleting an element as described inside documentation.
However, updating an element of the items array by sending the entire document creates poor performances for anyone without a good connection, which is the case for a lot of my users.
Second issue is that having everything in realtime inside one document makes collaboration hard in my case, because some of these elements can be edited by multiple users at the same time, which creates two issues:
Some write operations may fail due to too much contention on the document if two updates are made in the same second.
The updates are not atomic as we're sending the entire document at once, as it doesn't use transactions to avoid using bandwidth even more.
Solutions I already tried
Subcollections
Description
This was a very simple solution: create a subcollection for items, finalItems and modificationsHistory arrays, making them easy to edit as they now have their own ID so it's easy to reach them to update them.
Why it didn't work
Having a list with 10 finalItems, 30 items and 50 entries inside modificationsHistory means that I need to have a total of 4 listeners opened for one element to be listened entirely. Considering the fact that a user can have many of these elements opened at once, having several dozens of documents being listened creates an equally bad performance situation, probably even worse in a full user case.
It also means that if I want to update a big element with 100 items and I want to update half of them, it'll cost me one write operation per item, not to mention the amount of read operations needed to check permissions, etc, probably 3 per write so 150 read + 50 write just to update 50 items in an array.
Cloud Function to update the document
const {
applyPatch
} = require('fast-json-patch');
function applyOffsets(data, entries) {
entries.forEach(customEntry => {
const explodedPath = customEntry.path.split('/');
explodedPath.shift();
let pointer = data;
for (let fragment of explodedPath.slice(0, -1)) {
pointer = pointer[fragment];
}
pointer[explodedPath[explodedPath.length - 1]] += customEntry.offset;
});
return data;
}
exports.updateList = functions.runWith(runtimeOpts).https.onCall((data, context) => {
const listRef = firestore.collection('lists').doc(data.uid);
return firestore.runTransaction(transaction => {
return transaction.get(listRef).then(listDoc => {
const list = listDoc.data();
try {
const [standard, custom] = JSON.parse(data.diff).reduce((acc, entry) => {
if (entry.custom) {
acc[1].push(entry);
} else {
acc[0].push(entry);
}
return acc;
}, [
[],
[]
]);
applyPatch(list, standard);
applyOffsets(list, custom);
transaction.set(listRef, list);
} catch (e) {
console.log(data.diff);
}
});
});
});
Description
Using a diff library, I was making a diff between previous document and the new updated one, and sending this diff to a GCF that was operating the update using the transaction API.
Benefits of this approach being that since transaction happens inside GCF, it's super fast and doesn't consume too much bandwidth, plus the update only requires a diff to be sent, not the entire document anymore.
Why it didn't work
In reality, the cloud function was really slow and some updates were taking over 2 seconds to be made, they could also fail due to contention, without firestore connector knowing it, so no possibility to ensure data integrity in this case.
I will be edited accordingly to add more solutions if I find other stuff to try
Question
I feel like I'm missing something, like if firestore had something I just didn't know at all that could solve my use case, but I can't figure out what it is, maybe my previously tested solutions were badly implemented or I missed something important. What did I miss? Is it even possible to achieve what I want to do? I am open to data remodeling, query changes, anything, as it's mostly for learning purpose.
You should be able to reduce the bandwidth required to update your documents by using Maps instead of Arrays to store your data. This would allow you to send only the item that is being updated using its key.
I don't know how involved this would be for you to change, but it sounds like less work than the other options.
You said that it's not impossible for your documents to reach 200kb individually. It would be good to keep in mind that Firestore limits document size to 1mb. If you plan on supporting documents beyond that, you will need to find a way to fragment the data.
Regarding your contention issues... You might consider a system that "locks" the document and prevents it from receiving updates while another user is attempting to save. You could use a simple message system built with websockets or Firebase FCM to do this. A client would subscribe to the document's channel, and publish when they are attempting an update. Other clients would then receive a notice that the document is being updated and have to wait before they can save their own changes.
Also, I don't know what the contents of modificationsHistory look like, but that sounds to me like the type of data that you might keep in a subcollection instead.
Of the solutions you tried, the subcollection seems like the most scalable to me. You could look into the possibility of not using onSnapshot listeners and instead create your own event system to notify clients of changes. I suppose it could work similar to the "locking" system I mentioned above. A client sends an event when it updates an item belonging to a document. Other clients subscribed to that document's channel will know to check the database for the newest version.
Your diff-approach appeared mostly sensible, details aside.
You should store items inline, but defer modificationsHistory into a sub collection. For the entire root document, record which elements of modificationsHistory have been merged yet (by timestamp should suffice), and all elements not merged yet, you have to re-apply individually on each client, querying with aforementioned timestamp.
Each entry in modificationsHistory should not describe a single diff, but whenever possible a set of diffs.
Apply changes from modificationsHistory collections onto items in batch, deferred via GCF. You may defer this arbitrarily far, and you may want to exclude modifications performed only in the last few seconds, to account for not established consistency in Firestore. There is no risk of contention, that way.
Cleanup from the modificationsHistory collection has to be deferred even further, until you can be sure that no client has still access to an older revision of the root document. Especially if you consider that the client is not strictly required to update the root document when the listener is triggered.
You may need to reconstruct the patch stack on the client side if modificationsHistory changes in unexpected ways due to eventual consistency constraints. E.g. if you have a total order in the set of patches, you need to re-apply the patch stack from base image if the collection unexpectedly suddenly contains "older" patches unknown to the client before.
All in all, you should be able avoid frequent updates all together, and limit this solely to inserts into to modificationsHistory sub-collection. With bandwidth requirements not exceeding the cost of fetching the entire document once, plus streaming the collection of not-yet-applied patches. No contention expected.
You can tweak for how long clients may ignore hard updates to the root document, and how many changes they may batch client-side before submitting a new diff. Latter is also a tradeof with regard to how many documents another client has to fetch initially, with regard to max-documents-per-query limits.
If you require other information which are likely to suffer from contention, like list of users currently having a specific document open, that should go into sub-collections as well.
Should the latency for seeing changes by other users eventually turn out to be unacceptable, you may opt for an additional, real-time capable data channel for distribution of patches on a specific document. ActiveMQ or some other message broker operated on dedicated resources, running independently from FireStore.

Implement a firestore infinite scolling list which updates on collection changes

What am I trying to accomblish?
I am currently facing a bunch of problems implementing a real time updated infinite scrolling list with the firestore backend.
In my application I want to display comments (like in e.g. YouTube or other social media sites) to the user. Since the number of comments in a collection might be quite big, I see an option to paginate the collection, while receiving real time updates based on snapshots. So I initially load x comments with the option to load up to x more items whenever the user presses a button. In the image below x = 3.
The standard solution
Based on other SO questions I figured out that one is supposed to use the .limit() and the .startAfter() methods to implement such behaviour.
So the first page is loaded as:
query = this
.collection
.orderBy('date', descending: true)
.limit(pageSize);
query.snapshots().map((QuerySnapshot snap) {
lastVisible = snap.documents.last;
// convert the DocumentSnapshot into model object
});
All additional pages are loaded with the following code:
query = this.collection
.orderBy('date', descending: true)
.startAfterDocument(lastVisible)
.limit(pageSize);
Furthermore, I'd like to add that this code is located in a repository class which is used with the BLoC pattern similar to the code shown in Felix Angelov's Flutter Todos Tutorial.
While Felix uses a simple flutter list to show the items, I have a list of pages showing comments based on the data provided by their BLoCs. Note that each BLoC accesses a shared repository (parts of the repository code is shown below).
The Problem with the standard solution
With the code shown above I see multiple problems:
If a comment is inserted in the middle of the ordered collection (how is not of importance), the added comment is shown because of the Stream provided by the snapshot. However, another comment that already existed is not longer shown because of the .limit() operator in the query. One could increase the limit by one but I'm not sure how to edit a snapshot query. In the case that editing a snapshot query is not possible, one could create a new (and bigger) query, but that would cost additional reads.
Similar to 1., if a comment in the middle is deleted, the snapshot will return a list which does not longer contain the deleted comment, however another comment (which is already covered by a different page) appears. E.g., in the scenario shown in the image above 5 comments are loaded. Assuming that comment 3 is deleted, comment 2 will show twice.
Improving the standard solution
Based on these two problems discussed above, I decided that the solution is not sufficient and I implemented a solution which first loads x items by obtaining two "interval" documents. Then a query which fetches the required items in an interval using .startAtDocument() and .endAtDocument() is created, which eliminates the .limit() operator.
DocumentSnapshot pageStartDocument;
DocumentSnapshot pageEndDocument;
Future<Stream<List<Comment>>> comments() async {
// This fetches the first and next Document as initialization
// (maybe should be implemented in constructor)
if (pageStartDocument == null) {
Query query = collection
.orderBy('date', descending: true)
.limit(pageSize);
QuerySnapshot snap = await query.getDocuments();
pageStartDocument = snap.documents.first;
pageEndDocument = snap.documents.last;
} else {
Query query = collection
.orderBy('date', descending: true)
.startAfterDocument(pageEndDocument)
.limit(pageSize);
QuerySnapshot snap = await query.getDocuments();
pageStartDocument = snap.documents.first;
pageEndDocument = snap.documents.last;
}
// This fetches a subcollection of elements from the collection
// with the tradeof of double the reads
Query query = this
.collection
.orderBy('date', descending: true)
.startAtDocument(pageStartDocument)
.endAtDocument(pageEndDocument);
return query.snapshots().asyncMap((QuerySnapshot snap) async {
// convert the QuerySnapshot into model objects
});
As commented in the code, this solution has the following drawback:
Since a query is required to obtain the pageStartDocument and pageEndDocument, the number of reads is doubled, because all the data is read again when the second query is created. The performance impact might be neglectable because I believe the data is cashed, however having 2x database read cost can be significant.
Question:
Since I am not only implementing pagination but also real time updates (with collection insertions), the .limit() operator seems to be not working in my case.
How does one implement a pagination with real time updates (without double reads)?
Side Notes:
I watched how Todd Kerpelman devoures a massive gummy bear while explaining pagination, but in the video it seems to be not so trivial (and a point was made that a tradeoff might be necessary).
If further code from my side is required please say so in the comments.
For the scenario of comments it does not really makes sense that an item is inserted into the middle of the (sorted) collection. However I would like to understand how it should be implemented if the scenario requires such a feature.
this may come as a very late answer. The OP probably won't need help anymore, however for anyone who should stumble on this I wrote a tutorial with a solution that partly solve this:
the Bloc keep a list of stream subscription to keep trace of realtime updates to the list.
however concerning the insertion problem, since when you will have paginated streams based on a document cursor, upon insertion or deletion you necessarily need to reset your pagination stream subscriptions unless it is the last page.
Hence my solution around it was to update the list when modifications occur but reset it when insertions or deletions occur.
Here is the link to the tutorial :
https://link.medium.com/2SPf2Qsbsgb

Handling DB Failure during projection in cqrs

We are creating system using CQRS. Our projections are in mongodb. We are facing some cases. We have an event say OrderCreated. We need to produce a sequential order_no for example #3, #4 etc. We could use a projection and keep a sequence in table then called upsert method. and get a new number. Post a new command : GenerateOrderNumber. now before this post accepted hardware failure occur. If we retry we will have another number. It is not good. How to solve such use case in cqrs.
Our projections are in mongodb <...>
now before this post accepted hardware failure occur
Most likely that described issue is not about CQRS or EventSoucring itself, but related to projection storage, which is MongoDB in question above.
You are trying to perform potential atomic operation without transaction guarantees. Since hardware failure can be caused within random time, database should provide ability for rollbacking failed atomic operations in current transaction.
Best choice is native MongoDB transactions, which are available since 4.0 version - https://docs.mongodb.com/manual/core/transactions/ - and your code will look like this:
session.startTransaction( … );
try {
const lastNo = await eventsCollection.findOne( ... )
await eventsCollection.insertOne( …, lastNo +1 )
session.commitTransaction()
} catch (error) {
session.abortTransaction()
}
If you have to use older MongoDB versions, transactions still can be used. But instead of using intrinsic operator, you should manually write transaction log, and after reconnect to database perform monitoring for broken transactions and revert them manually via log.
You should do all actions via events, even generating sequence no.
In your case I suggest you using saga:
build a projection for generating order_no
fire new event OrderCreated (after this point you have Order Aggregate with some unique id)
saga, listening to this event, fire event GenerateOrderNo (get next free number from projection)
In that case, any time you ask new order_no after failure it'll be the same.
Correct me please if I understood you wrong.

When does node-mongodb-native hits the database?

I have trouble understanding when exactly the database is hit when using node-mongodb-native. Couldn't find any reference on that. As everything is callback based, it gave me the feeling that every single call hits the database ... For example, are those two snippets any different in terms of how many times the database is hit :
// ---- 1
db.collection('bla', function(err, coll) {
coll.findOne({'blo': 'bli'}, function(err, doc) {
coll.count(function(err, count) {
console.log(doc, count)
})
})
})
// ---- 2
db.collection('bla', function(err, coll) {
coll.findOne({'blo': 'bli'}, function(err, doc) {
db.collection('bla', function(err, coll) {
coll.count(function(err, count) {
console.log(doc, count)
})
})
})
})
I was basically wondering whether I can cache instances of collections and cursors. For example, why not fetch the collections I need only once, at server start, and reuse the same instances indefinitely ?
I'd really like to understand how the whole thing work, so I'd really appreciate a good link explaining stuff in details.
Looking at the source code for the node.js driver for collection it seems it will not ping MongoDB upon creation of the collection unless you have strict mode on: https://github.com/mongodb/node-mongodb-native/blob/master/Readme.md#strict-mode
The source code I looked at ( https://github.com/mongodb/node-mongodb-native/blob/master/lib/mongodb/db.js#L446 ) reinforced the idea that if strict was not on then it would just try and create a new node.js collection object and run the callback.
However findOne and count will break the "lazy" querying of node.js and will force it to query the database in order to get results.
Note: The count being on the collection won't enforce a "true" count of all items in the collection. Instead it will garnish this information from the collection meta.
So for the first snippet you should see two queries run. One for the findOne and one for the count and two for the second snippet as well since creating the collection after the findOne should not enforce a query to MongoDB.
After some googling, I have find this link about best practices for node-mongodb-native. It is answered by Christian Kvalheim who seem to be the maintainer of the library. He says :
"You can safely store the collection objects if you wish and reuse them"
So even if the call to collection might hit the database in case it is made in strict mode, the actual client-side collection instance can be reused.