How to subscribe to an entire collection and a document within it simultaneously without duplicating the amount of reads? - flutter

I am writing for learning purposes a cross-platform to-do app with Flutter and Firestore. Currently, I have the following design, and I would like to know if there are better alternatives.
One of the main screens of the app shows a list of all tasks. It does this by subscribing to the corresponding Firestore collection, which we'll say is /tasks for simplicity.
FirebaseFirestore.instance.collection("tasks").snapshots()
Each tile in the ListView of tasks can be clicked. Clicking a tile opens a new screen (with Navigator.push) showing details about that specific task.
Importantly, this screen also needs to update in real-time, so it is not enough to just pass it the (local, immutable) task object from the main screen. Instead, this screen subscribes to the individual Firestore document corresponding to that task.
FirebaseFirestore.instance.collection("tasks").doc(taskId).snapshots()
This makes sense to me logically: the details page only needs to know about that specific document, so it only subscribes to it to avoid receiving unnecessary updates.
The problem is since the collection-wide subscription for the main screen is still alive while the details screen is open, if the document /tasks/{taskId} gets updated, both listeners will trigger. According to the answers in this, this and this question, this means I will get charged for two (duplicate) reads for any single update to that document.
Furthermore, each task can have subtasks. This is reflected in Firestore as a tasks subcollection for each task. For example, a nested task could have the path: /tasks/abc123/tasks/efg875/tasks/aay789. The main page could show all tasks regardless of nesting by using a collection group query on "tasks". The aforementioned details page also shows the tasks' subtasks by listening to the subcollection. This allows to make complex queries on subtasks (filtering, ordering, etc.), but again the disadvantage is getting duplicate reads for every update to a subtask.
The alternative designs that occur to me are:
Only keep a single app-wide subscription to the entire set of tasks (be it a flat collection or a collection group query) and do any and all selection, filtering, etc. on the client. For example, the details page of a task would use the same collection-wide subscription and select the appropriate task out of the set every time. Any filtering and ordering of tasks/subtasks would be done on the client.
Advantages: no duplicate reads, minimizes the Firestore cost.
Disadvantages: might be more battery intensive for the client, and code would become more complex as I'd have to select the appropriate data out of the entire set of tasks in every situation.
Cancel the collection-wide subscription when opening the details page and re-start it when going back to the main screen. This means when the details page is open, only updates to that specific task will be received, and without being duplicated as two reads.
Advantages: no duplicate reads.
Disadvantages: re-starting the subscription when going back to the main screen means reading all of the documents in the first snapshot, i.e. one read per task, which might actually make the problem worse. Also, it could be quite complicated to code.
Do any of these designs seem the best? Is there another better alternative I'm missing?

Create a TaskService or something similar in your app that handles listening to the FirebaseFirestore.instance.collection("tasks").snapshots() call, then in your app, subscribe to updates to that service rather than Firebase itself (you can create two Stream objects, one for global updates, one for specific updates).
Then, you've only one read going on in your Firebase collection. Everything is handled app side.
Pseudo-code:
class TaskService {
final List<Task> _tasks = [];
final StreamController<List<Task>> _signalOnTasks = StreamController.broadcast();
final StreamController<Task> _signalOnTask = StreamController.broadcast();
get List<Task> allTasks => _tasks;
Stream<List<Task>> get onTasks => _signalOnTasks.stream;
Stream<List<Task>> get onTask => _signalOnTask.stream;
void init() {
FirebaseFirestore.instance.collection("tasks").snapshots().listen(_onData);
}
void _onData(snapshot) {
/// get/update our tasks (maybe check for duplicates or whatever)
_tasks.addAll(snapshot.documents);
/// dispatch our signal streams
_signalOnTasks.add(snapshot.documents);
for(final task in snapshot.documents) {
_signalOnTask.add(task);
}
}
}
You can make TaskService and InheritedWidget to get access to it wherever (or use the provider package), the add your listeners to whatever stream you're interested in. You'll need just to check in your listener to onTask that it's the correct task before doing anything with it.

Related

How doest Trello store generated actions from updates to other documents (boards, cards) in MongoDB without atomic transactions?

I'm developing a single page web app that will use a NoSQL Document Database (like MongoDB) and I want to generate events when I make a change to my entities.
Since most of these databases support transactions only on a document level (MongoDB just added ASIC support) there is no good way to store changes in one document and then store events from those changes to other documents.
Let's say for example that I have a collection 'Events' and a collection 'Cards' like Trello does. When I make a change to the description of a card from the 'Cards' collection, an event 'CardDescriptionChanged' should be generated.
The problem is that if there is a crash or some error between saving the changes to the 'Cards' collection and adding the event in the 'Events' collection this event will not be persisted and I don't want that.
I've done some research on this issue and most people would suggest that one of several approaches can be used:
Do not use MongoDB, use SQL database instead (I don't want that)
Use Event Sourcing. (This introduces complexity and I want to clear older events at some point, so I don't want to keep all events stored. I now that I can use snapshots and delete older events from the snapshot point, but there is a complexity in this solution)
Since errors of this nature probably won't happen too often, just ignore them and risk having events that won't be saved (I don't want that too)
Use an event/command/action processor. Store commands/action like 'ChangeCardDescription' and use a Processor that will process them and update the entities.
I have considered option 4, but a couple of question occurs:
How do I manage concurrency?
I can queue all commands for the same entity (like a card or a board) and make sure that they are processed sequentially, while events for different entities (different cards) can be processed in parallel. Then I can use processed commands as events. One problem here is that changes to an entity may generate several events that may not correspond to a single command. I will have to break down to very fine-grained commands all user actions so I can then translate them to events.
Error reporting and error handling.
If this process is asynchronous, I have to manage error reporting to the client. And also I have to remove or mark commands that failed.
I still have the problem with marking the commands as processed, as there are no transactions. I know I have to make processing of commands idempotent to resolve this problem.
Since Trello used MongoDB and generates actions ('DeleteCardAction', 'CreateCardAction') with changes to entities (Cards, Boards..) I was wondering how do they solve this problem?
Create a new collection called FutureUpdates. Write planned updates to the FutureUpdates collection with a single document defining the changes you plan to make to cards and the events you plan to generate. This insert will be atomic.
Now take a [ChangeStream][1] of the FutureUpdates collection this will give you the stream of updates you need to make. Take each doc from the change stream and apply the updates. Finally, update the doc in FutureUpdates to mark it as complete. Again this update will be atomic.
When you apply the updates to Events and Cards make sure to include the objectID of the doc used to create the update in FutureUpdates.
Now if the program crashes after inserting the update in FutureUpdates you can check the Events and Cards collections for the existence of records containing the objectID of the update. If they are not present then you can reapply the missing updates.
If the updates have been applied but the FutureUpdate doc is not marked as complete we can update that during recovery to complete the process.
Effectively you are continuously atomically updating a doc for each change in FutureUpdates to track progress. Once an update is complete you can archive the old docs or just delete them.

CQRS and Passing Data

Suppose I have an aggregate containing some data and when it reaches a certain state, I'd like to take all that state and pass it to some outside service. For argument and simplicity's sake, lets just say it is an aggregate that has a list and when all items in that list are checked off, I'd like to send the entire state to some outside service. Now when I'm handling the command for checking off the last item in the list, I'll know that I'm at the end but it doesn't seem correct to send it to the outside system from the processing of the command. So given this scenario what is the recommended approach if the outside system requires all of the state of the aggregate. Should the outside system build its own copy of the data based on the aggregate events or is there some better approach?
Should the outside system build its own copy of the data based on the aggregate events.
Probably not -- it's almost never a good idea to share the responsibility of rehydrating an aggregate from its history. The service that owns the object should be responsible for rehydration.
First key idea to understand is when in the flow the call to the outside service should happen.
First, the domain model processes the command arguments, computing the update to the event history, including the ChecklistCompleted event.
The application takes that history, and saves it to the book of record
The transaction completes successfully.
At this point, the application knows that the operation was successful, but the caller doesn't. So the usual answer is to be thinking of an asynchronous operation that will do the rest of the work.
Possibility one: the application takes the history that it just saved, and uses that history to create schedule a task to rehydrate a read-only copy of the aggregate state, and then send that state to the external service.
Possibility two: you ditch the copy of the history that you have now, and fire off an asynchronous task that has enough information to load its own copy of the history from the book of record.
There are at least three ways that you might do this. First, you could have the command schedule the task as before.
Second, you could have a event handler listening for ChecklistCompleted events in the book of record, and have that handler schedule the task.
Third, you could read the ChecklistCompleted event from the book of record, and publish a representation of that event to a shared bus, and let the handler in the external service call you back for a copy of the state.
I was under the impression that one bounded context should not reach out to get state from another bounded context but rather keep local copies of the data it needed.
From my experience, the key idea is that the services shouldn't block each other -- or more specifically, a call to service B should not block when service A is unavailable. Responding to events is fundamentally non blocking; does it really matter that we respond to an asynchronously delivered event by making an asynchronous blocking call?
What this buys you, however, is independent evolution of the two services - A broadcasts an event, B reacts to the event by calling A and asking for a representation of the aggregate that B understands, A -- being backwards compatible -- delivers the requested representation.
Compare this with requiring a new release of B every time the rehydration logic in A changes.
Udi Dahan raised a challenging idea - the notion that each piece of data belongs to a singe technical authority. "Raw business data" should not be replicated between services.
A service is the technical authority for a specific business capability.
Any piece of data or rule must be owned by only one service.
So in Udi's approach, you'd start to investigate why B has any responsibility for data owned by A, and from there determine how to align that responsibility and the data into a single service. (Part of the trick: the physical view of a service can span process boundaries; in other words, a process may be composed from components that belong to more than one service).
Jeppe Cramon series on microservices is nicely sourced, and touches on many of the points above.
You should never externalise your state. Reporting on that state is a function of the read side, as it produces reports and you'll need that data to call the service. The structure of your state is plastic, and you shouldn't have an external service that relies up that structure otherwise you'll have to update both in lockstep which is a bad thing.
There is a blog that puts forward a strong argument that the process manager is the correct place to put this type of feature (calling an external service), because that's the appropriate place for orchestrating events.

Spring-batch reader for frequently modified source

I'm using spring batch and I want to write a job where I have a JPA reader that selects paginated sets of products from the database. Then I have a processor that will perform some operation on every single product (let's say on product A), but performing this operation on product A the item processor will also process some other products too (like product B, product C, etc.). Then the processor will come to product B because it's in line and is given by the reader. But it has already been processed, so it's actually a waste of time/resources to process it again. How should one actually tackle this - is there a modification aware item reader in spring batch? One solution would be in the item processor to check if the product has already been processed, and only if it hasn't been then process it. However checking if the product has been process is actually very resource consuming.
There are two approaches here that I'd consider:
Adjust what you call an "item" - An item is what is returned from the reader. Depending on the design of things, you may want to build a more complex reader that can include the dependent items and therefore only loop through them once. Obviously this is very dependent upon your specific use case.
Use the Process Indicator pattern - The process indicator pattern is what this is for. As you process items, set a flag in the db indicating that they have been processed. Your reader's query is then configured to only read those that have been processed (filtering those out that were updated via the process phase).

Data store in an actor system

I'm working on an event processing pipeline based on Akka actors. I have 3 actors for each step of the pipeline: FilterWorker, EnrichWorker and ProcessWorker; plus a supervisor actor that makes sure the events are sent from one step of the pipeline to the next.
The enrich step might need to query some external database for extra data or even create new data that I'll want to persist. For example, the enrich step of a web analytics system might want to enrich a click event with the user that made the click and store that user information in a database.
Keeping in mind that example, I see the following options:
1.Use a singleton; e.g. UserStore that keeps in memory all the users gathered so far and saves them to the database once in a while; has all the logic to fetch users that are not yet in memory. Doesn't seem like a good idea to use a singleton in an actor system however (?).
Use a store actor. Use tell to add a new user and ask to fetch it.
Is there a better pattern for this?
Thanks!
In order to not leave this unanswered, I went with my second option and johanandren's suggestion of having an Actor fill the data store role. Works pretty well!

Getting past Salesforce trigger governors

I'm trying to write an "after update" trigger that does a batch update on all child records of the record that has just been updated. This needs to be able to handle 15k+ child records at a time. Unfortunately, the limit appears to be 100, which is so far below my needs it's not even close to acceptable. I haven't tried splitting the records into batches of 100 each, since this will still put me at a cap of 10k updates per trigger execution. (Maybe I could just daisy-chain triggers together? ugh.)
Does anyone know what series of hoops I can jump through to overcome this limitation?
Edit: I tried calling following #future function in my trigger, but it never updates the child records:
global class ParentChildBulkUpdater
{
#future
public static void UpdateChildDistributors(String parentId) {
Account[] children = [SELECT Id FROM Account WHERE ParentId = :parentId];
for(Account child : children)
child.Site = 'Bulk Updater Fired';
update children;
}
}
The best (and easiest) route to take with this problem is to use Batch Apex, you can create a batch class and fire it from the trigger. Like #future it runs in a separate thread, but it can process up to 50,000,000 records!
You'll need to pass some information to your batch class before using database.executeBatch so that it has the list of parent IDs to work with, or you could just get all of the accounts of course ;)
I've only just noticed how old this question is but hopefully this answer will help others.
It's worst than that, you're not even going to be able to get those 15k records in the first place, because there is a 1,000 row query limit within a trigger (This scales to the number of rows the trigger is being called for, but that probably doesnt help)
I guess your only way to do it is with the #future tag - read up on that in the docs. It gives you much higher limits. Although, you can only call so many of those in a day - so you may need to somehow keep track of which parent objects have their children updating, and then process that offline.
A final option may be to use the API via some external tool. But you'll still have to make sure everything in your code is batched up.
I thought these limits were draconian at first, but actually you can do a hell of a lot within them if you batch things correctly, we regularly update 1,000's of rows from triggers. And from an architectural point of view, much more than that and you're really talking batch processing anyway which isnt normally activated by a trigger. One things for sure - they make you jump through hoops to do it.
I think Codek is right, going the API / external tool route is a good way to go. The governor limits still apply, but are much less strict with API calls. Salesforce recently revamped their DataLoader tool, so that might be something to look into.
Another thing you could try is using a Workflow rule with an Outbound Message to call a web service on your end. Just send over the parent object and let a process on your end handle the child record updates via the API. One thing to be aware of with outbound messages, it is best to queue up the process on your end somehow, and immediately respond to Salesforce. Otherwise Salesforce will resend the message.
#future doesn't work (does not update records at all)? Weird. Did you try using your function in automated test? It should work and and the annotation should be ignored (during the test it will be executed instantly, test methods have higher limits). I suggest you investigate this a bit more, it seems like best solution to what you want to accomplish.
Also - maybe try to call it from your class, not the trigger?
Daisy-chaining triggers together will not work, I've tried it in the past.
Your last option might be batch Apex (from Winter'10 release so all organisations should have it by now). It's meant for mass data update/validation jobs, things you typically run overnight in normal databases (it can be scheduled). See http://www.salesforce.com/community/winter10/custom-cloud/program-cloud-logic/batch-code.jsp and release notes PDF.
I believe in version 18 of the API the 1000 limit has been removed. (so the documentations says but in some cases I still hit a limit)
So you may be able to use batch apex. With a single APEX update statement
Something like:
List children = new List{};
for(childObect__c c : [SELECT ....]) {
c.foo__c = 'bar';
children.add(c);
}
update(children);;
Besure you bulkify your tigger also see http://sfdc.arrowpointe.com/2008/09/13/bulkifying-a-trigger-an-example/
Maybe a change to your data model is the better option here. Think of creating a formula on the children object where you access the data from the parent. This would be far more efficient probably.