I have a method that executes a paginated call to a database to collect data and return an observable.
public Observable search(string query, int limit, int offset)
I want to build a method that executes all paginated search calls to load all pages for my query and return a Completable. In this method I would like to concatenate all pages in one stream, executes reactive transformations and actions and return a completable.
However, as I have to deal with a huge quantity of data, I do not want to load all pages at the same time , put a huge quantity of data in memory and flood my reactive stream with a lot of data because my subscriber processes the data slower than the publisher.
So, I would like to process the first page of data, then load another page through a call to search method, process it and continue until I process the last page.
As I am in a reactive flux, I am not sure that it's a good idea to write a loop to : fetch the data, subscribe to the stream that will transform and process data.
My approach is to concatenate all the pages in one stream (using a method such as Observable.concat) and then process the stream. However I do not want to load a huge quantity of data in memory and get an outOfMemoryException.
Should I use a Flowable and use backpressure in a such situation?
What is the best way to do this with rxjava2?
Related
Imagine a very simple application with two pages, PostList and PostDetail. On the former page, we show a list of posts, and on the latter, the details of a single post.
Now consider the following scenario. Our user clicks on the first PostItem and navigates to the PostDetail page. We fetch the full post data from the server. The likes_count of this post gets increased by one. Now if our user navigates back to the first page, the PostItem should be updated and show the new likes_count as well.
One possible solution to handle this is to create a pool of posts. Now when we fetch some new data from the server, instead of creating a new post object, we can update our corresponding pool instance object. For example, if we navigate to post with id=3, we can do something like this:
Post oldPost = PostPool.get(3)
oldPost.updateFromJson(servers_new_json_for_post_3)
Since the same object is used on the PostDetail page, our PostItem on the PostList page will be updated as well.
Other approaches that do not use a unique "single instance" of our Post objects, across the application, would not be clean to implement and requires tricks to keep the UI sync.
But the ObjectPool approach also has its own problems and leads to memory leaks since the size of the pool gets bigger and bigger over time. For this problem to get solved we need to manually count the number of references for each pool object instance and discard them when this count is equal to zero. This manual calculation is full of bugs and I was wondering if there are any better ways to achieve this.
You can also solve this by using streams and StreamBuilders. First you create the stream and populates it with the initial data fetched from the API:
I like to use BehaviorSubject from rxdart but you can use normal streams too.
final BehaviorSubject<List<Post>> postsStream = BehaviorSubject<List<Post>>()..startWith(<Post>[]);
On the constructor body or initState function you would fetch the data and add it to the stream.
PostPage() {
_fetchPosts().then((posts) => postsStream.add(posts));
}
You can now subscribe to changes on this postsStream in both pages with StreamBuilder. Any update you need to do you would emit a new (updated) List<Post> to the stream triggering a rebuild on any StreamBuilder subscribed to the stream with the new and updated values.
You can latter dispose the StreamController.
Say I have a static reference to a list/seq of a collection of users:
val users = List(User(..), User(...))
In my controllers, depending on the querystring filters passed in I will filter the users collection.
/users/find?locationId=1&age=30
The action would look something like:
def findUsers(...) = Action {
val filteredUsers = users.filter(.....)
Ok(filteredUsers)
}
So if this endpoint is getting 10K requests per second, the fact that the users reference is a val and I a am simply filtering the results in a read-only manner, this endpoint should be blazing fast correct?
The second part to this question is, since I cannot hard code 10K users in a collection, what would be the best way to mimic this behaviour or am I forced to make this a var in this case if I load the data from a db?
var users = userService.getAll()
I would need to reload this users periodically, like maybe every 3-4 hours.
So if this endpoint is getting 10K requests per second, the fact that
the users reference is a val and I a am simply filtering the results
in a read-only manner, this endpoint should be blazing fast correct?
Yes, no concerns with thread safety here. If you use something that refreshes this list you might get varying responses if 2 clients hit the same url when cache is being refreshed. It's possible to remediate this if that's an issue. In most cases it's not a problem.
You could use a var if you want to implement refresh yourself. There are other ways like using an actor which will hold this state. However, the best option I think is already provided by Play framework: ScalaCache
https://www.playframework.com/documentation/2.8.x/ScalaCache
It has cache refresh and expiry.
If you want further speedups you can cache results of your filtering if it makes sense for you. So it could be double cache for all results and filtered results or just for filtered results. Depends on your needs.
Trying to implement Event Sourcing and CQRS for the first time, but got stuck when it came to persisting the aggregates.
This is where I'm at now
I've setup "EventStore" an a stream, "foos"
Connected to it from node-eventstore-client
I subscribe to events with catchup
This is all working fine.
With the help of the eventAppeared event handler function I can build the aggregate, whenever events occur. This is great, but what do I do with it?
Let's say I build and aggregate that is a list of Foos
[
{
id: 'some aggregate uuidv5 made from barId and bazId',
barId: 'qwe',
bazId: 'rty',
isActive: true,
history: [
{
id: 'some event uuid',
data: {
isActive: true,
},
timestamp: 123456788,
eventType: 'IsActiveUpdated'
}
{
id: 'some event uuid',
data: {
barId: 'qwe',
bazId: 'rty',
},
timestamp: 123456789,
eventType: 'FooCreated'
}
]
}
]
To follow CQRS I will build the above aggregate within a Read Model, right? But how do I store this aggregate in a database?
I guess just a nosql database should be fine for this, but I definitely need a db since I will put a gRPC APi in front of this and other read models / aggreates.
But what do I actually go from when I have built the aggregate, to when to persist it in the db?
I once tried following this tutorial https://blog.insiderattack.net/implementing-event-sourcing-and-cqrs-pattern-with-mongodb-66991e7b72be which was super simple, since you'd use mongodb both as the event store and just create a view for the aggregate and update that one when new events are incoming. It had it's flaws and limitations (the aggregation pipeline) which is why I now turned to "EventStore" for the event store part.
But how to persist the aggregate, which is currently just built and stored in code/memory from events in "EventStore"...?
I feel this may be a silly question but do I have to loop over each item in the array and insert each item in the db table/collection or do you somehow have a way to dump the whole array/aggregate there at once?
What happens after? Do you create a materialized view per aggregate and query against that?
I'm open to picking the best db for this, whether that is postgres/other rdbms, mongodb, cassandra, redis, table storage etc.
Last question. For now I'm just using a single stream "foos", but at this level I expect new events to happen quite frequently (every couple of seconds or so) but as I understand it you'd still persist it and update it using materialized views right?
So given that barId and bazId in combination can be used for grouping events, instead of a single stream I'd think more specialized streams such as foos-barId-bazId would be the way to go, to try and reduce the frequency of incoming new events to a point where recreating materialized views will make sense.
Is there a general rule of thumb saying not to recreate/update/refresh materialized views if the update frequency gets below a certain limit? Then the only other a lternative would be querying from a normal table/collection?
Edit:
In the end I'm trying to make a gRPC api that has just 2 rpcs - one for getting a single foo by id and one for getting all foos (with optional field for filtering by status - but that is not so important). The simplified proto would look something like this:
rpc GetFoo(FooRequest) returns (Foo)
rpc GetFoos(FoosRequest) returns (FooResponse)
message FooRequest {
string id = 1; // uuid
}
// If the optional status field is not specified, return all foos
message FoosRequest {
// If this field is specified only return the Foos that has isActive true or false
FooStatus status = 1;
enum FooStatus {
UNKNOWN = 0;
ACTIVE = 1;
INACTIVE = 2;
}
}
message FoosResponse {
repeated Foo foos;
}
message Foo {
string id = 1; // uuid
string bar_id = 2 // uuid
string baz_id = 3 // uuid
boolean is_active = 4;
repeated Event history = 5;
google.protobuf.Timestamp last_updated = 6;
}
message Event {
string id = 1; // uuid
google.protobuf.Any data = 2;
google.protobuf.Timestamp timestamp = 3;
string eventType = 4;
}
The incoming events would look something like this:
{
id: 'some event uuid',
barId: 'qwe',
bazId: 'rty',
timestamp: 123456789,
eventType: 'FooCreated'
}
{
id: 'some event uuid',
isActive: true,
timestamp: 123456788,
eventType: 'IsActiveUpdated'
}
As you can see there is no uuid to make it possible to GetFoo(uuid) in the gRPC API, which is why I'll generate a uuidv5 with the barId and bazId, which will combined, be a valid uuid. I'm making that in the projection / aggregate you see above.
Also the GetFoos rpc will either return all foos (if status field is left undefined), or alternatively it'll return the foo's that has isActive that matches the status field (if specified).
Yet I can't figure out how to continue from the catchup subscription handler.
I have the events stored in "EventStore" (https://eventstore.com/), using a subscription with catchup, I have built an aggregate/projection with an array of Foo's in the form that I want them, but to be able to get a single Foo by id from a gRPC API of mine, I guess I'll need to store this entire aggregate/projection in a database of some sort, so I can connect and fetch the data from the gRPC API? And every time a new event comes in I'll need to add that event to the database also or how is this working?
I think I've read every resource I can possibly find on the internet, but still I'm missing some key pieces of information to figure this out.
The gRPC is not so important. It could be REST I guess, but my big question is how to make the aggregated/projected data available to the API service (possible more API's will need it as well)? I guess I will need to store the aggregated/projected data with the generated uuid and history fields in a database to be able to fetch it by uuid from the API service, but what database and how is this storing process done, from the catchup event handler where I build the aggregate?
I know exactly how you feel! This is basically what happened to me when I first tried to do CQRS and ES.
I think you have a couple of gaps in your knowledge which I'm sure you will rapidly plug. You hydrate an aggregate from the event stream as you are doing. That IS your aggregate persisted. The read model is something different. Let me explain...
Your read model is the thing you use to run queries against and to provide data for display to a UI for example. Your aggregates are not (directly) involved in that. In fact they should be encapsulated. Meaning that you can't 'see' their state from the outside. i.e. no getter and setters with the exception of the aggregate ID which would have a getter.
This article gives you a helpful overview of how it all fits together: CQRS + Event Sourcing – Step by Step
The idea is that when an aggregate changes state it can only do so via an event it generates. You store that event in the event store. That event is also published so that read models can be updated.
Also looking at your aggregate it looks more like a typical read model object or DTO. An aggregate is interested in functionality, not properties. So you would expect to see void public functions for issuing commands to the aggregate. But not public properties like isActive or history.
I hope that makes sense.
EDIT:
Here are some more practical suggestions.
"To follow CQRS I will build the above aggregate within a Read Model, right? "
You do not build aggregates in the read model. They are separate things on separate sides of the CQRS side of the equation. Aggregates are on the command side. Queries are done against read models which are different from aggregates.
Aggregates have public void functions and no getter or setters (with the exception of the aggregate id). They are encapsulated. They generate events when their state changes as a result of a command being issued. These events are stored in an event store and are used to recover the state of an aggregate. In other words, that is how an aggregate is stored.
The events go on to be published so the event handlers and other processes can react to them and update the read model and or trigger new cascading commands.
"Last question. For now I'm just using a single stream "foos", but at this level I expect new events to happen quite frequently (every couple of seconds or so) but as I understand it you'd still persist it and update it using materialized views right?"
Every couple of seconds is very likely to be fine. I'm more concerned at the persist and update using materialised views. I don't know what you mean by that but it doesn't sound like you have the right idea. Views should be very simple read models. No need to complex relations like you find in an RDMS. And is therefore highly optimised fast for reading.
There can be a lot of confusion on all the terminologies and jargon used in DDD and CQRS and ES. I think in this case, the confusion lies in what you think an aggregate is. You mention that you would like to persist your aggregate as a read model. As #Codescribler mentioned, at the sink end of your event stream, there isn't a concept of an aggregate. Concretely, in ES, commands are applied onto aggregates in your domain by loading previous events pertaining to that aggregate, rehydrating the aggregate by folding each previous event onto the aggregate and then applying the command, which generates more events to be persisted in the event store.
Down stream, a subscribing process receives all the events in order and builds a read model based on the events and data contained within. The confusion here is that this read model, at this end, is not an aggregate per se. It might very well look exactly like your aggregate at the domain end or it could be only creating a read model that doesn't use all the events and or the event data.
For example, you may choose to use every bit of information and build a read model that looks exactly like the aggregate hydrated up to the newest event(likely your source of confusion). You may instead have another process that builds a read model that only tallies a specific type of event. You might even subscribe to multiple streams and "join" them into a big read model.
As for how to store it, this is really up to you. It seems to me like you are taking the events and rebuilding your aggregate plus a history of events in a memory structure. This, of course, doesn't scale, which is why you want to store it at rest in a database. I wouldn't use the memory structure, since you would need to do a lot of state diffing when you flush to the database. You should be modify the database directly in response to each individual event. Ideally, you also transactionally store the stream count with said modification so you don't process the same event again in the case of a failure.
Hope this helps a bit.
I have two operation:
Load item from server(the item only contains certain fileds)
Load item from local database(the item in db may contain other
fileds)
Combine the item as one update the UI
update item in db
I know how to use rx for each single operation, but once call all of them I only think about nest observable inside other, this will result in callback hell.
What's the right way to complete these jobs?
According to your description, Server and local DB query should happen in parallel, after we have both data we combined them, so you need to use zip operator.
zip will subscribe to both server and local DB observable, when both Observable emitted values, you'll will get on next with both server and DB data, then combine them in the zip operator func, and you'll get Observable that emits the combined data.
With each emission of combined data (doOnNext), start a save operation in the background, and in the subscriber update the Ui according to the combined data.
Observable<ServerData> getServerData = ...;
Observable<LocalDbData> getLocalDbData = ...;
Observable
.zip(getServerData, getLocalDbData,
(serverData, localDbData) -> combinedData(serverData, localDbData))
.doOnNext(combinedData -> updateDataInDb())
.subscribe(combinedData -> updateUi(combinedData));
function _allUsers(callback){
var db = connect.get();
db.collection("users").find({}).toArray(function(err,data){
if(err){
callback(err);
}else{
callback(null,data);
}
});
}
I am trying to understand this code, I have been looking around the web but I find the explanations kinda defficult to understand ( I am new at Mean stack), so my questions are:
What does the Collection method do? I am not sure but the string "users" is it just the name of our collection with all users?
Why do we have to use a callback in this situation? (I find callbacks very confusing).
And why do we have to give toArray function, an annonymous function?
Instead of toArray could I use pretty method() without any annonymous function as a parameter?
MEAN Stack is a software bundle of software programs supporting applications written in all javascript. This means you can use javascript from your database, to your back-end and front-end.
MEAN actually stands for the first characters of each software program included in the stack. MongoDB, Expressjs, AngularJS and NodeJS.
1
MongoDB is a NoSQL database which uses BSON (similar to JSON) to store so called documents. Look at a document as if it is a single entity or row in a traditional database. These entities (or rows) are stored in collections (a collection of documents) which can be compared to tables.
So the answer to your 1st question is opens up the users collection, which grants access to all the user documents.
2
NodeJS is asynchronous by design. This allows NodeJS to perform a lot of operations while running on a single thread*. Because NodeJS is single-threaded we need a way to write our code non-blocking meaning we can start an operation, proceed with executing other code and come back whenever that operation is finished.
In your case we request access to the users collection, this takes some time. In order to allow other parts of our application to continue processing we use a callback. When we have access to our collection, our callback is executed and we can perform whatever operation we wanted to do when we first requested access.
*NodeJS actually runs on multiple threads but a developer never has to worry about multithreading, NodeJS does that for us.'
3
This is exactly what the previous point is about.
The .toArray() method returns an array that contains all the documents from a cursor. The method iterates completely the cursor, loading all the documents into RAM and exhausting the cursor. Source
.toArray() is a computionally intensive operation. Since we do not want to wait untill .toArray() is finished but proceed processing the rest of our code, we give it a callback so that we can come back to our collection processing whenever it's ready.
4
From what I can read from the docs I guess you could indeed write blocking code and do it this way:
var users = db.collection("users").find({}).toArray();
This however will block your code entirely. There is never a good reason to do this.
Disclaimer: I left out or oversimplified details in this explanation for ease of understanding.
db.collection('users') this will return the users collection instance
we are using callback for asynchronous
the annonymous function in toArray is its callback
this is dependent on the library in use..
without any annonymous function as a parameter
expressjs is an asynchronous programming, we need callback || Promises
You can think of the collection as of table in MySQL. A collection consists of documents (rows/items/records in MySQL). Your example calls the Users collection and finds all documents (records) in it.
About the callbacks - NodeJS/Express are commonly callbacks-oriented. This is the pattern they use and most of the code is using it, because it is asynchronous. If you need to be sure that some snippet is executed right after some other snippet, you have to use callback (or promise).
Calling toArray() depends on what your callback expects. You can skip calling this method if the callback expects the Query object returned by the find() method. All that depends on your callback.
You can use non-anonymous function, too, but you have to have in mind the asynchronous logic and continue using callbacks/promises. You can read more about callbacks and promises in this Quora's article.
Here you can find more about the find() method.