CQ query re trigger post server restart in apache geode - geode

Lets assume we have a CQ query set up with a server. We not interested in executeWithInitialResults, hence starting it mechanism "CqQuery.execute".
For e.g. --> CQ being " Select * from /Trade where price > 100 "
So when we start CQ , we see this query executed on server side and all data satisfying condition is loaded in memory.
Is this expected behaviour from geode ?
We were under impression that as we are not interested in initial data set , this query would not be executed and only new events would be transferred to the listener. Are we wrong with this assumption ?
Also We have kept our subscription to durable for 10 hour.
Now what we notice is, if server restarts ( client still active ), the CQ query is executed again on server side post restart.
This re-populates all the data in memory.
Again is this expected way for CQ to work in geode?

Related

Architecture for ML jobs platform

I'm building a platform to run ML jobs.
Jobs will be started from an interface.
I'm making a service for each type of jobs. Some times, a service S1 might require to first make a request to another service S2 and get its output before running its own job.
Each service is split into 2 Kubernetes deployment:
one that will pull the message from a topic, check it and persist it to a database (D1)
one that will read request from the database, run the actual job, update the request state in the database and then answer to the client (D2)
Here is the flow:
interface generates a PubSub message to a topic T1
D1 pulls message from T1 and persist a request to a database
D2 sees the new request in the database and runs it then update its state in the database and answer to the client
To answer to the client, D2 has 2 options:
push a message to a pubsub topic T2 that will continiously be checked by the client. An id is passed in both request and response so that only the client can pull it from the topic.
use a callback provided by the client to make a POST request
What do you think abouut this architecture ? Does the usage of PubSub makes sense ? Also does it make sense to split each service into 2 deployment (1 that deals with request, 1 that runs the actual job ) ?
interface generates a PubSub message to a topic T1 D1 pulls message
from T1 and persist a request to a database
If there's only one database, I'm not sure I see much advantage in using a topic (implying pub/sub). Another approach would be to use a queue: the interface creates jobs into the queue, then you can have any number of workers processing it. Depending on the situation you may not even need the database at all - if all the data needed can be in the message in the queue.
use a callback provided by the client to make a POST request
That's better if you can do it, on the assumption that there's only one consumer for the event; pub/sub is more for broadcasting out to multiple consumers. Polling works but is really inefficient and has limits on how much it can scale.
Also does it make sense to split each service into 2 deployment (1
that deals with request, 1 that runs the actual job ) ?
Having separate deployables make sense if they are built by different teams and have a different release cadence or if you need to scale them out independently, otherwise it may not be necessary.

Scala and playframework shared cache between nodes

I have a complex problem and I can't figure out which one is the best solution to solve it.
this is the scenario:
I have N servers under a single load balancer and a Database.
All the servers connect to the database
All the servers run the same identical application
I want to implement a Cache in order to decrease the response time and reduce to the minimum the HTTP calls Server -> Database
I implemented it and works like a charm on a single server...but I need to find a mechanism to update all the other caches in the other servers when the data is not valid anymore.
example:
I have server A and server B, both have their own cache.
At the first request from the outside, for example, get user information, replies server A.
his cache is empty so he needs to get the information from the database.
the second request goes to B, also here server B cache is empty, so he needs to get information from the database.
the third request, again on server A, now the data is in the cache, it replies immediately without database request.
the fourth request, on server B, is a write request (for example change user name), server B can make the changes on the database and update his own cache, invalidating the old user.
but server A still has the old invalid user.
So I need a mechanism for server B to communicate to server A (or N other servers) to invalidate/update the data in the cache.
whats is the best way to do this, in scala play framework?
Also, consider that in the future servers can be in geo-redundancy, so in different geographical locations, in a different network, served by a different ISP.
would be great also to update all the other caches when one user is loaded (one server request from database update all the servers caches), this way all the servers are ready for future request.
Hope I have been clear.
Thanks
Since you're using Play, which under the hood, already uses Akka, I suggest using Akka Cluster Sharding. With this, the instances of your Play service would form a cluster (including failure detection, etc.) at startup, and organize between themselves which instance owns a particular user's information.
So proceeding through your requests, the first request to GET /userinfo/:uid hits server A. The request handler hashes uid (e.g. with murmur3: consistent hashing is important) and resolves it to, e.g., shard 27. Since the instances started, this is the first time we've had a request involving a user in shard 27, so shard 27 is created and let's say it gets owned by server A. We send a message (e.g. GetUserInfoFor(uid)) to a new UserInfoActor which loads the required data from the DB, stores it in its state, and replies. The Play API handler receives the reply and generates a response to the HTTP request.
For the second request, it's for the same uid, but hits server B. The handler resolves it to shard 27 and its cluster sharding knows that A owns that shard, so it sends a message to the UserInfoActor on A for that uid which has the data in memory. It replies with the info and the Play API handler generates a response to the HTTP request from the reply.
In this way, all subsequent requests (e.g. the third, the same GET hitting server A) for the user info will not touch the DB, no matter which server they hit.
For the fourth request, which let's say is POST /userinfo/:uid and hits server B, the request handler again hashes the uid to shard 27 but this time, we send, e.g., an UpdateUserInfoFor(uid, newInfo) message to that UserInfoActor on server A. The actor receives the message, updates the DB, updates its in-memory user info and replies (either something simple like Done or the new info). The request handler generates a response from that reply.
This works really well: I've personally seen systems using cluster sharding keep terabytes in memory and operate with consistent single-digit millisecond latency for streaming analytics with interactive queries. Servers crash, and the actors running on the servers get rebalanced to surviving instances.
It's important to note that anything matching your requirements is a distributed system and you're requiring strong consistency, i.e. you're requiring that it be unavailable under a network partition (if B is unable to communicate an update to A, it has no choice but to fail the request). Once you start talking about geo-redundancy and multiple ISPs, you're going to see partitions pretty regularly. The only way to get availability under a network partition is to relax the consistency demand and accept that sometimes the GET will not incorporate the latest PUT/POST/DELETE.
This is probably not something that you want to build yourself. But there are plenty of distributed caches out there that you can use, such as Ehcache or InfiniSpan. I suggest you look into one of those two.

How to perform multiple HTTP DELETE operation on same Resource with different IDs in JMeter?

I have a question regarding **writing test for HTTP DELETE method in JMeter using Concurrency Thread Group**. I want to measure **how many DELETEs** can it perform in certain amount of time for certain amount of Users (i.e. Threads) who are sending Concurrent HTTP (DELETE) Requests.
Concurrency Thread Group parameters are:
Target Concurrency: 50 (Threads)
RampUp Time: 10 secs
RampUp Steps Count: 5 secs
Hold Target Rate Time (sec): 5 secs
Threads Iterations Limit: infinite
The thing is that HTTP DELETE is idempotent operation i.e. if inovked on same resource (i.e. Record in database) it kind of doesn't make much sense. How can I achieve deletion of multiple EXISTING records in database by passing Entity's ID in URL? E.g.:
http://localhost:8080/api/authors/{id}
...where ID is being incremented for each User (i.e. Thread)?
My question is how can I automate deletion of multiple EXISTING rows in database (Postgres 11.8)...should I write some sort of script or is there other easier way to achieve that?
But again I guess it will probably perform multiple times same thing on same resources ID (e.g. HTTP DELETE will be invoked more than once on http://localhost:8080/api/authors/5).
Any help/advice is greatly appreciated.
P.S. I'm doing this to performance test my SpringBoot, Vert.X and Dropwizard RESTful Web service apps.
UPDATE1:
Sorry, I've didn't fully specify reason for writing these Test Use Case for my Web Service apps which communicate with Postgres DB. MAIN reason why I'm actually doing this testing is to test PERFORMANCES of blocking and NON-blocking WEB Server implementations for mentioned frameworks (SpringBoot, Dropwizard and Vert.X). Web servers are:
Blocking impelementations:
1.1. Apache Tomcat (SpringBoot)
1.2. Jetty (Dropwizard)
Non-blocking: Vert.X (uses own implementation based on Netty)
If I am using JMeter's JDBC Request in my Test Plan won't that actually slow down Test execution?
The easiest way is using either Counter config element or __counter() function in order to generate an incrementing number on each API hit:
More information: How to Use a Counter in a JMeter Test
Also the list of IDs can be obtained from the Postgres database via JDBC Request sampler and iterated using ForEach Controller

Kafka consume message in reverse order

I use Kafka 0.10, I have a Topic logs where my IoT devices post their logs into , The key of my messages are the device-id , so all the logs of the same device are in the same partition.
I have an api /devices/{id}/tail-logs that needs to display the N last logs of one device at the moment the call was made.
Currently I have it implemented in a very unefficient way (but working), as I start from the beginning (i.e oldest logs) of the partition containing the device's log until I reach current timestamp.
A more efficient way would be if I could get the current latest offset and then consume the messages backward (I would need to filter out some message to keep only those of the device i'm looking for)
Is it possible to do it with kafka ? If not how one can solve this problematic ? (a more heavy solution I would see would be to have a kafka-connect linked to an elastic search and then to query the elasticsearch but to have 2 more components for this seems a bit overkill...)
As you are on 0.10.2, I would recommend to write a Kafka Streams application. The application will be stateful and the state will hold the last N records/logs per device-id -- if new data is written to the input topic, the Kafka Streams application will just update it's state (without the need to re-read the whole topic).
Furthermore, the application also serves you request ("api /devices/{id}/tail-logs" using Interactive Queries feature.
Thus, I would not build a stateless application that has to recompute the answer for each request, but build a stateful application that eagerly compute the result (and update the result automatically all the time) for all possible requests (ie, for all device-ids) and just returns the already computed result when a request comes in.

Tracking what data has changed since some time

In our application we have a central database and many disconnected client applications with their own local databases. A client connects to the central server and the server should send them the data that have changed since the client's last connection.
Because there are too many clients, and some of them might cease to exist without notifying the server, it is not practical to keep the pending changes on the server per client.
That is why in every relevant table we have a column update_date that is on every insert and every update set to the current_timestamp. Deletes are handled in a similar way, with an auxiliary table for every synchronized table, where we store the primary key of the synchronized table and the delete_date.
When a client connects to the server, it sends to the server the last synchronization timestamp, the server sends all changes where update_date > last_sync and then the current_timestamp of the transaction to store on the client as the last_sync.
The problem of this approach is that when there is a running transaction T1 with the current_timestamp = 1000, the client connects in a transaction T2 with the current_timestamp = 2000. Since T2 does not see the not yet committed changes made in T1, their are not sent to the client. The next time when the client connects, the changes from T1 are already committed, but they are marked with update_date = 1000, so they will not be sent the client requesting the changes made after 2000.
Any suggestions how to make sure that the clients get all the changed records? It is acceptable that the clients gets the same changes multiple times.
Personally I would go for an audit trigger to solve this which is described here: https://wiki.postgresql.org/wiki/Audit_trigger
After that you can choose how to apply the updates (or ignore some of them if they're not relevant).
Alternatively you could try one of the standard replication modules, some of the asynchronous ones should do the trick: https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling#Comparison_matrix
Bucardo for example was specifically designed for cases like these.