I have the following code.
My goal is to group messages by a given key and a 10 second window. I would like to count the total amount accumulated for a particular key in the particular window.
I read that I need to have caching enabled and also have a cache size declared. I am also forwarding the wall clock to enforce the windowing to kick in and group the elements in two separate groups. You can see what my expectations are for the given code in the two assertions.
Unfortunately this code fails them and it does so in two ways:
it sends a result of the reduction operation each time it is executed as opposed to utilizing the caching on the store and sending a single total value
windows are not respected as can be seen by the output
Can you please explain to me how am I misunderstanding the mechanics of Kafka Streams in this case?
I have a system that saves (X,Y) coordinates to a SQL table. Then, I have an endpoint that when called returns the (X,Y) coordinates.
However my system takes up to 30 minutes to process and store a (X,Y) coordinate to the SQL table. In this sense, I am using KSQL to get that data faster.
I have added the call to KSQL in the endpoint of the backend I mentioned. The problem is that this call adds 6 extra seconds to my request.
My endpoint includes a query that looks like this
SELECT feature_a,feature_b FROM ksql_table;
The ksql_table has already been pre-processed by two previous streams. In my understanding, this query should be pretty straight forward and easy to compute. But it is taking 6 seconds to process.
When a KSQL query runs, it instantiates a Kafka Streams application that will build the table state requested. This is going to have a "spin-up" time, which doesn't matter when it's the stream processing application itself (since once it's running it stays running). However, if you're repeatedly calling it via the REST API as part of your application's flow then you are going to see this delay.
I think a more optimal way to work with the stream of data in Kafka would be to use Kafka Streams to build and persist the state required in a KTable, and then serve this through Interactive Query and a custom API that your nodejs application can interface with such as described here. Further examples are here and here.
There is also a nodejs Kafka Streams library, which I have not used but might be worth checking out.
My scenario is something like this.
I have a vector consisting of a large number of reports that needs to be sent using a rest api call.
I am using Futures.traverse(the vector mentioned in 1)
Since the vector is too huge, it is failing with max open requests exceeded.
One initial solution that I could think of is to increase the max-open-requests setting. But the problem here is I am not aware of the number of reports that needs to be sent beforehand.
Can someone please suggest an alternative solution like limiting the parallelism that is taking place through Futures.traverse
Since you tag this question with akka, I'm assuming that you are using akka-http for the calls. You could use akka-streams to make the requests in batch so to avoid to overflow your connections, something like:
Source(reportsVector)
.grouped(safeValue)
.mapAsync(1)(reps => Future.traverse(reps)(x => ...)) //do your stuff
.mapConcat(identity)
.runWith(Sink.seq)
The example will execute safeValue concurrent calls at a time and collect all the results into a collection that will be returned when the entire stream is done. You can also play with other operators like sliding and splitWhen to make it better for your use case, you can tune the safeValue and the mapAsync concurrency values as well. Notice that the source of this stream is a known vector (reportsVector) but it could be an unknown finite stream of reports as well.
The naive approach for implementing the use case of enriching an incoming stream of events stored in Kafka with reference data - is by calling in map() operator an external service REST API that provides this reference data, for each incoming event.
eventStream.map((key, event) -> /* query the external service here, then return the enriched event */)
Another approach is to have second events stream with reference data and store it in KTable that will be a lightweight embedded "database" then join main event stream with it.
KStream<String, Object> eventStream = builder.stream(..., "event-topic");
KTable<String, Object> referenceDataTable = builder.table(..., "reference-data-topic");
KTable<String, Object> enrichedEventStream = eventStream
.leftJoin(referenceDataTable , (event, referenceData) -> /* return the enriched event */)
.map((key, enrichedEvent) -> new KeyValue<>(/* new key */, enrichedEvent)
.to("enriched-event-topic", ...);
Can the "naive" approach be considered an anti-pattern? Can the "KTable" approach be recommended as the preferred one?
Kafka can easily manage millions of messages per minute. Service that is called from the map() operator should be capable of handling high load too and also highly-available. These are extra requirements for the service implementation. But if the service satisfies these criteria can the "naive" approach be used?
Yes, it is ok to do RPC inside Kafka Streams operations such as map() operation. You just need to be aware of the pros and cons of doing so, see below. Also, you should do any such RPC calls synchronously from within your operations (I won't go into details here why; if needed, I'd suggest to create a new question).
Pros of doing RPC calls from within Kafka Streams operations:
Your application will fit more easily into an existing architecture, e.g. one where the use of REST APIs and request/response paradigms is common place. This means that you can make more progress quickly for a first proof-of-concept or MVP.
The approach is, in my experience, easier to understand for many developers (particularly those who are just starting out with Kafka) because they are familiar with doing RPC calls in this manner from their past projects. Think: it helps to move gradually from request-response architectures to event-driven architectures (powered by Kafka).
Nothing prevents you from starting with RPC calls and request-response, and then later migrating to a more Kafka-idiomatic approach.
Cons:
You are coupling the availability, scalability, and latency/throughput of your Kafka Streams powered application to the availability, scalability, and latency/throughput of the RPC service(s) you are calling. This is relevant also for thinking about SLAs.
Related to the previous point, Kafka and Kafka Streams scale very well. If you are running at large scale, your Kafka Streams application might end up DDoS'ing your RPC service(s) because the latter probably can't scale as much as Kafka. You should be able to judge pretty easily whether or not this is a problem for you in practice.
An RPC call (like from within map()) is a side-effect and thus a black box for Kafka Streams. The processing guarantees of Kafka Streams do not extend to such side effects.
Example: Kafka Streams (by default) processes data based on event-time (= based on when an event happened in the real world), so you can easily re-process old data and still get back the same results as when the old data was still new. But the RPC service you are calling during such reprocessing might return a different response than "back then". Ensuring the latter is your responsibility.
Example: In the case of failures, Kafka Streams will retry operations, and it will guarantee exactly-once processing (if enabled) even in such situations. But it can't guarantee, by itself, that an RPC call you are doing from within map() will be idempotent. Ensuring the latter is your responsibility.
Alternatives
In case you are wondering what other alternatives you have: If, for example, you are doing RPC calls for looking up data (e.g. for enriching an incoming stream of events with side/context information), you can address the downsides above by making the lookup data available in Kafka directly. If the lookup data is in MySQL, you can setup a Kafka connector to continuously ingest the MySQL data into a Kafka topic (think: CDC). In Kafka Streams, you can then read the lookup data into a KTable and perform the enrichment of your input stream via a stream-table join.
I suspect most of the advice you hear from the internet is along the lines of, "OMG, if this REST call takes 200ms, how wil I ever process 100,000 Kafka messages per second to keep up with my demand?"
Which is technically true: even if you scale your servers up for your REST service, if responses from this app routinely take 200ms - because it talks to a server 70ms away (speed of light is kinda slow, if that server is across the continent from you...) and the calling microservice takes 130ms even if you measure right at the source....
With kstreams the problem may be worse than it appears. Maybe you get 100,000 messages a second coming into your stream pipeline, but some kstream operator flatMaps and that operation in your app creates 2 messages for every one object... so now you really have 200,000 messages a second crashing through your REST server.
BUT maybe you're using Kstreams in an app that has 100 messages a second, or you can partition your data so that you get a message per partition maybe even just once a second. In that case, you might be fine.
Maybe your Kafka data just needs to go somewhere else: ie the end of the stream is back into a Good Ol' RDMS. In which case yes, there's some careful balancing there on the best way to deal with potentially "slow" systems, while making sure you don't DDOS yourself, while making sure you can work your way out of a backlog.
So is it an anti-pattern? Eh, probably, if your Kafka cluster is LinkedIn size. Does it matter for you? Depends on how many messages/second you need to drive, how fast your REST service really is, how efficiently it can scale (ie your new kstreams pipeline suddenly delivers 5x the normal traffic to it...)
Does ZK need some sort of delay after a successful write, before the data can be read back, in case of a standalone instance (no peers)?
I have a test in which I create 1000 records using 1000 rest calls to a service, and then try to read them all back together. It works most of the time, but I seem to intermittently run into an issue where I only seem to get back a partial result (say 995 records).
My understanding was that a sync() was only required for consistency in case of multiple peers; is this not the case? Is there some specific delay that goes in beyond that, and if so, is there an upper limit to it?
P.S.: Since I'm using rest calls, perhaps the ZK clients are different instances; is this the problem? Would a sync (or a fixed sleep interval) fix it if so?