Redis / Kafka - Why stream consumers are blocked? - apache-kafka

Are Kafka stream/redis stream good for reactive architecture?
I am asking this mainly because both redis and kafka seem to be blocking threads when consuming the messages.
Are there any reasons behind this? I was hoping that I could read messages with some callback - so execute when the message was delivered like pub/sub in a reactive manner. Not by blocking thread.

Kafka client is relatively low level, what is "good" as in: it provides you with much flexibility when (and in which thread) you'd do the record processing. In the end, to receive a record, someone needs to block (as actual reading is sending fetch requests over and over). Whether the thread being blocked is a "main-business" thread or some side-i/o dedicated one will depend on developer's choice.
You might take a look at higher-level offerings such as Spring-Kafka or Kafka Stream API / Kafka Connect frameworks that provide "fatter" inversion-of-control containers, effectively answering the above concern.

Related

Are Kafka and Kafka Streams right tools for our case?

I'm new to Kafka and will be grateful for any advice
We are updating a legacy application together with moving it from IBM MQ to something different.
Application currently does the following:
Reads batch XML messages (up to 5 MB)
Parses it to something meaningful
Processes data parallelizing this procedure somehow manually for parts of the batch. Involves some external legacy API calls resulting in DB changes
Sends several kinds of email notifications
Sends some reply to some other queue
input messages are profiled to disk
We are considering using Kafka with Kafka Streams as it is nice to
Scale processing easily
Have messages persistently stored out of the box
Built-in partitioning, replication, and fault-tolerance
Confluent Schema Registry to let us move to schema-on-write
Can be used for service-to-service communication for other applications as well
But I have some concerns.
We are thinking about splitting those huge messages logically and putting them to Kafka this way, as from how I understand it - Kafka is not a huge fan of big messages. Also it will let us parallelize processing on partition basis.
After that use Kafka Streams for actual processing and further on for aggregating some batch responses back using state store. Also to push some messages to some other topics (e.g. for sending emails)
But I wonder if it is a good idea to do actual processing in Kafka Streams at all, as it involves some external API calls?
Also I'm not sure what is the best way to handle the cases when this external API is down for any reason. It means temporary failure for current and all the subsequent messages. Is there any way to stop Kafka Stream processing for some time? I can see that there are Pause and Resume methods on the Consumer API, can they be utilized somehow in Streams?
Is it better to use a regular Kafka consumer here, possibly adding Streams as a next step to merge those batch messages together? Sounds like an overcomplication
Is Kafka a good tool for these purposes at all?
Overall I think you would be fine using Kafka and probably Kafka Streams as well. I would recommend using streams for any logic you need to do i.e. filtering or mapping that you have todo. Where you would want to write with a connector or a standard producer.
While it is ideal to have smaller messages I have seen streams users have messages in the GBs.
You can make remote calls, to send and email, from a Kafka Streams Processor but that is not recommend. It would probably be better to write the event to send an email to an output topic and use a normal consumer to read and send the messages. This would also take care of your concern about the API being down as you can always remember the last offset in case and restart from there. Or use the Pause and Resume methods.

Why is there no Async / Non-Blocking Support in Kafka Streams API?

I am wondering why there is no non-blocking support via simple callbacks or Java's CompletableFuture or Scala Futures in the Kafka Stream API.
I do understand that ordering in a partition needs to be maintained, but across partitions I do not see the reason to achieve ordering by blocking an expensive resource: a thread.
i.e. when I let my Kafka Streams app with a call to an external service, e.g. in mapValues run on 1 server and I have more than thousands of partitions, I will probably lock up the machine because all threads are blocked. Having some API method like mapValuesAsync() would be nice here, wouldn't it?
Also just imagine on Kafka Stream App with doing several blocking operations in it's flow one would need way less partitions per each topic to run into the problem. Wasting threads doesn't look like a nice API design here.
Is there any support planned for this? Or do I oversee something here?
Async processing is generally hard in stream processing. It's not just about ordering, but also about fault-tolerance, tracking progress etc.
It's not impossible to support though and in fact there is already a design proposal for it: https://cwiki.apache.org/confluence/display/KAFKA/KIP-408%3A+Add+Asynchronous+Processing+To+Kafka+Streams
Feel free to help building this feature!

How to implement communication between consumer and producer with fast and slow workers?

I'm looking for a pattern and existing implementations for my situation:
I have synchronous soa which uses REST APIs, in fact I implement remote procedure call with REST. And I have got some slow workers which process requests for substantial time (around 30 seconds) and due to some license constraints for some requests I can process them only sequentially (see system setup).
What are recommended ways to implement communication for such case?
How can I mix synchronous and asynchronous communication in the situation when the consumer is behind firewall and I cannot send him notification about completed tasks easily and I might not be able to let consumer use my message broker if I have one?
Workers are implemented in Python using Flask and gunicorn. At the moment I'm using synchronous REST interfaces and allow for the delay as I only had fast workers. I looked at Kafka and RabbitMq and they would fit for backend side
communication, however how does producer communicate with the consumer?
If the consumer fires API request my producer can return code 202, then how shall producer notify consumer that result is available? Will consumer have to poll the producer for results?
Also if I use message brokers and my gateway acts on behalf of consumer, it should have a registry of requests (I already have GUID for every request now) and results, which approach would you recommend for implementing it?
Producer- agent which produces the message
Consumer - agent which can handle the message and implement the logic for processing the message

can vert.x event bus replace the need for Kafka?

I am evaluating the vert.x framework to see if I can reduce the Kafka based communications between my microservices developed using spring boot.
The question is:
Can I replace
Kafka with vert.x event bus and
spring boot microservices with vert.x based verticles
To answer quickly, I would say it depends on your needs.
Yes, the eventbus can be a good way to handle natively communication between microservices verticles using an asynchronous and non-blocking paradigm.
But in some cases you could need:
to handle some common enterprises patterns like replay mechanisms, persistence of messages, transactional reading
to be able to process some kind of messages in a chronological order
to handle communication between multiples kind of microservices that aren't all written with the same framework/toolkit or even programming language
to handle reliability, resilience and
failure recovery when all your consumers/microservices/verticles are died
to handle dynamic horizontal scalability and monitoring of your consumers/microservices/verticles
to be able to work with a single cluster deployed in multi-datacenters and multi-regions
In those cases I'd prefer to choose Apache Kafka over the native eventbus or an old fascioned JMS compliant system.
It's not forbidden to use both eventbus and kafka in the same microservices architecture according to your real needs. For example, you could have one kafka consumers group reading a kafka topic to handle scaling, monitoring, failure recovery and reply mechanism and then handle communication between your sub-verticles through the eventbus.
I'll clarify a little bit for the scalability and monitoring part and explain why I think it's more simple to handle that with Kafka over the native eventbus and cluster mode with vert.x : Kafka allow us to know in real time (through JMX metrics and the describe command):
the "lag" of a topic which corresponds to
the number of unread messages
the number of consumers of each group that are listening a topic
the number of partitions of a topic affected of each consumers
i/o metrics
So it's possible to use an ElasticStack or Prometheus+Grafana solution to monitor those metrics and use them to handle a dynamic scalability (when you know that there's a need to increase temporarily the number of consumers for example according to the lag metric and the number of partitions and the cpu/ram/swap metrics of your hosts).
To answer the second question vert.x or SpringBoot my answer will be not very objective but I'd vote for vert.x for its performances on the JVM and especially for its simplicity. I'm a little tired of the Spring factory and its big layers of abstraction that hides a lot of issues under a mountain of annotations triggering a mountain of AOP.
Moreover, In the Java world of microservices, there's other alternatives to SpringBoot like the different implementations of Microprofile (thorntail project for example).
The event-bus is not persistent. You should use it for fast verticle-to-verticle communications, and more generally to dispatch events where you know that you can loose them if you have some crash.
Kafka streams are persistent, and you should send events there because either you want other (possibly non-Vert.x) applications to consume them, and/or because you want to ensure that these events are not being lost in case of failure.
A reactive (read "scalable and fault-tolerant") Vert.x application typically uses a combination of both the event-bus and some replicable messaging systems like AMQP / Kafka / etc.
On the question:
Can I replace spring boot microservices with vert.x based verticles?
Yes, definitely, although the 2 have different programming models.
If you want a more progressive approach and use Spring for structuring your application while using Vert.x for resource efficiency over your I/O and event processing then you can mix them, see https://github.com/vert-x3/vertx-examples/tree/master/spring-examples for examples.
Take a look at the Quarkus framework: in the workshop section you'll find Vert.x and Apache Kafka combined!

Push styled storm spout

Am a newbie in Storm and have been exploring its features to match our CEP requirements. Different examples which I have stumbled implements spouts as a polling service from a message broker, database. How to implement a push based spout i.e. Thrift server running inside a spout? How shall I make my clients aware of where my spouts are running, so that they can push data on it?
Spouts are designed and intended to poll, so you can't push to them. However, what many people do is use things like Redis, Thrift, or Kafka as services that you can push messages to and then your spout can poll them.
The control you have on where and when a spout runs is limited, so it's a bit of hassle to have external processes communicate directly with spouts. It certainly is possible, but it's not the simplest solution.
The standard solution is to push messages to some external message queue and let your spouts poll this message queue.
There are implementations of spouts that do exactly this for commonly used message queue services, such as Kafka, Kestrel and JMS, in storm-contrib
I don't have a whole lot of experience with either Storm or Kafka/Kestrel or CEP, in general but I am looking for a similar solution - push to a Storm spout. How about using a load-balancer between event source and the Storm cluster? For my use case of pushing Syslog messages from rsyslog to Storm, a load-balancer can keep track of what Storm nodes are running a listening spout and which ones are down and also distribute incoming load based on different parameters. I am less inclined to introduce another layer like a message bus between the source and spout.
Edit: I read your blog and to summarize, if the only problem with a listening spout is how would a source find it then a message bus might be the wrong answer. There are simpler/better solutions to direct network traffic at a receiver based on simple network status or higher app level logic. But yes, if you want to use all the additional message bus features then obviously Kafka/Kestrel would be good options.
It's not a typical usage of Storm, obviously you can't bind multiple instances of the spout on the same machine to the same port. In distributed setup it would be good idea to store API's current IP address and port e.g. to ZooKeeper and then balancer which would forward requests to your API.
Here's a project with simple REST API on Storm:
https://github.com/timjstewart/restexpress-storm