Kafka Messages Processing

Kafka Messages Processing - apache-kafka

I am using Kafka distributed system for message processing in spring boot application. Now my application are producing messages on even basic to three different different topics. There is one separate spring boot application which will be used by some data analysis team who will analysis the data. This application is a simple report type application with only one filter Topic.
Now I have to implement this but I am little bit confused how I will show the data to the UI. I have written listeners (Consumers) who are consuming the messages but how I will show the data to the UI on real time basic. Should I need to store it in some database like redis and then show this data to UI? Is this the correct way to deal with consumer in Kafka? Will it not be slow? As messages can grow drastically over the time.
In nutshell I want to know to how we can show messages on any UI in the efficient way and in real time.
Thanks

You can write a consumer to forward to a websocket.
Or you can use Kafka Connect to write to a database, then write a REST API
Or use Kafka Streams Interactive Queries feature + add a RPC layer on top for Javascript to call

Related

Which messaging system for a web dashboard?

I would like to make a Web Dashboard system and I am facing a problem. I need to get an information that is in the cache of one of the instances of my program, for this I had thought of doing Pub/Sub with Kafka however I don't know how to do to Publish and get a response from one of my Subscriber. Do you know a pattern that allows this and a service that allows me to do this?
EDIT: I would like to design an infrastructure that follows this pattern:

Attached diagram is showing simple request->response flow, Kafka is designed for different types of architecture, so IMHO you should not focus on Kafka in this case.
However, if you still want to use Kafka for some other reasons I can suggest to you two options:
Stick with request->response flow and use ReplyingKafkaTemplate or AggregatingKafkaTemplate to handle it, second one is an extension of first one, this adds functionality to handle more responses then one. You can send a request to Kafka topic from the Dashboard application, then poll the message by one of the Bot instances, next, send reply to reply topic, and then process reply in Dashboard application.
Use Kafka to implement Event-Carried State Transfer pattern, move state (mutual guilds data) from Bot Instances directly to Dashboard application via Kafka topic. You can use several tools to implement this:
Bot applications send events to Kafka topic via simple KafkaProducer or KafkaTemplate, then use one of the Kafka Connect sink connectors to save data in Dashboards database.
Bot applications send events to Kafka topic via simple KafkaProducer or KafkaTemplate. Run Kafka Streams thread in Dashboard application and build a state using Kafka Streams functionalities - grouping, aggregating etc. Then read the state directly from Kafka Streams internal RocksDB database.

Are Kafka and Kafka Streams right tools for our case?

I'm new to Kafka and will be grateful for any advice
We are updating a legacy application together with moving it from IBM MQ to something different.
Application currently does the following:
Reads batch XML messages (up to 5 MB)
Parses it to something meaningful
Processes data parallelizing this procedure somehow manually for parts of the batch. Involves some external legacy API calls resulting in DB changes
Sends several kinds of email notifications
Sends some reply to some other queue
input messages are profiled to disk
We are considering using Kafka with Kafka Streams as it is nice to
Scale processing easily
Have messages persistently stored out of the box
Built-in partitioning, replication, and fault-tolerance
Confluent Schema Registry to let us move to schema-on-write
Can be used for service-to-service communication for other applications as well
But I have some concerns.
We are thinking about splitting those huge messages logically and putting them to Kafka this way, as from how I understand it - Kafka is not a huge fan of big messages. Also it will let us parallelize processing on partition basis.
After that use Kafka Streams for actual processing and further on for aggregating some batch responses back using state store. Also to push some messages to some other topics (e.g. for sending emails)
But I wonder if it is a good idea to do actual processing in Kafka Streams at all, as it involves some external API calls?
Also I'm not sure what is the best way to handle the cases when this external API is down for any reason. It means temporary failure for current and all the subsequent messages. Is there any way to stop Kafka Stream processing for some time? I can see that there are Pause and Resume methods on the Consumer API, can they be utilized somehow in Streams?
Is it better to use a regular Kafka consumer here, possibly adding Streams as a next step to merge those batch messages together? Sounds like an overcomplication
Is Kafka a good tool for these purposes at all?

Overall I think you would be fine using Kafka and probably Kafka Streams as well. I would recommend using streams for any logic you need to do i.e. filtering or mapping that you have todo. Where you would want to write with a connector or a standard producer.
While it is ideal to have smaller messages I have seen streams users have messages in the GBs.
You can make remote calls, to send and email, from a Kafka Streams Processor but that is not recommend. It would probably be better to write the event to send an email to an output topic and use a normal consumer to read and send the messages. This would also take care of your concern about the API being down as you can always remember the last offset in case and restart from there. Or use the Pause and Resume methods.

How can I process data from Kafka with PySpark?

I want to process logs data from Kafka streaming to PySpark and save to Parquet files, but I don't know how to input the data to Spark. Please help me thanks.

My answer is on high level. You need to use spark-streaming and need to have some basic understanding of messaging systems like Kafka.
The application that sends data into Kafka (or any messaging system) is called "producer" and the application that receives data from Kafka is called as "consumer". When producer sends data, it will send data to a specific "topic". Multiple producers can send data to Kafka layer under different topics.
You basically need to create a consumer application. To do that, first you need to identify the topic you are going to consume data from.
You can find many sample programs online. Following page can help you to build your first application
https://www.rittmanmead.com/blog/2017/01/getting-started-with-spark-streaming-with-python-and-kafka/

Keeping services in sync in a kafka event driven backbone

Say I am using Kafka as the event-driven backbone for all my microservices in my system design. Many microservices use the events data to populate their internal databases.
Now there is a requirement where I need to create a new service and it uses some events data. The service will only be able to consume events after the time it comes live and hence, won't have a lot of data that it missed. I want a strategy such that I don't have to backfill my internal databases by writing out scripts.
What are some cool strategies I can have which do not create a huge load on Kafka & does not account for a lot of scripting to backfill data in the new services that I ever create?

There are a few strategies you can have here, depending on how you publish data to a kafka topic. Here are a few ideas:
first, you can set the retention of a kafka topic to be forever, meaning that it will store all the data. This is OK as kafka is built for this purpose as well. See this. By doing this, any new service that come alive can start consuming data from the start.
if you are using kafka for latest state publishing for a given entity/aggregate, you can also consider configuring the topic to be a compacted. This will let you store at least the latest state of your entity/aggregate on the topic, and new consumers that starts listening on the topic will have less data to configure. However, your consumers still need to know how to process multiple messages per entity/aggregate as you cannot guarantee it will have exactly one message in the topic.

Kafka user - project design advise

I am new to Kafka and data streaming and need some advice for the following requirement,
Our system is expecting close to 1 million incoming messages per day. The message carries a project identifier. The message should be pushed to users of only that project. For our case, lets say we have projects A, B and C. Users who opens project A's dashboard only sees / receives messages of project A.
This is my idea so far on implementing solution for the requirement,
The messages should be pushed to a Kafka Topic as they arrive, lets call this topic as Root Topic. The messages once pushed to the Root Topic, can be read by a Kafka Consumer/Listener and based on the project identifier in the message can push that message to a project specific Topic. So any message can end up at Topic A or B or C. Thinking of using websockets to update the message as they arrive on the project users' dashboards. There will be N Consumers/Listeners for the N project Topics. These consumers will push the project specific message to the project specifc websocket endpoints.
Please advise if I can make any improvements to the above design.
Chose Kafka as the messaging system here as it is highly scalable and fault tolerant.
There is no complex transformation or data enrichment before it gets sent to the client. Will it makes sense to use Apache Flink or Hazelcast Jet for the streaming or Kafka streaming is good enough for this simple requirement.
Also, when should I consider using Hazelcast Jet or Apache Flink in my project.
Should i use Flink say when I have to update few properties in the message based on a web service call or database lookup before sending it to the users?
Should I use Hazelcast Jet only when I need the entire dataset in memory to arrive at a property value? or will using Jet bring some benefits even for my simple use case specified above. Please advise.

Kafka Streams are a great tool to convert one Kafka topic to another Kafka topic.
What you need is a tool to move data from a Kafka topic to another system via web sockets.
Stream processor gives you a convenient tooling to build this data pipeline (among others connectors to Kafka and web sockets and scalable, fault-tolerant execution environment). So you might want use stream processor even if you don't transform the data.
The benefit of Hazelcast Jet is it's embedded scalable caching layer. You might want to cache your database/web service calls so that the enrichment is performed locally, reducing remote service calls.
See how to use Jet to read from Kafka and how to write data to a TCP socket (not websocket).

I would like to give you another option. I'm not Spark/Jet expert at all, but I've studying them for a few weeks.
I would use Pentaho Data Integration(kettle) to consume from the Kafka and I would write a kettle step (or User Defined Java Class step) to write the messages to a Hazelcast IMAP.
Then, would use this approach http://www.c2b2.co.uk/middleware-blog/hazelcast-websockets.php to provided the Websockets for the end-users.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse