Execute a calculation on each incoming data record in kafka - apache-kafka

I am consuming a kafka topic and I want to execute a calculation when a new data record is incoming. The calculation should be done on data of the incoming record and the two previous ones (like shown in the picture saved as a link here). Is it possible to somehow buffer the last two records so that I can operate with these and the new record?
Example

As mentioned by mike, this is a very broad question. From what you wrote, this looks like something you could do quite well with Kafka Streams. You might want to have a look at this intro

You can also achieve this simply by using KSql. KSQL is a SQL streaming engine for Apache Kafka. It provides an easy-to-use interactive SQL like interface for stream processing on Kafka, without the need to write code in a programming language like Java or Python.
Kindly find the tutorial https://docs.confluent.io/current/ksqldb/tutorials/index.html

Related

Filtering in Kafka and other streaming technologies

I am currently doing some research about which stream processing technology to use. So far I have looked at message queueing technologies and streaming frameworks. I am now leaning towards Apache Kafka or Google Pub/Sub.
The requirements I have:
Deliver, read and process messages/events in real time.
Persistence in the messages/events.
Ability to filter messages/event in real time with out having to read entire topic. For example: if I have topic called ‘details’, I want to be able to filter out the messages/events out of that topic where an attribute of an event equals a certain value.
Ability to see if the producer to a certain topic or queue is finished.
Ability to delete messages/events in a topic based on an attribute within an event equaling a certain value.
Ordering in messages/events.
My question is: what is the best framework/technology for these use cases? From what I have read so far, Kafka doesn’t provide that out of the boxes filtering approach for messages/events in topics and Google Pub/Sub does have a filter approach.
Any suggestions and experience would be welcome.
As per the requirements you mentioned kafka seems a nice fit, using kafka streams or KSQL you can perform filtering in real-time, here is an example https://kafka-tutorials.confluent.io/filter-a-stream-of-events/confluent.html
What you need is more than just integration and data transfer, you need something similar to what is known as ETL tool, here you can find more about ETL and tools in GCP https://cloud.google.com/learn/what-is-etl

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

What is the gain of using kafka-connect over traditional approach?

I have a use case where I need to send the data changes in relational database into a kafka-topic.
I'm able to write a simple JDBC program which executes set of queries for the changes in certain time period and write data into kafka-topic using KafkaTemplate (a wrapper provided by spring framework).
If I do the same using kafka-connect, which is to write a source connector. what benefits or overheads (if in case any) will I get?
The first thing is that you have "... to write a simple JDBC program ..." and take care of the logic of writing on both database and Kafka topic.
Kafka Connect does that for you and your business application has to write to the database only. With Kafka Connect you have more than that like fail-over handling, parallelism, scaling, ... it's all out of box for you while you should take care of them when for example you write on the database but something fails and you are not able to write to Kafka topic and so on.
Today you want to ingest from a database using a set of queries from one database to a Kafka topic, and write some bespoke code to do that.
Tomorrow you want to use a second database, or you want to change the serialisation format of your data in Kafka, or you want to scale out your ingest or you want to have high availability. Or you want to add in the ability to stream data from Kafka to another target, to ingest data also from other places. And, manage it all centrally using a standardised configuration pattern expressed just in JSON. Oh, and you want it to be easily maintainable by someone else who doesn't have to read through code but can just use a common API of Apache Kafka (which is what Kafka Connect is).
If you manage to do all of this yourself—you've just reinvented Kafka Connect :)
I talk extensively about this in my Kafka Summit session: "From Zero to Hero with Kafka Connect" which you can find online here

How to join multiple Kafka topics?

So I have...
1st topic that has general application logs (log4j). Stores things like HTTP API requests/responses and warnings, exceptions etc... There can be multiple logs associated to one logical business request. (These logs happen within seconds of each other)
2nd topic contains commands from the above business request which other services take action on. (The commands also happen within seconds of each other, but maybe couple minutes from the original request)
3rd topic contains events generated from actions of those other services. (Most events complete within seconds, but some can take up to 3-5 days to be received)
So a single logical business request can have multiple logs, commands and events associated to it by a uuid which the microservices pass to each other.
So what are some of the technologies/patterns that can be used to read the 3 topics and join them all together as a single json document and then dump them to lets say Elasticsearch?
Streaming?
You can use Kafka Streams, or KSQL, to achieve this. Which one depends on your preference/experience with Java, and also the specifics of the joins you want to do.
KSQL is the SQL streaming engine for Apache Kafka, and with SQL alone you can declare stream processing applications against Kafka topics. You can filter, enrich, and aggregate topics. Currently only stream-table joins are supported. You can see an example in this article here
The Kafka Streams API is part of Apache Kafka, and a Java library that you can use to do stream processing of data in Apache Kafka. It is actually what KSQL is built on, and supports greater flexibility of processing, including stream-stream joins.
You can use KSQL to join the streams.
There are 2 constructs in KSQL Table/Stream.
Currently, the Join is supported for a Stream & a table. So you need to identify the which is a good fit for what?
You don't need windowing for joins.
Benefits of using KSQL.
KSQL is easy to set up.
KSQL is SQL language which helps you to query your data quickly.
Drawback.
It's not production ready but in April-2018 the release is coming up.
Its little buggy right now but certainly will improve in a few months.
Please have a look.
https://github.com/confluentinc/ksql
Same as question Is it possible to use multiple left join in Confluent KSQL query? tried to join stream with more than 1 tables , if not then whats the solution?
And seems like you can not have multiple join keys within same query.

Query Kafka topic for specific record

Is there an elegant way to query a Kafka topic for a specific record? The REST API that I'm building gets an ID and needs to look up records associated with that ID in a Kafka topic. One approach is to check every record in the topic via a custom consumer and look for a match, but I'd like to avoid the overhead of reading a bunch of records. Does Kafka have a fast, built in filtering capability?
The only fast way to search for a record in Kafka (to oversimplify) is by partition and offset. The new producer class can return, via futures, the partition and offset into which a message was written. You can use these two values to very quickly retrieve the message.
So if you make the ID out of the partition and offset then you can implement your fast query. Otherwise, not so much. This means that the ID for an object isn't part of your data model, but rather is generated by the Kafka-knowledgable code.
Maybe that works for you, maybe it doesn't.
This might be late for you, but it will help for how other see this question, now there is KSQL, kafka sql is an open-source streaming SQL engine
https://github.com/confluentinc/ksql/