How to join multiple Kafka topics? - apache-kafka

So I have...
1st topic that has general application logs (log4j). Stores things like HTTP API requests/responses and warnings, exceptions etc... There can be multiple logs associated to one logical business request. (These logs happen within seconds of each other)
2nd topic contains commands from the above business request which other services take action on. (The commands also happen within seconds of each other, but maybe couple minutes from the original request)
3rd topic contains events generated from actions of those other services. (Most events complete within seconds, but some can take up to 3-5 days to be received)
So a single logical business request can have multiple logs, commands and events associated to it by a uuid which the microservices pass to each other.
So what are some of the technologies/patterns that can be used to read the 3 topics and join them all together as a single json document and then dump them to lets say Elasticsearch?
Streaming?

You can use Kafka Streams, or KSQL, to achieve this. Which one depends on your preference/experience with Java, and also the specifics of the joins you want to do.
KSQL is the SQL streaming engine for Apache Kafka, and with SQL alone you can declare stream processing applications against Kafka topics. You can filter, enrich, and aggregate topics. Currently only stream-table joins are supported. You can see an example in this article here
The Kafka Streams API is part of Apache Kafka, and a Java library that you can use to do stream processing of data in Apache Kafka. It is actually what KSQL is built on, and supports greater flexibility of processing, including stream-stream joins.

You can use KSQL to join the streams.
There are 2 constructs in KSQL Table/Stream.
Currently, the Join is supported for a Stream & a table. So you need to identify the which is a good fit for what?
You don't need windowing for joins.
Benefits of using KSQL.
KSQL is easy to set up.
KSQL is SQL language which helps you to query your data quickly.
Drawback.
It's not production ready but in April-2018 the release is coming up.
Its little buggy right now but certainly will improve in a few months.
Please have a look.
https://github.com/confluentinc/ksql

Same as question Is it possible to use multiple left join in Confluent KSQL query? tried to join stream with more than 1 tables , if not then whats the solution?
And seems like you can not have multiple join keys within same query.

Related

Filtering in Kafka and other streaming technologies

I am currently doing some research about which stream processing technology to use. So far I have looked at message queueing technologies and streaming frameworks. I am now leaning towards Apache Kafka or Google Pub/Sub.
The requirements I have:
Deliver, read and process messages/events in real time.
Persistence in the messages/events.
Ability to filter messages/event in real time with out having to read entire topic. For example: if I have topic called ‘details’, I want to be able to filter out the messages/events out of that topic where an attribute of an event equals a certain value.
Ability to see if the producer to a certain topic or queue is finished.
Ability to delete messages/events in a topic based on an attribute within an event equaling a certain value.
Ordering in messages/events.
My question is: what is the best framework/technology for these use cases? From what I have read so far, Kafka doesn’t provide that out of the boxes filtering approach for messages/events in topics and Google Pub/Sub does have a filter approach.
Any suggestions and experience would be welcome.
As per the requirements you mentioned kafka seems a nice fit, using kafka streams or KSQL you can perform filtering in real-time, here is an example https://kafka-tutorials.confluent.io/filter-a-stream-of-events/confluent.html
What you need is more than just integration and data transfer, you need something similar to what is known as ETL tool, here you can find more about ETL and tools in GCP https://cloud.google.com/learn/what-is-etl

What is stored in Kafka KSQL DB?

What kind of data is stored in ksqlDB? Only metadata about KStreams and KTables?
From what I understand I can create KStreams and KTables and operations between these using either the Java API or the KSQL language. But what is ksqlDB Server actually used for besides metadata about these objects when I create them using KSQL client?
I also assume that ksqlDB actually runs the streams processors so it is also an execution engine. Does ist scale automatically? Can there be multiple instances of ksqlDB server component that communicate with each other? Is it intended for scenarios that need massive scaling, or is it just some syntactic sugar suitable for people who don't like to write Java code?
EDITED
It is explained in this video: https://www.youtube.com/embed/f3wV8W_zjwE
It does not scale automatically, but you can manually deploy multiple instances of ksqlDB server and make them join the same ksqlDB cluster identified by ksql.server.id
What kind of data is stored in ksqlDB?
Nothing is really stored "in" it. All the metadata of the query history and running processors is stored in Kafka topics and on-disk KTables
I also assume that ksqlDB actually runs the streams processors so it is also an execution engine. Does ist scale automatically? Can there be multiple instances of ksqlDB server component that communicate with each other? Is it intended for scenarios that need massive scaling, or is it just some syntactic sugar suitable for people who don't like to write Java code?
All Yes.

Kafka Streams DSL over Kafka Consumer API

Recently, in an interview, I was asked a questions about Kafka Streams, more specifically, interviewer wanted to know why/when would you use Kafka Streams DSL over plain Kafka Consumer API to read and process streams of messages? I could not provide a convincing answer and wondering if others with using these two styles of stream processing can share their thoughts/opinions. Thanks.
As usual it depends on the use case when to use KafkaStreams API and when to use plain KafkaProducer/Consumer. I would not dare to select one over the other in general terms.
First of all, KafkaStreams is build on top of KafkaProducers/Consumers so everything that is possible with KafkaStreams is also possible with plain Consumers/Producers.
I would say the KafkaStreams API is less complex but also less flexible compared to the plain Consumers/Producers. Now we could start long discussions on what means "less".
When it comes to developing Kafka Streams API you can directly jump into your business logic applying methods like filter, map, join, or aggregate because all the consuming and producing part is abstracted behind the scenes.
When you are developing applications with plain Consumer/Producers you need to think about how you build your clients at the level of subscribe, poll, send, flush etc.
If you want to have even less complexity (but also less flexibilty) ksqldb is another option you can choose to build your Kafka applications.
Here are some of the scenarios where you might prefer the Kafka Streams over the core Producer / Consumer API:
It allows you to build a complex processing pipeline with much ease. So. let's assume (a contrived example) you have a topic containing customer orders and you want to filter the orders based on a delivery city and save them into a DB table for persistence and an Elasticsearch index for quick search experience. In such a scenario, you'd consume the messages from the source topic, filter out the unnecessary orders based on city using the Streams DSL filter function, store the filter data to a separate Kafka topic (using KStream.to() or KTable.to()), and finally using Kafka Connect, the messages will be stored into the database table and Elasticsearch. You can do the same thing using the core Producer / Consumer API also, but it would be much more coding.
In a data processing pipeline, you can do the consume-process-produce in a same transaction. So, in the above example, Kafka will ensure the exactly-once semantics and transaction from the source topic up to the DB and Elasticsearch. There won't be any duplicate messages introduced due to network glitches and retries. This feature is especially useful when you are doing aggregates such as the count of orders at the level of individual product. In such scenarios duplicates will always give you wrong result.
You can also enrich your incoming data with much low latency. Let's assume in the above example, you want to enrich the order data with the customer email address from your stored customer data. In the absence of Kafka Streams, what would you do? You'd probably invoke a REST API for each incoming order over the network which will be definitely an expensive operation impacting your throughput. In such case, you might want to store the required customer data in a compacted Kafka topic and load it in the streaming application using KTable or GlobalKTable. And now, all you need to do a simple local lookup in the KTable for the customer email address. Note that the KTable data here will be stored in the embedded RocksDB which comes with Kafka Streams and also as the KTable is backed by a Kafka topic, your data in the streaming application will be continuously updated in real time. In other words, there won't be stale data. This is essentially an example of materialized view pattern.
Let's say you want to join two different streams of data. So, in the above example, you want to process only the orders that have successful payments and the payment data is coming through another Kafka topic. Now, it may happen that the payment gets delayed or the payment event comes before the order event. In such case, you may want to do a one hour windowed join. So, that if the order and the corresponding payment events come within a one hour window, the order will be allowed to proceed down the pipeline for further processing. As you can see, you need to store the state for a one hour window and that state will be stored in the Rocks DB of Kafka Streams.

Merge multiple events from RDBMS in Kafka

We are capturing Change data capture from different tables from a RDBMS database. Each individual change is treated as an event. All the events are published into a single Kafka topic. Every event (message) is having the table name as header. We need to cater certain Use cases, where we need to merge multiple events and populate the output.
Entire thing is happening in real time.
We are using Apache Kafka.
Not sure what you mean exactly by merging events, but this seems to be in Kafka streams domain.
You can design each of your events using streams and ktables, for which you'll apply a Kafka streams topology ( joining streams of events and applying some business logic for instance)
But do you need more technical suggestions?
Yannick

Can Kafka streams deal with joining streams efficiently?

I'm new to Kafka and I'd like to know if what I'm planning is possible and reasonable to implement.
Suppose we have two sources, s1 and s2 that emit some messages to topics t1 and t2 respectively. Now, I'd like to have a sink which listens to both topics and I'd like it to process tuples of messages <m1, m2> where m1.key == m2.key.
If m1.key was never found in some message of s2, then the sink completely ignores m1.key (will never process it).
In summary, the sink will work only on keys that s1 and s2 worked on.
Some traditional and maybe naive solution would be to have some sort of cache or storage and to work on an item only when both of the messages are in the cache.
I'd like to know if Kafka offers a solution to this problem.
Most modern stream processing engines, such as Apache Flink, Kafka Streams or Spark Streaming can solve this problem for you. All three have battle tested Kafka consumers built for use cases like this.
Even within those frameworks, there are multiple different ways to achieve a streaming join like the above.
In Flink for example, one could use the Table API which has a SQL-like syntax.
What I have used in the past looks a bit like the example in this SO answer (you can just replace fromElements with a Kafka Source).
One thing to keep in mind when working with streams is that you do NOT have any ordering guarantees when consuming data from two Kafka topics t1 and t2. Your code needs to account for messages arriving in any order.
Edit - Just realised your question was probably about how you can implement the join using Kafka Streams as opposed to a stream of data from Kafka. In this case you will probably find relevant info here