Using Kafka KSQL to select all events of a topic from a specific partition with given offset - apache-kafka

Problem: I have a table in an external Database containing kafka events I polled from the Kafka Bus the last time. The table contains for all events the composite primary key PK(topic, partition, offset).
So I can easily for every topic and partition determine the latest event.
Now I would love to do an select like this:
SELECT event
FROM topic
WHERE event.partition = partition0 AND event.offset > partition0.offset
OR event.partition = partition1 AND event.offset > partition1.offset
...
And of course I would love that the statement returns immediately with all events currently in the queue, writing the result into an HDFS-File.
How would I do that with KSQL?
N.B.: Of course I would love to put all partitions with their corresponding offsets as pairs into an array and use that in the where clause ... that would be a premium solution.

Related

Can't join between stream and stream in ksqldb

I would like to inquire about the problem of not joining between stream and stream in ksqldb.
Situations and Problems
I have generated two ksqldb streams from a topic containing events from different databases (postgresql, mssql) and are joining against specific columns in both streams.
To help you understand, I will name both streams stream1, stream2, and join target columns target_col.
type
name
join target column
stream in ksqldb
stream1
taget_col
stream in ksqldb
stream2
taget_col
There is a problem that these two streams are not joined when joined with the query below.
select * from stream1 join stream2 within 1 minutes on stream1.tagtet_col=stream2.target_col emit changes
1. Meeting the joining requirements.
According to the official ksqldb document, the conditions for co-partitioning, which are essential conditions for join, are the following three conditions, and it is confirmed that both streams satisfy the conditions.
Co-partitioning Requirements source
1. The input records for the join must have the same key schema.
-> The describe stream1 command and describe stram2 command confirmed that the join key schema of stream1 and stream2 is the same as string.
2. The input records must have the same number of partitions on both sides.
-> The partition numbers for both streams were specified the same in the statement(CREATE STREAM ~ WITH(PARTITIONS=1, ~ )) at the time of the stream declaration. The number of partitions in the source topic that the stream is subscribed to is also equal to one.
3. Both sides of the join must have the same partitioning strategy.
-> The original topic that the stream is subscribed to has one partition, so even if the partitioning strategies are different, all records are included in one partition, so it doesn't matter if the partitioning strategies are different.
2. Time difference between records.
The timestamp and partition number of the record were verified through the psuedocolumns.
The queries used are as follows: select taget_col, rowtime, rowpartition from stream1 emit changes select taget_col, rowtime, rowpartition from stream2 emit changes
When the join key column has the same value, the partition number is the same(ex. 0), and the record timestamp is not more than 2 seconds apart.
Therefore, I think the time interval(1 minutes) of the query in question(select * from stream1 join stream2 within 1 minutes on stream1.tagtet_col=stream2.target_col emit changes) is not a problem.
3. Data Acquisition
Here's how data is obtained as a topic subscribed to by both streams.
postgresql --(kafka connect/ confluent jdbc source connector)--> kafka topic --> stream1
mssql --(kafka connect/ confluent jdbc source connector)--> kafka topic --> stream2
Because I use data from different databases, I utilized the appropriate jar library(mssql-jdbc-7.2.1.jre8.jar, postgresql-42.3.1.jar) for each database on the same connector.
I build kafka ecosystem using confluent official docker images.(zookeeper, broker, connect, ksqldb-server, ksqldb-cli)
In this situation, please advise if there is any action plan to solve the join problem.
Thank you.

Kafka-how to send messages to specific partition based on a table's field value via Debezium configuration

is it possible to send messages to specific partition based on a table field value? For example,i have a column called customer, which has 4 values ,say customer1,customer2,customer3,customer4. I want to send to their corresponding partition.
is it posiible to achive this in debezium configuration?
By default, Debezium will write Kafka records into partitions based on the record key, e.g. the database rows id. There's no guarantee "customer1" goes to "partition 1", or that 2 customers will end up in the same partition (e.g. you may have more customers than partitions)
To explicitly map the data to numbered partitions, you'll need to implement your own Partitioner Java interface and add it to the Connect worker classpath and set producer.override.partitioner.class in the Debezium config.
Or you can just let the producer partition based on the key of the records, as is expected.

Use Session window in Kafka stream to order records and insert into MySQL database

As per the KSQLDB documentation, session window can be used to order the records as per timestamp and do aggregation.
I have an use case where I want to insert records into MySQL in sequence.
I have a timestamp field in my record that I used as ROWTIME and then tried session window over it and inserted into an output stream that will push into a topic and then to RDS. But in the output stream I was not able to reorder the messages as per the timestamp.
Example -
There are two records - Record 1 at 11:00AM and Record 2 at 11:01AM and both of them has same primary keys. These two records are getting ingested in Kafka in sequence - Record 2 , Record 1. But in MYSQL I need Record 1 and then Record 2 as the Record 1 has lower timestamp. I tried window session of 5 minutes in stream. But in output stream, it is always coming as Record 2, Record 1.
Is this scenario possible inside Kafka? Can I reorder the records inside Kafka and then push into a stream using INSERT INTO statement?
Currently I am trying to do using KSQL queries as I am using confluent Kafka.
Session windows do not change the order of records, they GROUP records together that have the same key and are within some time period of each other.
Hence session windows are not going to allow you to reorder messages.
Reordering messages is not a use-case ksqlDB is suited for at present. You may have better luck if you tried to write a Kafka Streams based application.
Kafka Streams would allow you to use a state-store to buffer input messages for some time to allow for out-of-order messages. You should be able to use punctuation to trigger outputting of cached messages after some time period. You will need to choose how long you are willing to buffer the input to allow for out-of-order messages.

Can't consume messages from topic partitions in ClickHouse

I am introducing with kafka and I want to know how can I consume mesages from partitions in topic to ClickHouse tables like this:
In case when I have 3 topics it was easy to connect tables on each topics
ENGINE = Kafka SETTINGS
kafka_broker_list = 'broker:9092',
kafka_topic_list = 'topic1',
kafka_group_name = 'kafka_group',
kafka_format = 'JSONEachRow'
But I don't know how to consume messages from partitions of one topic to tables. Please help
There are multiple ways you can do that
Keep the identifier in your message like below. In your consumer you can read table attribute and take decision in which table you have to save the data.
{
table: Table1
}
Though kafka don't provide any direct way to produce method to specific partition however you can use key for that. Lets make the key with three value 1,2,3. When message is produced for Table1 use key 1. That way message will go to only one partition and then consumer for that partition can save data in Table1
Personally I'll prefer method 1 as it don't couple kafka processing with your business logic

How many records are stored in each offset of kafka partition?

I came across the below kafka official statement
For each topic, the Kafka cluster maintains a partitioned log
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
So, Lets say we have a kafka topic called "emprecords" and just assume that it has only one partition for now and in that partition let's say we have 10 offset starting from 0 to 9
My question is
Does each offset has got the ability to store only one record?
Or
Does each offset has got the ability to store more than one records?
For each partition, each offset can only be assigned to one record.