Can't consume messages from topic partitions in ClickHouse - apache-kafka

I am introducing with kafka and I want to know how can I consume mesages from partitions in topic to ClickHouse tables like this:
In case when I have 3 topics it was easy to connect tables on each topics
ENGINE = Kafka SETTINGS
kafka_broker_list = 'broker:9092',
kafka_topic_list = 'topic1',
kafka_group_name = 'kafka_group',
kafka_format = 'JSONEachRow'
But I don't know how to consume messages from partitions of one topic to tables. Please help

There are multiple ways you can do that
Keep the identifier in your message like below. In your consumer you can read table attribute and take decision in which table you have to save the data.
{
table: Table1
}
Though kafka don't provide any direct way to produce method to specific partition however you can use key for that. Lets make the key with three value 1,2,3. When message is produced for Table1 use key 1. That way message will go to only one partition and then consumer for that partition can save data in Table1
Personally I'll prefer method 1 as it don't couple kafka processing with your business logic

Related

Kafka consumer-group

I am newbie to Kafka and learning Kafka internals.. Please feel free to correct my understanding as required..
Here is my real time scenario.. appreciate all the responses:
I have a real time FTP server which receives data files.. Lets say claims files.
I will publish these data into a topic. lets call the topic as claims_topic (2 partitions).
I need to subscribe to this claims_topic, read the messages and write them to Oracle and Postgres table. Lets call oracle table as Otable and Postgres table as Ptable.
I need to capture every topic message and write them to Otable and Ptable. Basically Otable and Ptable has to be in sync.
Assume that I will write two consumers one for oracle and other for postgres.
Question1: Should the two consumers be in same consumer-group? I believe No. as it will lead to one consumer getting messages only from a particular partition.
Question2: If Question1 is TRUE. then please enlighten me in what case multiple consumers are grouped under a same consumer-group? real time scenario is much appreciated.
consumer group is a logical name that group an application consumers together, they are working together towards finish processing the data inside topic , each partition can be handled only by one consumer of consumer group, making partition count the maximum limit of parallel consumption/ processing power for a topic. each consumer in consumer group is handling one or more partitions , if you have one consumer on topic with many partitions it will handle all the partitions by itself, if you would add more consumers to the same consumer group they will divide / "rebalance" the topic partition among them , hope it clears things up
When setting up a consumer you configure its group id, this is the consumer group, two separate consumers with same group id are becoming members of the same consumer group
In cases where there is high produce throughout and one consumer can not handle the pressure you can scale it out by running more consumers with same consumer group to work together to process the topic , each task would take ownership on different partitions
For your use case complete sync of Postgres and Oracle won't be easily achievable, you could use kafka connect to read data from your topic to your targets with relevant sink connectors, but than again they will be "eventually consistent " as they do not share an atomic transaction
I would explore spring data transctional layer
Spring #Transactional with a transaction across multiple data sources
NO, Both consumers do not want to be in same consumer group, because they need to consume all topic data separately and write to Otable and Ptable.
If Both consumers are in one consumer group, then Otable getting data in one partition and Ptable getting data from other partition. (Because you have 2 partition)
In my opinion, use two consumers with two consumer group, then if there is high traffic in your topic, Then you can scale number of consumers separately for Otable and Ptable.
If you need two consumers to write Ptable, Use same group id for those consumers. Then consumer traffic will be shared with number of consumers. (in your case, maximum number of consumers for one group should be 2, because you have only 2 partitions in your topic). If you need this for Otable, follow the same scenario.

KSQL says it expects existing topic to have 2 partitions (topic has 1),

I am facing below error when trying to create Stream in KSQL from an existing Kafka topic.
io.confluent.ksql.exception.KafkaTopicExistsException: A Kafka topic with the name 'test-data' already exists, with different partition/replica configuration than required. KSQL expects 2 partitions (topic has 1), and 1 replication factor (topic has 1).
Is it mandatory to have 2 partitions for creating a stream in KSQL?
I'm guessing you're running a command such as:
CREATE TABLE FOO (<some column defs>)
WITH (
partitions=<some-value>, <-- explicitly setting partition count
kafka_topic='test-data',
value_format='<something>'
);
Specifically, are you explicitly setting the partition count in the WITH clause?
Neither the PARTITIONS nor the REPLICAS properties need to be set in the WITH clause when creating a TABLE or STREAM over an existing topic. These only need to be set when you wish ksqlDB to create a new topic for your data. If they are set, they must match any existing topic.
These doc pages provide more information on this subject:
https://docs.ksqldb.io/en/latest/developer-guide/create-a-stream/
https://docs.ksqldb.io/en/latest/developer-guide/create-a-table/
If this explanation does not cover your error, then please provide more information. E.g. ksql version, statements used, details of existing topics in Kafka etc.

Join data from 4 topics in broker using Kafka Streams when updates are not same in each of the topics

I am working on a requirement to process data ingested from a SQL Data store to Kafka Broker in 4 different topics corresponding to 4 different tables in the SQL Data Store. I am using Kafka Connect to ingest the data into the topics.
I now want to join the data from these topics and aggregate them and write them back to another topic. This topic will in turn be subscribed by a consumer to populate a NOSQL Data store which will be used to render the UI.
I know Kafka Streams can be used to join topics.
My query is, the data being ingested from SQL Data store tables may not always have data for all the 4 tables. Only 2 of the tables will have regular updates. One will get updated but not in the same frequency as the other 2. The remaining one is a static (sort of master table).
So, I am not sure how we can actually join them with Kafka Streams when the record counts will mismatch in topics.
Has anyone faced a similar issue . If so, can you please provide your thoughts/code snippets on the same.
The number of rows don't matter at all... Why should it have any impact on the join result?
You can just read all 4 topics as a KTable each, and do the join. Finally, you apply an aggregation to the join-result KTable and write the final result to a topic. Something like this:
KTable t1 = builder.table("topic1");
KTable t2 = builder.table("topic2");
KTable t3 = builder.table("topic3");
KTable t4 = builder.table("topic4");
KTable joinResult = t1.join(t2, ...).join(t3, ...).join(t4, ...);
joinResult.groupByKey(...).aggregate(...).to("result-topic);

Kafka Streams state store backup topic partitioning strategy

Kafka guarantees that messages with same key will always go to the same
partition.
For instance, I have message with the string key: 2329. And two topics t1 and t2. When I perform write of this message it goes into partition 1 in both topics, as expected.
Now the problem itself: I'm using Kafka Streams 0.10.2.0 persistent state store, which automatically creates a backup topic. Now in case of this backup topic message with the key: 2329 goes into another partition (partition 0), which is strange to me.
Has anyone encountered this issue?
I've found where was the issue.
To increase performance I've skipped repartitioning before write data to the state store. And used another column from the value as the state store key. And it worked until additional topic with enrichment information has been added. So I just forgot to perform repartitioning.

Kafka used as Delivery Mechanism in News Feed

Can I create topics called update_i for different kinds of updates and partition them using user_id in a Kafka MQ ? I've been through this post by confluent.io: https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ . Also, I know that I cannot create a topic with dynamic number of partitions. These two facts (the post and static number of Kafka partitions). What's the delivery mechanism alternative ?
Can I create topics called update_i for different kinds of updates and partition them using user_id in a Kafka MQ ?
If I understand you correctly, the answer is Yes.
What you would need to do in a nutshell:
Topic configuration: Determine the required number of partitions for your topic(s). Usually, the number of partitions is determined based on (1) anticipated scale/volume of the incoming data, i.e. the Write-side of scaling, and/or (2) the required parallelism when consuming the messages for processing, i.e. the Read-side of scaling. See https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/ for details.
Writing messages to these Kafka topics (aka the side of the "Kafka producer"): In Kafka, messages are key-value pairs. In your case, you would set the message key to be the user_id. Then, when using Kafka's default "partitioner", messages for the same message key (here: user_id) would automatically be sent to the same partition -- which is what you want to achieve.
As a possible solution I would suggest to create a number of partitions, and then setup producers to select partition using the following rule
user_id mod <number_of_partitions>
That will allow you to keep order of messages for particular user_id.
Then, If you need to have a consumer that processes only messages for particular user_id, you can write a (low-level) consumer that will read a particular partition and process only messages that are sent for a particular customer and ignore all other messages.