Kafka streams co-partitioning vs interactive query - apache-kafka

I have the following topology:
topology.addSource(WS_CONNECTION_SOURCE, new StringDeserializer(), new WebSocketConnectionEventDeserializer()
, utilService.getTopicByType(TopicType.CONNECTION_EVENTS_TOPIC))
.addProcessor(SESSION_PROCESSOR, WSUserSessionProcessor::new, WS_CONNECTION_SOURCE)
.addStateStore(sessionStoreBuilder, SESSION_PROCESSOR)
.addSink(WS_STATUS_SINK, utilService.getTopicByType(TopicType.ONLINE_STATUS_TOPIC),
stringSerializer, stringSerializer
, SESSION_PROCESSOR)
//WS session routing
.addSource(WS_NOTIFICATIONS_SOURCE, new StringDeserializer(), new StringDeserializer(),
utilService.getTopicByType(TopicType.NOTIFICATION_TOPIC))
.addProcessor(WS_NOTIFICATIONS_ROUTE_PROCESSOR, SessionRoutingEventGenerator::new,
WS_NOTIFICATIONS_SOURCE)
.addSink(WS_NOTIFICATIONS_DELIVERY_SINK, new NodeTopicNameExtractor(), WS_NOTIFICATIONS_ROUTE_PROCESSOR)
.addStateStore(userConnectedNodesStoreBuilder, WS_NOTIFICATIONS_ROUTE_PROCESSOR, SESSION_PROCESSOR);
As you can see there are 2 source topics. State store is built from the first topic and the second flow reads the state store. When I start the topology, I see those stream threads are assigned the same partitions (co-partitioning) of both source topics. I assume this is because the state store is accessed by the second topic flow.
This is functionally working fine. But there is a performance problem. When there is a surge in the volume of input data to the first source topic, which updates state-store, second topic processing is delayed.
For me, the second topic should be processed as fast as possible. Delay in processing the first topic is fine.
I am thinking of the following strategy:
Current configuration:
WS_CONNECTION_SOURCE - 30 partitions
WS_NOTIFICATIONS_SOURCE - 30 partitions
streamThreads: 10
appInstances: 3
New configuration:
WS_CONNECTION_SOURCE - 15 partitions
WS_NOTIFICATIONS_SOURCE - 30 partitions
streamThreads: 10
appInstances: 3
Since there is no co-partitioning, tasks has to use interactive query to access store
The idea is out of 10 threads, 5 threads will only process the second topic which can alleviate the current problem when there is a surge in the first topic.
Here are my questions:
1. Is this strategy correct? To avoid co-partitioning and use interactive query
2. Is there a chance that Kafka will assign 10 partitions of WS_CONNECTION_SOURCE
to one instance since there are 10 threads and one instance won't get any?
3. Is there any better approach to solve the performance problem?

State store and Interactive Query are Kafka Streams abstraction.
To use Interactive Query you have to define state store (using Kafka Streams API) and that enforce you to have same number of partitions, for inputs topics.
I think your solution will not work. Interactive query are for exposing ability to query state store outside the Kafka Streams (not for access within Processor API)
Maybe you can review your SESSION_PROCESSOR source code and extract more work to Process from the other topology and publish result to intermediate topic and then based on that build that state store.
Additionally:
Currently Kafka Streams doesn't support prioritization for input topics. There is KIP about priorities for Source topic: KIP-349. Unfortunately linked Jira ticket was closed as Won't FIX (https://issues.apache.org/jira/browse/KAFKA-6690)

Related

Streaming from sub-set of partitions within a topic (Kafka Streams)

I've got a question about design.
We are working on building Kafka dedupe processing.
As far as I understand leveraging Kafka Stream is the possible API candidate.
Our topics are divided into 100 partitions. To scale the process we are examining an option,
a single Docker container (K8S pod) as a processing task per partition or group of partitions.
My question is if Kafka Stream API allows creating a stream from a specific topic.
Since our time window is very long (2 weeks) then the stateful process store grows to a large size.
Will appreciate any idea, what is the most proper way how to tackle it.

Get latest values from a topic on consumer start, then continue normally

We have a Kafka producer that produces keyed messages in a very high frequency to topics whose retention time = 10 hours. These messages are real-time updates and the used key is the ID of the element whose value has changed. So the topic is acting as a changelog and will have many duplicate keys.
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal, keeping the minimum load on Kafka server and letting the consumer do most of the job. We tried many ways and none of them seems the best.
What we tried:
1 changelog topic + 1 compact topic:
The producer sends the same message to both topics wrapped in a transaction to assure successful send.
Consumer launches and requests the latest offset of the changelog topic.
Consumes the compacted topic from beginning to construct the table.
Continues consuming the changelog since the requested offset.
Cons:
Having duplicates in compacted topic is a very high possibility even with setting the log compaction frequency the highest possible.
x2 number of topics on Kakfa server.
KSQL:
With KSQL we either have to rewrite a KTable as a topic so that consumer can see it (Extra topics), or we will need consumers to execute KSQL SELECT using to KSQL Rest Server and query the table (Not as fast and performant as Kafka APIs).
Kafka Consumer API:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Kafka Streams:
By using KTables as following:
KTable<Integer, MarketData> tableFromTopic = streamsBuilder.table("topic_name", Consumed.with(Serdes.Integer(), customSerde));
KTable<Integer, MarketData> filteredTable = tableFromTopic.filter((key, value) -> keys.contains(value.getRiskFactorId()));
Kafka Streams will create 1 topic on Kafka server per KTable (named {consumer_app_id}-{topic_name}-STATE-STORE-0000000000-changelog), which will result in a huge number of topics since we a big number of consumers.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Thanks in advance.
By using KTables, Kafka Streams will create 1 topic on Kafka server per KTable, which will result in a huge number of topics since we a big number of consumers.
If you are just reading an existing topic into a KTable (via StreamsBuilder#table()), then no extra topics are being created by Kafka Streams. Same for KSQL.
It would help if you could clarify what exactly you want to do with the KTable(s). Apparently you are doing something that does result in additional topics being created?
1 changelog topic + 1 compact topic:
Why were you thinking about having two separate topics? Normally, changelog topics should always be compacted. And given your use case description, I don't see a reason why it should not be:
Now, what we're trying to achieve is that when a Kafka consumer launches, regardless of the last known state (new consumer, crashed, restart, etc..), it will somehow construct a table with the latest values of all the keys in a topic, and then keeps listening for new updates as normal [...]
Hence compaction would be very useful for your use case. It would also prevent this problem you described:
Consumer starts and consumes the topic from beginning. This worked perfectly, but the consumer has to consume the 10 hours change log to construct the last values table.
Note that, to reconstruct the latest table values, all three of Kafka Streams, KSQL, and the Kafka Consumer must read the table's underlying topic completely (from beginning to end). If that topic is NOT compacted, this might indeed take a long time depending on the data volume, topic retention settings, etc.
From what we have tried, it looks like we need to either increase the server load, or the consumer launch time. Isn't there a "perfect" way to achieve what we're trying to do?
Without knowing more about your use case, particularly what you want to do with the KTable(s) once they are populated, my answer would be:
Make sure the "changelog topic" is also compacted.
Try KSQL first. If this doesn't satisfy your needs, try Kafka Streams. If this doesn't satisfy your needs, try the Kafka Consumer.
For example, I wouldn't use the Kafka Consumer if it is supposed to do any stateful processing with the "table" data, because the Kafka Consumer lacks built-in functionality for fault-tolerant stateful processing.
Consumer starts and consumes the topic from beginning. This worked
perfectly, but the consumer has to consume the 10 hours change log to
construct the last values table.
During the first time your application starts up, what you said is correct.
To avoid this during every restart, store the key-value data in a file.
For example, you might want to use a persistent map (like MapDB).
Since you give the consumer group.id and you commit the offset either periodically or after each record is stored in the map, the next time your application restarts it will read it from the last comitted offset for that group.id.
So the problem of taking a lot of time occurs only initially (during first time). So long as you have the file, you don't need to consume from beginning.
In case, if the file is not there or is deleted, just seekToBeginning in the KafkaConsumer and build it again.
Somewhere, you need to store this key-values for retrieval and why cannot it be a persistent store?
In case if you want to use Kafka streams for whatever reason, then an alternative (not as simple as the above) is to use a persistent backed store.
For example, a persistent global store.
streamsBuilder.addGlobalStore(Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore(topic), keySerde, valueSerde), topic, Consumed.with(keySerde, valueSerde), this::updateValue);
P.S: There will be a file called .checkpoint in the directory which stores the offsets. In case if the topic is deleted in the middle you get OffsetOutOfRangeException. You may want to avoid this, perhaps by using UncaughtExceptionHandler
Refer to https://stackoverflow.com/a/57301986/2534090 for more.
Finally,
It is better to use Consumer with persistent file rather than Streams for this, because of simplicity it offers.

Dynamically create and change Kafka topics with Flink

I'm using Flink to read and write data from different Kafka topics.
Specifically, I'm using the FlinkKafkaConsumer and FlinkKafkaProducer.
I'd like to know if it is possible to change the Kafka topics I'm reading from and writing to 'on the fly' based on either logic within my program, or the contents of the records themselves.
For example, if a record with a new field is read, I'd like to create a new topic and start diverting records with that field to the new topic.
Thanks.
If you have your topics following a generic naming pattern, for example, "topic-n*", your Flink Kafka consumer can automatically reads from "topic-n1", "topic-n2", ... and so on as they are added to Kafka.
Flink 1.5 (FlinkKafkaConsumer09) added support for dynamic partition discovery & topic discovery based on regex. This means that the Flink-Kafka consumer can pick up new Kafka partitions without needing to restart the job and while maintaining exactly-once guarantees.
Consumer constructor that accepts subscriptionPattern: link.
Thinking more about the requirement,
1st step is - You will start from one topic (for simplicity) and will spawn more topic during runtime based on the data provided and direct respective messages to these topics. It's entirely possible and will not be a complicated code. Use ZkClient API to check if topic-name exists, if does not exist create a model topic with new name and start pushing messages into it through a new producer tied to this new topic. You don't need to restart job to produce messages to a specific topic.
Your initial consumer become producer(for new topics) + consumer(old topic)
2nd step is - You want to consume messages for new topic. One way could be to spawn a new job entirely. You can do this be creating a thread pool initially and supplying arguments to them.
Again be more careful with this, more automation can lead to overload of cluster in case of a looping bug. Think about the possibility of too many topics created after some time if input data is not controlled or is simply dirty. There could be better architectural approaches as mentioned above in comments.

How does Kinesis achieve Kafka style Consumer Groups?

In Kafka, I can split my topic into many partitions. I cannot have more consumers than partitions in Kafka, because the partition is used as a way to scale out a topic. If I have more load, I can increase the number of partitions, which will allow me to increase the number of consumers, which will allow me to have more threads / processes processing on a given topic.
In Kafka, there is a concept of a Consumer Group. If we have 10 consumer groups on a single topic, each consumer group will have the opportunity to process every message in a topic. The consumer group still takes advantage of the scalability from the partitions (i.e. Each consumer group can have up to 'n' consumers, where 'n' is the number of partitions on a topic). This is the beauty of kafka, scalability and multi-channel reading are two separate concepts with two separate knobs to turn.
In Kinesis, we are told that, if you use the Kinesis Library Client you can get the same functionality as consumer groups by defining different Kinesis Applications. In other words, we can have different Kinesis Applications independently streaming all records from the same stream and different times.
We are also told that "Amazon Kinesis Client Library (KCL) automatically creates an Amazon DynamoDB table for each Amazon Kinesis Application to track and maintain state information such as resharding events and sequence number checkpoints."
OK, So I'm getting ready to start reading through the KCL code here, but I'm hoping someone can answer these questions to save me some time.
How does the KCL actually do this?
Are there diagrams somewhere explaining the process?
If I started a new Kinesis Application (MyKinesisApp1) after a record was already produced and consumed by all prior Kinesis Applications, will the new Kinesis Application (MyKinesisApp1) still have an opportunity to consume that record? In other words, does Kinesis remove the record from its stream after it has been processed, or does it leave it there for the 7 days no matter what?
I have seen this question here but it doesn't answer my question. Especially my third question! Also, this question does a direct comparison between two similar technologies. It will help people that know Kafka, learn Kinesis more quickly.
In the KCL configuration, there is a section "appName" which corresponds to "Application Name" and that is the same as "consumer group" in Kafka. For each consumer group (ie. Kinesis Streams Consumer Application) there is a DynamoDB table. You can see an example DynamoDB here (the KCL appName is 'quickstats-development'): AWS Kinesis leaseOwner confusion
No, as far as I know, there is not. "Kinesis Streams" is similar to Kafka, but other than that, not much graphical representation.
Yes. Each Kafka Consumer-Group is represented as a different DynamoDB table in Kinesis. That way, different Kinesis Consumer Applications can consume same record independently. The checkpoint in Kinesis is the Offset value of Kafka. And a checkpoint in DynamoDB is the cursor of reading point in a Kinesis shard. Read this answer for a similar example: https://stackoverflow.com/a/42833193/1622134

Data Modeling with Kafka? Topics and Partitions

One of the first things I think about when using a new service (such as a non-RDBMS data store or a message queue) is: "How should I structure my data?".
I've read and watched some introductory materials. In particular, take, for example, Kafka: a Distributed Messaging System for Log Processing, which writes:
"a Topic is the container with which messages are associated"
"the smallest unit of parallelism is the partition of a topic. This implies that all messages that ... belong to a particular partition of a topic will be consumed by a consumer in a consumer group."
Knowing this, what would be a good example that illustrates how to use topics and partitions? When should something be a topic? When should something be a partition?
As an example, let's say my (Clojure) data looks like:
{:user-id 101 :viewed "/page1.html" :at #inst "2013-04-12T23:20:50.22Z"}
{:user-id 102 :viewed "/page2.html" :at #inst "2013-04-12T23:20:55.50Z"}
Should the topic be based on user-id? viewed? at? What about the partition?
How do I decide?
When structuring your data for Kafka it really depends on how it´s meant to be consumed.
In my mind, a topic is a grouping of messages of a similar type that will be consumed by the same type of consumer so in the example above, I would just have a single topic and if you´ll decide to push some other kind of data through Kafka, you can add a new topic for that later.
Topics are registered in ZooKeeper which means that you might run into issues if trying to add too many of them, e.g. the case where you have a million users and have decided to create a topic per user.
Partitions on the other hand is a way to parallelize the consumption of the messages. The total number of partitions in a broker cluster need to be at least the same as the number of consumers in a consumer group to make sense of the partitioning feature. Consumers in a consumer group will split the burden of processing the topic between themselves according to the partitioning so that one consumer will only be concerned with messages in the partition itself is "assigned to".
Partitioning can either be explicitly set using a partition key on the producer side or if not provided, a random partition will be selected for every message.
Once you know how to partition your event stream, the topic name will be easy, so let's answer that question first.
#Ludd is correct - the partition structure you choose will depend largely on how you want to process the event stream. Ideally you want a partition key which means that your event processing is partition-local.
For example:
If you care about users' average time-on-site, then you should partition by :user-id. That way, all the events related to a single user's site activity will be available within the same partition. This means that a stream processing engine such as Apache Samza can calculate average time-on-site for a given user just by looking at the events in a single partition. This avoids having to perform any kind of costly partition-global processing
If you care about the most popular pages on your website, you should partition by the :viewed page. Again, Samza will be able to keep a count of a given page's views just by looking at the events in a single partition
Generally, we are trying to avoid having to rely on global state (such as keeping counts in a remote database like DynamoDB or Cassandra), and instead be able to work using partition-local state. This is because local state is a fundamental primitive in stream processing.
If you need both of the above use-cases, then a common pattern with Kafka is to first partition by say :user-id, and then to re-partition by :viewed ready for the next phase of processing.
On topic names - an obvious one here would be events or user-events. To be more specific you could go with with events-by-user-id and/or events-by-viewed.
This is not exactly related to the question, but in case you already have decided upon the logical segregation of records based on topics, and want to optimize the topic/partition count in Kafka, this blog post might come handy.
Key takeaways in a nutshell:
In general, the more partitions there are in a Kafka cluster, the higher the throughput one can achieve. Let the max throughout achievable on a single partition for production be p and consumption be c. Let’s say your target throughput is t. Then you need to have at least max(t/p, t/c) partitions.
Currently, in Kafka, each broker opens a file handle of both the index and the data file of every log segment. So, the more partitions, the higher that one needs to configure the open file handle limit in the underlying operating system. E.g. in our production system, we once saw an error saying too many files are open, while we had around 3600 topic partitions.
When a broker is shut down uncleanly (e.g., kill -9), the observed unavailability could be proportional to the number of partitions.
The end-to-end latency in Kafka is defined by the time from when a message is published by the producer to when the message is read by the consumer. As a rule of thumb, if you care about latency, it’s probably a good idea to limit the number of partitions per broker to 100 x b x r, where b is the number of brokers in a Kafka cluster and r is the replication factor.
I think topic name is a conclusion of a kind of messages, and producer publish message to the topic and consumer subscribe message through subscribe topic.
A topic could have many partitions. partition is good for parallelism. partition is also the unit of replication,so in Kafka, leader and follower is also said at the level of partition. Actually a partition is an ordered queue which the order is the message arrived order. And the topic is composed by one or more queue in a simple word. This is useful for us to model our structure.
Kafka is developed by LinkedIn for log aggregation and delivery. this scene is very good as a example.
The user's events on your web or app can be logged by your Web sever and then sent to Kafka broker through the producer. In producer, you could specific the partition method, for example : event type (different event is saved in different partition) or event time (partition a day into different period according your app logic) or user type or just no logic and balance all logs into many partitions.
About your case in question, you can create one topic called "page-view-event", and create N partitions through hash keys to distribute the logs into all partitions evenly. Or you could choose a partition logic to make log distributing by your spirit.