Kafka Streams for count a total num? - apache-kafka

A topic named "addcash" which has 3 partitions(the number of the kafka cluster machines is 3 too), and a lot of user recharge messages flow in it. I want to count the total money num everyday.
I learned from some articles about Kafka Streams: The Kafka Streams will run the topology as task, and the number of the task depend on the number of the topic's partitions, and every task has individual state store.
So when I count the total money num by state stroe, Is there three values, not a total value will be return? What is the right way to do it?
Thanks!

That is correct.
You have two ways to do this:
You do the partial sums, and that a follow up KTable.groupBy(...).reduce(...) and set a single global key to bring all partial aggregates together.
You can get the total sum by creating an additional single-partitions topic, write the partial results into this topic, read the data back with KafkaStreams and do a second aggregation that add those partial numbers together. You can express this with a single program using through("my-single-partition-topic"); to connect the first and second part of the aggregation. You would need to use a transform() but not DSL to do the second aggregation step for this solution.

Related

grouping messages and processing bunch of messages at once with kafka [duplicate]

I need to make Kafka consumers process all the messages with the same ID in each partition at once. For example, consider one topic containing all orders with different types and there are multiple consumer instances subscribing to this topic. How can I run consumers to process all the messages in each partition with the same Id? Because when the orders are produced with that Id, although Kafka guarantees that all same IDs go to the same partition, but each partition may contain different orders. I need to process all the similar orders in each partition at once(not one by one) and once in a while(not as soon as a new message arrives).
As the comments say, you'll need to manually batch your data into "bins per ID", then process those on your own. For example, write each record to a database, group by ID, then iterate/process each batch.
As far as Kafka is concerned, you're required to look at each event "one by one", but this does not require you to "handle them" in that order, unless you care about sequential processing, at least once processing, and in-order offset commits.
There's also no way to get "all unique ids" in any partition without consuming the whole partition end-to-end. You could use Kafka Streams aggregate function to help with this, and punctuate to periodically handle all gathered IDs up to a certain point, as one other solution.

Combining data coming from multiple kafka to single kafka

I have N Kafka topic, with data and a timestamp, I need to combine them in a single topic with sorted timestamp order, where the data is sorted inside the partition. I got one way to do that.
Combine all the Kafka topic data in Cassandra(because of its fast write) with clustering order as DESCENDING, it will combine them all but the limit would be if after a timed window of accumulation of data if a data came late, it won't be sorted
Is there any other appropriate way to do that? If not then is there any chance of improvement in my solution.
Thanks
Not clear why you need Kafka to sort on timestamps. Typically this is done only at consumption time for each batch of messages.
For example, create Kafka Streams process that reads from all topics. Create a Global KTable and enable Interactive Querying.
When you query, then you sort the data on the client side, regardless of how it is ordered in the topic.
This way, you are no limited to a single, ordered partition.
Alternatively, I would write to something other than Cassandra (due to my lack of deep knowledge of it). For example, Couchbase or CockroachDB.
Then when you query those later, run a SORT BY

Timeout for aggregated records in Kafka table?

I use Kafka for processing messages. Messages can be divided on a few parts (it's a composite message). So in stream I can have for example one composite message that is divided on three parts. In other words it will be three records in Kafka stream, but it's one big message. I want use Kafka table for merge parts of composite message in one Kafka record. After merge one message will be inserted in database (Postgres). Every part has number and total number of parts. For example if I have three parts (three Kafka records) of one message in stream - every parts has field total number of parts with value 3.
How I understand, task is simple in positive scenario: aggregate parts in table, create stream from table and filter records that have equals aggregated parts size and total number of parts, map filtered in one merged message and insert it in database (Postgres).
But negative scenario is also possible. In rare cases one of parts can be not inserted in Kafka at all (or it will be inserted much later, after timeout). So for example in stream only two parts from three of one composite message will be present. And this case I must insert in database (Postgres) not fully constructed message (it will consist only two parts, not three). How can I implement this negative scenario in Kafka?
I would recommend to check out punctuations: https://docs.confluent.io/current/streams/developer-guide/processor-api.html#defining-a-stream-processor
Also note, that you can mix-and-match Processor API and DSL: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#applying-processors-and-transformers-processor-api-integration
If you provide a store name for the KTable aggregation, you can connect the store to a custom processor that registers a punctuation. Overall, it might be better to use Processor API for the whole application instead of the DSL though.

How can I consume a data sequentially(in order of their time-stamp) from a multi-partitioned Kafka topic

I know that Kafka will not be able to guarantee ordering of data when a topic has multiple partitions. But my problem is:- I need to have multiple partitions to an event topic(user activities generating events) since I want multiple consumer groups to consume the data from the topic.
But there are times when I need to bootstrap the entire data,i.e, read the complete data right from the beginning to the end and rebuild my graph of events from the historical messages in Kafka and then I lose the ordering which is creating problem.
One approach might be to process it in a Map-Reduce paradigm where I map the data based on time and order it and consume it.
Is there anybody who has faced similar situation / problem and who would like to help me out with the right approach / solution.
Thanks in advance.
As per kafka documentation global ordering throughout partitions not guaranteed so you can create N number of partitions with N number of consumers. Create partitions based on type of data i.e. all type of data of category A should go in one partition as the order of messages maintained within partition you can consume those messages in separate consumer and process data.
I gone through some blogs which saying buffer those messages and apply sorting logic on those messages, but this is not seems to be a good practice as one of partition may be slow message message is late in some cases and you need to sort your messages as and when every new message arrives.

Processing records in order in Storm

I'm new to Storm and I'm having problems to figure out how to process records in order.
I have a dataset which contains records with the following fields:
user_id, location_id, time_of_checking
Now, I would like to identify users which have fulfilled the path I specified (for example, users that went from location A to location B to location C).
I'm using Kafka producer and reading this records from a file to simulate live data. Data is sorted by date.
So, to check if my pattern is fulfilled I need to process records in order. The thing is, due to parallelization (bolt replication) I don't get check-ins of user in order. Because of that patterns won't work.
How to overcome this problem? How to process records in order?
There is no general system support for ordered processing in Storm. Either you use a different system that supports ordered steam processing like Apache Flink (Disclaimer, I am a committer at Flink) or you need to take care of it in your bolt code by yourself.
The only support Storm delivers is using Trident. You can put tuples of a certain time period (for example one minute) into a single batch. Thus, you can process all tuples within a minute at once. However, this only works if your use case allows for it because you cannot related tuples from different batches to each other. In your case, this would only be the case, if you know that there are points in time, in which all users have reached their destination (and no other use started a new interaction); ie, you need points in time in which no overlap of any two users occurs. (It seems to me, that your use-case cannot fulfill this requirement).
For non-system, ie, customized user-code based solution, there would be two approaches:
You could for example buffer up tuples and sort on time stamp within a bolt before processing. To make this work properly, you need to inject punctuations/watermarks that ensure that no tuple with larger timestamp than the punctuation comes after a punctuation. If you received a punctuation from each parallel input substream you can safely trigger sorting and processing.
Another way would be to buffer tuples per incoming substream in district buffers (within a substream order is preserved) and merge the tuples from the buffers in order. This has the advantage that sorting is avoided. However, you need to ensure that each operator emits tuples ordered. Furthermore, to avoid blocking (ie, if no input is available for a substream) punctuations might be needed, too. (I implemented this approach. Feel free to use the code or adapt it to your needs: https://github.com/mjsax/aeolus/blob/master/queries/utils/src/main/java/de/hub/cs/dbis/aeolus/utils/TimestampMerger.java)
Storm supports this use case. For this you just have to ensure that order is maintained throughout your flow in all the involved components. So as first step, in Kafka producer, all the messages for a particular user id should go to the same partition in Kafka. For this you can implement a custom Partitioner in your KafkaProducer. Please refer to the link here for implementation details.
Since a partition in Kafka can be read by one and only one kafkaSpout instance in Storm, the messages in that partition come in order in the spout instance. Thereby ensuring that all the messages of the same user id arrive to the same spout.
Now comes the tricky part - to maintain order in bolt, you want to ensure that you use field grouping on bolt based on "user_id" field emitted from the Kafka spout. A provided kafkaSpout does not break the message to emit field, you would have to override the kafkaSpout to read the message and emit a "user_id" field from the spout. One way of doing so is to have an intermediate bolt which reads the message from the Kafkaspout and emits a stream with "user_id" field.
When finally you specify a bolt with field grouping on "user_id", all messages of a particular user_id value would go to the same instance of the bolt, whatever be the degree of parallelism of the bolt.
A sample topology which work for your case could be as follow -
builder.setSpout("KafkaSpout", Kafkaspout);
builder.setBolt("FieldsEmitterBolt", FieldsEmitterBolt).shuffleGrouping("KafkaSpout");
builder.setBolt("CalculatorBolt", CalculatorBolt).fieldsGrouping("FieldsEmitterBolt", new Fields("user_id")); //user_id field emitted by Bolt2
--Beware, there could be case when all the user_id values come to the same CalculatorBolt instance if you have limited number of user_ids. This in turn would decrease the effective 'parallelism'!