Aggregate to compacted topic with unlimited retention - apache-kafka

I'm setting up a Kafka Streams application that consumes from a topic (retention: 14 days, cleanup.policy: delete, partitions: 1).
I wish to consume the messages and output it into another topic (retention: -1, cleanup.policy: compact, partitions: 3).
The grouping is by key on the input topic.
So:
Input-topic:
Key: A Value: { SomeJson }
Key: A Value: { Other Json}
Key: B Value: { TestJson }
Output:
Key: A Value: {[ { SomeJson }, { Other Json } ]}
Key: B Value: {[ { TestJson } ]}
It's important that the content on the output topic is never lost, so it's ack: all and 3x replicas.
Each key in the compacted topic will have around 100 json records. Estimated to less than 20kb for each key.
I was also hoping that the output topic worked as a state-topic, so that it wouldn't have to create another topic to contain the same information.
Anyone know how to do this? Most of the examples I find relate to windowing: https://github.com/confluentinc/kafka-streams-examples/tree/5.3.1-post/src/main/java/io/confluent/examples/streams
Current code:
val mapper = new ObjectMapper();
builder.stream(properties.getInputTopic(), Consumed.with(Serdes.String(), Serdes.String()))
.groupByKey()
.aggregate(
() -> new GroupedIdenthendelser(Collections.emptyList()),
(key, value, currentAggregate) -> {
val items = new ArrayList<>(currentAggregate.getIdenthendelser());
items.add(value);
return new GroupedIdenthendelser(items);
},
Materialized.with(Serdes.String(), new JsonSerde<>(GroupedIdenthendelser.class, mapper)))
.toStream()
.to(properties.getOutputTopic(), Produced.with(Serdes.String(), new JsonSerde<>(mapper)));
If someone has some other tips to give, please do tell as this data is customer information so if there is some config's I should tweak do tell. Or if you know some blog posts/examples out there it's appreciated.
Edit: The code example above seems to work but it creates its own state topic which is something that is not needed as the output topic will always contain the same state. There will also only every be 1 application running this since input topic has 1 partition and as it's related to people in a rather fixed size (10 000 000 people give or take), the size of the data won't increase above 20kb per person either. Event's per second is estimated to be around 1/s, so the load it not much either.
The Topology:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [input-topic])
--> KSTREAM-AGGREGATE-0000000002
Processor: KSTREAM-AGGREGATE-0000000002 (stores: [KSTREAM-AGGREGATE-STATE-STORE-0000000001])
--> KTABLE-TOSTREAM-0000000003
<-- KSTREAM-SOURCE-0000000000
Processor: KTABLE-TOSTREAM-0000000003 (stores: [])
--> KSTREAM-SINK-0000000004
<-- KSTREAM-AGGREGATE-0000000002
Sink: KSTREAM-SINK-0000000004 (topic: output-topic)
<-- KTABLE-TOSTREAM-0000000003

Looking at your example dataset, I guess what you might need is the realtime aggregation. Please take a look at this blog post of Confluent as a starting point.

Related

Advanced counting in Kafka Streams

I'd like to determine the number of assigned companies to a category. The (simplified) data structure of the input topic looks something like this:
Key: 3298440
Value: {"company_id": 5678, "category_id": 9876}
Key: 4367848
Value: {"company_id": 35383, "category_id": 9876}
[...]
So, like in this example, I'd like to count the different companies for the category 9876.
My idea was to group by category_id (the new key) and in- or decrease the count (via reduce()) as the value.
So in principle, I'm only interested in inserts and deletes (tombstones) for altering the count. My problem is, that updates will also produce a record and invalidate the count.
I believe I'm on the wrong track here. Is there any way to do it?
This is my incorrect java code, which also counts updates:
companyCategoriesKTable
.groupBy(KeyValue::pair, Grouped.with(Serdes.String(), companyCategoriesSerde))
.aggregate(
() -> new CompanyCount(0L, 0L), /* category_id, count */
(key, newValue, aggValue) -> {
aggValue.setCategoryId(newValue.getBrId());
aggValue.setCount(aggValue.getCount() + 1);
return aggValue;
},
(key, oldValue, aggValue) -> {
aggValue.setCategoryId(oldValue.getBrId());
aggValue.setCount(aggValue.getCount() - 1);
return aggValue;
},
Materialized.<String, CompanyCount, KeyValueStore<Bytes, byte[]>>as("CompanyCountStore").withKeySerde(Serdes.String()).withValueSerde(new JsonSerde<>(CompanyCount.class))
)
.toStream()
.map((k, v) -> new KeyValue<>(v.getCategoryId(), v.getCount() > 0 ? 1L : -1L))
.groupByKey(Grouped.with(Serdes.Long(), Serdes.Long()))
.reduce(Long::sum)
.toStream()
.to("output-topic", Produced.with(Serdes.Long(), Serdes.Long()));

Spring kafkaListener Consumer GroupID for a Partition

Currently, We create multiple kafka consumers for a given topic and all the messages in the topic are together processed by all the consumers in the same consumer group.
We want to enhance this now. Instead of creating multiple kaka consumers/listeners reading from the same topic, we want multiple consumers in the same consumer group to read from a specified partition of the topic.
Is it possible? Do we have any reference where we can specify both the consumer group id and partition together in each of our kafka listeners?
If I understand your requirements, you have to use manual partition assignment. See https://docs.spring.io/spring-kafka/docs/current/reference/html/#manual-assignment
You can also configure POJO listeners with explicit topics and partitions (and, optionally, their initial offsets). The following example shows how to do so:
#KafkaListener(groupId = "thing2", topicPartitions =
{ #TopicPartition(topic = "topic1", partitions = { "0", "1" }),
#TopicPartition(topic = "topic2", partitions = { "0", "1" })
})
#KafkaListener(groupId = "thing2", topicPartitions =
{ #TopicPartition(topic = "topic1", partitions = { "2", "3" }),
#TopicPartition(topic = "topic2", partitions = { "2", "3" })
})
etc.

Kafka Streams - disappearing values from KTable's

Issue background
Currently we are using: Kafka Streams API (version 1.1.0) to process messages from Kafka cluster (3 brokers, 3 partitions per topic, with replication factor 2). Installed Kafka is in version 1.1.1 .
End users report us the problem with disappearing data. They report that suddenly they can't see any data (eg. yesterday they could see n records in UI and next day morning table was empty). We checked changelog topic for this particular users and it looked strange, it looks like after few days of inactivity(given key-value pair might be unchanged for days) aggregate value in changelog topic was missing.
Code
KTable assembly line: (messages are grouped by 'username' from the event)
#Bean
public KTable<UsernameVO, UserItems> itemsOfTheUser() {
return streamsBuilder.stream("application-user-UserItems", Consumed.with(Serdes.String(), serdes.forA(UserItems.class)))
.groupBy((key, event) -> event.getUsername(),
Serialized.with(serdes.forA(UsernameVO.class), serdes.forA(UserItems.class)))
.aggregate(
UserItems::none,
(key, event, userItems) ->
userItems.after(event),
Materialized
.<UsernameVO, UserItems> as(persistentKeyValueStore("application-user-UserItems"))
.withKeySerde(serdes.forA(UsernameVO.class))
.withValueSerde(serdes.forA(UserItems.class)));
}
Aggregate object (KTable value):
public class UserItems {
private final Map<String, Item> items;
public static UserItems none() {
return new UserItems();
}
private UserItems() {
this(emptyMap());
}
#JsonCreator
private UserItems(Map<String, Item> userItems) {
this.userItems = userItems;
}
#JsonValue
#SuppressWarnings("unused")
Map<String, Item> getUserItems() {
return Collections.unmodifiableMap(items);
}
...
public UserItems after(ItemAddedEvent itemEvent) {
Item item = Item.from(itemEvent);
Map<String, Item> newItems = new HashMap<>(items);
newItems.put(itemEvent.getItemName(), item);
return new UserItems(newItems);
}
Kafka topics
application-user-UserItems
There is no problem with this source topic. It has retention set to maximum, all messages are present all the time.
application-user-UserItems-store-changelog (compacted. has default configuration - no retention changed, nor anything)
Here is the strange part. We can observe in the changelog that for some of the users the values are getting lost:
Offset | Partition | Key | Value
...........................................
...
320 0 "User1" : {"ItemName1":{"param":"foo"}}
325 0 "User1" : {"ItemName1":{"param":"foo"},"ItemName2":{"param":"bar"}}
1056 0 "User1" : {"ItemName3":{"param":"zyx"}}
...
We can see above that at first messages are agregated correctly: there is Item1 that was processed, then Item2 was applied to aggregate. But after some period of time - it might be few days - another event is being processed - the value under underlying "User1" key seems to be missing, and only Item3 is present.
In the application, user has no possibility to remove all items and add another in one action - user can only add or remove item on by one. So if he remove ItemName1 and ItemName2 and then add ItemName3 we expect something like that in the changelog:
Offset | Partition | Key | Value
..............................................
...
320 0 "User1" : {"ItemName1":{"param":"foo"}}
325 0 "User1" : {"ItemName1":{"param":"foo"},"ItemName2":{"param":"bar"}}
1054 0 "User1" : {"ItemName2":{"param":"bar"}}
1055 0 "User1" : {}
1056 0 "User1" : {"ItemName3":{"param":"zyx"}}
Conclusion
At first we thought it is related to changelog topic retention (but we checked it and it has only compaction enabled).
application-user-UserItems-store-changelog PartitionCount:3 ReplicationFactor:1 Configs:cleanup.policy=compact,max.message.bytes=104857600
Topic: application-user-UserItems-store-changelog Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Topic: application-user-UserItems-store-changelog Partition: 1 Leader: 2 Replicas: 2 Isr: 2
Topic: application-user-UserItems-store-changelog Partition: 2 Leader: 1 Replicas: 1 Isr:
Any ideas or hints would be appreciated. Cheers
I have experienced the same problem as you described and it seems that the problem is related to your kafka-streams configuration.
You have mentioned that you have the following configuration for your "source" topic:
3 brokers, 3 partitions per topic, with replication factor 2
Make sure you put the following property to your kafka streams configuration(replication.factor) at least to 2 (it is set to 1 by default)
StreamsConfig.REPLICATION_FACTOR_CONFIG [replication.factor]
That corresponds to what you have written as well (replication factor for changelog topic is set to 1)
application-user-UserItems-store-changelog PartitionCount:3 ReplicationFactor:1 Configs:cleanup.policy=compact,max.message.bytes=104857600
So, my assumption is that you are loosing data due to broker outage (the data though should preserve in your source topic due to replication factor 2, so you could reprocess and populate changelog topic)

Spark streaming mapWithState mem increasing

So I have this code
KafkaUtils.
createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, fromOffsets)
map(event => (event.requestId, event.toString)).
mapWithState(StateSpec.function(StateFunc.func _).numPartitions(200).timeout(Minutes(5))).
foreachRDD((rdd: RDD[String]) => {
process(rdd.filter(_ != null), sparkSession, topic, partitionsNum)
})
All that the StateFunc.func is doing, is to update the state like so:
def func(batchTime: Time, key: String, event: Option[String], state: State[String]): Option[String] = {
if (state.exists) {
if (!state.isTimingOut()) {
state.update(event)
}
} else {
state.update(event)
}
event
}
Edit: all that is been done in the state function is update the sate.
But It should stopped be updated after about 2-3 times (no more events with the same key will arrive). so eventually they should be timed out and deleted by spark
All that the process function is doing is to write to S3 the RDDs as json files
After 3 batches with 3600000 records (from the spark stream UI) the output size was about ~2GB
but the mapWithState was ~30GB (should be as the output size) and my cluster is only 40GB
those, after some time, the spark fails and starts over again.
Can someone help with explanation why the mapWithState size is 30+GB?
and Why the cluster sometimes fails of OutOfMemory?
Properties:
streamInterval is 50 sec
maxRatePerPartition is 3000 records
kafka partitions is 8
backPressure is true
each record is about 1500 bytes
Update:
So I tried running it again with small batch of ~170000 records (each record is a string with length of 264 byes) so total should be 45MB
as shown in the picture. the mapWithstate is ~350MB all the 199 stored rdds are 2.5KB and just 1 rdd is 350 MB. what the hell?!
info mapWithState

Mongoose: how to avoid duplicating documents?

I have designed a collection for some reviews.
The review collection schema contains both posts and topics.
Posts properties (key, phone, date, contents, author.name) are direct children of review collection.
Topics properties (key, title) are children of topic object, which is a child of review collection.
Each post belongs to a topic.
Each post key is unique, each topic key is unique.
If many posts belong to a topic, topic data is repeated for each review (NoSQL is not ACID, right? :-)
The question is: is the decision to duplicate topic properties correct, or I should use different collections for posts and topics?
This is my model:
var reviewSchema = new mongoose.Schema({
key: String,
phone: String,
date: Date,
contents: String,
author: {
name: String,
},
topic: {
key: String,
title: String,
},
});
reviewSchema.index({ 'key': 1 }, { unique: true });
reviewSchema.index({ 'phone': 1 }, { unique: false });
reviewSchema.index({ 'topic.key': 1 }, { unique: false });
If you wish to avoid duplication, create a separate schema for a Topic, then reference it in your Reviews:
var TopicSchema = new mongoose.Schema({
key: String,
title: String
});
var ReviewSchema = new mongoose.Schema({
key: String,
phone: String,
...
topic: {type: Schema.Types.ObjectId, ref: 'Topic'}
});
var Topic = mongoose.model('Topic', TopicSchema);
var Review = mongoose.model('Review', ReviewSchema);
From here, use the populate() method for when you want to insert a Review with a Topic as a subdocument. Based on the fact that you are storing author in its own document, you may consider following that same pattern there.
I'm curious about your use of key. MongoDB creates a unique _id by default on top level documents, a kind of primary key. If that was your intention for key, you should probably let MongoDB handle it.
But at the end of the day, there is no "correct" solution to your problem, only a comparison of tradeoffs. An advantage of MongoDB is the ability to "store what you query for" with ease, and since Topics are quite small, it may be worth the duplication if you want topics every time you fetch a review. MongoDB is not ACID within a collection (I can't speak for other NoSQL options), so with this method, updating all the embedded topics at once could cause brief discrepancies for users.
// Get entire review in one go, including subdocuments!
Review.findOne( { "key": "myReview" }, (err, doc) => { /* do things */ } );
// On bulk topic updates, not all topics change at once (not ACIDic)
Review.update(
{ topic.title: 'foo' },
{ topic.title: 'bar' },
{ multi: true },
(err, doc) => {/* callback */ }
);
If you're coming from a SQL background, the populate() paradigm described above will feel far more comfortable. And since MongoDB is ACIDic per document, updating a topic once will suffice for all other documents that reference it. Behind the scenes, this will require Mongoose to make at least two queries: once for the Review, then again for the referenced Topic.
// To replace refs with documents two queries behind the scenes
Review.findOne( { key: 'myReview' } )
.populate('topic').exec( (err, review) => { /* do things */ })
// But updating a single topic is ACIDic, since reviews only contain references
Topic.update( { key: 'foo' }, { title: 'sci-fi' }, (err, res) => {/* more stuff */ } )
In my experience, unless you're pushing your query pipeline to the limit and want to cut response times at all costs, separate schemas with populate() is worth the tradeoff of extra queries.