Embedded Kafka: KTable+KTable leftJoin produces duplicate records - apache-kafka

I come seeking knowledge of the arcane.
First, I have two pairs of topics, with one topic in each pair feeding into the other topic. Two KTables are being formed by the latter topics, which are used in a KTable+KTable leftJoin. Problem is, the leftJoin producing THREE records when I produce a single record to either KTable. I would expect two records in the form (A-null, A-B) but instead I get (A-null, A-B, A-null). I have confirmed that the KTables are receiving a single record each.
I have fiddled with the CACHE_MAX_BYTES_BUFFERING_CONFIG to enable/disable state store caching. The behavior above is with CACHE_MAX_BYTES_BUFFERING_CONFIG set to 0. When I use the default value for CACHE_MAX_BYTES_BUFFERING_CONFIG I see the following records output from the join: (A-B, A-B, A-null)
Here are the configurations for streams, consumers, producers:
properties.put(StreamsConfig.APPLICATION_ID_CONFIG, appName);
properties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapUrls);
properties.put(StreamsConfig.STATE_DIR_CONFIG, String.format("/tmp/kafka-streams/%s/%s",
properties.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0); // fiddled with
properties.put(StreamsConfig.CLIENT_ID_CONFIG, appName);
properties.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000);
properties.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 1);
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.GROUP_ID_CONFIG, appName);
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.cla
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
The Processor API code (sanitized) that experiences this behavior is below, notice the topics paired [A1, A2] and [B1, B2]:
KTable<Long, Value> kTableA =
kstreamBuilder.table(longSerde, valueSerde, topicA2);
kstreamBuilder.stream(keySerde, envelopeSerde, topicA1)
.to(longSerde, valueSerde, topicA2);
kstreamBuilder.stream(keySerde, envelopeSerde, topicB1)
.to(longSerde, valueSerde, topicB2.topicName);
KTable<Long, Value> kTableB =
kstreamBuilder.table(longSerde, valueSerde, topicB2.topicName);
KTable<Long, Result> joinTable = kTableA.leftJoin(kTableB, (a,b) -> {
// value joiner called three times with only a single record input
// into topicA1 and topicB1
});
joinTable.groupBy(...)
.aggregate(...)
.to(longSerde, aggregateSerde, outputTopic);
Thanks in advance for any and all help, oh benevolent ones.
Update:
I was running with one kafka server and 1 partition per topic and experienced this behavior. When I increased the number of servers to 2 and number of partitions to 3, my output becomes (A-null).
It seems to me I need to spent some more time with the kafka manual...

Related

Kafka consumer with manual assignment to a specific partition returns nothing on poll

I have a Kafka topic with 3 partitions.
I am trying to create a test Consumer to fetch last N messages from each partition.
For that I manually assign to each partition, shift the offset and poll like the following:
val topicPartition = TopicPartition(topic, 1) // where 1 is the number of a partition
consumer.assign(listOf(topicPartition))
consumer.seekToEnd(listOf(topicPartition))
val lastOffset = consumer.position(topicPartition)
consumer.seek(topicPartition, lastOffset - N) // lastOffset is known to be > N
val consumerRecords = consumer.poll(Duration.ofMillis(10000))
I repeat this for all 3 partitions.
This works fine for 2 of 3 partitions.
Surprisingly, this never works for the one (always the same) partition, so that poll() always goes waiting for the given timeout and returns nothing.
Notes:
It does not depend on the partitions polling order.
I tried polling more than once as suggested here - no luck.
The consumer has its own unique group_id
Consumer has the following props:
consumerProps["auto.offset.reset"] = "earliest"
consumerProps["max.poll.records"] = 500
consumerProps["fetch.max.bytes"] = 50000000
consumerProps["max.partition.fetch.bytes"] = 50000000
What could be the reason for this behavior?

Kafka Streams Hopping window top N by dimension

I have a kafka stream, and I need a processor which does the following:
Uses a 45 second hopping window with 5 second advances to compute the top 5 count based on one dimension of the domain object. For example, if the stream would contain Clickstream data, I would need the top 5 urls viewed by domain name, but also windowed in a hopping window.
I've seen examples to do window counting, for example:
KStream<String, GenericRecord> pageViews = ...;
// Count page views per window, per user, with hopping windows of size 5 minutes that advance every 1 minute
KTable<Windowed<String>, Long> windowedPageViewCounts = pageViews
.groupByKey(Grouped.with(Serdes.String(), genericAvroSerde))
.windowedBy(TimeWindows.of(Duration.ofMinutes(5).advanceBy(Duration.ofMinutes(1))))
.count()
And Top n aggregations on the MusicExample, for example:
songPlayCounts.groupBy((song, plays) ->
KeyValue.pair(TOP_FIVE_KEY,
new SongPlayCount(song.getId(), plays)),
Grouped.with(Serdes.String(), songPlayCountSerde))
.aggregate(TopFiveSongs::new,
(aggKey, value, aggregate) -> {
aggregate.add(value);
return aggregate;
},
(aggKey, value, aggregate) -> {
aggregate.remove(value);
return aggregate;
},
Materialized.<String, TopFiveSongs, KeyValueStore<Bytes, byte[]>>as(TOP_FIVE_SONGS_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(topFiveSerde)
);
I just can't seem to be able to combine the 2 - where I get both windowing and top n aggregations. Any thoughts?
In general yes, however, for non-windowed top-N aggregation the algorithm will always be an approximation (it's not possible to get an exact result, because one would need to buffer everything what is not possible for unbounded input). However, for a hopping window, you would do an exact computation.
For the windowed case case, the actual aggregation step, could just accumulate all input records per window (eg, return a List<V> or some other collection). On this result KTable you apply a mapValues() function that get the List<V> of input records per window (and key), and can compute the actual top-N result you are looking for.

Kafka streams dropping messages during windowing and on restart

I am working on Kafka Streams application with following topology:
private final Initializer<Set<String>> eventInitializer = () -> new HashSet<>();
final StreamsBuilder streamBuilder = new StreamsBuilder();
final KStream<String, AggQuantityByPrimeValue> eventStreams = streamBuilder.stream("testTopic",
Consumed.with(Serdes.String(), **valueSerde**));
final KStream<String, Value> filteredStreams = eventStreams
.filter((key,clientRecord)->recordValidator.isAllowedByRules(clientRecord));
final KGroupedStream<Integer, Value> groupedStreams = filteredStreams.groupBy(
(key, transactionEntry) -> transactionEntry.getNodeid(),
Serialized.with(Serdes.Integer(), **valueSerde**));
/* Hopping window */
final TimeWindowedKStream<Integer, Value> windowedGroupStreams = groupedStreams
.windowedBy(TimeWindows.of(Duration.ofSeconds(30)).advanceBy(Duration.ofSeconds(25))
.grace(Duration.ofSeconds(0)));
/* Aggregating the events */
final KStream<Windowed<Integer>, Set<String>> suppressedStreams = windowedGroupStreams
.aggregate(eventInitializer, countAggregator, Materialized.as("counts-aggregate")
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded())
.withName("suppress-window")
.toStream();
suppressedStreams.foreach((windowed, value) -> eventProcessor.publish(windowed.key(), value));
return new KafkaStreams(streamBuilder.build(), config.getKafkaConfigForStreams());
I am observing that intermittently few events are getting dropped during/after windowing.
For example:
All records can be seen/printed in isAllowedByRules() method, which are valid(allowed by filters) and consumed by the stream.
But when printing the events in countAggregator, I can see few events are not coming through it.
Current configurations for streams:
Properties streamsConfig = new Properties();
streamsConfig.put(StreamsConfig.APPLICATION_ID_CONFIG,"kafka-app-id"
streamsConfig.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, <bootstraps-server>);
streamsConfig.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
streamsConfig.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 30000);
streamsConfig.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 5);
streamsConfig.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, 10000);
streamsConfig.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 30000);
streamsConfig.put(ConsumerConfig.FETCH_MAX_BYTES_CONFIG, 10485760);
streamsConfig.put(ProducerConfig.MAX_REQUEST_SIZE_CONFIG, 10485760);
streamsConfig.put(ConsumerConfig.MAX_PARTITION_FETCH_BYTES_CONFIG, 10485760);
/*For window buffering across all threads*/
streamsConfig.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 52428800);
streamsConfig.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.Integer().getClass().getName());
streamsConfig.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, **customSerdesForSet**);
Initially, I was using tumbling window but I found that mostly at the end of window few events were getting lost so I changed to hopping window (better to duplicate than lost). Then dropped events became zero. But today again after almost 4 days I saw few dropped events and there is one pattern among them, that they are late by almost a minute compared to other events which were produced together. But then expectation is that these late events should come in any of the future windows but that didn't happen. Correct me here if my understanding is not right.
Also as I have mentioned in the topic, on restart of streams (gracefully) I could see few events getting lost again at aggregation step though processed by isAllowedByRules() method.
I have searched a lot on stack overflow and other sites, but couldn't find the root cause of this behaviour. Is it something related to some configuration that I am missing/not correctly setting or could be due to some other reason?
From my understanding, you have a empty grace period :
/* Hopping window */
...
.grace(Duration.ofSeconds(0))
So your window is closed without permitting any late arrivals.
Then regarding your sub question :
But then expectation is that these late events should come in any of the future windows but that didn't happen. Correct me here if my understanding is not right.
Maybe you're mixing event time and processing time.
Your record will be categorized as 'late' if the timestamp of the record ( added by the producer at produce time, or by the brokers when arriving in the cluster if not set by producer) is outside your current window.
Here is a example with 2 records '*'.
Their event time (et1 and et2) fit in the window :
| window |
t1 t2
| * * |
et1 et2
But, processing time of et2 (pt2) is in fact as follows :
| window |
t1 t2
| * | *
pt1 pt2
Here the window is a slice of time between t1 and t2 (processing time)
et1 and et2 are respectively event time of the 2 records '*'.
et1 and et2 are timestamps set in the records themselves.
in this example , et1 and et2 are between t1 and t2, et2 have been received after the window closure, as your grace period is 0, it will be skipped.
Might be a explanation

Combining Results from different Flink Jobs

I have 3 jobs that read from the same input stream.
Each gives a different output
How do I combine the results from different Jobs
and create a single JSON string
Example: {"key":"input_msg", "result_1":"job1_result",...}
I am hoping to avoid querying a DB, as if I scale my jobs to a huge number that will have a negative impact.
Yes that is possible
available_topics = List("topic_1", "topic_2")
var streams = collection.mutable.Map[String, DataStream[String]]()
for(a <- 0 until available_topics.size){
streams += (available_topics(a) -> env.addSource(new FlinkKafkaConsumer09(available_topics(a), new SimpleStringSchema(), properties)).map(x => someFunctionThatS(x)))
}
You could combine all three jobs into one and then join the results of the three parts to form the joined JSON result.

How to make spark partition sticky, i.e. stay with node?

I am trying to use Spark Streaming 1.2.0. At some point, I grouped streaming data by key and then applied some operation on them.
The following is a segment of the test code:
...
JavaPairDStream<Integer, Iterable<Integer>> grouped = mapped.groupByKey();
JavaPairDStream<Integer, Integer> results = grouped.mapToPair(
new PairFunction<Tuple2<Integer, Iterable<Integer>>, Integer, Integer>() {
#Override
public Tuple2<Integer, Integer> call(Tuple2<Integer, Iterable<Integer>> tp) throws Exception {
TaskContext tc = TaskContext.get();
String ip = InetAddress.getLocalHost().getHostAddress();
int key = tp._1();
System.out.println(ip + ": Partition: " + tc.partitionId() + "\tKey: " + key);
return new Tuple2<>(key, 1);
}
});
results.print();
mapped is an JavaPairDStream wrapping a dummy receiver that stores an array of integers every second.
I ran this app on a cluster with two slaves, each has 2 cores.
When I checked out the printout, it seems that partitions were not assigned to nodes permanently (or in a "sticky" fashion). They moved between the two nodes frequently. This creates a problem for me.
In my real application, I need to load fairly large amount of geo data per partition. These geo data will be used to process the data in the streams. I can only afford to load part of the geo data set per partition. If the partition moves between nodes, I will have to move the geo data too, which can be very expensive.
Is there a way to make the partitions sticky, i.e. partition 0,1,2,3 stay with node 0, partition 4,5,6,7 stay with node 1?
I have tried setting spark.locality.wait to a large number, say, 1000000. And it did not work.
Thanks.
I found a workaround.
I can make my auxiliary data a RDD. Partition it and cache it.
Later, I can cogroup it with other RDDs and Spark will try to keep the cached RDD partitions where they are and not shuffle them. E.g.
...
JavaPairRDD<Integer, GeoData> geoRDD =
geoRDD1.partitionBy(new HashPartitioner(num)).cache();
Later, do this
JavaPairRDD<Integer, Integer> someOtherRDD = ...
JavaPairRDD<Integer, Tuple2<Iterator<GeoData>>, Iterator<Integer>>> grp =
geoRDD.cogroup(someOtherRDD);
Then, you can use foreach on the cogroupped rdd to process the input data with geo data.