I am trying to test a topology that, as the last node, has a KTable. My test is using a full-blown Kafka Cluster (through confluent's Docker images) so I am not using the TopologyTestDriver.
My topology has input of key-value types String -> Customer and output of String -> CustomerMapped. The serdes, schemas and integration with Schema Registry all work as expected.
I am using Scala, Kafka 2.2.0, Confluent Platform 5.2.1 and kafka-streams-scala. My topology, as simplified as possible, looks something like this:
val otherBuilder = new StreamsBuilder()
otherBuilder
.table[String,Customer](source)
.mapValues(c => CustomerMapped(c.surname, c.age))
.toStream.to(target)
(all implicit serdes, Produced, Consumed, etc are the default ones and are found correctly)
My test consists in sending a few records (data) onto the source topic in synchronously and without pause, and reading back from the target topic, I compare the results with expected:
val data: Seq[(String, Customer)] = Vector(
"key1" -> Customer(0, "Obsolete", "To be overridden", 0),
"key1" -> Customer(0, "Obsolete2", "To be overridden2", 0),
"key1" -> Customer(1, "Billy", "The Man", 32),
"key2" -> Customer(2, "Tommy", "The Guy", 31),
"key3" -> Customer(3, "Jenny", "The Lady", 40)
)
val expected = Vector(
"key1" -> CustomerMapped("The Man", 32),
"key2" -> CustomerMapped("The Guy", 31),
"key3" -> CustomerMapped("The Lady", 40)
)
I build the Kafka Stream application, setting between other settings, the following two:
p.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, "5000")
val s: Long = 50L * 1024 * 1024
p.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, s.toString)
So I expect the KTable to use caching, having an interval of 5 seconds between commits and a cache size of 50MB (more than enough for my scenario).
My problem is that the results I read back from the target topic always contain multiple entries for key1. I would have expected no event to be emitted for the records with Obsolete and `Obsolete1. The actual output is:
Vector(
"key1" -> CustomerMapped("To be overridden", 0),
"key1" -> CustomerMapped("To be overridden2", 0),
"key1" -> CustomerMapped("The Man", 32),
"key2" -> CustomerMapped("The Guy", 31),
"key3" -> CustomerMapped("The Lady", 40)
)
One final thing to mention: this test used to work as expected until I updated Kafka from 2.1.0 to 2.2.0. I verified this downgrading my application again.
I am quite confused, can anyone point out whether something changed in the behaviour of KTables in the 2.2.x versions? Or maybe there are now new settings I have to set to control the emission of events?
In Kafka 2.2, an optimization was introduced to reduce the resource footprint of Kafka Streams. A KTable is not necessarily materialized if it's not required for the computation. This holds for your case, because mapValues() can be computed on-the-fly. Because the KTable is not materialized, there is no cache and thus each input record produces one output record.
Compare: https://issues.apache.org/jira/browse/KAFKA-6036
If you want to enforce KTable materialization, you can pass in Materilized.as("someStoreName") into StreamsBuilder#table() method.
Related
I need to merge the payload if the reference number(refNo) is the same in different messages. My limitation is that I can only use a KTable and if the key is an even number I don't need to merge the payload. Additionally, the order of incoming messages should not change the result.
For example, if we have an empty topic and incoming messages are:
1: { key: "1", value: {refNo:1, payload:{data1}} }
2: { key: "2", value: {refNo:1, payload:{data2}} }
3: { key: "3", value: {refNo:2, payload:{data3}} } // this one should be not effected and left how it is
Expected result:
1: { key: "1", value: {refNo:1, payload:{data1, data2}} }
2: { key: "2", value: {refNo:1, payload:{data2}} }
3: { key: "3", value: {refNo:2, payload:{data3}} }
The only way I can think of to do this is to use two times .groupBy and join with the original topic everything again.
First change the key to refNo, save the key to the value itself, and join the payload during aggregation.
Secondly .groupBy revert key to the initial state.
The last step joins everything to the original topic because I lost one message during grouping by.
I'm pretty sure there's an easier way to do this. What is the most optimized and elegant way to solve this issue?
Edit: Its downstream and there is output topic, original is not edited.
Aggregating within KSQL could probably accomplish exactly this. You can use any of the aggregate functions like COLLECT_LIST(col1) => ARRAY
The only issue is how large of a window you would need. How frequently do you need to concatenate the data?
Also, it feels like a big "no" to write back to the original topic.
You're changing the message format slightly, additional downstream consumers could have been expecting a specific message format.
Writing to a new topic seems like a better route to go, it also decreases the amount of messages additional consumers need to consume.
At the moment I'm going with this solution. It works, but I have no idea how it will perform, or can it be more optimized, or if there is better way to solve my issue.
KStream<String, Value> even = inputTopicStream.filter((key, value) -> value.isEven()));
inputTopicStream.toTable(Materialized.with(String.serdes, Value.serde))
.groupBy(
(key, value) -> KeyValue.pair(new Key(value.getRefNo(), addKeyToValue(key, value)),
Grouped.with("aggregation-internal", String.serdes, Value.serde))
.aggregate(
Value::new,
(key, value, agg) -> mergePayload(key, value, agg), // ensure that key is uneven after merge
(key, value, agg) -> handleSplit(key, value, agg))
.toStream()
.selectKey((key, value) -> new Key(value.getKey())) // restore original key
.merge(even) // need to merge even key stream, because they was lost during aggregation
.to(OUTPUT_TOPIC, Produced.with(String.serde, Value.serde));
I am new to structured streaming. I have a use case and I want to know the best approach to achieve this.
I have data coming in as streams from kafka like below
{id: "abc", class: "x", student: "1" } -> at 4:55,
{id: "abc", class: "A", student: "1" } -> at 5:00,
{id: "abc", class: "A", student: "2" } -> at 5:05,
{id: "abc", class: "B", student: "1" } -> at 5:10,
{id: "abc", class: "A", student: "1" } -> at 5:15,
{id: "abc", class: "C", student: "1" } -> at 5:20
Now I want to group the records by class and send the results to a kafka topic and this is read by another microservice and some metrics are computed.
To achieve this
I can think of sliding window concept say every 5 mins I can group the last 15 mins of data and send to kafka - The issue here will be multiple groups. There is a group from 4:55 to 5:10, 2nd one from 5:00 to 5:15 and 3rd one from 5:05 to 5:20. The 2nd one is the group what I need because that is complete and has all the records from "class:A" and can be used to compute the metrics for class A. if I send all 3 groups to kafka like in regular streaming application then the metrics computed from second will be overwritten by 3rd which is not expected so to overcome this I can do the below
I can think of creating a sql table on spark and store the records from kafka for some fixed amount of time say(1 hour) and every x minutes read the records from the table which are older than the threshold(1 hour) and send it to kafka. But again this feels like this is why spark provided window concept. so I am not sure if this is right.
Is there any better way to achieve this? Please provide me some suggestions.
Suppose I have a bucket from which I need to fetch documents that have a date older than now.
This document looks like this:
{
id: "1",
date: "Some date",
otherObjectKEY: "key1"
}
For each document, I need to fetch another document using its otherObjectKEY, send the latter to a kafka topic, then delete the original document.
Using the reactive java driver 3.0, I was able to do it with something like this:
public void batch(){
streamOriginalObjects()
.flatMap(originalObject -> fetchOtherObjectUsingItsKEY(originalObject)
.flatMap(otherObject -> sendToKafkaAndDeleteOriginalObject(originalObject))
)
.subscribe();
}
streamOriginalObjects():
public Flux<OriginalObject> streamOriginalObjects(){
return client.query("select ... and date <= '"+ LocalDateTime.now().toString() +"'")
.flux()
.flatMap(result -> result.rowsAs(OriginalObject.class));
}
It works like expected, but I'm wondering if there is a better approach (especially in terms of performance) than streaming and processing element by element.
Doing a N1QL query, and then fanning-out key-value operations from that, is a useful and common pattern. This should make the fan-out happen in parallel:
streamOriginalObjects()
// Split into numberOfThreads 'rails'
.parallel(numberOfThreads)
// Run on an unlimited thread pool
.runOn(Schedulers.elastic())
.concatMap(originalObject -> fetchOtherObjectUsingItsKEY(originalObject)
.concatMap(otherObject -> sendToKafkaAndDeleteOriginalObject(originalObject))
)
// Back out of parallel mode
.sequential()
.subscribe();
I have a job that runs on a daily basis. The purpose of this job is to correlate HTTP requests with their corresponding HTTP replies. This can be achieved because all HTTP requests & HTTP replies have a GUID that uniquely binds them.
So the job deals with two DataFrames: one containing the requests, and one containing the replies. So to correlate the requests with their replies, I am obviously doing an inner join based on that GUID.
The problem that I am running into is that a request that was captured on day X at 23:59:59 might see its reply captured on day X+1 at 00:00:01 (or vice-versa) which means that they will never get correlated together, neither on day X nor on day X+1.
Here is example code that illustrates what I mean:
val day1_requests = """[ { "id1": "guid_a", "val" : 1 }, { "id1": "guid_b", "val" : 3 }, { "id1": "guid_c", "val" : 5 }, { "id1": "guid_d", "val" : 7 } ]"""
val day1_replies = """[ { "id2": "guid_a", "val" : 2 }, { "id2": "guid_b", "val" : 4 }, { "id2": "guid_c", "val" : 6 }, { "id2": "guid_e", "val" : 10 } ]"""
val day2_requests = """[ { "id1": "guid_e", "val" : 9 }, { "id1": "guid_f", "val" : 11 }, { "id1": "guid_g", "val" : 13 }, { "id1": "guid_h", "val" : 15 } ]"""
val day2_replies = """[ { "id2": "guid_d", "val" : 8 }, { "id2": "guid_f", "val" : 12 }, { "id2": "guid_g", "val" : 14 }, { "id2": "guid_h", "val" : 16 } ]"""
val day1_df_requests = spark.read.json(spark.sparkContext.makeRDD(day1_requests :: Nil))
val day1_df_replies = spark.read.json(spark.sparkContext.makeRDD(day1_replies :: Nil))
val day2_df_requests = spark.read.json(spark.sparkContext.makeRDD(day2_requests :: Nil))
val day2_df_replies = spark.read.json(spark.sparkContext.makeRDD(day2_replies :: Nil))
day1_df_requests.show()
day1_df_replies.show()
day2_df_requests.show()
day2_df_replies.show()
day1_df_requests.join(day1_df_replies, day1_df_requests("id1") === day1_df_replies("id2")).show()
// guid_d from request stream is left over, as well as guid_e from reply stream.
//
// The following 'join' is done on the following day.
// I would like to carry 'guid_d' into day2_df_requests and 'guid_e' into day2_df_replies).
day2_df_requests.join(day2_df_replies, day2_df_requests("id1") === day2_df_replies("id2")).show()
I can see 2 solutions.
Solution#1 - custom carry over
In this solution, on day X, I would do a "full_outer" join instead of an inner-join, and I would persist into some storage the results that are missing one side or the other. On the next day X+1, I would load this extra data along with my "regular data" when doing my join.
An additional implementation detail is that my custom carry over would have to discard "old carry overs" otherwise it could pile up, i.e. it is possible that a HTTP request or HTTP reply from 10 days ago never sees its counterpart (maybe the app crashed for instance thus a HTTP request was emitted but not a reply).
Solution#2 - guid folding
In this solution, I would make the assumption that my requests and replies are within a certain amount of time of one another (e.g. 5 minutes). Thus on day X+1, I would also load the last 5 minutes of data from day X and include that as part of my join. This way, I don't need to use extra storage like in solution#1. However, the disadvantage is that this solution requires that the target storage can deal with duplicate entries (for instance, if the target storage is a SQL table, the PK would have to be this GUID and do an upsert instead of an insert).
Question
So my question is whether Spark provides functionality to automatically deal with a situation like that, thus not requiring any of my two solutions and by the same fact making things easier and more elegant?
Bonus question
Let's assume that I need to do the same type of correlation but with a stream of data (i.e. instead of a daily batch job that runs on a static set of data, I use Spark Streaming and data is processed on live streams of requests & replies).
In this scenario, a "full_outer" join is obviously inappropriate (https://dzone.com/articles/spark-structured-streaming-joins) and actually unnecessary since Spark Streaming takes care of that for us by having a sliding window for doing the join.
However, I am curious to know what happens if the job is stopped (or if it crashes) then resumed. Similarly to the batch mode example that I gave above, what if the job was interrupted after a request was consumed (and acknowledged) from stream/queue but before its related reply did? Does Spark Streaming keeps a state of its sliding window hence resuming the job will be able to correlate as if the stream was never interrupted?
P.S. backing up your answer with hyperlinks to reputable docs (like Apache's own) would be much appreciated.
I've got an event source that emits a Snapshot followed by Deltas let's say it looks a bit like
Flowable.just("A", "1", "2", "3")
where "A" is the snapshot and "1", "2", "3" would be the updates.
I'd like the first subscriber to retrieve
"A", "1", "2", "3"
And the second subscriber (assuming it occurs between "2" and "3") to receive
"A".apply("1").apply("2"), "3"
So what I'm looking for is an operator that multicasts the stream but emits the conflated value from the first up until the next emission before the next emission for all subsequent subscribers.
Can someone point me into the right direction? Started writing a custom operator, but I feel like there's something easy I'm missing.
Worked on it a bit more with my colleague and we've come up with the following
Flowable.just("A", "1", "2", "3")
.scan(Tuple(empty, empty), (prev, update) -> build "snapshot" "last delta" tuple
.skip(1)
.replay(1) // replay tuple of "current snapshot" "latest delta"
.refCount()
.scan(empty, (prev, update) -> empty == prev ? update.snapshot : update.delta)
.skip(1)
works beautifully, and hopefully will be of use to the next person trying the same thing