Intermittent incorrect count from TimeWindowKStream in kafka streams - apache-kafka

My intention with this topology is to window incoming messages and then count them, and then send the count to another topic.
When I test this with a single key and one or more values to the input topic, I get inconsistent results. Sometimes the count is correct. Sometimes I will send in a single message, see the single message at the first peek and instead of getting a count of 1, I get some other value at the second peek and in the output topic. When I send in multiple messages, the count is usually right, but sometimes off. I'm careful to send the messages inside the time window, so I don't think they're getting split into two windows.
Is there a flaw in my topology?
public static final String INPUT_TOPIC = "test-topic";
public static final String OUTPUT_TOPIC = "test-output-topic";
public static void buildTopo(StreamsBuilder builder) {
WindowBytesStoreSupplier store = Stores.persistentTimestampedWindowStore(
"my-state-store",
Duration.ofDays(1),
Duration.ofMinutes(1),
false);
Materialized<String, Long, WindowStore<Bytes, byte[]>> materialized = Materialized
.<String, Long>as(store)
.withKeySerde(Serdes.String());
Suppressed<Windowed> suppression = Suppressed
.untilWindowCloses(Suppressed.BufferConfig.unbounded());
TimeWindows window = TimeWindows
.of(Duration.ofMinutes(1))
.grace(Duration.ofSeconds(0));
// windowedKey has a string plus the kafka time window
builder.stream(INPUT_TOPIC, Consumed.with(Serdes.String(), Serdes.String()))
.peek((key, value) -> System.out.println("****key = " + key + " value= " + value))
.groupByKey()
.windowedBy(window)
.count(materialized)
.suppress(suppression)
.toStream()
.peek((key, value) -> System.out.println("key = " + key + " value= " + value))
.map((key, value) -> new KeyValue<>(key.key(), value))
.to(OUTPUT_TOPIC, Produced.with(Serdes.String(), Serdes.Long()));
}

Related

Kafka streams join: How to wait a duration of time before emitting records?

We currently have 2 Kafka stream topics that have records coming in continuously. We're looking into joining the 2 streams based on a key after waiting for a window of 5 minutes but with my current code, I see records being emitted immediately without "waiting" to see if a matching record arrives in the other stream. My current implementation:
KStream<String, String> streamA =
builder.stream(topicA, Consumed.with(Serdes.String(), Serdes.String()))
.peek((key, value) -> System.out.println("Stream A incoming record key " + key + " value " + value));
KStream<String, String> streamB =
builder.stream(topicB, Consumed.with(Serdes.String(), Serdes.String()))
.peek((key, value) -> System.out.println("Stream B incoming record key " + key + " value " + value));
ValueJoiner<String, String, String > recordJoiner =
(recordA, recordB) -> {
if(recordA != null) {
return recordA;
} else {
return recordB;
}
};
KStream<String, String > combinedStream =
streamA(
streamB,
recordJoiner,
JoinWindows
.of(Duration.ofMinutes(5)),
StreamJoined.with(
Serdes.String(),
Serdes.String(),
Serdes.String()))
.peek((key, value) -> System.out.println("Stream-Stream Join record key " + key + " value " + value));
combinedStream.to("test-topic"
Produced.with(
Serdes.String(),
Serdes.String()));
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), streamsConfiguration);
kafkaStreams.start();
Although I have the JoinWindows.of(Duration.ofMinutes(5)), I see some records being emitted immediately. How do I ensure they are not?
Additionally, is this the most efficient way of joining 2 Kafka streams or is it better to come up with our own consumer implementation that reads from 2 streams etc?

Not able to query on local key-value store(Kafka-Stream)

I am working on the use-case where I need to query KTable(Using local key-value stores approach).My sample data which is present inside the topic:
A,Blue
A,Blue
A,Yellow
A,Red
A,Yellow
A,Yellow
B,Blue
C,Red
C,Red
B,Blue
Based On Input I want to generate the output and store in the topic:
A Blue:2,Yellow:3,Red:1
B Blue:2
C Red:2
Approach:
1) I first performed count operation by reading topic data in Kstream.
//set the properties for interactive queries
props.put(StreamsConfig.APPLICATION_SERVER_CONFIG,"localhost:9092" );
props.put(StreamsConfig.STATE_DIR_CONFIG, "D:\\Kafka_data\\Local_store");
//read the user input from Kafka topic: data
final KStream<String,String> userDataSource = builder.stream("data");
final KGroupedStream<String,String> inputData = userDataSource.
map((key, value) -> new KeyValue<>(value.split(",")[0].toString() + "_"+ value.split(",")[1].toString() , value.split(",")[1].toString()) )
.selectKey((s, s2) -> s)
.groupByKey(Grouped.with(Serdes.String(),Serdes.String()));
final KTable<String,Long> inputAggregationResult = inputData.count();
Result of the above code:
A_Blue 1
A_Yellow 1
A_Red 1
A_Yellow 2
A_Yellow 3
B_Blue 1
C_Red 1
C_Red 2
B_Blue 2
2) Then store the result in topic:
inputAggregationResult.toStream().to("input-data-aggregation", Produced.with(Serdes.String(), Serdes.Long()));
3) Now reading data from the topic(input-data-aggregation) as Ktable so that I can query.
final StreamsBuilder builder = new StreamsBuilder();
KTable<String, Object> ktableInformation = builder.table("input-data-aggregation", Materialized.<String, Object, KeyValueStore<Bytes, byte[]>>as("CountsValueStore"));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.cleanUp();
streams.start();
ReadOnlyKeyValueStore<String, Object> keyValueStore;
Map<String,Object> information = new LinkedHashMap<String,Object>();
while (true) {
try {
// Get the key-value store CountsKeyValueStore
keyValueStore =
streams.store(ktableInformation.queryableStoreName(), QueryableStoreTypes.keyValueStore());
//read the value
KeyValueIterator<String, Object> range = keyValueStore.range("all", "streams");
while (range.hasNext()) {
KeyValue<String, Object> next = range.next();
information.put(next.key,next.value);
System.out.println("count for " + next.key + ": " + next.value);
}
// close the iterator to release resources
range.close();
} catch (InvalidStateStoreException ignored) {
ignored.printStackTrace();
}
}
4)When I am trying to query data it is giving empty data(No output is getting print).
Can someone guide me if I missed any step as part Querying local key-value stores? or any other alternative to achieve the target output. I have verified that Kafka is writing Local Key-value store data inside my local instance but while reading(Querying) data it's giving an empty result.

Incorrect result Kstream-Kstream Join with asymmetric time window

I have 2 streams named "alarm" and "intervention" that contain JSON. If an alarm and an intervention are connected then they will have the same key. I want to reach them to detect all alarms that haven't had intervention 24 hours before.
But this program doesn't work and gives me as a result all the alarms as if no intervention had been done 24 hours before.
I rechecked my dataset 5 times and there are alarms that have interventions done less than 24 hours before the date of the alarm.
This picture explain the situation:
enter image description here
So i need to know if there is an intervention before an alarm.
The code of the program:
final KStream<String, JsonNode> alarm = ...;
final KStream<String, JsonNode> intervention = ...;
final JoinWindows jw = JoinWindows.of(TimeUnit.HOURS.toMillis(24)).before(TimeUnit.HOURS.toMillis(24)).after(0);
final KStream<String, JsonNode> joinedAI = alarm.filter((String key, JsonNode value) -> {
return value != null;
}).leftJoin(intervention, (JsonNode leftValue, JsonNode rightValue) -> {
ObjectMapper mapper = new ObjectMapper();
JsonNode actualObj = null;
if (rightValue == null) {//No intervention before
try {
actualObj = mapper.readTree("{\"date\":\"" + leftValue.get("date").asText() + "\","
+ "\"alarm\":" + leftValue.toString()
+ "}");
} catch (IOException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}
return actualObj;
} else {
return null;
}
}, jw, Joined.with(Serdes.String(), jsonSerde, jsonSerde));
final KStream<String, JsonNode> fraude = joinedAI.filter((String key, JsonNode value) -> {
return value != null;
});
fraude.foreach((key, value) -> {
rl.println("Fraude=" + key + " => " + value);
System.out.println("Fraude=" + key + " => " + value);
});
final KafkaStreams streams = new KafkaStreams(builder.build(), streamingConfig);
streams.cleanUp();
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
#Override
public void run() {
streams.close();
rl.close();
el.close();
nfl.close();
}
}));
To sum up, I want to detect the pattern in the red rectangle enter image description here
P.S: I make sure that the intervention records are send before alarm records
M.Djx,
I don't think there's a perfect solution for this use case in Kafka Streams right now, but I have a few thoughts to get you closer. I'm preparing to submit a KIP to address exactly use cases like this in the near future.
One point: unlike a KTable, KStreams are not changelogs, so newer events don't overwrite older events with the same key; they just all coexist in the same stream. I think that's why your foreach makes it look like all the alerts have no intervention; you're seeing the intermediate join events from before the interventions.
For example:
LEFT RIGHT JOIN
a:1 a:(1,null)
a:X a:(1,X)
foreach will be invoked on both join results, making it look like the right value is missing, when it's actually just a little late.
If you apply a time-window on the result stream, you will get a changelog--newer values will overwrite older ones. Something like:
joinedAI
.groupByKey()
.windowedBy(
TimeWindows
.of(1000 * 60 * 60 * 24) // the window will be 24 hours in size
.until(1000 * 60 * 60 * 48) // and we'll keep it in the state store for at least 48 hours
).reduce(
new Reducer<JsonNode>() {
#Override
public Long apply(final JsonNode value1, final JsonNode value2) {
return value2;
}
},
Materialized.<String, JsonNode, WindowStore<Bytes, byte[]>>as("alerts-without-interventions")
);
The bummer is that this will produce a changelog stream with the right semantics, but you'll still see the intermediate values, so you wouldn't want to trigger any actions directly from this stream either (like the foreach).
One thing you could do is schedule a job, once a day, to scan "alerts-without-interventions" for windows from yesterday. Any result you get from the window store will be the most recent value from that key.
The KIP I'm preparing will propose a way to let you filter out the intermediate results from the window, which would let you attach a foreach to the changelog and have it trigger only on the final result of the window.
Alternatively, if the data for your app isn't too big, and if you're not too worried about edge cases, you could consider implementing the "window final events" semantics yourself with a LinkedHashMap or a Guava cache.
I hope this helps.

Kafka KStream to GlobalKTable join does not work with same key used

I have a very frustrating problem trying to join a KStream, that is populated by a java driver program using KafkaProducer, to a GlobalKTable that is populated from a Topic that, in turn, is populated using the JDBCConnector pulling data from a MySQL Table. No matter what I try to do the join between the KStream and the GlobalKTable, which both are keyed on the same value, will not work. What I mean is that the ValueJoiner is never called. I'll try and explain by showing the relevant config and code below. I appreciate any help.
I am using the latest version of the confluent platform.
The topic that the GlobalKTable is populated from is pulled from a single MySQL table:
Column Name/Type:
pk/bigint(20)
org_name/varchar(255)
orgId/varchar(10)
The JDBCConnector configuration for this is:
name=my-demo
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
connection.url=jdbc:mysql://localhost:3306/reporting?user=root&password=XXX
table.whitelist=organisation
mode=incrementing
incrementing.column.name=pk
topic.prefix=my-
transforms=keyaddition
transforms.keyaddition.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.keyaddition.fields=orgId
I am running the JDBC connector using the command line:
connect-standalone /home/jim/platform/confluent/etc/schema-registry/connect-avro-standalone.properties /home/jim/prg/kafka/config/my.mysql.properties
This gives me a topic called my-organisation, that is keyed on orgId ..... so far so good!
(note, the namespace does not seem to be set by JDBCConnector but I don't think this is an issue but I don't know for sure)
Now, the code. Here is how I initialise and create the GlobalKTable (relevant code shown):
final Map<String, String> serdeConfig =
Collections.singletonMap(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG,
schemaRegistryUrl);
final StreamsBuilder builder = new StreamsBuilder();
final SpecificAvroSerde<Organisation> orgSerde = new SpecificAvroSerde<>();
orgSerde.configure(serdeConfig, false);
// Create the GlobalKTable from the topic that was populated using the connect-standalone command line
final GlobalKTable<String, Organisation>
orgs =
builder.globalTable(ORG_TOPIC, Materialized.<String, Organisation, KeyValueStore<Bytes, byte[]>>as(ORG_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(orgSerde));
The avro schema, from where the Organisaton class is generated is defined as:
{"namespace": "io.confluent.examples.streams.avro",
"type":"record",
"name":"Organisation",
"fields":[
{"name": "pk", "type":"long"},
{"name": "org_name", "type":"string"},
{"name": "orgId", "type":"string"}
]
}
Note: as described above the orgId is set as the key on the topic using the single message transform (SMT) operation.
So, that is the GlobalKTable setup.
Now for the KStream setup (the right hand side of the join). This has the same key (orgId) as the globalKTable. I use a simple driver program for this:
(The use case is that this topic will contain events associated with each organisation)
public class UploadGenerator {
public static void main(String[] args){
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put("schema.registry.url", "http://localhost:8081");
KafkaProducer producer = new KafkaProducer(props);
// This schema is also used in the consumer application or more specifically a class generated from it.
String mySchema = "{\"namespace\": \"io.confluent.examples.streams.avro\"," +
"\"type\":\"record\"," +
"\"name\":\"DocumentUpload\"," +
"\"fields\":[{\"name\":\"orgId\",\"type\":\"string\"}," +
"{\"name\":\"date\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}]}";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(mySchema);
// Just using three fictional organisations with the following orgIds/keys
String[] ORG_ARRAY = {"002", "003", "004"};
long count = 0;
String key = ""; // key is the realm
while(true) {
count++;
try {
TimeUnit.SECONDS.sleep(5);
} catch (InterruptedException e) {
}
GenericRecord avroRecord = new GenericData.Record(schema);
int orgId = ThreadLocalRandom.current().nextInt(0, 2 + 1);
avroRecord.put("orgId",ORG_ARRAY[orgId]);
avroRecord.put("date",new Date().getTime());
key = ORG_ARRAY[orgId];
ProducerRecord<Object, Object> record = new ProducerRecord<>("topic_uploads", key, avroRecord);
try {
producer.send(record);
producer.flush();
} catch(SerializationException e) {
System.out.println("Exccccception was generated! + " + e.getMessage());
} catch(Exception el) {
System.out.println("Exception: " + el.getMessage());
}
}
}
}
So, this generates a new event representing an upload for an organisation represented by the orgId but also specifically set in the key variable used in the ProducerRecord.
Here is the code that sets up the KStream for these events:
final SpecificAvroSerde<DocumentUpload> uploadSerde = new SpecificAvroSerde<>();
uploadSerde.configure(serdeConfig, false);
// Get the stream of uploads
final KStream<String, DocumentUpload> uploadStream = builder.stream(UPLOADS_TOPIC, Consumed.with(Serdes.String(), uploadSerde));
// Debug output to see the contents of the stream
uploadStream.foreach((k, v) -> System.out.println("uploadStream: Key: " + k + ", Value: " + v));
// Note, I tried to re-key the stream with the orgId field (even though it was set as the key in the driver but same problem)
final KStream<String, DocumentUpload> keyedUploadStream = uploadStream.selectKey((key, value) -> value.getOrgId());
keyedUploadStream.foreach((k, v) -> System.out.println("keyedUploadStream: Key: " + k + ", Value: " + v));
// Java 7 form used as it was easier to put in debug statements
// OrgPK is just a helper class defined in the same class
KStream<String, OrgPk> joined = keyedUploadStream.leftJoin(orgs,
new KeyValueMapper<String, DocumentUpload, String>() { /* derive a (potentially) new key by which to lookup against the table */
#Override
public String apply(String key, DocumentUpload value) {
System.out.println("1. The key passed in is: " + key);
System.out.println("2. The upload realm passed in is: " + value.getOrgId());
return value.getOrgId();
}
},
// THIS IS NEVER CALLED WITH A join() AND WHEN CALLED WITH A leftJoin() HAS A NULL ORGANISATION
new ValueJoiner<DocumentUpload, Organisation, OrgPk>() {
#Override
public OrgPk apply(DocumentUpload leftValue, Organisation rightValue) {
System.out.println("3. Value joiner has been called...");
if( null == rightValue ) {
// THIS IS ALWAYS CALLED, SO THERE IS NEVER A "MATCH"
System.out.println(" 3.1. Orgnisation is NULL");
return new OrgPk(leftValue.getRealm(), 1L);
}
System.out.println(" 3.1. Org is OK");
// Never reaches here - this is the issue i.e. there is never a match
return new OrgPk(leftValue.getOrgId(), rightValue.getPk());
}
});
So, the above join (or leftJoin) never matches, even though the two keys are the same! This is the main issue.
Finally, the avro schema for the DocumentUpload is:
{"namespace": "io.confluent.examples.streams.avro",
"type":"record",
"name":"DocumentUpload",
"fields":[
{"name": "orgId", "type":"string"},
{"name":"date", "type":"long", "logicalType":"timestamp-millis"}
]
}
So, in summary:
I have a KStream on a topic with a String key of OrgId
I have a GlobalKTable on a topic with a String key of OrgId also.
The join never works, even though the keys are in the GlobalKTable (at least they are in the topic underlying the GlobalKTable)
Can someone help me? I am pulling my hair out trying to figure this out.
I was able to solve this issue on Windows/Intellij by providing a state dir config
StreamsConfig.STATE_DIR_CONFIG

Kafka streams - set a different time window depending on the message group

I wonder if given a KStream, is possible to set a different time window depending on the message group, for example, for groupBy "A" 5 seconds, for groupBy "B" 10 seconds ...
KStream<String, Msg> stream = builder.stream(stringSerde, msgSerde, input);
stream.groupBy((key, msg) -> msg.getPool())
.aggregate(init, agg, TimeWindows.of(wndLength).advanceBy(wndLength), msgSerde)
...
The simplest way that comes to mind is to .filter() or .branch() before you .groupBy()/.aggregate(), like:
KStream<String, Msg> stream = builder.stream(stringSerde, msgSerde, input);
stream.filter((key, msg) -> msg.getPool().equals("A"))
.groupBy((key, msg) -> msg.getPool())
.aggregate(init, agg, TimeWindows.of(wndLength).advanceBy(wndLength), msgSerde)
...