Not able to query on local key-value store(Kafka-Stream) - apache-kafka

I am working on the use-case where I need to query KTable(Using local key-value stores approach).My sample data which is present inside the topic:
A,Blue
A,Blue
A,Yellow
A,Red
A,Yellow
A,Yellow
B,Blue
C,Red
C,Red
B,Blue
Based On Input I want to generate the output and store in the topic:
A Blue:2,Yellow:3,Red:1
B Blue:2
C Red:2
Approach:
1) I first performed count operation by reading topic data in Kstream.
//set the properties for interactive queries
props.put(StreamsConfig.APPLICATION_SERVER_CONFIG,"localhost:9092" );
props.put(StreamsConfig.STATE_DIR_CONFIG, "D:\\Kafka_data\\Local_store");
//read the user input from Kafka topic: data
final KStream<String,String> userDataSource = builder.stream("data");
final KGroupedStream<String,String> inputData = userDataSource.
map((key, value) -> new KeyValue<>(value.split(",")[0].toString() + "_"+ value.split(",")[1].toString() , value.split(",")[1].toString()) )
.selectKey((s, s2) -> s)
.groupByKey(Grouped.with(Serdes.String(),Serdes.String()));
final KTable<String,Long> inputAggregationResult = inputData.count();
Result of the above code:
A_Blue 1
A_Yellow 1
A_Red 1
A_Yellow 2
A_Yellow 3
B_Blue 1
C_Red 1
C_Red 2
B_Blue 2
2) Then store the result in topic:
inputAggregationResult.toStream().to("input-data-aggregation", Produced.with(Serdes.String(), Serdes.Long()));
3) Now reading data from the topic(input-data-aggregation) as Ktable so that I can query.
final StreamsBuilder builder = new StreamsBuilder();
KTable<String, Object> ktableInformation = builder.table("input-data-aggregation", Materialized.<String, Object, KeyValueStore<Bytes, byte[]>>as("CountsValueStore"));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.cleanUp();
streams.start();
ReadOnlyKeyValueStore<String, Object> keyValueStore;
Map<String,Object> information = new LinkedHashMap<String,Object>();
while (true) {
try {
// Get the key-value store CountsKeyValueStore
keyValueStore =
streams.store(ktableInformation.queryableStoreName(), QueryableStoreTypes.keyValueStore());
//read the value
KeyValueIterator<String, Object> range = keyValueStore.range("all", "streams");
while (range.hasNext()) {
KeyValue<String, Object> next = range.next();
information.put(next.key,next.value);
System.out.println("count for " + next.key + ": " + next.value);
}
// close the iterator to release resources
range.close();
} catch (InvalidStateStoreException ignored) {
ignored.printStackTrace();
}
}
4)When I am trying to query data it is giving empty data(No output is getting print).
Can someone guide me if I missed any step as part Querying local key-value stores? or any other alternative to achieve the target output. I have verified that Kafka is writing Local Key-value store data inside my local instance but while reading(Querying) data it's giving an empty result.

Related

i'm trying a windowed word count application streams, in consumer console i have some unreadable characters alongwith count

The application (.java) file is as given below;
public class WordCountFinal {
public static void main(String[] args) {
StringSerializer stringSerializer = new StringSerializer();
StringDeserializer stringDeserializer = new StringDeserializer();
TimeWindowedSerializer<String> windowedSerializer = new TimeWindowedSerializer<>(stringSerializer);
TimeWindowedDeserializer<String> windowedDeserializer = new TimeWindowedDeserializer<>(stringDeserializer);
Serde<Windowed<String>> windowedSerde = Serdes.serdeFrom(windowedSerializer, windowedDeserializer);
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "rogue");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "ssc-vm-r.com:9092, ssc-vmr:9092, ssc-vm:9092");
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.Long().getClass());
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> wordcountinput = builder.stream("TextLinesTopic", Consumed.with(Serdes.String(), Serdes.String()));
KGroupedStream<String, String> groupedStream = wordcountinput
.flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))
.map((key, word) -> new KeyValue<>(word, word))
.groupByKey(Grouped.with(Serdes.String(), Serdes.String()));
KTable<Windowed<String>, Long> aggregatedStream = groupedStream
.windowedBy(TimeWindows.of(Duration.ofMinutes(2)))
.count();
aggregatedStream.toStream().to("tuesdaystopic", Produced.with(windowedSerde, Serdes.Long()));
KafkaStreams streams = new KafkaStreams(builder.build(), streamsConfiguration);
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
}
input to producer console is sentences or words. output should be like similar wordcount app, but after 2 minutes, suppose till now i have word count for 'qwerty' as 5. and after two mins i enter again qwerty in producer console, i should get outputted count as 1.
qwerty 3
qwerty 4
qwerty 5
abcd 1
after 2 mins and entering qwerty in prod. console
qwerty 1
Note that the type of the key of the result is Windowed<String> -- that's also why you use a TimeWindowedSerializer when writing the result stream to a topic via to() (you don't use a StringSerializer).
When you read the data with the console consumer, you specify StringDeserializer for the key though, however, the bytes in the key is not of type String and thus you get those unreadable characters and the types don't match.
You can either specify a different deserializer (ie, TimeWindowedDeserializer when using the console consumer, or you modify the key to type String before writing the result into the output topic. For example you could use:
aggregatedStream.toStream()
// `k` is of type Windowed<String>
// you can get the plain String key via `key()`
.selectKey((k,v) -> k.key())
.to(....)

How to make Serdes work with multi-step kafka streams

I am new to Kafka and I'm building a starter project using the Twitter API as a data source. I have create a Producer which can query the Twitter API and sends the data to my kafka topic with string serializer for both key and value. My Kafka Stream Application reads this data and does a word count, but also grouping by the date of the tweet. This part is done through a KTable called wordCounts to make use of its upsert functionality. The structure of this KTable is:
Key: {word: exampleWord, date: exampleDate}, Value: numberOfOccurences
I then attempt to restructure the data in the KTable stream to a flat structure so I can later send it to a database. You can see this in the wordCountsStructured KStream object. This restructures the data to look like the structure below. The value is initially a JsonObject but i convert it to a string to match the Serdes which i set.
Key: null, Value: {word: exampleWord, date: exampleDate, Counts: numberOfOccurences}
However, when I try to send this to my second kafka topic, I get the error below.
A serializer (key:
org.apache.kafka.common.serialization.StringSerializer / value:
org.apache.kafka.common.serialization.StringSerializer) is not
compatible to the actual key or value type (key type:
com.google.gson.JsonObject / value type: com.google.gson.JsonObject).
Change the default Serdes in StreamConfig or provide correct Serdes
via method parameters.
I'm confused by this since the KStream I am sending to the topic is of type <String, String>. Does anyone know how I might fix this?
public class TwitterWordCounter {
private final JsonParser jsonParser = new JsonParser();
public Topology createTopology(){
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("test-topic2");
KTable<JsonObject, Long> wordCounts = textLines
//parse each tweet as a tweet object
.mapValues(tweetString -> new Gson().fromJson(jsonParser.parse(tweetString).getAsJsonObject(), Tweet.class))
//map each tweet object to a list of json objects, each of which containing a word from the tweet and the date of the tweet
.flatMapValues(TwitterWordCounter::tweetWordDateMapper)
//update the key so it matches the word-date combination so we can do a groupBy and count instances
.selectKey((key, wordDate) -> wordDate)
.groupByKey()
.count(Materialized.as("Counts"));
/*
In order to structure the data so that it can be ingested into SQL, the value of each item in the stream must be straightforward: property, value
so we have to:
1. take the columns which include the dimensional data and put this into the value of the stream.
2. lable the count with 'count' as the column name
*/
KStream<String, String> wordCountsStructured = wordCounts.toStream()
.map((key, value) -> new KeyValue<>(null, MapValuesToIncludeColumnData(key, value).toString()));
KStream<String, String> wordCountsPeek = wordCountsStructured.peek(
(key, value) -> System.out.println("key: " + key + "value:" + value)
);
wordCountsStructured.to("test-output2", Produced.with(Serdes.String(), Serdes.String()));
return builder.build();
}
public static void main(String[] args) {
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-application1111");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "myIPAddress");
config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
TwitterWordCounter wordCountApp = new TwitterWordCounter();
KafkaStreams streams = new KafkaStreams(wordCountApp.createTopology(), config);
streams.start();
// shutdown hook to correctly close the streams application
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
//this method is used for taking a tweet and transforming it to a representation of the words in it plus the date
public static List<JsonObject> tweetWordDateMapper(Tweet tweet) {
try{
List<String> words = Arrays.asList(tweet.tweetText.split("\\W+"));
List<JsonObject> tweetsJson = new ArrayList<JsonObject>();
for(String word: words) {
JsonObject tweetJson = new JsonObject();
tweetJson.add("date", new JsonPrimitive(tweet.formattedDate().toString()));
tweetJson.add("word", new JsonPrimitive(word));
tweetsJson.add(tweetJson);
}
return tweetsJson;
}
catch (Exception e) {
System.out.println(e);
System.out.println(tweet.serialize().toString());
return new ArrayList<JsonObject>();
}
}
public JsonObject MapValuesToIncludeColumnData(JsonObject key, Long countOfWord) {
key.addProperty("count", countOfWord); //new JsonPrimitive(count));
return key;
}
Because you are performing a key changing operation before the groupBy(), it will create a repartition topic and for that topic, it will rely on the default key, value serdes, which you have set to String Serde.
You can modify the groupBy() call to groupBy(Grouped.with(StringSerde,JsonSerde) and this should help.

Is there a retention policy for custom state store (RocksDb) with Kafka streams?

I am setting up a new Kafka streams application, and want to use custom state store using RocksDb. This is working fine for putting data in state store and getting a queryable state store from it and iterating over the data, However, after about 72 hours I observe data to be missing from the store. Is there a default retention time on data for state store in Kafka streams or in RocksDb?
I are using custom state store using RocksDb so that we can utilize the column family feature, that we can't use with the embedded RocksDb implementation with KStreams. I have implementated custom store using KeyValueStore interface. And have my own StoreSupplier, StoreBuilder, StoreType and StoreWrapper as well.
A changelog topic is created for the application but no data is going to it yet (haven't looked into that problem yet).
Putting data into this custom state store and getting queryable state store from it is working fine. However, I am seeing that data is missing after about 72 hours from the store. I checked by getting the size of the state store directory as well as by exporting the data into files and checking the number of entries.
Using SNAPPY compression and UNIVERSAL compaction
Simple topology:
final StreamsBuilder builder = new StreamsBuilder();
String storeName = "store-name"
List<String> cfNames = new ArrayList<>();
// Hybrid custom store
final StoreBuilder customStore = new RocksDBColumnFamilyStoreBuilder(storeName, cfNames);
builder.addStateStore(customStore);
KStream<String, String> inputstream = builder.stream(
inputTopicName,
Consumed.with(Serdes.String(), Serdes.String()
));
inputstream
.transform(() -> new CurrentTransformer(storeName), storeName);
Topology tp = builder.build();
Snippet from custom store implementation:
RocksDBColumnFamilyStore(final String name, final String parentDir, List<String> columnFamilyNames) {
.....
......
final BlockBasedTableConfig tableConfig = new BlockBasedTableConfig()
.setBlockCache(cache)
.setBlockSize(BLOCK_SIZE)
.setCacheIndexAndFilterBlocks(true)
.setPinL0FilterAndIndexBlocksInCache(true)
.setFilterPolicy(filter)
.setCacheIndexAndFilterBlocksWithHighPriority(true)
.setPinTopLevelIndexAndFilter(true)
;
cfOptions = new ColumnFamilyOptions()
.setCompressionType(CompressionType.SNAPPY_COMPRESSION)
.setCompactionStyle(CompactionStyle.UNIVERSAL)
.setMaxWriteBufferNumber(MAX_WRITE_BUFFERS)
.setOptimizeFiltersForHits(true)
.setLevelCompactionDynamicLevelBytes(true)
.setTableFormatConfig(tableConfig);
columnFamilyDescriptors.add(new ColumnFamilyDescriptor(RocksDB.DEFAULT_COLUMN_FAMILY, cfOptions));
columnFamilyNames.stream().forEach((cfName) -> columnFamilyDescriptors.add(new ColumnFamilyDescriptor(cfName.getBytes(), cfOptions)));
}
#SuppressWarnings("unchecked")
public void openDB(final ProcessorContext context) {
Options opts = new Options()
.prepareForBulkLoad();
options = new DBOptions(opts)
.setCreateIfMissing(true)
.setErrorIfExists(false)
.setInfoLogLevel(InfoLogLevel.INFO_LEVEL)
.setMaxOpenFiles(-1)
.setWriteBufferManager(writeBufferManager)
.setIncreaseParallelism(Math.max(Runtime.getRuntime().availableProcessors(), 2))
.setCreateMissingColumnFamilies(true);
fOptions = new FlushOptions();
fOptions.setWaitForFlush(true);
dbDir = new File(new File(context.stateDir(), parentDir), name);
try {
Files.createDirectories(dbDir.getParentFile().toPath());
db = RocksDB.open(options, dbDir.getAbsolutePath(), columnFamilyDescriptors, columnFamilyHandles);
columnFamilyHandles.stream().forEach((handle) -> {
try {
columnFamilyMap.put(new String(handle.getName()), handle);
} catch (RocksDBException e) {
throw new ProcessorStateException("Error opening store " + name + " at location " + dbDir.toString(), e);
}
});
} catch (RocksDBException e) {
throw new ProcessorStateException("Error opening store " + name + " at location " + dbDir.toString(), e);
}
open = true;
}
The expectation is that the state store (RocksDb) will retain the data indefinitely until manually deleted or until the storage disk goes down. I am not aware that Kafka streams has introduced having TTl with state stores yet.

Kafka KStream to GlobalKTable join does not work with same key used

I have a very frustrating problem trying to join a KStream, that is populated by a java driver program using KafkaProducer, to a GlobalKTable that is populated from a Topic that, in turn, is populated using the JDBCConnector pulling data from a MySQL Table. No matter what I try to do the join between the KStream and the GlobalKTable, which both are keyed on the same value, will not work. What I mean is that the ValueJoiner is never called. I'll try and explain by showing the relevant config and code below. I appreciate any help.
I am using the latest version of the confluent platform.
The topic that the GlobalKTable is populated from is pulled from a single MySQL table:
Column Name/Type:
pk/bigint(20)
org_name/varchar(255)
orgId/varchar(10)
The JDBCConnector configuration for this is:
name=my-demo
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
connection.url=jdbc:mysql://localhost:3306/reporting?user=root&password=XXX
table.whitelist=organisation
mode=incrementing
incrementing.column.name=pk
topic.prefix=my-
transforms=keyaddition
transforms.keyaddition.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.keyaddition.fields=orgId
I am running the JDBC connector using the command line:
connect-standalone /home/jim/platform/confluent/etc/schema-registry/connect-avro-standalone.properties /home/jim/prg/kafka/config/my.mysql.properties
This gives me a topic called my-organisation, that is keyed on orgId ..... so far so good!
(note, the namespace does not seem to be set by JDBCConnector but I don't think this is an issue but I don't know for sure)
Now, the code. Here is how I initialise and create the GlobalKTable (relevant code shown):
final Map<String, String> serdeConfig =
Collections.singletonMap(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG,
schemaRegistryUrl);
final StreamsBuilder builder = new StreamsBuilder();
final SpecificAvroSerde<Organisation> orgSerde = new SpecificAvroSerde<>();
orgSerde.configure(serdeConfig, false);
// Create the GlobalKTable from the topic that was populated using the connect-standalone command line
final GlobalKTable<String, Organisation>
orgs =
builder.globalTable(ORG_TOPIC, Materialized.<String, Organisation, KeyValueStore<Bytes, byte[]>>as(ORG_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(orgSerde));
The avro schema, from where the Organisaton class is generated is defined as:
{"namespace": "io.confluent.examples.streams.avro",
"type":"record",
"name":"Organisation",
"fields":[
{"name": "pk", "type":"long"},
{"name": "org_name", "type":"string"},
{"name": "orgId", "type":"string"}
]
}
Note: as described above the orgId is set as the key on the topic using the single message transform (SMT) operation.
So, that is the GlobalKTable setup.
Now for the KStream setup (the right hand side of the join). This has the same key (orgId) as the globalKTable. I use a simple driver program for this:
(The use case is that this topic will contain events associated with each organisation)
public class UploadGenerator {
public static void main(String[] args){
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put("schema.registry.url", "http://localhost:8081");
KafkaProducer producer = new KafkaProducer(props);
// This schema is also used in the consumer application or more specifically a class generated from it.
String mySchema = "{\"namespace\": \"io.confluent.examples.streams.avro\"," +
"\"type\":\"record\"," +
"\"name\":\"DocumentUpload\"," +
"\"fields\":[{\"name\":\"orgId\",\"type\":\"string\"}," +
"{\"name\":\"date\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}]}";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(mySchema);
// Just using three fictional organisations with the following orgIds/keys
String[] ORG_ARRAY = {"002", "003", "004"};
long count = 0;
String key = ""; // key is the realm
while(true) {
count++;
try {
TimeUnit.SECONDS.sleep(5);
} catch (InterruptedException e) {
}
GenericRecord avroRecord = new GenericData.Record(schema);
int orgId = ThreadLocalRandom.current().nextInt(0, 2 + 1);
avroRecord.put("orgId",ORG_ARRAY[orgId]);
avroRecord.put("date",new Date().getTime());
key = ORG_ARRAY[orgId];
ProducerRecord<Object, Object> record = new ProducerRecord<>("topic_uploads", key, avroRecord);
try {
producer.send(record);
producer.flush();
} catch(SerializationException e) {
System.out.println("Exccccception was generated! + " + e.getMessage());
} catch(Exception el) {
System.out.println("Exception: " + el.getMessage());
}
}
}
}
So, this generates a new event representing an upload for an organisation represented by the orgId but also specifically set in the key variable used in the ProducerRecord.
Here is the code that sets up the KStream for these events:
final SpecificAvroSerde<DocumentUpload> uploadSerde = new SpecificAvroSerde<>();
uploadSerde.configure(serdeConfig, false);
// Get the stream of uploads
final KStream<String, DocumentUpload> uploadStream = builder.stream(UPLOADS_TOPIC, Consumed.with(Serdes.String(), uploadSerde));
// Debug output to see the contents of the stream
uploadStream.foreach((k, v) -> System.out.println("uploadStream: Key: " + k + ", Value: " + v));
// Note, I tried to re-key the stream with the orgId field (even though it was set as the key in the driver but same problem)
final KStream<String, DocumentUpload> keyedUploadStream = uploadStream.selectKey((key, value) -> value.getOrgId());
keyedUploadStream.foreach((k, v) -> System.out.println("keyedUploadStream: Key: " + k + ", Value: " + v));
// Java 7 form used as it was easier to put in debug statements
// OrgPK is just a helper class defined in the same class
KStream<String, OrgPk> joined = keyedUploadStream.leftJoin(orgs,
new KeyValueMapper<String, DocumentUpload, String>() { /* derive a (potentially) new key by which to lookup against the table */
#Override
public String apply(String key, DocumentUpload value) {
System.out.println("1. The key passed in is: " + key);
System.out.println("2. The upload realm passed in is: " + value.getOrgId());
return value.getOrgId();
}
},
// THIS IS NEVER CALLED WITH A join() AND WHEN CALLED WITH A leftJoin() HAS A NULL ORGANISATION
new ValueJoiner<DocumentUpload, Organisation, OrgPk>() {
#Override
public OrgPk apply(DocumentUpload leftValue, Organisation rightValue) {
System.out.println("3. Value joiner has been called...");
if( null == rightValue ) {
// THIS IS ALWAYS CALLED, SO THERE IS NEVER A "MATCH"
System.out.println(" 3.1. Orgnisation is NULL");
return new OrgPk(leftValue.getRealm(), 1L);
}
System.out.println(" 3.1. Org is OK");
// Never reaches here - this is the issue i.e. there is never a match
return new OrgPk(leftValue.getOrgId(), rightValue.getPk());
}
});
So, the above join (or leftJoin) never matches, even though the two keys are the same! This is the main issue.
Finally, the avro schema for the DocumentUpload is:
{"namespace": "io.confluent.examples.streams.avro",
"type":"record",
"name":"DocumentUpload",
"fields":[
{"name": "orgId", "type":"string"},
{"name":"date", "type":"long", "logicalType":"timestamp-millis"}
]
}
So, in summary:
I have a KStream on a topic with a String key of OrgId
I have a GlobalKTable on a topic with a String key of OrgId also.
The join never works, even though the keys are in the GlobalKTable (at least they are in the topic underlying the GlobalKTable)
Can someone help me? I am pulling my hair out trying to figure this out.
I was able to solve this issue on Windows/Intellij by providing a state dir config
StreamsConfig.STATE_DIR_CONFIG

Kafka Stream giving weird output

I'm playing around with Kafka Streams trying to do basic aggregations (for the purpose of this question, just incrementing by 1 on each message). On the output topic that receives the changes done to the KTable, I get really weird output:
#B�
#C
#C�
#D
#D�
#E
#E�
#F
#F�
I recognize that the "�" means that it's printing out some kind of character that doesn't exist in the character set, but I'm not sure why. Here's my code for reference:
public class KafkaMetricsAggregator {
public static void main(final String[] args) throws Exception {
final String bootstrapServers = args.length > 0 ? args[0] : "my-kafka-ip:9092";
final Properties streamsConfig = new Properties();
streamsConfig.put(StreamsConfig.APPLICATION_ID_CONFIG, "metrics-aggregator");
// Where to find Kafka broker(s).
streamsConfig.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
// Specify default (de)serializers for record keys and for record values.
streamsConfig.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfig.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
// Records should be flushed every 10 seconds. This is less than the default
// in order to keep this example interactive.
streamsConfig.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10 * 1000);
// For illustrative purposes we disable record caches
streamsConfig.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
// Class to extract the timestamp from the event object
streamsConfig.put(StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG, "my.package.EventTimestampExtractor");
// Set up serializers and deserializers, which we will use for overriding the default serdes
// specified above.
final Serde<JsonNode> jsonSerde = Serdes.serdeFrom(new JsonSerializer(), new JsonDeserializer());
final Serde<String> stringSerde = Serdes.String();
final Serde<Double> doubleSerde = Serdes.Double();
final KStreamBuilder builder = new KStreamBuilder();
final KTable<String, Double> aggregatedMetrics = builder.stream(jsonSerde, jsonSerde, "test2")
.groupBy(KafkaMetricsAggregator::generateKey, stringSerde, jsonSerde)
.aggregate(
() -> 0d,
(key, value, agg) -> agg + 1,
doubleSerde,
"metrics-table2");
aggregatedMetrics.to(stringSerde, doubleSerde, "metrics");
final KafkaStreams streams = new KafkaStreams(builder, streamsConfig);
// Only clean up in development
streams.cleanUp();
streams.start();
// Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
}
}
EDIT: Using aggregatedMetrics.print(); does print out the correct output to the console:
[KSTREAM-AGGREGATE-0000000002]: my-generated-key , (43.0<-null)
Any ideas about what's going on?
You're using Serdes.Double() for your values, that uses a binary efficient encoding [1] for the serialised values and that's what you're seeing on your topic. To get human-readable numbers on the console, you'd need to instruct the consumer to use the DoubleDeserializer too.
[1] https://github.com/apache/kafka/blob/e31c0c9bdbad432bc21b583bd3c084f05323f642/clients/src/main/java/org/apache/kafka/common/serialization/DoubleSerializer.java#L29-L44
Specify DoubleDeserializer as value deserializer at consumer's command line as shown below
--property value.deserializer=org.apache.kafka.common.serialization.DoubleDeserializer