Ktable to KGroupTable - Schema Not available (State Store ChangeLog Schema not registered) - apache-kafka

I have a Kafka topic - let's activity-daily-aggregate,
and I want to do aggregate (add/sub) using KGroupTable. So I read the topic using the
final KTable<String, GenericRecord> inputKTable =
builder.table("activity-daily-aggregate",Consumed.with(new StringSerde(), getConsumerSerde());
Note: getConsumerSerde - returns >> new GenericAvroSerde(mockSchemaRegistryClient)
2.Next Step,
inputKTable.groupBy(
(key,value)->KeyValue.pair(KeyMapper.generateGroupKey(value), new JsonValueMapper().apply(value)),
Grouped.with(AppSerdes.String(), AppSerdes.jsonNode())
);
Before Step 1 and 2 I have configured MockSchemaRegistryClient with
mockSchemaRegistryClient.register("activity-daily-aggregate-key",
Schema.parse(AppUtils.class.getResourceAsStream("/avro/key.avsc")));
mockSchemaRegistryClient.register("activity-daily-aggregate-value",
Schema.parse(AppUtils.class.getResourceAsStream("/avro/daily-activity-aggregate.avsc")))
While I run the topology - using test cases, I get an error at Step 2.
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000011, topic=activity-daily-aggregate, partition=0, offset=0, stacktrace=org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema: {"type":"record","name":"FactActivity","namespace":"com.ascendlearning.avro","fields":.....}
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema Not Found; error code: 404001
The Error goes off when i register the schema with mockSchemaRegistryClient,
stream-app-id-activity-daily-aggregate-STATE-STORE-0000000010-changelog-key
stream-app-id-activity-daily-aggregate-STATE-STORE-0000000010-changelog-value
=> /avro/daily-activity-aggregate.avsc
Do we need to do this step? I thought it might be handled automatically by the topology

From the blog,
https://blog.jdriven.com/2019/12/kafka-streams-topologytestdriver-with-avro/
When you configure the same mock:// URL in both the Properties passed into TopologyTestDriver, as well as for the (de)serializer instances passed into createInputTopic and createOutputTopic, all (de)serializers will use the same MockSchemaRegistryClient, with a single in-memory schema store.
// Configure Serdes to use the same mock schema registry URL
Map<String, String> config = Map.of(
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, MOCK_SCHEMA_REGISTRY_URL);
avroUserSerde.configure(config, false);
avroColorSerde.configure(config, false);
// Define input and output topics to use in tests
usersTopic = testDriver.createInputTopic(
"users-topic",
stringSerde.serializer(),
avroUserSerde.serializer());
colorsTopic = testDriver.createOutputTopic(
"colors-topic",
stringSerde.deserializer(),
avroColorSerde.deserializer());
I was not passing the mock registry client - schema URL in the serdes passed to input /output topic.

Related

Getting 409 from the schema registry while saving records to a local state store where multiple state stores are associated with a single processor

Long story short: I am in the middle of implementing a processor topology: the processor is to store the received records into corresponding local state stores and do event-based processing as a record arrives. And the related code looks like this:
#Override
public void configureBuilder(StreamsBuilder builder) {
final Map<String, String> serdeConfig =
Collections.singletonMap("schema.registry.url", processorConfig.getSchemaRegistryUrl());
final Serde<GenericRecord> valueSerde = new GenericAvroSerde();
valueSerde.configure(serdeConfig, false); // `true` for record keys
final Serde<EventKey> keySerde = new SpecificAvroSerde();
keySerde.configure(serdeConfig, true); // `true` for record keys
Map<String, String> stateStoreConfigMap = new HashMap<>();
//stateStoreConfigMap.put(KafkaAvroSerializerConfig.VALUE_SUBJECT_NAME_STRATEGY, RecordNameStrategy.class.getName());
StoreBuilder<KeyValueStore<EventKey, GenericRecord>> aggSequenceStateStoreBuilder =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(processStateStore), keySerde, valueSerde)
.withLoggingEnabled(stateStoreConfigMap)
.withCachingEnabled();
final Serde<EnrichedSmcHeatData> enrichedSmcHeatDataSerde = new SpecificAvroSerde<>();
enrichedSmcHeatDataSerde.configure(serdeConfig, false); // `true` for record keys
StoreBuilder<KeyValueStore<EventKey, EnrichedSmcHeatData>> enrichedSmcHeatStateStoreBuilder =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("enriched-smc-heat-state-store"), keySerde, enrichedSmcHeatDataSerde)
.withLoggingEnabled(stateStoreConfigMap)
.withCachingEnabled();
Topology topology = builder.build();
topology
.addSource(
PROCESS_EVENTS_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getInputCcmProcessEvents())
.addSource(
SCHEDULED_SEQUENCES_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getScheduledCastSequences())
.addSource(
SMC_HEAT_EVENTS_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getInputSmcHeatEvents())
.addProcessor(
PROCESS_STATE_AGGREGATOR,
() -> new ProcessStateProcessor(processStateStore, processorConfig),
PROCESS_EVENTS_SOURCE,
SCHEDULED_SEQUENCES_SOURCE,
SMC_HEAT_EVENTS_SOURCE)
.addStateStore(aggSequenceStateStoreBuilder, PROCESS_STATE_AGGREGATOR)
.addStateStore(enrichedSmcHeatStateStoreBuilder, PROCESS_STATE_AGGREGATOR);
If there are updates for the store created by aggSequenceStateStoreBuilder, the record values could be saved to the store without problems. However, if updates came for the second store, the following error was getting thrown:
Caused by:
io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException:
Schema being registered is incompatible with an earlier schema for
subject
"ccm-process-events-processor-ccm-process-state-store-changelog-value";
error code: 409
My use case: the state processor accepts inbound records from multiple source topics and do event-handling (including storing the modified values to the corresponding stores) when a record arrives from any of the source topics.
It appears that there can only be one schema registered with the schema registry for the same processor. Is that by design, or am I missing anything, or what alternative options do I have instead?
Thanks in advance!

Apache Storm JoinBolt

I am trying to join two kafka data streams(using kafka spouts) into one using JoinBolt with following code snippet (http://storm.apache.org/releases/1.1.2/Joins.html)
It says that each of JoinBolt's incoming data streams must be Fields Grouped on a single field. A stream should only be joined with the other streams using the field on which it has been FieldsGrouped
Code Snippet :
KafkaSpout kafka_spout_1 = SpoutBuilder.buildSpout("127.0.0.1:2181","test-topic-1", "/spout-1", "spout-1");//String zkHosts, String topic, String zkRoot, String spoutId
KafkaSpout kafka_spout_2 = SpoutBuilder.buildSpout("127.0.0.1:2181","test-topic-2", "/spout-2", "spout-2");//String zkHosts, String topic, String zkRoot, String spoutId
topologyBuilder.setSpout("kafka-spout-1", kafka_spout_1, 1);
topologyBuilder.setSpout("kafka-spout-2", kafka_spout_2, 1);
JoinBolt joinBolt = new JoinBolt("kafka-spout-1", "id")
.join("kafka-spout-2", "deptId", "kafka-spout-1")
.select("id,deptId,firstName,deptName")
.withTumblingWindow(new Duration(10, TimeUnit.SECONDS));
topologyBuilder.setBolt("joiner", joinBolt, 1)
.fieldsGrouping("spout-1", new Fields("id"))
.fieldsGrouping("spout-2", new Fields("deptId"));
kafka-spout-1 sample record --> {"id" : 1 ,"firstName" : "Alyssa" , "lastName" : "Parker"}
kafka-spout-2 sample record --> {"deptId" : 1 ,"deptName" : "Engineering"}
I got following exception while deploying topology using above code snippet
[main] WARN o.a.s.StormSubmitter - Topology submission exception: Component: [joiner] subscribes from stream: [default] of component [kafka-spout-2] with non-existent fields: #{"deptId"}
java.lang.RuntimeException: InvalidTopologyException(msg:Component: [joiner] subscribes from stream: [default] of component [kafka-spout-2] with non-existent fields: #{"deptId"})
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:273)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
at BuildTopology.runTopology(BuildTopology.java:71)
at Main.main(Main.java:6)
Caused by: InvalidTopologyException(msg:Component: [joiner] subscribes from stream: [default] of component [kafka-spout-2] with non-existent fields: #{"deptId"})
at org.apache.storm.generated.Nimbus$submitTopology_result$submitTopology_resultStandardScheme.read(Nimbus.java:8070)
at org.apache.storm.generated.Nimbus$submitTopology_result$submitTopology_resultStandardScheme.read(Nimbus.java:8047)
at org.apache.storm.generated.Nimbus$submitTopology_result.read(Nimbus.java:7981)
at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at org.apache.storm.generated.Nimbus$Client.recv_submitTopology(Nimbus.java:306)
at org.apache.storm.generated.Nimbus$Client.submitTopology(Nimbus.java:290)
at org.apache.storm.StormSubmitter.submitTopologyInDistributeMode(StormSubmitter.java:326)
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:260)
... 4 more
How to solve the issue?
Thank you,any help will be appreciated
Consider using storm-kafka-client instead of storm-kafka if you're doing new development. Storm-kafka is deprecated.
Does the spout actually emit a field called "deptId"?
Your configuration snippet doesn't mention that you set the SpoutConfig.scheme, and your example records seem to imply that you're emitting JSON documents containing a "deptId" field.
Storm doesn't know anything about JSON or the contents of the strings coming out of the spout. You need to define a scheme that makes the spout emit the "deptId" field separately from the rest of the record.
Here's the relevant snippet from one of the built-in schemes that emits the message, topic and offset in separate fields:
#Override
public List<Object> deserializeMessageWithMetadata(ByteBuffer message, Partition partition, long offset) {
String stringMessage = StringScheme.deserializeString(message);
return new Values(stringMessage, partition.partition, offset);
}
#Override
public Fields getOutputFields() {
return new Fields(STRING_SCHEME_KEY, STRING_SCHEME_PARTITION_KEY, STRING_SCHEME_OFFSET);
}
See https://github.com/apache/storm/blob/v1.2.2/external/storm-kafka/src/jvm/org/apache/storm/kafka/StringMessageAndMetadataScheme.java for reference.
An alternative to you doing this with a scheme is that you make a bolt in between the spout and the JoinBolt that extracts the "deptId" from the record and emits it as a field alongside the record.

How to Stream to a Global Kafka Table

I have a Kafka Streams application that needs to join an incoming stream against a global table, then after some processing, write out the result of an aggregate back to that table:
KeyValueBytesStoreSupplier supplier = Stores.persistentKeyValueStore(
storeName
);
Materialized<String, String, KeyValueStore<Bytes, byte[]>> m = Materialized.as(
supplier
);
GlobalKTable<String, String> table = builder.globalTable(
topic, m.withKeySerde(
Serdes.String()
).withValueSerde(
Serdes.String()
)
);
stream.leftJoin(
table
...
).groupByKey().aggregate(
...
).toStream().through(
topic, Produced.with(Serdes.String(), Serdes.String())
);
However, when I try to stream into the KTable changelog, I get the following error: Invalid topology: Topic 'topic' has already been registered by another source.
If I try to aggregate to the store itself, I get the following error: InvalidStateStoreException: Store 'store' is currently closed.
How can both join against the table and write back to its changelog?
If this isn't possible, a solution that involves filtering incoming logs against the store would also work.
Calling through() is a shortcut for
stream.to("topic");
KStream stream2 = builder.stream("topic");
Because you use builder.stream("topic") already, you get Invalid topology: Topic 'topic' has already been registered by another source. because each topic can only be consumed once. If you want to feed the data of a stream/topic into different part, you need to reuse the original KStream you created for this topic:
KStream stream = builder.stream("topic");
// this won't work
KStream stream2 = stream.through("topic");
// rewrite to
stream.to("topic");
KStream stream2 = stream; // or just omit `stream2` and reuse `stream`
Not sure what you mean by
If I try to aggregate to the store itself

KStream: Error Reading and Writing Avro records

I’m trying to write avro record that I read from a topic into another topic, intentions it to augment it with transformation after I get this routing working. I have used the KStream with avro code from one of the example with some modifications to connect to Schema Registry for retrieving the avro schema.
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "mysql-stream-processing");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers);
streamsConfiguration.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, schemaRegistryUrl);
final Serde<GenericRecord> keySerde = new GenericAvroSerde(
new CachedSchemaRegistryClient(schemaRegistryUrl, 100),
Collections.singletonMap(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG,
schemaRegistryUrl));
final Serde<GenericRecord> valueSerde = new GenericAvroSerde(
new CachedSchemaRegistryClient(schemaRegistryUrl, 100),
Collections.singletonMap(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG,
schemaRegistryUrl));
streamsConfiguration.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10 * 1000);
final KStreamBuilder builder = new KStreamBuilder();
final KStream<GenericRecord, GenericRecord> record = builder.stream("dbserver1.employees.employees");
record.print(keySerde, valueSerde);
record.to(keySerde, valueSerde, "newtopic");
record.foreach((key, val) -> System.out.println(key.toString()+" "+val.toString()));
final KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration);
streams.cleanUp();
streams.start();
When run print() works as I can see the record in the console, but Im unable to get the record written to the “newtopic”, failing with the below error
Exception in thread "StreamThread-1" org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000000, topic=dbserver1.employees.employees, partition=0, offset=0
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:217)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:627)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361)
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (key: io.confluent.examples.streams.utils.GenericAvroSerializer / value: io.confluent.examples.streams.utils.GenericAvroSerializer) is not compatible to the actual key or value type (key type: [B / value type: [B). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:81)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:83)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:70)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:198)
... 2 more
Caused by: java.lang.ClassCastException: [B cannot be cast to org.apache.avro.generic.GenericRecord
at io.confluent.examples.streams.utils.GenericAvroSerializer.serialize(GenericAvroSerializer.java:25)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:77)
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:79)
... 5 more
I guess you need to configure the correct Serdes:
https://docs.confluent.io/current/streams/developer-guide/config-streams.html#default-key-serde
https://docs.confluent.io/current/streams/developer-guide/dsl-api.html
Either set correct global Serdes, or specify Serdes for each operator. If an operator needs a Serde, than it has a corresponding overload taking Serdes are parameters.

Null Pointer Exception / Not Found Exception when I tried to process & sink data in Avro schema

I am using a processor to consume byte array data with byte array serdes from a topic, process them into a generic record (based on schema I got from my HTTP GET request) and send them over to a topic with formatted avro schema registry.
I had no problem retrieving the schema from HTTP GET request and map my data according to it to generate a generic record that follows the schema. However when I tried to sink it to the topic I get a Null Pointer Exception:
org.apache.kafka.common.errors.SerializationException: Error serializing Avro message
Caused by: java.lang.NullPointerException
atio.confluent.kafka.serializers.AbstractKafkaAvroSerializer.serializeImpl(AbstractKafkaAvroSerializer.
java:72 )
at io.confluent.kafka.serializers.KafkaAvroSerializer.serialize(KafkaAvroSerializer.java:54)
at
org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:78)
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:79)
atorg.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl
.java:83)
at streamProcessor.XXXXprocessor.process(XXXXprocessor.java:80)
at streamProcessor.XXXXprocessor.process(XXXXprocessor.java:1)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:48)
atorg.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetr
icsImpl.java:188)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:134)
atorg.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl
.java:111)
at streamProcessor.SelectorProcessor.process(SelectorProcessor.java:33)
at streamProcessor.SelectorProcessor.process(SelectorProcessor.java:1)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:48)
atorg.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetr
icsImpl.java:188)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:134)
atorg.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl
.java:83)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:70)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:197)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:627)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361)
This is my topology code:
//Stream Properties
Properties config = new Properties();
config.put(StreamsConfig.APPLICATION_ID_CONFIG, "processor-kafka-streams234");
config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "xxxxxxxxxxxxxxxxxxxxxx:xxxx");
config.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
config.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG,
Serdes.ByteArray().getClass().getName());
config.put(StreamsConfig.TIMESTAMP_EXTRACTOR_CLASS_CONFIG,
WallclockTimestampExtractor.class);
//Build topology
TopologyBuilder builder = new TopologyBuilder();
builder.addSource("messages-source", "mytest2");
builder.addProcessor("selector-processor", () -> new SelectorProcessor(), "messages-source");
builder.addProcessor("XXXX-processor", () -> new XXXXprocessor(), "selector-processor");
builder.addSink("XXXX-sink", "XXXXavrotest", new KafkaAvroSerializer(), new
KafkaAvroSerializer(), "XXXX-processor");
//Start Streaming
KafkaStreams streaming = new KafkaStreams(builder, config);
streaming.start();
System.out.println("processor streaming...");
After some readings on the issues forum I discovered that I might need to inject a client when I am creating the KafkaAvroSerializers, so I changed that line to:
SchemaRegistryClient client = new
CachedSchemaRegistryClient("xxxxxxxxxxxxxxxxxxxxxx:xxxx/subjects/xxxxschemas/versions", 1000);
builder.addSink("XXXX-sink", "XXXXavrotest", new KafkaAvroSerializer(client), new
KafkaAvroSerializer(client), "XXXX-processor");
Which resulted in a HTTP 404 Not Found Exception ...
I had my URL wrong :P
Also the key had to be init to something besides null because of the cleanup.policy setting on my topic.