Getting 409 from the schema registry while saving records to a local state store where multiple state stores are associated with a single processor - apache-kafka

Long story short: I am in the middle of implementing a processor topology: the processor is to store the received records into corresponding local state stores and do event-based processing as a record arrives. And the related code looks like this:
#Override
public void configureBuilder(StreamsBuilder builder) {
final Map<String, String> serdeConfig =
Collections.singletonMap("schema.registry.url", processorConfig.getSchemaRegistryUrl());
final Serde<GenericRecord> valueSerde = new GenericAvroSerde();
valueSerde.configure(serdeConfig, false); // `true` for record keys
final Serde<EventKey> keySerde = new SpecificAvroSerde();
keySerde.configure(serdeConfig, true); // `true` for record keys
Map<String, String> stateStoreConfigMap = new HashMap<>();
//stateStoreConfigMap.put(KafkaAvroSerializerConfig.VALUE_SUBJECT_NAME_STRATEGY, RecordNameStrategy.class.getName());
StoreBuilder<KeyValueStore<EventKey, GenericRecord>> aggSequenceStateStoreBuilder =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore(processStateStore), keySerde, valueSerde)
.withLoggingEnabled(stateStoreConfigMap)
.withCachingEnabled();
final Serde<EnrichedSmcHeatData> enrichedSmcHeatDataSerde = new SpecificAvroSerde<>();
enrichedSmcHeatDataSerde.configure(serdeConfig, false); // `true` for record keys
StoreBuilder<KeyValueStore<EventKey, EnrichedSmcHeatData>> enrichedSmcHeatStateStoreBuilder =
Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("enriched-smc-heat-state-store"), keySerde, enrichedSmcHeatDataSerde)
.withLoggingEnabled(stateStoreConfigMap)
.withCachingEnabled();
Topology topology = builder.build();
topology
.addSource(
PROCESS_EVENTS_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getInputCcmProcessEvents())
.addSource(
SCHEDULED_SEQUENCES_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getScheduledCastSequences())
.addSource(
SMC_HEAT_EVENTS_SOURCE,
keySerde.deserializer(),
valueSerde.deserializer(),
processorConfig.getInputSmcHeatEvents())
.addProcessor(
PROCESS_STATE_AGGREGATOR,
() -> new ProcessStateProcessor(processStateStore, processorConfig),
PROCESS_EVENTS_SOURCE,
SCHEDULED_SEQUENCES_SOURCE,
SMC_HEAT_EVENTS_SOURCE)
.addStateStore(aggSequenceStateStoreBuilder, PROCESS_STATE_AGGREGATOR)
.addStateStore(enrichedSmcHeatStateStoreBuilder, PROCESS_STATE_AGGREGATOR);
If there are updates for the store created by aggSequenceStateStoreBuilder, the record values could be saved to the store without problems. However, if updates came for the second store, the following error was getting thrown:
Caused by:
io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException:
Schema being registered is incompatible with an earlier schema for
subject
"ccm-process-events-processor-ccm-process-state-store-changelog-value";
error code: 409
My use case: the state processor accepts inbound records from multiple source topics and do event-handling (including storing the modified values to the corresponding stores) when a record arrives from any of the source topics.
It appears that there can only be one schema registered with the schema registry for the same processor. Is that by design, or am I missing anything, or what alternative options do I have instead?
Thanks in advance!

Related

Ktable to KGroupTable - Schema Not available (State Store ChangeLog Schema not registered)

I have a Kafka topic - let's activity-daily-aggregate,
and I want to do aggregate (add/sub) using KGroupTable. So I read the topic using the
final KTable<String, GenericRecord> inputKTable =
builder.table("activity-daily-aggregate",Consumed.with(new StringSerde(), getConsumerSerde());
Note: getConsumerSerde - returns >> new GenericAvroSerde(mockSchemaRegistryClient)
2.Next Step,
inputKTable.groupBy(
(key,value)->KeyValue.pair(KeyMapper.generateGroupKey(value), new JsonValueMapper().apply(value)),
Grouped.with(AppSerdes.String(), AppSerdes.jsonNode())
);
Before Step 1 and 2 I have configured MockSchemaRegistryClient with
mockSchemaRegistryClient.register("activity-daily-aggregate-key",
Schema.parse(AppUtils.class.getResourceAsStream("/avro/key.avsc")));
mockSchemaRegistryClient.register("activity-daily-aggregate-value",
Schema.parse(AppUtils.class.getResourceAsStream("/avro/daily-activity-aggregate.avsc")))
While I run the topology - using test cases, I get an error at Step 2.
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000011, topic=activity-daily-aggregate, partition=0, offset=0, stacktrace=org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema: {"type":"record","name":"FactActivity","namespace":"com.ascendlearning.avro","fields":.....}
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Schema Not Found; error code: 404001
The Error goes off when i register the schema with mockSchemaRegistryClient,
stream-app-id-activity-daily-aggregate-STATE-STORE-0000000010-changelog-key
stream-app-id-activity-daily-aggregate-STATE-STORE-0000000010-changelog-value
=> /avro/daily-activity-aggregate.avsc
Do we need to do this step? I thought it might be handled automatically by the topology
From the blog,
https://blog.jdriven.com/2019/12/kafka-streams-topologytestdriver-with-avro/
When you configure the same mock:// URL in both the Properties passed into TopologyTestDriver, as well as for the (de)serializer instances passed into createInputTopic and createOutputTopic, all (de)serializers will use the same MockSchemaRegistryClient, with a single in-memory schema store.
// Configure Serdes to use the same mock schema registry URL
Map<String, String> config = Map.of(
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, MOCK_SCHEMA_REGISTRY_URL);
avroUserSerde.configure(config, false);
avroColorSerde.configure(config, false);
// Define input and output topics to use in tests
usersTopic = testDriver.createInputTopic(
"users-topic",
stringSerde.serializer(),
avroUserSerde.serializer());
colorsTopic = testDriver.createOutputTopic(
"colors-topic",
stringSerde.deserializer(),
avroColorSerde.deserializer());
I was not passing the mock registry client - schema URL in the serdes passed to input /output topic.

How to access a KStreams Materialized State Store from another Stream Processor

I need to be able to remove a record from a Ktable from a separate Stream Processor. Today I'm using aggregate() and passing a materialized state store. In a separate processor that reads from a "termination" topic, I'd like to query that materialized state store either in a .transform() or a different .aggregate() and 'remove' that key/value. Every time I try to access the materialized state from a separate stream processor, it keeps telling me either the store isn't added to the topology, so then I add it and run it again, then it tells me it's already be registered and errors out.
builder.stream("topic1").map().groupByKey().aggregate(() -> null,
(aggKey, newValue, aggValue) -> {
//add to the Ktable
return newValue;
},
stateStoreMaterialized);
and in a separate stream I want to delete a key from that stateStoreMaterialized
builder.stream("topic2")
.transform(stateStoreDeleteTransformer, stateStoreSupplier.name())
stateStoreDeleteTransformer will query the key and delete it.
//in ctor
KeyValueBytesStoreSupplier stateStoreSupplier = Stores.persistentKeyValueStore("store1");
stateStoreMaterialized = Materialized.<String, MyObj>as(stateStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(mySerDe);
I don't have a terminal flag on my topic1 stream object value that can trigger a deletion. It has to come from another stream/topic.
When I try to use the same Materialized Store on two separate stream processors I get..
Invalid topology: Topic STATE_STORE-repartition has already been registered by another source.
at org.springframework.kafka.config.StreamsBuilderFactoryBean.start(StreamsBuilderFactoryBean.java:268)
Edit:
This is the 1st error I receive.
Caused by: org.apache.kafka.streams.errors.StreamsException: Processor KSTREAM-TRANSFORMVALUES-0000000012 has no access to StateStore store1 as the store is not connected to the processor. If you add stores manually via '.addStateStore()' make sure to connect the added store to the processor by providing the processor name to '.addStateStore()' or connect them via '.connectProcessorAndStateStores()'. DSL users need to provide the store name to '.process()', '.transform()', or '.transformValues()' to connect the store to the corresponding operator. If you do not add stores manually, please file a bug report at https://issues.apache.org/jira/projects/KAFKA.
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.getStateStore(ProcessorContextImpl.java:104)
at org.apache.kafka.streams.processor.internals.ForwardingDisabledProcessorContext.getStateStore(ForwardingDisabledProcessorContext.java:85)
So then I do this:
stateStoreSupplier = Stores.persistentKeyValueStore(STATE_STORE_NAME);
storeStoreBuilder = Stores.keyValueStoreBuilder(stateStoreSupplier, Serdes.String(), jsonSerDe);
stateStoreMaterialized = Materialized.as(stateStoreSupplier);
Then I get this error:
Caused by: org.apache.kafka.streams.errors.TopologyException: Invalid topology: StateStore 'state-store' is already added.
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addStateStore(InternalTopologyBuilder.java:520)
at org.apache.kafka.streams.processor.internals.InternalTopologyBuilder.addStateStore(InternalTopologyBuilder.java:512)
Here's the code that fixed my issue. As it turns out, order matters when building the streams. Had to set the materialized store first and then in subsequent lines of code, setup the transformer.
/**
* Create the streams using the KStreams DSL - a method to configure the stream and add any state stores.
*/
#Bean
public KafkaStreamsConfig setup() {
final JsonSerDe<Bus> ltaSerde = new JsonSerDe<>(Bus.class);
final StudentSerde<Student> StudentSerde = new StudentSerde<>();
//start lta stream
KStream<String, Bus> ltaStream = builder
.stream(ltaInputTopic, Consumed.with(Serdes.String(), ltaSerde));
final KStream<String, Student> statusStream = this.builder
.stream(this.locoStatusInputTopic,
Consumed.with(Serdes.String(),
StudentSerde));
//create lta store
KeyValueBytesStoreSupplier ltaStateStoreSupplier = Stores.persistentKeyValueStore(LTA_STATE_STORE_NAME);
final Materialized<String, Bus, KeyValueStore<Bytes, byte[]>> ltaStateStoreMaterialized =
Materialized.
<String, Bus>as(ltaStateStoreSupplier)
.withKeySerde(Serdes.String())
.withValueSerde(ltaSerde);
KTable<String, Bus> ltaStateProcessor = ltaStream
//map and convert lta stream into Loco / LTA key value pairs
.groupByKey(Grouped.with(Serdes.String(), ltaSerde))
.aggregate(
//The 'aggregate' and 'reduce' functions ignore messages with null values FYI.
// so if the value after the groupbykey produces a null value, it won't be removed from the state store.
//which is why it's very important to send a message with some terminal flag indicating this value should be removed from the store.
() -> null, /* initializer */
(aggKey, newValue, aggValue) -> {
if (null != newValue.getAssociationEndTime()) { //if there is an endTime associated to this train/loco then remove it from the ktable
logger.trace("removing LTA: {} loco from {} train", newValue.getLocoId(), newValue.getTrainAuthorization());
return null; //Returning null removes the record from the state store as well as its changelog topic. re: https://objectpartners.com/2019/07/31/slimming-down-your-kafka-streams-data/
}
logger.trace("adding LTA: {} loco from {} train", newValue.getLocoId(), newValue.getTrainAuthorization());
return newValue;
}, /* adder */
ltaStateStoreMaterialized
);
// don't need builder.addStateStore(keyValueStoreStoreBuilder); and CANT use it
// because the ltaStateStoreMaterialized will already be added to the topology in the KTable aggregate method above.
// The below transformer can use the state store because it's already added (apparently) by the aggregate method.
// Add the KTable processors first, then if there are any transformers that need to use the store, add them after the KTable aggregate method.
statusStream.map((k, v) -> new KeyValue<>(v.getLocoId(), v))
.transform(locoStatusTransformerSupplier, ltaStateStoreSupplier.name())
.to("testing.outputtopic", Produced.with(Serdes.String(), StudentSerde));
return this; //can return anything except for void.
}
is stateStoreMaterialized and stateStoreSupplier.name() has the same name?
Use have a error in your topology
KStream.transform(stateStoreDeleteTransformer, stateStoreSupplier.name())
You have to supply new instant of StateStoreDeleteTransformer per ProcessContext in TransformerSupplier, like this:
KStream.transform(StateStoreDeleteTransformer::new, stateStoreSupplier.name())
or
KStream.transform(() -> StateStoreDeleteTransformerSupplier.get(), stateStoreSupplier.name())//StateStoreDeleteTransformerSupplier return new instant of StateStoreDeleteTransformer
in stateStoreDeleteTransformer how do you intent on using stateStoreMaterialized inside transformer directly?
I have the similar use case and I using a KeyValueStore<String, MyObj>
public void init(ProcessorContext context) {
kvStore = (KeyValueStore<String, MyObj>) context.getStateStore("store1");
}

Is there a retention policy for custom state store (RocksDb) with Kafka streams?

I am setting up a new Kafka streams application, and want to use custom state store using RocksDb. This is working fine for putting data in state store and getting a queryable state store from it and iterating over the data, However, after about 72 hours I observe data to be missing from the store. Is there a default retention time on data for state store in Kafka streams or in RocksDb?
I are using custom state store using RocksDb so that we can utilize the column family feature, that we can't use with the embedded RocksDb implementation with KStreams. I have implementated custom store using KeyValueStore interface. And have my own StoreSupplier, StoreBuilder, StoreType and StoreWrapper as well.
A changelog topic is created for the application but no data is going to it yet (haven't looked into that problem yet).
Putting data into this custom state store and getting queryable state store from it is working fine. However, I am seeing that data is missing after about 72 hours from the store. I checked by getting the size of the state store directory as well as by exporting the data into files and checking the number of entries.
Using SNAPPY compression and UNIVERSAL compaction
Simple topology:
final StreamsBuilder builder = new StreamsBuilder();
String storeName = "store-name"
List<String> cfNames = new ArrayList<>();
// Hybrid custom store
final StoreBuilder customStore = new RocksDBColumnFamilyStoreBuilder(storeName, cfNames);
builder.addStateStore(customStore);
KStream<String, String> inputstream = builder.stream(
inputTopicName,
Consumed.with(Serdes.String(), Serdes.String()
));
inputstream
.transform(() -> new CurrentTransformer(storeName), storeName);
Topology tp = builder.build();
Snippet from custom store implementation:
RocksDBColumnFamilyStore(final String name, final String parentDir, List<String> columnFamilyNames) {
.....
......
final BlockBasedTableConfig tableConfig = new BlockBasedTableConfig()
.setBlockCache(cache)
.setBlockSize(BLOCK_SIZE)
.setCacheIndexAndFilterBlocks(true)
.setPinL0FilterAndIndexBlocksInCache(true)
.setFilterPolicy(filter)
.setCacheIndexAndFilterBlocksWithHighPriority(true)
.setPinTopLevelIndexAndFilter(true)
;
cfOptions = new ColumnFamilyOptions()
.setCompressionType(CompressionType.SNAPPY_COMPRESSION)
.setCompactionStyle(CompactionStyle.UNIVERSAL)
.setMaxWriteBufferNumber(MAX_WRITE_BUFFERS)
.setOptimizeFiltersForHits(true)
.setLevelCompactionDynamicLevelBytes(true)
.setTableFormatConfig(tableConfig);
columnFamilyDescriptors.add(new ColumnFamilyDescriptor(RocksDB.DEFAULT_COLUMN_FAMILY, cfOptions));
columnFamilyNames.stream().forEach((cfName) -> columnFamilyDescriptors.add(new ColumnFamilyDescriptor(cfName.getBytes(), cfOptions)));
}
#SuppressWarnings("unchecked")
public void openDB(final ProcessorContext context) {
Options opts = new Options()
.prepareForBulkLoad();
options = new DBOptions(opts)
.setCreateIfMissing(true)
.setErrorIfExists(false)
.setInfoLogLevel(InfoLogLevel.INFO_LEVEL)
.setMaxOpenFiles(-1)
.setWriteBufferManager(writeBufferManager)
.setIncreaseParallelism(Math.max(Runtime.getRuntime().availableProcessors(), 2))
.setCreateMissingColumnFamilies(true);
fOptions = new FlushOptions();
fOptions.setWaitForFlush(true);
dbDir = new File(new File(context.stateDir(), parentDir), name);
try {
Files.createDirectories(dbDir.getParentFile().toPath());
db = RocksDB.open(options, dbDir.getAbsolutePath(), columnFamilyDescriptors, columnFamilyHandles);
columnFamilyHandles.stream().forEach((handle) -> {
try {
columnFamilyMap.put(new String(handle.getName()), handle);
} catch (RocksDBException e) {
throw new ProcessorStateException("Error opening store " + name + " at location " + dbDir.toString(), e);
}
});
} catch (RocksDBException e) {
throw new ProcessorStateException("Error opening store " + name + " at location " + dbDir.toString(), e);
}
open = true;
}
The expectation is that the state store (RocksDb) will retain the data indefinitely until manually deleted or until the storage disk goes down. I am not aware that Kafka streams has introduced having TTl with state stores yet.

How to Stream to a Global Kafka Table

I have a Kafka Streams application that needs to join an incoming stream against a global table, then after some processing, write out the result of an aggregate back to that table:
KeyValueBytesStoreSupplier supplier = Stores.persistentKeyValueStore(
storeName
);
Materialized<String, String, KeyValueStore<Bytes, byte[]>> m = Materialized.as(
supplier
);
GlobalKTable<String, String> table = builder.globalTable(
topic, m.withKeySerde(
Serdes.String()
).withValueSerde(
Serdes.String()
)
);
stream.leftJoin(
table
...
).groupByKey().aggregate(
...
).toStream().through(
topic, Produced.with(Serdes.String(), Serdes.String())
);
However, when I try to stream into the KTable changelog, I get the following error: Invalid topology: Topic 'topic' has already been registered by another source.
If I try to aggregate to the store itself, I get the following error: InvalidStateStoreException: Store 'store' is currently closed.
How can both join against the table and write back to its changelog?
If this isn't possible, a solution that involves filtering incoming logs against the store would also work.
Calling through() is a shortcut for
stream.to("topic");
KStream stream2 = builder.stream("topic");
Because you use builder.stream("topic") already, you get Invalid topology: Topic 'topic' has already been registered by another source. because each topic can only be consumed once. If you want to feed the data of a stream/topic into different part, you need to reuse the original KStream you created for this topic:
KStream stream = builder.stream("topic");
// this won't work
KStream stream2 = stream.through("topic");
// rewrite to
stream.to("topic");
KStream stream2 = stream; // or just omit `stream2` and reuse `stream`
Not sure what you mean by
If I try to aggregate to the store itself

How to write the ValueJoiner when joining two Kafka Streams defined using Avro Schemas?

I am building an ecommerce application, where I am currently dealing with two data feeds: order executions, and broken sales. A broken sale would be an invalid execution, for a variety of reasons. A broken sale would have the same order ref number as the order, so the join is on order ref # and line item #.
Currently, I have two topics - orders, and broken. Both have been defined using Avro Schemas, and built using SpecificRecord. The key is OrderReferenceNumber.
Fields for orders: OrderReferenceNumber, Timestamp, OrderLine, ItemNumber, Quantity
Fields for broken: OrderReferenceNumber, OrderLine, Timestamp
Corresponding Java classes were generated by running
mvn clean package
I need to left-join orders with broken and include the following fields in the output: OrderReferenceNumber, Timestamp, BrokenSaleTimestamp, OrderLine, ItemNumber, Quantity
Here is my code:
public static void main(String[] args) {
// Declare variables
final Map<String, String> avroSerdeConfig = Collections.singletonMap(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
// Add Kafka Streams Properties
Properties streamsProperties = new Properties();
streamsProperties.put(StreamsConfig.APPLICATION_ID_CONFIG, "orderProcessor");
streamsProperties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsProperties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
streamsProperties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
streamsProperties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, SpecificAvroSerde.class);
streamsProperties.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "localhost:8081");
// Specify Kafka Topic Names
String orderTopic = "com.ecomapp.input.OrderExecuted";
String brokenTopic = "com.ecomapp.input.BrokenSale";
// Specify Serializer-Deserializer or Serdes for each Message Type
Serdes.StringSerde stringSerde = new Serdes.StringSerde();
Serdes.LongSerde longSerde = new Serdes.LongSerde();
// For the Order Executed Message
SpecificAvroSerde<OrderExecuted> ordersSpecificAvroSerde = new SpecificAvroSerde<OrderExecuted>();
ordersSpecificAvroSerde.configure(avroSerdeConfig, false);
// For the Broken Sale Message
SpecificAvroSerde<BrokenSale> brokenSpecificAvroSerde = new SpecificAvroSerde<BrokenSale>();
brokenSpecificAvroSerde.configure(avroSerdeConfig, false);
StreamsBuilder streamBuilder = new StreamsBuilder();
KStream<String, OrderExecuted> orders = streamBuilder
.stream(orderTopic, Consumed.with(stringSerde, ordersSpecificAvroSerde))
.selectKey((key, orderExec) -> orderExec.getMatchNumber().toString());
KStream<String, BrokenSale> broken = streamBuilder
.stream(brokenTopic, Consumed.with(stringSerde, brokenSpecificAvroSerde))
.selectKey((key, brokenS) -> brokenS.getMatchNumber().toString());
KStream<String, JoinOrdersExecutedNonBroken> joinOrdersNonBroken = orders
.leftJoin(broken,
(orderExec, brokenS) -> JoinOrdersExecutedNonBroken.newBuilder()
.setOrderReferenceNumber((Long) orderExec.get("OrderReferenceNumber"))
.setTimestamp((Long) orderExec.get("Timestamp"))
.setBrokenSaleTimestamp((Long) brokenS.get("Timestamp"))
.setOrderLine((Long) orderExec.get("OrderLine"))
.setItemNumber((String) orderExec.get("ItemNumber"))
.setQuantity((Long) orderExec.get("Quantity"))
.build(),
JoinWindows.of(TimeUnit.MILLISECONDS.toMillis(1))
Joined.with(stringSerde, ordersSpecificAvroSerde, brokenSpecificAvroSerde))
.peek((key, value) -> System.out.println("key = " + key + ", value = " + value));
KafkaStreams orderStreams = new KafkaStreams(streamBuilder.build(), streamsProperties);
orderStreams.start();
// print the topology
System.out.println(orderStreams.localThreadsMetadata());
// shutdown hook to correctly close the streams application
Runtime.getRuntime().addShutdownHook(new Thread(orderStreams::close));
}
When I run this, I get the following maven compile error:
[ERROR] /Tech/Projects/jCom/src/main/java/com/ecomapp/kafka/orderProcessor.java:[96,26] incompatible types: cannot infer type-variable(s) VO,VR,K,V,VO
(argument mismatch; org.apache.kafka.streams.kstream.Joined<K,V,com.ecomapp.input.BrokenSale> cannot be converted to org.apache.kafka.streams.kstream.Joined<java.lang.String,com.ecomapp.OrderExecuted,com.ecomapp.input.BrokenSale>)
The issue really is in defining my ValueJoiner. The Confluent documentation is not very clear on how to do this when Avro schemas are involved (I can't find examples either). What is the right way to define this?
Not sure why Java cannot resolve the type.
Try:
Joined.<String,OrderExecuted,BrokenSale>with(stringSerde, ordersSpecificAvroSerde, brokenSpecificAvroSerde))
To specify the types explicitly.