Access kafka producer factory from spring-cloud-stream - apache-kafka

I have a project using spring-boot-cloud and apache-kafka, I have a list of integration test covering the topology logic, thanks to EmbeddedBroker.
I recently discovered that there are many noise in the log when running these tests.
e.g. [Producer clientId=producer-2] Connection to node 0 (localhost/127.0.0.1:63267) could not be established. Broker may not be available.
After some trial and error it appears that these were the producers created by the spring-cloud-stream bindings. Somehow with #DirtiesContext(classMode = DirtiesContext.ClassMode.AFTER_EACH_TEST_METHOD) on the class level they do not appear to be cleaned up after each test.
Thus I figured if I can get access to the producer factory I can then manually clean them up inside the #AfterEach clause of my test class. I've tried to autowire KafkaTemplate but it didn't help. I don't know how I can access the producer factory since it's created implicitly by the framework.
Please be noted that these do not appear to affect the test result since they only show up at the end of the test phase.
Thanks in advance!

You can add a ProducerMessageHandlerCustomizer bean and get a reference to the producer factory that way.
#Bean
ProducerMessageHandlerCustomizer<KafkaProducerMessageHandler> cust() {
return (handler, dest) -> {
this.pfMap.put(dest, handler.getKafkaTemplate().getProducerFactory());
}
}
Store the PF in a map in the test case, then reset() it when you want to close the producer(s).

Related

Spring Batch partitioned job JMS acknowledgement

Let's say I have a Spring Batch remote partitioned job, i.e. I have a manager application instance which starts the job and partitions the work and I have multiple workers who are executing individual partitions.
The message channel where the partitions are sent to the workers is an ActiveMQ queue and the Spring Integration configuration is based on JMS.
Assume that I wanna make sure that in case of a worker crashing in the middle of the partition execution, I want to make sure that another worker will pick up the same partition.
I think here's where acknowledging JMS messages would come in handy to only acknowledge a message in case a worker has fully completed its work on a particular partition but it seems as soon as the message is received by a worker, the message is acknowledged right away and in case of failures in the worker Spring Batch steps, the message won't reappear (obviously).
Is this even possible with Spring Batch? I've tried transacted sessions too but it doesn't really work either.
I know how to achieve this with JMS API. The difficulty comes from the fact that there is a lot of abstraction with Spring Batch in terms of messaging, and I'm unable to figure it out.
I know how to achieve this with JMS API. The difficulty comes from the fact that there is a lot of abstraction with Spring Batch in terms of messaging, and I'm unable to figure it out.
In this case, I think the best way to answer this question is to remove all these abstractions coming from Spring Batch (as well as Spring Integration), and try to see where the acknowledgment can be configured.
In a remote partitioning setup, workers are listeners on a queue in which messages coming from the manager are of type StepExecutionRequest. The most basic code form of a worker in this setup is something like the following (simplified version of StepExecutionRequestHandler, which is configured as a Spring Integration service activator when using the RemotePartitioningWorkerStepBuilder):
#Component
public class BatchWorkerStep {
#Autowired
private JobRepository jobRepository;
#Autowired
private StepLocator stepLocator;
#JmsListener(destination = "requests")
public void receiveMessage(final Message<StepExecutionRequest> message) throws JMSException {
StepExecutionRequest request = message.getObject();
Long jobExecutionId = request.getJobExecutionId();
Long stepExecutionId = request.getStepExecutionId();
String stepName = request.getStepName();
StepExecution stepExecution = jobRepository.getStepExecution(jobExecutionId, stepExecutionId);
Step step = stepLocator.getStep(stepName);
try {
step.execute(stepExecution);
stepExecution.setStatus(BatchStatus.COMPLETED);
} catch (Throwable e) {
stepExecution.addFailureException(e);
stepExecution.setStatus(BatchStatus.FAILED);
} finally {
jobRepository.update(stepExecution); // this is needed in a setup where the manager polls the job repository
}
}
}
As you can see, the JMS message acknowledgment cannot be configured on the worker side (there is no way to do it with attributes of JmsListener, so it has to be done somewhere else. And this is actually at the message listener container level with DefaultJmsListenerContainerFactory#setSessionAcknowledgeMode.
Now if you are using Spring Integration to configure the messaging middleware, you can configure the acknowledgment mode in Spring Integration .

Kafka Transaction in case multi Threading

I am trying to create kafka producer in trasnsaction i.e. i want to write a group of msgs if anyone fails i want to rollback all the msg.
kafkaProducer.beginTransaction();
try
{
// code to produce to kafka topic
}
catch(Exception e)
{
kafkaProducer.abortTransaction();
}
kafkaProducer.commitTransaction();
The problem is for single thread above works just fine, but when multiple threads writes it throws exception
Invalid transaction attempted from state IN_TRANSITION to IN_TRANSITION
while debugging I found that if the thread1 transaction is in progress and thread2 also says beingTransaction it throws this exception. What I dont find if how to solve this issue. One possible thing I could find is creating a pool of produce.
Is there any already available API for kafka producer pool or i will have to create my own.
Below is the improvement jira already reported for this.
https://issues.apache.org/jira/browse/KAFKA-6278
Any other suggestion will be really helpful
You can only have a single transaction in progress at a time with a producer instance.
If you have multiple threads doing separate processing and they all need exactly once semantics, you should have a producer instance per thread.
Not sure if this was resolved.
you can use apache common pool2 to create a producer instance pool.
In the create() method of the factory implementation you can generate and assign a unique transactionalID to avoid a conflict (ProducerFencedException)

Produce called with an IAsyncSerializer value serializer configured but an ISerializer is required when using Avro Serializer

I am working with Kafka cluster and using Transactional Producer for atomic streaming (read-process-write).
// Init Transactions
_transactionalProducer.InitTransactions(DefaultTimeout);
// Begin the transaction
_transactionalProducer.BeginTransaction();
// produce message to one or many topics
var topic = Topics.MyTopic;
_transactionalProducer.Produce(topic, consumeResult.Message);
I am using AvroSerializer since I publish messages with Schema.
Produce throws an exception:
"System.InvalidOperationException: Produce called with an IAsyncSerializer value serializer configured but an ISerializer is required.\r\n at Confluent.Kafka.Producer`2.Produce(TopicPartition topicPartition, Message`2 message, Action`1 deliveryHandler)"
All examples I've seen for transactional producer use Produce method rather than ProduceAsync so not sure I can simply switch to ProduceAsync and assume that transactional produce will function correctly. Correct me if I'm wrong or help me find documentation.
Otherwise, I am not able to find AvroSerializer that is not Async, inheriting from ISerializer.
public class AvroSerializer<T> : IAsyncSerializer<T>
I didn't realize that there is AsSyncOverAsync method which I can use when creating the Serializer. This exists because Kafka Consumer is also still Sync and not Async.
For example:
new AvroSerializer<TValue>(schemaRegistryClient, serializerConfig).AsSyncOverAsync();
Here is Confluent documentation of that method.
//
// Summary:
// Create a sync serializer by wrapping an async one. For more information on the
// potential pitfalls in doing this, refer to Confluent.Kafka.SyncOverAsync.SyncOverAsyncSerializer`1.
public static ISerializer<T> AsSyncOverAsync<T>(this IAsyncSerializer<T> asyncSerializer);

On Partitions Assignment and ChainedKafkaTransactionManager at startup with JPA

I have many transactional consumers with a ChainedKafkaTransactionManager based on a JpaTransactionManager and a KafkaTransactionManager (all #KafkaListener's).
The JPA one needs a ThreadLocal variable to be set, to be able to know to which DB to connect to (is the tenant id).
When starting the application, in the onPartitionsAssigned listener, spring-kafka is trying to create a chained txn, hence trying to create a JPA txn, but there's no tenant set, then it fails.
That tenant is set through a http filter and/or kafka interceptors (through event headers).
I tried using the auto-wired KafkaListenerEndpointRegistry with setAutoStartup(false), but I see that the consumers don't receive any events, probably because they aren't initialized yet (I thought they were initialized on-demand).
If I set a mock tenant id and call registry.start() when the application is ready, the initializations seem to be done in other threads (probably because I'm using a ConcurrentKafkaListenerContainerFactory), so it doesn't work.
Is there a way to avoid the JPA transaction on that initial onPartitionsAssigned listener, that is part of the consumer initialization?
If your chained TM has the KafkaTM first, followed by JPA TM (which would be the normal case), you can achieve similar functionality by just injecting the Kafka TM into the container and using #Transactional (with just the JPA TM on the listener) to start the JPA transaction when the listener is called.
The time between the transaction commits will be marginally increased but it would provide similar functionality.
If that won't work for you, open a GitHub issue; we can either disable the initial commit on assignment, or do it without a transaction at all (optionally).

Multiple StreamListeners with Spring Cloud Stream connected to Kafka

In a Spring Boot app using Spring Cloud Stream connecting to Kafka, I'm trying to set up two separate stream listener methods:
One reads from topics "t1" and "t2" as KTables, re-partitioning using a different key in one, then joining to data from the other
The other reads from an unrelated topic, "t3", as a KStream.
Because the first listener does some joining and aggregating, some topics are created automatically, e.g. "test-1-KTABLE-AGGREGATE-STATE-STORE-0000000007-repartition-0". (Not sure if this is related to the problem or not.)
When I set up the code by having two separate methods annotated with #StreamListener, I get the error below when the Spring Boot app starts:
Exception in thread "test-d44cb424-7575-4f5f-b148-afad034c93f4-StreamThread-2" java.lang.IllegalArgumentException: Assigned partition t1-0 for non-subscribed topic regex pattern; subscription pattern is t3
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignFromSubscribed(SubscriptionState.java:195)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:225)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:367)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:316)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:295)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1146)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1111)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:848)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:805)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:771)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:741)
I think the important part is: "Assigned partition t1-0 for non-subscribed topic regex pattern; subscription pattern is t3". These are the two unrelated topics, so as far as I can see nothing related to t3 should be subscribing to anything related to t1. The exact topic which causes the problem also changes intermittently: sometimes it's one of the automatically generated topics which is mentioned, rather than t1 itself.
Here is how the two stream listeners are set up (in Kotlin):
#StreamListener
fun listenerForT1AndT2(
#Input("t1") t1KTable: KTable<String, T1Obj>,
#Input("t2") t2KTable: KTable<String, T2Obj>) {
t2KTable
.groupBy(...)
.aggregate(
{ ... },
{ ... },
{ ... },
Materialized.with(Serdes.String(), JsonSerde(SomeObj::class.java)))
.join(t1KTable,
{ ... },
Materialized.`as`<String, SomeObj, KeyValueStore<Bytes, ByteArray>>("test")
.withKeySerde(Serdes.String())
.withValueSerde(JsonSerde(SomeObj::class.java)))
}
#StreamListener
fun listenerForT3(#Input("t3") t3KStream: KStream<String, T3Obj>) {
events.map { ... }
}
However, when I set up my code by having just one method annotated with #StreamListener, and take parameters for all three topics, everything works fine, e.g.
#StreamListener
fun compositeListener(
#Input("t1") t1KTable: KTable<String, T1Obj>,
#Input("t2") t2KTable: KTable<String, T2Obj>,
#Input("t3") t3KStream: KStream<String, T3Obj>) {
...
}
But I don't think it's right that I can only have one #StreamListener method.
I know that there is content-based routing for adding conditions to the StreamListener annotation, but if the methods define the input channels then I'm not sure if I need to be using this here - I'd have thought the use of the #Input annotations on the method parameters would be enough to tell the system which channels (and therefore which Kafka topics) to bind to? If I do need to use content-based routing, how can I apply it here to have each method receive only the items from the relevant topic(s)?
I've also tried separating out the two listener methods into two separate classes, each of which has #EnableBinding for only the interface it's interested in (i.e. one interface for t1 and t2, and a separate interface for t3), but that doesn't help.
Everything else I've found related to this error message, e.g. here, is about having multiple app instances, but in my case there's only one Spring Boot app instance.
You need separate application id for each StreamListener methods. Here is an example:
spring.cloud.stream.kafka.streams.bindings.t1.consumer.application-id=processor1-application-id
spring.cloud.stream.kafka.streams.bindings.t2.consumer.application-id=processor1-application-id
spring.cloud.stream.kafka.streams.bindings.t3.consumer.application-id=processor2-application-id
You probably want to test with the latest snapshot (2.1.0) as there were some recent changes with the way application id is processed by the binder.
Please see the update here for more details.
Here is a working sample of multiple StreamListener methods which are Kafka Streams processors.