Stop processing kafka messages if something goes wrong during process - apache-kafka

In my processor API I store the messages in a key value store and every 100 messages I make a POST request. If something fails while trying to send the messages (api is not responding etc.) I want to stop processing messages. Until there is evidence the API calls work.
Here is my code:
public class BulkProcessor implements Processor<byte[], UserEvent> {
private KeyValueStore<Integer, ArrayList<UserEvent>> keyValueStore;
private BulkAPIClient bulkClient;
private String storeName;
private ProcessorContext context;
private int count;
#Autowired
public BulkProcessor(String storeName, BulkClient bulkClient) {
this.storeName = storeName;
this.bulkClient = bulkClient;
}
#Override
public void init(ProcessorContext context) {
this.context = context;
keyValueStore = (KeyValueStore<Integer, ArrayList<UserEvent>>) context.getStateStore(storeName);
count = 0;
// to check every 15 minutes if there are any remainders in the store that are not sent yet
this.context.schedule(Duration.ofMinutes(15), PunctuationType.WALL_CLOCK_TIME, (timestamp) -> {
if (count > 0) {
sendEntriesFromStore();
}
});
}
#Override
public void process(byte[] key, UserEvent value) {
int userGroupId = Integer.valueOf(value.getUserGroupId());
ArrayList<UserEvent> userEventArrayList = keyValueStore.get(userGroupId);
if (userEventArrayList == null) {
userEventArrayList = new ArrayList<>();
}
userEventArrayList.add(value);
keyValueStore.put(userGroupId, userEventArrayList);
if (count == 100) {
sendEntriesFromStore();
}
}
private void sendEntriesFromStore() {
KeyValueIterator<Integer, ArrayList<UserEvent>> iterator = keyValueStore.all();
while (iterator.hasNext()) {
KeyValue<Integer, ArrayList<UserEvent>> entry = iterator.next();
BulkRequest bulkRequest = new BulkRequest(entry.key, entry.value);
if (bulkRequest.getLocation() != null) {
URI url = bulkClient.buildURIPath(bulkRequest);
try {
bulkClient.postRequestBulkApi(url, bulkRequest);
keyValueStore.delete(entry.key);
} catch (BulkApiException e) {
logger.warn(e.getMessage(), e.fillInStackTrace());
}
}
}
iterator.close();
count = 0;
}
#Override
public void close() {
}
}
Currently in my code if a call to the API fails it will iterate the next 100 (and this will keep happening as long as it fails) and add them to the keyValueStore. I don't want this to happen. Instead I would prefer to stop the stream and continue once the keyValueStore is emptied. Is that possible?
Could I throw a StreamsException?
try {
bulkClient.postRequestBulkApi(url, bulkRequest);
keyValueStore.delete(entry.key);
} catch (BulkApiException e) {
throw new StreamsException(e);
}
Would that kill my stream app and so the process dies?

You should only delete the record from state store after you make sure your record is successfully processed by the API, so remove the first keyValueStore.delete(entry.key); and keep the second one. If not then you can potentially lost some messages when keyValueStore.delete is committed to underlying changelog topic but your messages are not successfully process yet, so it's only at most one guarantee.
Just wrap the calling API code around an infinite loop and keep trying until the record successfully processed, your processor will not consume new message from above processor node cause it's running in a same StreamThread:
private void sendEntriesFromStore() {
KeyValueIterator<Integer, ArrayList<UserEvent>> iterator = keyValueStore.all();
while (iterator.hasNext()) {
KeyValue<Integer, ArrayList<UserEvent>> entry = iterator.next();
//remove this state store delete code : keyValueStore.delete(entry.key);
BulkRequest bulkRequest = new BulkRequest(entry.key, entry.value);
if (bulkRequest.getLocation() != null) {
URI url = bulkClient.buildURIPath(bulkRequest);
while (true) {
try {
bulkClient.postRequestBulkApi(url, bulkRequest);
keyValueStore.delete(entry.key);//only delete after successfully process the message to achieve at least one processing guarantee
break;
} catch (BulkApiException e) {
logger.warn(e.getMessage(), e.fillInStackTrace());
}
}
}
}
iterator.close();
count = 0;
}
Yes you could throw a StreamsException, this StreamTask will be migrate to another StreamThread during re-balance, maybe on the sample application instance. If the API keep causing Exception until all StreamThread had died, your application will not automatically exit and receive below Exception, you should add a custom StreamsException handler to exit your app when all stream threads had died using KafkaStreams#setUncaughtExceptionHandler or listen to Stream State change (to ERROR state):
All stream threads have died. The instance will be in error state and should be closed.

In the end I used a simple KafkaConsumer instead of KafkaStreams, but the bottom line was that I changed the BulkApiException to extend RuntimeException, which I throw again after I log it. So now it looks as follows:
} catch (BulkApiException bae) {
logger.error(bae.getMessage(), bae.fillInStackTrace());
throw new BulkApiException();
} finally {
consumer.close();
int exitCode = SpringApplication.exit(ctx, () -> 1);
System.exit(exitCode);
}
This way the application is exited and the k8s restarts the pod. That was because if the api where I'm trying to forward the requests is down, then there is no point on continue reading messages. So until the other api is back up k8s will restart a pod.

Related

How can I reconnect a Photon Bolt client after it disconnects?

I'm trying to make a Photon Bolt game that connects two devices. The problem is that the Client tends to get disconnected a lot, an it doesn't reconnect automatically. I've tried using methods like ReconnectAndRejoin, but it seems like it only works in PUN. Right now I'm using this custom solution, without success:
[BoltGlobalBehaviour(BoltNetworkModes.Client)]
public class InitialiseGameClient : Photon.Bolt.GlobalEventListener
{
private bool disconnected;
public void Update(){
if(disconnected){
Reconnect();
}
}
public override void Disconnected(BoltConnection connection)
{
disconnected = true;
}
public void Reconnect(){
BoltLauncher.StartClient();
PlayerPrefs.DeleteAll();
if (BoltNetwork.IsRunning && BoltNetwork.IsClient)
{
foreach (var session in BoltNetwork.SessionList)
{
UdpSession udpSession = session.Value as UdpSession;
if (udpSession.Source != UdpSessionSource.Photon)
continue;
PhotonSession photonSession = udpSession as PhotonSession;
string sessionDescription = String.Format("{0} / {1} ({2})",
photonSession.Source, photonSession.HostName, photonSession.Id);
RoomProtocolToken token = photonSession.GetProtocolToken() as RoomProtocolToken;
if (token != null)
{
sessionDescription += String.Format(" :: {0}", token.ArbitraryData);
}
else
{
object value_t = -1;
object value_m = -1;
if (photonSession.Properties.ContainsKey("t"))
{
value_t = photonSession.Properties["t"];
}
if (photonSession.Properties.ContainsKey("m"))
{
value_m = photonSession.Properties["m"];
}
sessionDescription += String.Format(" :: {0}/{1}", value_t, value_m);
}
ServerConnectToken connectToken = new ServerConnectToken
{
data = "ConnectTokenData"
};
Debug.Log((int)photonSession.Properties["t"]);
var propertyID = PlayerPrefs.GetInt("PropertyID", 2);;
if((int)photonSession.Properties["t"] == propertyID){
BoltMatchmaking.JoinSession(photonSession, connectToken);
disconnected = false;
}
}
}
}
}
With this method I'm trying to use the same code used to connect the the client for the first time in the reconnect function, and keep trying until the client manages to connect. However it seems that the code never executes, even if the disconnect function gets triggered (the reconnect doesn't). Is there any Bolt integrated function that helps with reconnecting? Thanks in advance.
You need to shutdown bolt, then try reconnecting. Even if you don't get the below exception, it's just an example and you should shutdown and do BoltLauncher.StartClient() etc.
BoltException: Bolt is already running, you must call BoltLauncher.Shutdown() before starting a new instance of Bolt.

Two concurrent request were able to lock the same row in Postgres sql

when two concurrent request were made for the below code, both of the requests were able to acquire lock simultaneously and hence were able to execute the block of code
Sample code running in production:
Sample Code for reference :
//Starting point for the request
#Override
public void receiveTransferItems(String argumet1, String refernceId, List<Item> items, long messageId)
throws Exception {
ParentDTO parent = DAO.lockByReferenceid(referenceId);
if (parent == null) {
throw new Exception(referenceId + "does not exist");
}
updateData(parent);
for (Item item : items) {
receiveItem(td, td.getWarehouseId(), item.getItemSKU(), item.getItemStatus(), item.getQtyReceived(), messageId);
}
}
private void updateData(ParentDTO td) throws DropShipException {
//perform some logical processing and then execute update
DAO.update(td);
}
private void receiveItem(ParentDTO td, String warehouseId, String asin, String itemStatus, int quantity, long messageId)
throws Exception {
/**
* perform some logical processing
*
**/
//call is being made to another class to do the rest of the processing
service.receive(td, asin, quantity, condition, container, messageId);
}
#Override
public void receive(
ParentDTO parentDTO,
String asin,
int quantity,
Condition condition,
Container container,
long messageId,
DataAccessor accessor) throws Exception {
List<ChildDTO> childDTOs =
DAO.lockChildDTOItems(parentDTO.getReferenceId(), asin, condition,
CostInfoSource.MANIFEST);
List<ChildDTO> filterItems = DAO
.loadChildDTOItems(parentDTO.getReferenceId(), asin, condition.name());
long totalExpectedQuantity = getTotalExpectedQuantity(filterItems);
long totalReceivedQuantity = getTotalReceivedQuantity(filterItems);
int quantityNormalReceived = 0;
for (ChildDTO tdi : childDTOs) {
int quantityReceived = 0;
if (asinDropShipMsgAction != null) {
quantity -= asinDropShipMsgAction.getInitialQuantity();
quantityNormalReceived += asinDropShipMsgAction.getInitialQuantity();
} else {
quantityReceived = new DBOperationRunner<Integer>(accessor.getSessionManager()) {
#Override
protected Integer doWorkAndReturn() throws Exception {
return normalReceive(tdi, quantityLeft, container, MessageActionType.TS_IN, messageId);
}
}.execute();
}
}
}
private int normalReceive(final ChildDTO childDTO,
int quantity,
final Container container,
final MessageActionType type,
long messageId)
throws Exception {
/**
* perform some business logic
*
* */
DAO.update(childDTO);
return someQuantity;
}
Implementation for lockByReferenceId function:
#Override
public ParentDTO lockByReferenceId(String referenceId) {
Criteria criteria = getCurrentSession().createCriteria(ParentDTO.class)
.add(Restrictions.eq("referenceId", referenceId)).setLockMode(LockMode.UPGRADE_NOWAIT);
return (ParentDTO) criteria.uniqueResult();
}
Implementation of DBOperationRunner class :
public T execute() throws Exception {
T t = null;
Session originalSession = (Session) ThreadLocalContext.get(ThreadLocalContext.CURRENT_SESSION);
try {
ThreadLocalContext.put(ThreadLocalContext.CURRENT_SESSION, sessionManager.getCurrentSession());
sessionManager.beginTransaction();
t = doWorkAndReturn();
sessionManager.commit();
} catch (Exception e) {
try {
sessionManager.rollback();
} catch (Throwable t1) {
logger.error("failed to rollback", t1);
}
throw e;
} finally {
ThreadLocalContext.put(ThreadLocalContext.CURRENT_SESSION, originalSession);
}
return t;
}
Recently i observed one issue in production code in which two or more simultaneous requests were able to acquire lock on the same data at same time.
I am using hibernate and criteria as a DB framework and c3p0 as a connection pooling framework and Postgres as DB.
Note : This issue is intermittent and only observed for some random concurrent requests which is making it hard to debug.
I am unable to understand how two concurrent request were able to lock the same rows simultaneously. Can you please help me in identifying what is going wrong in this case?
Thanks in advance!!!!

#KafkaListener : behavior and tracking processing of events

We are using spring-kafka 2.3.0 in our app . Have observed some processing glitches in the scenarios below with
#Service
#EnableScheduling
public class KafkaService {
public void sendToKafkaProducer(String data) {
kafkaTemplate.send(configuration.getProducer().getTopicName(), data);
}
#KafkaListener(id = "consumer_grpA_id",
topics = "#{__listener.getEnvironmentConfiguration().getConsumer().getTopicName()}", groupId = "consumer_grpA", autoStartup = "false")
public void onMessage(ConsumerRecord<String, String> data) throws Exception {
passA(data);
}
private void passB(String message) {
//counter to keep track of retry attempts
if (counter.containsKey(message.getEventID())) {
//RETRY_COUNT = 5
if (counter.get(message.getEventID()) < RETRY_COUNT) {
retryAgain(message);
}
} else {
firstRetryPass(message);
}
}
private void retryAgain(String message) {
counter.put(message.getEventID(), counter.get(message.getEventID()) + 1);
try {
registry.stop(); //pause the listener
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void firstRetryPass(String message) {
// First Time Entry for count and time
counter.put(message.getEventID(), 1);
try {
registry.stop();//pause the listener
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private void passA(String message) {
try {
passToTarget(message); //Call target processor
LOGGER.info("Message Processed Successfully to the target");
} catch (Exception e) {
targetUnavailable= true;
passB(message);
}
}
private void passToTarget(String message){
//processor logic, if the target is not available, retry after 15 mins, call passB func
}
#Scheduled(cron = "0 0/15 * 1/1 * ?")
public void scheduledMethod() {
try {
if (targetUnavailable) {
registry.start();
firstTimeStart = false;
}
LOGGER.info(">>>Scheduler Running ?>>>" + registry.isRunning());
} catch (Exception e) {
LOGGER.error(e.getMessage());
}
}
}
On receipt of the first message after a gap in processing, the consumer doesn't pick up the first message. The subsequent messages are processed.
As we don't have the direct access to Kafka topics, we aren't able to identify the process that didn't get picked up from consumer.
How do we track those events that arenot picked up and why is it so.?
We also configured a scheduler whose job is to keep the registry for Kafka running . So is this scheduler required when we already have a listener configured ?
What is the mem and CPU utilization metrics if we keep the listener running. That was one of the reason we used the Kafka registry to stop the listener explicitly whenever the target is down. So need to validate if this approach is sustainable. My hunch is this is against the basic working of Listener, as it's main job is to continue listening for new events irrespective of target status
Edited*
You shouldn't stop the registry on the listener thread unless you use stop(Runnable) - otherwise there will be a deadlock and a delay since the container waits for the listener to exit.
Stopping the container (via the registry) won't actually take effect until any remaining records fetched by the last poll have been processed (unless you set max.poll.records=1.
When the listener exits normally, the record's offset will be committed so that record will not be redelivered on the next start.
You can use the ContainerStoppingErrorHandler for this use case. See here.
Throw an exception and the error handler will stop the container for you.
But that will stop the container on the first try.
If you want retries, use a SeekToCurrentErrorHandler and call the ContainerStoppingErrorHandler from the recoverer after retries are exhausted.

netty issue when writeAndFlush called from different InboundChannelHandlerAdapter.channelRead

I've got an issue, for which I am unable to post full code (sorry), due to security reasons. The gist of my issue is that I have a ServerBootstrap, created as follows:
bossGroup = new NioEventLoopGroup();
workerGroup = new NioEventLoopGroup();
final ServerBootstrap b = new ServerBootstrap();
b.group(bossGroup, workerGroup)
.channel(NioServerSocketChannel.class)
.childHandler(new ChannelInitializer<SocketChannel>() {
#Override
public void initChannel(SocketChannel ch) throws Exception {
ch.pipeline().addFirst("idleStateHandler", new IdleStateHandler(0, 0, 3000));
//Adds the MQTT encoder and decoder
ch.pipeline().addLast("decoder", new MyMessageDecoder());
ch.pipeline().addLast("encoder", new MyMessageEncoder());
ch.pipeline().addLast(createMyHandler());
}
}).option(ChannelOption.SO_BACKLOG, 128).option(ChannelOption.SO_REUSEADDR, true)
.option(ChannelOption.TCP_NODELAY, true)
.childOption(ChannelOption.SO_KEEPALIVE, true);
// Bind and start to accept incoming connections.
channelFuture = b.bind(listenAddress, listenPort);
With createMyHandlerMethod() that basically returns an extended implementation of ChannelInboundHandlerAdapter
I also have a "client" listener, that listens for incoming connection requests, and is loaded as follows:
final String host = getHost();
final int port = getPort();
nioEventLoopGroup = new NioEventLoopGroup();
bootStrap = new Bootstrap();
bootStrap.group(nioEventLoopGroup);
bootStrap.channel(NioSocketChannel.class);
bootStrap.option(ChannelOption.SO_KEEPALIVE, true);
bootStrap.handler(new ChannelInitializer<SocketChannel>() {
#Override
public void initChannel(SocketChannel ch) throws Exception {
ch.pipeline().addFirst("idleStateHandler", new IdleStateHandler(0, 0, getKeepAliveInterval()));
ch.pipeline().addAfter("idleStateHandler", "idleEventHandler", new MoquetteIdleTimeoutHandler());
ch.pipeline().addLast("decoder", new MyMessageDecoder());
ch.pipeline().addLast("encoder", new MyMessageEncoder());
ch.pipeline().addLast(MyClientHandler.this);
}
})
.option(ChannelOption.SO_REUSEADDR, true)
.option(ChannelOption.TCP_NODELAY, true);
// Start the client.
try {
channelFuture = bootStrap.connect(host, port).sync();
} catch (InterruptedException e) {
throw new MyException(“Exception”, e);
}
Where MyClientHandler is again a subclassed instance of ChannelInboundHandlerAdapter. Everything works fine, I get messages coming in from the "server" adapter, i process them, and send them back on the same context. And vice-versa for the "client" handler.
The problem happens when I have to (for some messages) proxy them from the server or client handler to other connection. Again, I am very sorry for not being able to post much code, but the gist of it is that I'm calling from:
serverHandler.channelRead(ChannelHandlerContext ctx, Object msg) {
if (msg instanceof myProxyingMessage) {
if (ctx.channel().isActive()) {
ctx.channel().writeAndFlush(someOtherMessage);
**getClientHandler().writeAndFlush(myProxyingMessage);**
}
}
}
Now here's the problem: the bolded (client) writeAndFlush - never actually writes the message bytes, it doesn't throw any errors. The ChannelFuture returns all false (success, cancelled, done). And if I sync on it, eventually it times out for other reasons (connection timeout set within my code).
I know I haven't posted all of my code, but I'm hoping that someone has some tips and/or pointers for how to isolate the problem of WHY it is not writing to the client context. I'm not a Netty expert by any stretch, and most of this code was written by someone else. They are both subclassing ChannelInboundHandlerAdapter
Feel free to ask any questions if you have any.
*****EDIT*********
I tried to proxy the request back to a DIFFERENT context/channel (ie, the client channel) using the following test code:
public void proxyPubRec(int messageId) throws MQTTException {
logger.log(logLevel, "proxying PUBREC to context: " + debugContext());
PubRecMessage pubRecMessage = new PubRecMessage();
pubRecMessage.setMessageID(messageId);
pubRecMessage.setRemainingLength(2);
logger.log(logLevel, "pipeline writable flag: " + ctx.pipeline().channel().isWritable());
MyMQTTEncoder encoder = new MyMQTTEncoder();
ByteBuf buff = null;
try {
buff = encoder.encode(pubRecMessage);
ctx.channel().writeAndFlush(buff);
} catch (Throwable t) {
logger.log(Level.SEVERE, "unable to encode PUBREC");
} finally {
if (buff != null) {
buff.release();
}
}
}
public class MyMQTTEncoder extends MQTTEncoder {
public ByteBuf encode(AbstractMessage msg) {
PooledByteBufAllocator allocator = new PooledByteBufAllocator();
ByteBuf buf = allocator.buffer();
try {
super.encode(ctx, msg, buf);
} catch (Throwable t) {
logger.log(Level.SEVERE, "unable to encode PUBREC, " + t.getMessage());
}
return buf;
}
}
But the above at line: ctx.channel().writeAndFlush(buff) is NOT writing to the other channel - any tips/tricks on debugging this sort of issue?
someOtherMessage has to be ByteBuf.
So, take this :
serverHandler.channelRead(ChannelHandlerContext ctx, Object msg) {
if (msg instanceof myProxyingMessage) {
if (ctx.channel().isActive()) {
ctx.channel().writeAndFlush(someOtherMessage);
**getClientHandler().writeAndFlush(myProxyingMessage);**
}
}
}
... and replace it with this :
serverHandler.channelRead(ChannelHandlerContext ctx, Object msg) {
if (msg instanceof myProxyingMessage) {
if (ctx.channel().isActive()) {
ctx.channel().writeAndFlush(ByteBuf);
**getClientHandler().writeAndFlush(myProxyingMessage);**
}
}
}
Actually, this turned out to be a threading issue. One of my threads was blocked/waiting while other threads were writing to the context and because of this, the writes were buffered and not sent, even with a flush. Problem solved!
Essentially, I put the first message code in an Runnable/Executor thread, which allowed it to run separately so that the second write/response was able to write to the context. There are still potentially some issues with this (in terms of message ordering), but this is not on topic for the original question. Thanks for all your help!

rxjava: queue scheduler with default idle job

I have a client server application and I'm using rxjava to do server requests from the client. The client should only do one request at a time so I intent to use a thread queue scheduler similar to the trampoline scheduler.
Now I try to implement a mechanism to watch changes on the server. Therefore I send a long living request that blocks until the server has some changes and sends back the result (long pull).
This long pull request should only run when the job queue is idle. I'm looking for a way to automatically stop the watch request when a regular request is scheduled and start it again when the queue becomes empty. I thought about modifying the trampoline scheduler to get this behavior but I have the feeling that this is a common problem and there might be an easier solution?
You can hold onto the Subscription returned by scheduling the long poll task, unsubscribe it if the queue becomes non-empty and re-schedule if the queue becomes empty.
Edit: here is an example with the basic ExecutorScheduler:
import java.util.concurrent.*;
import java.util.concurrent.atomic.*;
public class IdleScheduling {
static final class TaskQueue {
final ExecutorService executor;
final AtomicReference<Future<?>> idleFuture;
final Runnable idleRunnable;
final AtomicInteger wip;
public TaskQueue(Runnable idleRunnable) {
this.executor = Executors.newFixedThreadPool(1);
this.idleRunnable = idleRunnable;
this.idleFuture = new AtomicReference<>();
this.wip = new AtomicInteger();
this.idleFuture.set(executor.submit(idleRunnable));
}
public void shutdownNow() {
executor.shutdownNow();
}
public Future<?> enqueue(Runnable task) {
if (wip.getAndIncrement() == 0) {
idleFuture.get().cancel(true);
}
return executor.submit(() -> {
task.run();
if (wip.decrementAndGet() == 0) {
startIdle();
}
});
}
void startIdle() {
idleFuture.set(executor.submit(idleRunnable));
}
}
public static void main(String[] args) throws Exception {
TaskQueue tq = new TaskQueue(() -> {
while (!Thread.currentThread().isInterrupted()) {
try {
Thread.sleep(1000);
} catch (InterruptedException ex) {
System.out.println("Idle interrupted...");
return;
}
System.out.println("Idle...");
}
});
try {
Thread.sleep(1500);
tq.enqueue(() -> System.out.println("Work 1"));
Thread.sleep(500);
tq.enqueue(() -> {
System.out.println("Work 2");
try {
Thread.sleep(500);
} catch (InterruptedException ex) {
}
});
tq.enqueue(() -> System.out.println("Work 3"));
Thread.sleep(1500);
} finally {
tq.shutdownNow();
}
}
}