Spring websockets across Wildfly 10.1 cluster - wildfly

Using Wildfly 10.1 standalone-full-ha.xml with the included ActiveMQ Artemis 1.1.0. Enabled STOMP by adding this to the activemq config:
<acceptor name="stomp-acceptor" factory-class="org.apache.activemq.artemis.core.remoting.impl.netty.NettyAcceptorFactory">
<param name="protocols" value="STOMP"/>
<param name="port" value="${stomp.port:61613}"/>
</acceptor>
Deployed a sample spring websockets war based on https://spring.io/guides/gs/messaging-stomp-websocket/
I have this running on 2 separate servers which are able to form a cluster. I am able to connect to both servers and send messages across, however when I disconnect one of the websockets, the following exception is thrown on the other server:
16:49:02,377 ERROR [org.apache.activemq.artemis.core.server] (Thread-7 (ActiveMQ-client-global-threads-926891052)) AMQ224037: cluster connection Failed to handle message: java.lang.IllegalStateException: Cannot find binding for ffaa538e-77c3-11e7-ba4b-7b9cb7ee40e2b7e6493a-77c3-11e7-ba4b-7b9cb7ee40e2
at org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.doConsumerClosed(ClusterConnectionImpl.java:1319)
at org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.handleNotificationMessage(ClusterConnectionImpl.java:1005)
at org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.onMessage(ClusterConnectionImpl.java:974)
at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1018)
at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:48)
at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1145)
at org.apache.activemq.artemis.utils.OrderedExecutorFactory$OrderedExecutor$ExecutorTask.run(OrderedExecutorFactory.java:103)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I have also tested this with Wildfly 11 Alpha and the included ActiveMQ Artemis 1.5.3 but I get the same error every time I disconnect a websocket.
I also get the following errors when I shutdown a server, but I am less concerned about these since they only happen during shutdown:
17:11:03,747 ERROR [org.apache.activemq.artemis.core.server] (Thread-27 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$2#5669232b-14602972)) AMQ224051: Failed to call notification listener: java.lang.IllegalStateException: No queue 0110e86e-77c4-11e7-a745-252881f696b4
at org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.onNotification(PostOfficeImpl.java:387)
at org.apache.activemq.artemis.core.server.management.impl.ManagementServiceImpl.sendNotification(ManagementServiceImpl.java:580)
at org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl.close(ServerConsumerImpl.java:437)
at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doClose(ServerSessionImpl.java:354)
at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl$1.done(ServerSessionImpl.java:1191)
at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl.executeOnCompletion(OperationContextImpl.java:161)
at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.close(ServerSessionImpl.java:1185)
at org.apache.activemq.artemis.core.protocol.stomp.StompProtocolManager$1.run(StompProtocolManager.java:276)
at org.apache.activemq.artemis.utils.OrderedExecutorFactory$OrderedExecutor$ExecutorTask.run(OrderedExecutorFactory.java:103)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17:11:03,784 ERROR [io.netty.util.concurrent.DefaultPromise.rejectedExecution] (globalEventExecutor-1-2) Failed to submit a listener notification task. Event loop shut down?: java.util.concurrent.RejectedExecutionException: event executor terminated
at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821)
at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327)
at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320)
at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746)
at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:760)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:428)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1058)
at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:686)
at io.netty.channel.AbstractChannel$AbstractUnsafe.access$700(AbstractChannel.java:419)
at io.netty.channel.AbstractChannel$AbstractUnsafe$5.run(AbstractChannel.java:646)
at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:233)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
17:11:03,797 WARN [io.netty.channel.AbstractChannel] (globalEventExecutor-1-2) Can't invoke task later as EventLoop rejected it: java.util.concurrent.RejectedExecutionException: event executor terminated
at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821)
at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327)
at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320)
at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746)
at io.netty.channel.AbstractChannel$AbstractUnsafe.invokeLater(AbstractChannel.java:931)
at io.netty.channel.AbstractChannel$AbstractUnsafe.access$900(AbstractChannel.java:419)
at io.netty.channel.AbstractChannel$AbstractUnsafe$5.run(AbstractChannel.java:649)
at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:233)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Is it an issue with the spring websocket example?
Do I need to change something in the default standalone-full-ha.xml to get this to work properly across a cluster, without throwing exceptions on client disconnect?

The version of Artemis in Wildfly is a bit behind the latest upstream (1.5.3 vs 2.2.0). Have you tried this with a cluster of standalone Artemis 2.2.0 nodes?

Related

Why is Cassandra crashing whenever I try to run DataStax Kafka Connector?

Goal: My goal is to use Kafka to send messages to a Cassandra sink using Kafka Connect.
I've deployed Kafka and Cassandra and I am able to work with each of them individually - I have no problem sending data to Kafka, using producers to pass messages, and using consumers to consume them. I have no problem using cqlsh to create tables and insert data into them. However, whenever I try to deploy the DataStax Apache Kafka Connector, Cassandra seems to crash.
I am trying to learn how to use Kafka Connect using just one Kafka producer, broker, and one Cassandra keyspace using the standalone mode. I've configured both connect-standalone.properties and the cassandra-sink-standalone.properties following the instructions shown on DataStax: https://docs.datastax.com/en/kafka/doc/kafka/kafkaStringJson.html
connect-standalone.properties
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000
plugin.path= *install_location*/kafka-connect-cassandra-sink-1.4.0.jar
cassandra-sink-standalone.properties
name=stocks-sink
connector.class=com.datastax.kafkaconnector.DseSinkConnector
tasks.max=1
topics=stocks_topic
topic.stocks_topic.stocks_keyspace.stocks_table.mapping = symbol=value.symbol, ts=value.ts, exchange=value.exchange, industry=value.industry, name=key, value=value.value
Then, the Kafka Connector is started using bin/connect-standalone.sh connect-standalone.properties cassandra-sink-standalone.properties.
About 95% of the time I attempt to launch Kafka Connector, Cassandra crashes. Running bin/nodetool status shows the message:
nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'
In the system.log and debug.log logs, there is no indication that Cassandra has even crashed. The last line just remains as:
INFO [main] 2023-01-31 00:00:00,143 StorageService.java:2806 - Node localhost/127.0.0.1:7000 state jump to NORMAL
And in the Kafka Connect logs, the error messages states:
[2023-01-31 15:24:47,803] INFO [plc-sink|task-0] DataStax Java driver for Apache Cassandra(R) (com.datastax.oss:java-driver-core) version 4.6.0 (com.datastax.oss.driver.internal.core.DefaultMavenCoordinates:37)
[2023-01-31 15:24:47,947] INFO [plc-sink|task-0] Could not register Graph extensions; this is normal if Tinkerpop was explicitly excluded from classpath (com.datastax.oss.driver.internal.core.context.InternalDriverContext:540)
[2023-01-31 15:24:47,948] INFO [plc-sink|task-0] Could not register Reactive extensions; this is normal if Reactive Streams was explicitly excluded from classpath (com.datastax.oss.driver.internal.core.context.InternalDriverContext:559)
[2023-01-31 15:24:47,997] INFO [plc-sink|task-0] Using native clock for microsecond precision (com.datastax.oss.driver.internal.core.time.Clock:40)
[2023-01-31 15:24:47,999] INFO [plc-sink|task-0] [s0] No contact points provided, defaulting to /127.0.0.1:9042 (com.datastax.oss.driver.internal.core.metadata.MetadataManager:134)
[2023-01-31 15:24:48,190] WARN [plc-sink|task-0] [s0] Error connecting to Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=3247c5e4), trying next node (ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException)) (com.datastax.oss.driver.internal.core.control.ControlConnection:34)
[2023-01-31 15:24:48,200] ERROR [plc-sink|task-0] WorkerSinkTask{id=plc-sink-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:196)
com.datastax.oss.driver.api.core.AllNodesFailedException: Could not reach any contact point, make sure you've provided valid addresses (showing first 1 nodes, use getAllErrors() for more): Node(endPoint=/127.0.0.1:9042, hostId=null, hashCode=3247c5e4): [com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException)]
at com.datastax.oss.driver.api.core.AllNodesFailedException.copy(AllNodesFailedException.java:141)
at com.datastax.oss.driver.internal.core.util.concurrent.CompletableFutures.getUninterruptibly(CompletableFutures.java:149)
at com.datastax.oss.driver.api.core.session.SessionBuilder.build(SessionBuilder.java:612)
at com.datastax.oss.kafka.sink.state.LifeCycleManager.buildCqlSession(LifeCycleManager.java:518)
at com.datastax.oss.kafka.sink.state.LifeCycleManager.lambda$startTask$0(LifeCycleManager.java:113)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at com.datastax.oss.kafka.sink.state.LifeCycleManager.startTask(LifeCycleManager.java:109)
at com.datastax.oss.kafka.sink.CassandraSinkTask.start(CassandraSinkTask.java:83)
at org.apache.kafka.connect.runtime.WorkerSinkTask.initializeAndStart(WorkerSinkTask.java:312)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:187)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:244)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Suppressed: com.datastax.oss.driver.api.core.connection.ConnectionInitException: [s0|control|connecting...] Protocol initialization request, step 1 (OPTIONS): failed to send request (java.nio.channels.ClosedChannelException)
at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler$InitRequest.fail(ProtocolInitHandler.java:342)
at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.writeListener(ChannelHandlerRequest.java:87)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.addListener(DefaultPromise.java:183)
at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:95)
at io.netty.channel.DefaultChannelPromise.addListener(DefaultChannelPromise.java:30)
at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.send(ChannelHandlerRequest.java:76)
at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler$InitRequest.send(ProtocolInitHandler.java:183)
at com.datastax.oss.driver.internal.core.channel.ProtocolInitHandler.onRealConnect(ProtocolInitHandler.java:118)
at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.lambda$connect$0(ConnectInitHandler.java:57)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
Suppressed: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9042
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:921)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:354)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:897)
at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1372)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:748)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:740)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:726)
at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:748)
at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:763)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:788)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:756)
at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:806)
at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1025)
at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:294)
at com.datastax.oss.driver.internal.core.channel.ChannelHandlerRequest.send(ChannelHandlerRequest.java:75)
... 20 more
In the 5% of the time that Cassandra doesn't actually crash, the following message shows up in Kafka Connect's logs:
[2023-01-31 15:41:32,839] INFO [plc-sink|task-0] DataStax Java driver for Apache Cassandra(R) (com.datastax.oss:java-driver-core) version 4.6.0 (com.datastax.oss.driver.internal.core.DefaultMavenCoordinates:37)
[2023-01-31 15:41:32,981] INFO [plc-sink|task-0] Could not register Graph extensions; this is normal if Tinkerpop was explicitly excluded from classpath (com.datastax.oss.driver.internal.core.context.InternalDriverContext:540)
[2023-01-31 15:41:32,982] INFO [plc-sink|task-0] Could not register Reactive extensions; this is normal if Reactive Streams was explicitly excluded from classpath (com.datastax.oss.driver.internal.core.context.InternalDriverContext:559)
[2023-01-31 15:41:33,037] INFO [plc-sink|task-0] Using native clock for microsecond precision (com.datastax.oss.driver.internal.core.time.Clock:40)
[2023-01-31 15:41:33,040] INFO [plc-sink|task-0] [s0] No contact points provided, defaulting to /127.0.0.1:9042 (com.datastax.oss.driver.internal.core.metadata.MetadataManager:134)
[2023-01-31 15:41:33,254] INFO [plc-sink|task-0] [s0] Failed to connect with protocol DSE_V2, retrying with DSE_V1 (com.datastax.oss.driver.internal.core.channel.ChannelFactory:224)
[2023-01-31 15:41:33,263] INFO [plc-sink|task-0] [s0] Failed to connect with protocol DSE_V1, retrying with V4 (com.datastax.oss.driver.internal.core.channel.ChannelFactory:224)
[2023-01-31 15:41:34,091] INFO [plc-sink|task-0] WorkerSinkTask{id=plc-sink-0} Sink task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSinkTask:313)
[2023-01-31 15:41:34,092] INFO [plc-sink|task-0] WorkerSinkTask{id=plc-sink-0} Executing sink task (org.apache.kafka.connect.runtime.WorkerSinkTask:198)
...
Versions:
Apache Cassandra 4.0.7
Apache Kafka 3.3.1
DataStax Apache Kafka Connector 1.4.0
I am currently using WSL2 Ubuntu 20.04.5 on Windows 11, with the following specs:
CPU: 4 Cores
Memory: 8GB RAM
Disk (SSD): 250 GB
Seeing that it actually works 5% of the time, I suspect that it's an OOM problem as outlined in https://community.datastax.com/questions/6947/index.html (and I sometimes just happen to have enough memory?). I've tried the solution in this article but it didn't help. How can I configure Cassandra / Kafka Connect to avoid this problem? Is this just a matter of needing a computer with more memory?
I think you're on the right track when you suggested that memory is an issue.
I have a "small" Windows Surface Pro I use to replicate issues like yours. I'm also running Ubuntu 20.04 with WSL2 on this laptop.
By default, Windows allocates half of system RAM to WSL2 so on my sub-8GB RAM installation, my Ubuntu installation takes up 3.7GB of memory. A vanilla installation of Cassandra (out-of-the-box zero configuration), starts with 1.3GB of memory allocated to it so there's only 2.4GB left for everything else.
What I suspect is happening is that when you start Kafka on the same node, Ubuntu runs out of memory and it triggers the Linux oom-killer. Although the end result is similar, the trigger is slightly different to what I described in the post you linked so my recommendation to explicitly set disk_access_mode doesn't help in this situation.
As a workaround, configure Cassandra to only allocate 1GB of memory by setting the MAX_HEAP_SIZE in conf/cassandra-env.sh:
MAX_HEAP_SIZE="1G"
Kafka is configured with 1GB by default but if it isn't, set the following in bin/kafka-server-start.sh:
export KAFKA_HEAP_OPTS="-Xmx1G -Xms1G"
By setting both, there should be over 1GB left for Ubuntu and hopefully allow you to run your tests. Cheers!

Exception when trying to upgrade to flink 1.3.1

I tried to upgrade my flink version in my cluster to 1.3.1 (and 1.3.2 as well) and I got the following exception in my task managers:
2018-02-28 12:57:27,120 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Error during disposal of stream operator.
org.apache.kafka.common.KafkaException: java.lang.InterruptedException
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:424)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.close(FlinkKafkaProducerBase.java:317)
at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:126)
at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:429)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:334)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:422)
... 7 more
The job manager showed that it failed to connect with the task managers.
I am using FlinkKafkaProducer08.
Any ideas?
First of all, from the stack trace above: it was thrown during operator cleanup of a non-graceful termination (otherwise this code is not executed). It looks as if it should be followed by the real exception that caused the initial problem. Can you provide some more parts of the log?
If the JobManager failed to connect to any TaskManager that should run your job, the whole job will be cancelled (and retried based on your retry policy). The same may happen on your TaskManager side. That may be the root cause and needs further investigation.

Confluent Start -> Schema Registry Failed to Start

When I start Confluent, Schema-registry fails, preventing the process from completing successfully. This is the response I get:
Starting zookeeper
zookeeper is [UP]
Starting kafka
kafka is [UP]
Starting schema-registry
Schema Registry failed to start
schema-registry is [DOWN]
Starting kafka-rest
Kafka Rest failed to start
kafka-rest is [DOWN]
Starting connect
connect is [UP]
When I tried to run the processes individually, zookeeper ran without problems. However, when I launched kafka, zookeeper displayed the following error:
Error Path:/brokers Error:KeeperErrorCode = NodeExists for /brokers (org.apache.zookeeper.server.PrepRequestProcessor)
Then, when I attempted to run Schema registry, I was hit with a massive list of errors. I'm sure the errors all point to one small thing. Here are some of the errors (many repeat in the same long message):
1.
WARNING: HK2 service reification failed for [org.glassfish.jersey.message.internal.DataSourceProvider] with an exception:
MultiException stack 1 of 2
java.lang.NoClassDefFoundError: javax/activation/DataSource
2.
MultiException stack 2 of 2
java.lang.IllegalArgumentException: Errors were discovered while reifying SystemDescriptor
3.
java.lang.IllegalArgumentException: While attempting to resolve the dependencies of org.glassfish.jersey.server.validation.internal.ValidationBinder$ConfiguredValidatorProvider errors were found
4.
java.lang.NoClassDefFoundError: javax/xml/bind/ValidationException
Some of the errors vary slightly based on location, but for the most part, these 4 errors are printed out dozens of times.
I did my best to make sure no ports were being used by other processes. I also stopped and destroyed all instances of confluent that I've created before. I've played around with Kafka on this computer before, so I theorize that that could have something to do with it, but I've made sure to close all past zookeeper and kafka instances.
I've tried to run confluent on a different computer and didn't run into any issues. Does anyone know what could be the problem? I can send the entire error message and provide any additional details.
Thanks in advance!
Remove Java 9.
I had both Java 9 and Java 8 on my computer. Turns out, Confluent was attempting to use Java 9, which isn't compatible with Confluent. When I deleted everything related to Java 9, Confluent started using Java 8, which solved the problem.
As BluePhantom pointed out, using Java 7 will also do the trick.

Kafka mirrormaker do not start

Im in the process of upgrading our cluster. However Im having issues trying to make the mirrormakers run.
So this machines have kafka-brokers and kafka-mirrormakers running. They have separate init scripts.
The brokers currently are using version 10.1.1.1 and mirrormakers are using version 0.8.2-beta.
Both of them have their own config files and locations
for example brokers are installed in /server/kafka/
mirrormakers are installed under /opt/kafka_mirrormaker.
Here the config lines for brokers following what the upgrade process explained:
inter.broker.protocol.version=0.10.1
log.message.format.version=0.8.2
and for mirrormakers:
inter.broker.protocol.version=0.8.2
log.message.format.version=0.8.2
So I was testing to upgrade this to 10.2.1 I tried the upgrade in one host.
Broker is running fine after I applied the upgrade version 10.2.1 however the mirrormaker dies right away when I tried to start it.
I see this exception on the logs
Exception in thread "main" java.lang.NullPointerException
at kafka.tools.MirrorMaker$.main(MirrorMaker.scala:309)
at kafka.tools.MirrorMaker.main(MirrorMaker.scala)
Exception in thread "MirrorMakerShutdownHook" java.lang.NullPointerException
at kafka.tools.MirrorMaker$.cleanShutdown(MirrorMaker.scala:399)
at kafka.tools.MirrorMaker$$anon$2.run(MirrorMaker.scala:222)
tail: kafka-mirrormaker-repl-sjc2-to-hkg1.out: file truncated
Exception in thread "main" java.lang.NullPointerException
at kafka.tools.MirrorMaker$.main(MirrorMaker.scala:309)
at kafka.tools.MirrorMaker.main(MirrorMaker.scala)
Exception in thread "MirrorMakerShutdownHook" java.lang.NullPointerException
at kafka.tools.MirrorMaker$.cleanShutdown(MirrorMaker.scala:399)
at kafka.tools.MirrorMaker$$anon$2.run(MirrorMaker.scala:222)
and this one
[2017-05-18 17:02:27,936] ERROR Exception when starting mirror maker. (kafka.tools.MirrorMaker$)
org.apache.kafka.common.config.ConfigException: Missing required configuration "bootstrap.servers" which has no default value.
at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:436)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:56)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:63)
at org.apache.kafka.clients.producer.ProducerConfig.<init>(ProducerConfig.java:340)
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:191)
at kafka.tools.MirrorMaker$MirrorMakerProducer.<init>(MirrorMaker.scala:694)
at kafka.tools.MirrorMaker$.main(MirrorMaker.scala:236)
at kafka.tools.MirrorMaker.main(MirrorMaker.scala)
This bootstrap error is kind of weird due to this already config. The server.properties has localhost:9292 configured as the bootstrap.server
To upgrade this I did broker and mirrormaker at the same time. Im not sure if I should first upgrade all the brokers first and then the mirrormakers.
Any suggestions. Should I follow the same procedure, upgrade first all brokers and then all mirrormakers. once they are upgrade bump the protocols in server.properties. Even though when it seems that the documentation kind of doesnt imply that: http://kafka.apache.org/documentation.html#upgrade
This has being solved.
The reason why they were not starting was due to the options on the configuration files change or were not properly configure

Kafka Connection error in contoller.logs

I am using single node Kafka(v 0.10.2) and single node zookeeper (v 3.4.8) and my controller.log file is filled with this exception
java.io.IOException: Connection to 1 was disconnected before the response was read
at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3(NetworkClientBlockingOps.scala:114)
at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$3$adapted(NetworkClientBlockingOps.scala:112)
at scala.Option.foreach(Option.scala:257)
at kafka.utils.NetworkClientBlockingOps$.$anonfun$blockingSendAndReceive$1(NetworkClientBlockingOps.scala:112)
at kafka.utils.NetworkClientBlockingOps$.recursivePoll$1(NetworkClientBlockingOps.scala:136)
at kafka.utils.NetworkClientBlockingOps$.pollContinuously$extension(NetworkClientBlockingOps.scala:142)
at kafka.utils.NetworkClientBlockingOps$.blockingSendAndReceive$extension(NetworkClientBlockingOps.scala:108)
at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:192)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:184)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
I googled this exception but was not able to find the root cause for this exception. Can someone suggest me why this error is happening and how to prevent it?
I also encounter same issue in multi-node cluster scenario. It was because of connection shutdown between kafka-node and zookeeper. I would suggest to restart zookeeper server then kafka-node to re-establish the connection therefore broker should be handle pub/sub message transition.
Hope it would raise you from this.