Exception when trying to upgrade to flink 1.3.1 - apache-kafka

I tried to upgrade my flink version in my cluster to 1.3.1 (and 1.3.2 as well) and I got the following exception in my task managers:
2018-02-28 12:57:27,120 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask - Error during disposal of stream operator.
org.apache.kafka.common.KafkaException: java.lang.InterruptedException
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:424)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase.close(FlinkKafkaProducerBase.java:317)
at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:126)
at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:429)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:334)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:422)
... 7 more
The job manager showed that it failed to connect with the task managers.
I am using FlinkKafkaProducer08.
Any ideas?

First of all, from the stack trace above: it was thrown during operator cleanup of a non-graceful termination (otherwise this code is not executed). It looks as if it should be followed by the real exception that caused the initial problem. Can you provide some more parts of the log?
If the JobManager failed to connect to any TaskManager that should run your job, the whole job will be cancelled (and retried based on your retry policy). The same may happen on your TaskManager side. That may be the root cause and needs further investigation.

Related

IOError(Stalefile) exception being thrown by Kafka Streams RocksDB

When running my stateful Kafka streaming applications I'm coming across various different RocksDB Disk I/O Stalefile exceptions. The exception only occurs when I have at least one KTable implementation and it happens at various different times. I've tried countless times to reproduce it but haven't been able to.
App/Environment details:
Runtime: Java
Kafka library: org.apache.kafka:kafka-streams:2.5.1
Deployment: OpenShift
Volume type: NFS
RAM: 2000 - 8000 MiB
CPU: 200 Millicores to 2 Cores
Threads: 1
Partitions: 1 - many
Exceptions encountered:
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while getting value for key from at org.apache.kafka.streams.state.internals.RocksDBStore.get(RocksDbStore.java:301)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error restoring batch to store at org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(RocksDbStore.java:636)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while range compacting during restoring at org.apache.kafka.streams.state.internals.RocksDBStore$SingleColumnFamilyAccessor.toggleDbForBulkLoading(RocksDbStore.java:616)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while executing flush from store at org.apache.kafka.streams.state.internals.RocksDBStore.flush(RocksDbStore.java:616)
Apologies for not being able to post the entire stack trace, but all of the above exceptions seem to reference the org.rocksdb.RocksDBException: IOError(Stalefile) exception.
Additional info:
Using a persisted state directory
Kafka topic settings are created with defaults
Running a single instance on a single thread
Exception is raised during gets and writes
Exception is raised when consuming valid data
Exception also occurs on internal repartition topics
I'd really appreciate any help and please let me know if I can provide any further information.
If you are using Posix file system, this error means that the file system returns ESTALE. See description to the code in https://man7.org/linux/man-pages/man3/errno.3.html

Is IllegalStateException: No entry found for connection a recoverable exception for KafkaConsumer.poll?

I have a KafkaConsumer where the following error is logged
Heartbeat thread failed due to unexpected error java.lang.IllegalStateException: No entry found for connection 1
followed by KafkaConsumer.poll() throwing the IllegalStateException.
In this case can I continue to call KafkaConsumer.poll() with the expectation that it will reconnect and continue to function?
Or is this an unrecoverable error that means I can no longer use this consumer?
The underlying cause of this issue was KAFKA-7974 so it isn't an exception you'd be expected to recover from as the internals of the consumer should handle this case.
It will be resolved in Kafka 2.1.2, 2.2.1 and 2.3.0.

Interrupted while joining ioThread / Error during disposal of stream operator in flink application

I have a flink-based streaming application which uses apache kafka sources and sinks. Since some days I am getting exceptions at random times during development, and I have no clue where they're coming from.
I am running the app within IntelliJ using the mainRunner class, and I am feeding it messages via kafka. Sometimes the first message will trigger the errors, sometimes it happens only after a few messages.
This is how it looks:
16:31:01.935 ERROR o.a.k.c.producer.KafkaProducer - Interrupted while joining ioThread
java.lang.InterruptedException: null
at java.lang.Object.wait(Native Method) ~[na:1.8.0_51]
at java.lang.Thread.join(Thread.java:1253) [na:1.8.0_51]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1031) [kafka-clients-0.11.0.2.jar:na]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1010) [kafka-clients-0.11.0.2.jar:na]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:989) [kafka-clients-0.11.0.2.jar:na]
at org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.close(FlinkKafkaProducer.java:168) [flink-connector-kafka-0.11_2.11-1.6.1.jar:1.6.1]
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.close(FlinkKafkaProducer011.java:662) [flink-connector-kafka-0.11_2.11-1.6.1.jar:1.6.1]
at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) [flink-core-1.6.1.jar:1.6.1]
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) [flink-streaming-java_2.11-1.6.1.jar:1.6.1]
at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:477) [flink-streaming-java_2.11-1.6.1.jar:1.6.1]
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:378) [flink-streaming-java_2.11-1.6.1.jar:1.6.1]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) [flink-runtime_2.11-1.6.1.jar:1.6.1]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
16:31:01.936 ERROR o.a.f.s.runtime.tasks.StreamTask - Error during disposal of stream operator.
org.apache.kafka.common.KafkaException: Failed to close kafka producer
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1062) ~[kafka-clients-0.11.0.2.jar:na]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1010) ~[kafka-clients-0.11.0.2.jar:na]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:989) ~[kafka-clients-0.11.0.2.jar:na]
at org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaProducer.close(FlinkKafkaProducer.java:168) ~[flink-connector-kafka-0.11_2.11-1.6.1.jar:1.6.1]
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.close(FlinkKafkaProducer011.java:662) ~[flink-connector-kafka-0.11_2.11-1.6.1.jar:1.6.1]
at org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:43) ~[flink-core-1.6.1.jar:1.6.1]
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117) ~[flink-streaming-java_2.11-1.6.1.jar:1.6.1]
at org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:477) [flink-streaming-java_2.11-1.6.1.jar:1.6.1]
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:378) [flink-streaming-java_2.11-1.6.1.jar:1.6.1]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) [flink-runtime_2.11-1.6.1.jar:1.6.1]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_51]
Caused by: java.lang.InterruptedException: null
at java.lang.Object.wait(Native Method) ~[na:1.8.0_51]
at java.lang.Thread.join(Thread.java:1253) [na:1.8.0_51]
at org.apache.kafka.clients.producer.KafkaProducer.close(KafkaProducer.java:1031) ~[kafka-clients-0.11.0.2.jar:na]
... 10 common frames omitted
16:31:01.938 ERROR o.a.k.c.producer.KafkaProducer - Interrupted while joining ioThread
I get around 10-20 of those, and then it seems like flink recovers the app, and it gets usable again, and I can successfully process messages.
What could possibly cause this? Or how can I analyze further to track this down?
I am using flink version 1.6.1 with scala 2.11 on a mac with IntelliJ beeing version 2018.3.2.
I was able to resolve it. Turned out that one of my stream operators (map-function) was throwing an exception because of some invalid array index.
It was not possible to see this in the logs, only when I step-by-step teared down the application into smaller pieces I finally got this very exception in the logs, and after fixing the obvious bug in the array access, the above mentioned exceptions (java.lang.InterruptedException and org.apache.kafka.common.KafkaException) went away.

Spring websockets across Wildfly 10.1 cluster

Using Wildfly 10.1 standalone-full-ha.xml with the included ActiveMQ Artemis 1.1.0. Enabled STOMP by adding this to the activemq config:
<acceptor name="stomp-acceptor" factory-class="org.apache.activemq.artemis.core.remoting.impl.netty.NettyAcceptorFactory">
<param name="protocols" value="STOMP"/>
<param name="port" value="${stomp.port:61613}"/>
</acceptor>
Deployed a sample spring websockets war based on https://spring.io/guides/gs/messaging-stomp-websocket/
I have this running on 2 separate servers which are able to form a cluster. I am able to connect to both servers and send messages across, however when I disconnect one of the websockets, the following exception is thrown on the other server:
16:49:02,377 ERROR [org.apache.activemq.artemis.core.server] (Thread-7 (ActiveMQ-client-global-threads-926891052)) AMQ224037: cluster connection Failed to handle message: java.lang.IllegalStateException: Cannot find binding for ffaa538e-77c3-11e7-ba4b-7b9cb7ee40e2b7e6493a-77c3-11e7-ba4b-7b9cb7ee40e2
at org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.doConsumerClosed(ClusterConnectionImpl.java:1319)
at org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.handleNotificationMessage(ClusterConnectionImpl.java:1005)
at org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.onMessage(ClusterConnectionImpl.java:974)
at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1018)
at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:48)
at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1145)
at org.apache.activemq.artemis.utils.OrderedExecutorFactory$OrderedExecutor$ExecutorTask.run(OrderedExecutorFactory.java:103)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I have also tested this with Wildfly 11 Alpha and the included ActiveMQ Artemis 1.5.3 but I get the same error every time I disconnect a websocket.
I also get the following errors when I shutdown a server, but I am less concerned about these since they only happen during shutdown:
17:11:03,747 ERROR [org.apache.activemq.artemis.core.server] (Thread-27 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$2#5669232b-14602972)) AMQ224051: Failed to call notification listener: java.lang.IllegalStateException: No queue 0110e86e-77c4-11e7-a745-252881f696b4
at org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.onNotification(PostOfficeImpl.java:387)
at org.apache.activemq.artemis.core.server.management.impl.ManagementServiceImpl.sendNotification(ManagementServiceImpl.java:580)
at org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl.close(ServerConsumerImpl.java:437)
at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doClose(ServerSessionImpl.java:354)
at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl$1.done(ServerSessionImpl.java:1191)
at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl.executeOnCompletion(OperationContextImpl.java:161)
at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.close(ServerSessionImpl.java:1185)
at org.apache.activemq.artemis.core.protocol.stomp.StompProtocolManager$1.run(StompProtocolManager.java:276)
at org.apache.activemq.artemis.utils.OrderedExecutorFactory$OrderedExecutor$ExecutorTask.run(OrderedExecutorFactory.java:103)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17:11:03,784 ERROR [io.netty.util.concurrent.DefaultPromise.rejectedExecution] (globalEventExecutor-1-2) Failed to submit a listener notification task. Event loop shut down?: java.util.concurrent.RejectedExecutionException: event executor terminated
at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821)
at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327)
at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320)
at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746)
at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:760)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:428)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1058)
at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:686)
at io.netty.channel.AbstractChannel$AbstractUnsafe.access$700(AbstractChannel.java:419)
at io.netty.channel.AbstractChannel$AbstractUnsafe$5.run(AbstractChannel.java:646)
at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:233)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
17:11:03,797 WARN [io.netty.channel.AbstractChannel] (globalEventExecutor-1-2) Can't invoke task later as EventLoop rejected it: java.util.concurrent.RejectedExecutionException: event executor terminated
at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:821)
at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:327)
at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:320)
at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:746)
at io.netty.channel.AbstractChannel$AbstractUnsafe.invokeLater(AbstractChannel.java:931)
at io.netty.channel.AbstractChannel$AbstractUnsafe.access$900(AbstractChannel.java:419)
at io.netty.channel.AbstractChannel$AbstractUnsafe$5.run(AbstractChannel.java:649)
at io.netty.util.concurrent.GlobalEventExecutor$TaskRunner.run(GlobalEventExecutor.java:233)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Is it an issue with the spring websocket example?
Do I need to change something in the default standalone-full-ha.xml to get this to work properly across a cluster, without throwing exceptions on client disconnect?
The version of Artemis in Wildfly is a bit behind the latest upstream (1.5.3 vs 2.2.0). Have you tried this with a cluster of standalone Artemis 2.2.0 nodes?

Kafka mirrormaker do not start

Im in the process of upgrading our cluster. However Im having issues trying to make the mirrormakers run.
So this machines have kafka-brokers and kafka-mirrormakers running. They have separate init scripts.
The brokers currently are using version 10.1.1.1 and mirrormakers are using version 0.8.2-beta.
Both of them have their own config files and locations
for example brokers are installed in /server/kafka/
mirrormakers are installed under /opt/kafka_mirrormaker.
Here the config lines for brokers following what the upgrade process explained:
inter.broker.protocol.version=0.10.1
log.message.format.version=0.8.2
and for mirrormakers:
inter.broker.protocol.version=0.8.2
log.message.format.version=0.8.2
So I was testing to upgrade this to 10.2.1 I tried the upgrade in one host.
Broker is running fine after I applied the upgrade version 10.2.1 however the mirrormaker dies right away when I tried to start it.
I see this exception on the logs
Exception in thread "main" java.lang.NullPointerException
at kafka.tools.MirrorMaker$.main(MirrorMaker.scala:309)
at kafka.tools.MirrorMaker.main(MirrorMaker.scala)
Exception in thread "MirrorMakerShutdownHook" java.lang.NullPointerException
at kafka.tools.MirrorMaker$.cleanShutdown(MirrorMaker.scala:399)
at kafka.tools.MirrorMaker$$anon$2.run(MirrorMaker.scala:222)
tail: kafka-mirrormaker-repl-sjc2-to-hkg1.out: file truncated
Exception in thread "main" java.lang.NullPointerException
at kafka.tools.MirrorMaker$.main(MirrorMaker.scala:309)
at kafka.tools.MirrorMaker.main(MirrorMaker.scala)
Exception in thread "MirrorMakerShutdownHook" java.lang.NullPointerException
at kafka.tools.MirrorMaker$.cleanShutdown(MirrorMaker.scala:399)
at kafka.tools.MirrorMaker$$anon$2.run(MirrorMaker.scala:222)
and this one
[2017-05-18 17:02:27,936] ERROR Exception when starting mirror maker. (kafka.tools.MirrorMaker$)
org.apache.kafka.common.config.ConfigException: Missing required configuration "bootstrap.servers" which has no default value.
at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:436)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:56)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:63)
at org.apache.kafka.clients.producer.ProducerConfig.<init>(ProducerConfig.java:340)
at org.apache.kafka.clients.producer.KafkaProducer.<init>(KafkaProducer.java:191)
at kafka.tools.MirrorMaker$MirrorMakerProducer.<init>(MirrorMaker.scala:694)
at kafka.tools.MirrorMaker$.main(MirrorMaker.scala:236)
at kafka.tools.MirrorMaker.main(MirrorMaker.scala)
This bootstrap error is kind of weird due to this already config. The server.properties has localhost:9292 configured as the bootstrap.server
To upgrade this I did broker and mirrormaker at the same time. Im not sure if I should first upgrade all the brokers first and then the mirrormakers.
Any suggestions. Should I follow the same procedure, upgrade first all brokers and then all mirrormakers. once they are upgrade bump the protocols in server.properties. Even though when it seems that the documentation kind of doesnt imply that: http://kafka.apache.org/documentation.html#upgrade
This has being solved.
The reason why they were not starting was due to the options on the configuration files change or were not properly configure