Cygnus does not reconnect to kafka broker - fiware-cygnus

I am using cygnus-kafka connector. when the connection is lost beetween cygnus and the zookeeper. cygnus can not reconnect again to the zookeeper when the conenction is back. I need to restart it so it will be able to reconnect to the zookeeper.
Any ideas why cygnus is not able to reconnect to the kafka broker if the connection was lost once?
This the error that I got:
time=2016-11-30T11:29:26.254Z | lvl=WARN | corr=2a924ba4-b6f0-11e6-8836-fa163e68f7a2 | trans=ce766745-ae85-415a-a6f3-0bed9f121e79 | srv=service| subsrv=/servicepath | function=run | comp=cygnusagent | msg=org.apache.zookeeper.ClientCnxn$SendThread[1185] : Session 0x0 for server kafkaServerIp/kafkaServerIp:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
time=2016-11-30T11:29:28.211Z | lvl=WARN | corr=2a924ba4-b6f0-11e6-8836-fa163e68f7a2 | trans=ce766745-ae85-415a-a6f3-0bed9f121e79 | srv=service| subsrv=/servicepath | function=processNewBatches | comp=cygnusagent | msg=com.telefonica.iot.cygnus.sinks.NGSISink[439] : Unable to connect to zookeeper server within timeout: 10000
Thanks!

The problem is the connection from Cygnus to Kafka is permanent, because of efficiency issues. Nevertheless, a check for reseted connection by peer is missing in the code. I'll fix it ASAP in order it is ready for next version release (1.7.0) by the end of January (of course, it will be available at master branch once fixed, much sooner).

Related

How Datadog can get the JMX Metrics from Strimzi Kafka pods on AKS?

I already read a lot of the documentation from Datadog and Strimzi
about the JMX autodiscovery and the JMX configuration. But I missing something, at least it's not working (dd doesn't get the metrics)
Im using kubectl to an AKS, installed Strimzi to use Kafka on AKS
helm install strimzi-kafka-release strimzi/strimzi-kafka-operator
and with kafka-single.yaml setting up the kafka and zokeeper pods
kubectl apply -f kafka-single.yaml -n aks
then install the datadog agent with datadog-values.yaml file
helm install datadog-agent -f datadog-values.yaml --set datadog.site='datadoghq.com' --set datadog.apiKey='$DD-KEY' datadog/datadog
and I can even see the options for the jmx to be available on the process inspect in Datadog
I'm pretty sure I have something badplaced or badcalled, but I'm a little frustrated rn and can't get to what is the thing that doesn't allow the metrics to be discoverable for datadog.
I tried to edit the confd option on the datadog-values.yaml, but creates the files in /etc/datadog-agent/conf.d instead of /etc/datadog-agent/conf.d/kafka.d/ where it is recognized the conf file and try to do something (I guess, at least fails when I change the host)
I'm editing and copying kafka-conf.yaml directly to the pod
kubectl cp kafka-conf.yaml datadog-agent-pod:/etc/datadog-agent/conf.d/kafka.d/conf.yaml
and then I try the command
kubectl exec -it datadog-agent-pod agent jmx list matching
where it fails if I put localhost or somethig else different than %%host%%
(the failing message when I tried with directly wtit an IP)
Loading configs...
Config kafka was loaded.
2022-02-03 18:49:23 GMT | JMX | INFO | App | JMX Fetch 0.44.6 has started
2022-02-03 18:49:23 GMT | JMX | INFO | App | Found 0 config files
2022-02-03 18:49:24 GMT | JMX | INFO | App | update is in order - updating timestamp: 1643914164
2022-02-03 18:49:24 GMT | JMX | INFO | App | Cleaning up instances...
2022-02-03 18:49:24 GMT | JMX | INFO | App | Dealing with YAML config instances...
2022-02-03 18:49:24 GMT | JMX | INFO | App | Dealing with Auto-Config instances collected...
2022-02-03 18:49:24 GMT | JMX | INFO | App | Instantiating instance for: kafka
2022-02-03 18:49:24 GMT | JMX | INFO | App | Started instance initialization...
2022-02-03 18:49:24 GMT | JMX | INFO | Instance | Trying to connect to JMX Server at 10.244.0.66:9999
2022-02-03 18:49:24 GMT | JMX | INFO | Instance | Connection closed or does not exist. Attempting to create a new connection...
2022-02-03 18:49:24 GMT | JMX | INFO | ConnectionFactory | Connecting using JMX Remote
2022-02-03 18:49:24 GMT | JMX | INFO | Connection | Connecting to: service:jmx:rmi:///jndi/rmi://10.244.0.66:9999/jmxrmi
2022-02-03 18:49:27 GMT | JMX | INFO | App | Completed instance initialization...
2022-02-03 18:49:27 GMT | JMX | WARN | App | Could not initialize instance: kafka-10.244.0.66-9999:
java.util.concurrent.ExecutionException: java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: Exception creating connection to: 10.244.0.66; nested exception is:
java.net.NoRouteToHostException: No route to host (Host unreachable)]
at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
at org.datadog.jmxfetch.App.processRecoveryResults(App.java:1001)
at org.datadog.jmxfetch.App$6.invoke(App.java:977)
at org.datadog.jmxfetch.tasks.TaskProcessor.processTasks(TaskProcessor.java:63)
at org.datadog.jmxfetch.App.init(App.java:969)
at org.datadog.jmxfetch.App.run(App.java:205)
at org.datadog.jmxfetch.App.main(App.java:153)
Caused by: java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: Exception creating connection to: 10.244.0.66; nested exception is:
java.net.NoRouteToHostException: No route to host (Host unreachable)]
at java.management.rmi/javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:370)
at java.management/javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:270)
at org.datadog.jmxfetch.Connection.createConnection(Connection.java:64)
at org.datadog.jmxfetch.RemoteConnection.<init>(RemoteConnection.java:101)
at org.datadog.jmxfetch.ConnectionFactory.createConnection(ConnectionFactory.java:38)
at org.datadog.jmxfetch.Instance.getConnection(Instance.java:403)
at org.datadog.jmxfetch.Instance.init(Instance.java:416)
at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:15)
at org.datadog.jmxfetch.InstanceInitializingTask.call(InstanceInitializingTask.java:3)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: Exception creating connection to: 10.244.0.66; nested exception is:
java.net.NoRouteToHostException: No route to host (Host unreachable)]
at jdk.naming.rmi/com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:137)
at java.naming/com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:207)
at java.naming/javax.naming.InitialContext.lookup(InitialContext.java:409)
at java.management.rmi/javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1839)
at java.management.rmi/javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1813)
at java.management.rmi/javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:302)
... 12 more
Caused by: java.rmi.ConnectIOException: Exception creating connection to: 10.244.0.66; nested exception is:
java.net.NoRouteToHostException: No route to host (Host unreachable)
at java.rmi/sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:635)
at java.rmi/sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:209)
at java.rmi/sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:196)
at java.rmi/sun.rmi.server.UnicastRef.newCall(UnicastRef.java:343)
at java.rmi/sun.rmi.registry.RegistryImpl_Stub.lookup(RegistryImpl_Stub.java:116)
at jdk.naming.rmi/com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:133)
... 17 more
Caused by: java.net.NoRouteToHostException: No route to host (Host unreachable)
at org.datadog.jmxfetch.util.JmxfetchRmiClientSocketFactory.getSocketFromFactory(JmxfetchRmiClientSocketFactory.java:67)
at org.datadog.jmxfetch.util.JmxfetchRmiClientSocketFactory.createSocket(JmxfetchRmiClientSocketFactory.java:40)
at java.rmi/sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:617)
... 22 more
but when the host is with %% there's no error but it get nothing from the kafka pods.
What I'm doing wrong? or just what I have wrong on this setting? .-.
I checked other answers and quesions and a lot of docs these last days just to get the kafka metrics and apparently One does not simply configure datadog for JMX autodiscovery in AKS with Strimzi/Kafka... I just need the topics metrics.
I know that Strimzi aims to have Prometheus Metrics, but I need Datadog and I already got scolded for trying the Prometheus option (bc I couldn't enable it and get the metrics from there to dd).
I feel like it has to be something with the annotations, but tbh idk.
Please help, I can't be the only one with this problem.

MongoDB : MongoClient Unable to connect to Secondary replica set when Primary is down

In my scenario i expect the MongoClient (Meteor app) to connect to MongoDB secondary replica set when Primary Mongo is down. I have a 5 Cluster set up where when 3 MongoDB's are down the remaining 2 will be up as secondary and during this time the MongoClient should continue to establish connection with the secondary Mongo's and operate in a read only mode.
Has anyone tried this ?
I get various time outs
Cluster created with settings {hosts=[10.1.1.10:4000], mode=MULTIPLE, requiredClusterType=REPLICA_SET, serverSelectionTimeout='5 ms', maxWaitQueueSize=2000, requiredReplicaSetName='rs0'}
INFO | 2020-08-19 08:29:02,880 | 10.2.2.11 | org.mongodb.driver.cluster | Adding discovered server 10.1.1.10:4000 to client view of cluster
INFO | 2020-08-19 08:29:02,999 | 10.2.2.11 | org.mongodb.morphia.logging.MorphiaLoggerFactory | LoggerImplFactory set to org.mongodb.morphia.logging.jdk.JDKLoggerFactory
INFO | 2020-08-19 08:29:03,036 | 10.2.2.11 | org.mongodb.driver.cluster | Exception in monitor thread while connecting to server 10.1.1.10:4000
No server chosen by ReadPreferenceServerSelector{readPreference=primaryPreferred} from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=10.1.1.10:4000, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.SocketTimeoutException: connect timed out}}]}. Waiting for 5 ms before timing out
Has anyone seen this error ?
This is how meteor connects to mongod
mongodb://username:password#host-01:4000,host-02:4000,host-03:4000,host-04:4000,host-05:4000/mydb?replicaSet=rs0
readPreference is PrimaryPreferred
Any driver should allow you to issue queries with a secondary/secondary preferred/primary preferred read preference when it is aware of at least one secondary.
You either do not have any functional secondaries in the deployment or you have some sort of connectivity issue between your application and your database, which causes the servers to be unknown rather than secondaries.

Kafka consumer not connecting to remote host

In the past few days i've been learning about kafka and doing smalls tests and stuff.
I could already consume messages succesfuly in my localhost, even from another pc within the same network. But now that i'm trying to connect to a remote server (it's actually the same pc, and same broker and topic, i just have two internet service providers so i just switched in order to try with the public ip), i don't receive any messages. Also, i get this in the console:
[main] DEBUG org.apache.kafka.common.network.Selector - [Consumer clientId=consumer-1, groupId=0a396775-94e2-46a0-a6bf-08f0d848ffc9] Connection with /xxx.xx.xxx.xxx disconnected
java.net.ConnectException: Connection timed out: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at org.apache.kafka.common.network.PlaintextTransportLayer.finishConnect(PlaintextTransportLayer.java:50)
at org.apache.kafka.common.network.KafkaChannel.finishConnect(KafkaChannel.java:216)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:531)
at org.apache.kafka.common.network.Selector.poll(Selector.java:483)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:539)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:212)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureCoordinatorReady(AbstractCoordinator.java:249)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:326)
at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1251)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1220)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1159)
at com.okta.javakafka.kafkajava.SimpleConsumer.main(SimpleConsumer.java:39)
I tried a couple of pinging tests and the connection was succesful. So maybe i'm missing something that's different in the case of a remote kafka connection?
If someone could help, i'd appreciate it a lot
************** EDIT ****************************
listeners on my server.properties:
listeners=PLAINTEXT://192.168.1.101:9092 (local ip)
advertised.listeners=PLAINTEXT://200.x.xxx.xxx:9092 (public ip)

Zookeeper unable to talk to new Kafka broker

In an attempt to reduce the storage on my AWS instance I decided to launch a new, smaller instance and setup Kafka again from scratch using the Ansible playbook we had from before. I then terminated the old, larger instance and took its IP address that it and the other brokers were using and put it on my new instance.
When tailing my Zookeeper logs however I'm receiving this error -
2018-04-13 14:17:34,884 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker#810] - Connection broken for id 1, my id = 2, error =
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:153)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.net.SocketInputStream.read(SocketInputStream.java:211)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:795)
2018-04-13 14:17:34,885 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker#813] - Interrupting SendWorker
2018-04-13 14:17:34,884 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker#727] - Interrupted while waiting for message on queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715)
I double checked and all 3 Kafka broker IP addresses are correctly listed in these location and I restarted all their services to be safe.
/etc/hosts
/etc/kafka/config/server.properties
/etc/zookeeper/conf/zoo.cfg
/etc/filebeat/filebeat.yml

Zookeeper: Connection request from old client will be dropped if server is in r-o mode

storm version: 0.82
zookeeper version: 3.4.5.
We have a small storm cluster (1 nimbus and 3 supervisors), so using just 1 zookeeper instance that's co-located with storm nimbus.
Infrequently we start getting the following errors in the zookeeper logs and our storm cluster comes to a standstill.
2014-04-05 13:27:32,885 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFact
ory#197] - Accepted socket connection from /10.0.1.183:56121
2014-04-05 13:27:32,886 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#7
93] - Connection request from old client /10.0.1.183:56121; will be dropped if server is in r-o mode
2014-04-05 13:27:32,886 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#8
32] - Client attempting to renew session 0x1452dd02834002e at /10.0.1.183:56121
2014-04-05 13:27:32,886 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#5
95] - Established session 0x1452dd02834002e with negotiated timeout 40000 for client /10.0.1.183:561
21
On the storm end we start seeing the following in supervisor and worker logs:
2014-04-05 11:37:29 ConnectionStateManager [WARN] There are no ConnectionStateListeners registered.
2014-04-05 11:37:29 cluster [WARN] Received event :disconnected::none: with disconnected Zookeeper.
2014-04-05 11:37:31 ClientCnxn [WARN] Session 0x1452dd028340015 for server null, unexpected error,
losing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-04-05 11:37:42 CuratorFrameworkImpl [ERROR] Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(Curat
rFrameworkImpl.java:380)
at com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl
java:49)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
Do we need to downgrade zookeeper to 3.3.3 or is there a known issue/config that we're missing?
We also experienced several issues with Storm 0.9 and Zookeeper 3.4.X, even though not exactly the one you describe.
Storm mailing list are also reporting such incompatibility issues:
https://mail.google.com/mail/u/0/#search/label%3Astorm+zookeeper+3.4/144313a45ba069b5
https://mail.google.com/mail/u/0/#search/label%3Astorm+zookeeper+3.4/1447d95d10ce7582
This later one is pointing us to this Storm pull request, which should hopefully let us use ZK 3.4.X with future versions of Storm when it will be released:
https://github.com/apache/incubator-storm/pull/29
Until then, I would recommend downgrading ZK to 3.3.6 (you may install a specific separate instance of ZK for Storm if you absolutely need ZK 3.4.X for another system). You could also clone the Storm code and merge that pull request locally or compile the latest version of the trunk, but that's a bit adventurous and more tiresome than just waiting for those nice folks to just deliver a new release for us :)
A workaround for this situation is to clear storm's data directory (configured in strom.yaml==>storm.local.dir), then restart the supervisor. I did that in my test environment by clear storm's data directory and restart the nimbus and supervisor.
I think it's caused by a previous crash of the storm cluster, and the supervisor can not recovery from such a spot.