Rabbit MQ Shovel Plugin- Creating duplicate data in case of node failure - kubernetes

I am creating shovel plugin in rabbit mq, that is working fine with one pod, However, We are running on Kubernetes cluster with multiple pods and in case of pod restart, it is creating multiple instance of shovel on each pod independently, which is causing duplicate message replication on destination.
Detail steps are below
We are deploying rabbit mq on Kubernetes cluster using helm chart.
After that we are creating shovel using Rabbit MQ Management UI. Once we are creating it from UI, shovels are working fine and not replicating data multiple time on destination.
When any pod get restarted, it create separate shovel instance. That start causing issue of duplicate message replication on destination from different shovel instance.
When we saw shovel status on Rabbit MQ UI then we found that, there are multiple instance of same shovel running on each pod.
When we start shovel from Rabbit MQ UI manually, then it will resolved this issue and only once instance will be visible in UI.
So problem which we concluded that, in case of pod failure/restart, shovel is not able to sync with other node/pod, if any other shovel is already running on node/pod. Since we are able to solve this issue be restarting of shovel form UI, but this not a valid approach for production.
This issue we are not getting in case of queue and exchange.
Can anyone help us here to resolve this issue.

as we lately have seen similar problems - this seems to be an issue since some 3.8. version - https://github.com/rabbitmq/rabbitmq-server/discussions/3154
it should be fixed as far as I have understood from version 3.8.20 on. see
https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.8.19
and
https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.8.20
and
https://github.com/rabbitmq/rabbitmq-server/releases/tag/v3.9.2
didn't have time yet to check if this is really fixed with those versions.

Related

How to avoid congestion when using Kubernetes pods as Jenkins slaves

Our usecase is pretty simple, however, I haven't found a solution for it yet.
In the organization I'm working at, we decided to move to Kubernetes as our container manager in order to spin-up slaves.
Until we moved to this kind of environment, we used to have dedicated slaves per each team. Each got the resources it needs and based on that, it was working.
However, when we moved to use Kubernetes, it started to cause issues as each team shares the same pile of resources, which, can lead to congestion or job failures.
The suggested solution was to create Kubernetes cluster per each team, however, this will lead to burnout of the teams involved with maintanance of multiple clusters.
Searching online, I didn't found any solution avilable, hence, I'm asking here - what is the best way to approach the solution? I understand that we might need to implament a dispacher, but currently it's not possible in the way the Kubernetes plugin is developed.
Thanks,

Cassandra Kubernetes Statefulset NoHostAvailableException

I have an application deployed in kubernetes, it consists of cassandra, a go client, and a java client (and other things, but they are not relevant for this discussion).
We have used helm to do our deployment.
We are using a stateful set and a headless service for cassandra.
We have configured the clients to use the headless service dns as a contact point for cluster creation.
Everything works great.
Until all of the nodes go down, or some other nefarious combination of nodes going down, I am simulating it by deleting all pods using kubectl delete in succession on all of the cassandra nodes.
When I do this the clients throw NoHostAvailableException
in java its
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.200.23.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (1 required but only 0 alive)), /10.200.152.130:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
which eventually becomes
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)"
in go its
"gocql: no hosts available in the pool"
I can query cassandra using cqlsh, the node seems fine using nodetool status, all of the new ips are there
the image I am using doesnt have netstat so I have not yet confirmed its listening on the expected port.
Via executing bash on the two client pods I can see the dns makes sense using nslookup, but...
netstat does not show any established connections to cassandra (they are present before I take the nodes down)
If I restart my clients everything works fine.
I have googled a lot (I mean a lot), most of what I have found is related to never having a working connection, the most relevant things seem very old (like 2014, 2016).
So a node going down is very basic and I would expect everything to work, the cassandra cluster manages itself, it discovers new nodes as they come online, it balances the load, etc. etc.
If I take my all of my cassandra nodes down slowly, one at a time, everything works fine (I have not confirmed that the load is distributed appropriately and to the correct node, but at least it works)
So, is there a point where this behaviour is expected? ie I have taken everything down, nothing was up and running before the last from the first cluster was taken down.. is this behaviour expected?
To me it seems like it should be an easy issue to resolve, not sure whats missing / incorrect, I am surprised that both clients show the same symptoms, makes me think something is not happening with our statefulset and service
I think the problem might lie in the headless DNS service. If all of the nodes go down completely and there are no nodes at all available via the service until pods are replaced, it could cause the driver to hang.
I've noted that you've used Helm for your deployments but you may be interested in this document for connecting to Cassandra clusters in Kubernetes from the authors of the cass-operator.
I'm going to contact some of the authors and get them to respond here. Cheers!

Two versions of fluentd fighting over port in my cluster

Somehow, I have 2 versions of fluentd running in my cluster:
They end up fighting over the same port, they just keep cranking away, trying to start up on that port, and it saturates all the CPU in the cluster.
unexpected error error_class=Errno::EADDRINUSE error="Address already in use - bind(2) for 0.0.0.0:24231
/opt/google-fluentd/embedded/lib/ruby/2.6.0/socket.rb:201:in 'bind'
I've tried deleting the daemon sets and deployments, they just keep coming back. Also tried ssh'ing into the machines and killing the process on that port. Nothing seems to work.
Obviously, I only want one version of fluentd to run (and I'm not even sure which one).
I seem to have fixed it. I went to GCP dashboard cluster edit page, Kubernetes Engine Monitoring dropdown was blank. It seems not even the dropdown could decide what to display here.
It seems the automated agent, or whatever, seriously messed up here, and had 2 versions of the logging and monitoring system running, fighting over a port, and crushing the CPU on every machine in the cluster. On top of that, I couldn't delete the daemon sets, pods, or deployments. It seems Google treats these as special somehow, maybe with some kind of automated agent, I don't know.
From the dropdown, I just selected System and workload logging and monitoring, saved, and it applied the changes.
Everything looking good so far, but this whole event has me worried, I didn't do anything. This just....happened.
This is a dev cluster, but if it was a production cluster...

How to overcome the IllegalAccessError while start up of connector in Kafka

I am writing a connector for Kafka Connect. The error I see during the start up of connector is
java.lang.IllegalAccessError: tried to access field org.apache.kafka.common.config.ConfigTransformer.DEFAULT_PATTERN from class org.apache.kafka.connect.runtime.AbstractHerder
The error seems to happen at https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractHerder.java#L449
Do I need to set this DEFAULT.PATTERN manually? Is this not set by default.
I am using the docker image confluentinc/cp-kafka:5.0.1. The version of connect-api I am using in my connector app is org.apache.kafka:connect-api:2.0.0. I am running my set up inside Kubernetes.
The issue was resolved when I changed the image to confluentinc/cp-kafka:5.0.0-2.
I already tried this option before posting the question, but the pod was in a Pending state and was refusing to start. I thought that it could have been an issue with the image. Upon doing some more research later, I came to know that sometimes Kubernetes is unable to allot enough resources and hence pods can stay in Pending state.
I tried the image confluentinc/cp-kafka:5.0.0-2 and it works fine.

Service Fabric stateful service no longer replicates

FURTHER UPDATE: this error has not occurred since the November update.
EDIT: you may want to read this if your stateful service stops working for no apparent reason. Typical sign is using WordCount-like app (for example), the service deployment reports that one partition is remaining and after 5 tries gives up. The stateless service starts ok. The diagnostics reports multiple "Constructed instance of type WordCountService". If You have this, then you may have the same problem I have. No amount of uninstalling VS/SF/Azure SDKs helps. I now use a VM template with VS/Azure/SF installed and just delete and recreate it each time this error occurs (it is rare but has happened several times). Assume MSFT is aware and fixing for beta.
ORIGINAL:
Summary question: Is there a way to reset Service Fabric completely?
Background: I have a stateful/stateless app service based on Wordcount example. All of a sudden, after deployment the app no longer replicates the stateful service (1 instance, 2 replicas). The stateless service is deployed ok (one instance, no replicas).
The partition status of the primary partition is reporting "Partition is below target replica or instance count". The replica status is "InBuild" for replicas, Primary is OK.
On the primary node, there is a warning "Replica had multiple failures during open. Error = -2147024894.
I have tried cleaning the cluster, uninstalling/reinstalling service fabric, deleting the SfDevCluster directory entirely etc.
If I copy the exact code to another computer with service fabric installed, it works (and I mean copy/paste the whole solution directory).
I had a similar problem last week but it caused the host service not to start. Tried uninstall/reinstall/clean/remove SDKs, remove Visual Studio, etc. The only thing that fixed it was a reinstall of windows.