We are using Celery with RabbitMQ as Broker and Redis as Backend. We are running into a problem where we see growth in the number of queues inside RabbitMQ with the format:
000d7a5b-f554-3817-bff4-1397407ae08a.reply.celery.pidbox
We are not able to understand why these queues are getting created, and why they are not going down.
Any help would be appreciated.
This Github issue gives some idea: https://github.com/celery/kombu/issues/294. I have seen the same issue on our Redis boxes. A comment on the issue recommends disabling gossip and mingle to overcome this.
Hope this is a bit helpful.
Related
Our usecase is pretty simple, however, I haven't found a solution for it yet.
In the organization I'm working at, we decided to move to Kubernetes as our container manager in order to spin-up slaves.
Until we moved to this kind of environment, we used to have dedicated slaves per each team. Each got the resources it needs and based on that, it was working.
However, when we moved to use Kubernetes, it started to cause issues as each team shares the same pile of resources, which, can lead to congestion or job failures.
The suggested solution was to create Kubernetes cluster per each team, however, this will lead to burnout of the teams involved with maintanance of multiple clusters.
Searching online, I didn't found any solution avilable, hence, I'm asking here - what is the best way to approach the solution? I understand that we might need to implament a dispacher, but currently it's not possible in the way the Kubernetes plugin is developed.
Thanks,
I want to take out a host (mesos-slave) from the mesos cluster in a clean manner by draining out the executors its running. Is it possible for mesos-master to not give any further work to a mesos-slave but still receive updates for the currently running executors? If thats possible, I can make mesos-master not give anymore work to this slave and once the slave is done with its currently running executors, I can take it out. Please feel free to suggest a better way of achieving the same thing.
I think you look for maintenance primitives, which have been recently added to Mesos. A user doc is here.
I recently ran across this Netflix Blog article http://techblog.netflix.com/2013/08/deploying-netflix-api.html
They are talking about red/black deployment where they run the old and new code side by side and direct the production traffic to both of them. If something goes wrong they do a rollback.
How does the directing of the traffic work? and is it possible to adapt this strategy with e.g two Docker containers?
One way of directing traffic is using Weighted Routing, as you can do in AWS Route 53.
Initially you have 100% traffic going to server(s) with old code. Then gradually you change that to have some traffic to server(s) with new code.
Also, as you can read in this blog, you can use Docker to achieve it:
Even with the best testing, things can go wrong after deployment and a
rollback may be required. Containers make this easy and we’ve brought
similar tools to the operating system with Project Atomic. Red/Black
deployments can be done throughout the entire stack with Atomic and
Docker.
I think they use Spinnaker to implement a red/black strategy. https://spinnaker.io/docs/concepts/
I suddenly became an admin of the cluster in my lab and I'm lost.
I have experiences managing linux servers but not clusters.
Cluster seems to be quite different.
I figured the cluster is running CentOS and ROCKS.
I'm not sure what SGE and if it is used in the cluster or not.
Would you point me to an overview or documents of how cluster is configured and how to manage it? I googled but there seem to be lots of way to build a cluster and it is confusing where to start.
I too suddenly became a Rocks Clusters admin. While your CentOS knowledge will be handy, there are some 'Rocks' way of doing things, which you need to read up on. They mostly start with the CLI command rocks list|set command, and they are very nice to work with, when you get to learn them.
You should probadly start by reading the documentation (for the newest version, you can find yours with 'rocks report version'):
http://central6.rocksclusters.org/roll-documentation/base/6.1/
You can read up on SGE part at
http://central6.rocksclusters.org/roll-documentation/sge/6.1/
I would recommend you to sign up for the Rokcs Clusters discussion mailing list on:
https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion
The list is very friendly.
Anyone have any insight into how GitHub deals with the potential failure or temporary unavailability of a Redis server when using Resque?
There are others that seem to have put together semi-complicated solutions as a holdover for redis-cluster using zookeeper (see https://github.com/ryanlecompte/redis_failover and Solutions for resque failover redis). Others seem to have 'poor mans failover' that switches the slave to the master on first sight of connectivity issues without coordination between redis clients (but this seems problematic in the temporary unavailability scenario).
The question: Has Defunkt ever talked about how GitHub handles Redis failure? Is there a best practice for failover that doesn't involve zookeeper?
The original post on resque states part of the rational for the selection of Redis was the master-slave capability of redis, but the post doesn't describe how GitHub leverages this since all workers need both read+write access to Redis (see https://github.com/blog/542-introducing-resque).
The base Resque library does not handle failures. If a box dies immediately after poping off a message, the message is gone forever. You'll have to write your own code to handle failures, which is quite tricky.
https://github.com/resque/resque/issues/93