Notification when buildbot worker is down - buildbot

It happens quite often that a buildbot worker is down and it usually takes a little until someone notices.
Is there a way in buildbot to receive a notification if a worker is down?

Depending on the version of Buildbot you're using, the notify_on_missing parameter is what you're looking for. (https://docs.buildbot.net/1.8.0/manual/configuration/workers.html#when-workers-go-missing)

Related

What happens to the CronWorkflows in Cadence if a cluster is down when a workflow should have started?

What is the behaviour of CronWorkflows in the event that the Cadence cluster is down while a workflow should have started? When the cluster comes back, would we expect the workflow to still be started?
Yes, all workflows that didn't start yet are going to start right away. If multiple starts were missed only one will be executed.

rundeck: 1 job, multiple nodes involded?

I have a question about Rundeck features. Is it possible to include conditions within job execution? As it is quite difficult to explain, I provide an example:
You have 2 redundant firewalls in your network. You implement a job 'job1' and it's aim is to update your firewall's configuration. Master is down, therefore you do not want to update slave. Indeed if you do so, slave will have to restart and there will not have any firewall running for a short time. So, what I want to do is to test, before running the update, that none of my firewalls are out of service. If the master is down, then do not update slave.
So, is it possible to involve multiple nodes within one job?
Thanks for helping!
You create a job which pings both the firewalls. If both are up then this job will succeed. Now create another job which includes this job before update job in workflow. Make this job proceed only if first workflow succeeds. That should solve your problem.

How to debug celery delays and errors?

I am continuing Django project of someone who is using Celery along with Mandrill. There are daily reports which are sent to customers and due to some reason not a single mail is sent for three days, gets accumulated and sent together after three days. Since I am new to Celery, I want to know how to debug celery delays and errors, what are popular commands and execution path to follow?
Short tips:
Set debug=True in celery config, it will take you register and execution time for every task.
Install flower, popular tool for monitoring tasks
Use sentry for handy error tracking and aggregation
Happy debugging ;)

How to flush out jobs on a Mesos slave?

I want to take out a host (mesos-slave) from the mesos cluster in a clean manner by draining out the executors its running. Is it possible for mesos-master to not give any further work to a mesos-slave but still receive updates for the currently running executors? If thats possible, I can make mesos-master not give anymore work to this slave and once the slave is done with its currently running executors, I can take it out. Please feel free to suggest a better way of achieving the same thing.
I think you look for maintenance primitives, which have been recently added to Mesos. A user doc is here.

Chronos + Mesosphere. How to execute tasks in parallel?

Good day everyone.
I have single server for Chronos, Mesos and Zookeeper, and i want to use Chronos as something, what will run my scripts daily. Some scripts today, some tomorrow and so on..
The problem is when i'm trying to launch tasks one after another, only first one executes correctly, another one is lost somewhere. If i launch first then take a pause of 3-4 seconds and launch another - they both are launched, but sequentially.
And i need to run them in parallel.
Can someone provide a hint on this? Maybe there is some settings that i must change?
You should set a time in UTC time for both tasks to be launched with a repeating period of 24 hours. In this case, there is no reason why your tasks should not execute in parallel. Check the chronos logs and the tasks logs in sandbox on mesos for errors.
You can certainly run all of these components (Chronos, master, slave, and ZK) on the same machine, although ZK really becomes valuable once you have HA with multiple masters.
As user4103259 suggested, check the master and slave logs for that LOST/failed taskId to see what exactly happened to it. A task could go LOST/failed for numerous reasons, anywhere along the task launch/running/completing process.