Airflow web server starts without Gunicorn and is not accessible - webserver

I'm using Airflow 1.9 and it was working fine for over 2 months but somehow now I am not able to start airflow webserver on Gunicorn.
nohup airflow webserver $* > webserver_new.logs &
just starts the web server process but log does not contain any mention of Gunicorn. The UI is not accessible. I have checked that the environment variable $AIRFLOW_HOME points to the correct path.
Also when the web server is being started it doesn't create a webserver-pid file in $AIRFLOW_HOME.
When I uninstall Gunicorn and start the Airflow web server I do not get any error but without Gunicorn the UI is not accessible. Basically it behaves the same whether gunicorn is present or not.
Environment
I use a Python 2.7 virtualenv on a CentOS box. Few other developers updated some Python packages like pyhive, thrift and six. I have uninstalled all those and uninstalled Airflow using pip (and installed back again).
Log contents
The web server logs do not contain any mention of Gunicorn and the do not contain any other error when started from the command line. The DAGs are running but the UI was still down.
[2018-02-21 14:13:36,082] {default_celery.py:41} WARNING - Celery Executor will run without SSL
Additional observation
After a manual start of Gunicorn I found that the workers are getting timed out as soon as they are created.

I found out that the problem was a dag which had a for loop to generate dynamic tasks(all tasks were dyanmic) but the task ids were same for each iteration, I removed that dag and the webserver came back like charm.

Related

How to wait for full cloud-initialization before VM is marked as running

I am currently configuring a virtual machine to work as an agent within Azure (with Ubuntu as image). In which the additional configuration is running through a cloud init file.
In which, among others, I have the below 'fix' within bootcmd and multiple steps within runcmd.
However the machine already gives the state running within the azure portal, while still running the cloud configuration phase (cloud_config_modules). This has as a result pipelines see the machine as ready for usage while not everything is installed/configured yet and breaks.
I tried a couple of things which did not result in the desired effect. After which I stumbled on the following article/bug;
The proposed solution worked, however I switched to a rhel image and it stopped working.
I noticed this image is not using walinuxagent as the solution states but waagent, so I tried to replacing that like the example below without any success.
bootcmd:
- mkdir -p /etc/systemd/system/waagent.service.d
- echo "[Unit]\nAfter=cloud-final.service" > /etc/systemd/system/waagent.service.d/override.conf
- sed "s/After=multi-user.target//g" /lib/systemd/system/cloud-final.service > /etc/systemd/system/cloud-final.service
- systemctl daemon-reload
After this, also tried to set the runcmd steps to the bootcmd steps. This resulted in a boot which took ages and eventually froze.
Since I am not that familiar with rhel and Linux overall, I wanted to ask help if anyone might have some suggestions which I can additionally try.
(Apply some other configuration to ensure await on the cloud-final.service within a waagent?)
However the machine already had the state running, while still running the cloud configuration phase (cloud_config_modules).
Could you please be more specific? Where did you read the machine state?
The reason I ask is that cloud-init status will report status: running until cloud-init is done running, at which point it will report status: done
I what is the purpose of waiting until cloud-init is done? I'm not sure exactly what you are expecting to happen, but here are a couple of things that might help.
If you want to execute a script "at the end" of cloud-init initialization, you could put the script directly in runcmd, and if you want to wait for cloud-init in an external script you could do cloud-init status --wait, which will print a visual indicator and eventually return once cloud-init is complete.
On not too old Azure Linux VM images, cloud-init rather than WALinuxAgent acts as the VM provisioner. The VM is marked provisioned by the Azure cloud-init datasource module very early during cloud-init processing (source), before any cloud-init modules configurable with user data. WALinuxAgent is only responsible for provisioning Azure VM extensions. It does not appear to be possible to delay sending the 'VM ready' signal to Azure without modifying the VM image and patching the source code of cloud-init Azure datasource.

Marathon (Mesos) - Stuck in "Loading applications"

I am building a mesos cluster from scratch (using Vagrant, which is not relevant for this issue).
OS: Ubuntu 16.04 (trusty)
Setup:
Master -> Runs ZooKeeper, Mesos-master, Marathon and Chronos
Slave -> Runs Mesos-slave
This is my provisioning script for the master node https://github.com/zeitgeist2018/infrastructure/blob/fix-marathon/provision/scripts/install-master.sh.
I have managed to register de slave into Mesos, install Marathon and Chronos frameworks, and run scheduled jobs in Chronos (both with docker and shell commands), but I can't get Marathon to work properly. The UI gets stuck in "Loading applications" as soon as I open it, and when I try to call the API, the request hangs forever with no response. In the API I tried to get simple marathon information and do deployments, both with the same hanging result.
I've been checking Marathon logs but I don't see anything error there. Just a couple of logs that may (or not) be a hint:
[2020-03-08 10:33:21,819] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marathon-akka.actor.default-dispatcher-6)
[2020-03-08 10:33:21,822] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-87)
[2020-03-08 10:33:25,957] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$ReviveRepeaterLogic:marathon-akka.actor.default-dispatcher-7)
Installing jdk11 and choosing it as default fixed this issue for me without downgrading the Marathon to any other version.
in ubuntu 20.04:
sudo apt install openjdk-11-jre-headless
update-alternatives --config java
I increased the number of cpus, virtual machine in which the marathon was installed to 3 and the problem was solved.
I have managed to make it work. It was as simple as downgrading Marathon to v1.7.189. After that, it starts properly, and the API responds to requests.

Airflow: what do `airflow webserver`, `airflow scheduler` and `airflow worker` exactly do?

I've been working with Airflow for a while now, which was set up by a colleague. Lately I run into several errors, which require me to more in dept know how to fix certain things within Airflow.
I do understand what the 3 processes are, I just don't understand the underlying things that happen when I run them. What exactly happens when I run one of the commands? Can I somewhere see afterwards that they are running? And if I run one of these commands, does this overwrite older webservers/schedulers/workers or add a new one?
Moreover, if I for example run airflow webserver, the screen shows some of the things that are happening. Can I simply get out of this by pressing CTRL + C? Because when I do this, it says things like Worker exiting and Shutting down: Master. Does this mean I'm shutting everything down? How else should I get out of the webserver screen then?
Each process does what they are built to do while they are running (webserver provides a UI, scheduler determines when things need to be run, and workers actually run the tasks).
I think your confusion is that you may be seeing them as commands that tell some sort of "Airflow service" to do something, but they are each standalone commands that start the processes to do stuff. ie. Starting from nothing, you run airflow scheduler: now you have a scheduler running. Run airflow webserver: now you have a webserver running. When you run airflow webserver, it is starting a python flask app. While that process is running, the webserver is running, if you kill command, is goes down.
All three have to be running for airflow as a whole to work (assuming you are using an executor that needs workers). You should only ever had one scheduler running, but if you were to run two processes of airflow webserver (ignoring port conflicts, you would then have two separate http servers running using the same metadata database. Workers are a little different in that you may want multiple worker processes running so you can execute more tasks concurrently. So if you create multiple airflow worker processes, you'll end up with multiple processes taking jobs from the queue, executing them, and updating the task instance with the status of the task.
When you run any of these commands you'll see the stdout and stderr output in console. If you are running them as a daemon or background process, you can check what processes are running on the server.
If you ctrl+c you are sending a signal to kill the process. Ideally for a production airflow cluster, you should have some supervisor monitoring the processes and ensuring that they are always running. Locally you can either run the commands in the foreground of separate shells, minimize them and just keep them running when you need them. Or run them in as a background daemon with the -D argument. ie airflow webserver -D.

How to use sidekiq on Bluemix

I'm trying to use sidekiq on Bluemix. I think that I'm on the right track, but it's not working completely.
I have an app with Sinatra that uses sidekiq jobs to make many actions. I set the following line in my manifest.yml file:
command: bundle exec rackup config.ru -p $PORT && bundle exec sidekiq -r ./server.rb -c 3
I thought that with this command sidekiq will run, However, when I call the endpoint that creates a job, it's still on the "Queue" section on the Sidekiq panel.
What actions do I need to take to get sidekiq to process the job?
PS: I'm beginner on Bluemix. I'm trying to migrate my app from Heroku to Bluemix.
Straightforward answer to this question "as asked":
Your start-up command does not evaluate a second part, the one after '&&'. If you try starting that in your local environment, the result'll be the same. The server will start up and the console will simply tail the server logs, not technically evaluating to true until you send it a kill signal (so the part after '&&' will never run at the same time).
Subbing that with just '&' sort-of-kinda fixes that, since both will run at the same time.
command: bundle exec rackup config.ru -p $PORT & bundle exec sidekiq
What is not ideal with that solution? Eh, probably a lot of stuff. The biggest offender though: having two processes active at the same time, only one of them expected and observed ( the second one ).
Sending '(bluemix) cf stop' to the application instance created by the manifest with this command stops only the observed one before decommissioning the instance - in that case we can not be sure that the first process freed up external resources by properly sending notifications or closing the connections, or whatever.
What you probably could consider instead:
1. Point one.
Bluemix is a CF implementation, and with a quick manifest.yml deploy, there is nothing preventing you from having the app server and sidekiq workers run on separate instances.
2. Better shell.
command: sh -c 'command1 & command2 & wait'
**3. TBD, probably a lot of options, but I am a beginner as well. **
Separate app instances on CloudFoundry for your rack-based application and your workers would be preferable because you can then:
Scale web / workers independently (more traffic? Just scale the web application)
Deploy each component independently, if needed
Make sure each process is health-checked
The downside of using & to join commands, as suggested in the other answer, is that the first process will launch in the background. This means you won't have reliable monitoring and automatic restarts if the first process crashes.
There's a slightly out of date example on the CloudFoundry website which demos using two application manifests (one for web, one for workers) to deploy each part independently.

Jenkins kills JBoss server when job finishes

I use Ant to start/shutdown JBoss 5 server through Jenkins. Ant java spawn and fork are set to "true", so command is executed in the background.
Jenkins successfully starts up the server, waits two minutes (a "sleep" command in Jenkins), then after the sleep it for some strange reason shuts down the server. The sleep command is the last step in the build job. The shutdown says:
2013-01-29 17:03:39,332 INFO [org.jboss.bootstrap.microcontainer.ServerImpl] Runtime shutdown hook called, forceHalt: true
I googled it and tried the suggested -Xrs command, but it didn't help. What is happening here?
Jenkins have something called the process tree killer that will kill all processes created by the job (even those started with spawn and fork set to true).
There are some workarounds to this behavior.
disabling process tree killer
-Dhudson.util.ProcessTreeKiller.disable=true
or
set the env. var BUILD_ID=dontKillMe in the JBOSS process.
export BUILD_ID=dontKillMe
You can browse the ProcessTreeKill wiki article or jenkins JIRA to find various workarounds for this issue.
This source (comments) suggest other environment variables, apparently for older versions of Jenkins. For me it didn't work before I started using JENKINS(_SERVER)_COOKIE.