Airflow scheduler fails to start tasks - postgresql

My problem:
Airflow scheduler is not assigning tasks.
Background:
I have Airflow running successfully on my local machine with sqlitedb. The sample dags as well as my custom DAGs ran without any issues.
When I try to migrate from sqlite database to Postgres (using this guide), the scheduler no longer seems to be assigning tasks. The DAG get stuck on "running" state but no task in any DAGs ever gets assigned a state.
Troubleshooting steps I've taken
The web server and the scheduler are running
The DAG is set to "ON".
After running airflow initdb, the public schema is populated with all of the airflow tables.
The user in my connection string owns the database as well as every table in the public schema.
Scheduler Log
The scheduler log keeps posting out this WARNING but I have not been able to use it to find any useful information aside form this other post with no responses.
[2020-04-08 09:39:17,907] {dag_processing.py:556} INFO - Launched DagFileProcessorManager with pid: 44144
[2020-04-08 09:39:17,916] {settings.py:54} INFO - Configured default timezone <Timezone [UTC]>
[2020-04-08 09:39:17,927] {settings.py:253} INFO - settings.configure_orm(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=44144
[2020-04-08 09:39:19,914] {dag_processing.py:663} WARNING - DagFileProcessorManager (PID=44144) exited with exit code -11 - re-launching
Environment
PostgreSQL version 12.1
Airflow v1.10.9
This is all running on a MacBook Pro (Catalina) in a conda virtual environment.

Postgres was installed using postgresapp. Updated postgresapp to version 2.3.3e. PostgresSQL is still version 12.1 but by updating the postgresapp, the issue was solved.

Related

Jobs remain executing after Rundeck crashes/restarts

Details:
App Version: Rundeck Community - v4.2.1 (though the issue below has
been around for some time)
Database: Postgresql 9.6
I'm wondering if anyone else has experienced this?
In the event of a rundeck crash or server crash jobs which were executing remain in a running state. Navigating to the execution shows the following message:
"Workflow State and Log Output is not available."
This causes issues since a running execution will block subsequent executions and are not alerted on etc.
Is there a config setting that can be used to force any jobs in a RUNNING state to fail in the event of a rundeck service crash/restart?
regards
Right now the way is to use Missed Job Fires feature with Job Resume (both features only available on PagerDuty Process Automation On Prem, formerly "Rundeck Enterprise") to resume missed jobs/steps after a Rundeck server failure.

Marathon (Mesos) - Stuck in "Loading applications"

I am building a mesos cluster from scratch (using Vagrant, which is not relevant for this issue).
OS: Ubuntu 16.04 (trusty)
Setup:
Master -> Runs ZooKeeper, Mesos-master, Marathon and Chronos
Slave -> Runs Mesos-slave
This is my provisioning script for the master node https://github.com/zeitgeist2018/infrastructure/blob/fix-marathon/provision/scripts/install-master.sh.
I have managed to register de slave into Mesos, install Marathon and Chronos frameworks, and run scheduled jobs in Chronos (both with docker and shell commands), but I can't get Marathon to work properly. The UI gets stuck in "Loading applications" as soon as I open it, and when I try to call the API, the request hangs forever with no response. In the API I tried to get simple marathon information and do deployments, both with the same hanging result.
I've been checking Marathon logs but I don't see anything error there. Just a couple of logs that may (or not) be a hint:
[2020-03-08 10:33:21,819] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marathon-akka.actor.default-dispatcher-6)
[2020-03-08 10:33:21,822] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-87)
[2020-03-08 10:33:25,957] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$ReviveRepeaterLogic:marathon-akka.actor.default-dispatcher-7)
Installing jdk11 and choosing it as default fixed this issue for me without downgrading the Marathon to any other version.
in ubuntu 20.04:
sudo apt install openjdk-11-jre-headless
update-alternatives --config java
I increased the number of cpus, virtual machine in which the marathon was installed to 3 and the problem was solved.
I have managed to make it work. It was as simple as downgrading Marathon to v1.7.189. After that, it starts properly, and the API responds to requests.

Airflow 1.9 - Tasks stuck in queue

Latest Apache-Airflow install from PyPy (1.9.0)
Set up includes:
Apache-Airflow
Apache-Airflow[celery]
RabbitMQ 3.7.5
Celery 4.1.1
Postgres
I have the installation across 3 hosts.
Host #1
Airflow Webserver
Airflow Scheduler
RabbitMQ Server
Postgres Server
Host #2
Airflow Worker
Host #3
Airflow Worker
I have a simple DAG that executes a BashOperator Task that runs every 1 minute. I can see the scheduler "queue" the job however, it nevers gets added to a Celery/RabbitMQ queue and gets picked up by the workers. I have a custom RabbitMQ user, authentication seems fine. Flower, however, doesn't show any of the queues populating with data. It does see the two worker machines listening on their respective queues.
Things I've checked:
Airflow Pool configuration
Airflow environment variables
Upgrade/Downgrade Celery and RabbitMQ
Postgres permissions
RabbitMQ Permissions
DEBUG level airflow logs
I read the documentation section about jobs not running. My "start_date" variable is a static date that exists before the current date.
OS: Centos 7
I was able to figure it out but I'm not sure why this is the answer.
Changing the "broker_url" variable to use "pyamqp" instead of "amqp" was the fix.

Airflow web server starts without Gunicorn and is not accessible

I'm using Airflow 1.9 and it was working fine for over 2 months but somehow now I am not able to start airflow webserver on Gunicorn.
nohup airflow webserver $* > webserver_new.logs &
just starts the web server process but log does not contain any mention of Gunicorn. The UI is not accessible. I have checked that the environment variable $AIRFLOW_HOME points to the correct path.
Also when the web server is being started it doesn't create a webserver-pid file in $AIRFLOW_HOME.
When I uninstall Gunicorn and start the Airflow web server I do not get any error but without Gunicorn the UI is not accessible. Basically it behaves the same whether gunicorn is present or not.
Environment
I use a Python 2.7 virtualenv on a CentOS box. Few other developers updated some Python packages like pyhive, thrift and six. I have uninstalled all those and uninstalled Airflow using pip (and installed back again).
Log contents
The web server logs do not contain any mention of Gunicorn and the do not contain any other error when started from the command line. The DAGs are running but the UI was still down.
[2018-02-21 14:13:36,082] {default_celery.py:41} WARNING - Celery Executor will run without SSL
Additional observation
After a manual start of Gunicorn I found that the workers are getting timed out as soon as they are created.
I found out that the problem was a dag which had a for loop to generate dynamic tasks(all tasks were dyanmic) but the task ids were same for each iteration, I removed that dag and the webserver came back like charm.

Airflow Tasks are not getting triggered

I am scheduling the dag and it shows in the running state but tasks are not getting triggered.Airflow scheduler and web server are up and running. I toggled the dag as ON on the UI. Still i cant able to fix the issue.I am using CeleryExecutor tried changing to the SequentialExecutor but no luck.
If you are using the CeleryExecutor you have to start the airflow workers too.
cmd: airflow worker
You need the following commands:
airflow worker
airflow scheduler
airflow webserver
If it still doesn't probably you have set start_date: datetime.today().