How do i get all pending tasks to show in celery flower task view? - celery

I'm currently running celery 5 with flower 1.0.0 with redis as both broker and backend.
Only tasks that are in their final result/tombstone state are visible in flower. Is there an additional configuration change that I need to enable?
celery -A main worker --hostname=annotate_#%h --pool=solo --task-events
flower -> flower:1.0.0 tornado:6.1 humanize:3.10.0
software -> celery:5.1.0 (sun-harmonics) kombu:5.1.0 py:3.10.0b1
billiard:3.6.4.0 py-amqp:5.0.6
platform -> system:Linux arch:64bit, ELF
kernel version:5.4.0-1051-azure imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:amqp results:disabled
deprecated_settings: None
worker configuration:
app = Celery(broker='redacted',
backend='redis://redacted'
)
app.autodiscover_tasks()
app.conf.task_default_queue = 'redacted'
app.conf.task_track_started = True
app.conf.task_send_sent_event = True
app.conf.worker_send_task_events = True
app.conf.worker_hijack_root_logger = False
app.conf.worker_prefetch_multiplier = 1
app.conf.worker_enable_remote_control = True
app.conf.result_expires = 60 * 60 * 2 # 2 hour TTL in seconds
The following is a snapshot of a few finished tasks, but the queue itself has 10+ pending & in 'progress'.

Related

Dynamically change celery beat schedule params

I get schedule values from .env file. And sometimes parameters in .env file change.
Is it possible to change schedule values of already running celery beat tasks?
My celery.py:
import os
from celery import Celery
from celery.schedules import crontab
from dotenv import load_dotenv
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproj.settings')
app = Celery('myproj')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
load_dotenv()
orders_update_time = float(os.getenv("ORDERS_UPDATE_TIME"))
if not orders_update_time:
orders_update_time = 60.0
orders_update_time = float(os.getenv("REMAINS_SEND_TIME"))
if not remains_send_time:
remains_send_time = 60.0
app.conf.beat_schedule = {
'wb_orders_autosaver': {
'task': 'myapp.tasks.orders_autosave',
'schedule': orders_update_time,
},
'wb_remains_autosender': {
'task': 'myapp.tasks.remains_autosend',
'schedule': remains_send_time,
},
}
Yes, use django-celery-beat. That will allow you to save your schedule to the database and you can use the django admin ui to modify the schedule.
From django shell_plus, you can run the following commands to create your schedule:
schedule = CrontabSchedule(minute='0', hour='10')
schedule.save()
PeriodicTask.objects.create(
crontab=schedule,
task='myapp.tasks.orders_autosave',
name='autosave orders',
)
schedule = CrontabSchedule(minute='15', hour='10')
schedule.save()
PeriodicTask.objects.create(
crontab=schedule,
task='myapp.tasks.remains_autosend',
name='autosend remains',
)
PeriodicTasks.changed()
Or you can use the UI in the django admin panel:
Select add periodic task
Enter in the information about your task and select save

airflow dag - task is immediately put into 'up_for_retry' state ('start_date' is 1 day ago)

I do not know if i am lack of airflow scheduler knowledge or if this is a potential bug from airflow.
situation is like this:
my dag's start date is set to be "start_date": airflow.utils.dates.days_ago(1),
i uploaded the dag to the folder where airflow scans the DAGs
i then turn the dag on (it was by default 'off')
the tasks in the pipeline immediately goes into 'up_for_retry' and you do not really see what had been tried before.
airflow Version Info: Version : 1.10.14. it is run on kubenetes in azure
use Celery executor with Redis
the task instance details are listed below:
Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency Reason
Task Instance State Task is in the 'up_for_retry' state which is not a valid state for execution. The task must be cleared in order to be run.
Not In Retry Period Task is not ready for retry yet but will be retried automatically. Current date is 2021-05-17T09:06:57.239015+00:00 and task will be retried at 2021-05-17T09:09:50.662150+00:00.
am i missing something to judge if it is a bug or if it is expected?
addition, below is the DAG definition as requested.
import airflow
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksSubmitRunOperator
from airflow.models import Variable
dag_args = {
"owner": "our_project_team_name",
"retries": 1,
"email": ["ouremail_address_replaced_by_this_string"],
"email_on_failure": True,
"email_on_retry": True,
"depends_on_past": False,
"start_date": airflow.utils.dates.days_ago(1),
}
# Implement cluster reuse on Databricks, pick from light, medium, heavy cluster type based on workloads
clusters = Variable.get("our_project_team_namejob_cluster_config", deserialize_json=True)
databricks_connection = "our_company_databricks"
adl_connection = "our_company_wasb"
pipeline_name = "process_our_data_from_boomi"
dag = DAG(dag_id=pipeline_name, default_args=dag_args, schedule_interval="0 3 * * *")
notebook_dir = "/Shared/our_data_name"
lib_path_sub = ""
lib_name_dev_plus_branch = ""
atlas_library = {
"whl": f"dbfs:/python-wheels/atlas{lib_path_sub}/atlas_library-0{lib_name_dev_plus_branch}-py3-none-any.whl"
}
create_our_data_name_source_data_from_boomi_notebook_params = {
"existing_cluster_id": clusters["our_cluster_name"],
"notebook_task": {
"notebook_path": f"{notebook_dir}/create_our_data_name_source_data_from_boomi",
"base_parameters": {"Extraction_date": "{{ ds_nodash }}"},
},
}
create_our_data_name_standardized_table_from_source_xml_notebook_params = {
"existing_cluster_id": clusters["our_cluster_name"],
"notebook_task": {
"notebook_path": f"{notebook_dir}/create_our_data_name_standardized_table_from_source_xml",
"base_parameters": {"Extraction_date": "{{ ds_nodash }}"},
},
}
create_our_data_name_enriched_table_from_standardized_notebook_params = {
"existing_cluster_id": clusters["our_cluster_name"],
"notebook_task": {
"notebook_path": f"{notebook_dir}/create_our_data_name_enriched",
"base_parameters": {"Extraction_date": "{{ ds_nodash }}"},
},
}
layer_1_task = DatabricksSubmitRunOperator(
task_id="Load_our_data_name_to_source",
databricks_conn_id=databricks_connection,
dag=dag,
json=create_our_data_name_source_data_from_boomi_notebook_params,
libraries=[atlas_library],
)
layer_2_task = DatabricksSubmitRunOperator(
task_id="Load_our_data_name_to_standardized",
databricks_conn_id=databricks_connection,
dag=dag,
json=create_our_data_name_standardized_table_from_source_xml_notebook_params,
libraries=[
{"maven": {"coordinates": "com.databricks:spark-xml_2.11:0.5.0"}},
{"pypi": {"package": "inflection"}},
atlas_library,
],
)
layer_3_task = DatabricksSubmitRunOperator(
task_id="Load_our_data_name_to_enriched",
databricks_conn_id=databricks_connection,
dag=dag,
json=create_our_data_name_enriched_table_from_standardized_notebook_params,
libraries=[atlas_library],
)
layer_1_task >> layer_2_task >> layer_3_task
after getting some help from #AnandVidvat about trying to make retry=0 experiment and some firend help to change operator to either DummyOperator or PythonOperator, i can confirm that the issue is not to do with DatabricksOperator or airflow version 1.10.x. i.e it is not an airflow bug.
so in summary, when a DAG, has meaningful operator, my setup fails in first Execution without any task log, and during retry works OK (the task log hides the fact it had been retried, because the failure had no logs).
In order to reduce the total run time. The workaround/patch, before finding the real cause, is to set the retry_delay to 10 seconds (default is 5 mins, and it makes DAG run long unnessicssarily.)
Next step is to figure out what is causing this 1st failure thing, by checking logs on scheduler or woker pods in our current setup (azure K8s, postgresql, Redis, celery executor).
p.s. I used below DAG tested and get the conclusion.
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
import time
from pprint import pprint
dag_args = {
"owner": "min_test",
"retries": 1,
"email": ["c243d70b.domain.onmicrosoft.com#emea.teams.ms"],
"email_on_failure": True,
"email_on_retry": True,
"depends_on_past": False,
"start_date": airflow.utils.dates.days_ago(1),
}
pipeline_name = "min_test_debug_airflow_baseline_PythonOperator_1_retry"
dag = DAG(
dag_id=pipeline_name,
default_args=dag_args,
schedule_interval="0 3 * * *",
tags=["min_test_airflow"],
)
def my_sleeping_function(random_base):
"""This is a function that will run within the DAG execution"""
time.sleep(random_base)
def print_context(ds, **kwargs):
pprint(kwargs)
print(ds)
return "Whatever you return gets printed in the logs"
run_this = PythonOperator(
task_id="print_the_context",
provide_context=True,
python_callable=print_context,
dag=dag,
)
# Generate 3 sleeping tasks, sleeping from 0 to 2 seconds respectively
for i in range(3):
task = PythonOperator(
task_id="sleep_for_" + str(i),
python_callable=my_sleeping_function,
op_kwargs={"random_base": float(i) / 10},
dag=dag,
)
task.set_upstream(run_this)

using supervisord to run lsyncd script

I'm trying to run my lsynd's script with supervisord in order to have it always run.
I've coded this conf for my supervisor
[program:autostart_lsyncd]
command=bash -c "lsyncd /home/sync/lsyncd_script.lua"
autostart=true
autorestart=unexpected
numprocs=1
startsecs = 0
stderr_logfile=/var/log/autostart_sync.err.log
stdout_logfile=/var/log/autostart_sync.out.log
Script runs ok at startup but it exits always
2018-04-09 09:48:49,638 INFO success: autostart_lsyncd entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2018-04-09 09:48:49,639 INFO exited: autostart_lsyncd (exit status 0; expected)
I can't understand if this is the correct way to keep alive a lsynd script or not.
Suggestions?
I'm using this configuration to supervisord in file /etc/supervisor/conf.d/lsyncd.conf
[program:lsyncd]
command=/usr/bin/lsyncd -nodaemon /etc/lsyncd/lsyncd.conf.lua
autostart=true
autorestart=unexpected
startretries=3
And this configuration to lsyncd (/etc/lsyncd/lsyncd.conf.lua):
settings {
logfile = "/var/log/lsyncd/lsyncd.log",
statusFile = "/var/log/lsyncd/lsyncd.status"
}
sync {
default.rsync,
source="/var/www/html/sites/default/files",
target="root#cdn:/var/www/html/sites/default/files",
exclude = {"*.php", "*.po", "\.ht*"},
rsync = {
archive = false,
acls = false,
compress = true,
links = false,
owner = false,
perms = false,
verbose = true,
rsh = "/usr/bin/ssh -p 22 -o StrictHostKeyChecking=no"
}
}
Also I had configure ssh keys and install rsync in the servers.

getting timeout when submitting fat jar to spark-jobserver (akka.pattern.AskTimeoutException)

I have built my job jar using sbt assembly to have all dependencies in one jar. When I try to submit my binary to spark-jobserver I am getting akka.pattern.AskTimeoutException
I modified my configuration to be able to submit large jars (I added parsing.max-content-length = 300m to my configuration) I also increased some of timeouts in configuration but nothing helped.
After I run:
curl --data-binary #matching-ml-assembly-1.0.jar localhost:8090/jars/matching-ml
I am getting:
{
"status": "ERROR",
"result": {
"message": "Ask timed out on [Actor[akka://JobServer/user/binary-manager#1785133213]] after [3000 ms]. Sender[null] sent message of type \"spark.jobserver.StoreBinary\".",
"errorClass": "akka.pattern.AskTimeoutException",
"stack": ["akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)", "akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)", "scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)", "scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)", "scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)", "akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:331)", "akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:282)", "akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:286)", "akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:238)", "java.lang.Thread.run(Thread.java:745)"]
}
My configuration:
# Template for a Spark Job Server configuration file
# When deployed these settings are loaded when job server starts
#
# Spark Cluster / Job Server configuration
spark {
# spark.master will be passed to each job's JobContext
master = "local[4]"
# master = "mesos://vm28-hulk-pub:5050"
# master = "yarn-client"
# Default # of CPUs for jobs to use for Spark standalone cluster
job-number-cpus = 4
jobserver {
port = 8090
context-per-jvm = false
# Note: JobFileDAO is deprecated from v0.7.0 because of issues in
# production and will be removed in future, now defaults to H2 file.
jobdao = spark.jobserver.io.JobSqlDAO
filedao {
rootdir = /tmp/spark-jobserver/filedao/data
}
datadao {
# storage directory for files that are uploaded to the server
# via POST/data commands
rootdir = /tmp/spark-jobserver/upload
}
sqldao {
# Slick database driver, full classpath
slick-driver = slick.driver.H2Driver
# JDBC driver, full classpath
jdbc-driver = org.h2.Driver
# Directory where default H2 driver stores its data. Only needed for H2.
rootdir = /tmp/spark-jobserver/sqldao/data
# Full JDBC URL / init string, along with username and password. Sorry, needs to match above.
# Substitutions may be used to launch job-server, but leave it out here in the default or tests won't pass
jdbc {
url = "jdbc:h2:file:/tmp/spark-jobserver/sqldao/data/h2-db"
user = ""
password = ""
}
# DB connection pool settings
dbcp {
enabled = false
maxactive = 20
maxidle = 10
initialsize = 10
}
}
# When using chunked transfer encoding with scala Stream job results, this is the size of each chunk
result-chunk-size = 1m
}
# Predefined Spark contexts
# contexts {
# my-low-latency-context {
# num-cpu-cores = 1 # Number of cores to allocate. Required.
# memory-per-node = 512m # Executor memory per node, -Xmx style eg 512m, 1G, etc.
# }
# # define additional contexts here
# }
# Universal context configuration. These settings can be overridden, see README.md
context-settings {
num-cpu-cores = 2 # Number of cores to allocate. Required.
memory-per-node = 2G # Executor memory per node, -Xmx style eg 512m, #1G, etc.
# In case spark distribution should be accessed from HDFS (as opposed to being installed on every Mesos slave)
# spark.executor.uri = "hdfs://namenode:8020/apps/spark/spark.tgz"
# URIs of Jars to be loaded into the classpath for this context.
# Uris is a string list, or a string separated by commas ','
# dependent-jar-uris = ["file:///some/path/present/in/each/mesos/slave/somepackage.jar"]
# Add settings you wish to pass directly to the sparkConf as-is such as Hadoop connection
# settings that don't use the "spark." prefix
passthrough {
#es.nodes = "192.1.1.1"
}
}
# This needs to match SPARK_HOME for cluster SparkContexts to be created successfully
# home = "/home/spark/spark"
}
# Note that you can use this file to define settings not only for job server,
# but for your Spark jobs as well. Spark job configuration merges with this configuration file as defaults.
spray.can.server {
# uncomment the next lines for making this an HTTPS example
# ssl-encryption = on
# path to keystore
#keystore = "/some/path/sjs.jks"
#keystorePW = "changeit"
# see http://docs.oracle.com/javase/7/docs/technotes/guides/security/StandardNames.html#SSLContext for more examples
# typical are either SSL or TLS
encryptionType = "SSL"
keystoreType = "JKS"
# key manager factory provider
provider = "SunX509"
# ssl engine provider protocols
enabledProtocols = ["SSLv3", "TLSv1"]
idle-timeout = 60 s
request-timeout = 20 s
connecting-timeout = 5s
pipelining-limit = 2 # for maximum performance (prevents StopReading / ResumeReading messages to the IOBridge)
# Needed for HTTP/1.0 requests with missing Host headers
default-host-header = "spray.io:8765"
# Increase this in order to upload bigger job jars
parsing.max-content-length = 300m
}
akka {
remote.netty.tcp {
# This controls the maximum message size, including job results, that can be sent
# maximum-frame-size = 10 MiB
}
}
I came to the similar issue. The way how to solve it is a bit tricky. First you need to add spark.jobserver.short-timeout to your configuration. Just modify your configuration like this:
jobserver {
port = 8090
context-per-jvm = false
short-timeout = 60s
...
}
The second (tricky) part is you can't fix it without modifying code of the spark-job-application. The attribute which cause timeout is in class BinaryManager:
implicit val daoAskTimeout = Timeout(3 seconds)
The default is set to 3 second which apparently for big jar is not enough. You can increase it to for example 60 second which solve problem for me.
implicit val daoAskTimeout = Timeout(60 seconds)
Actually you can bring down the size of the jars easily. Also some of the dependent jars can be passed using dependent-jar-uris instead of assembling into one big fat jar.

How to use IPython 2.3.1 using StarCluster in stead of 0.13.1?

StarCluster seems to use IPython 0.13.1 by default. Is there a way to upgrade this to IPython 2.3.1? Can it be done via the config file? Or manually after the cluster is started?
Here is my config, with only minor security changes:
[global]
DEFAULT_TEMPLATE=iptemplate
REFRESH_INTERVAL=5
[aws info]
aws_access_key_id = XXXXXXXXXXXXXXXXXXXXX
aws_secret_access_key = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
aws_region_name = us-west-2
aws_region_host = ec2.us-west-2.amazonaws.com
[keypair starcluster]
key_location = starcluster.pem
[plugin ipcluster]
SETUP_CLASS = starcluster.plugins.ipcluster.IPCluster
ENABLE_NOTEBOOK = True
NOTEBOOK_PASSWD = XXXX
[plugin ipclusterstop]
SETUP_CLASS = starcluster.plugins.ipcluster.IPClusterStop
[plugin ipclusterrestart]
SETUP_CLASS = starcluster.plugins.ipcluster.IPClusterRestartEngines
[plugin pypackages]
setup_class = starcluster.plugins.pypkginstaller.PyPkgInstaller
packages = scikit-learn, psutil, pandas
# Base configuration for IPython.parallel cluster
[cluster iptemplate]
KEYNAME = starcluster
CLUSTER_SIZE = 1
CLUSTER_USER = ipuser
CLUSTER_SHELL = bash
#REGION = us-east-1
NODE_IMAGE_ID = ami-706afe40 # REGION and NODE_IMAGE_ID go in pair
NODE_INSTANCE_TYPE = c1.xlarge # 8 CPUs
DISABLE_QUEUE = True # We don't need SGE, faster cluster startup
PLUGINS = pypackages, ipcluster
You can do it by updating setup.py. Add "ipython==2.3.1" to install_requires and rerun the setup command. It will update ipython to the version specified.