Airflow run a DAG to manipulate DB2 data that rasise a jaydebeapi.Error - db2

I follow the offcial website of Airflow to produce my Airflow DAG to connect to DB2. When i run a DAG to insert data or update data that will raise a jaydebeapi.Error. Even though Airflow raise a jaydebeapi.Error, the data still has inserted/updated in DB2 successfully.
The DAG on the Airflow UI will be marked FAILED. I don't know what steps i miss to do.
My DAG code snippet:
with DAG("my_dag1", default_args=default_args,
schedule_interval="#daily", catchup=False) as dag:
cerating_table = JdbcOperator(
task_id='creating_table',
jdbc_conn_id='db2',
sql=r"""
insert into DB2ECIF.T2(C1,C1_DATE) VALUES('TEST',CURRENT DATE);
""",
autocommit=True,
dag=dag
)
DAG log:
[2022-06-20 02:16:03,743] {base.py:68} INFO - Using connection ID 'db2' for task execution.
[2022-06-20 02:16:04,785] {dbapi.py:213} INFO - Running statement:
insert into DB2ECIF.T2(C1,C1_DATE) VALUES('TEST',CURRENT DATE);
, parameters: None
[2022-06-20 02:16:04,842] {dbapi.py:221} INFO - Rows affected: 1
[2022-06-20 02:16:04,844] {taskinstance.py:1889} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/jdbc/operators/jdbc.py", line 76, in execute
return hook.run(self.sql, self.autocommit, parameters=self.parameters, handler=fetch_all_handler)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/hooks/dbapi.py", line 195, in run
result = handler(cur)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/jdbc/operators/jdbc.py", line 30, in fetch_all_handler
return cursor.fetchall()
File "/home/airflow/.local/lib/python3.7/site-packages/jaydebeapi/__init__.py", line 596, in fetchall
row = self.fetchone()
File "/home/airflow/.local/lib/python3.7/site-packages/jaydebeapi/__init__.py", line 561, in fetchone
raise Error()
jaydebeapi.Error
[2022-06-20 02:16:04,847] {taskinstance.py:1400} INFO - Marking task as FAILED. dag_id=my_dag1, task_id=creating_table, execution_date=20210101T000000, start_date=, end_date=20220620T021604
I have installed required python packages of Airflow. List below:
Package(System) name/Version
Airflow/2.3.2
IBM DB2/11.5.7
OpenJDK/15.0.2
JayDeBeApi/1.2.0
JPype1/0.7.2
apache-airflow-providers-jdbc/3.0.0
I have tried to use the latest version of item 4(1.2.3) and item 5(1.4.0) still doesn't work.
I also have downgraded Airflow version to 2.2.3 or 2.2.5 got same result.
How to solve this problem?

The error doesn't happen in the original insert query but due to a fetchall introduced in this PR - https://github.com/apache/airflow/pull/23817
Using apache-airflow-providers-jdbc/2.1.3 might be an easy workaround.
To get the root cause, set DEBUG logging level in Airflow and see why the fetchall causes the error. Having the full traceback will help

Related

Airflow scheduler is throwing out an error - 'DisabledBackend' object has no attribute '_get_task_meta_for'

I am trying to install airflow (distributed mode) in WSL, I got the setup of Airflow webserver, Airflow Scheduler, Airflow Worker, Celery (3.1) and RabbitMQ.
While running the Airflow Scheduler it is throwing out this error (below) even though the backend is set up.
ERROR
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/airflow/executors/celery_executor.py", line 92, in sync
state = task.state
File "/usr/local/lib/python3.6/dist-packages/celery/result.py", line 398, in state
return self._get_task_meta()['status']
File "/usr/local/lib/python3.6/dist-packages/celery/result.py", line 341, in _get_task_meta
return self._maybe_set_cache(self.backend.get_task_meta(self.id))
File "/usr/local/lib/python3.6/dist-packages/celery/backends/base.py", line 288, in get_task_meta
meta = self._get_task_meta_for(task_id)
AttributeError: 'DisabledBackend' object has no attribute '_get_task_meta_for'
https://issues.apache.org/jira/browse/AIRFLOW-1840
This is the exact error I am getting but couldn't find a solution.
Result Backend-
result_backend = db+postgresql://postgres:****#localhost:5432/postgres
broker_url = amqp://rabbitmq_user_name:rabbitmq_password#localhost/rabbitmq_virtual_host_name
Help please, gone through almost all the documents but couldn't find a solution
I was facing the same issue on celery version - 3.1.26.post2 (with rabitmq,postgresql and airflow),the reason for this issue is the dictionary used in celery base.py file at(lib/python3.5/site-packages/celery/app/base.py)
does not capture celery backend at key CELERY_RESULT_BACKEND instead it captures at key result_backend.
So the solution here is go to _get_config function available in base.py file at(lib/python3.5/site-packages/celery/app/base.py),at the end of the function before returning dictionary s add the below code.
s['CELERY_RESULT_BACKEND'] = s['result_backend'] #code to be added
return s
This solved the problem.

pypi erro when try to install gcsfs in google composer(airflow)

I use google composer-1.0.0-airflow-1.9.0. I used dask in one of my DAG and wanted to setup composer to use dask. One of the required package for this DAG is gcsfs. When I tried to install it via Web UI I got the below error:
Composer Backend timed out. Currently running tasks are [stage: CP_COMPOSER_AGENT_RUNNING description: "Composer Agent Running. Latest Agent Stage: stage: DEPLOYMENTS_UPDATED\n ." response_timestamp { seconds: 1540331648 nanos: 860000000 } ].
Updated:
The error is coming from this line of code when dask tries to read file from gcp bucket:dd.read_csv(bucket)
log:
[2018-10-24 22:25:12,729] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 350, in get_fs_token_paths
[2018-10-24 22:25:12,733] {base_task_runner.py:98} INFO - Subtask: fs, fs_token = get_fs(protocol, options)
[2018-10-24 22:25:12,735] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 473, in get_fs
[2018-10-24 22:25:12,740] {base_task_runner.py:98} INFO - Subtask: "Need to install `gcsfs` library for Google Cloud Storage support\n"
[2018-10-24 22:25:12,741] {base_task_runner.py:98} INFO - Subtask: File "/usr/local/lib/python2.7/site-packages/dask/utils.py", line 94, in import_required
[2018-10-24 22:25:12,748] {base_task_runner.py:98} INFO - Subtask: raise RuntimeError(error_msg)
[2018-10-24 22:25:12,751] {base_task_runner.py:98} INFO - Subtask: RuntimeError: Need to install `gcsfs` library for Google Cloud Storage support
[2018-10-24 22:25:12,756] {base_task_runner.py:98} INFO - Subtask: conda install gcsfs -c conda-forge
[2018-10-24 22:25:12,758] {base_task_runner.py:98} INFO - Subtask: or
[2018-10-24 22:25:12,762] {base_task_runner.py:98} INFO - Subtask: pip install gcsfs
When tried to install gcsfs in google composer UI using pypi got below error:
{
insertId: "17ks763f726w1i"
logName: "projects/xxxxxxxxx/logs/airflow-worker"
receiveTimestamp: "2018-10-25T15:42:24.935880717Z"
resource: {…}
severity: "ERROR"
textPayload: "Traceback (most recent call last):
File "/usr/local/bin/gcsfuse", line 7, in <module>
from gcsfs.cli.gcsfuse import main
File "/usr/local/lib/python2.7/site-
packages/gcsfs/cli/gcsfuse.py", line 3, in <module>
fuse import FUSE
ImportError: No module named fuse
"
timestamp: "2018-10-25T15:41:53Z"
}
Unfortunately, your error mssage doesn't mean much to me.
gcsfs is pure python code, so it is very unlikely that anything is going wrong with installing it - as is done very commonly with pip or conda. The dependency libraries are a bunch of google ones, some of which may require compilation (I don't know), so I would suggest trying to find out from logs which one is stalling and taking it up with them. On the other hand, this kind of thing can often be a network/intermittent problem, so waiting may also fix things.
For the future, I recommend basing installations around conda, which never needs to compile anything and is generally better at dependency tracking.
This has to do with the fact that Composer and Airflow have silent dependencies and they are not syncd. So if gcsfs installation has conflicts with Airflow dependency, we get this error. More details here. The only workarounds ( other than updating to the Nov 28 release of composer) are:
Source: Thanks to Jake Biesinger (jake.biesinger#infusionsoft.com)
use a separate Kubernetes Pod for running various jobs, but it's a
large change and requires infra we're not very familiar with (GKE).
This particular issue can also be solved by installing dbt in a
PythonVirtualEnvOperator, then having the python_callable re-use the
virtualenv's bin dir, something like:
``` def _run_cmd_in_virtual_env(cmd):
subprocess.check_call(os.path.join(os.path.split(sys.argv[0])[0], cmd)
task =
PythonVirtualEnvOperator(python_callable=_run_cmd_in_virtual_env,
op_args=('dbt',)) # this will call the temporarily-installed dbt
binary, something like /tmp/virtualenv-asdasd/bin/dbt.
```
I haven't tried this, but this might help you out.
In general, installing arbitrary system packages (like fuse or whatever which becomes the dependencies of what you are trying to install) is not supported by Google Composer. As discussed here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!searchin/cloud-composer-discuss/sugimiyanto%7Csort:date/cloud-composer-discuss/jpxAGCPFkZo/mCx_P1LPCQAJ
However, you may be able to do this by uploading the package folder that you have installed it in your local (i.e. fuse), into your Google Cloud Storage bucket for example: gs://<your_bukcet_name>/libs, so that it becomes shared libraries.
Then, you can set LD_LIBRARY_PATH environment variable in Google Composer to /home/airflow/gcs/libs, to make GCC look for shared libraries in that directory.
Then, try to reinstall the gcsfs using pypi Google Composer.

Apache Airflow celery executor is not getting result backend

I am running Apache Airflow version 1.9.0 and when I try to run a task from UI, I get the following error in airflow scheduler console:
[2018-05-08 12:09:06,737] {jobs.py:1077} INFO - No tasks to consider for execution.
[2018-05-08 12:09:06,738] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-08 12:09:06,738] {celery_executor.py:101} ERROR - Error syncing the celery executor, ignoring it:
[2018-05-08 12:09:06,738] {celery_executor.py:102} ERROR - No result backend configured. Please see the documentation for more information.
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/airflow/executors/celery_executor.py", line 83, in sync
state = async.state
File "/usr/local/lib/python2.7/dist-packages/celery/result.py", line 329, in state
return self.backend.get_status(self.id)
File "/usr/local/lib/python2.7/dist-packages/celery/backends/base.py", line 547, in _is_disabled
'No result backend configured. '
NotImplementedError: No result backend configured. Please see the documentation for more information.
In my airflow.cfg, I have the following variables in [celery] section:
celery_app_name = airflow.executors.celery_executor
celeryd_concurrency = 16
worker_log_server_port = 8795
broker_url = amqp://guest:guest#localhost:5672//
celery_result_backend = amqp://guest:guest#localhost:5672//
flower_host = 0.0.0.0
flower_port = 5555
default_queue = default
What am I doing wrong here?
You should not point celery_result_backend to a RabbitMQ instance since the purpose of this backend is to store information concerning the status of the tasks and RabbitMQ is not the right tool for that (Please correct me if I'm mistaken).
You can use Redis in case you want to keep using the same instance as broker and backend, or alternatively you can use postgres as the backend which I recommend. A sample configuration for Postgres would be the following:
celery_result_backend = db+postgresql://airflow:****#postgres/airflow
More info on the official docs: Here

Saltstack - Schedule to ensure service is running not working

I'm trying to set up a Saltstack schedule that will check to ensure that a service is running on the minion. However, it doesn't seem like service.running is working as a function on the schedule.
Here's my run.sls file:
test-service-sched:
schedule.present:
- name: test-service-sched
- function: service.running
- seconds: 60
- job_kwargs:
name: test-service
- persist: True
- enabled: True
- run_on_start: True
And I execute the following: salt 'service*' state.apply run
This ends up with the following error on the minion:
2017-03-28 02:47:11,493 [salt.utils.schedule ][ERROR ][6172] Unhandled exception running service.running
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/salt/utils/schedule.py", line 826, in handle_func
message=self.functions.missing_fun_string(func))
File "/usr/lib/python2.6/site-packages/salt/utils/error.py", line 36, in raise_error
raise ex(message)
Exception: 'service.running' is not available.
I haven't seen anything in the documentation that says I can't run service.running from a schedule. Is it a known limitation of Salt? Or am I just doing it wrong?
I can use cmd.run, but it ends up spamming the logs with errors if the service is already running.
So, I was pointed in the right direction on the Salt Google Group. There's a difference between execution modules and state modules. Since service.running is an execution module, and schedule only supports state modules, I had to reference it indirectly. I used 2 files:
schedule.sls:
service_schedule:
schedule.present:
- function: state.apply
- minutes: 1
- job_args:
- running
running.sls:
service_running:
service.running:
- name: test_service
Now, running salt 'service*' state.apply schedule did exactly what I wanted it to.

Celery: Be sure to commit the transaction for each poll iteration

I'm using django-celery and have set things up so I can call a task from the interactive shell, the task completes (as evidenced by celery log) and i see the result in celeryd output.
However, I seem unable to ever get the result of the task in the shell where I start the task:
>>> from mymodule.tasks import testTask
>>> res = testTask.delay()
>>> testTask.ready()
False
#task
def testTask():
logger.info('LOGGER: start task')
time.sleep(10)
logger.info('LOGGER: stop task')
return 5
I'm assuming this is due to the following error which I sometimes get:
TxIsolationWarning: Polling results with transaction isolation level repeatable-read within the same transaction may give outdated results. Be sure to commit the transaction for each poll iteration.
My question, how to I commit the transaction and where is this done? Also, what is the issue here? Celery trying to access the info from mysql whilst Django has locked the table?
Thanks in advance,
Check transaction isolation level if you use MySQL as broker.
http://docs.celeryproject.org/en/latest/faq.html#mysql-is-throwing-deadlock-errors-what-can-i-do