"Dag Seems to be missing" error in a Cloud Composer Airflow Dynamic DAG - google-cloud-storage

I have a dynamic Airflow DAG in Google Cloud Composer gets created, listed in the web-server and ran (backfill) without error.
However, there are issues:
When clicking on the DAG in web url, it says "DAG seems to be
missing"
Can't see Graph view/Tree view as showing the error above
Can't manually trigger the DAG as showing the error above
Trying to fix this for couple days...any hint will be helpful. Thank you!
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
from google.cloud import storage
from airflow.models import Variable
import json
args = {
'owner': 'xxx',
'start_date':'2020-11-5',
'provide_context': True
}
dag = DAG(
dag_id='dynamic',
default_args=args
)
def return_bucket_files(bucket_name='xxxxx', **kwargs):
client = storage.Client()
bucket = client.get_bucket(bucket_name)
blobs = bucket.list_blobs()
file_list = [blob.name for blob in blobs]
return file_list
def dynamic_gcs_to_gbq_etl(file, **kwargs):
mapping = json.loads(Variable.get("xxxxx"))
database = mapping[0][file]
table = mapping[1][file]
task=GoogleCloudStorageToBigQueryOperator(
task_id= f'gcs_load_{file}_to_gbq',
bucket='xxxxxxx',
source_objects=[f'{file}'],
destination_project_dataset_table=f'xxx.{database}.{table}',
write_disposition="WRITE_TRUNCATE",
autodetect=True,
skip_leading_rows=1,
source_format='CSV',
dag=dag)
return task
start_task = DummyOperator(
task_id='start',
dag=dag
)
end_task = DummyOperator(
task_id='end',
dag=dag)
push_bucket_files = PythonOperator(
task_id="return_bucket_files",
provide_context=True,
python_callable=return_bucket_files,
dag=dag)
for file in return_bucket_files():
gcs_load_task = dynamic_gcs_to_gbq_etl(file)
start_task >> push_bucket_files >> gcs_load_task >> end_task

This issue means that the Web Server is failing to fill in the DAG bag on its side - this problem is most likely not with your DAG specifically.
My suggestion would be right now to try and restart the web server (via the installation of some dummy package).
Similar issues reported in this post as well here.

Related

Getting a ModuleNotFoundError while running an celery app on a poetry environment

So I have a simple celery app code(app.py) which looks something like this:
``from celery import Celery
from functools import lru_cache`
#lru_cache def get_celery() -> Celery:
celery_app = Celery(
'worker',
broker='redis://XXXXXXXXX/4',
backend='redis://XXXXXXXX/5',
)
celery_app.autodiscover_tasks(
['api.v1.worker.worker'],
force=True
)
return celery_app`
Now when I run this app.py file through the VSCode terminal I get a error like this:
ModuleNotFoundError: No module named 'api'
My directory looks something like this:
app
|__api
| |__v1
| |__endpoints
| |__utils
| |__worker
| |__worker.py
|__celery_app
|__app.py
The API and celery_app folders are on the same level. Can anyone please help me understand and fix this issue.
I'm running this entire project on a poetry shell and my pyproject.toml file looks something like this:
[tool.poetry.dependencies]
python = "3.10.9" requests = "^2.28.2" numpy = "^1.24.1" pandas = "^1.5.3" fastapi = "^0.89.1" datetime = "^5.0" mimir = "^0.6.3" jsonify = "^0.5" uvicorn = {extras = ["standard"], version = "^0.20.0"} appdirs = "^1.4.4" black = "^22.12.0" click = "^8.1.3" pydantic = "^1.10.4" pymongo = "^4.3.3" regex = "^2022.10.31" motor = "^3.1.1" json-utils = "^0.2" celery = "^5.2.7" redis = "^4.4.2"
To try resolving this issue I commented out this part of the code:
celery_app.autodiscover_tasks( ['api.v1.worker.worker'], force=True )
and the celery app worked fine. But i want the code to work along with the api.v1.worker.worker statement.

programmatically export grafana dashboard data

I have a visual in grafana. I can manually go to the menu click export and export the time series data in json. This works great. Is there a way I can script that in python?. Is there some api I can hit that will return the json of a visual?
I was googling around and it looks like I can use the api to create dashboards/visuals and administer them but not sure where how to use the api to export the data.
Here's a Python script to export then dashboard json, not the presented data. Tested on Python 2.7:
#!/usr/bin/env python
"""Grafana dashboard exporter"""
import json
import os
import requests
HOST = 'http://localhost:3000'
API_KEY = os.environ["grafana_api_key"]
DIR = 'exported-dashboards/'
def main():
headers = {'Authorization': 'Bearer %s' % (API_KEY,)}
response = requests.get('%s/api/search?query=&' % (HOST,), headers=headers)
response.raise_for_status()
dashboards = response.json()
if not os.path.exists(DIR):
os.makedirs(DIR)
for d in dashboards:
print ("Saving: " + d['title'])
response = requests.get('%s/api/dashboards/%s' % (HOST, d['uri']), headers=headers)
data = response.json()['dashboard']
dash = json.dumps(data, sort_keys=True, indent=4, separators=(',', ': '))
name = data['title'].replace(' ', '_').replace('/', '_').replace(':', '').replace('[', '').replace(']', '')
tmp = open(DIR + name + '.json', 'w')
tmp.write(dash)
tmp.write('\n')
tmp.close()
if __name__ == '__main__':
main()
Usage:
You should first create an API key in Grafana and then run:
grafana_api_key=my-key python export-dash.py
Credit: This is a simplified version of https://github.com/percona/grafana-dashboards/blob/master/misc/export-dash.py
http://docs.grafana.org/http_api/data_source/#data-source-proxy-calls.
Visit your browser console (network tab) and you will see how it works there.
You could also use this Go client https://github.com/netsage-project/grafana-dashboard-manager
Its purpose is not what you are looking for, but it is possible to reuse that code.

No FileSystem for scheme: cos

I'm trying to connect to IBM Cloud Object Storage from IBM Data Science Experience:
access_key = 'XXX'
secret_key = 'XXX'
bucket = 'mybucket'
host = 'lon.ibmselect.objstor.com'
service = 'mycos'
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.myCos.access.key', access_key)
hconf.set('fs.cos.myCos.endpoint', 'http://' + host)
hconf.set('fs.cose.myCos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
This returns:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No FileSystem for scheme: cos
I'm guessing I need to use the 'cos' scheme based on the stocator docs. However, the error suggests stocator isn't available or is an old version?
Any ideas?
Update 1:
I have also tried the following:
sqlCxt = SQLContext(sc)
hconf = sc._jsc.hadoopConfiguration()
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
service = 'mycos'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('cos://{0}.{1}/{2}'.format(bucket, service, obj))
print(rdd.count())
However, this time the response was:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No object store for: cos
at com.ibm.stocator.fs.ObjectStoreVisitor.getStoreClient(ObjectStoreVisitor.java:121)
...
Caused by: java.lang.ClassNotFoundException: com.ibm.stocator.fs.cos.COSAPIClient
The latest version of Stocator (v1.0.9) that supports fs.cos scheme is not yet deployed on Spark aaService (It will be soon). Please use the stocator scheme "fs.s3d" to connect to your COS.
Example:
endpoint = 'endpointXXX'
access_key = 'XXX'
secret_key = 'XXX'
prefix = "fs.s3d.service"
hconf = sc._jsc.hadoopConfiguration()
hconf.set(prefix + ".endpoint", endpoint)
hconf.set(prefix + ".access.key", access_key)
hconf.set(prefix + ".secret.key", secret_key)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile('s3d://{0}.service/{1}'.format(bucket, obj))
rdd.count()
Alternatively, you can use ibmos2spark. The lib is already installed on our service. Example:
import ibmos2spark
credentials = {
'endpoint': 'endpointXXXX',
'access_key': 'XXXX',
'secret_key': 'XXXX'
}
configuration_name = 'os_configs' # any string you want
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name)
bucket = 'mybucket'
obj = 'mydata.tsv.gz'
rdd = sc.textFile(cos.url(obj, bucket))
rdd.count()
Stocator is on the classpath for Spark 2.0 and 2.1 kernels, but the cos scheme is not configured. You can access the config by executing the following in a Python notebook:
!cat $SPARK_CONF_DIR/core-site.xml
Look for the property fs.stocator.scheme.list. What I currently see is:
<property>
<name>fs.stocator.scheme.list</name>
<value>swift2d,swift,s3d</value>
</property>
I recommend that you raise a feature request against DSX to support the cos scheme.
It looks like cos driver is not properly initialized. Try this configuration:
hconf.set('fs.cos.impl', 'com.ibm.stocator.fs.ObjectStoreFileSystem')
hconf.set('fs.stocator.scheme.list', 'cos')
hconf.set('fs.stocator.cos.impl', 'com.ibm.stocator.fs.cos.COSAPIClient')
hconf.set('fs.stocator.cos.scheme', 'cos')
hconf.set('fs.cos.mycos.access.key', access_key)
hconf.set('fs.cos.mycos.endpoint', 'http://' + host)
hconf.set('fs.cos.mycos.secret.key', secret_key)
hconf.set('fs.cos.service.v2.signer.type', 'false')
UPDATE 1:
You also need to ensure stocator classes are on the classpath. You can use packages system by exceuting pyspark in the following way:
./bin/pyspark --packages com.ibm.stocator:stocator:1.0.24
This works with swift2d and cos scheme.
UPDATE 2:
Just follow Stocator documentation (https://github.com/CODAIT/stocator). It contains all details how to install it, what branch to use, etc.
I found the same issue, and to solve it I just changed environment:
Within IBM Watson Studio, if you start a a Jupyter notebook in an environment without a pre-configured spark cluster, than you get that error. Installing PySpark is not enough.
Instead, if you start a notebook with the Spark cluster available, you will be just fine.
You have to set .config("spark.hadoop.fs.stocator.scheme.list", "cos") along with some others fs.cos... configurations.
Here's an end-to-end snippet code example that works (tested with pyspark==2.3.2 and Python 3.7.3):
from pyspark.sql import SparkSession
stocator_jar = '/path/to/stocator-1.1.2-SNAPSHOT-IBM-SDK.jar'
cos_instance_name = '<myCosIntanceName>'
bucket_name = '<bucketName>'
s3_region = '<region>'
cos_iam_api_key = '*******'
iam_servicce_id = 'crn:v1:bluemix:public:iam-identity::<****************>'
spark_builder = (
SparkSession
.builder
.appName('test_app'))
spark_builder.config('spark.driver.extraClassPath', stocator_jar)
spark_builder.config('spark.executor.extraClassPath', stocator_jar)
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.api.key", cos_iam_api_key)
spark_builder.config(f"fs.cos.{cos_instance_name}.endpoint", f"s3.{s3_region}.cloud-object-storage.appdomain.cloud")
spark_builder.config(f"fs.cos.{cos_instance_name}.iam.service.id", iam_servicce_id)
spark_builder.config("spark.hadoop.fs.stocator.scheme.list", "cos")
spark_builder.config("spark.hadoop.fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
spark_builder.config("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
spark_builder.config("fs.stocator.cos.scheme", "cos")
spark_sess = spark_builder.getOrCreate()
dataset = spark_sess.range(1, 10)
dataset = dataset.withColumnRenamed('id', 'user_idx')
dataset.repartition(1).write.csv(
f'cos://{bucket_name}.{cos_instance_name}/test.csv',
mode='overwrite',
header=True)
spark_sess.stop()
print('done!')

Apache Airflow - trigger/schedule DAG rerun on completion (File Sensor)

Good Morning.
I'm trying to setup a DAG too
Watch/sense for a file to hit a network folder
Process the file
Archive the file
Using the tutorials online and stackoverflow I have been able to come up with the following DAG and Operator that successfully achieves the objectives, however I would like the DAG to be rescheduled or rerun on completion so it starts watching/sensing for another file.
I attempted to set a variable max_active_runs:1 and then a schedule_interval: timedelta(seconds=5) this yes reschedules the DAG but starts queuing task and locks the file.
Any ideas welcome on how I could rerun the DAG after the archive_task?
Thanks
DAG CODE
from airflow import DAG
from airflow.operators import PythonOperator, OmegaFileSensor, ArchiveFileOperator
from datetime import datetime, timedelta
from airflow.models import Variable
default_args = {
'owner': 'glsam',
'depends_on_past': False,
'start_date': datetime.now(),
'provide_context': True,
'retries': 100,
'retry_delay': timedelta(seconds=30),
'max_active_runs': 1,
'schedule_interval': timedelta(seconds=5),
}
dag = DAG('test_sensing_for_a_file', default_args=default_args)
filepath = Variable.get("soucePath_Test")
filepattern = Variable.get("filePattern_Test")
archivepath = Variable.get("archivePath_Test")
sensor_task = OmegaFileSensor(
task_id='file_sensor_task',
filepath=filepath,
filepattern=filepattern,
poke_interval=3,
dag=dag)
def process_file(**context):
file_to_process = context['task_instance'].xcom_pull(
key='file_name', task_ids='file_sensor_task')
file = open(filepath + file_to_process, 'w')
file.write('This is a test\n')
file.write('of processing the file')
file.close()
proccess_task = PythonOperator(
task_id='process_the_file',
python_callable=process_file,
provide_context=True,
dag=dag
)
archive_task = ArchiveFileOperator(
task_id='archive_file',
filepath=filepath,
archivepath=archivepath,
dag=dag)
sensor_task >> proccess_task >> archive_task
FILE SENSOR OPERATOR
import os
import re
from datetime import datetime
from airflow.models import BaseOperator
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
from airflow.operators.sensors import BaseSensorOperator
class ArchiveFileOperator(BaseOperator):
#apply_defaults
def __init__(self, filepath, archivepath, *args, **kwargs):
super(ArchiveFileOperator, self).__init__(*args, **kwargs)
self.filepath = filepath
self.archivepath = archivepath
def execute(self, context):
file_name = context['task_instance'].xcom_pull(
'file_sensor_task', key='file_name')
os.rename(self.filepath + file_name, self.archivepath + file_name)
class OmegaFileSensor(BaseSensorOperator):
#apply_defaults
def __init__(self, filepath, filepattern, *args, **kwargs):
super(OmegaFileSensor, self).__init__(*args, **kwargs)
self.filepath = filepath
self.filepattern = filepattern
def poke(self, context):
full_path = self.filepath
file_pattern = re.compile(self.filepattern)
directory = os.listdir(full_path)
for files in directory:
if re.match(file_pattern, files):
context['task_instance'].xcom_push('file_name', files)
return True
return False
class OmegaPlugin(AirflowPlugin):
name = "omega_plugin"
operators = [OmegaFileSensor, ArchiveFileOperator]
Dmitris method worked perfectly.
I also found in my reading setting schedule_interval=None and then using the TriggerDagRunOperator worked equally as well
trigger = TriggerDagRunOperator(
task_id='trigger_dag_RBCPV99_rerun',
trigger_dag_id="RBCPV99_v2",
dag=dag)
sensor_task >> proccess_task >> archive_task >> trigger
Set schedule_interval=None and use airflow trigger_dag command from BashOperator to launch next execution at the completion of the previous one.
trigger_next = BashOperator(task_id="trigger_next",
bash_command="airflow trigger_dag 'your_dag_id'", dag=dag)
sensor_task >> proccess_task >> archive_task >> trigger_next
You can start your first run manually with the same airflow trigger_dag command and then trigger_next task will automatically trigger the next one. We use this in production for many months now and and it runs perfectly.

Flask, blueprints uses celery task and got cycle import

I have an application with Blueprints and Celery
the code is here:
config.py
import os
from celery.schedules import crontab
basedir = os.path.abspath(os.path.dirname(__file__))
class Config:
SECRET_KEY = os.environ.get('SECRET_KEY') or ''
SQLALCHEMY_COMMIT_ON_TEARDOWN = True
RECORDS_PER_PAGE = 40
SQLALCHEMY_DATABASE_URI = ''
CELERY_BROKER_URL = ''
CELERY_RESULT_BACKEND = ''
CELERY_RESULT_DBURI = ''
CELERY_TIMEZONE = 'Europe/Kiev'
CELERY_ENABLE_UTC = False
CELERYBEAT_SCHEDULE = {}
#staticmethod
def init_app(app):
pass
class DevelopmentConfig(Config):
DEBUG = True
WTF_CSRF_ENABLED = True
APP_HOME = ''
SQLALCHEMY_DATABASE_URI = 'mysql+mysqldb://...'
CELERY_BROKER_URL = 'sqla+mysql://...'
CELERY_RESULT_BACKEND = "database"
CELERY_RESULT_DBURI = 'mysql://...'
CELERY_TIMEZONE = 'Europe/Kiev'
CELERY_ENABLE_UTC = False
CELERYBEAT_SCHEDULE = {
'send-email-every-morning': {
'task': 'app.workers.tasks.send_email_task',
'schedule': crontab(hour=6, minute=15),
},
}
class TestConfig(Config):
DEBUG = True
WTF_CSRF_ENABLED = False
TESTING = True
SQLALCHEMY_DATABASE_URI = 'mysql+mysqldb://...'
class ProdConfig(Config):
DEBUG = False
WTF_CSRF_ENABLED = True
SQLALCHEMY_DATABASE_URI = 'mysql+mysqldb://...'
CELERY_BROKER_URL = 'sqla+mysql://...celery'
CELERY_RESULT_BACKEND = "database"
CELERY_RESULT_DBURI = 'mysql://.../celery'
CELERY_TIMEZONE = 'Europe/Kiev'
CELERY_ENABLE_UTC = False
CELERYBEAT_SCHEDULE = {
'send-email-every-morning': {
'task': 'app.workers.tasks.send_email_task',
'schedule': crontab(hour=6, minute=15),
},
}
config = {
'development': DevelopmentConfig,
'default': ProdConfig,
'production': ProdConfig,
'testing': TestConfig,
}
class AppConf:
"""
Class to store current config even out of context
"""
def __init__(self):
self.app = None
self.config = {}
def init_app(self, app):
if hasattr(app, 'config'):
self.app = app
self.config = app.config.copy()
else:
raise TypeError
init.py:
import os
from flask import Flask
from celery import Celery
from config import config, AppConf
def create_app(config_name):
app = Flask(__name__)
app.config.from_object(config[config_name])
config[config_name].init_app(app)
app_conf.init_app(app)
# Connect to Staging view
from staging.views import staging as staging_blueprint
app.register_blueprint(staging_blueprint)
return app
def make_celery(app=None):
app = app or create_app(os.getenv('FLASK_CONFIG') or 'default')
celery = Celery(__name__, broker=app.config.CELERY_BROKER_URL)
celery.conf.update(app.conf)
TaskBase = celery.Task
class ContextTask(TaskBase):
abstract = True
def __call__(self, *args, **kwargs):
with app.app_context():
return TaskBase.__call__(self, *args, **kwargs)
celery.Task = ContextTask
return celery
tasks.py:
from app import make_celery, app_conf
cel = make_celery(app_conf.app)
#cel.task
def send_realm_to_fabricdb(realm, form):
some actions...
and here is the problem:
The Blueprint "staging" uses task send_realm_to_fabricdb, so it makes: from tasks import send_realm_to_fabricdb
than, when I just run application, everything goes ok
BUT, when I'm trying to run celery: celery -A app.tasks worker -l info --beat, it goes to cel = make_celery(app_conf.app) in tasks.py, got app=None and trying to create application again: registering a blueprint... so I've got cycle import here.
Could you tell me how to break this cycle?
Thanks in advance.
I don't have the code to try this out, but I think things would work better if you move the creation of the Celery instance out of tasks.py and into the create_app function, so that it happens at the same time the app instance is created.
The argument you give to the Celery worker in the -A option does not need to have the tasks, Celery just needs the celery object, so for example, you could create a separate starter script, say celery_worker.py that calls create_app to create app and cel and then give it to the worker as -A celery_worker.cel, without involving the blueprint at all.
Hope this helps.
What I do to solve this error is that I create two Flask instance which one is for Web app, and another is for initial Celery instance.
Like #Miguel said, I have
celery_app.py for celery instance
manager.py for Flask instance
And in these two files, each module has it's own Flask instance.
So I can use celery.task in Views. And I can start celery worker separately.
Thanks Bob Jordan, you can find the answer from https://stackoverflow.com/a/50665633/2794539,
Key points:
1. make_celery do two things at the same time: create celery app and run celery with flask content, so you can create two functions to do make_celery job
2. celery app must init before blueprint register
Having the same problem, I ended up solving it very easily using shared_task (docs), keeping a single app.py file and not having to instantiate the flask app multiple times.
The original situation that led to the circular import:
from src.app import celery # src.app is ALSO importing the blueprints which are importing this file which causes the circular import.
#celery.task(bind=True)
def celery_test(self):
sleep(5)
logger.info("Task processed by Celery.")
The current code that works fine and avoids the circular import:
# from src.app import celery <- not needed anymore!
#shared_task(bind=True)
def celery_test(self):
sleep(5)
logger.info("Task processed by Celery.")
Please mind that I'm pretty new to Celery so I might be overseeing important stuff, it would be great if someone more experienced can give their opinion.