Push variable from Spark to Airflow - pyspark

I have a variable which value I 'd like to be pushed to Airflow so I can use it as an input for the next task. I know that I must use xcoms but I haven't figured out how to push from the spark task to the Airflow
def c_count():
return spark_task(
name='c_count',
script='c_count.py',
dag=dag,
table=None,
host=Variable.get('host'),
trigger_rule="all_done",
provide_context=True,
xcom_push = True
)
def c_int():
return spark_task(
name='c_in',
script='another_test.py',
dag=dag,
table=None,
host=Variable.get('host'),
trigger_rule="all_done",
counts="{{ task_instance.xcom_pull(task_ids='c_count') }}"
)
EDIT:
The spark task is the following:
def spark_task_sapbw(name, script, dag, table, host, **kwargs):
spark_cmd = 'spark-submit'
if Variable.get('spark_master_uri', None):
spark_cmd += ' --master {}'.format(Variable.get('spark_master_uri'))
.
.
.
task = BashOperator(
task_id=name,
bash_command=spark_cmd,
dag=dag,
**kwargs
)
return task
The problem is that what I get back is the last print of the Airflow's log. Is there any way that I can get a specific value from the spark script? Thank you!

You cannot make directly spark and airflow communicate. You have to use Python in between. collect the values you need and push them to airflow with XComs.

Related

"Dag Seems to be missing" error in a Cloud Composer Airflow Dynamic DAG

I have a dynamic Airflow DAG in Google Cloud Composer gets created, listed in the web-server and ran (backfill) without error.
However, there are issues:
When clicking on the DAG in web url, it says "DAG seems to be
missing"
Can't see Graph view/Tree view as showing the error above
Can't manually trigger the DAG as showing the error above
Trying to fix this for couple days...any hint will be helpful. Thank you!
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.gcs_to_bq import GoogleCloudStorageToBigQueryOperator
from google.cloud import storage
from airflow.models import Variable
import json
args = {
'owner': 'xxx',
'start_date':'2020-11-5',
'provide_context': True
}
dag = DAG(
dag_id='dynamic',
default_args=args
)
def return_bucket_files(bucket_name='xxxxx', **kwargs):
client = storage.Client()
bucket = client.get_bucket(bucket_name)
blobs = bucket.list_blobs()
file_list = [blob.name for blob in blobs]
return file_list
def dynamic_gcs_to_gbq_etl(file, **kwargs):
mapping = json.loads(Variable.get("xxxxx"))
database = mapping[0][file]
table = mapping[1][file]
task=GoogleCloudStorageToBigQueryOperator(
task_id= f'gcs_load_{file}_to_gbq',
bucket='xxxxxxx',
source_objects=[f'{file}'],
destination_project_dataset_table=f'xxx.{database}.{table}',
write_disposition="WRITE_TRUNCATE",
autodetect=True,
skip_leading_rows=1,
source_format='CSV',
dag=dag)
return task
start_task = DummyOperator(
task_id='start',
dag=dag
)
end_task = DummyOperator(
task_id='end',
dag=dag)
push_bucket_files = PythonOperator(
task_id="return_bucket_files",
provide_context=True,
python_callable=return_bucket_files,
dag=dag)
for file in return_bucket_files():
gcs_load_task = dynamic_gcs_to_gbq_etl(file)
start_task >> push_bucket_files >> gcs_load_task >> end_task
This issue means that the Web Server is failing to fill in the DAG bag on its side - this problem is most likely not with your DAG specifically.
My suggestion would be right now to try and restart the web server (via the installation of some dummy package).
Similar issues reported in this post as well here.

Extract Embedded AWS Glue Connection Credentials Using Scala

I have a glue job that reads directly from redshift, and to do that, one has to provide connection credentials. I have created an embedded glue connection and can extract the credentials with the following pyspark code. Is there a way to do this in Scala?
glue = boto3.client('glue', region_name='us-east-1')
response = glue.get_connection(
Name='name-of-embedded-connection',
HidePassword=False
)
table = spark.read.format(
'com.databricks.spark.redshift'
).option(
'url',
'jdbc:redshift://prod.us-east-1.redshift.amazonaws.com:5439/db'
).option(
'user',
response['Connection']['ConnectionProperties']['USERNAME']
).option(
'password',
response['Connection']['ConnectionProperties']['PASSWORD']
).option(
'dbtable',
'db.table'
).option(
'tempdir',
's3://config/glue/temp/redshift/'
).option(
'forward_spark_s3_credentials', 'true'
).load()
There is no scala equivalent from AWS to issue this API call.But you can use Java SDK code inside scala as mentioned in this answer.
This is the Java SDK call for getConnection and if you don't want to do this then you can follow below approach:
Create AWS Glue python shell job and retrieve the connection information.
Once you have the values then call the other scala Glue job with these as arguments inside your python shell job as shown below :
glue = boto3.client('glue', region_name='us-east-1')
response = glue.get_connection(
Name='name-of-embedded-connection',
HidePassword=False
)
response = client.start_job_run(
JobName = 'my_scala_Job',
Arguments = {
'--username': response['Connection']['ConnectionProperties']['USERNAME'],
'--password': response['Connection']['ConnectionProperties']['PASSWORD'] } )
Then access these parameters inside your scala job using getResolvedOptions as shown below:
import com.amazonaws.services.glue.util.GlueArgParser
val args = GlueArgParser.getResolvedOptions(
sysArgs, Array(
"username",
"password")
)
val user = args("username")
val pwd = args("password")

Using input function with remote files in snakemake

I want to use a function to read inputs file paths from a dataframe and send them to my snakemake rule. I also have a helper function to select the remote from which to pull the files.
from snakemake.remote.GS import RemoteProvider as GSRemoteProvider
from snakemake.remote.SFTP import RemoteProvider as SFTPRemoteProvider
from os.path import join
import pandas as pd
configfile: "config.yaml"
units = pd.read_csv(config["units"]).set_index(["library", "unit"], drop=False)
TMP= join('data', 'tmp')
def access_remote(local_path):
""" Connnects to remote as defined in config file"""
provider = config['provider']
if provider == 'GS':
GS = GSRemoteProvider()
remote_path = GS.remote(join("gs://" + config['bucket'], local_path))
elif provider == 'SFTP':
SFTP = SFTPRemoteProvider(
username=config['user'],
private_key=config['ssh_key']
)
remote_path = SFTP.remote(
config['host'] + ":22" + join(base_path, local_path)
)
else:
remote_path = local_path
return remote_path
def get_fastqs(wc):
"""
Get fastq files (units) of a particular library - sample
combination from the unit sheet.
"""
fqs = units.loc[
(units.library == wc.library) &
(units.libtype == wc.libtype),
"fq1"
]
return {
"r1": list(map(access_remote, fqs.fq1.values)),
}
# Combine all fastq files from the same sample / library type combination
rule combine_units:
input: unpack(get_fastqs)
output:
r1 = join(TMP, "reads", "{library}_{libtype}.end1.fq.gz")
threads: 12
run:
shell("cat {i1} > {o1}".format(i1=input['r1'], o1=output['r1']))
My config file contains the bucket name and provider, which are passed to the function. This works as expected when running simply snakemake.
However, I would like to use the kubernetes integration, which requires passing the provider and bucket name in the command line. But when I run:
snakemake -n --kubernetes --default-remote-provider GS --default-remote-prefix bucket-name
I get this error:
ERROR :: MissingInputException in line 19 of Snakefile:
Missing input files for rule combine_units:
bucket-name/['bucket-name/lib1-unit1.end1.fastq.gz', 'bucket-name/lib1-unit2.end1.fastq.gz', 'bucket-name/lib1-unit3.end1.fastq.gz']
The bucket is applied twice (once mapped correctly to each element, and once before the whole list (which gets converted to a string). Did I miss something ? Is there a good way to work around this ?

How to get applicationId of Spark application deployed to YARN in Scala?

I'm using the following Scala code (as a custom spark-submit wrapper) to submit a Spark application to a YARN cluster:
val result = Seq(spark_submit_script_here).!!
All I have at the time of submission is spark-submit and the Spark application's jar (no SparkContext). I'd like to capture applicationId from result, but it's empty.
I can see in my command line output the applicationId and rest of the Yarn messages:
INFO yarn.Client: Application report for application_1450268755662_0110
How can I read it within code and get the applicationId ?
As stated in the Spark issue 5439, you could either use SparkContext.applicationId or parse the stderr output. Now, as you are wrapping the spark-submit command with your own script/object, I would say you need to read the stderr and get the application id.
If you are submitting the job via Python, then this is how you can get the yarn application id:
cmd_list = [{
'cmd': '/usr/bin/spark-submit --name %s --master yarn --deploy-mode cluster '
'--executor-memory %s --executor-cores %s --num-executors %s '
'--class %s %s %s'
% (
app_name,
config.SJ_EXECUTOR_MEMORY,
config.SJ_EXECUTOR_CORES,
config.SJ_NUM_OF_EXECUTORS,
config.PRODUCT_SNAPSHOT_SKU_PRESTO_CLASS,
config.SPARK_JAR_LOCATION,
config.SPARK_LOGGING_ENABLED
),
'cwd': config.WORK_DIR
}]
cmd_output = subprocess.run(cmd_obj['cmd'], shell=True, check=True, cwd=cwd, stderr=subprocess.PIPE)
cmd_output = cmd_output.stderr.decode("utf-8")
yarn_application_ids = re.findall(r"application_\d{13}_\d{4}", cmd_output)
if len(yarn_application_ids):
yarn_application_id = yarn_application_ids[0]
yarn_command = "yarn logs -applicationId " + yarn_application_id
Use the spark context to get application info.
sc.getConf.getAppId
res7: String = application_1532296406128_16555
as Rajiv's answer , the regex 'application_\d{13}_\d{4}' is not correct
actualy, the job id will increase greater than 9999,
so the regex of application_\d{13}_\d{4,} will just working
and the java code
public static final String APPLICATION_REGEX="application_\\d+_\\d{4,}+";
/**
* get yarn application id list
* #param log log content
* #return app id list
*/
public static List<String> getAppIds(String log) {
List<String> appIds = new ArrayList<>();
Matcher matcher = APPLICATION_REGEX.matcher(log);
while (matcher.find()) {
String appId = matcher.group();
if(!appIds.contains(appId)){
appIds.add(appId);
}
}
return appIds;
}

Celery broadcast task not working

I've tried to make a broadcast task but only one of my workers recieve it per each call. Would you please help me? (I'm using rabbitmq and node-celery)
default_exchange = Exchange('celery', type='direct')
celery.conf.update(
CELERY_RESULT_BACKEND = "amqp",
CELERY_RESULT_SERIALIZER='json',
CELERY_QUEUES = (
Queue('celery', default_exchange, routing_key='celery'),
Broadcast('broadcast_tasks'),
),
CELERY_ROUTES = (
{'my_tasks.sample_broadcast_task': {
'queue': 'broadcast_tasks',
}},
{'my_tasks.sample_normal_task': {
'queue': 'celery',
'exchange': 'celery',
'exchange_type': 'direct',
'routing_key': 'celery',
}}
),
)
I've also test following configurtion but not working.
celery.conf.update(
CELERY_RESULT_BACKEND = "amqp",
CELERY_RESULT_SERIALIZER='json',
CELERY_QUEUES=(
Queue('celery', Exchange('celery'), routing_key='celery'),
Broadcast('broadcast'),
),
)
#celery.task(ignore_result=True, queue='broadcast',
options=dict(queue='broadcast'))
def sample_broadcast_task():
print "test"
EDIT
after changing how to run worker by adding -Q broadcast, now i face to this error:
PreconditionFailed: Exchange.declare: (406) PRECONDITION_FAILED - inequivalent arg 'type' for exchange 'broadcast' in vhost '/': received 'direct' but current is 'fanout'
After trying many many many things, i finally find a solution. This work for me.
( celery 3.1.24 (Cipater) and Python 2.7.12 )
WORKER - tasks.py :
from celery import Celery
import celery_config
from kombu.common import Broadcast, Queue, Exchange
app = Celery()
app.config_from_object(sysadmin_celery_config)
#app.task
def print_prout(x):
print x
return x
WORKER - celery_config.py :
# coding=utf-8
from kombu.common import Broadcast, Queue, Exchange
BROKER_URL = 'amqp://login:pass#172.17.0.1//'
CELERY_RESULT_BACKEND = 'redis://:login#172.17.0.1'
CELERY_TIMEZONE = 'Europe/Paris'
CELERY_ENABLE_UTC = True
CELERY_TASK_SERIALIZER = 'pickle'
CELERY_RESULT_SERIALIZER = 'pickle'
CELERY_ACCEPT_CONTENT = ['pickle', 'json', 'msgpack', 'yaml']
CELERY_DISABLE_RATE_LIMITS = True
CELERY_ALWAYS_EAGER = False
CELERY_QUEUES = (Broadcast('broadcast_tasks'), )
worker lauched with :
celery -A celery_worker.tasks worker --loglevel=info --concurrency=1 -n worker_name_1
On the client (another docker container for me).
from celery import Celery
from celery_worker import tasks
result = tasks.print_prout.apply_async(['prout'], queue='broadcast_tasks')
print result.get()
The next step for me is how to retrieve and display results returned by all the workers. The "print result.get()" seems to return only the result of the last worker.
It does not seem obvious ( Have Celery broadcast return results from all workers )
according to your description:
I've tried to make a broadcast task but only one of my workers recieve it per each call
you may be using direct type exchange.
Try this
from celery import Celery
from kombu.common import Broadcast
BROKER_URL = 'amqp://guest:guest#localhost:5672//'
class CeleryConf:
# List of modules to import when celery starts.
CELERY_ACCEPT_CONTENT = ['json']
CELERY_IMPORTS = ('main.tasks')
CELERY_QUEUES = (Broadcast('q1'),)
CELERY_ROUTES = {
'tasks.sampletask': {'queue': 'q1'}
}
celeryapp = Celery('celeryapp', broker=BROKER_URL)
celeryapp.config_from_object(CeleryConf())
#celeryapp.task
def sampletask(form):
print form
To send the message, do
d= sampletask.apply_async(['4c5b678350fc643'],serializer="json", queue='q1')