How to get the result of a background job? - return-value

My background job returns a table. How do I view this table?
For example:
def func1(x){
    t = table(take(1..1000,x) as Id,rand(100.00,x) as qyt)
    return t
}
submitJob("getTable","getTable",func1,10) 

Call getJobReturn(job_id) to check the result of a batch job. job_id is the ID of the job. In your example, you could use the following script:
result = getJobReturn("getTable")

Related

How do i implement a locustfile where each locust takes unique value from csv files for it's task?

enter code here
from locust import HttpLocust, TaskSet, task
class ExampleTask(TaskSet):
csvfile = open('failed.csv', 'r')
data = csvfile.readlines()
bakdata = list(data)
#task
def fun(self):
try:
value = self.data.pop().split(',')
print('------This is the value {}'.format(value[0]))
except IndexError:
self.data = list(self.bakdata)
class ExampleUser(HttpLocust):
host = 'https://www.google.com'
task_set = ExampleTask
Following my csv file:
516,True,success
517,True,success
518,True,success
519,True,success
520,True,success
521,True,success
522,True,success
523,True,success
524,True,success
525,True,success
526,True,success
527,True,success
528,True,success
529,True,success
530,True,success
531,True,success
532,True,success
533,True,success
534,True,success
535,True,success
536,True,success
537,True,success
538,True,success
539,True,success
540,True,success
541,True,success
542,True,success
543,True,success
544,True,success
545,True,success
546,True,success
547,True,success
548,True,success
549,True,success
550,True,success
551,True,success
552,True,success
553,True,success
554,True,success
555,True,success
556,True,success
557,True,success
558,True,success
559,True,success
Here after csv file end , locust does not takes unique value, it takes same value for all the users which is simulated.
I'm not 100% sure, but I think your problem is this line:
self.data = list(self.bakdata)
This will give each User instance a different copy of the list.
It should work if you change it to:
ExampleTask.data = list(self.bakdata)
Or you can use locust-plugins's CSVReader, see the example here:
https://github.com/SvenskaSpel/locust-plugins/blob/master/examples/csvreader_ex.py

Pyspark - update list in udf

Is there any possibility to update a list/variable inside an udf?
Let's consider this scenario:
studentsWithNewId = []
udfChangeStudentId = udf(changeStudentId, IntegerType())
def changeStudentId(studentId):
if condition:
newStudentId = computeNewStudentId() // this function is based on studentsWithNewId list contents
studentsWithNewId.append(newStudentId)
return newStudentId
return studentId
studentsDF.select(udfChangeStudentId(studentId))
Is this possible and safe on a cluster environment?
The code above is just an example, therefore maybe it can be re-written in some other, better way.

airflow TriggerDagRunOperator how to change the execution date

I noticed that for scheduled task the execution date is set in the past according to
Airflow was developed as a solution for ETL needs. In the ETL world,
you typically summarize data. So, if I want to summarize data for
2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be
right after all data for 2016-02-19 becomes available.
however, when a dag triggers another dag the execution time is set to now().
Is there a way to have the triggered dags with the same execution time of triggering dag? Of course, I can rewrite the template and use yesterday_ds, however, this is a tricky solution.
The following class expands on TriggerDagRunOperator to allow passing the execution date as a string that then gets converted back into a datetime. It's a bit hacky but it is the only way I found to get the job done.
from datetime import datetime
import logging
from airflow import settings
from airflow.utils.state import State
from airflow.models import DagBag
from airflow.operators.dagrun_operator import TriggerDagRunOperator, DagRunOrder
class MMTTriggerDagRunOperator(TriggerDagRunOperator):
"""
MMT-patched for passing explicit execution date
(otherwise it's hard to hook the datetime.now() date).
Use when you want to explicity set the execution date on the target DAG
from the controller DAG.
Adapted from Paul Elliot's solution on airflow-dev mailing list archives:
http://mail-archives.apache.org/mod_mbox/airflow-dev/201711.mbox/%3cCAJuWvXgLfipPmMhkbf63puPGfi_ezj8vHYWoSHpBXysXhF_oZQ#mail.gmail.com%3e
Parameters
------------------
execution_date: str
the custom execution date (jinja'd)
Usage Example:
-------------------
my_dag_trigger_operator = MMTTriggerDagRunOperator(
execution_date="{{execution_date}}"
task_id='my_dag_trigger_operator',
trigger_dag_id='my_target_dag_id',
python_callable=lambda: random.getrandbits(1),
params={},
dag=my_controller_dag
)
"""
template_fields = ('execution_date',)
def __init__(
self, trigger_dag_id, python_callable, execution_date,
*args, **kwargs
):
self.execution_date = execution_date
super(MMTTriggerDagRunOperator, self).__init__(
trigger_dag_id=trigger_dag_id, python_callable=python_callable,
*args, **kwargs
)
def execute(self, context):
run_id_dt = datetime.strptime(self.execution_date, '%Y-%m-%d %H:%M:%S')
dro = DagRunOrder(run_id='trig__' + run_id_dt.isoformat())
dro = self.python_callable(context, dro)
if dro:
session = settings.Session()
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
execution_date=self.execution_date,
conf=dro.payload,
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
else:
logging.info("Criteria not met, moving on")
There is an issue you may run into when using this and not setting execution_date=now(): your operator will throw a mysql error if you try to start a dag with an identical execution_date twice. This is because the execution_date and dag_id are used to create the row index and rows with identical indexes cannot be inserted.
I can't think of a reason you would ever want to run two identical dags with the same execution_date in production anyway, but it is something I ran into while testing and you should not be alarmed by it. Simply clear the old job or use a different datetime.
The TriggerDagRunOperator now has an execution_date parameter to set the execution date of the triggered run.
Unfortunately the parameter is not in the template fields.
If it will be added to template fields (or if you override the operator and change the template_fields value) it will be possible to use it like this:
my_trigger_task= TriggerDagRunOperator(task_id='my_trigger_task',
trigger_dag_id="triggered_dag_id",
python_callable=conditionally_trigger,
execution_date= '{{execution_date}}',
dag=dag)
It has not been released yet but you can see the sources here:
https://github.com/apache/incubator-airflow/blob/master/airflow/operators/dagrun_operator.py
The commit that did the change was:
https://github.com/apache/incubator-airflow/commit/089c996fbd9ecb0014dbefedff232e8699ce6283#diff-41f9029188bd5e500dec9804fed26fb4
I improved a bit the MMTTriggerDagRunOperator. The function checks if the dag_run already exists, if found, restart the dag using the clear function of airflow. This allows us to create a dependency between dags because the possibility to have the execution date moved to the triggered dag opens a whole universe of amazing possibilities. I wonder why this is not the default behavior in airflow.
def execute(self, context):
run_id_dt = datetime.strptime(self.execution_date, '%Y-%m-%d %H:%M:%S')
dro = DagRunOrder(run_id='trig__' + run_id_dt.isoformat())
dro = self.python_callable(context, dro)
if dro:
session = settings.Session()
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
if not trigger_dag.get_dagrun( self.execution_date ):
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
execution_date=self.execution_date,
conf=dro.payload,
external_trigger=True
)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
else:
trigger_dag.clear(
start_date = self.execution_date,
end_date = self.execution_date,
only_failed = False,
only_running = False,
confirm_prompt = False,
reset_dag_runs = True,
include_subdags= False,
dry_run = False
)
logging.info("Cleared DagRun {}".format(trigger_dag))
session.close()
else:
logging.info("Criteria not met, moving on")
There is a function available in the experimental API section of airflow that allows you to trigger a dag with a specific execution date. https://github.com/apache/incubator-airflow/blob/master/airflow/api/common/experimental/trigger_dag.py
You can call this function as a part of PythonOperator and achieve the objective.
So it will look like
from airflow.api.common.experimental.trigger_dag import trigger_dag
trigger_operator=PythonOperator(task_id='YOUR_TASK_ID',
python_callable=trigger_dag,
op_args=['dag_id'],
op_kwargs={'execution_date': datetime.now()})

Schedule Celery task to run after other task(s) complete

I want to accomplish something like this:
results = []
for i in range(N):
data = generate_data_slowly()
res = tasks.process_data.apply_async(data)
results.append(res)
celery.collect(results).then(tasks.combine_processed_data())
ie launch asynchronous tasks over a long period of time, then schedule a dependent task that will only be executed once all earlier tasks are complete.
I've looked at things like chain and chord, but it seems like they only work if you can construct your task graph completely upfront.
For anyone interested, I ended up using this snippet:
#app.task(bind=True, max_retries=None)
def wait_for(self, task_id_or_ids):
try:
ready = app.AsyncResult(task_id_or_ids).ready()
except TypeError:
ready = all(app.AsyncResult(task_id).ready()
for task_id in task_id_or_ids)
if not ready:
self.retry(countdown=2**self.request.retries)
And writing the workflow something like this:
task_ids = []
for i in range(N):
task = (generate_data_slowly.si(i) |
process_data.si(i)
)
task_id = task.delay().task_id
task_ids.append(task_id)
final_task = (wait_for(task_ids) |
combine_processed_data.si()
)
final_task.delay()
That way you would be running your tasks synchronously.
The solution depends entirely on how and where data are collected. Roughly, given that generate_data_slowly and tasks.process_data are synchronized, a better approach would be to join both in one task (or a chain) and to group them.
chord will allow you to add a callback to that group.
The simplest example would be:
from celery import chord
#app.task
def getnprocess_data():
data = generate_data_slowly()
return whatever_process_data_does(data)
header = [getnprocess_data.s() for i in range(N)]
callback = combine_processed_data.s()
chord(header)(callback).get()

Trigger works but test doesn't cover 75% of the code

I have a trigger which works in the sandbox. The workflow checks the field in the campaign level and compares it with the custom setting. If it matches, then it returns the target to the DS Multiplier field. The trigger looks as follows
trigger PopulateTarget on Campaign (before insert, before update)
{
for(Campaign campaign : Trigger.new)
{
if (String.isNotBlank(campaign.Apex_Calculator__c) == true)
{
DSTargets__c targetInstance = DSTargets__c.getInstance(campaign.Apex_Calculator__c);
{
String target = targetInstance .Target__c;
campaign.DS_Target_Multiplier__c = Target;
}
}
}
}
However, I had problems to write a proper test to this and asked for the help on the internet. I received the test
#isTest
private class testPopulateTarget{
static testMethod void testMethod1(){
// Load the Custom Settings
DSTargets__c testSetting = new DSTargets__c(Name='Africa - 10 Weeks; CW 10',Target__c='0.1538', SetupOwnerId = apexCalculatorUserId);
insert testSetting;
// Create Campaign. Since it would execute trigger, put it in start and stoptests
Test.startTest();
Campaign testCamp = new Campaign();
// populate all reqd. fields.
testCamp.Name = 'test DS campaign';
testCamp.RecordTypeId = '012200000001b3v';
testCamp.Started_Campaign_weeks_before_Event__c = '12 Weeks';
testCamp.ParentId= '701g0000000EZRk';
insert testCamp;
Test.stopTest();
testCamp = [Select ID,Apex_Calculator__c,DS_Target_Multiplier__c from Campaign where Id = :testCamp.Id];
system.assertEquals(testCamp.DS_Target_Multiplier__c,testSetting.Target__c);// assert that target is populated right
}
}
Such test returns the error "Compile Error: Variable does not exist: apexCalculatorUserId at line 6 column 122". If I remove that ApexCalculator part System.assertEquals then the test passes. However it covers 4/6 part of the code (which is 66%)
Could anyone help me how should I amend the code to make the coverage of 75%?
Yes, apexCalculatorUserId has not been defined. The code you were given appears to be incomplete. You'll need to look at the constructor DSTargets__c and see what kind of ID it is expecting there.
At a guess, you could try UserInfo.getUserId() to get the ID of the current user, but that may not be the ID that's expected in the constructor. It would be worth trying it to see if the test coverage improves.
1) Replace apexCalculatorUserId with UserInfo.getUserId()
2) I'm not sure what kind of field is Apex_Calculator__c on campaign. If its not a formula you want to insert a new line before "insert testCamp". Something like:
testCamp.Apex_Calculator__c = UserInfo.getUserId();