Databricks monitoring a column value in a spark table - pyspark

I have a complex workflow (partly inherited) in databricks that runs multiple child notebooks. Child notebooks should not run if the current run is set to terminated by an external process. Currently I am checking the state of this column before each run. There is some pseudo code below that tries to give an idea. My question is, is there a way to make this read better and make it easier to maintain. A more global if or wait that will check dynamically if the job is "terminated" and move the code on to the tidy up part.
audit = spark.createDataFrame([('122','complete',None),
('123','complete',None),
('124','Processing','Terminated'),
('125','Processing',None)],"job_id","job_status","external_status")
audit.write.overwrite("audit")
def is_terminated(run_id):
current_log = spark.table("audit").filter(f"run_id = {run_id}")
job_terminated =current_log.select('external_status').collect()[0][0]
return(terminated)
run_id = '125' # in live code there is a get current run_id job
terminated = is_terminated(run_id)
if terminated != "Terminated":
dbutils.notebook.run("./notebooks/job1",0)
terminated = is_terminated(run_id)
otherstuff = otherstuffjob(run_id)
if terminated != "Terminated" and otherstuff == True:
dbutils.notebook.run("./notebooks/job2",0)
terminated = is_terminated(run_id)
yetotherstuff = yetotherstuffjob(run_id)
if terminated != "Terminated" and yetothrstuff == True:
dbutils.notebook.run("./notebooks/job3",0)
terminated = is_terminated(run_id)
finalstuff = finalstuffjob(run_id)
if terminated != "Terminated" and finalstuff == True:
dbutils.notebook.run("./notebooks/job4",0)
dbutils.notebook.run("./notebooks/finaltidyupjob",0) #this runs whatever.

Related

How to repeat each test with a delay if a particular Exception happens (pytest)

I have a load of test which I want to rerun if there is a particular exception. The reason for this is that I am running real API calls to a server and sometimes I hit the rate limit for the API, in which case I want to wait and try again.
However, I am also using a pytest fixture to make each test is run several times, because I am sending requests to different servers (the actual use case is different cryptocurrency exchanges).
Using pytest-rerunfailures comes very close to what I need...apart from that I can't see how to look at the exception of the last test run in the condition.
Below is some code which shows what I am trying to achieve, but I don't want to write code like this for every test obviously.
#pytest_asyncio.fixture(
params=EXCHANGE_NAMES,
)
async def client(request):
exchange_name = request.param
exchange_client = get_exchange_client(exchange_name)
return exchange_client
def test_something(client):
test_something.count += 1
### This block is the code I want to
try:
result = client.do_something()
except RateLimitException:
test_something.count
if test_something.count <= 3:
sleep_duration = get_sleep_duration(client)
time.sleep(sleep_duration)
# run the same test again
test_something()
else:
raise
expected = [1,2,3]
assert result == expected
You can use the retry library to wrap your actual code in:
#pytest_asyncio.fixture(
params=EXCHANGE_NAMES,
autouse=True,
)
async def client(request):
exchange_name = request.param
exchange_client = get_exchange_client(exchange_name)
return exchange_client
def test_something(client):
actual_test_something(client)
#retry(RateLimitException, tries=3, delay=2)
def actual_test_something(client):
'''Retry on RateLimitException, raise error after 3 attempts, sleep 2 seconds between attempts.'''
result = client.do_something()
expected = [1,2,3]
assert result == expected
The code looks much cleaner this way.

Fork() in XV6, does the process child execute in kernel or user mode?

In XV6, when a fork() is called, does the child execute in kernel mode or user mode?
This is the fork code in XV6:
// Create a new process copying p as the parent.
// Sets up stack to return as if from system call.
// Caller must set state of returned proc to RUNNABLE.
int fork(void){
int i, pid;
struct proc *np;
struct proc *curproc = myproc();
// Allocate process.
if((np = allocproc()) == 0){
return -1;
}
// Copy process state from proc.
if((np->pgdir = copyuvm(curproc->pgdir, curproc->sz)) == 0){
kfree(np->kstack);
np->kstack = 0;
np->state = UNUSED;
return -1;
}
np->sz = curproc->sz;
np->parent = curproc;
*np->tf = *curproc->tf;
// Clear %eax so that fork returns 0 in the child.
np->tf->eax = 0;
for(i = 0; i < NOFILE; i++)
if(curproc->ofile[i])
np->ofile[i] = filedup(curproc->ofile[i]);
np->cwd = idup(curproc->cwd);
safestrcpy(np->name, curproc->name, sizeof(curproc->name));
pid = np->pid;
acquire(&ptable.lock);
np->state = RUNNABLE;
release(&ptable.lock);
return pid;
}
I did some research but even from the code I can't understand how it works. Understanding how it works in UNIX would also help
It is almost the exact copy of the parent process except the value of eax register and parent process information, so it will execute whichever context the parent process is in.
The fork() function here creates a new process structure by calling allocproc() and fills it with the values of the original process and maps the same page tables.
Finally, it sets the process state to RUNNABLE which allows the scheduler to run the new process along with the parent.
That means actual running is performed by the scheduler, not the fork code here.
What Sedat has written entirely correct. The forked process, or the child would run in the same context which it's parent was, i.e. either Kernel or User.
In addition to that, I feel what confused you were the calls done by alloproc() like kalloc() and the attributes like kstack. These deal with setting up the new process in the system with regards to the page tables and the memory part.

Schedule Celery task to run after other task(s) complete

I want to accomplish something like this:
results = []
for i in range(N):
data = generate_data_slowly()
res = tasks.process_data.apply_async(data)
results.append(res)
celery.collect(results).then(tasks.combine_processed_data())
ie launch asynchronous tasks over a long period of time, then schedule a dependent task that will only be executed once all earlier tasks are complete.
I've looked at things like chain and chord, but it seems like they only work if you can construct your task graph completely upfront.
For anyone interested, I ended up using this snippet:
#app.task(bind=True, max_retries=None)
def wait_for(self, task_id_or_ids):
try:
ready = app.AsyncResult(task_id_or_ids).ready()
except TypeError:
ready = all(app.AsyncResult(task_id).ready()
for task_id in task_id_or_ids)
if not ready:
self.retry(countdown=2**self.request.retries)
And writing the workflow something like this:
task_ids = []
for i in range(N):
task = (generate_data_slowly.si(i) |
process_data.si(i)
)
task_id = task.delay().task_id
task_ids.append(task_id)
final_task = (wait_for(task_ids) |
combine_processed_data.si()
)
final_task.delay()
That way you would be running your tasks synchronously.
The solution depends entirely on how and where data are collected. Roughly, given that generate_data_slowly and tasks.process_data are synchronized, a better approach would be to join both in one task (or a chain) and to group them.
chord will allow you to add a callback to that group.
The simplest example would be:
from celery import chord
#app.task
def getnprocess_data():
data = generate_data_slowly()
return whatever_process_data_does(data)
header = [getnprocess_data.s() for i in range(N)]
callback = combine_processed_data.s()
chord(header)(callback).get()

Avoiding duplicate tasks in celery broker

I want to create the following flow using celery configuration\api:
Send TaskA(argB) Only if celery queue has no TaskA(argB) already pending
Is it possible? how?
You can make your job aware of other tasks by some sort of memoization. If you use a cache control key (redis, memcached, /tmp, whatever is handy), you can make execution depend on that key. I'm using redis as an example.
from redis import Redis
#app.task
def run_only_one_instance(params):
try:
sentinel = Redis().incr("run_only_one_instance_sentinel")
if sentinel == 1:
#I am the legitimate running task
perform_task()
else:
#Do you want to do something else on task duplicate?
pass
Redis().decr("run_only_one_instance_sentinel")
except Exception as e:
Redis().decr("run_only_one_instance_sentinel")
# potentially log error with Sentry?
# decrement the counter to insure tasks can run
# or: raise e
I cannot think of a way but to
Retrieve all executing and scheduled tasks via celery inspect
Iterate through them to see if your task is there.
check this SO question to see how the first point is done.
good luck
I don't know it's gonna help you more than the other answers, but there goes my approach, following the same idea given by srj. I needed a way to block my server to launch tasks with same id to queue. So I made a general function to help me.
def is_task_active_or_registered(app, task_id):
i = app.control.inspect()
active_dict = i.active()
scheduled_dict = i.scheduled()
keys_set = set(active_dict.keys() + scheduled_dict.keys())
tasks_ids_set = set()
for _dict in [active_dict, scheduled_dict]:
for k in keys_set:
for task in _dict[k]:
tasks_ids_set.add(task['id'])
if task_id in tasks_ids_set:
return True
else:
return False
So, I use it like this:
In the context where my celery-app object is available, I define:
def check_task_can_not_run(task_id):
return is_task_active_or_registered(app=celery, task_id=task_id)
And so, from my client request, I call this check_task_can_not_run(...) and block task from being launched in case of True.
I was facing similar problem. The Beat was making duplicates in my queue. I wanted to use expires but this feature isn't working properly https://github.com/celery/celery/issues/4300.
So here is scheduler which checks if task has been already enqueued (based on task name).
# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
import json
from heapq import heappop, heappush
from celery.beat import event_t
from celery.schedules import schedstate
from django_celery_beat.schedulers import DatabaseScheduler
from typing import List, Optional
from typing import TYPE_CHECKING
from your_project import celery_app
if TYPE_CHECKING:
from celery.beat import ScheduleEntry
def is_task_in_queue(task, queue_name=None):
# type: (str, Optional[str]) -> bool
queues = [queue_name] if queue_name else celery_app.amqp.queues.keys()
for queue in queues:
if task in get_celery_queue_tasks(queue):
return True
return False
def get_celery_queue_tasks(queue_name):
# type: (str) -> List[str]
with celery_app.pool.acquire(block=True) as conn:
tasks = conn.default_channel.client.lrange(queue_name, 0, -1)
decoded_tasks = []
for task in tasks:
j = json.loads(task)
task = j['headers']['task']
if task not in decoded_tasks:
decoded_tasks.append(task)
return decoded_tasks
class SmartScheduler(DatabaseScheduler):
"""
Smart means that prevents duplicating of tasks in queues.
"""
def is_due(self, entry):
# type: (ScheduleEntry) -> schedstate
is_due, next_time_to_run = entry.is_due()
if (
not is_due or # duplicate wouldn't be created
not is_task_in_queue(entry.task) # not in queue so let it run
):
return schedstate(is_due, next_time_to_run)
# Task should be run (is_due) and it is present in queue (is_task_in_queue)
H = self._heap
if not H:
return schedstate(False, self.max_interval)
event = H[0]
verify = heappop(H)
if verify is event:
next_entry = self.reserve(entry)
heappush(H, event_t(self._when(next_entry, next_time_to_run), event[1], next_entry))
else:
heappush(H, verify)
next_time_to_run = min(verify[0], next_time_to_run)
return schedstate(False, min(next_time_to_run, self.max_interval))

How to ensure only one job fires at a time in Quartz.NET?

I have a Windows Service that uses Quartz.NET to execute jobs that are scheduled. I only want it to pick up a single job at a time. However, occasionally I am seeing behavior that indicates that it has picked up two jobs at once.
There are two log files (the regular one and one automatically generated when the regular one is in use) with jobs that start at the exact same time. I can see both jobs executing in the QRTZ_FIRED_TRIGGERS table, but only one has the correct instance ID, which is odd.
I have configured Quartz to use only a single thread. Is this not how you tell it to only pick up a single job at a time?
Here is my quartz.config file with sensitive values hashed out:
quartz.scheduler.instanceName = DefaultQuartzJobScheduler
quartz.scheduler.instanceId = ######################
quartz.jobstore.clustered = true
quartz.jobstore.clusterCheckinInterval = 15000
quartz.threadPool.type = Quartz.Simpl.SimpleThreadPool, Quartz
quartz.jobStore.useProperties = false
quartz.jobStore.type = Quartz.Impl.AdoJobStore.JobStoreTX, Quartz
quartz.jobStore.driverDelegateType = Quartz.Impl.AdoJobStore.OracleDelegate, Quartz
quartz.jobStore.tablePrefix = QRTZ_
quartz.jobStore.lockHandler.type = Quartz.Impl.AdoJobStore.UpdateLockRowSemaphore, Quartz
quartz.jobStore.misfireThreshold = 60000
quartz.jobStore.dataSource = default
quartz.dataSource.default.connectionString = ######################
quartz.dataSource.default.provider = OracleClient-20
# Customizable values per Node
quartz.threadPool.threadCount = 1
quartz.threadPool.threadPriority = Normal
Make the threadcount = 1.
<add key="quartz.threadPool.threadCount" value="1"/>
<add key="quartz.threadPool.threadPriority" value="Normal"/>
(as you have done)
Make each of your jobs "Stateful"
[PersistJobDataAfterExecution]
[DisallowConcurrentExecution]
public class StatefulDoesNotRunConcurrentlyJob : IJob /* : IStatefulJob */ /* Error 43 'Quartz.IStatefulJob' is obsolete: 'Use DisallowConcurrentExecutionAttribute and/or PersistJobDataAfterExecutionAttribute annotations instead. */
{
}
I've left in the name of the ~~older~~ version of how to do this (namely, the "IStatefulJob") and the error message that is generated when you code to the outdated "IStatefulJob" interface. But the error message gives the hint.
Basically, if you have 1 thread AND every job is marked with "DisallowConcurrentExecution", it should result in 1 job at any given time..running in "serial mode".