Running concurrent mongoengine queries with asyncio - mongodb

I have 4 functions that basically build queries and execute them. I want to make them run simultaneously using asyncio. My implementation of asyncio seems correct as non mongodb tasks run as they should( example asyncio.sleep()). Here is the code:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
tasks = [
service.async_get_associate_opportunity_count_by_user(me, criteria),
service.get_new_associate_opportunity_count_by_user(me, criteria),
service.async_get_associate_favorites_count(me, criteria=dict()),
service.get_group_matched_opportunities_count_by_user(me, criteria)
]
available, new, favorites, group_matched = loop.run_until_complete(asyncio.gather(*tasks))
stats['opportunities']['available'] = available
stats['opportunities']['new'] = new
stats['opportunities']['favorites'] = favorites
stats['opportunities']['group_matched'] = group_matched
loop.close()
# functions written in other file
#asyncio.coroutine
def async_get_ass(self, user, criteria=None, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
opportunities = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of available: {}".format(run_time))
yield from asyncio.sleep(2)
return opportunities
#asyncio.coroutine
def get_new_associate_opportunity_count_by_user(self, user, criteria=None, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
opportunities = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of new: {}".format(run_time))
yield from asyncio.sleep(2)
return opportunities
#asyncio.coroutine
def async_get_associate_favorites_count(self, user, criteria={}, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
favorites = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of favorites: {}".format(run_time))
yield from asyncio.sleep(2)
return favorites
#asyncio.coroutine
def get_group_matched_opportunities_count_by_user(self, user, criteria=None, **kwargs):
start_time = time.time()
query = **query that gets built from some other functions**
opportunities = Opportunity.objects(query).count()
run_time = time.time() - start_time
print("runtime of group matched: {}".format(run_time))
yield from asyncio.sleep(2)
return opportunities
The yield from asyncio.sleep(2) is just to show that the functions run asynchronously. Here is the output on the terminal:
runtime of group matched: 0.11431598663330078
runtime of favorites: 0.0029871463775634766
Timestamp function run time: 0.0004897117614746094
runtime of new: 0.15225648880004883
runtime of available: 0.13006806373596191
total run time: 2403.2700061798096
From my understanding, apart from the 2000ms that gets added to the total run time due to the sleep function, it shouldn't be more than 155-160ms as the max run time among all functions is this value.
I'm currently looking into motorengine(a port of mongoengine 0.9.0) that apparently enables asynchronous mongodb queries but I think I won't be able to use it since my models have been defined using mongoengine. Is there a workaround to this problem?

The reason your queries aren't running in parallel is because whenever you run Opportunity.objects(query).count() in your coroutines, the entire event loop blocks because those methods are doing blocking IO.
So you need a mongodb driver which can do async/non-blocking IO. You are on the correct path with trying to use motorengine, but as far as I can tell it's written for the Tornado asynchronous framework. To get it to work with asyncio you would have to hookup Tornado and asycnio. See, http://tornado.readthedocs.org/en/latest/asyncio.html on how to do that.
Another option is to use asyncio-mongo, but it doesn't have a mongoengine compatibale ORM, so you might have to rewrite most of your code.

Related

Dash app connections to AWS postgres DB VERY SLOW

I've created a live-updating dash app connected to a public facing AWS Postgres database. I've put db connection within my callback so it updates, but I find that it takes a long long time to retrieve data and create the graph, such that if the interval time is reduced to 10 seconds or less, no graph loads at all. I've tried to store the data in dcc.store but the initial load still takes a very long time. My abbreviated code is written below. I'm assuming the lag time is from the engine connecting to the database, because I am only reading a few rows and columns. Is there anyway to speed this up?
import plotly.graph_objs as go
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output, State
from plotly.subplots import make_subplots
from sqlalchemy import create_engine, MetaData, Table
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import declarative_base
from sqlalchemy import Column, Integer, String, func, Date, ARRAY
from sqlalchemy.orm import sessionmaker
app = dash.Dash(__name__, external_stylesheets=[BS], suppress_callback_exceptions=True, update_title=None)
server=app.server
app.layout = html.Div([
dcc.Store(id='time', storage_type='session'),
dcc.Store(id='blood_pressure', storage_type='session'),
html.Div(dcc.Graph(id='live-graph', animate=False), className='w-100'),
html.Div(id= "testing"),
dcc.Interval(
id='graph-update-BP',
interval=30000,
n_intervals=0
)]), width={"size": 10, "offset": 0.5}),
#app.callback(
dash.dependencies.Output('live-graph', 'figure'),
dash.dependencies.Output('blood_pressure', 'data'),
dash.dependencies.Output('time', 'data'),
[dash.dependencies.Input('graph-update-BP', 'n_intervals')],
Input('live-graph', 'relayoutData'),
)
def update_graph_scatter_1(n):
trace = []
blood_pressure = []
time = []
engine = create_engine("postgresql://username:password#address:5432/xxxxx", echo=True, future=True)
Session = sessionmaker(bind=engine)
session = Session()
Base = automap_base()
Base.prepare(engine, reflect=True)
User = Base.classes.users
Datex = Base.classes.data
for instance in session.query(Datex).filter(Datex.user_id == 3).filter(Datex.date_time == 'Monday,Apr:26'):
blood_pressure.append([instance.systolic, instance.mean, instance.diastolic])
time.append(instance.time)
for i in range(0, len(blood_pressure)):
trace.append(go.Box(y=blood_pressure[i],
x=time[i],
line=dict(color='#6a92ff'),
hoverinfo='all'))
fig = make_subplots(rows=1, cols=1)
def append_trace():
for i in range(0, len(trace)):
fig.append_trace(trace[i], 1, 1)
append_trace()
return fig, blood_pressure, hr,
You can increase performance in your app in the following ways:
Non-programming methods:
If your app is deployed on AWS, ensure your app is connecting to your database over private IP. This reduces the number of networks your data has to traverse and will result in significantly lower latency.
Ensure your virtual machine has enough RAM. (If you're loading 2GB of data to a machine with 1GB available RAM, you're going to see the IO hit disk before loading to your program.)
Programming methods:
Modularize connecting to your database, and only do it once. This decreases the overhead required to reserve resources and authenticate connecting to the database
import os
class DbConnection:
"""Use this class to connect to your database within a dashapp"""
def __init__(self, **kwargs):
self.DB_URI = os.environ.get('DB_URI', kwargs.get('DB_URI'))
self.echo = kwargs.get('echo', True)
self.future = kwargs.get('future', True)
# Now create the engine
self.engine = create_engine(self.DB_URI, echo=self.echo, future=self.self)
# Make the session maker
self.session_maker = sessionmaker(bind=self.engine)
#property
def session(self):
"""Return a session as a property"""
return self.session_maker()
# -------------------------------------------
# In your app, instantiate the database connection
# and map your base
my_db_connection = DbConnection() # provide kwargs as needed
session = my_db_connection.session # necessary to assign property to a variable
# Map the classes
Base = automap_base()
Base.prepare(my_db_connection.engine, reflect=True)
User = Base.classes.users
Datex = Base.classes.data
Cache frequently queried data. Unless your data is massive and dramatically varying, you should expect better performance from loading the data from disk (or RAM) on your machine, than over the network from your database.
from functools import lru_cache
#lru_cache()
def get_blood_pressure(session, user_id, date):
"""returns blood pressure for a given user for a given date"""
blood_pressure, time = [], []
query = session.query(Datex)\
.filter(Datex.user_id == 3)\
.filter(Datex.date_time == 'Monday,Apr:26')
# I like short variable names when interacting with db results
for rec in query:
time.append(rec.time)
blood_pressure.append([rec.systolic, rec.mean, rec.diastolic])
# finally
return blood_pressure, time
Putting them all together, your callback should be a lot quicker
def update_graph_scatter_1(n):
# I'm not sure how these variables will be assigned
# but you'll figure it out
blood_pressure, time = get_blood_pressure(session=session, user_id=user_id, date='Monday,Apr:26')
# Create new traces
for i in range(0, len(blood_pressure)):
trace.append(go.Box(
y=blood_pressure[i],
x=time[i],
line=dict(color='#6a92ff'),
hoverinfo='all'
))
# Add to subplots
fig = make_subplots(rows=1, cols=1)
for i in range(0, len(trace)):
fig.append_trace(trace[i], 1, 1)
return fig, blood_pressure, time
Lastly, it looks like you're recreating your graph objects each update. This is a heavy operation. I'd recommend updating the graph's data instead. I know this is possible, since I've done this in the past. But it looks like the solution is not-trivial, unfortunately. Perhaps an item for a later response or follow up Q.
Further reading: https://dash.plotly.com/performance

I use celery beat worker to create new process that use billiard and set daemonic, but daemonic is not working

billiard version 3.5.0.5
celery 4.0.2
Steps to reproduce
I want control long time beat task. like if a task not finish, not run a new process.
def lock_beat_task(gap_time, flush_time):
def decorate(func):
"""
Celery timing task solves the problem that the previous task is executed again when it is not executed.
problem: Celery is similar to crontab, it will be scheduled regularly, regardless of whether the previous task is executed or not.
achieve: With the help of the cache, the key is the class name, and each time the schedule is checked, it is checked whether it is executed before, and the key is deleted after the execution.
:param func:
:return:
"""
#wraps(func)
def wrapper(*args, **kwargs):
key_name = func.__name__
logger.info('++++++++{} enter +++++++.'.format(key_name))
monitor = BeatWokerMonitor(key_name, gap_time, flush_time)
mo_process = monitor.monitor()
if mo_process:
try:
logger.info('++++++++{} is running.++++++++'.format(key_name))
func(*args, **kwargs)
logger.info('{} is graceful end.'.format(key_name))
except KeyboardInterrupt, e:
monitor.logger.info('{} KeyboardInterrupt reset succ'.format(key_name))
finally:
monitor.reset()
# mo_process.join()
else:
logger.info('{} is running or gap time is not over.'.format(key_name))
logger.info('{} is end!---.'.format(key_name))
return wrapper
return decorate
class BeatWokerMonitor(object):
"""
Used to maintain/monitor the health status of the beat workerCall beat_worker_shoud_run.
If there is no key_name in redis, or the time difference between the current time and key_name is
greater than gap_time, create a monitor daemon.
This daemon is responsible for refreshing the time in key_name at a fixed
time.Beat_worker_shoud_run returns true(should run)。 Otherwise return None
"""
def __init__(self, key_name, gap_time, flush_time):
"""
秒级时间
:param key_name:
:param gap_time:
:param flush_time:
"""
self.key_name = key_name
self.gap_time = gap_time
self.flush_time = flush_time
self.db_redis = get_redis_conn(11)
def start_monitor(self):
flush_key_process = Process(target=self.flush_redis_key_gap_time, name='{}_monitor'.format(self.key_name), daemon=True)
flush_key_process.start()
return flush_key_process
def monitor(self):
rd_key_value = self.get_float_rd_key_value()
if not rd_key_value:
v = time.time()
self.db_redis.set(self.key_name, v)
return self.start_monitor()
if time.time() - rd_key_value > self.gap_time:
return self.start_monitor()
def get_float_rd_key_value(self):
value = self.db_redis.get(self.key_name)
if value:
return float(value)
else:
return 0
def flush_redis_key_gap_time(self):
old_time = self.get_float_rd_key_value()
self.logger.info('{} start flush, now is {}'.format(self.key_name, old_time))
while 1:
if time.time() - old_time > self.flush_time:
now = time.time()
self.db_redis.set(self.key_name, now)
old_time = now
self.logger.info('{} monitor flush time {}'.format(self.key_name, now))
else:
self.logger.info('{} not flush {} , {}'.format(self.key_name, time.time() - old_time, self.flush_time))
def reset(self):
self.db_redis.delete(self.key_name)
and task code. you can write some short time task to test .
#app.task
#lock_beat_task(5*60, 10*3)
#send_update_msg("update")
def beat_update_all():
"""
:return:
"""
from crontab.update import EAllTicket
eall = EAllTicket()
send_task_nums = eall.run()
return send_task_nums
Expected behavior
want background run, so can not use multiprocessing to create child process.
use billiard instead
want beat_update_all() finished and monitor process will kill itself.
Actual behavior
beat_update_all() finished and monitor process is still running.

airflow TriggerDagRunOperator how to change the execution date

I noticed that for scheduled task the execution date is set in the past according to
Airflow was developed as a solution for ETL needs. In the ETL world,
you typically summarize data. So, if I want to summarize data for
2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be
right after all data for 2016-02-19 becomes available.
however, when a dag triggers another dag the execution time is set to now().
Is there a way to have the triggered dags with the same execution time of triggering dag? Of course, I can rewrite the template and use yesterday_ds, however, this is a tricky solution.
The following class expands on TriggerDagRunOperator to allow passing the execution date as a string that then gets converted back into a datetime. It's a bit hacky but it is the only way I found to get the job done.
from datetime import datetime
import logging
from airflow import settings
from airflow.utils.state import State
from airflow.models import DagBag
from airflow.operators.dagrun_operator import TriggerDagRunOperator, DagRunOrder
class MMTTriggerDagRunOperator(TriggerDagRunOperator):
"""
MMT-patched for passing explicit execution date
(otherwise it's hard to hook the datetime.now() date).
Use when you want to explicity set the execution date on the target DAG
from the controller DAG.
Adapted from Paul Elliot's solution on airflow-dev mailing list archives:
http://mail-archives.apache.org/mod_mbox/airflow-dev/201711.mbox/%3cCAJuWvXgLfipPmMhkbf63puPGfi_ezj8vHYWoSHpBXysXhF_oZQ#mail.gmail.com%3e
Parameters
------------------
execution_date: str
the custom execution date (jinja'd)
Usage Example:
-------------------
my_dag_trigger_operator = MMTTriggerDagRunOperator(
execution_date="{{execution_date}}"
task_id='my_dag_trigger_operator',
trigger_dag_id='my_target_dag_id',
python_callable=lambda: random.getrandbits(1),
params={},
dag=my_controller_dag
)
"""
template_fields = ('execution_date',)
def __init__(
self, trigger_dag_id, python_callable, execution_date,
*args, **kwargs
):
self.execution_date = execution_date
super(MMTTriggerDagRunOperator, self).__init__(
trigger_dag_id=trigger_dag_id, python_callable=python_callable,
*args, **kwargs
)
def execute(self, context):
run_id_dt = datetime.strptime(self.execution_date, '%Y-%m-%d %H:%M:%S')
dro = DagRunOrder(run_id='trig__' + run_id_dt.isoformat())
dro = self.python_callable(context, dro)
if dro:
session = settings.Session()
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
execution_date=self.execution_date,
conf=dro.payload,
external_trigger=True)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
session.close()
else:
logging.info("Criteria not met, moving on")
There is an issue you may run into when using this and not setting execution_date=now(): your operator will throw a mysql error if you try to start a dag with an identical execution_date twice. This is because the execution_date and dag_id are used to create the row index and rows with identical indexes cannot be inserted.
I can't think of a reason you would ever want to run two identical dags with the same execution_date in production anyway, but it is something I ran into while testing and you should not be alarmed by it. Simply clear the old job or use a different datetime.
The TriggerDagRunOperator now has an execution_date parameter to set the execution date of the triggered run.
Unfortunately the parameter is not in the template fields.
If it will be added to template fields (or if you override the operator and change the template_fields value) it will be possible to use it like this:
my_trigger_task= TriggerDagRunOperator(task_id='my_trigger_task',
trigger_dag_id="triggered_dag_id",
python_callable=conditionally_trigger,
execution_date= '{{execution_date}}',
dag=dag)
It has not been released yet but you can see the sources here:
https://github.com/apache/incubator-airflow/blob/master/airflow/operators/dagrun_operator.py
The commit that did the change was:
https://github.com/apache/incubator-airflow/commit/089c996fbd9ecb0014dbefedff232e8699ce6283#diff-41f9029188bd5e500dec9804fed26fb4
I improved a bit the MMTTriggerDagRunOperator. The function checks if the dag_run already exists, if found, restart the dag using the clear function of airflow. This allows us to create a dependency between dags because the possibility to have the execution date moved to the triggered dag opens a whole universe of amazing possibilities. I wonder why this is not the default behavior in airflow.
def execute(self, context):
run_id_dt = datetime.strptime(self.execution_date, '%Y-%m-%d %H:%M:%S')
dro = DagRunOrder(run_id='trig__' + run_id_dt.isoformat())
dro = self.python_callable(context, dro)
if dro:
session = settings.Session()
dbag = DagBag(settings.DAGS_FOLDER)
trigger_dag = dbag.get_dag(self.trigger_dag_id)
if not trigger_dag.get_dagrun( self.execution_date ):
dr = trigger_dag.create_dagrun(
run_id=dro.run_id,
state=State.RUNNING,
execution_date=self.execution_date,
conf=dro.payload,
external_trigger=True
)
logging.info("Creating DagRun {}".format(dr))
session.add(dr)
session.commit()
else:
trigger_dag.clear(
start_date = self.execution_date,
end_date = self.execution_date,
only_failed = False,
only_running = False,
confirm_prompt = False,
reset_dag_runs = True,
include_subdags= False,
dry_run = False
)
logging.info("Cleared DagRun {}".format(trigger_dag))
session.close()
else:
logging.info("Criteria not met, moving on")
There is a function available in the experimental API section of airflow that allows you to trigger a dag with a specific execution date. https://github.com/apache/incubator-airflow/blob/master/airflow/api/common/experimental/trigger_dag.py
You can call this function as a part of PythonOperator and achieve the objective.
So it will look like
from airflow.api.common.experimental.trigger_dag import trigger_dag
trigger_operator=PythonOperator(task_id='YOUR_TASK_ID',
python_callable=trigger_dag,
op_args=['dag_id'],
op_kwargs={'execution_date': datetime.now()})

Schedule Celery task to run after other task(s) complete

I want to accomplish something like this:
results = []
for i in range(N):
data = generate_data_slowly()
res = tasks.process_data.apply_async(data)
results.append(res)
celery.collect(results).then(tasks.combine_processed_data())
ie launch asynchronous tasks over a long period of time, then schedule a dependent task that will only be executed once all earlier tasks are complete.
I've looked at things like chain and chord, but it seems like they only work if you can construct your task graph completely upfront.
For anyone interested, I ended up using this snippet:
#app.task(bind=True, max_retries=None)
def wait_for(self, task_id_or_ids):
try:
ready = app.AsyncResult(task_id_or_ids).ready()
except TypeError:
ready = all(app.AsyncResult(task_id).ready()
for task_id in task_id_or_ids)
if not ready:
self.retry(countdown=2**self.request.retries)
And writing the workflow something like this:
task_ids = []
for i in range(N):
task = (generate_data_slowly.si(i) |
process_data.si(i)
)
task_id = task.delay().task_id
task_ids.append(task_id)
final_task = (wait_for(task_ids) |
combine_processed_data.si()
)
final_task.delay()
That way you would be running your tasks synchronously.
The solution depends entirely on how and where data are collected. Roughly, given that generate_data_slowly and tasks.process_data are synchronized, a better approach would be to join both in one task (or a chain) and to group them.
chord will allow you to add a callback to that group.
The simplest example would be:
from celery import chord
#app.task
def getnprocess_data():
data = generate_data_slowly()
return whatever_process_data_does(data)
header = [getnprocess_data.s() for i in range(N)]
callback = combine_processed_data.s()
chord(header)(callback).get()

Replacing Celerybeat with Chronos

How mature is Chronos? Is it a viable alternative to scheduler like celery-beat?
Right now our scheduling implements a periodic "heartbeat" task that checks of "outstanding" events and fires them if they are overdue. We are using python-dateutil's rrule for defining this.
We are looking at alternatives to this approach, and Chronos seems a very attactive alternative: 1) it would mitigate the necessity to use a heartbeat schedule task, 2) it supports RESTful submission of events with ISO8601 format, 3) has a useful interface for management, and 4) it scales.
The crucial requirement is that scheduling needs to be configurable on the fly from the Web Interface. This is why can't use celerybeat's built-in scheduling out of the box.
Are we going to shoot ourselves in the foot by switching over to Chronos?
This SO has solutions to your dynamic periodic task problem. It's not the accepted answer at the moment:
from djcelery.models import PeriodicTask, IntervalSchedule
from datetime import datetime
class TaskScheduler(models.Model):
periodic_task = models.ForeignKey(PeriodicTask)
#staticmethod
def schedule_every(task_name, period, every, args=None, kwargs=None):
""" schedules a task by name every "every" "period". So an example call would be:
TaskScheduler('mycustomtask', 'seconds', 30, [1,2,3])
that would schedule your custom task to run every 30 seconds with the arguments 1 ,2 and 3 passed to the actual task.
"""
permissible_periods = ['days', 'hours', 'minutes', 'seconds']
if period not in permissible_periods:
raise Exception('Invalid period specified')
# create the periodic task and the interval
ptask_name = "%s_%s" % (task_name, datetime.datetime.now()) # create some name for the period task
interval_schedules = IntervalSchedule.objects.filter(period=period, every=every)
if interval_schedules: # just check if interval schedules exist like that already and reuse em
interval_schedule = interval_schedules[0]
else: # create a brand new interval schedule
interval_schedule = IntervalSchedule()
interval_schedule.every = every # should check to make sure this is a positive int
interval_schedule.period = period
interval_schedule.save()
ptask = PeriodicTask(name=ptask_name, task=task_name, interval=interval_schedule)
if args:
ptask.args = args
if kwargs:
ptask.kwargs = kwargs
ptask.save()
return TaskScheduler.objects.create(periodic_task=ptask)
def stop(self):
"""pauses the task"""
ptask = self.periodic_task
ptask.enabled = False
ptask.save()
def start(self):
"""starts the task"""
ptask = self.periodic_task
ptask.enabled = True
ptask.save()
def terminate(self):
self.stop()
ptask = self.periodic_task
self.delete()
ptask.delete()
I haven't used djcelery yet, but it supposedly has an admin interface for dynamic periodic tasks.