I have a MongoDB collection with Tasks. Each task has an interval in seconds, Task identifier and payload that should be sent via HTTP POST to gather results and store them into another collection.
It may be thousands tasks with different intervals and I cannot figure out how to schedule them.
Currently I'm using simple polling by last execution time every 10ms but it produces heavy load on DB.
and it looks like this
mongo.MongoClient.connect(MONGO_URL, (err, db) ->
handle_error(err)
schedule = (collection) ->
collection.find({isEnabled:true, '$where': '((new Date()).getTime() - this.timestamp) > (this.checkInterval * 60 * 1000)'}).toArray((err, docs) ->
handle_error(err)
for i, doc of docs
collection.update({_id: doc._id}, {'$set': {timestamp: (new Date()).getTime()}}, {w: 1})
task = prepare(doc)
request.post({url: url, formData: {task: JSON.stringify(prepare(doc))}}, (err,httpResponse,body) ->
result = JSON.parse(body)
console.log(result)
db.collection(MONGO_COLLECTION_RESULTS).save({
task: result.id,
type: result.type,
data: result
})
)
setTimeout((() -> schedule(collection)), 10)
)
setTimeout((() -> schedule(db.collection(MONGO_COLLECTION_TASKS))), 10)
)
tasks can be added, updated, deleted and I have to handle it.
what about using redis? but I have no clue how to sync the data from mongo to redis when some tasks are waiting for result, interval was changed, etc
please advice best strategy for that
I don't think this is the right way to solve your use case.
I would suggest to not store the tasks in whatever database but schedule them directly when they come in and save the result, with or without the original task information.
Why not use Quartz to schedule the task?
If you know the tasks to be run, you can schedule with the unix crontab, which runs a script that connects to DB or send HTTP requests.
If each task is unique, and you cannot pre-schedule them that way, perhaps you can use your current db collections, but not poll the db that often.
If it is not critical that the tasks are executed exactly on the right time, I would do a db lookup maybe once every 10 sec to see what tasks should have been executed since the last lookup.
One way of solving the db load would be to make a query that fetches tasks ordered on when they should be executed, with all tasks that should be executed within the next minute or so. Then you have (hopefully) a low amount of tasks in memory, and can set a javascript timeout for when they should be run. If too many tasks should be run at the same time, this could be problematic to fetch from the db at once.
The essence is to batch several tasks from the db into memory, and handle some of the scheduling there.
Related
I have to post thousands of records to an endpoint that receives them and inserts them into a table. The back-end made in Elixir (with PostgreSQL) knows beforehand how many records will arrive in total. On the other hand the front-end sends the records simultaneously and in parts. For example, I have to send 100 records in chunks of 10. So 10 post requests are made to this endpoint. This works fine, but the problem is that when the last chunk is inserted I want to spawn a process that makes some calculations on the records, but I want this process to spawn only once. How can achieve this?
def bulk_create(conn, %{"records" => records_params}) do
with {rows_affected, nil} <- Api.create_records(records_params) do
group = Api.get_group!(records_params |> first() |> Map.get("group_id"))
total_records_saved = Api.count_records_by_group_id(group.id)
finished = group.total_records == total_records_saved
pid =
if finished do
{:ok, pid} =
Processor.start_link(%{group_id: group.id, name: String.to_atom("PID.#{group.id}")})
:erlang.pid_to_list(pid) |> List.delete_at(0) |> List.delete_at(-1) |> List.to_string()
else
nil
end
conn
|> put_status(:created)
|> render("bulk_create.json",
rows_affected: rows_affected,
finished: finished,
pid: pid
)
end
end
This seems to work fine but if for any reason two requests insert at the same time they could obtain both true in this line
finished = group.total_records == total_records_saved
And both spawn the process.
You need a state, and once you need a state in OTP, you are to launch a process.
It might be a simple Agent, or you might use built-in :counters or :persistent_term.
The idea is you amend the value in the counter upon DB insert, which is an atomic operation, instead of relying on querying DB which is vulnerable to race conditions.
Another way around it would be to spawn a “throttler” which would queue DB requests, execute them, and ensure the calculation and cleanup starts before the N+1th record arrived.
16:37:21.945 [Workflow Executor taskList="PullFulfillmentsTaskList", domain="test-domain": 3] WARN com.uber.cadence.internal.common.Retryer - Retrying after failure
org.apache.thrift.transport.TTransportException: Request timeout after 1993ms
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:546)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:519)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.respondDecisionTaskCompleted(WorkflowServiceTChannel.java:962)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RespondDecisionTaskCompleted$11(WorkflowServiceTChannel.java:951)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:569)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.RespondDecisionTaskCompleted(WorkflowServiceTChannel.java:949)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:301)
at com.uber.cadence.internal.common.Retryer.lambda$retry$0(Retryer.java:104)
at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
at com.uber.cadence.internal.common.Retryer.retry(Retryer.java:101)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:301)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:261)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:229)
at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Our parent workflow code is basically like this (JSONObject is from org.json)
JSONObject[] array = restActivities.getArrayWithHugeJSONItems();
for(JSONObject hugeJSON: array) {
ChildWorkflow child = Workflow.newChildWorkflowStub(ChildWorkflow.class);
child.run(hugeJSON);
}
What we find out is that most of the time, the parent workflow worker fails to start the child workflow and throws the timeout exception above. It retries like crazy but never success and print the timeout exception over and over again. However sometimes we got very lucky and it works. And sometimes it fails even earlier at the activity worker, and it throws the same exception. We believe this is due to the size of the data is too big (about 5MB) and could not be sent within the timeout (judging from the log we guess it's set to 2s). If we call child.run with small fake data it 100% works.
The reason we use child workflow is we want to use Async.function to run them in parallel. So how can we solve this problem? Is there a thrift timeout config we should increase or somehow we can avoid passing huge data around?
Thank you in advance!
---Update after Maxim's answer---
Thank you. I read the example, but still have some questions for my use case. Let's say I got an array of 100 huge JSON objects in my RestActivitiesWorker, if I should not return the huge array to the workflow, I need to make 100 calls to the database to create 100 rows of records and put 100 ids in an array and pass that back to the workflow. Then the workflow create one child workflow per id. Each child workflow then calls another activity with the id to load the data from the DB. But that activity has to pass that huge JSON to the child workflow, is this OK? And for the RestActivitiesWorker making 100 inserts into the DB, what if it failed in the middle?
I guess it boils down to that our workflow is trying to work directly with huge JSON. We are trying to load huge JSON (5-30MB, not that huge) from an external system into our system. We break down the JSON a little bit, manipulate a few values, and use values from a few fields to do some different logic, and finally save it in our DB. How should we do this with Temporal?
Temporal/Cadence doesn't support passing large blobs as inputs and outputs as it uses a DB as underlying storage. So you want to change architecture of your application to avoid this.
The standard workarounds are:
Use external blob store to save large data and pass reference to it as parameters.
Cache data in a worker process or even on a host disk and route activities that operate on this data to that process or host. See fileprocessing sample for this approach.
Our Java app saves its configurations in a MongoDB collections. When the app starts it reads all the configurations from MongoDB and caches them in Maps. We would like to use the change stream API to be able also to watch for updates of the configurations collections.
So, upon app startup, first we would like to get all configurations, and from now on - watch for any further change.
Is there an easy way to execute the following atomically:
A find() that retrieves all configurations (documents)
Start a watch() that will send all further updates
By atomically I mean - without potentially missing any update (between 1 and 2 someone could update the collection with new configuration).
To make sure I lose no update notifications, I found that I can use watch().startAtOperationTime(serverTime) (for MongoDB of 4.0 or later), as follows.
Query the MongoDB server for its current time, using command such as Document hostInfoDoc = mongoTemplate.executeCommand(new Document("hostInfo", 1))
Query for all interesting documents: List<C> configList = mongoTemplate.findAll(clazz);
Extract the server time from hostInfoDoc: BsonTimestamp serverTime = (BsonTimestamp) hostInfoDoc.get("operationTime");
Start the change stream configured with the saved server time ChangeStreamIterable<Document> changes = eventCollection.watch().startAtOperationTime(serverTime);
Since 1 ends before 2 starts, we know that the documents that were returned by 2 were at least same or fresher than the ones on that server time. And any updates that happened on or after this server time will be sent to us by the change stream (I don't care to run again redundant updates, because I use map as cache, so extra add/remove won't make a difference, as long as the last action arrives).
I think I could also use watch().resumeAfter(_idOfLastAddedDoc) (didn't try). I did not use this approach because of the following scenario: the collection is empty, and the first document is added after getting all (none) documents, and before starting the watch(). In that scenario I don't have previous document _id to use as resume token.
Update
Instead of using "hostInfo" for getting the server time, which couldn't be used in our production, I ended using "dbStats" like that:
Document dbStats= mongoOperations.executeCommand(new Document("dbStats", 1));
BsonTimestamp serverTime = (BsonTimestamp) dbStats.get("operationTime");
I am looking for efficient solution for case i have to handle to
thousands of records for saving them to MongoDB database.
i cant lose data but can have delay in saving time.
i searching for the right solution. form what i read i think about:
1. use Redis for caching records and with cronjob bulk insert the records.
2. use task queue for buffering the records and with cronjob bulk insert.
for example rabbitMQ/google task queue - is there a way to set condition consume on buffer length x or time passed from last delivery(like cronjob)
i am using nodejs and mongodb.
tracker schema:
let trackSchema = {
threadId: string,
updated: number,//(timestamp)
created: number,//(timestamp)
count: number //number of opens
trace: [
{
created: number,//(timestamp)
userAgent: string
}
]
}
I have a server running IP controller and 12 IPengines. I connect to the controller from my laptop using SSH. I submitted some jobs to the controller using the load-balanced view interface (in non-blocking mode) and stored the message IDs in the Asyc Result object returned the by apply_async() method.
I accidentally lost the message IDs for the jobs and wanted to know if there's a way to retrieve the job IDs (or the results) from the Hub database. I use a SQLite database for the Hub, and I can get the rc.db_query() method to work, but I don't know what to look for.
Does anyone know how to query the Hub database only for message IDs of the jobs I submitted? What's the easiest way of retrieving the job results from the Hub, if I don't have access to the AsyncHubResult object (or their message IDs)?
Thanks!
Without the message IDs, you are might have a pretty hard time finding the right tasks, unless there haven't been so many tasks submitted.
The querying is based on MongoDB (it's a passthrough when you use mongodb, and a subset of simple operators are implemented for sqlite).
Quick summary: a query is a dict. If you use literal values, they are equality tests, but you can use dict values for comparison operators.
You can search by date for any of the timestamps:
submitted: arrived at the controller
started: arrived on an engine
completed: finished on the engine
For instance, to find tasks submitted yesterday:
from datetime import date, time, timedelta, datetime
# round to midnight
today = datetime.combine(date.today(), time())
yesterday = today - timedelta(days=1)
rc.db_query({'submitted': {
'$lt': today, # less than midnight last night
'$gt': yesterday, # greater than midnight the night before
}})
or all tasks submitted 1-4 hours ago:
found = rc.db_query({'submitted': {
'$lt': datetime.now() - timedelta(hours=1),
'$gt': datetime.now() - timedelta(hours=4),
}})
With the results of that, you can look at keys like client_uuid to retrieve all messages submitted by a given client instance (e.g. a single notebook or script):
client_id = found[0]['client_uuid']
all_from_client = rc.db_query({'client_uuid': client_uuid})
Since you are only interested in results at this point, you can specify keys=['msg_id'] to only retrieve the message IDs. We can then use these msg_ids to get all the results produced by a single client session:
# construct list of msg_ids
msg_ids = [ r['msg_id'] for r in rc.db_query({'client_uuid': client_uuid}, keys=['msg_id']) ]
# use client.get_result to retrieve the actual results:
results = rc.get_result(msg_ids)
At this point, you have all of the results, but you have lost the association of which results came from which execution. There isn't a lot of info to help you out there, but you might be able to tell by type, timestamps, or perhaps select the 9 final items from a given session.