How to Implement Queue Based Workflow System? - workflow

I'm working on a document management system. An example workflow would be something like this:
A document is emailed to the system
The system does a number of preparatory actions to the document
Document is presented to a user for further processing
Afterwards, document is sent to Quality Assurance
Afterwards, the system does a number or post-processing actions to the document
Document is considered completely processed and disseminated (e.g. emailed back to whoever emailed the document to the system, etc.)
Since the volume of my input will vary (but will usually be high volume), I am very concerend about scalability.
For example, say the system has already downloaded the email attachments. If the attachments are PDF documents, the system needs to split the PDF into individual pages, then convert each page into multiple size thumbnails, etc. I plan to have a cron job check (say, every minute) to see if there are an PDF documents that need to be processed. Using a flagging system (e.g. "PDF Document Ready to be Processed"), I can check the database for all PDF documents that are flagged to be processed. Once the PDF processing is done, the flag can be updated to say "PDF Processing Done."
However, since the processing of each PDF document is very time consuming, I am concerned that when the next cron job is executed, that cron job will also try to process the PDFs that the previous cron job is still processing.
A possible solution is to immediately flag the PDF documents with "PDF Document Currently Being Processed." That way, when the next cron job is executed, it will exclude the ones already being processed.
Thus, each step in the workflow will probably have 3 flags:
PDF Document Ready to be Processed
PDF Document Currently Being Processed
PDF Processing Done
Same for QA:
Document Ready for QA
Document Currently Being QAd
Document QA Done
Is this a good approach? Is there a better approach? Would I have these flags as a single column of the "PDF Document" table in the database? Or should the flags be its own table (e.g. especially if a document can have multiple flags set).
I'd like to solicit suggestions on how to implement such a system.

To solve your concern about concurrent processing on the same document, you can use many scheduler packages to help you manage this aspect. http://www.quartz-scheduler.org/ is one I've used with great success.
To address your problem, I'd have the 3 states, received, queued, processed (similar to what you suggest).
I'd have a scheduled recurring job which polls the database, looking for received pdfs, and for each, queue a job to process and mark the pdf as queued. If you ensure this happens in the same transaction, and utilize optimistic locking, there is no risk another job could come along and re-read this as received.
Quartz uses a thread pool, with may configuration options, and is great for deferred, resource intensive processing (I use it for image thumbnailing in a server setting).
To take a step back, there are some great workflow packages in the java world which can handle most of what you want to do, including the deferred pdf processing. Take a look at jbpm or drools flow, these are two great, if complex, packages.
UPDATE: Drools Flow has been merged into JBPM. For this particular problem it may be a bit of "killing a mosquito with a bazooka" situation, but it's a great workflow package.

The solution kind of depends on what technologies you are using to implement this system is the pre / post processing done by the same software / language as the emailing software? Additionally are they running in seperate processes.
If you have distributed components you could do much worse than investigating an AMQP solution like RabbitMQ, as this takes care of putting each job into a queue, and making sure that only one of your consumers takes each job. (we'd model each thumbnailing job as individual tasks).
If however the entire system is implemented in one language, and inside a single process there's some simpler systems you can use:
Resque is a good solution for Ruby
Java would work well as a LinkedBlockingQueue
Uh, I'm sure c# will have some way of creating a queue of jobs (disclaimer: I know nothing of c#)

Related

Marklogic DLS Document Checkout Timout

Does MarkLogic DLS offer a similar file versioning experience to subversion?
Under Subversion, once the file(document) has been locked, others could not update it anymore, unless the file has been committed (check-in) or released the lock.
However in MarkLogic Library Services (DLS), once the document has been checked out, others could still call dls:document-checkout-update-checkin to update and release the lock. Does it mean it is the developer who should use those dls functions to implement the file lock and unlock mechanism?
I tried to use the timeout parameter in dls:doucment-checkout. However, it seems the document will remain in the checkout status forever. But I do see that parameter when I call 'dls:coument-checkout-status'.
Does it mean that it is the developer who should check the server timestamp together with the initial checkout timestamp and timeout duration to determine whether the file is still in lock status?
If so, I will need to write some XQuery programs and set up a scheduled task in ML to clean up the file checkout daily. Is my above understanding correct?
Per https://docs.marklogic.com/guide/app-dev/dls#id_56448, I believe the timeout is not enforced automatically - i.e. there's no background process in MarkLogic that is periodically inspecting documents to see if they should be automatically checked back in or un-checked out. The timeout appears to be meant to be used by a developer to apply their own logic with it - e.g. allowing a UI to state that "Jane checked this document out and only intended to keep it for 10 minutes, but that was 2 hours ago - would you like to break her checkout?"

Workflow platform for managing the processing of incoming files

In general, I have a single workflow that I want to be able to monitor. The workflow should start whenever new files arrive or alternatively at certain scheduled times, i.e. I want to be able to insert new "jobs" to the workflow as they come, and process the files by going through multiple different tasks and steps. I want to be able to monitor each file going through the tasks.
The queues and distributing the load for each task might be managed by Celery, but it's not decided yet either.
I've looked at Apache Airflow, and as far as I understand at the moment, is geared more towards monitoring many different workflows, such that each workflow is mostly running from start to end, not adding new files to the beginning of the flow before the previous run ended.
Cadence workflow seems like can do what I need, but also seems to be a bit of an overkill.
I'm not expecting a specific final solution here, but I would appreciate suggestions to more such solutions that I can look into and can fit the above.
Luigi - https://luigi.readthedocs.io/en/stable/
Extremely light-weight and fast compared to Airflow.

Stuck with understanding how to build a scalable system

I need some guidance on how to properly build out a system that will be able to scale. I will give you some information about what I am trying to do and then ask my specific question.
I have a site where I want visitors to send some data to be processed. They input the data into a textarea or upload it in a file. Simple. The data is somewhat preprocessed on the client side before a POST request is made to a REST endpoint.
What I am stuck on is what is a good way to take this posted data store it and then associate an id with it that references the user since I cannot process the data fast enough for it to be returned to the user in a reasonable amount of time?
This question is a bit vague and open to opinion, I admit it. I just need a push in the right direction to keep moving. What I have been considering is throwing the data into a message queue and then having some workers process the data elsewhere and when the data is processed alert the user as to where to find it with some sort of link to an S3 bucket or just a URL to a file. The other idea was to just run the request for each item to be processed against another end-point that already processes individual records in some sort of loop client side. The problem is as follows with this idea:
To process the data it may take somewhere from 30 minutes to 2 hours depending upon the amount that they want processed. It's not ideal for them to just sit there and wait for that to finish depending on the amount of records they need processed, so I have ruled this out mostly.
Any guidance would be very much appreciated as I don't have any coworkers to bounce things off of, nor do I know many people with the domain knowledge that I could freely ask. If this isn't the right place to ask this, could you point me in the right direction as to where it should be asked?
Chris
If I've got you right, your pipeline is:
Accept item from user
Possibly preprocess/validate it (?)
Put into some queue
Process data
Return result.
You man use one or several queues on stage (3). Entity from user gets added to one of the queues. If it's big enough, it could be stored in S3 or storage alike, and only info about it put into the queue: link, add date, user id (or email of alike). Processors can pull items from queue and give feedback to users.
If you have no strict requirements on order, things get much simpler: you don't need any sync between them. Treat all the components: upload acceptors, queues, storages and processors as independent pools of processes. Monitor each pool separately. If there's some bottlenecks - add machines to that pool.

Recommender API - Upload Usage Event

The documentation of this API is a little hard to understand in functional terms.
https://westus.dev.cognitive.microsoft.com/docs/services/Recommendations.V4.0/operations/577d91f77270320f24da2592
Upload a usage event to a model. If buildId is set to "-1", the event
is ingested against the Active Build of the model. Set the buildId is
set to null or 0, the events are ingested against the Active build, if
Active build doesn't exist, the events are not associated with any
build.
"is ingested against the Active Build of the model"
What does this mean?
What happens when you associate events to a build?
I have been sending events using the Upload usage event API, but I don't see any changes on the active build on the Data Statistics tab.
Any help to understand this would be appreciated.
I'm building a batch process to send new usage events, and right now my approach is this:
Upload New Usage File
Delete Old Usage file
Create New Build
Change Active Build
Delete Old Build
I was hoping that the other API just to send users events would work, but since I can't make it to work as expected, I changed to this approach.
Is this a good approach or should be doing this in a different way?
The upload usage file is a better approach than the upload usage event.
Reasons:
You get to send the events as one file thus decreasing your api usage count
You can always review and correct your usage files in case something is wrong. I do not see an api command to view/edit/delete uploaded events
You can reuse your usage files to recreate the model in case of an issue with the current one
Here is my own process during midnight:
Upload new usage file based on today's events
Create new build
Update my system to use new build number (since I have different build types in the same model)
Why this process?
Apparently, we will need to create a new build anyway for new usage data to be considered.
Per another post (answered by an authority on the subject)
After updloading a usage event you need to create a new build in
that model for the usage event to be considered as part of the
recommendations request.
You can check the whole post here
Also, as mentioned in the linked post, a few usage events may not be enough to change the recommendations if done real time / frequently thus wasting effort. So a batch process, using usage files and done once per day is the more pragmatic approach.

How to handle large amounts of scheduled tasks on a web server?

I'm developing a website (using a LAMP stack) which must handle many user-made scheduling tasks. It works as following: an user creates an event and sets a date, and others users (as many as 63) may join. A few hours before the set date, the system must email each user subscribed to that event. And that's it.
However, I have never handled scheduling, and the only tools I know (poorly) are cron and at. My plan is to create an at job for each event, which will call a script that gets all subscribers emails and mails them.
My question is: is my plan/design good? Is it scalable? Are there better options that I should be aware of?
Why a separate cron job for each event? I've done something similar thing for a newsletter with a cron job just running once per hour and if there are any newsletters to be sent it just handles them. In your case you'd have a script that runs once every hour and gets a list of users for events that happen in the desired time interval since.
It will work. As far as scalability, at the minimum make sure that the script runs in it's own process so it doesn't bog down the server unnecessarily.
Create a php-cli script perhaps?
I'm doing most of my work in Rails nowadays, and there's a wealth of background processing libraries one of them is Resque it uses the redis server to keep track of the jobs
I found a PHP clone https://github.com/chrisboulton/php-resque
Might be overkill for your use case, but give it a shot perhaps
If you would consider a proper framework that uses an application server (and not a simple webserver), Spring has a task scheduling layer that's simple to use. Scheduling jobs on the server really requires more than what a simple LAMP install can do, but I haven't used PHP in a while so maybe there's an equivalent.
Here's an article that compares some of your options.