Celery periodic task workflow - celery

I want to have a periodic celery task to do the following:
Get the list of receipt items not yet scraped from the data base
Call a task to scrap a specific website for every item on the previous list. Each task will return a list of products
Save the products on the data base and update the receipts so it won't be used on the next call of the periodic task
I think I could accomplish that by using chain, chunks and group, but I don't know how to do it. My idea was to chain the list from item 1 as input to item 2, but I would like every item on the list to be a separate task. It seems that "chunks" is a good candidate, but how can I pipe the list from item 1 to it? And how can I use the length of this list to define the number of tasks? Is it possible?
Than the returned values of each one of those tasks would be pipe to a "group" composed of the tasks to save the products to the db and update the receipts.

Related

How do I allocate fix number of users per single user class in locust

Suppose I have 3 separate user classes. I want to allocate fix number of users for each class. My code is as below.
class User_1(TaskSet):
# I need 3 users to execute the tasks within this user class
class User_2(TaskSet):
# I need only 1 user to execute the tasks within this user class
class User_3(TaskSet):
# I need only 1 user to execute the tasks within this user class
class API_User_Test(HttpUser):
#I already tried weighting the classes as below.
tasks = {Site_User_1: 3, User_2: 1, User_3: 1}
I've already tried weighting the classes as shown in the code above. But it doesn't work. Some times it will allocate more than 1 users for class User_2 or class User_3. Can someone tell me how to fix this issue.
A weight in Locust is just a statistical weight and is not guarantee. The weights determine how many times a task/user are put into a list to be selected from. When a new task/user is spawned, Locust randomly selects a task from the list. Given your weights:
tasks = {Site_User_1: 3, User_2: 1, User_3: 1}
Statistically speaking, spawning 5 users with weights 3/1/1 would get you 3/1/1 but it may not be that precise every time. While less likely, it's possible you could get 4/0/1 or 3/2/0 or 5/0/0.
From the Locust docs:
If the tasks attribute is specified as a list, each time a task is to be performed, it will be randomly chosen from the tasks attribute. If however, tasks is a dict - with callables as keys and ints as values - the task that is to be executed will be chosen at random but with the int as ratio. So with a task that looks like this:
{my_task: 3, another_task: 1}
my_task would be 3 times more likely to be executed than another_task.
Internally the above dict will actually be expanded into a list (and the tasks attribute is updated) that looks like this:
[my_task, my_task, my_task, another_task]
and then Python’s random.choice() is used pick tasks from the list.
If you absolutely have to have full control over exactly what users are running, I'd probably recommend having a single Locust user with a single task that contains your own logic on what to run. Create your own list of functions to call and iterate through it each time a new user is created. Might have to be external to the user as a global or something. But the idea is you manage the logic yourself and not Locust.
Edit:
Using the single user method to control what's running won't work well if you run on multiple workers as the workers don't communicate with each other. You may consider doing some more advanced things like sending messages between master and workers to coordinate, or use an external source like a database or other service the workers talk to to know what they should run.

How to identify the multi-instance sub-process and differentiate it from the main process in Jbpm?

I have used one multi instance subprocess which includes an workflow with human task. When executing, its creating the number of human tasks as to the number of elements present inside the collection object. But all tasks have same process instance id. How the relation is working between parent process and multi instance subprocess?
If there are multiple elements in collection list, then it will create those many tasks inside the multi instance sub process. As all the tasks have same process instance id, how to identify the respective process variable values for each task and the uniqueness of each flow afterwards? And is there a way to make it create an different instance id for each task of the multi instance subprocess?
I did not get all the question, but I will try to answer what I got:
Human tasks have their own task instance id
What is collection object? If you mean tasks in bpmn model, then it is as expected: process instance flow starts after start node and when it reaches a human task, it will create an task instance with id. You can see it in the tasks in UI and with api you can claim, work on, complete , populate data etc.
it is wise to have a separate/different variable for every tasks that can execute in parallel. Then the input will be kept in distinguished data placeholders and you can use it accordingly.
you can create a different instance(task instance) for each task or have repeatable tasks
well the answer was to put the multi-instance into a sub-process, this will allow me to have a separate process instance id per each element of the my List (the input of the multi-instance )

How to pass output from a Datastage Parallel job to input as another job?

My requirement is
Parallel Job1 --I extract data from a table, when row count is more than 0
Parallel job 2 should be triggered in the sequencer only when the row count from source query in Job1 is greater than 0
I want to achieve this without creating any intermediate file in job1.
So basically what you want to do is using information from a data stream (of your Job1) and use it in the "above" sequence as a parameter.
In your case you want to decide on sequence level to run subsequent jobs (if more than 0 rows get returned) or not.
Two options for that:
Job1 writes information to a file which is a value file of a parameterset. These files are stored in a fixed directory. The parameter of the value file could then be used in your sequence to decide your further processing. Details for parameter sets can be found here.
You could use a server job for Job1 and set a user status (basic function DSSetUserStatus) in a transfomer. This is also passed back to the sequence and could be referenced in subsequent stages of the sequence. See the documentation but you will find many other information on the internet as well regarding this topic.
There are more solution to this problem - or let us call it challenge. Other ways may be a script called at sequence level which queries the database and will avoid Job1...

Spring-batch reader for frequently modified source

I'm using spring batch and I want to write a job where I have a JPA reader that selects paginated sets of products from the database. Then I have a processor that will perform some operation on every single product (let's say on product A), but performing this operation on product A the item processor will also process some other products too (like product B, product C, etc.). Then the processor will come to product B because it's in line and is given by the reader. But it has already been processed, so it's actually a waste of time/resources to process it again. How should one actually tackle this - is there a modification aware item reader in spring batch? One solution would be in the item processor to check if the product has already been processed, and only if it hasn't been then process it. However checking if the product has been process is actually very resource consuming.
There are two approaches here that I'd consider:
Adjust what you call an "item" - An item is what is returned from the reader. Depending on the design of things, you may want to build a more complex reader that can include the dependent items and therefore only loop through them once. Obviously this is very dependent upon your specific use case.
Use the Process Indicator pattern - The process indicator pattern is what this is for. As you process items, set a flag in the db indicating that they have been processed. Your reader's query is then configured to only read those that have been processed (filtering those out that were updated via the process phase).

How does Activiti dynamic assignment of candidate user work?

There is a way to pass the candidate users dynamically to Activiti workflow as described in .
How do I pass a list of candidate users to an activiti workflow task in alfresco?
When candidateUser/candidateGroup is set for a UserTask using a variable, when is the expression evaluated ? Is the task id -> user/group persisted in database for fast query of like, list all the tasks a particular use can claim ? What table is it stored in ?
When human tasks are created there are two distinct events that fire.
Create : When the task itself is created and most of the task metadata is associated with the task.
Assign : When the task assignment is evaluated and the task is assigned to either an assignee or candidateGroup.
As such, the candidateGroup expression is evaluated during the assign phase.
This means we can easily manipulate the list of candidates based on a rule, database result or some other business logic prior to the task actually being assigned using a task listener that fires on the create phase.
Hope this helps,
G
Concerning the "What table is it stored in ?" part of your question:
Candidate start groups/users for a given task or process are stored in the ACT_IDENTITY_LINK table.