How to optimise Drools execution Performance?

How to optimise Drools execution Performance? - drools

We have 1000 rules under a single Rule flow Group
We have severe performance issue while executing (around 10-20 secs)
We thought instead of having under single Rule Flow group,Splitting into multiple Agenda group will improve the performance.
Or creating multiple entry points increase the performance?
Anyone came across this problem?
Any Links /documentation also welcomed.

There was a similar issue several months ago on the Drools user list, and it was resolved successfully by a different approach according to may proposal. It may be applicable here, too.
Let's say there are some risk factors that influence the premium for a car insurance. Attributes are: age, previous incidents, amount of damage in previous incidents, gender, medical classification.
Each of these values influences the premium by a few credits.
You can write tons of rules like
Application( age <= 32 && <=35, previous == 1, damage <= 1000,
gender == 'F', medical == 0.25 )
then
setPremium( 421 );
The proposed solution was to insert (constant) facts for each such parameter set and have a single rule that determines the matching parameter set and setting the premium from the field in the parameter set.

Related

Is rule engine suitable for validating data against set of rules?

I am trying to design an application that allows users to create subscriptions based on different configurations - expressing their interest to receive alerts when those conditions are met.
While evaluating the options for achieving the same, I was thinking about utilizing a generic rule engine such as Drools to achieve the same. Which seemed to be a natural fit to this problem looking at an high-level. But digging deeper and giving it a bit more thought, I am doubting if Business Rule Engine is the right thing to use.
I see Rule engine as something that can select a Rule based on predefined condition and apply the Rule to that data to produce an outcome. Whereas, my requirement is to start with a data (the event that is generated) and identify based on Rules (subscriptions) configured by users to identify all the Rules (subscription) that would satisfy the event being handled. So that Alerts can be generated to all those Subscribers.
To give an example, an hypothetical subscription from an user could be, to be alerted when a product in Amazon drops below $10 in the next 7 days. Another user would have created a subscription to be notified when a product in Amazon drops below $15 within the next 30 days and also offers free one-day shipping for Prime members.
After a bit of thought, I have settled down to storing the Rules/Subscriptions in a relational DB and identifying which Subscriptions are to fire an Alert for an Event by querying against the DB.
My main reason for choosing this approach is because of the volume, as the number of Rules/Subscriptions I being with will be about 1000 complex rules, and will grow exponentially as more users are added to the system. With the query approach I can trigger a single query that can validate all Rules in one go, vs. the Rule engine approach which would require me to do multiple validations based on the number of Rules configured.
While, I know my DB approach would work (may not be efficient), I just wanted to understand if Rule Engine can be used for such purposes and be able to scale well as the number of rules increases. (Performance is of at most importance as the number of Events that are to be processed per minute will be about 1000+)
If rule engine is not the right way to approach it, what other options are there for me to explore rather than writing my own implementation.

You are getting it wrong. A standard rule engine selects rules to execute based on the data. The rules constraints are evaluated with the data you insert into the rule engine. If all constraints in a rule match the data, the rule is executed. I would suggest you to try Drools.

Bipartite graph distributed processing with dynamic programming <?>

I am trying to figure out efficient algorithm for processing Documents in distributed (FaaS to be more precise) environment.
Bruteforce approach would be O(D * F * R) where:
D is amount of Documents to process
F is amount of filters
R is highest amount of Rules in single Filter
I can assume, that:
single Filter has no more than 10 Rules
some Filters may share Rules (so it's N-to-N relation)
Rules are boolean functions (predicates) so I can try to take advantage of early cutting, meaning that if I have f() && g() && h() with f() evaluating to false then I do not have to process g() and h() at all and can return false immediately.
in single Document amount of Fields is always same (and about 5-10)
Filters, Rules and Documents are already in database
every Filter has at least one Rule
Using sharing (second assumption) I had an idea to first process Document against every Rule and then (after finishing) for every Filter using already computed Rules compute result. This way if Rule is shared then I am computing it only once. However, it doesn't take advantage of early cutting (third assumption).
Second idea is to use early cutting as slightly optimized bruteforce, but it won't use Rules sharing then.
Rules sharing looks like subproblem sharing, so probably memoization and dynamic programming will be helpful.
I have noticed, that Filter-Rule relation is bipartite graph. Not quite sure if it can help me though. I also have noticed, that I could use reverse sets and in every Rule store corresponding Set. This would however create circular dependency and may cause desynchronization problems in database.
Default idea is that Documents are streamed, and every single of them is event that will create FaaS instance to process it. However, this would probably force every FaaS instance to query for all Filters, which leaves me at O(F * D) queries because of Shared-Nothing architecture.
Sample Filter:
{
'normalForm': 'CONJUNCTIVE',
'rules':
[
{
'isNegated': true,
'field': 'X',
'relation': 'STARTS_WITH',
'value': 'G',
},
{
'isNegated': false,
'field': 'Y',
'relation': 'CONTAINS',
'value': 'KEY',
},
}
or in more condense form:
document -> !document.x.startsWith("G") && document.y.contains("KEY")
for Document:
{
'x': 'CAR',
'y': 'KEYBOARD',
'z': 'PAPER',
}
evaluates to true.
I can slightly change data model, stream something else instead of Document (ex. Filters) and use any nosql database and tools to help it. Apache Flink (event processing) and MongoDB (single query to retrieve Filter with it's Rules) or maybe Neo4j (as model looks like bipartite graph) looks like could help me, but not sure about it.
Can it be processed efficiently (with regard to - probably - database queries)? What tools would be appropriate?
I have been also wondering, if maybe I am trying to solve special case of some more general (math) problem that may have useful theorems and algorithms.
EDIT: My newest idea: Gather all Documents in cache like Redis. Then single event starts up and publishes N functions (as in Function as a Service), and every function selects F/N (amount of Filters divided by number of instances - so just evenly distributing Filters across instances) this way every Filter is fetched from database only once.
Now, every instance streams all Documents from cache (one document should be less than 1MB and at the same time I should have 1-10k of them so should fit in cache). This way every Document is selected from database only once (to cache).
I have reduced database read operations (still some Rules are selected multiple times), but still I am not taking advantage of Rule sharing across Filters. I could intentionally ignore it by using document database. This way by selecting Filter I will also get it's Rules. Still - I have to recalculate it's value.
I guess that's what I get for using Shared Nothing scalable architecture?

I realized that although my graph is indeed (in theory) bipartite but (in practice) it's going to be set of disjoint bipartite graphs (as not all Rules are going to be shared). This means, that I can process those disjoint parts independently on different FaaS instances without recalculating same Rules.
This reduces my problem to processing single bipartite connected graph. Now, I can use benefits of dynamic programming and share result of Rule computation only if memory i shared, so I cannot divide (and distribute) this problem further without sacrificing this benefit. So I thought this way: if I have already decided, that I will have to recompute some Rules, then let it be low compared to disjoint parts that I will get.
This is actually minimum cut problem, that has (fortunately) polynomial complexity known algorithm.
However, this may be not ideal in my case, because I don't want to cut any part of graph - I would like to cut graph ideally in half (divide and conquer strategy, that could be reapplied recursively till graph would be so small that could be processed in seconds in FaaS instance, that has time bound).
This means, that I am looking for cut, that would create two disjoint bipartite graphs, with possibly same amount of vertexes each (or at least similar).
This is sparsest cut problem, that is NP-hard, but has O(sqrt(logN)) approximated algorithm, that also favors less cut edges.
Currently, this does look like solution for my problem, however I would love to hear any suggestions, improvements and other answers.
Maybe it can be done better with other data model or algorithm? Maybe I can reduce it further with some theorem? Maybe I could transform it to other (simpler) problem, or at least that is easier to divide and distribute across nodes?
This idea and analysis strongly suggests using graph database.

Rewards instead of penalty in optaplanner

So I have lectures and time periods and some lectures need to be taught in a specific time period. How do i do that?
Does scoreHolder.addHardConstraintMatch(kcontext, 10); solve this as a hard constraint? Does the value of positive 10 ensure the constraint of courses being in a specific time period?
I'm aware of the Penalty pattern but I don't want to make a lot of CoursePeriodPenalty objects. Ideally, i'd like to only have one CoursePeriodReward object to say that CS101 should be in time period 9:00-10:00

Locking them with Immovable planning entities won't work as I suspect you still want OptaPlanner to decide the room for you - and currently optaplanner only supports MovableSelectionFilter per entity, not per variable (vote for the open jira for that).
A positive hard constraint would definitely work. Your score will be harder to interpret for your users though, for example a solution with a hard score of 0 won't be feasible (either it didn't get that +10 hard points or it lost 10 hard points somewhere else).
Or you could add a new negative hard constraint type that says if != desiredTimeslot then loose 10 points.

Roster/Timetable generation

I'm working on a tool to generate a timetable for employee up to a month taking into account commercial and labor law constraints. Few challenges and difference from similar problem:
The shift concept contains breaks split up to half an hour.
There is no concept of full 8 shifts as the referred similar problem. e.g. There is a need to have 2 resources at 8am, 2.5 resources at 3PM (e.g. give half hour break)..
and regular constraints like hours per day, hours before break, break time...
Possible solutions is to rely on using a solver aka OR-Tools and Optaplanner. any hints?

If you go with OptaPlanner and don't want to follow the Employee Rostering design of assigning 8 hours Shifts (planning entities) to Employees (planning value), because of your 2th constraint,
then you could try to follow the Cheap Time Example design, something like this:
#PlanningEntity public class WorkAssignment {
Employee employee;
#PlanningVariable PotentialShiftStartTime startTime
#PlanningVariable int durationInHalfHours
}
PotentialShiftStartTime is basically any time a shift can validly start, so Mon 8:00, Mon 8:30, Mon 9:00, etc.
The search space will be huge, in this free form way, but there are tricks to improve scalability (Nearby Selection, pick early for CH, Limited Selection for CH, ...).
To get out of the free form way (= to reduce the search space), you might be able to combine startTime and durationInHalfHours into PotentialShift, if for example it's not possible to start a 8 hour shift at 16:00 in the afternoon. But make sure the gain is huge before introducing that complexity.
In any case, the trouble with this design is determining how many WorkAssignment instances to create. So you'll probably want to create the max number possible per employee and work with nullable=true to ignore unused assignments.

Is it possible to answer multiple questions using the same rule?

I am evaluating business rules engines. I played a little bit with Drools, but it seems, I am rather looking for a query driven, backward chaining system.
So to be more specific, let's see a simple business rule like this:
when
(amount > 1000 AND amount < 2000 AND currency == USD)
OR
(amount > 750 AND amount < 1500 AND currency == EUR)
then
approve loan
Is it possible to use only this rule and "ask" Drools to answer these questions:
What are the required conditions to get a loan approved, if the currency is USD?
I would expected a result something like this: (amount > 1000) AND (amount < 2000)
Is it possible to get a 2000 EUR loan? (expected answer: false)
If not possible, then what were the key reasons of rejection? (expected answer: amount >= 1500 )
Is Drools capable of answering such kind of questions using only one rule?
In theory, those information are all stored in the rule, but I don't know how to "extract" from it.
If Drools is not the best rules engine for this scenario, then are there any engines that provides this kind of functionality?

Drools, like many other rule engines, uses RETE algorithm to decide which consequences to run, all of the when part of your rule must match to the then part to run.
For your questions 1 and 2:
If you want multiple "answers" you can either use multiple rules, or use accumulate function.
For 3:
You cannot know why a rule did not invoked, you need to express the rejection case as a separate rule that fires when the specific condition for the rejection case met.

In a RETE engine a production contains multiple conditions so basically we can define your rule as:
P1= C1 ^ C2 ^ C3 where C1=amount > 1000; C2=amount < 2000;C3=currency == USD
I also recommend you to split your OR condition to two different productions.
P2=C4 ^ C5 ^ C6
where C4=amount > 750; C5=amount < 1500; C6=currency == EUR;
(Note that in production systems you don't really have OR and some of them do not even let you write OR conditions because of tracability requirements)
And to answer your questions:
You simply can query all the productions that sharing C3 as part of your conditions.
Execute your rule with your input and you get the result
Query against your beta memories that do not have a token and the have the discrimination condition for C5 passed in the corresponding alpha memory of network.
Not all the rules engine would let you query against beta memory and tokens. Check if Drools allow you to retrieve alpha, beta memories and stored tokens. Basically you need to traverse the internal RETE graph.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse