Can we reuse Seeds and RutaBasic annotations for multiple script executed sequentially on same cas object - uima

we have multiple ruta script which is setup to run sequentially on incoming emails. Is it a good idea to create seeds and rutabasic annotations once and use them to execute multiple ruta script one by one and once all the scripts are executed we empty the cas.
CAS cas = jCas.getCas();
//initialize the seeds and ruta basic
for (String rutaScript : rutaScripts) {
//execute the ruta one by one
}
//clear the cas

The TokenSeed annotations are commonly only created once as they should represent some simple static layer. The RutaTokenSeedAnnotator, for example, creates only new annotations if there are no TokenSeed annotations yet. They can be shared like any other annotation.
The RutaBasic annotations store additional information about the annotations. They need to be updated for each addition or removal of any annotations, i.e internal maps need to be up to date all the time or else the rules will be executed incorrectly. The RutaBasic annotations can be shared across different analysis engines processing the same CAS and the RutaEngine provides parameters configuring the internal update strategy. These parameters are named PARAM_INDEX_** or PARAM_REINDEX_**.
If there are only two consecutive RutaEngines in your pipeline, then you can set PARAM_REINDEX_UPDATE_MODE to NONE as no other analysis engine modified the indexes.
If runtime is not an issue, then you can set PARAM_REINDEX_UPDATE_MODE to COMPLETE and the RutaEngine will update everything.
If you know that analysis engines in between two RutaEngine do not remove any annotations, then you can set PARAM_REINDEX_UPDATE_MODE to ADDITIVE. The internal update is faster in this mode.
The other parameters can be used to optimize different aspects of the Ruta indexing and reindexing improving speed as well as memory consumption.

Related

When to use multiple KieBases vs multiple KieSessions?

I know that one can utilize multiple KieBases and multiple KieSessions, but I don't understand under what scenarios one would use one approach vs the other (I am having some trouble in general understanding the definitions and relationships between KieContainer, KieBase, KieModule, and KieSession). Can someone clarify this?
You use multiple KieBases when you have multiple sets of rules doing different things.
KieSessions are the actual session for rule execution -- that is, they hold your data and some metadata and are what actually executes the rules.
Let's say I have an application for a school. One part of my application monitors students' attendance. The other part of my application tracks their grades. I have a set of rules which decides if students are truant and we need to talk to their parents. I have a completely unrelated set of rules which determines whether a student is having trouble academically and needs to be put on probation/a performance plan.
These rules have nothing to do with one another. They have completely separate concerns, different rule inputs, and are triggered in different parts of the application. The part of the application that is tracking attendance doesn't need to trigger the rules that monitor student performance.
For this application, I would have two different KieBases: one for attendance, and one for academics. When I need to fire the rules, I fire one or the other -- there is no use case for firing both at the same time.
The KieSession is the runtime for when we fire those rules. We add to it the data we need to trigger the rules, and it also tracks some other metadata that's really not relevant to this discussion. When firing the academics rules, I would be adding to it the student's grades, their classes, and maybe some information about the student (eg the grade level, whether they're an "honors" student, tec.). For the attendance rules, we would need the student information, plus historical tardiness/absence records. Those distinct pieces of data get added to the sessions.
When we decide to fire rules, we first get the appropriate KieBase -- academics or attendance. Then we get a session for that rule set, populate the data, and fire it. We technically "execute" the session, not the rules (and definitely not the rule base.) The rule base is just the collection of the rules; the session is how we actually execute it.
There are two kinds of sessions -- stateful and stateless. As their names imply, they differ with how data is stored and tracked. In most cases, people use stateful sessions because they want their rules to do iterative work on the inputs. You can read more about the specific differences in the documentation.
For low-volume applications, there's generally little need to reuse your KieSessions. Create, use, and dispose of them as needed. There is, however, some inherent overhead in this process, so there comes a point in which reuse does become something that you should consider. The documentation discusses the solution provided out-of-the box for Drools, which is session pooling.
(When trying to wrap your head around this, I like to use an analogy of databases. A session is like a JDBC connection: for small applications you can create them, use them, then close them as you need them. But as you scale you'll quickly find that you need to look into connection pooling to minimize this overhead. In this particular analogy, the rule base would be the database that the rules are executing against -- not the tables!)

Drools global variable initialization and scaling for performance

Thanks in advance. We are trying to adopt drools as rules engine in our enterprise. After evaluating basic functionality in POC mode, we are exploring further. We have the following challenges and I am trying to validate some of the options we are considering. Any help is greatly appreciated.
Scenario-1: Say you get USA state (TX,CA,CO etc) in a fact's field. Now you want the rule to check if the 'state value on the fact' exists in a predetermined static list of state values(say the list contains three values TX,TN,MN).
Possible solution to Scenario-1: 'static list of state values' can be set as a global variable and the rule can access the global variable while performing the check.
Qustions on Scenario-1:
Is the 'possible solution to scenario-1' the standard practice? If so, is it possible to load the value of this global variable from a database during rule engine(KIE Server) startup? If yes, could you let me know the drools feature that enables us to load global variables from a database? Should a client application (a client application that calls kie-server) initialize global variables instead?
Scenario-2: We want to horizantally scale rule execution sever. Say we have one rule engine server(kie-server) exposing rest-api. Can we have multiple instances running behind a loadbalancer to have it scale horizontally? Is there any other way of achieving the scalability?
Q1: It depends. The usual solution for a small, rarely (if ever) changing set that is used just in a single rule is to put it into the rule, using the in operator. If you think you might have to change it or use it frequently, a global would be one way of achieving that but you must make sure that the global is initialized before any facts are inserted.
There is nothing out-of-the-box for accessing a DB.
Q2: A server running a Drools session is just another Java server program, so any load balancing applicable to this class of programs should apply to a Drools app as well. What are you fearing?

Can optaplanner solve partly pre-assigned planning entities using Drools rules?

We are using the time dependent vehicle routing problem example of Optaplanner 6.2.
In our case the domain model consists of activities (corresponding to customers) and technicians (corresponding to vehicles).
Is it possible to initialize an optimization with partly pre-assigned activities to certain technicians, whereby the rest of activities is not assigned?
This would correspond to the case of optaplanner (cvrptw-case) when we stop the solving or wait for the solution, and then add at the end of the solved xml-file not assigned activities.
This file would then be used for the further optimization as input file. Here it is mandatory to lock the already assigned activities.
Can such a starting state: initially locked consecutive
chain-parts of pre-defined activities at the chain-start which should not be rearranged - while the rest activities should be optimized and put after the pre-assigned activities into the existing chains, handled with incremental constraint rules (with Drools)?
Read over the Immovable Planning Entities and Nonvolatile Replanning sections of the Repeated Planning chapter of the manual. It's fairly well explained.

quick loading of a drools knowledge base

I'm trying to use Drools as the rule engine for a grammar relations to semantics mapping framework. The rule base is in excess of 5000 rules even now and will get extended. In using Drools currently the reading of the drl file containing the rules and creating the knowledge base takes a lot of time each time the program is run. Is there a way create the knowledge base once and save it in some persistent format that can be quickly loaded with the option to regenerate the knowledge base only when a change is made?
Yes, drools can serialise a knowledgebase out to external storage and then load this serialised knowledgebase back in again.
So, you need a cycle that loads from drl, compiles, serialises out. Then a second cycle that uses the serialised version.
I've used this with some success, reducing a 1 minute 30 loading time down to about 15-20 seconds. Also, it reduces your heap/perm gen requirements as well.
Check the API for the exact methods.
My first thought is to keep the knowledge base around as long as possible. Unless you are creating multiple knowledge bases from different sets of rules, and there are too many possible combinations, hang onto those knowledge bases. In one application I work on, one knowledge base has all the rules so we treat it like a singleton.
However, if that's not possible or your application is not that long-running, I don't know that Drools itself provides any ways of speeding that up. Running a Drools 5.0 project through the debugger, I see that the KnowledgeBase Drools gives me is Serializable. I imagine it would be quicker to deserialize a KnowledgeBase than to re-parse the rules. But be careful designing your application around this! You use interfaces for a reason and the implementation could change without warning.

Rules Based Database Engine

I would like to design a rules based database engine within Oracle for PeopleSoft Time entry application. How do I do this?
A rules-based system needs several key components:
- A set of rules defined as data
- A set of uniform inputs on which to operate
- A rules executor
- Supervisor hierarchy
Write out a series of use-cases - what might someone be trying to accomplish using the system?
Decide on what things your rules can take as inputs, and what as outputs
Describe the rules from your use-cases as a series of data, and thus determine your rule format. Expand 2 as necessary for this.
Create the basic rule executor, and test that it will take the rule data and process it correctly
Extend the above to deal with multiple rules with different priorities
Learn enough rule engine theory and graph theory to understand common rule-based problems - circularity, conflicting rules etc - and how to use (node) graphs to find cases of them
Write a supervisor hierarchy that is capable of managing the ruleset and taking decisions based on the possible problems above. This part is important, because it is your protection against foolishness on the part of the rule creators causing runtime failure of the entire system.
Profit!
Broadly, rules engines are an exercise in managing complexity. If you don't manage it, you can easily end up with rules that cascade from each other causing circular loops, race-conditions and other issues. It's very easy to construct these accidentally: consider an email program which you have told to move mail from folder A to B if it contains the magic word 'beta', and from B to A if it contains the word 'alpha'. An email with both would be shuttled back and forward until something broke, preventing all other rules from being processed.
I have assumed here that you want to learn about the theory and build the engine yourself. alphazero raises the important suggestion of using an existing rules engine library, which is wise - this is the kind of subject that benefits from academic theory.
I haven't tried this myself, but an obvious approach is to use Java procedures in the Oracle database, and use a Java rules engine library in that code.
Try:
http://www.oracle.com/technology/tech/java/jsp/index.html
http://www.oracle.com/technology/tech/java/java_db/pdf/TWP_AppDev_Java_DB_Reduce_your_Costs_and%20_Extend_your_Database_10gR1_1113.PDF
and
http://www.jboss.org/drools/
or
http://www.jessrules.com/
--
Basically you'll need to capture data events (inserts, updates, deletes), map to them to your rulespace's events, and apply rules.