Drools is very slow in processing big data - drools

We have integrated Drools with Talend ETL. Drools takes lot of time to process records counting upto half a million or more. How can we increase the processing speed of drools. I am familiar with drools coding but i am not aware how drools internally works. please help me with this issue. It would be really greatful. I am not sure whether I have given right tags i.e whether they have the right answer. But please do help me on this as it is needed.

The typical problems involve:
Not using == constraints, to allow for indexing.
Make sure you have the field on the left, and the variable on the right.
Not having your most restrictive patterns and constraints first
Not ensuring your rules are written to avoid large cross products
Use of multiple accumulates per rule, or sub networks.
The last issue is improved in Drools 6.0.

Related

Minizinc, Gecode, how to get an identical solutions across distributed servers, with multi-solution model?

I am using minizinc and gecode to solve a minimization problem in a distributed fashion. I have multiple distributed servers that solve the same model with identical input and I want all the servers to get the same solution.
The problem is that model has multiple solutions, which periodically causes servers to come up with different solutions independently. It is not significant which solution will be chosen, as long as it is identical among all servers. I am also using "-p" arguments with gecode to use multiple threads (if it is relevant).
Is there a way that I could address this issue?
For example, I was thinking about outputting all the solutions and then sort them alphanumerically on each server.
Thanks!
If the search strategy in the model does not contain randomisation, then, assuming all versioning is the same, a single thread executing of Gecode should always return the same answer for the same model and instance data. It does not matter if it's on a different node. Using single threaded execution is the easiest way of ensuring that the same solution is found on all nodes.
If you are however want to use multiple threads, no such guarantee can be made. Due to the concurrency of the program, the execution path can be different every run and a different solution might be found each time.
Your suggestion of sorting the solution is possible, but will come at a price. There are two ways of doing this. You can either find all solutions, using the -a flag, and sort them afterwards or you can change your model to force the solution to be the first solution if you would sort them. This second option can be achieved by changing the search strategy. Both these solutions can be very costly and might (more than) exponentially increase the runtime.
If you are concerned about runtime at all, then I suggest you take Patrick Trentin's advice and run the model on a master node and distribute the solution. This will be the most efficient in computational time and most likely as efficient in runtime.

Drools for rating telco records

Has anyone successfully used Drools as a kind of "rating engine" before? What are your experiences?
I'm trying to process a couple of millions of records (of slightly different types) and apply rating/pricing to these records.
Rating would be based of tables or database lookups as well as chains of if/then/else/else/else/else conditions using the lookup data.
Traditional rating engines don't employ rule mechanisms in ways that I'm comfortable with...
thanks for your help
To provide a slightly more informative response (although your question can't be answered based on the very vague description you've given), your "rating" is just one of the many names for what I use to call "classification problem". It has been solved many times using Drools.
However, this doesn't mean to say that your problem, with its particular environmental flavour and expected performance (how fast do you want to have the 2M records processed?) can be solved best using Drools - especially when the measure for deciding the quality isn't settled. (For instance: Is ease of maintenance more important than top efficiency?)
Go ahead and rig up a prototype and run a test to see how it goes. That will give you a more reliable answer than anything else. If someone says that something similar couldn't be done, it could be due to bad rule coding. If someone says that something similar was done successfully, it may not have had one of the quirks of your setup. And so on.

When are TSQL Cursors the best or only option?

I'm having this argument about using Cursors in TSQL recently...
First of all, I'm not a cheerleader in the debate. But every time someone says cursor, there's always some knucklehead (or 50) who pounce with the obligatory 'cursors are evil' mantra. I know SQL-Server was optimized for set-based operations, and maybe cursors truly ARE evil incarnate, but if I wanted to put some objective thought behind that...
Here's where my mind is going:
Is the only difference between cursors and set operations one of performance?
Edit: There's been a good case made for it not being simply a matter of performance -- such as running a single batch over-and-over for a list of id's, or alternatively, executing actual SQL text stored in a table field row-by-row.
Follow-up: do cursors always perform worse?
EDIT: #Martin shows a good case where Cursors out-perform set-based operations fairly dramatically. I suspect that this wouldn't be the kind of thing you'd do too often (before you resorted to some kind of OLAP / Data Warehouse kind of solution), but nonetheless, seems like a case where you really couldn't live without a cursor.
reference to TPC benchmarks suggesting cursors may be more competitive than folks generally believe.
reference to memory-usage optimizations for cursors since Sql-Server 2005
Are there any problems you can think of, that cursors are better suited to solve than set-based operations?
EDIT: Set-based operations literally cannot Execute stored procedures, etc. (see edit for item 1 above).
EDIT: Set-based operations are exponentially slower than row-by-row when it comes to aggregating over large data sets.
Article from MSDN explaining their perspective
of the most common problems people resort to cursors for (and some
explanation of set-based techniques that would work better.)
Microsoft says (vaguely) in the 2008 Transact SQL Reference on MSDN: "...there are times when the results are best processed one row at a time", but the don't give any examples as to what cases they're referring to.
Mostly, I'm of a mind to convert cursors to set-based operations in my old code if/as I do any significant upgrades to various applications, as long as there's something to be gained from it. (I tend toward laziness over purity a lot of the time -- i.e., if it ain't broke, don't fix it.)
To answer your question directly:
I have yet to encounter a situation where set operations could not do what might otherwise be done with cursors. However, there are situations where using cursors to break a large set problem down into more manageable chunks proves a better solution for purposes of code maintainability, logging, transaction control, and the like. But I doubt there are any hard-and-fast rules to tell you what types of requirements would lead to one solution or the other -- individual databases and needs are simply far too variant.
That said, I fully concur with your "if it ain't broke, don't fix it" approach. There is little to be gained by refactoring procedural code to set operations for a procedure that is working just fine. However, it is a good rule of thumb to seek first for a set-based solution and only drop into procedural code when you must. Gut feel? If you're using cursors more than 20% of the time, you're doing something wrong.
And for what I really want to say:
When I interview programmers, I always throw them a couple of moderately complex SQL questions and ask them to explain how they'd solve them. These are problems that I know can be solved with set operations, and I'm specifically looking for candidates who are able to solve them without procedural approaches (i.e., cursors).
This is not because I believe there is anything inherently good or more performant in either approach -- different situations yield different results. Rather it's because, in my experience, programmers either get the concept of set-based operations or they do not. If they do not, they will spend too much time developing complex procedural solutions for problems that can be solved far more quickly and simply with set-based operations.
Conversely, a programmer who gets set-based operations almost never has problems implementing a procedural solution when, indeed, it's absolutely necessary.
Running Totals is the classic case where as the number of rows gets larger cursors can out perform set based operations as despite the higher fixed cost of the cursor the work required grows linearly rather than exponentially as with the set based "triangular join" approach.
Itzik Ben Gan does some comparisons here.
Denali has more complete support for the OVER clause however that should make this use redundant.
Since I've seen people manage to re-implement cursors (in all there varied forms) using other TSQL constructs (usually involving at least one while loop), there's nothing that cursors can achieve that can't be done using other constructs.
That's not to say that the re-implementations aren't equally as inefficient as the cursors that were avoided by not including the word "cursor" in that solution. Some people seem to purely hate the word, not the mechanics.
One place I've successfully argued to keep cursors was for a data transfer/transform between two different databases (we were dealing with clients here). Whilst we could have implemented this transfer in a set based manner (indeed, we previously had), there was problematic data that could cause issues for a few clients. In a set based solution, we had either to:
Continue the transfer, excluding failed client data at each table, leaving those clients partially transferred, or,
abort the entire batch
Whereas, by making the unit of transfer the individual client (using a cursor to select each client), we could make each client's transfer between the systems either work fully or be entirely rolled back (i.e. place each transfer in its own transaction)
I can't think of any situations where I've wanted to use a cursor below the "top level" of such transfers though (e.g. selecting which client to transfer next)
Often when you build dynamic sql, you have to use cursors. Imagine a script that search through all tabels in the database for same value in different fields. Best solution will be a cursor. Question where the problem was raised is here How to use EXEC or sp_executeSQL without looping in this case? I will be really impressed if anyone can solve that better without a cursor.

Do software metrics work both ways

I just started working for a large company. in a recent internal audit, measuring metrics such as Cyclomatic complexity and file sizes it turned out that several modules including the one owned by my team have a very high index. so in the last week we have been all concentrating on lowering these indexes for our code. by removing decision points and splitting files.
maybe I am missing something being the new guy but, how will this make our software better?, I know that software metrics can measure how good your code is, but dose it work the other way around? will our code become better just because for example we are making a 10000 lines file into 4 2500 lines files?
The purpose of metrics is to have more control over your project. They are not a goal on their own, but can help to increase the overall quality and/or to spot design disharmonies. Cyclomatic complexity is just one of them.
Test coverage is another one. It is however well-known that you can get high test coverage and still have a poor test suite, or the opposite, a great test suite that focus on one part of the code. The same happens for cyclomatic complexity. Consider the context of each metrics, and whether there is something to improve.
You should try to avoid accidental complexity, but if the processing has essential complexity, you code will anyway be more complicated. Try then to write mainteanble code with a fair balance between the number of methods and their size.
A great book to look at is "Object-oriented metrics in practice".
It depends how you define "better". Smaller files and less cyclomatic complexity generally makes it easier to maintain. Of course the code itself could still be wrong, and unit tests and other test methods will help with that. It's just a part of making code more maintainable.
Code is easier to understand and manage in smaller chunks.
It is a good idea to group related bits of code in their own functional areas for improved readability and cohesiveness.
Having a whole large program all in a single file will make your project very difficult to debug, extend, and maintain. I think this is quite obvious.
The particular metric is really only a rule of thumb and should not be followed religiously, but it may indicate something is not as nice as it could be.
Whether legacy working code should be touched and refactored is something that needs to be evaluated. If you decide to do so, you should consider writing tests for it first, that way you'll quickly know whether your changes broke any required behavior.
Never ever opened one of your own projects after several months again? The larger and more complex the single components are the more one asks oneself, what genious wrote that code and why the heck he wrote it that way.
And, there's never too much or even enough documentation. So if the components themself are lesser complex and smaller, its easier to re-understand 'em
This is bit Subjective. The idea of assigning a maximim Cyclomatic complexity index is to improve the maintainability and the readability of the code.
As an example in the perspective of the unit testing, it is really convenient to have smaller "units". And avoiding the long codes will help the reader to understand the code. You cannot ensure that the original developer works on the code forever so in the company's perspective it is fair to assign such a criteria to keep the code "simple"
It is easy to write a code that can undertand by a computer. It is more harder to write a code that can understood by a human.
how will this make our software better?
Excerpt from the articles Fighting Fabricated Complexity related to the tool for .NET developers NDepend. NDepend is good at helping team to manage large and complex code base. The idea is that code metrics are good are reducing fabricated complexity in the code implementation:
During my interview on Code Metrics by Scott Hanselman’s on Software Metrics, Scott had a particularly relevant remark.
Basically, while I was explaining that long and complex methods are killing quality and should be split into smaller methods, Scott asked me:
looking at this big too complicated
method and I break it up into smaller
methods, the complexity of the
business problem is still there,
looking at my application I can say,
this is no longer complex from the
method perspective, but the software
itself, the way it is coupled with
other bits of code, may indicate other
problem…
Software complexity is a subjective measure relative to the human cognition capacity. Something is complex when it requires effort to be understood by a human. The fact is that software complexity is a 2 dimensional measure. To understand a piece of code one must understand both:
what this piece of code is supposed to do at run-time, the behavior of the code, this is the business problem complexity
how the actual implementation does achieve the business problem, what was the developer mental state while she wrote the code, this is the implementation complexity.
Business problem complexity lies into the specification of the program and reducing it means working on the behavior of the code itself. On the other hand, we are talking of fabricated complexity when it comes to the complexity of the implementation: it is fabricated in the sense that it can be reduced without altering the behavior of the code.
how will this make our software better?
It can be a trigger for a refactoring, but following one metric doesn't guarantee that all other quality metrics stay the same. And tools are only able to follow very few metrics. You can't measure to which degree code is understandable.
Will our code become better just
because for example we are making a
10 000 lines file into 4 2500 lines
files?
Not necessarily. Sometimes the larger one can be more understandable, better structured and have lesser bugs.
Most design patterns for example "improve" your code by making it more general and maintenable, but often with the cost of added source lines.

Using Drools in a heavy batch process

We used Drools as part of a solution to act as a sort of filter in a very intense processing application, maybe running up to 100 rules on 500,000 + working memory objects.
turns out that it is extremely slow.
anybody else have any experience using Drools in a batch type processing application?
Kind of depends on your rules - 500K objects is reasonable given enough memory (it has to populate a RETE network in memory, so memory usage is a multiple of 500K objects - ie space for objects + space for network structure, indexes etc) - its possible you are paging to disk which would be really slow.
Of course, if you have rules that match combinations of the same type of fact, that can cause an explosion of combinations to try, which even if you have 1 rule will be really really slow.
If you had any more information on the analysis you are doing that would probably help with possible solutions.
I've used a Drools with a stateful working memory containing over 1M facts. With some tuning of both your rules and the underlying JVM, performance can be quite good after a few minutes for initial start-up. Let me know if you want more details.
I haven't worked with the latest version of Drools (last time I used it was about a year ago), but back then our high-load benchmarks proved it to be utterly slow. A huge disappointment after having based much of our architecture on it.
At least something good I remember about drools is that their dev team was available on IRC and very helpful, you might give them a try, they're the experts after all: irc.codehaus.org #drools
I'm just learning drools myself, so maybe I'm missing something, but why is the whole batch of five hundred thousand objects added to working memory at once? The only reason I can think of is that there are rules that kick in only when two or more items in the batch are related.
If that isn't the case, then perhaps you could use a stateless session and assert one object at a time. I assume rules will run 500k times faster in that case.
Even if it is the case, do all your rules need access to all 500k objects? Could you speed things up by applying per-item rules one at a time, and then in a second phase of processing apply batch level rules using a different rulebase and working memory? This would not change the volume of data, but the RETE network would be smaller because the simple rules would have been removed.
An alternative approach would be to try and identify the related groups of objects and assert the objects in groups during the second phase, further reducing the volume of data in working memory as well as splitting up the RETE network.
Drools is not really designed to be run on a huge number of objects. It's optimized for running complex rules on a few objects.
The working memory initialization for each additional object is too slow and the caching strategies are designed to work per working memory object.
Use a stateless session and add the objects one at a time ?
I had problems with OutOfMemory errors after parsing a few thousand objects. Setting a different default optimizer solved the problem.
OptimizerFactory.setDefaultOptimizer(OptimizerFactory.SAFE_REFLECTIVE);
We were looking at drools as well, but for us the number of objects is low so this isn't an issue. I do remember reading that there are alternate versions of the same algorithm that take memory usage more into account, and are optimized for speed while still being based on the same algorithm. Not sure if any of them have made it into a real usable library though.
this optimizer can also be set by using parameter
-Dmvel2.disable.jit=true