quick loading of a drools knowledge base - drools

I'm trying to use Drools as the rule engine for a grammar relations to semantics mapping framework. The rule base is in excess of 5000 rules even now and will get extended. In using Drools currently the reading of the drl file containing the rules and creating the knowledge base takes a lot of time each time the program is run. Is there a way create the knowledge base once and save it in some persistent format that can be quickly loaded with the option to regenerate the knowledge base only when a change is made?

Yes, drools can serialise a knowledgebase out to external storage and then load this serialised knowledgebase back in again.
So, you need a cycle that loads from drl, compiles, serialises out. Then a second cycle that uses the serialised version.
I've used this with some success, reducing a 1 minute 30 loading time down to about 15-20 seconds. Also, it reduces your heap/perm gen requirements as well.
Check the API for the exact methods.

My first thought is to keep the knowledge base around as long as possible. Unless you are creating multiple knowledge bases from different sets of rules, and there are too many possible combinations, hang onto those knowledge bases. In one application I work on, one knowledge base has all the rules so we treat it like a singleton.
However, if that's not possible or your application is not that long-running, I don't know that Drools itself provides any ways of speeding that up. Running a Drools 5.0 project through the debugger, I see that the KnowledgeBase Drools gives me is Serializable. I imagine it would be quicker to deserialize a KnowledgeBase than to re-parse the rules. But be careful designing your application around this! You use interfaces for a reason and the implementation could change without warning.

Related

Caching constructor selection instead of lamda conversion

We have a very large number of autofac resolutions that occur in our application. We've always had a very long time (about 50% of our web requests) of processing belonging to the autofac resolution stage.
It came to our attention how very important switching to Register Frequently-Used Components with Lambdas is from the guide, and our testing shows it could bring extreme gains for us.
Convert from this
builder.RegisterType<Component>();
to
builder.Register(c => new Component());
It seems the procedure to reflect to find the largest constructor, scan the properties, determine if autofac can resolve that, if not move on to the next smallest, etc. So specifying the constructor directly gives us massive improvements.
My question is can there be or should there be some sort of caching for the found constructor? Since the containers are immutable you can't add or subtract registrations so would require a different constructor later on.
I just wanted to check before we start working on switching over a lot of registrations to lambda, we have 1550 and will try to find the core ones.
Autofac does cache a lot of what it can, but even with container immutability, there's...
Registration sources: dynamic suppliers of registrations (that's how open generics get handled, for example)
Child lifetime scopes: the root container might not have the current HttpRequestMessage for the active WebAPI request, but you can add and override registrations in child scopes so the request lifetime does have it.
Parameters on resolve: you can pass parameters to a resolve operation that will get factored into that operation.
Given stuff like that, caching is a little more complicated than "find the right constructor once, cache it, never look again."
Lambdas will help, as you saw, so look into that.
I'd also look at lifetime scopes for components. If constructor lookup is happening a lot, it means you're constructing a lot, which means you're allocating a lot, which is going to cost you other time. Can more components be singletons or per-lifetime-scope? Using an existing instance will be cheaper still.
You also mentioned you have 1550 things registered. Do all of those actually get used and resolved? Don't register stuff that doesn't need to be. I've seen a lot of times where folks do assembly scanning and try to just register every type in every assembly in the whole app. I'm not saying you're doing this, but if you are, don't. Just register the stuff you need.
Finally, think about how many constructors you have. If there's only one (recommended) it'll be faster than trying many different ones. You can also specify the constructor with UsingConstructor during registration to force a choice and bypass searching.

EF - multiple includes to eager load hierarchical data. Bad practice?

I am needing to eager load a hierarchy structure so that I can recursively iterate through it. The eager loading is necessary to prevent multiple db queries while traversing the tree. It seems the consensus is that you can't eager load infinite levels of the tree, so I did something like
var item= db.ItemHierarchies
.Include("Children.Children.Children.Children.Children")
.Where(x => x.condition == condition)
to load 5 levels of children. This seems to get the job done. I'm wondering what the drawback is to doing this? If there is none then theoretically could I add 50 levels of includes here without slowing things down?
I recommend taking a look at the SQL that is generated as you add eager loading to your query.
var item= db.ItemHierarchies
.Include("Children")
.Include("Children.Children")
.Include("Children.Children.Children")
.Include("Children.Children.Children.Children")
.Include("Children.Children.Children.Children.Children")
var sql = ((System.Data.Objects.ObjectQuery) item).ToTraceString()
// http://visualstudiomagazine.com/blogs/tool-tracker/2011/11/seeing-the-sql.aspx
You'll see that the SQL quickly gets very big and complicated and can potentially have serious performance implications. You'd do well to limit your eager loading to data that you are certain you will need and to consider using explicit loading for some of the related entities - especially if you're working with connected entities in which case you can explicitly load collection properties when they're needed.
Also note that you may not need multiple separate Includes. For example, the following needs to be separate Includes because they're addressing separate properties (Widgets and Spanners) of the root.
var item= db.ItemHierarchies
.Include("Widgets")
.Include("Spanners.Flanges")
But the following isn't necessary:
var item= db.ItemHierarchies
.Include("Widgets") //This isn't necessary.
.Include("Widgets.Flanges") //This loads both Widges and Flanges.
Well honestly.. It's an extremely bad practice.
Let's assume you had 50 objects in your root.. and 50 per level.
You may end up retrieving 312500000 "capsules" of information.
Now, one might ask: "So what is wrong with that?!",
I mean if that is what is required than why not do that..
Rule #1: we develop software that should be used by human beings.
And the fact is that no human capable of taking a glimpse at 312500000 items of information at once and learn or conclude something beneficial out of it. (except.. that it does not help him or her to watch it)
Rule #2: UI should be based on what is needed and not what is possible.
And since we already established that showing 312500000 capsules of data is not needed there is no reason to bring all that at once.
And now you might come forward and say - But I don't care about the UI, really! All I need is to iterate in that data in order to process some information!
In that case you would probably want to save your results somewhere for future reference, but that means that its a batch job.. so why not apply batch job rules upon it.. like process it item by item which will also may give you the benefit of splitting it between even more machines if needed.
So you see.. no matter which path you choose there should be no reason to do it.
(= definition of what is a bad practice.)
Update:
After reading interesting concerns in the comments, I would like to update this answer with more analysis:
Deciding what is a bad practice must always be in reference to what is to be achieved or what is the role of each part in the system. In the current situation (after reading the comments) it has been brought or implied that the data storage is actually a persistent medium for objects opposed to a different concept where the data is the 'heart' of the application.
We can define two data types:
1) Data-Center which is being used in data-centric applications such as banks, CRM, ERP, websites or other service based solutions.
VS.
2) Data-Persistence medium which is being used as data to be saved for when the application is not active, in example: any simple app save file or any game save file and etc.
The main difference is that a data persistence medium is to be accessed only by a single instance of the app at a single point in time.. meaning the data is not designed to be shared by many instances. if the data is to be shared - we are dealing with a data-center application.
If your app just need a data-persistence medium - loading all the information cannot be considered as a bad practice - but you still need to make sure you are not exploding the memory. and in that frame of work, SQL Server might not be what you need or the best tool to use.
In the other case of Data-Centric application - my original answer remains as it will be a bad practice to bring all the information per instance of the application.

should I reuse gwt instances (registry objects, loaders) or obtain fresh copy every time?

I'm newbie in gwt so sorry for this simple question.
I can call Registry.get("id") every time or I can cache returned value in field, what is better (how fast/slow Registry.get("id") is?)
Similar question for RpcProxy instance and different loader instances.
That's actually a good question. You should always try to reuse your instances instead of creating new ones, especially with GWT client Java code which becomes Javascript at runtime. The overhead of instantiating objects in JS (even with all the optimisations you get from GWT) can become quickly unwieldy if you're not carefull. Try it for yourself, have a list of 200 gwt Labels of which you only display 10 at a time versus instantiating only 10 and reusing them each time the values change, you'll see the difference in the time your browser takes to render.

Rules Based Database Engine

I would like to design a rules based database engine within Oracle for PeopleSoft Time entry application. How do I do this?
A rules-based system needs several key components:
- A set of rules defined as data
- A set of uniform inputs on which to operate
- A rules executor
- Supervisor hierarchy
Write out a series of use-cases - what might someone be trying to accomplish using the system?
Decide on what things your rules can take as inputs, and what as outputs
Describe the rules from your use-cases as a series of data, and thus determine your rule format. Expand 2 as necessary for this.
Create the basic rule executor, and test that it will take the rule data and process it correctly
Extend the above to deal with multiple rules with different priorities
Learn enough rule engine theory and graph theory to understand common rule-based problems - circularity, conflicting rules etc - and how to use (node) graphs to find cases of them
Write a supervisor hierarchy that is capable of managing the ruleset and taking decisions based on the possible problems above. This part is important, because it is your protection against foolishness on the part of the rule creators causing runtime failure of the entire system.
Profit!
Broadly, rules engines are an exercise in managing complexity. If you don't manage it, you can easily end up with rules that cascade from each other causing circular loops, race-conditions and other issues. It's very easy to construct these accidentally: consider an email program which you have told to move mail from folder A to B if it contains the magic word 'beta', and from B to A if it contains the word 'alpha'. An email with both would be shuttled back and forward until something broke, preventing all other rules from being processed.
I have assumed here that you want to learn about the theory and build the engine yourself. alphazero raises the important suggestion of using an existing rules engine library, which is wise - this is the kind of subject that benefits from academic theory.
I haven't tried this myself, but an obvious approach is to use Java procedures in the Oracle database, and use a Java rules engine library in that code.
Try:
http://www.oracle.com/technology/tech/java/jsp/index.html
http://www.oracle.com/technology/tech/java/java_db/pdf/TWP_AppDev_Java_DB_Reduce_your_Costs_and%20_Extend_your_Database_10gR1_1113.PDF
and
http://www.jboss.org/drools/
or
http://www.jessrules.com/
--
Basically you'll need to capture data events (inserts, updates, deletes), map to them to your rulespace's events, and apply rules.

Do you create your own code generators?

The Pragmatic Programmer advocates the use of code generators.
Do you create code generators on your projects? If yes, what do you use them for?
In "Pragmatic Programmer" Hunt and Thomas distinguish between Passive and Active code generators.
Passive generators are run-once, after which you edit the result.
Active generators are run as often as desired, and you should never edit the result because it will be replaced.
IMO, the latter are much more valuable because they approach the DRY (don't-repeat-yourself) principle.
If the input information to your program can be split into two parts, the part that changes seldom (A) (like metadata or a DSL), and the part that is different each time the program is run (B)(the live input), you can write a generator program that takes only A as input, and writes out an ad-hoc program that only takes B as input.
(Another name for this is partial evaluation.)
The generator program is simpler because it only has to wade through input A, not A and B. Also, it does not have to be fast because it is not run often, and it doesn't have to care about memory leaks.
The ad-hoc program is faster because it's not having to wade through input that is almost always the same (A). It is simpler because it only has to make decisions about input B, not A and B.
It's a good idea for the generated ad-hoc program to be quite readable, so you can more easily find any errors in it. Once you get the errors removed from the generator, they are gone forever.
In one project I worked on, a team designed a complex database application with a design spec two inches thick and a lengthy implementation schedule, fraught with concerns about performance. By writing a code generator, two people did the job in three months, and the source code listings (in C) were about a half-inch thick, and the generated code was so fast as to not be an issue. The ad-hoc program was regenerated weekly, at trivial cost.
So active code generation, when you can use it, is a win-win. And, I think it's no accident that this is exactly what compilers do.
Code generators if used widely without correct argumentation make code less understandable and decrease maintainability (the same with dynamic SQL by the way). Personally I'm using it with some of ORM tools, because their usage here mostly obvious and sometimes for things like searcher-parser algorithms and grammatic analyzers which are not designed to be maintained "by hands" lately. Cheers.
In hardware design, it's fairly common practice to do this at several levels of the 'stack'. For instance, I wrote a code generator to emit Verilog for various widths, topologies, and structures of DMA engines and crossbar switches, because the constructs needed to express this parameterization weren't yet mature in the synthesis and simulation tool flows.
It's also routine to emit logical models all the way down to layout data for very regular things that can be expressed and generated algorithmically, like SRAM, cache, and register file structures.
I also spent a fair bit of time writing, essentially, a code generator that would take an XML description of all the registers on a System-on-Chip, and emit HTML (yes, yes, I know about XSLT, I just found emitting it programatically to be more time-effective), Verilog, SystemVerilog, C, Assembly etc. "views" of that data for different teams (front-end and back-end ASIC design, firmware, documentation, etc.) to use (and keep them consistent by virtue of this single XML "codebase"). Does that count?
People also like to write code generators for e.g. taking terse descriptions of very common things, like finite state machines, and mechanically outputting more verbose imperative language code to implement them efficiently (e.g. transition tables and traversal code).
We use code generators for generating data entity classes, database objects (like triggers, stored procs), service proxies etc. Anywhere you see lot of repititive code following a pattern and lot of manual work involved, code generators can help. But, you should not use it too much to the extend that maintainability is a pain. Some issues also arise if you want to regenerate them.
Tools like Visual Studio, Codesmith have their own templates for most of the common tasks and make this process easier. But, it is easy to roll out on your own.
It is often useful to create a code generator that generates code from a specification - usually one that has regular tabular rules. It reduces the chance of introducing an error via a typo or omission.
Yes ,
I developed my own code generator for AAA protocol Diameter (RFC 3588).
It could generate structures and Api's for diameter messages reading from an XML file that described diameter application's grammar.
That greatly reduced the time to develop complete diameter interface (such as SH/CX/RO etc.).
in my opinion a good programming language would not need code generators because introspection and runtime code generation would be part of language e.g. in python metaclasses and new module etc.
code generators usually generate more unmanageable code in long term usage.
however, if it is absolutely imperative to use a code generator (eclipse VE for swing development is what I use at times) then make sure you know what code is being generated. Believe me, you wouldn't want code in your application that you are not familiar with.
Writing own generator for project is not efficient. Instead, use a generator such as T4, CodeSmith and Zontroy.
T4 is more complex and you need to know a .Net programming language. You have to write your template line by line and you have to complete data relational operations on your own. You can use it over Visual Studio.
CodeSmith is an functional tool and there are plenty of templates ready to use. It is based on T4 and writing your own temlate takes too much time as it is in T4. There is a trial and a commercial version.
Zontroy is a new tool with a user friendly user interface. It has its own template language and is easy to learn. There is an online template market and it is developing. Even you can deliver templates and sell them online over market.
It has a free and a commercial version. Even the free version is enough to complete a medium-scale project.
there might be a lot of code generators out there , however I always create my own to make the code more understandable and suit the frameworks and guidelines we are using
We use a generator for all new code to help ensure that coding standards are followed.
We recently replaced our in-house C++ generator with CodeSmith. We still have to create the templates for the tool, but it seems ideal to not have to maintain the tool ourselves.
My most recent need for a generator was a project that read data from hardware and ultimately posted it to a 'dashboard' UI. In-between were models, properties, presenters, events, interfaces, flags, etc. for several data points. I worked up the framework for a couple data points until I was satisfied that I could live with the design. Then, with the help of some carefully placed comments, I put the "generation" in a visual studio macro, tweaked and cleaned the macro, added the datapoints to a function in the macro to call the generation - and saved several tedious hours (days?) in the end.
Don't underestimate the power of macros :)
I am also now trying to get my head around CodeRush customization capabilities to help me with some more local generation requirements. There is powerful stuff in there if you need on-the-fly decision making when generating a code block.
I have my own code generator that I run against SQL tables. It generates the SQL procedures to access the data, the data access layer and the business logic. It has done wonders in standardising my code and naming conventions. Because it expects certain fields in the database tables (such as an id column and updated datetime column) it has also helped standardise my data design.
How many are you looking for? I've created two major ones and numerous minor ones. The first of the major ones allowed me to generate programs 1500 line programs (give or take) that had a strong family resemblance but were attuned to the different tables in a database - and to do that fast, and reliably.
The downside of a code generator is that if there's a bug in the code generated (because the template contains a bug), then there's a lot of fixing to do.
However, for languages or systems where there is a lot of near-repetitious coding to be done, a good (enough) code generator is a boon (and more of a boon than a 'doggle').
In embedded systems, sometimes you need a big block of binary data in the flash. For example, I have one that takes a text file containing bitmap font glyphs and turns it into a .cc/.h file pair declaring interesting constants (such as first character, last character, character width and height) and then the actual data as a large static const uint8_t[].
Trying to do such a thing in C++ itself, so the font data would auto-generate on compilation without a first pass, would be a pain and most likely illegible. Writing a .o file by hand is out of the question. So is breaking out graph paper, hand encoding to binary, and typing all that in.
IMHO, this kind of thing is what code generators are for. Never forget that the computer works for you, not the other way around.
BTW, if you use a generator, always always always include some lines such as this at both the start and end of each generated file:
// This code was automatically generated from Font_foo.txt. DO NOT EDIT THIS FILE.
// If there's a bug, fix the font text file or the generator program, not this file.
Yes I've had to maintain a few. CORBA or some other object communication style of interface is probably the general thing that I think of first. You have object definitions that are provided to you by the interface you are going to talk over but you still have to build those objects up in code. Building and running a code generator is a fairly routine way of doing that. This can become a fairly lengthy compile just to support some legacy communication channel, and since there is a large tendency to put wrappers around CORBA to make it simpler, well things just get worse.
In general if you have a large amount of structures, or just rapidly changing structures that you need to use, but you can't handle the performance hit of building objects through metadata, then your into writing a code generator.
I can't think of any projects where we needed to create our own code generators from scratch but there are several where we used preexisting generators. (I have used both Antlr and the Eclipse Modeling Framework for building parsers and models in java for enterprise software.) The beauty of using a code generator that someone else has written is that the authors tend to be experts in that area and have solved problems that I didn't even know existed yet. This saves me time and frustration.
So even though I might be able to write code that solves the problem at hand, I can generate the code a lot faster and there is a good chance that it will be less buggy than anything I write.
If you're not going to write the code, are you going to be comfortable with someone else's generated code?
Is it cheaper in both time and $$$ in the long run to write your own code or code generator?
I wrote a code generator that would build 100's of classes (java) that would output XML data from database in a DTD or schema compliant manner. The code generation was generally a one time thing and the code would then be smartened up with various business rules etc. The output was for a rather pedantic bank.
Code generators are work-around for programming language limitations. I personally prefer reflection instead of code generators but I agree that code generators are more flexible and resulting code obviously faster during runtime. I hope, future versions of C# will include some kind of DSL environment.
The only code generators that I use are webservice parsers. I personally stay away from code generators because of the maintenance problems for new employees or a separate team after hand off.
I write my own code generators, mainly in T-SQL, which are called during the build process.
Based on meta-model data, they generate triggers, logging, C# const declarations, INSERT/UPDATE statements, data model information to check whether the app is running on the expected database schema.
I still need to write a forms generator for increased productivity, more specs and less coding ;)
I've created a few code generators. I had a passive code generator for SQL Stored procedures which used templates. This generated generated 90% of our stored procedures.
Since we made the switch to Entity Framework I've created an active codegenerator using T4 (Text Template Transformation Toolkit) inside visual studio. I've used it to create basic repository partial classes for our entities. Works very nicely and saves a bunch of coding. I also use T4 for decorating the entity classes with certain Attributes.
I use code generation features provided by EMF - Eclipse Modeling Framework.
Code generators are really useful in many cases, especially when mapping from one format to another. I've done code generators for IDL to C++, database tables to OO types, and marshalling code just to name a few.
I think the point the authors are trying to make is that if you're a developer you should be able to make the computer work for you. Generating code is just one obvious task to automate.
I once worked with a guy who insisted that he would do our IDL to C++ mapping manually. In the beginning of the project he was able to keep up, because the rest of us were trying to figure out what to do, but eventually he became a bottleneck. I did a code generator in Perl and then we could pretty much do his "work" in a few minutes.
See our "universal" code generator based on program transformations.
I'm the architect and a key implementer.
It is worth noting that a significant fraction of this generator, is generated using this generator.
We uses Telosys code generator in our projects : http://www.telosys.org/
We have created it to reduce the development duration in recurrent tasks like CRUD screens, documentation, etc...
For us the most important thing is to be able to customize the generator's templates, in order to create new generation targets if necessary and to customize existing templates. That's why we have also created a template editor (for Velocity .vm files).
It works fine for Java/Spring/AngularJS code generator and can be adapt for other targets (PHP, C#, Python, etc )