Storing parameters for rules - drools

I am using RdeHat Decision Maker 7.1 (Drools) to create a rule for assigning a case to a department. The rule itself is quite simple, however it requires quite a lot of parameters (~12) like the agent type, working area, case type, customer seniority and more. The result "action" is the department to which the case is assigned.
I tried to place the parameters in a decision table , but the table quickly bloated to over 15,000 rows and will probably get even larger then that. I did, however, notices that in many cases the different between two rows is 1 or two parameters (e.g. same row with the only different is agent type "Local" vs. "Regional") resulting in different assignment.
I am thinking of replacing the table with something else, like a tree structure, so I can group similar rows under the same node and then navigate over the tree to make the decision. To do this I plan to prioritize the parameters and give parameters with higher priority a higher place in the tree.
Does anyone has experience with such a problem ? I looked at decision trees but they focus more on ML and probabilities, so I'm not sure this is what I need.
Is there any other method to deal with bloated tables that become unmanageable ? I cannot go to our customer and ask them to maintain a 15,000 rows excel. They'll shoot me there and then.
Thanks
Alon.

Related

Restricting access to some columns in a table, based on another column in the same row

This feels like it should be a common requirement, but I'm not sure how best to implement the requirement.
I'm going to make up a really simple example. It's similar enough to what I actually need, without getting over-complicated.
Imagine we have a table called transport and it has the following columns:
type
model_name
size
number_of_wheels
fuel
maximum_passenger_count
We want to store all sorts of different types of transportation in this table, but not every type will have values in every column. Imagine the commonality is a lot higher, as this is a bit of fake example.
Here's a few examples of how this might work in practice:
type = cycle, we ban fuel, as it's not relevant for a cycle
type = bus, all columns are valid
type = sledge, we ban number_of_wheels, as sledges don't have wheels, we also ban fuel
type = car, all columns are valid
I want my UI to show a grid with a list of all the rows in the transport table. Users can edit the data directly in the grid and add new rows. When they add a new row, they're forced to pick the transport type in a dropdown before it appears in the grid. They then complete the details. All the values are optional, apart from the ones we explicitly don't want to record a value for, where we expect to not see anything at all.
I can see a number of ways to implement this, but none of them seems like a complete solution:
I could put this logic into the UI, enabling/ disabling grid cells based on type. But there's nothing to stop someone directly inserting data into the "wrong" columns in the backend of the database or via the API, which would then come through into the UI unless I added a rule to also mask out values in disabled cells. Making changes to which columns are restricted per transport type would be very difficult
I could put this logic into the API, raising an error if someone enters data into a cell that should be disallowed. This closes one gap for insertion to the database via the API, but SQL scripts would still allow entry into the "wrong" column. Also, the user experience would suck, as users would have to guess which columns to complete and which to leave blank. It would still be difficult to make changes to which columns are allowed/ restricted
I could add a trigger to the database, maybe to set the values to NULL if they shouldn't be allowed, but this seems clunky and users would not understand what was happening
I could add a generated column, but this doesn't help if I sometimes need to set a value and sometimes don't
I could just allow the unnecessary data to be stored in the database, then hide it by using a view to report it back. It doesn't seem great, as users would still see data disappearing from the UI with no explanation
I could add a second table, storing a matrix of which values are allowed and which are restricted by type. The API, UI and database could all implement this list using different mechanisms - this comes with the advantage of having a single place to make changes that will immediately be reflected across the entire system, but it's a lot of work and I have lots of these tables that work the same way

Comparison of db strategies for polymorphic many to many: RDB vs. NoSQL

I'm choosing a database approach, and still at the stage where the decision is at a very highly level; I'm trying as hard as possible to fit my problem into Postgres, for its enormous maturity and feature set, but I'm not at all familiar with this level of SQL complexity, and wonder if the core relationship I have in mind might just be better expressed in a different model. (I've only found other answers that ask about this within a given framework.)
The driving issue (within a larger setup) is the ability to associate Features (ex. large, weight, type, but a VARYING set of these), that have been assigned by a specific Classifier (model), with Things. These don't just show that a Thing IS 'large,' but that it was ASSESSED for 'largeness,' by a particular Classifier. And most difficult within a RDBMS, the VALUE of 'large' might be binary, while 'weight' is an integer, 'type' is a category, etc. In other words,
Thing1 --> large=true (says model 1)
|-> weight=3 (says model 2)
|-> type='great' (says model 3)
Thing2 --> large=false (says model 1)
(not assessed for 'weight')
|-> type='lousy' (says model 4)
In a nutshell, storing (thing, feature, value, classifier) tuples, where Things and Features are many to many, and values will be of different types, for different features.
Approach 1: Relational
There is a table of Things. There is also a PARENT class Features, which (or maybe its children?) has a many-to-many relationship with Things. Each particular feature, Large, Weight, Type, is a CHILD class of 'Feature', and each of the Child >-< Things junction table also holds the (consistently typed) values of the Things, for that particular feature. In other words,
ThingTypeValues
-------------------
thing.id | type.id | 'great'
thing.id | type.id | 'lousy'
...
ThingLargeValues
---------------------
thing.id | large.id | true
thing.id | large.id | false
...
This allows me to get all 'thing.features,' because they share Feature as a parent, and still query the full (mixed-type) description of the Thing, while having consistent types within tables; it also allows me to (better) avoid the problem of accidentally having a 'great' and a 'grreat' and a 'Great,' by making each Feature-child be its own table (instead of just a string, within an object, or an ever-being-updated ENUM), where I can easily check to see what existing options there are for labels, by treating each Feature as a well-respected, separate-identity, thing.
As a last point, if there were only one Classifier (thing APPLYING Features) then there could be a single table, with every column being a Feature, and every cell having a Value, or NULL, to indicate that it wasn't assessed for that Feature -- but because there will actually be a very large number of Models, each giving their opinion, this sounds like a pretty ugly strategy, for each one to have their own enormous, mostly empty, and always growing (as we add more Features), table.
HOWEVER, using SQLAlchemy for example (I'm by far most familiar with python), I now have to use association_objects and AbstractConcreteBases and it all looks quite nasty, to my untrained eye.
** Approach 2: NoSQL**
Suddenly mixed type, which appears to be the biggest problem, is no longer a problem! I can have a set of features, and each with a type, and a value, and associate them each with an answer. If I want to be careful with not fat-fingering my categories, I can have a function that validates them, or a process that checks against existing Features before adding a new one. These sound more error-prone, certainly, but I'm asking this because I don't know how to evaluate the tradeoff, how to approach the best solution in EITHER framework, or if something entirely different is a much better idea anyway.
Opinions, suggestions, technologies, philosophies, and donations, all welcome.

Determining canonical classes with text data

I have a unique problem and I'm not aware of any algorithm that can help me. Maybe someone on here does.
I have a dataset compiled from many different sources (teams). One field in particular is called "type". Here are some example values for type:
aple, apples, appls, ornge, fruits, orange, orange z, pear,
cauliflower, colifower, brocli, brocoli, leeks, veg, vegetables.
What I would like to be able to do is to group them together into e.g. fruits, vegetables, etc.
Put another way I have multiple spellings of various permutations of a parent level variable (fruits or vegetables in this example) and I need to be able to group them as best I can.
The only other potentially relevant feature of the data is the team that entered it, assuming some consistency in the way each team enters their data.
So, I have several million records of multiple spellings and short spellings (e.g. apple, appls) and I want to group them together in some way. In this example by fruits and vegetables.
Clustering would be challenging since each entry is most often 1 or two words, making it tricky to calculate a distance between terms.
Short of creating a massive lookup table created by a human (not likely with millions of rows), is there any approach I can take with this problem?
You will need to first solve the spelling problem, unless you have Google scale data that could allow you to learn fixing spelling with Google scale statistics.
Then you will still have the problem that "Apple" could be a fruit or a computer. Apple and "Granny Smith" will be completely different. You best guess at this second stage is something like word2vec trained on massive data. Then you get high dimensional word vectors, and can finally try to solve the clustering challenge, if you ever get that far with decent results. Good luck.

SQL Database Design - Flag or New Table?

Some of the Users in my database will also be Practitioners.
This could be represented by either:
an is_practitioner flag in the User table
a separate Practitioner table with a user_id column
It isn't clear to me which approach is better.
Advantages of flag:
fewer tables
only one id per user (hence no possibility of confusion, and also no confusion in which id to use in other tables)
flexibility (I don't have to decide whether fields are Practitioner-only or not)
possible speed advantage for finding User-level information for a practitioner (e.g. e-mail address)
Advantages of new table:
no nulls in the User table
clearer as to what information pertains to practitioners only
speed advantage for finding practitioners
In my case specifically, at the moment, practitioner-related information is generally one-to-many (such as the locations they can work in, or the shifts they can work, etc). I would not be at all surprised if it turns I need to store simple attributes for practitioners (i.e., one-to-one).
Questions
Are there any other considerations?
Is either approach superior?
You might want to consider the fact that, someone who is a practitioner today, is something else tomorrow. (And, by that I don't mean, not being a practitioner). Say, a consultant, an author or whatever are the variants in your subject domain, and you might want to keep track of his latest status in the Users table. So it might make sense to have a ProfType field, (Type of Professional practice) or equivalent. This way, you have all the advantages of having a flag, you could keep it as a string field and leave it as a blank string, or fill it with other Prof.Type codes as your requirements grow.
You mention, having a new table, has the advantage for finding practitioners. No, you are better off with a WHERE clause on the users table for that.
Your last paragraph(one-to-many), however, may tilt the whole choice in favour of a separate table. You might also want to consider, likely number of records, likely growth, criticality of complicated queries etc.
I tried to draw two scenarios, with some notes inside the image. It's really only a draft just to help you to "see" the various entities. May be you already done something like it: in this case do not consider my answer please. As Whirl stated in his last paragraph, you should consider other things too.
Personally I would go for a separate table - as long as you can already identify some extra data that make sense only for a Practitioner (e.g.: full professional title, University, Hospital or any other Entity the Practitioner is associated with).
So in case in the future you discover more data that make sense only for the Practitioner and/or identify another distinct "subtype" of User (e.g. Intern) you can just add fields to the Practitioner subtable, or a new Table for the Intern.
It might be advantageous to use a User Type field as suggested by #Whirl Mind above.
I think that this is just one example of having to identify different type of Objects in your DB, and for that I refer to one of my previous answers here: Designing SQL database to represent OO class hierarchy

Which of Context and Alias is more useful?

During tables linking we will have loops which will result in duplicate records in report, so to overcome we use Context and Alias.
To the extent I know both serve the same purpose but what is the difference between the two and which one is more effective.
One thing I am aware is alias creates more tables but all tables are of logical structure so is alias more useful that context?
This is kind of like asking, what's the more useful tool: a wrench or a screwdriver? It depends on the task at hand.
You are correct that aliases create additional logical tables. Sometimes that's the desired approach, but not always.
One way I approach the question is to first determine whether there are multiple logical dimensions for a single physical dimension.
For example, consider a fact table with two date keys: transaction_dt_key, completed_dt_key. Both of these are associated with a date_key field in a date_dim table. You would, of course, create a loop if you were to join both fact fields to the date dim table. In this case, an alias is appropriate -- you would alias the dim table, join the fact keys to the original and alias table, then create a new object associated with the alias table.
Another way to look at this example is that the Transaction Date and Completed Date are two different things. Therefore, it is appropriate to have them represented by two different objects, and it follows that this would be accomplished by an alias.
In this respect, the design in the universe will more closely match the logical design of the data mart rather than its physical design.
Contexts, on the other hand, are useful when the same dimension table is associated with multiple fact tables.
Consider this example: the model has
customer_dim
store_dim
sales_fact
returns_fact
Both fact tables have a customer_id and store_id field. Joining all keys would create a loop. In this case, a context would be appropriate -- one context to include sales_fact and the two dims, and the other context to include returns_fact and the two dims.
They both serve a general purpose of controlling loops in a universe.
Personally, I've used them both in the same universe. They can be complementary.
I totally agree with Joe's explanation and examples.
Since Aliases can be physically seen on the model, the maintenance can be less challenging than Contexts.