Do having multiple labels for a node in Neo4j make any sense? - nosql

Following this post from Neo4j's google group I have to say that I don't see any benefits when using this multiple-label-thing but rather, on the contrary, IMHO it just adds complexity for what a uniqueness constraint is. It could also tempt the user to introduce inheritance into the data model which would cause frustration since that's not possible at all...

Labels have not the notion of just representing a type, they are rather roles which are viable in different contexts.
So in one role, certain attributes and relationships of a node might matter and in another role (label) a different set (that might intersect with the first one).
We stayed away from inheritance as it opens a new can of worms, and we favor composition. So you'd rather compose a node whole as the sum of its parts. You can also mimic an inheritance by also attaching the "super"-types as labels to the child elements in your hierarchy.
Node labels can also be used to separate subgraphs in a larger graph, e.g. label the proteins that are active in human pathways and phylo pathways with those labels. So you can quickly select a part of the graph that you're interested in.
Those separate subgraphs can also come from different domains, like geo,social,catalogue,supplier that are combined in a single graph.
And multiple labels also make sense to separate "technical" namespaces of your graph that are used to represent "in-graph-indexes" from your "domain"-labels.
Regarding uniqueness - all uniqueness constraints for the existing labels and properties on your nodes are enforced at the same time. If they cannot be resolved on insert or update the operation will fail.

Related

When to use multiple KieBases vs multiple KieSessions?

I know that one can utilize multiple KieBases and multiple KieSessions, but I don't understand under what scenarios one would use one approach vs the other (I am having some trouble in general understanding the definitions and relationships between KieContainer, KieBase, KieModule, and KieSession). Can someone clarify this?
You use multiple KieBases when you have multiple sets of rules doing different things.
KieSessions are the actual session for rule execution -- that is, they hold your data and some metadata and are what actually executes the rules.
Let's say I have an application for a school. One part of my application monitors students' attendance. The other part of my application tracks their grades. I have a set of rules which decides if students are truant and we need to talk to their parents. I have a completely unrelated set of rules which determines whether a student is having trouble academically and needs to be put on probation/a performance plan.
These rules have nothing to do with one another. They have completely separate concerns, different rule inputs, and are triggered in different parts of the application. The part of the application that is tracking attendance doesn't need to trigger the rules that monitor student performance.
For this application, I would have two different KieBases: one for attendance, and one for academics. When I need to fire the rules, I fire one or the other -- there is no use case for firing both at the same time.
The KieSession is the runtime for when we fire those rules. We add to it the data we need to trigger the rules, and it also tracks some other metadata that's really not relevant to this discussion. When firing the academics rules, I would be adding to it the student's grades, their classes, and maybe some information about the student (eg the grade level, whether they're an "honors" student, tec.). For the attendance rules, we would need the student information, plus historical tardiness/absence records. Those distinct pieces of data get added to the sessions.
When we decide to fire rules, we first get the appropriate KieBase -- academics or attendance. Then we get a session for that rule set, populate the data, and fire it. We technically "execute" the session, not the rules (and definitely not the rule base.) The rule base is just the collection of the rules; the session is how we actually execute it.
There are two kinds of sessions -- stateful and stateless. As their names imply, they differ with how data is stored and tracked. In most cases, people use stateful sessions because they want their rules to do iterative work on the inputs. You can read more about the specific differences in the documentation.
For low-volume applications, there's generally little need to reuse your KieSessions. Create, use, and dispose of them as needed. There is, however, some inherent overhead in this process, so there comes a point in which reuse does become something that you should consider. The documentation discusses the solution provided out-of-the box for Drools, which is session pooling.
(When trying to wrap your head around this, I like to use an analogy of databases. A session is like a JDBC connection: for small applications you can create them, use them, then close them as you need them. But as you scale you'll quickly find that you need to look into connection pooling to minimize this overhead. In this particular analogy, the rule base would be the database that the rules are executing against -- not the tables!)

Storing parameters for rules

I am using RdeHat Decision Maker 7.1 (Drools) to create a rule for assigning a case to a department. The rule itself is quite simple, however it requires quite a lot of parameters (~12) like the agent type, working area, case type, customer seniority and more. The result "action" is the department to which the case is assigned.
I tried to place the parameters in a decision table , but the table quickly bloated to over 15,000 rows and will probably get even larger then that. I did, however, notices that in many cases the different between two rows is 1 or two parameters (e.g. same row with the only different is agent type "Local" vs. "Regional") resulting in different assignment.
I am thinking of replacing the table with something else, like a tree structure, so I can group similar rows under the same node and then navigate over the tree to make the decision. To do this I plan to prioritize the parameters and give parameters with higher priority a higher place in the tree.
Does anyone has experience with such a problem ? I looked at decision trees but they focus more on ML and probabilities, so I'm not sure this is what I need.
Is there any other method to deal with bloated tables that become unmanageable ? I cannot go to our customer and ask them to maintain a 15,000 rows excel. They'll shoot me there and then.
Thanks
Alon.

Database schema for a tinder like app

I have a database of million of Objects (simply say lot of objects). Everyday i will present to my users 3 selected objects, and like with tinder they can swipe left to say they don't like or swipe right to say they like it.
I select each objects based on their location (more closest to the user are selected first) and also based on few user settings.
I m under mongoDB.
now the problem, how to implement the database in the way it's can provide fastly everyday a selection of object to show to the end user (and skip all the object he already swipe).
Well, considering you have made your choice of using MongoDB, you will have to maintain multiple collections. One is your main collection, and you will have to maintain user specific collections which hold user data, say the document ids the user has swiped. Then, when you want to fetch data, you might want to do a setDifference aggregation. SetDifference does this:
Takes two sets and returns an array containing the elements that only
exist in the first set; i.e. performs a relative complement of the
second set relative to the first.
Now how performant this is would depend on the size of your sets and the overall scale.
EDIT
I agree with your comment that this is not a scalable solution.
Solution 2:
One solution I could think of is to use a graph based solution, like Neo4j. You could represent all your 1M objects and all your user objects as nodes and have relationships between users and objects that he has swiped. Your query would be to return a list of all objects the user is not connected to.
You cannot shard a graph, which brings up scaling challenges. Graph based solutions require that the entire graph be in memory. So the feasibility of this solution depends on you.
Solution 3:
Use MySQL. Have 2 tables, one being the objects table and the other being (uid-viewed_object) mapping. A join would solve your problem. Joins work well for the longest time, till you hit a scale. So I don't think is a bad starting point.
Solution 4:
Use Bloom filters. Your problem eventually boils down to a set membership problem. Give a set of ids, check if its part of another set. A Bloom filter is a probabilistic data structure which answers set membership. They are super small and super efficient. But ya, its probabilistic though, false negatives will never happen, but false positives can. So thats a trade off. Check out this for how its used : http://blog.vawter.com/2016/03/17/Using-Bloomfilters-to-Avoid-Repetition/
Ill update the answer if I can think of something else.

Are all field of a list belong to one CFS

Do generatable array fields always belong to the same CFS?
In case one of the list fields has a constraint, and another field has a different constraint and they are not connected. Will both the fields belong to the same CFS?
If the fields aren't connected than each field will be on a different CFS.
The question is not totally clear but here is an attempt to answer:
if this is a list of struct with several fields, the different fields will belong to the same CFS only if connected (e.g. l[0].x and l[0].y will belong to the same CFS only if there is a constraint connecting them).
assuming the question referred to different indices of the same list path (e.g. l[0].x and l[1].x, or m[0] and m[1]), then we need to differentiate between the static and runtime considerations:
statically, both paths are considered to belong to the same CFS. For example, ICFS analysis assume that, so "keep x < f(m[0]); keep m[1] < g(x)" will create an ICFS.
on runtime, IntelliGen tries to solve the list items one-by-one (as if they were in different CFSs), for performance reasons. However, when the list items are connected (either directly or to different variables), which in IntelliGen's terms is called 'lace', they are indeed solved as one CFS, which can be very large.
for more details, I suggest to read section 4.3 ("Avoid Dependencies Between List Elements") in the IntelliGen user guide.

Overpass relation railway segments

I want to query the Overpass Api to find out the distance of special relations (railways). The request is fine, and returns me all relation, way and node objects I'm interested in. Example for Hamburg:
[out:json];(rel(53.55224324163863, 10.006766589408304, 53.55314275836137, 10.008081410591696)["route"="train"];>;);out body;
In Overpass, each relation object has members defining this relation. For way objects you can resolve the lat/lon of its node attributes and calculate the distance for that way. If you sum up all the way distances it seems to be reasonable.
However, there are members from that relation of the type node (most of the time, they have a role of "stop") which seem to represent the right order of stops from that relation. But instead being in between the members, they are roughly at the end.
If I try to look the stops up inside the ways, they are not all present. How am I supposed to calculate the distance between two particular stops?
There seems to be a misconception about relations. Relation members don't necessarily have to be sorted. Consequently you might have to sort the members yourself, if necessary at all.
You can take a look at JOSM which has a neat sorting algorithm for various types of relations. But I don't think it is able to place members with the role stop at the correct position. This isn't always possible because a way doesn't necessarily have to be split at each node with the stop role. This also means a single way can contain more than one node with a stop role, making it impossible to sort the relation members correctly. Unless you do some pre-processing for splitting each way accordingly.
For calculating the distance between each stop it seems unnecessary to sort the elements. Just follow the way by iterating over all each nodes and check for each node if it has a stop role in the corresponding relation. When reaching the end of the way continue with the one which shares the same node at its start or end and which is also member of the relation.