Graph DB and Relations from Edges - orientdb

I'm going to design simple knowledge base with Objects, Relations between them and Questions about Objects and Relations. I understand well how to do this in the RDBMS. But now I'm going to study Graph DB capabilities, in particular OrientDB.
Visual representation of what I want:
Vertex: Moon --> "What is the mass of the M?", etc
|
|
o---> This Edge must also have 'one to many' (edge to questions) relation, e.g
| "How far is the M. from the E?" etc.
|
| <-- Edge: Has a satellite (or stands nearby)
|
Vertex: Earth --> "What is the age of the E?"
I found Link list and Link set datatypes in the manual, but I'm still not sure about (a) are they really what I need and (b) how should I properly represent Questions.
Need your help.

Related

OrientDB traverse (children) while connected to a vertex and get an other vertex

I'm not sure the title is the best way to phrase it, here's the structure:
Structure
Here's the db json backup if you want to import it to test it: http://pastebin.com/iw2d3uuy
I'd like to get the Dishes eaten by the Humans living in Continent 1 until a _Parent Human moved to Continent 2.
Which means the target is Dish 1 & 2.
If a parent moved to another Continent, I don't want their dish nor the dishes of their children, even if they move back to Continent 1.
I don't know if it matters, but a Human can have multiple children.
If there wasn't the condition about the children of a Human who has moved from the Continent, this query would have worked:
SELECT expand(in('_Is_in').in('_Lives').in('_Eaten_by'))
FROM Continent WHERE continent_id = 1
But I guess here we're forced to use (among other things)
TRAVERSE out('_Parent') FROM Human WHILE
I've tried to use the while of traverse with a subquery to get all the Humans I'm interested in, before to try to get the Dishes, but I'm not even sure we can use while with a subquery.
I hope the structure will help other users to quickly find out if this query is useful to them. If anyone is wondering, I used the Graph tab of OrientDB Studio to make it, along with GIMP.
As a bonus, if anyone knows the Gremlin syntax, it would also be useful to learn it.
Please feel free to edit this post as you see fit and contribute your thoughts :)
SELECT expand(in('_Eaten_by'))
FROM (TRAVERSE out('_Parent')
FROM (SELECT from Human WHERE in('_Parent').size() = 0)
WHILE out('_Lives').out('_Is_in').continent_id = 1)
Explanation:
TRAVERSE out('_Parent')
FROM (SELECT FROM Human WHERE in('_Parent').size() = 0)
WHILE out('_Lives').out('_Is_in').continent_id = 1
returns Human 1 and 2.
That query traverses Human, starting from Human 1 while the Human is connected to Continent 1.
It starts from in('_Parent').size() = 0 which are the Humans without any _Parent (there's only Human 1 in this case) (size() is the size of the collection of vertices coming in from _Parent).
And SELECT expand(in('_Eaten_by')) FROM
gets the Dishes, starting from the Humans we got from the traversal and going through the edge _Eaten_by.
Note: be sure to always use ' around the vertices and edges names, otherwise the names don't seem to be taken in account.

Controlling edge multiplicity and direction in OrientDB, can it be improved?

It seems there is a way to control multiplicity of edges in ODB through edge constraints.
http://orientdb.com/docs/last/Tutorial-Using-schema-with-graphs.html (towards the bottom)
The edge constraints are one of those usability things that make ODB less user friendly IMHO.
Here are some multiplicity types taken from Titan's manual (because the way they are explained is simple to understand):
MULTI: Allows multiple edges of the same label between any pair of
vertices. In other words, the graph is a multi graph with respect to
such edge label. There is no constraint on edge multiplicity.
SIMPLE: Allows at most one edge of such label between any pair of vertices. In
other words, the graph is a simple graph with respect to the label.
Ensures that edges are unique for a given label and pairs of vertices.
MANY2ONE: Allows at most one outgoing edge of such label on any vertex
in the graph but places no constraint on incoming edges. The edge
label mother is an example with MANY2ONE multiplicity since each
person has at most one mother but mothers can have multiple children.
ONE2MANY: Allows at most one incoming edge of such label on any vertex
in the graph but places no constraint on outgoing edges. The edge
label winnerOf is an example with ONE2MANY multiplicity since each
contest is won by at most one person but a person can win multiple
contests.
ONE2ONE: Allows at most one incoming and one outgoing edge
of such label on any vertex in the graph. The edge label marriedTo is
an example with ONE2ONE multiplicity since a person is married to
exactly one other person.
Needless to say, because of this kind of abstraction, implementing such multiplicity while creating edges in Titan is fairly straight forward and simple. In ODB, well, it isn't quite so transparent or simple. This is along the lines of the requested usability abstraction discussed in this issue on Github.
Let's go through the possibilities and see how ODB does them (and according to my understanding of ODB, which admittedly isn't great, so I could be wrong. Please correct me if I am!).
MULTI
This is a standard heavy weight edge in ODB. So, creating a heavy weight edge follows the MULTI multiplicity rule automatically.
SIMPLE
This seems like it would fall under the "UNIQUE" indexing method of an edge. However, I don't think that is completely right, because the UNIQUE index enforces a single in and out between two vertices. So using a UNIQUE index is more like the ONE2ONE. I believe this might be the equivalent to a light weight edge, but with an added unique index.(?)
MANY2ONE
I believe this can be done with the in and out constraints.
ONE2MANY
Same as above.
ONE2ONE
Available through the UNIQUE constraint.
So, through this exercise, I think I have learned that ODB can cover all multiplicity scenarios, though, I am absolutely not certain. Why must I be uncertain? This whole concept could be simplified by using the same terms Titan uses. It seems the abstraction is necessary. I believe it would make ODB easier to understand.
Maybe these suggestions could be worth thinking about. Starting from the Cars database example taken from the docs.
Creating the edge classes stays the same.
orientdb> CREATE CLASS Owns EXTENDS E
orientdb> CREATE CLASS Lives EXTENDS E
SIMPLE (a standard light weight edge, but automatically allows multiple light weight edges, which seems a step above TItan!)
orientdb> CREATE EDGE Owns FROM ( SELECT FROM Person ) TO ( SELECT FROM Car )
MULTI (an edge with properties is created/ heave weight edge, with in and out mandatory)
orientdb> CREATE MULTI EDGE Owns FROM ( SELECT FROM Person ) TO ( SELECT FROM Car )
MANY2ONE (not sure what needs to happen with in and out here)
orientdb> CREATE MANY2ONE EDGE Lives FROM ( SELECT FROM Country ) TO ( SELECT FROM Person )
ONE2MANY (same as above, not sure about what needs to happen with in and out)
orientdb> CREATE ONE2MANY EDGE Owns FROM ( SELECT FROM Person ) TO ( SELECT FROM Cars )
ONE2ONE (this is a heavy weight edge, with an automatic UNIQUE constraint)
orientdb> CREATE ONE2ONE EDGE Owns FROM ( SELECT FROM Person ) TO ( SELECT FROM Cars )
UNIQUE (an additional constraint, only for light weight edges)
orientdb> CREATE UNIQUE EDGE Owns FROM ( SELECT FROM Person ) TO ( SELECT FROM Car )
To be honest, I am really not sure this covers all needed or wanted possibilities, when it comes to edge direction constraints or multiplicity. However, I know for a fact the above suggest SQL is a lot easier for me to understand., which is the goal of a declarative language like SQL. We are abstracting out three things as I see it. Edge type creation (light or heavy weight), edge direction and multiplicity.
As I look back at what I just wrote, I guess what I am uncertain about is how to actually create many-to-one and one-to-many edges in ODB.
Any other thoughts on this and corrections to my thinking would be greatly appreciated.
Scott
I've got a bite in the ODB Google Group.
https://groups.google.com/forum/#!topic/orient-database/sZ4GjSvEKtI
Scott

Getting Streets of a specific postcode using Open Street Maps

I want to write a code that has the Countrycode and Postcode as an input and the ouput are the streets that are in the given postcode using some apis that use GSM.
My tactic is as follows:
I need to get the relation Id of the district. For Example 1991416 is the relation id for the third district in Vienna - Austria. It's provided by the nominatim api: http://nominatim.openstreetmap.org/details.php?place_id=158947085
Put the id in this api url: http://polygons.openstreetmap.fr/get_wkt.py?id=1991416&params=0
After downloading the polygon I can put the gathered polygon in this query on the overpass api
(
way
(poly: "polygone data")
["highway"~"^(primary|secondary|tertiary|residential)$"]
["name"];
);
out geom;
And this gives me the streets of the searched district. My two problems with this solution are
1. that it takes quite a time, because asking three different APIs per request isn't that easy on ressources and
2. I don't know how to gather the relation Id from step one automatically. When I enter a Nominatim query like http:// nominatim.openstreetmap.org/search?format=json&country=austria&postalcode=1030 I just get various point in the district, but not the relation id of the searched district in order to get the desired polygone.
So my questions are if someone can tell my how I can get the relation_Id in order to do the mentioned workflow or if there is another, maybe better way to work this issue out.
Thank you for your help!
Best Regards
Daniel
You can simplify your approach quite a bit, down to a single Overpass API call, assuming you define some relevant tags to match the relation in question. In particular, you don't have to resort to using poly at all, i.e. there's no need to convert a relation to a list of lat/lon pairs. Nowadays the concept of an area can be used instead to query for certain objects in a polygon defined by a way or relation. Please check out the documentation for more details on areas.
To get the matching area for relation 1991416, I have used postal_code=1030 and boundary=administrative as filter criteria. Using that area you can then search for ways in this specific polygon:
//uncomment the following line, if you need csv output
//[out:csv(::id, ::type, name)];
//adjust area to your needs, filter critera are the same as for relations
area[postal_code=1030][boundary=administrative]->.a;
// Alternative: {{geocodeArea:name}} -> see
// http://wiki.openstreetmap.org/wiki/Overpass_turbo/Extended_Overpass_Queries
way(area.a)["highway"~"^(primary|secondary|tertiary|residential)$"]["name"];
(._;>;);out meta;
// just for checking if we're looking at the right area
rel(pivot.a);out geom;
Try it on overpass turbo link: http://overpass-turbo.eu/s/6uN
Note: not all ways/relations have a corresponding area, i.e. some area generation rules apply (see wiki page above). For your particular use case you should be ok, however.

cassandra schema data design for many-to-many array relationship

So I need a DB that can store info for about 300 million users. Each user will have two vectors: their 5 favorite items, and their 5 most similar users (these users also contained in the user set)
ex:
preferences users
user | item user | user
-------------- --------------
user1 | item1 user1 | user2
user1 | item2 user1 | user4
user1 | item3 user2 | user8
user2 | item3 . . .
user2 | item4
. . .
So basically I need two tables, both many-many relationships, and both relatively big.
Ive been exploring cassandra (but im open to other solutions) and I was wondering how I would define the schema, and what type of indexing I need for this to be optimized and working properly.
I will need to query in two fashions:
1.By user of course, and
2. by whatever item is in their list.
(so i can get a list of users with the same favorite item)
Ive already set up cassandra and started messing with it but I cant even get lists to work because i need 'composite' primary keys? I dont understand why.
Any help/a push in the right direction is greatly appreciated.
Thanks!
I am not sure you've adequately described your use case. It is the access patterns that first and foremost define your key design, which is ultimately what defines your workload characteristics with NoSQL databases. For example, are you going to have to do searches for users based on a certain geography or something along those lines or is this just simple , grab 1 user and his favorite items and/or his similar users.
Based on what you've described, you should probably just create a keyspace for user_ids and then your value can be the denormalized copies of "favorite items" and a list of "similar user id's". Assuming your next action is to do something with those similar users, you can quickly get them from the list of id's.
The important point is how big is your key ( i mean in characters / bytes ) and will you be able to fit them into memory so you get really fast performance. If your machines have limited memory for your key size, then you need to plan for a number of nodes which can accommodate a given number of keys and let those nodes run on separate servers. At least that is the most important part for Oracle NoSQL Database (ONDB) .... I am part of that team. Good news is that 300M is still very small.
Hope it helps,
-Robert

Tag hierarchies and handling of

This is a real issue that applies on tagging items in general (and yes, this applies to StackOverflow too, and no, it is not a question about StackOverflow).
The whole tagging issue helps cluster similar items, whatever items they may be (jokes, blog posts, so questions etc). However, there (usually but not strictly) is a hierarchy of tags, meaning that some tags imply other tags too. To use a familiar example, the "c#" so tag implies also ".net"; another example, in a jokes database, a "blondes" tag implies the "derisive" tag, similarly to "irish" or "belge" or "canadian" etc depending on the joke's country origin.
How have you handled this, if you have, in your projects? I will supply an answer describing two different methods I have used in two separate cases (actually, the same mechanism but implemented in two different environments), but I am also interested not only on similar mechanisms, but also on your opinion on the hierarchy issue.
This is a tough question. The two extremes are an ontology (everything is hierarchical) and a folksonomy (tags have no hierarchy). I have answered this on WikiAnswers, with a reference to Clay Shirky's "Ontology is Overrated" article which claims you should set no hierarchy.
Actually I would say that it is not so much a hierarchical system but a semantic net with felt distancies between tags meanings. What do I mean: mathematics is closer to experimental physics then to gardening.
Possibility to build such a net: Build pairs of tags and let people judge the perceived distance (using a measure like 1-10, meaning something like [synonyms, alike,...,antonyms], ...) and when searching, search for all tags within a certain distance.
Does a measure have to be equal distance if coming from the oposite direction ([a,b] close -> [b,a,] close)? Or does proximity imply [a,b] close and [b,c] close -> [a,b] close?
Maybe the first word will by default trigger another semantic field? If you start at "social worker", "analyst" ist near. If you start at "programmer", "analyst" is near as well. But starting at any of these points, you probably would not count the other as near ("sozial worker" is by no means close to "programmer").
You therefore would have only pairs judged and judged in both directions (in random order).
[TagRelations]
tagId integer
closeTagId integer
proximity integer
Example for selection of similar tags:
select closeTagId from TagRelations where tagId = :tagID and proximity < 3
The mechanism I have implemented was to not use the tags given themselves, but an indirect lookup table (not strictly DBMS terms) which links a tag to many implied tags (obviously, a tag is linked with itself for this to work).
In a python project, the lookup table is a dictionary keyed on tags, with values sets of tags (where tags are plain strings).
In a database project (indifferent which RDBMS engine it was), there were the following tables:
[Tags]
tagID integer primary key
tagName text
[TagRelations]
tagID integer # first part of two-field key
tagID_parent integer # second part of key
trlValue float
where the trlValue was a value in the (0, 1] space, used to give a gravity for the each linked tag; a self-to-self tag relation always carries 1.0 in the trlValue, while the rest are algorithmically calculated (it's not important how exactly). Think the example jokes database I gave; a ['blonde', 'derisive', 0.5] record would correlate to a ['pondian', 'derisive', 0.5] and therefore suggest all derisive jokes given another.