OrientDB - Find related items of a vertex - orientdb

We are building a system that will audit searches of a user (as well as other actions by the user). We will be tracking data such as User1 searched for Term1. There is an edge called 'searched' between a User and a Term vertex. What we are trying to find is some related information like users that searched for Term1 also searched for these terms (Term2, Term3, etc) and possibly some other related information between users and terms like "you may know these users". I am guessing a traversal is needed, however what I'm wondering is if depth of a traversal matters and will tell us the data we want. If we go the traversal route how deep do we set before we lose actual relevance?
So far this is what we pieced together but we aren't entirely sure if it is the correct approach.
SELECT $depth, * FROM (TRAVERSE * FROM (SELECT FROM Searched where q = 'al') STRATEGY BREADTH_FIRST ) WHERE #class in ['Term']
Update: what I ended up going with so far was the following query. It will tell me what other users searched for and a count of how many times. So I sort by how close a vertex is and how many times it was searched for. I feel this should hopefully provide a fairly good sample of what other users are searching for that also searched for that term.
SELECT $depth, q, in().size() AS count FROM (TRAVERSE * FROM (select from Term where q.toLowerCase() = 'aluminum') STRATEGY BREADTH_FIRST) WHERE #class = 'Term' AND $depth <> 0 ORDER BY $depth ASC, count DESC

Related

How to find most connected child nodes in OrientDB

So I'm new to OrientDB, and while I'm pretty good at SQL the syntax to get what I want in OrientDB is escaping me.
I know I can do something like select *, in().size() as size from Users order by size desc to find the most connected node of a certain class (Users in this case), but how do I find the most connected children a couple levels down?
I.e., let's say I have Organizations --> PROMOTES (edge) --> Platform --> MANAGES (edge) --> Suggestion
How do I find the most connected Suggestions at the Organization level? I.e., I know I can easily find the most connected suggestions one level out using the query I shared, but what about the most connected another level beyond that?
I'd ultimately like a result which lists each Suggestion along with how indirectly connected (number of edges) it is to Organizations.
Thank you!
Use TRAVERSE
You should use the command TRAVERSE to do that
SELECT out(PROMOTES).size() AS connectedOrg,*
FROM (
TRAVERSE out(MANAGES)
FROM Suggestion WHILE $depth < 2
)
WHERE $depth > 0
Result will be the Platforms linked to a Suggestion. Records can be duplicated.
Along with each Platform, you get the number of Organisation called connectedOrg.
About traverse :
Traverse follow the record ids present in a record and aggregates them in the results. With WHILE $depth < 2 you can limit search to only one level and with WHERE $depth > 0 you can remove the original record from the result. More info here.
Use OrientDB Functions
If you need to know how many Organization is linked to each Suggestion (through Plateform), use this syntax.
SELECT *, set(out(MANAGES).out(PROMOTES), null).size() FROM Suggestion
Note : , null allows to switch from AGGREGATE to INLINE. It prevents from having all sizes aggregated in a single record. See the doc for more info.

OrientDB traverse (children) while connected to a vertex and get an other vertex

I'm not sure the title is the best way to phrase it, here's the structure:
Structure
Here's the db json backup if you want to import it to test it: http://pastebin.com/iw2d3uuy
I'd like to get the Dishes eaten by the Humans living in Continent 1 until a _Parent Human moved to Continent 2.
Which means the target is Dish 1 & 2.
If a parent moved to another Continent, I don't want their dish nor the dishes of their children, even if they move back to Continent 1.
I don't know if it matters, but a Human can have multiple children.
If there wasn't the condition about the children of a Human who has moved from the Continent, this query would have worked:
SELECT expand(in('_Is_in').in('_Lives').in('_Eaten_by'))
FROM Continent WHERE continent_id = 1
But I guess here we're forced to use (among other things)
TRAVERSE out('_Parent') FROM Human WHILE
I've tried to use the while of traverse with a subquery to get all the Humans I'm interested in, before to try to get the Dishes, but I'm not even sure we can use while with a subquery.
I hope the structure will help other users to quickly find out if this query is useful to them. If anyone is wondering, I used the Graph tab of OrientDB Studio to make it, along with GIMP.
As a bonus, if anyone knows the Gremlin syntax, it would also be useful to learn it.
Please feel free to edit this post as you see fit and contribute your thoughts :)
SELECT expand(in('_Eaten_by'))
FROM (TRAVERSE out('_Parent')
FROM (SELECT from Human WHERE in('_Parent').size() = 0)
WHILE out('_Lives').out('_Is_in').continent_id = 1)
Explanation:
TRAVERSE out('_Parent')
FROM (SELECT FROM Human WHERE in('_Parent').size() = 0)
WHILE out('_Lives').out('_Is_in').continent_id = 1
returns Human 1 and 2.
That query traverses Human, starting from Human 1 while the Human is connected to Continent 1.
It starts from in('_Parent').size() = 0 which are the Humans without any _Parent (there's only Human 1 in this case) (size() is the size of the collection of vertices coming in from _Parent).
And SELECT expand(in('_Eaten_by')) FROM
gets the Dishes, starting from the Humans we got from the traversal and going through the edge _Eaten_by.
Note: be sure to always use ' around the vertices and edges names, otherwise the names don't seem to be taken in account.

Visualizing graph with OrientDB Studio

I'm working with OrientDB (2.2.10) and occasionaly I would like to visually inspect my dataset to make sure I'm doing things correctly. On this page of OrientDB http://orientdb.com/orientdb/ you see a nice visualization of a large graph with the following query:
select * from V limit -1;
So I tried the same query on my dataset but the result is so extremely sluggish that I can't work with it. My dataset is not extremely large (few hundred vertices, couple thousand edges) but still the result is unworkable. I tried all major browsers but with all I have the same result. Also my computer is not underpowered, I have a quad-core i7 with 16GB RAM.
As a very simple example I have the following graph:
BAR --WITHIN---> CITY --LOCATED_IN--> COUNTRY
Here: Find "friends of friends" with OrientDB SQL I was able to get at least an example of how to do this type of query on a graph. I managed to get a subset of my graph for example as follows:
select expand(
bothE('WITHIN').bothV()
) from Bar where barName='Foo' limit -1
This get's me the graph of 1 Bar vertex, the edge WITHIN and the City vertex. But if I now want to go one step further by also fetching the country which the city is located in I cannot get this style of query to work for me. I tried this:
select expand(
bothE('WITHIN').bothV()
.bothE('LOCATED_IN').bothV()
) from Bar where barName='Foo' limit -1
This results in the same subset being shown. However, if I first run the first query and then without clearing the canvas run the second query I do get the 3 vertices. So it seems I'm close but I would like to get all 3 vertices and it's edges in one query, not having to run first the one and then the other. Could someone point me in the right direction?
If you want to get all three vertices, it would be much easier start from the middle (city) and than get in and out to get bar and contry. I've tried with a similar little structure:
To get city, bar name and country you can try a query like this:
select name, in("WITHIN").name as barName,out("LOCATED_IN").name as barCountry from (select from City where name='Milan') unwind barName, barCountry
And the output will be:
Hope it helps.
If it is not suitable for your case, let me know.
You could use
traverse * from (select from bar where barName='Foo') while $depth <= 4
Example: I tried with this little graph
and I got
Hope it helps.

Find most common shared vertices in OrientDB

I'm currently evaluating OrientDB (2.1.16) as a possible solution to building a similarity recommender. To that end, I'd love some help writing an initial query that accomplishes the following:
Vertex:Maker -(Edge:Produced)-> Vertex:Item -(Edge:TaggedBy)-> Vertex:Tag
I'd like to select a particular Item (V1) and get a list back of other Items (Vn) ordered by the number of Tags shared in common with V1;
By extension, I'd like to take a selected Maker (V2) and traverse through Items to get an ordered list of Makers (and the traversed Items, if possible) who share Tags.
There isn't an awful lot of detailed documentation on the application of intersect in this way. No unusual constraints in particular. There would be thousands of Items and Makers and probably 10x that many Tags.
I tried with this little graph example
I used this query
select item.name, count(tag)from (
select from (
MATCH {
CLASS:Item, AS:item, WHERE: (name<>'v1')
}
.out("TaggedBy"){AS:tag}
return item, tag
) where tag in (
select expand(tag) from (
MATCH {
CLASS:Item, AS:item, WHERE: (name='v1')
}.out("TaggedBy"){AS:tag}
return tag
)
)
) group by item order by count desc
and I got this result
Hope it helps.

Sphinx: Show all results order by previous searches

I use SphinxQL for searching and filtering in product database and I store last x search phrases of each user. I wonder if is it possible to show all products (all rows) to every user but with relevance on previous search.
Let's say one user sought for mobile phones (iphone, galaxy s7...), ie. electronics category. I want to show him all products randomly, but products from category electronics more often and products with those searched keywords even more often.
Is it even possible with Sphinx?
Thanks and sorry for english.
An alternative, would be perhaps to create random numbers attached to each result. A high and a low number, with an overlapping range.
sql_query = SELECT id, RAND()*100 AS rand_low, (RAND()*100)+50 AS rand_high, ...
sql_attr_uint = rand_low
sql_attr_uint = rand_high
Can then arrange the ranking expression to pick either of these numbers depending on if matches or not, and sort by the result.
SELECT id FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
OPTION ranker=expr('IF(doc_word_count>1,rand_high,rand_low)');
Will be mixed up. But results that match one of the words, have a greater chance of showing up first (because use the weighted random number) - its still only a chance, because a rand_high CAN still be smaller than rand_low.
... can change the size of the number 'overlap' to tweak the mix of matching/non matching results.
(added as a new answer as its a quite differnt idea, although uses the same 'all' keyword)
Sphinx doesn't have a 'mode' to just do that. But can get very close...
Can use MAYBE operator
MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
The complication is need a way to match all products. Depending on your data you may already have a word can use (eg word like 'the' in every single product), or add the word to every document, during indexing.
... using MAYBE allows the matching results to have a higher weight.
But you dont want to sort strictly by weight. So need a different alogithm, something to shuffle the results a bit (as you not really wanting 'random'!)
SELECT id, IDIV(id/10000) AS int,WEIGHT() AS w
FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
ORDER BY int DESC, w DESC;
This creates banding by ID, as in theory results can be spread over all the id-space will mix them up a bit. But the category results will still tend to be shown first within each band.
If you have one a different attribute other than ID might be better, something more spread out. Or can add a deliberate random attribute to results)
... there are all sort so variations, your imagination is the only limitation, this basic techqiue can be used to mix things up quote a bit.
(There are other possiblities, Sphinxes little known GROUP N BY function, can be used to produce a sampling search result. This is isnt random, but it might give the similar enough result - ie just mixing up results)