OrientDB - Compute User Similarity

OrientDB - Compute User Similarity - orientdb

I am trying to solve a simple problem: compute user to user similarity based on euclidean distance computed for ratings of a product.
I am using a such query
SELECT U1.UserId, U2.UserId
FROM (
MATCH
{class:User, as: U1, where: (UserId=12345) } -rate-> {class:Product, as:P},
{class:User, as: U2, where: (UserId<>12345)} -rate-> {as:OP},
RETURN U1, U2, P, OP
)
I would now compute for each pair (user U1, user U2) a result that represents the distance between rating for common product.
Example of common products for users
U1,Product,Rating
1, xxx, 5
2, xxx, 2
1, yyy, 10
2, yyy, 8
So I would compute Sqrt((5-2)^2 + (10-8)^2) as Distance
Is this possible with a single query on OrientDB. Neo4J provide the WITH statement to manipulate consecutive instance in Cypher Query.
Many thanks in advance for all the help you can provide us.
Thx
Roberto

First of all I'd rewrite the MATCH statement to return the distance of ratings for two users and a product:
MATCH
{class:User, as: U1, where: (UserId=12345) }.outE("rate"){as:r1}.inV(){class:Product, as:P},
{class:User, as: U2, where: (UserId<>12345)}.outE("rate"){as:r2}.inV(){as:P},
RETURN U1, U2, (r1.rating - r2.rating) * (r1.rating - r2.rating) as squareDistance, P
Then you can use some outer SELECT to do the calculations:
SELECT U1, U2, P, sqrt(squareSum) as distance from (
SELECT U1, U2, P, sum(squareDistance) as squareSum from (
MATCH...
) GROUP BY U1, U2, P
)
The only problem here is that OrientDB does not have a built-in sqrt() function, so you have to write your own sqrt() in javascript. It's quite easy, because inside js functions you can use Java classes, so the function body will just be
return java.lang.Math.sqrt(x);

Related

Why LATERAL not works with values?

It not make sense, a literal is not a valid column?
SELECT x, y FROM (select 1 as x) t, LATERAL CAST(2 AS FLOAT) AS y; -- fine
SELECT x, y FROM (select 1 as x) t, LATERAL 2.0 AS y; -- SYNNTAX ERROR!
Same if you use CASE clause or x+1 expression or (x+1)... seems ERROR for any non-function.
The Pg Guide, about LATERAL expression (not LATERAL subquery), say
LATERAL is primarily useful when the cross-referenced column is necessary for computing the row(s) to be joined (...)
NOTES
The question is about LATERAL 1_column_expression not LATERAL multicolumn_subquery. Example:
SELECT x, y, exp, z
FROM (select 3) t(x), -- subquery
LATERAL round(x*0.2+1.2) as exp, -- expression!
LATERAL (SELECT exp+2.0 AS y, x||'foo' as z) t2 --subquery
;
... After #klin comment showing that the Guide in another point say "only functions", the question Why? must be expressed in a more specific way, changing a litle bit the scope of the question:
Not make sense "only funcions", the syntax (x) or (x+1), encapsulatening expression in parentesis, is fine, is not?Why only functions?
PS: perhaps there is a future plan, or perhaps a real problem on parsing generic expressions... As users we must show to PostgreSQL developers what make sense and we need.

It'll all work fine if you wrap it in its own subquery
SELECT x, y FROM (select 1 as x) t, LATERAL (SELECT 2.0 AS y) z;

A literal is a valid value for a column, but as the docs you quoted say, LATERAL syntax is used
for computing the row(s) to be joined
A relation, such as a FROM or JOIN or LATERAL subquery clause, always computes tuples of (a single or multiple) columns. The alias you're assigning is not for an individual row, but for the whole tuple.

Answering "Why only functions?" by intuition.
Or "Why does the PostgreSQL spec use only functions?". Of course, it's not a question about the parser, because it complies with the specification.
The SELECT syntax Guide show the only occasions when we can use LATERAL:
[ LATERAL ] ( select ) [ AS ] alias [ ( column_alias [, ...] ) ] ...
[ LATERAL ] function_name ( [ argument [, ...] ] ) ...
[ LATERAL ] ROWS FROM( function_name ( [ argument [, ...] ] ) ...
So, no conflict on
[ LATERAL ] (single_expression) [ AS ] alias
The guess of #Bergi is that a literal expression like LATERAL 2.0 AS y could be interpreted as LATERAL "2"."0", the "table 0" and "schema 2"... But, as we saw above, not make sense to expect a table name after clause LATERAL, so, in fact, no ambiguity.
Conclusion: it looks like the specification of LATERAL can grow and allow the use of expressions.This is the great advantage of being able to discuss and participate in an open community software!
Why LATERAL single_expression AS alias? Rationale:
to be orthogonal: any new user of PostgreSQL, that see that is valid SELECT a, x, x+b AS y FROM t, LATERAL f(a) AS x, will naturally try also expressions instead functions. It is expected in a "orthogonal system" and is intuitive for any programmer.
to reuse expressions: we use "chain of dependent expressions" in any language, things like a=b+c; x=a+y; z=a/2; .... It is ugly to do "SELECT(SELECT(SELECT))" in SQL, only for reuse expressions. The "chains of LATERALs" is more elegant and human-readable. And perhaps is better also for query optimization.

How to optimize orientdb match of large data with filter a relation twice?

I am using orientdb recently. I load my db's data into orientdb, make the relation between different data type. Then try to find data by using orientdb's match command. But I found some match query is very slow. The reason cause this is I need filter a relation twice. I think I need some guide about how to write match query command like this.
details is below:
I have node types: U, A, I
node‘s relation is:
U -R_UA_REL-> A -R_HAS-> I
U has property of {user_id: xxx}
A has propert of {create_time:xxx}
I has properties of {id:xxx, value:xxx}
The search order is:
find U with property of user_id as u;
find A:a1 through u's relation of R_UA_REL, get a1;
filter a1 by it's R_HAS relation to I with I's id and value, get a2;
filter a2 by it's R_HAS relation to I with I's id and value, get a3;
return a3 as result;
My match query command is:
match
{class:U, as: u, where:(user_id = 60000021380)}
.out('R_UA_REL')
{class:A, as: a1, where:(create_time > 1509033600 and create_time < 1509206400000)}
.out('R_HAS')
{class:I, as: i1, where:(id = "5d6fc56bf2d34bd09d394b1ce3d357e1" and value = "交通费")}
.in('R_HAS')
{class:A, as: a2}
.out('R_HAS')
{class:I, as: i2, where:(id = "c59994b93c22488fa6f3cd7563715923" and value > 1 and value < 130)}
.in('R_HAS')
{class:A, as: a3}
return a3
limit 10000;
The data scale of different query step is:
(U:u)-R_UA_REL->(A:a1)-R_HAS->(I:i1)<-R_HAS-(A:a2)-R_HAS->(I:i2)<-R_HAS-(A:a3)
1 ~1000K ~10000K ~290K ~2900K ~20K
The query is very very slow !!! it takes about 11s one time.
the orientdb version is: orientdb-community-2.2.28
>>>> my question is: how can I do to optimize this query?
Any advice is appreciated.
Thanks.

OrientDb Vertices with most Mutual Friends

I'm trying to find the pairs of vertices that have the greatest number of common vertices between them. It is very similar to the 'number of mutual friends' example used in many graph database demos. I can determine the number of mutual vertices between a pair of known vertices using this:
SELECT Expand($query) LET
$query1 = (SELECT Expand(outE().in) FROM #1:2,
$query2 = (SELECT Expand(OutE().in) FROM #1:3,
$query = Intersect($query1,$query2);
The Count() of the above query's result is the number of common vertices.
However, I can't figure out how to aggregate that query across my entire data set. My best solution has been a brute force, where I iterate through each vertex and run the above query against all other vertices (technically, I do all the vertices 'after' that vertex).
My solution is inefficient and had to be coded in C# rather than done entirely in SQL. How can this be done using OrientDb's SQL?

You can use a SELECT with a MATCH:
SELECT FROM (
SELECT a, b, count(friend) as nFriends from (
MATCH
{class:Person, as:a} -FriendOf- {as:friend} -FriendOf-{as:b, where:($matched.a != $currentMatch)}
RETURN a, b, friend
)
) ORDER BY nFriends

Slight modification to #Luigi's answer:
SELECT a, b, Count(friend) AS nFriends FROM (
MATCH
{class:Person, as:a} -E- {as:friend} -E- {class:Person, as:b, where:($matched.a != $currentMatch)}
RETURN a, b, friend
) GROUP BY a, b ORDER BY nFriends DESC
I needed the GROUP BY or I just get one big count.

slow aggregate on multiple columns

This kdb query that aggregates multiple columns takes approximately 31 seconds compared to 3 seconds with J
Is there a faster way to do the sum in kdb?
Ultimately this will be running against a partitioned database on the 32-bit version
/test 1 - using symbols
n: 13000000;
cust: n?`8;
prod: n?`8;
v: n?100
a:([]cust:cust; prod:prod ;v:v)
/query 1 - using simple by
q)\t select sum(v) by cust, prod from a
31058
/query 2 - grouping manually
\t {sum each x[`v][group[flip (x[`cust]; x[`prod])]]}(select v, cust, prod from a)
12887
/query 3 - simpler method of accessing
\t {sum each a.v[group x]} (flip (a.cust;a.prod))
11576
/test 2 - using strings, very slow
n: 13000000;
cust: string n?`8;
prod: string n?`8;
v: n?100
a:([]cust:cust; prod:prod ;v:v)
q)\t select sum(v) by cust, prod from a
116745
comparison J code
n=:13000000
cust=: _8[\ a. {~ (65+?(8*n)#26)
prod=: _8[\ a. {~ (65+?(8*n)#26)
v=: ?.n#100
agg=: 3 : 0
keys=:i.~ |: i.~ every (cust;prod)
c=.((~.keys) { cust)
p=.((~.keys) { prod)
s=.keys +//. v
c;p;s
)
NB. 3.57 seconds
6!:2 'r=.agg 0'
3.57139
({.#$) every r
13000000 13000000 13000000
Update:
From the kdbplus forums, we can get down to about 2x the speed difference
q)\t r:(`cust`prod xkey a inds) + select sum v by cust,prod from a til[count a] except inds:(select cust,prod from a) ? d:distinct select cust,prod from a
6809
Update 2: added another dataset per #user3576050
This dataset has the same overall number of rows, but is distributed 4 instances per group
n: 2500000
g: 4
v: (g*n)?100
cust: (g*n)#(n?`8)
prod: (g*n)#(n?`8)
b:([]cust:cust; prod:prod ;v:v)
q)\ts select sum v by cust, prod from b
9737 838861968
The previous query runs poorly on the new dataset
q)\ts r:(`cust`prod xkey b inds) + select sum v by cust,prod from a til[count b] except inds:(select cust,prod from b) ? d:distinct select cust,prod from b
17181 671090384

If you update this data less frequently than you query it, how about pre-computing a group index? It’s about the same cost to create as a single query, and it allows querying at ~30x the speed.
q)\ts select sum v by cust,prod from b
14014 838861360
q)\ts update g:`g#{(key group x)?x}flip(cust;prod)from`b
14934 1058198384
q)\ts select first cust,first prod,sum v by g from b
473 201327488
q)
The results match up to row order and schema details:
q)(select sum v by cust,prod from b)~`cust`prod xasc 2!delete g from select first cust,first prod,sum v by g from b
1b
q)
(BTW, I know essentially nothing about J, but my guess would be that it’s computing a similar multi-column group index. q’s g index is unfortunately (currently?) limited to plain vector data—if it were possible to somehow apply it to the combination of cust and prod, I expect we’d see results like mine from the simple query.)

You are using a pathological dataset, a set of random symbols of length 8 will have few duplicates making the grouping redundant.
q)n:13000000; (count distinct n?`8)%n
0.9984848
p#/g# attributes(mentioned in comments above) will have no impact on performance for the same reasons.
You will see better performance with more appropriate data.
q)n:1000000
q)
q)a:([]cust:n?`8; prod:n?`8; v:n?100)
q)b:([]cust:n?`3; prod:n?`3; v:n?100)
q)
q)\ts select sum v by cust, prod from a
3779 92275568
q)
q)\ts select sum v by cust, prod from b
762 58786352

Greatest N per group in Open SQL

Selecting the rows from a table by (partial) key with the maximum value in a particular column is a common task in SQL. This question has some excellent answers that cover a variety of approaches to it. Unfortunately I'm struggling to replicate this in my ABAP program.
None of the commonly used approaches seem to be supported:
Joining on a subquery is not supported in syntax: SELECT * FROM X as x INNER JOIN ( SELECT ... ) AS y
Using IN for a composite key is not supported in syntax as far as I know: SELECT * FROM X WHERE (key1, key2) IN ( SELECT key1 key2 FROM ... )
Left join to itself with smaller-than comparison is not supported, outer joins only support EQ comparisons: SELECT * FROM X AS x LEFT JOIN X as xmax ON x-key1 = xmax-key1 AND x-key2 < xmax-key2 WHERE xmax-key IS INITIAL
After trying each of these solutions in turn only to discover that ABAP doesn't seem to support them and being unable to find any equivalents I'm starting to think that I'll have no choice but to dump the data of the subquery to an itab.
What is the best practice for this common programming requirement in ABAP development?

First of all, specific requirement, would give you a better answer. As it happens I bumped into this question when working on a program, that uses 3 distinct methods of pseudo-grouping, (while looking for alternatives) and ALL 3 can be used to answer your question, depending on what exactly you need to do. I'm sure there are more ways to do it.
For instance, you can pull maximum values within a group by simply selecting max( your_field ) and grouping by some fields, if that's all you need.
select bname, nation, max( date_from ) from adrp group by bname, nation. "selects highest "from" date for each bname
If you need to use that max value as a filter condition within a query, you can do it by performing pseudo-grouping using sub-query and max within sub-query like this (notice how I move out the BNAME check into sub query, which means I don't have to check both fields using in (subquery) addition):
select ... from adrp as b_adrp "Pulls the latest person info for a user (some conditions are missing, but this is a part of an actual query)
where b_adrp~date_from in (
select max( date_from ) "Highest date_from where both dates are valid
from adrp where persnumber = b_adrp~persnumber and nation = b_adrp~nation and date_from <= #sy-datum )
The query above allows you to select selects all user info from base query and (where the first one only allows to take aggregated and grouped data).
Finally, If you need to check based on composite key and compare it to multiple agregate function results, the implementation will heavily depend on specifics of your requirement (and since your question has none, I'll provide a generic one). Easiest option is to use exists / not exists instead of in (subquery), in exact same way and form the subquery to check for existance of specific key or condition rather than pull a list ( you can nest subqueries if you have to ):
select * from bkpf where exists ( select 1 from bkpf as b where belnr = bkpf~belnr and gjahr = bkpf~gjahr group by belnr, gjahr having max( budat ) = bkpf~budat ) "Took an available example, that I had in testing program.
All 3 queries will get you max value of a column within a group and in fact, all 3 can use joins to achieve identical results.

please find my answers below your questions.
Joining on a subquery is not supported in syntax: SELECT * FROM X as x INNER JOIN ( SELECT ... ) AS y
Putting the subquery in your where condition should do the work SELECT * FROM X AS x INNER JOIN Y AS y ON x~a = y~b WHERE ( SELECT * FROM y WHERE ... )
Using IN for a composite key is not supported in syntax as far as I know: SELECT * FROM X WHERE (key1, key2) IN ( SELECT key1 key2 FROM ... )
You have to split your WHERE clause: SELECT * FROM X WHERE key1 IN ( SELECT key1 FROM y ) AND key2 IN ( SELECT key2 FROM y )
Left join to itself with smaller-than comparison is not supported, outer joins only support EQ comparisons.
Yes, thats right at the moment.

Left join to itself with smaller-than comparison is not supported, outer joins only support EQ comparisons:
SELECT * FROM X AS x LEFT JOIN X as xmax ON x-key1 = xmax-key1 AND x-key2 < xmax-key2 WHERE xmax-key IS INITIAL
This is not true. This SELECT is perfectly valid:
SELECT b1~budat
INTO TABLE lt_bkpf
FROM bkpf AS b1
LEFT JOIN bkpf AS b2
ON b2~belnr < b1~belnr
WHERE b1~bukrs <> ''.
And was valid at least since 7.40 SP08, since July 2013, so at the time you asked this question it was valid as well.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

OrientDB - Compute User Similarity - orientdb

Related

Why LATERAL not works with values?

How to optimize orientdb match of large data with filter a relation twice?

OrientDb Vertices with most Mutual Friends

slow aggregate on multiple columns

Greatest N per group in Open SQL

Categories

Resources