OrientDB Traverse Sum and Group By Top-Most Record - orientdb

We have Orders that include "caused_order" edges from Order to Order because friends can refer other friends to make purchases. We know from the links we generate for the friends that Order ID 42 caused Order ID 47, so we create a "caused_order" edge between the two Order vertices.
We're looking to identify the people that are generating the most referral business. Right now we just loop through in C# and figure it out because our datasets are relatively small. But I'd like to figure out if there's a way to use the Traverse SQL to accomplish this instead.
The problem I'm running in to is getting an accurate count/sum for each Original Order ID.
Consider the following scenario:
Order 42 caused four other Orders, including Order 47. Order 47 caused 2 additional Orders. And Order 51, unrelated to 42 or 47, caused 3 Orders.
I can run the following SQL to get the best referrers for this specific {ProductId}:
select in_caused_order[0].id as OrderID, count(*) as ReferCount, sum(amount) as ReferSum
from ( traverse out('caused_order') from Order )
where out_includes.id = '{ProductId}' and $depth >= 1
group by in_caused_order[0].id
EDIT: the schema is a bit more complex than this, I was just including the out_includes WHERE clause to show that there's a bit of filtering of the Orders. But it's a bit like:
Product(V) <-- includes(E) <-- Order(V) --> caused_order(E) --> Order(V)
(the Order vertex has "amount" as a property, which stores the money spent and is being SUM'd in the SELECT, along with a few fields like date which aren't important)
But that will result in something like:
OrderID | ReferCount | ReferSum
42 | 4 | 525
47 | 2 | 130
51 | 3 | 250
Except that's not quite right, is it? Because Order 42 also technically caused 47's two orders. So we'd want to see something like:
OrderID | ReferCount | ReferSum | ExtendedCount | ExtendedSum
42 | 4 | 525 | 2 | 130
47 | 2 | 130 | 0 | 0
51 | 3 | 250 | 0 | 0
I recognize that the two "Extended" count/sum columns might be tricky. We might have to run the query twice, once with $depth = 1, and again with $depth > 1, and then assemble the results of those two queries in C#, which is fine.
But I can't even figure out how to get the overall total calculated correctly. The first step would even be to see something like:
OrderID | ReferCount | ReferSum
42 | 6 | 635 <-- includes its 4 orders + 47's 2 orders
47 | 2 | 130
51 | 3 | 250
And since this can be n-levels deep, it's not like I can somehow just do in_caused_order.in_caused_order.in_caused_order in the SQL, I don't know how many deep that will go. Order 83 could be caused by Order 47, and Order 105 could be caused by Order 83, and so on.
Any help would be much appreciated. Or maybe the answer is, Traverse can't handle this, and we'll have to figure something else out entirely.

I'm trying your usecase, following is my testdata:
create class caused_order extends e
create class Order extends v
create property Order.id integer
create property Order.amount integer
begin
create vertex Order set id=1 ,amount=1
create vertex Order set id=2 ,amount=5
create vertex Order set id=3 ,amount=11
create vertex Order set id=4 ,amount=23
create vertex Order set id=5 ,amount=31
create vertex Order set id=6 ,amount=49
create vertex Order set id=7 ,amount=4
create vertex Order set id=8 ,amount=74
create vertex Order set id=9 ,amount=87
create edge caused_order from (select from Order where id=1) to (select from Order where id=2)
create edge caused_order from (select from Order where id=1) to (select from Order where id=3)
create edge caused_order from (select from Order where id=2) to (select from Order where id=4)
create edge caused_order from (select from Order where id=2) to (select from Order where id=5)
create edge caused_order from (select from Order where id=6) to (select from Order where id=7)
create edge caused_order from (select from Order where id=6) to (select from Order where id=8)
commit retry 20
then I wrote these 2 queries to show orders with relative referSum and ReferCount.
First one including head order in the count:
select id as OrderID, $a[0].Amount as ReferSum, $a[0].Count as ReferCount from Order
let $a=(select sum(amount) as Amount, count(*) as Count from (traverse out('caused_order') from $parent.$current) group by Amount)
second one, excluding the head:
select id as OrderID, $a[0].Amount as ReferSum, $a[0].Count as ReferCount from Order
let $a=(select sum(amount) as Amount, count(*) as Count from (select from (traverse out('caused_order') from $parent.$current) where $depth>=1) group by Amount)
EDIT
I've added this to my data:
create class includes extends E
create class Product extends V
create property Product.id Integer
create vertex Product set id = 101
create vertex Product set id = 102
create vertex Product set id = 103
create vertex Product set id = 104
create edge includes from (select from Order where id=1) to (select from Product where id=101)
create edge includes from (select from Order where id=2) to (select from Product where id=102)
create edge includes from (select from Order where id=3) to (select from Product where id=103)
create edge includes from (select from Order where id=4) to (select from Product where id=104)
create edge includes from (select from Order where id=5) to (select from Product where id=101)
create edge includes from (select from Order where id=6) to (select from Product where id=102)
create edge includes from (select from Order where id=7) to (select from Product where id=103)
create edge includes from (select from Order where id=8) to (select from Product where id=104)
create edge includes from (select from Order where id=9) to (select from Product where id=101)
create edge includes from (select from Order where id=1) to (select from Product where id=102)
create edge includes from (select from Order where id=1) to (select from Product where id=103)
create edge includes from (select from Order where id=2) to (select from Product where id=104)
and these are the modified queries (added the while out('includes').id contains {prodID_number} in traverse and where out('includes').id contains {prodID_number}:
select id as OrderID, $a[0].Amount as ReferSum, $a[0].Count as ReferCount from Order
let $a=(select sum(amount) as Amount, count(*) as Count from (traverse out('caused_order') from $parent.$current while out('includes').id contains 102) group by Amount)
where out('includes').id contains 102
select id as OrderID, $a[0].Amount as ReferSum, $a[0].Count as ReferCount from Order
let $a=(select sum(amount) as Amount, count(*) as Count from (traverse out('caused_order') from $parent.$current while out('includes').id contains 102) where $depth >= 1 group by Amount)
where out('includes').id contains 102

Related

Postgres - ranking with additional conditions, then pulling the top 3 results

I want to combine 2 conditions and get the top 3 rank result.
My dataset has 2 components, i.e. 1 major group and with sub-groups mapping to each major group
For e.g. i take part of the dataset for Group A
Group | Sub-group | Revenue | Source
---------------------------------------
A | A-1 | 50 | Y
A | A-2 | 40 | Y
A | A-3 | 60 | Y
A | A-4 | 80 | Y
A | A-5 | 100 | Y
A | A-6 | 140 | X
A | A-7 | 20 | X
A | A-8 | 300 | X
And under Revenue, there are different sources, such as revenue from source X and source Y. I am only interested at source Y.
So I want to pull get a list of eligible results, with top 3 revenue > rank them. A separate step to pull data of Group A for sub-groups that have revenue from source Y. My steps:
WITH
sum_revenue AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_rev
FROM table t
GROUP BY 1, 2
),
subgroup_rev AS (
SELECT DISTINCT
group,
total_rev,
ROW_NUMBER() OVER (
PARTITION BY group ORDER BY total_rev DESC
)
AS row
FROM sum_revenue
GROUP BY 1, 2
),
source_rev AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_sourceY_rev
FROM table t
WHERE
subgroup = 'Y'
GROUP BY 1, 2
),
eligible AS (
SELECT DISTINCT
group,
subgroup,
total_revenue,
FROM source_rev
WHERE total_sourceY_rev < 1 --identify top 3 revenue with source Y rev < $1
),
agg_list AS (
SELECT DISTINCT
group,
STRING_AGG(DISTINCT eligible.subgroup) AS eligible_list
INNER JOIN eligible USING (group)
WHERE subgroup_rev.row <= 3
GROUP BY 1
)
SELECT DISTINCT
group,
eligible_list
FROM agg_list
WHERE group IN ('A')
I'm expecting to get aggregated list of top 3 revenue of subgroup which fulfils the condition with source Y revenue <$1.
)
But I am getting full list (or partial list aggregated (not top 3), my result gave me 7. I tried running without aggregate and it did return 7 individual rows as well.
what could have gone wrong ? I thought i have filtered with row <=3, and I also tried with INNER JOIN when i thought i might introduced redundant subgroup if using LEFT JOIN.
seems to be issue with joining with eligible alias, because when i break down to test with pulling directly from subgroup_rev without joining eligible, i can get the top 3 revenue subgroups.
However, I need the condition with the subgroups having source Y revenue, that are having highest 3, so ideally i shouldn't hardcode row as 3 (because some cases the some of the top 3s might not necessarily have source Y revenue.
[EDIT] based on feedback
Result:
Group | eligible_list
---------------------------
A | A-1,A-2,A-3,A-4,A-5
Expected result:
Group | eligible_list
---------------------------
A | A-3,A-4,A-5
It may be your subgroup_rev subquery.
Since you are using a OVER with a PARTITION as well as a GROUP BY clause, you may just end of creating groups for each combination of group and subgroup. I believe your subgroup_rev subquery can just be:
subgroup_rev AS (
SELECT DISTINCT
group,
total_rev,
ROW_NUMBER() OVER (
PARTITION BY group ORDER BY total_rev DESC
)
AS row
FROM sum_revenue)
Like #GMB said, it's hard to follow without seeing how sources play into the revenue column, but too much grouping may ruin your rankings.
You may end up with a all groups having only 1 member so all of their rankings would be 1 and thus would all be returned in by your final WHERE clause.
I restructured my funnel of data and finally got it work:
WITH
source_rev AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_sourceY_rev
FROM table t
WHERE
subgroup = 'Y'
GROUP BY 1, 2
),
eligible AS (
SELECT DISTINCT
group,
subgroup,
total_revenue,
FROM source_rev
WHERE total_sourceY_rev < 1 --identify top 3 revenue with source Y rev < $1
),
sum_revenue AS (
SELECT DISTINCT
group,
subgroup,
SUM(revenue) AS total_rev
FROM table t
INNER JOIN eligible
USING (group)
GROUP BY 1, 2
),
subgroup_rev AS (
SELECT DISTINCT
group,
subgroup,
sum_revenue.total_rev,
ROW_NUMBER() OVER (
PARTITION BY group ORDER BY sum_revenue.total_rev DESC
)
AS row
FROM eligible
INNER JOIN sum_revenue
USING (group)
GROUP BY 1, 2,3
),
agg_list AS (
SELECT DISTINCT
group,
STRING_AGG(DISTINCT eligible.subgroup) AS eligible_list
WHERE subgroup_rev.row <= 3
GROUP BY 1
)
SELECT DISTINCT
group,
eligible_list
FROM agg_list
WHERE group IN ('A')
The difference is that
i first narrow the data with revenue from source Y and
find those eligible data with conditional select
then i am having an alias that pull total revenue of all source and finally INNER JOIN them together
do the ranking, so that the ranking is only being performed on the eligible data (i.e. source Y revenue < $1), and finally
aggregating them.
Result with eligible list with source Y revenue < $1:
Group | eligible_list
---------------------------
A | A-3,A-4,A-5
I note that some share feedback with me of having too many subgroups. Happy to get any other feedback for a better way to do this if there's more efficient suggestion. Thanks!

Count With Conditional on PostgreSQL

I have a table with people and another with visits. I want to count all visits but if the person signed up with 'emp' or 'oth' on ref_signup then remove the first visit. Example:
This are my tables:
PEOPLE:
id | ref_signup
---------------------
20 | emp
30 | oth
23 | fri
VISITS
id | date
-------------------------
20 | 10-01-2019
20 | 10-05-2019
23 | 10-09-2019
23 | 10-10-2019
30 | 09-10-2019
30 | 10-07-2019
On this example the visit count should be 4 because persons with id's 20 and 30 have their ref_signup as emp or oth, so it should exclude their first visit, but count from the second and forward.
This is what I have as a query:
SELECT COUNT(*) as visit_count FROM visits
LEFT JOIN people ON people.id = visits.people_id
WHERE visits.group_id = 1
Would using a case on the count help on this case as I just want to remove one visit not all of the visits from the person.
Subtract from COUNT(*) the distinct number of person.ids with person.ref_signup IN ('emp', 'oth'):
SELECT
COUNT(*) -
COUNT(DISTINCT CASE WHEN p.ref_signup IN ('emp', 'oth') THEN p.id END) as visit_count
FROM visits v LEFT JOIN people p
ON p.id = v.id
See the demo.
Result:
| visit_count |
| ----------- |
| 4 |
Note: this code and demo fiddle use the column names of your sample data.
Premise, select the count of visits from each person, along with a synthetic column that contains a 1 if the referral was from emp or oth, a 0 otherwise. Select the sum of the count minus the sum of that column.
SELECT SUM(count) - SUM(ignore_first) FROM (SELECT COUNT(*) as count, CASE WHEN ref_signup in ('emp', 'oth') THEN 1 ELSE 0 END as ignore_first as visit_count FROM visits
LEFT JOIN people ON people.id = visits.people_id
WHERE visits.group_id = 1 GROUP BY id) a
where's "people_id" in your example ?
SELECT COUNT(*) as visit_count
FROM visits v
JOIN people p ON p.id = v.people_id
WHERE p.ref_signup IN ('emp','oth');
then remove the first visit.
You cannot select count and delete the first visit at same time.
DELETE FROM visits
WHERE id IN (
SELECT id
FROM visits v
JOIN people p ON p.id = v.people_id
WHERE p.ref_signup IN ('emp','oth')
ORDER BY v.id
LIMIT 1
);
edit: typos
First, I create the tables
create table people (id int primary key, ref_signup varchar(3));
insert into people (id, ref_signup) values (20, 'emp'), (30, 'oth'), (23, 'fri');
create table visits (people_id int not null, visit_date date not null);
insert into visits (people_id, visit_date) values (20, '10-01-2019'), (20, '10-05-2019'), (23, '10-09-2019'), (23, '10-10-2019'), (30, '09-10-2019'), (30, '10-07-2019');
You can use the row_number() window function to mark which visit is "visit number one":
select
*,
row_number() over (partition by people_id order by visit_date) as visit_num
from people
join visits
on people.id = visits.people_id
Once you have that, you can do another query on those results, and use the filter clause to count up the correct rows that match the condition where visit_num > 1 or ref_signup = 'fri':
-- wrap the first query in a WITH clause
with joined_visits as (
select
*,
row_number() over (partition by people_id order by visit_date) as visit_num
from people
join visits
on people.id = visits.people_id
)
select count(1) filter (where visit_num > 1 or ref_signup = 'fri')
from joined_visits;
-- First get the corrected counts for all users
WITH grouped_visits AS (
SELECT
COUNT(visits.*) -
CASE WHEN people.ref_signup IN ('emp', 'oth') THEN 1 ELSE 0 END
AS visit_count
FROM visits
INNER JOIN people ON (people.id = visits.id)
GROUP BY people.id, people.ref_signup
)
-- Then sum them
SELECT SUM(visit_count)
FROM grouped_visits;
This should give you the result you're looking for.
On a side note, I can't help but think clever use of a window function could do this in a single shot without the CTE.
EDIT: No, it can't since window functions run after needed WHERE and GROUP BY and HAVING clauses.

T-SQL select all IDs that have value A and B

I'm trying to find all IDs in TableA that are mentioned by a set of records in TableB and that set if defined in Table C. I've come so far to the point where a set of INNER JOIN provide me with the following result:
TableA.ID | TableB.Code
-----------------------
1 | A
1 | B
2 | A
3 | B
I want to select only the ID where in this case there is an entry for both A and B, but where the values A and B are based on another Query.
I figured this should be possible with a GROUP BY TableA.ID and HAVING = ALL(Subquery on table C).
But that is returning no values.
Since you did not post your original query, I will assume it is inside a CTE. Assuming this, the query you want is something along these lines:
SELECT ID
FROM cte
WHERE Code IN ('A', 'B')
GROUP BY ID
HAVING COUNT(DISTINCT Code) = 2;
It's an extremely poor question, but you you probably need to compare distinct counts against table C
SELECT a.ID
FROM TableA a
GROUP BY a.ID
HAVING COUNT(DISTINCT a.Code) = (SELECT COUNT(*) FROM TableC)
We're guessing though.

Cascading sum hierarchy using recursive cte

I'm trying to perform recursive cte with postgres but I can't wrap my head around it. In terms of performance issue there are only 50 items in TABLE 1 so this shouldn't be an issue.
TABLE 1 (expense):
id | parent_id | name
------------------------------
1 | null | A
2 | null | B
3 | 1 | C
4 | 1 | D
TABLE 2 (expense_amount):
ref_id | amount
-------------------------------
3 | 500
4 | 200
Expected Result:
id, name, amount
-------------------------------
1 | A | 700
2 | B | 0
3 | C | 500
4 | D | 200
Query
WITH RECURSIVE cte AS (
SELECT
expenses.id,
name,
parent_id,
expense_amount.total
FROM expenses
WHERE expenses.parent_id IS NULL
LEFT JOIN expense_amount ON expense_amount.expense_id = expenses.id
UNION ALL
SELECT
expenses.id,
expenses.name,
expenses.parent_id,
expense_amount.total
FROM cte
JOIN expenses ON expenses.parent_id = cte.id
LEFT JOIN expense_amount ON expense_amount.expense_id = expenses.id
)
SELECT
id,
SUM(amount)
FROM cte
GROUP BY 1
ORDER BY 1
Results
id | sum
--------------------
1 | null
2 | null
3 | 500
4 | 200
You can do a conditional sum() for only the root row:
with recursive tree as (
select id, parent_id, name, id as root_id
from expense
where parent_id is null
union all
select c.id, c.parent_id, c.name, p.root_id
from expense c
join tree p on c.parent_id = p.id
)
select e.id,
e.name,
e.root_id,
case
when e.id = e.root_id then sum(ea.amount) over (partition by root_id)
else amount
end as amount
from tree e
left join expense_amount ea on e.id = ea.ref_id
order by id;
I prefer doing the recursive part first, then join the related tables to the result of the recursive query, but you could do the join to the expense_amount also inside the CTE.
Online example: http://rextester.com/TGQUX53703
However, the above only aggregates on the top-level parent, not for any intermediate non-leaf rows.
If you want to see intermediate aggregates as well, this gets a bit more complicated (and is probably not very scalable for large results, but you said your tables aren't that big)
with recursive tree as (
select id, parent_id, name, 1 as level, concat('/', id) as path, null::numeric as amount
from expense
where parent_id is null
union all
select c.id, c.parent_id, c.name, p.level + 1, concat(p.path, '/', c.id), ea.amount
from expense c
join tree p on c.parent_id = p.id
left join expense_amount ea on ea.ref_id = c.id
)
select e.id,
lpad(' ', (e.level - 1) * 2, ' ')||e.name as name,
e.amount as element_amount,
(select sum(amount)
from tree t
where t.path like e.path||'%') as sub_tree_amount,
e.path
from tree e
order by path;
Online example: http://rextester.com/MCE96740
The query builds up a path of all IDs belonging to a (sub)tree and then uses a scalar sub-select to get all child rows belonging to a node. That sub-select is what will make this quite slow as soon as the result of the recursive query can't be kept in memory.
I used the level column to create a "visual" display of the tree structure - this helps me debugging the statement and understanding the result better. If you need the real name of an element in your program you would obviously only use e.name instead of pre-pending it with blanks.
I could not get your query to work for some reason. Here's my attempt that works for the particular table you provided (parent-child, no grandchild) without recursion. SQL Fiddle
--- step 1: get parent-child data together
with parent_child as(
select t.*, amount
from
(select e.id, f.name as name,
coalesce(f.name, e.name) as pname
from expense e
left join expense f
on e.parent_id = f.id) t
left join expense_amount ea
on ea.ref_id = t.id
)
--- final step is to group by id, name
select id, pname, sum(amount)
from
(-- step 2: group by parent name and find corresponding amount
-- returns A, B
select e.id, t.pname, t.amount
from expense e
join (select pname, sum(amount) as amount
from parent_child
group by 1) t
on t.pname = e.name
-- step 3: to get C, D we union and get corresponding columns
-- results in all rows and corresponding value
union
select id, name, amount
from expense e
left join expense_amount ea
on e.id = ea.ref_id
) t
group by 1, 2
order by 1;

Finding the first row in a group using Hive

For a student database in the following format:
Roll Number | School Name | Name | Age | Gender | Class | Subject | Marks
how to find out who got the highest for each class? The below query returns the entire group, but I am interested in finding the first row in the group.
SELECT school,
class,
roll,
Sum(marks) AS total
FROM students
GROUP BY school,
class,
roll
ORDER BY school,
class,
total DESC;
Another way using row_number()
select * from (
select *,
row_number() over (partition by school,class,roll order by marks desc) rn
from students
) t1 where rn = 1
If you want to return all ties for top marks, then use rank() instead of row_number()
You will have do one more group by and a join to get the desired results. This should do:
select q1.*, q2.roll from
(
select school, class, max(total) as max from
(
select school,class,roll,sum(marks) as total from students group by school,class,roll order by school, class, total desc
)q3 group by school, class
)q1
LEFT OUTER JOIN
(select school,class,roll,sum(marks) as total from students group by school,class,roll order by school, class, total desc)q2
ON (q1.max = q2.total) AND (q1.school = q2.school) AND (q1.class = q2.class)
We will have to build on the query that you have provided :
The given query will give you the marks per class per roll. To find out the highest
total achieved per class, you will have to remove roll number from the select and then group on this query.
Now we know the school, class and highest total per class per school. You just have to find out the roll number corresponding to this total. For that, a join will be needed.
The final query will look like this :
select a.school, a.class, b.roll, a.highest_marks from
(select q.school as school, q.class as class, max(q.total) as highest_marks from(select school, class, roll, sum(marks) as total from students group by school, class, roll)q group by school, class)a
join
(select school, class, roll, sum(marks) as total from students group by school, class, roll)b
on (a.school = b.school) and (a.class = b.class) and (a.highest_marks = b.total)