Finding the first row in a group using Hive

Finding the first row in a group using Hive - group-by

For a student database in the following format:
Roll Number | School Name | Name | Age | Gender | Class | Subject | Marks
how to find out who got the highest for each class? The below query returns the entire group, but I am interested in finding the first row in the group.
SELECT school,
class,
roll,
Sum(marks) AS total
FROM students
GROUP BY school,
class,
roll
ORDER BY school,
class,
total DESC;

Another way using row_number()
select * from (
select *,
row_number() over (partition by school,class,roll order by marks desc) rn
from students
) t1 where rn = 1
If you want to return all ties for top marks, then use rank() instead of row_number()

You will have do one more group by and a join to get the desired results. This should do:
select q1.*, q2.roll from
(
select school, class, max(total) as max from
(
select school,class,roll,sum(marks) as total from students group by school,class,roll order by school, class, total desc
)q3 group by school, class
)q1
LEFT OUTER JOIN
(select school,class,roll,sum(marks) as total from students group by school,class,roll order by school, class, total desc)q2
ON (q1.max = q2.total) AND (q1.school = q2.school) AND (q1.class = q2.class)

We will have to build on the query that you have provided :
The given query will give you the marks per class per roll. To find out the highest
total achieved per class, you will have to remove roll number from the select and then group on this query.
Now we know the school, class and highest total per class per school. You just have to find out the roll number corresponding to this total. For that, a join will be needed.
The final query will look like this :
select a.school, a.class, b.roll, a.highest_marks from
(select q.school as school, q.class as class, max(q.total) as highest_marks from(select school, class, roll, sum(marks) as total from students group by school, class, roll)q group by school, class)a
join
(select school, class, roll, sum(marks) as total from students group by school, class, roll)b
on (a.school = b.school) and (a.class = b.class) and (a.highest_marks = b.total)

Related

How to get the MAX(SUM of values) to find the category with the biggest total? PostgreSQL

I have two tables. One is Transactions and the other is Tickets. In Tickets I have the Ticket_Number,the name of the Category(Theater,Cinema,Concert), the Price of the Ticket. In Transactions I also have the Ticket_Number. What i want to do is to Get a SUM of money for each Category, and then with that data I want to Select the Category with the most money.
I already managed to get the SUM for each category but I am stuck here
SELECT category, SUM (Tickets.Price) AS Price
FROM Tickets,Transactions
WHERE Tickets.ticket_num=Transactions.ticket_num
GROUP BY Category
ORDER BY Price DESC;
I know i can add LIMIT 1 but I know it's not correct because 2 or more values can be the same

Using ROW_NUMBER to generate a sequence based on the sum of the price. Then, restrict to only the matching aggregated row with the highest total price.
WITH cte AS (
SELECT category, SUM(t1.Price) AS Price,
ROW_NUMBER() OVER (ORDER BY SUM(t1.Price) DESC) rn
FROM Tickets t1
INNER JOIN Transactions t2
ON t1.ticket_num = t2.ticket_num
GROUP BY Category
)
SELECT category, Price
FROM cte
WHERE rn = 1
ORDER BY Price DESC;
Note that if you want to capture all categories tied for the highest price, should a tie occur, then replace ROW_NUMBER in the above CTE with RANK, keeping everything else the same.

What you are looking for is a window function DENSE_RANK() which will handle ties properly.
RANK() will also work for your case, but if you would like to extend it to get TOP N places with ties (where N > 1), dense rank is the way to go.
SELECT Category, Price
FROM (
SELECT
Category,
SUM(ti.Price) AS Price,
DENSE_RANK() OVER (ORDER BY SUM(ti.Price) DESC) AS rnk
FROM Tickets ti
INNER JOIN Transactions tr ON
ti.ticket_num = tr.ticket_num
GROUP BY Category
) t
WHERE rnk = 1
I've also replaced the old style and not recommended joining of tables as comma separated list in FROM clause to a proper INNER JOIN clause and assigned aliases to tables.

You can use rank() to rank the sums of the prices, more expensive first.
SELECT category,
price
FROM (SELECT category,
sum(tickets.price) price,
rank() OVER (ORDER BY sum(tickets.price) DESC) r
FROM tickets
INNER JOIN transactions
ON transactions.ticket_num = tickets.ticket_num
GROUP BY category) x
WHERE r = 1;
I also took the liberty to rewrite your join from the ancient comma style to a modern, clearer version.

OrientDB Traverse Sum and Group By Top-Most Record

We have Orders that include "caused_order" edges from Order to Order because friends can refer other friends to make purchases. We know from the links we generate for the friends that Order ID 42 caused Order ID 47, so we create a "caused_order" edge between the two Order vertices.
We're looking to identify the people that are generating the most referral business. Right now we just loop through in C# and figure it out because our datasets are relatively small. But I'd like to figure out if there's a way to use the Traverse SQL to accomplish this instead.
The problem I'm running in to is getting an accurate count/sum for each Original Order ID.
Consider the following scenario:
Order 42 caused four other Orders, including Order 47. Order 47 caused 2 additional Orders. And Order 51, unrelated to 42 or 47, caused 3 Orders.
I can run the following SQL to get the best referrers for this specific {ProductId}:
select in_caused_order[0].id as OrderID, count(*) as ReferCount, sum(amount) as ReferSum
from ( traverse out('caused_order') from Order )
where out_includes.id = '{ProductId}' and $depth >= 1
group by in_caused_order[0].id
EDIT: the schema is a bit more complex than this, I was just including the out_includes WHERE clause to show that there's a bit of filtering of the Orders. But it's a bit like:
Product(V) <-- includes(E) <-- Order(V) --> caused_order(E) --> Order(V)
(the Order vertex has "amount" as a property, which stores the money spent and is being SUM'd in the SELECT, along with a few fields like date which aren't important)
But that will result in something like:
OrderID | ReferCount | ReferSum
42 | 4 | 525
47 | 2 | 130
51 | 3 | 250
Except that's not quite right, is it? Because Order 42 also technically caused 47's two orders. So we'd want to see something like:
OrderID | ReferCount | ReferSum | ExtendedCount | ExtendedSum
42 | 4 | 525 | 2 | 130
47 | 2 | 130 | 0 | 0
51 | 3 | 250 | 0 | 0
I recognize that the two "Extended" count/sum columns might be tricky. We might have to run the query twice, once with $depth = 1, and again with $depth > 1, and then assemble the results of those two queries in C#, which is fine.
But I can't even figure out how to get the overall total calculated correctly. The first step would even be to see something like:
OrderID | ReferCount | ReferSum
42 | 6 | 635 <-- includes its 4 orders + 47's 2 orders
47 | 2 | 130
51 | 3 | 250
And since this can be n-levels deep, it's not like I can somehow just do in_caused_order.in_caused_order.in_caused_order in the SQL, I don't know how many deep that will go. Order 83 could be caused by Order 47, and Order 105 could be caused by Order 83, and so on.
Any help would be much appreciated. Or maybe the answer is, Traverse can't handle this, and we'll have to figure something else out entirely.

I'm trying your usecase, following is my testdata:
create class caused_order extends e
create class Order extends v
create property Order.id integer
create property Order.amount integer
begin
create vertex Order set id=1 ,amount=1
create vertex Order set id=2 ,amount=5
create vertex Order set id=3 ,amount=11
create vertex Order set id=4 ,amount=23
create vertex Order set id=5 ,amount=31
create vertex Order set id=6 ,amount=49
create vertex Order set id=7 ,amount=4
create vertex Order set id=8 ,amount=74
create vertex Order set id=9 ,amount=87
create edge caused_order from (select from Order where id=1) to (select from Order where id=2)
create edge caused_order from (select from Order where id=1) to (select from Order where id=3)
create edge caused_order from (select from Order where id=2) to (select from Order where id=4)
create edge caused_order from (select from Order where id=2) to (select from Order where id=5)
create edge caused_order from (select from Order where id=6) to (select from Order where id=7)
create edge caused_order from (select from Order where id=6) to (select from Order where id=8)
commit retry 20
then I wrote these 2 queries to show orders with relative referSum and ReferCount.
First one including head order in the count:
select id as OrderID, $a[0].Amount as ReferSum, $a[0].Count as ReferCount from Order
let $a=(select sum(amount) as Amount, count(*) as Count from (traverse out('caused_order') from $parent.$current) group by Amount)
second one, excluding the head:
select id as OrderID, $a[0].Amount as ReferSum, $a[0].Count as ReferCount from Order
let $a=(select sum(amount) as Amount, count(*) as Count from (select from (traverse out('caused_order') from $parent.$current) where $depth>=1) group by Amount)
EDIT
I've added this to my data:
create class includes extends E
create class Product extends V
create property Product.id Integer
create vertex Product set id = 101
create vertex Product set id = 102
create vertex Product set id = 103
create vertex Product set id = 104
create edge includes from (select from Order where id=1) to (select from Product where id=101)
create edge includes from (select from Order where id=2) to (select from Product where id=102)
create edge includes from (select from Order where id=3) to (select from Product where id=103)
create edge includes from (select from Order where id=4) to (select from Product where id=104)
create edge includes from (select from Order where id=5) to (select from Product where id=101)
create edge includes from (select from Order where id=6) to (select from Product where id=102)
create edge includes from (select from Order where id=7) to (select from Product where id=103)
create edge includes from (select from Order where id=8) to (select from Product where id=104)
create edge includes from (select from Order where id=9) to (select from Product where id=101)
create edge includes from (select from Order where id=1) to (select from Product where id=102)
create edge includes from (select from Order where id=1) to (select from Product where id=103)
create edge includes from (select from Order where id=2) to (select from Product where id=104)
and these are the modified queries (added the while out('includes').id contains {prodID_number} in traverse and where out('includes').id contains {prodID_number}:
select id as OrderID, $a[0].Amount as ReferSum, $a[0].Count as ReferCount from Order
let $a=(select sum(amount) as Amount, count(*) as Count from (traverse out('caused_order') from $parent.$current while out('includes').id contains 102) group by Amount)
where out('includes').id contains 102
select id as OrderID, $a[0].Amount as ReferSum, $a[0].Count as ReferCount from Order
let $a=(select sum(amount) as Amount, count(*) as Count from (traverse out('caused_order') from $parent.$current while out('includes').id contains 102) where $depth >= 1 group by Amount)
where out('includes').id contains 102

psql, display column that is not in the group by clause

i'm having problems with a query. I have two tables: country and city and i want to display the city with the highest population per country.
Here's the query:
select country.name as coname, city.name as ciname, max(city.population) as pop
from city
join country on city.countrycode=country.code
group by country.name
order by pop;`
Error
column "city.name" must appear in the GROUP BY clause or be used in an aggregate function.
I don't know how to solve this, i tried to make a subquery but it didn't work out.
How can i make it work?

You can easly get it using rank function:
select * from
(
select country.name as coname,
city.name as ciname,
city.population,
rank() over (partition by country.name order by city.population desc) as ranking
from
city
join
country
on city.countrycode=country.code
) A
where ranking = 1

Group by multiple columns in PostgreSQL

I have two queries:
SELECT city, count(id) as num_of_applicants
FROM(
select distinct(students.id), city
FROM STUDENTS INNER JOIN APPLICATIONS ON STUDENTS.ID = APPLICATIONS.STUDENT_ID
WHERE APPLICATIONS.COLLEGE_ID = '28'
) AS derivedTable
GROUP BY city;
SELECT city, count(id) as num_of_accepted_applicants
FROM
(select applications.id, city FROM
STUDENTS INNER JOIN APPLICATIONS ON STUDENTS.ID = APPLICATIONS.STUDENT_ID
WHERE status = 'Accepted' and college_id = '28') as tbl
GROUP BY city
one give the number of applicants for each college and one give the number of accepted applicants in each college, but I want to get a result in on query (instead of) where the result is something like:
city | number_of_applicants | number_of_accepted_applicants

You can simplify (fyi: I didn't understand why you used the derived tables, you could have just put the COUNT and GROUP BY on the inner queries) and combine the queries as this:
SELECT city
, COUNT(*) AS num_of_applicants
, SUM( CASE
WHEN status = 'Accepted' THEN 1
ELSE 0
END
) AS num_of_accepted_applicants
FROM STUDENTS
JOIN APPLICATIONS
ON STUDENTS.ID = APPLICATIONS.STUDENT_ID
WHERE college_id='28'
GROUP BY city;
Another way is to continue with the technique of derived tables. Make each of your queries a derived table and JOIN on the city - but that would not perform as well.

t-sql how to select records without a duplicated one column

I want to select rows for all employess without repeating the data in one column.
For example I have two rows where salary (before raise) is displayed, how can I display only the largest figure without duplication.

You can use Row_Number function
Here is a sample code
select * from (
select *,
row_number() over (partition by empid, name, department order by salary desc) as rn
from employee
) employee where rn = 1
You can find Row_Number() with Partition By clause sample at http://www.kodyaz.com

If I'm understanding the question correctly, then a simple MAX function and GROUP BY would work.
SELECT EmployeeId, OtherColumns, MAX(Salary)
FROM tblEmployees
GROUP BY EmployeeId, OtherColumns