Hive - top n records within a group - hiveql

I am currently using Hive and I have a table with the fields user_id and value. I want to order the values in descending order within each user_id and then only emit the top 100 records for each user_id. This is the code I am attempting to use:
DROP TABLE IF EXISTS mytable2
CREATE TABLE mytable2 AS
SELECT * FROM
(SELECT *, rank (user_id) as rank
FROM
(SELECT * from mytable
DISTRIBUTE BY user_id
SORT BY user_id, value DESC)a )b
WHERE rank<101
ORDER BY rank;
However when I run this query, I get the following error:
Error while compiling statement: FAILED: SemanticException [Error 10247]: Missing over clause for function : rank [ERROR_STATUS]
FYI - My UserIds are alpha-numeric.
Can anyone help?
Thanks in advance.
Add comment

As the error message says, you have error using the rank function,
try to add over after rank as following:
....
(SELECT *, rank (user_id) over (order by user_id) as rank
....
for further information how to use the rank function you could refer to this documentation

Related

How to reuse Postgres variables declared using 'WITH' operator

i need to delete records from two tables, but i cannot perform it consistently, because of after deletion from first table there will be no data to delete from second.
I have tried following:
WITH person_ids AS(select person_id from application_person
where application_id in (select DISTINCT duplicate.id
from application duplicate inner join application application
on duplicate.document_id = application.document_id
where duplicate.modify_date < application.modify_date)),
delete from application_person where application_person.person_id in (select person_id from person_ids);
delete from person where id in (select person_id from person_ids);
For second call of person_ids i have Query failed: ERROR: relation "person_ids" does not exist
What am i doing wrong?
Thanks.
What am i doing wrong?
You have two separate statements. The person_ids is only in scope within the first one, which lasts until the semicolon.
You'll want to use
WITH duplicate_applications AS (
select DISTINCT duplicate.id
from application duplicate
inner join application application using (document_id)
where duplicate.modify_date < application.modify_date)
), deleted_persons AS (
delete from application_person
where application_id in (select application_id from duplicate_applications)
returning person_id
)
delete from person
where id in (select person_id from deleted_persons);

How do I make my RANK () OVER query work in select?

table image
I have this table that I need to sort in the following way:
need to rank Departments by Salary;
need to show if Salary = NULL - 'No data to be shown' message
need to add total salary paid to the department
need to count people in the department
SELECT RANK() OVER (
ORDER BY Salary DESC
)
,CASE
WHEN Salary IS NULL
THEN 'NO DATA TO BE SHOWN'
ELSE Salary
,Count(Fname)
,Total(Salary) FROM dbo.Employees
I get an error saying:
Column 'dbo.Employees.Salary' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Why so?
Column 'dbo.Employees.Salary' is invalid in the select list because it
is not contained in either an aggregate function or the GROUP BY
clause.
Why so?
The aggregate functions are returning a single value for the whole table, you can't SELECT a field alongside them it doesn't makes sense. Like say, you have a students table you apply Sum(marks) for the whole students table, and you are then also selecting student's name Select studentname in your query. Which student's name will the database engine select? Confusing
Column "invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause"
I tried this-
using inner query
SELECT RANK() OVER (ORDER BY SAL DESC) RANK,FNAME,DEPARTMENT
CASE
WHEN SAL IS NULL THEN 'NO DATA TO BE SHOWN'
ELSE SAL
END
FROM
(SELECT COUNT(FNAME) FNAME, SUM(SALARY) SAL, DEPARTMENT
FROM TESTEMPLOYEE
GROUP BY DEPARTMENT) t

how to select multiple column from the table using group by( based on one column) , having and count in hive query

Requirement :
Using group by A and get records having count > 1
eg:
SELECT count(sk), id, sk
FROM table x
GROUP BY id
HAVING COUNT(sk) > 1
But I am not able to select sk in select statement. Is there any other way to do this. how to use partition on this input and output set attached here?
Something like this, you can do.
select * from (
SELECT count(sk)over(partition by id) as cnt, id, sk
FROM table x) a
where a.cnt >1

How to use new created column in where column in sql?

Hi I have a query which looks like the following :
SELECT device_id, tag_id, at, _deleted, data,
row_number() OVER (PARTITION BY device_id ORDER BY at DESC) AS row_num
FROM mdb_history.devices_tags_mapping_history
WHERE at <= '2019-04-01'
AND _deleted = False
AND (tag_id = '275674' or tag_id = '275673')
AND row_num = 1
However when I run the following query, I get the following error :
ERROR: column "row_num" does not exist
Is there any way to go about this. One way I tried was to use it in the following way:
SELECT * from (SELECT device_id, tag_id, at, _deleted, data,
row_number() OVER (PARTITION BY device_id ORDER BY at DESC) AS row_num
FROM mdb_history.devices_tags_mapping_history
WHERE at <= '2019-04-01'
AND _deleted = False
AND (tag_id = '275674' or tag_id = '275673')) tag_deleted
WHERE tag_deleted.row_num = 1
But this becomes way too complicated as I do it with other queries as I have number of join and I have to select the column as stated from so it causes alot of select statement. Any smart way of doing that in a more simpler way. Thanks
You can't refer to the row_num alias which you defined in the same level of the select in your query. So, your main option here would be to subquery, where row_num would be available. But, Postgres actually has an option to get what you want in another way. You could use DISTINCT ON here:
SELECT DISTINCT ON (device_id), device_id, tag_id, at, _deleted, data
FROM mdb_history.devices_tags_mapping_history
WHERE
at <= '2019-04-01' AND
_deleted = false AND
tag_id IN ('275674', '275673')
ORDER BY
device_id,
at DESC;
Too long/ formatted for a comment. There is a reason behind #TimBiegeleisen statement "alias which you defined in the same level of the select". That reason is that all SQL statement follow the same sequence for evaluation. Unfortunately that sequence does NOT follow the sequence of clauses within the statement presentation. that sequence is in order:
from
where
group by
having
select
limits
You will notice that what actually gets selected fall well after evaluation of the where clause. Since your alias is defined within the select phase it does not exist during the where phase.

Simple SELECT, but adding JOIN returns too many rows

The query below returns 9,817 records. Now, I want to SELECT one more field from another table. See the 2 lines that are commented out, where I've simply selected this additional field and added a JOIN statement to bind this new columns. With these lines added, the query now returns 649,200 records and I can't figure out why! I guess something is wrong with my WHERE criteria in conjunction with the JOIN statement. Please help, thanks.
SELECT DISTINCT dbo.IMPORT_DOCUMENTS.ITEMID, BEGDOC, BATCHID
--, dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.CATEGORY_ID
FROM IMPORT_DOCUMENTS
--JOIN dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS ON
dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.ITEMID = dbo.IMPORT_DOCUMENTS.ITEMID
WHERE (BATCHID LIKE 'IC0%' OR BATCHID LIKE 'LP0%')
AND dbo.IMPORT_DOCUMENTS.ITEMID IN
(SELECT dbo.CATEGORY_COLLECTION_CATEGORY_RESULTS.ITEMID FROM
CATEGORY_COLLECTION_CATEGORY_RESULTS
WHERE SCORE >= .7 AND SCORE <= .75 AND CATEGORY_ID IN(
SELECT CATEGORY_ID FROM CATEGORY_COLLECTION_CATS WHERE COLLECTION_ID IN (11,16))
AND Sample_Id > 0)
AND dbo.IMPORT_DOCUMENTS.ITEMID NOT IN
(SELECT ASSIGNMENT_FOLDER_DOCUMENTS.Item_Id FROM ASSIGNMENT_FOLDER_DOCUMENTS)
One possible reason is because one of your tables contains data at lower level, lower than your join key. For example, there may be multiple records per item id. The same item id is repeated X number of times. I would fix the query like the below. Without data knowledge, Try running the below modified query.... If output is not what you're looking for, convert it into SELECT Within a Select...
Hope this helps....
Try this SQL: SELECT DISTINCT a.ITEMID, a.BEGDOC, a.BATCHID, b.CATEGORY_ID FROM IMPORT_DOCUMENTS a JOIN (SELECT DISTINCT ITEMID FROM CATEGORY_COLLECTION_CATEGORY_RESULTS WHERE SCORE >= .7 AND SCORE <= .75 AND CATEGORY_ID IN (SELECT DISTINCT CATEGORY_ID FROM CATEGORY_COLLECTION_CATS WHERE COLLECTION_ID IN (11,16)) AND Sample_Id > 0) B ON a.ITEMID =b.ITEMID WHERE a.(a.BATCHID LIKE 'IC0%' OR a.BATCHID LIKE 'LP0%') AND a.ITEMID NOT IN (SELECT DIDTINCT Item_Id FROM ASSIGNMENT_FOLDER_DOCUMENTS)