How to get the id of max count group in hive? - group-by

I have a table like this:
id , m_id , group_id
1 , a , 0
1 , b , 0
1 , c , 1
1 , d , 1
2 , e , 0
2 , f , 0
2 , g , 0
2 , h , 1
2 , i , 1
For each id, I would like to get the m_id which they belong to the group that has max number of m_id. If there is a tie, I will just take a random group of m_id. Hence the expected output will be like:
id , m_id
1 , a
1 , b
2 , e
2 , f
2 , g
Notice: the number from group_id is only an indicator of group identification under each id. i.e. group_id = 0 does not not mean the same thing between id=1, and id=2.
My original idea is to get the max(group_id) group by (id,m_id), and return the id,m_id which has the max(group_id). However, this approach wont help on the tie situation (id = 2 cases).
Really hope someone can help me on this!
Thanks!

Use row_number() and partition the group by id to get the max grouping.Then self join to get the max grouping for each id,group_id
CREATE TABLE test
(
id integer , m_id char(1) , group_id integer
);
INSERT INTO test (id,m_id,group_id) VALUES (1,'a',0);
INSERT INTO test (id,m_id,group_id) VALUES (1,'b',0);
INSERT INTO test (id,m_id,group_id) VALUES (1,'c',1);
INSERT INTO test (id,m_id,group_id) VALUES (1,'d',1);
INSERT INTO test (id,m_id,group_id) VALUES (2,'e',0);
INSERT INTO test (id,m_id,group_id) VALUES (2,'f',0);
INSERT INTO test (id,m_id,group_id) VALUES (2,'g',0);
INSERT INTO test (id,m_id,group_id) VALUES (2,'h',1);
INSERT INTO test (id,m_id,group_id) VALUES (2,'i',1);
select b.id,b.group_id,b.m_id
from (
select id,group_id,row_number() over(partition by id order by id,group_id,count(*) desc) as r_no
from test
group by id,group_id
) a
join test b on b.id=a.id and b.group_id=a.group_id
where a.r_no=1
Output

You can use row_number with aggregation to do this.
select t1.id,t1.group_id,t1.m_id
from (select id,group_id,row_number() over(partition by id order by count(*) desc) as rnum
from tbl
group by id,group_id
) t
join tbl t1 on t1.id=t.id and t1.group_id=t.group_id
where t.rnum=1

Related

Row_number() over partition

I am working on peoplesoft. I have a requirement where I have to update the column value in a sequence ordered based on some ID.
For eg.
CA24100001648- 1
CA24100001648- 2
CA24100001664- 1
CA24100001664- 2
CA24100001664- 3
CA24100001664- 4
CA24100001664- 5
CA24100001664- 6
But, I am getting '1' as the value for all the rows on updating.
Here is my query, can anyone please help out on this.
UPDATE PS_UC_CA_CONT_STG C
SET C.CONTRACT_LINE_NUM2 = ( SELECT row_number() over(PARTITION BY D.CONTRACT_NUM
order by D.CONTRACT_NUM)
FROM PS_UC_CA_HDR_STG D
WHERE C.CONTRACT_NUM=D.CONTRACT_NUM );
Thanksenter image description here
update emp a
set comm =
(with cnt as ( select deptno,empno,row_number() over (partition by deptno order by deptno) rn from emp)
select c.rn from cnt c where c.empno=a.empno)

TSQL - Get latest rows which their title is not null

I have following table:
========================
Id SubCode Title
========================
1 1 test1
1 2 test2
1 3 NULL
1 4 NULL
2 1 k1
2 2 k2
2 3 k3
2 4 NULL
No I want to select latest rows which their title is not null, for example for Id 1 then query must show test2 and for Id 2 it must be k3:
========================
Id SubCode Title
========================
1 2 test2
2 3 k3
I have written this query:
select t.Id, t.SubCode, t.Title from Test t
inner join (
select max(Id) as Id, max(SubCode) as SubCode
from Test
group by Id
) tm on t.Id = tm.Id and t.SubCode = tm.SubCode
But this code gives the wrong result:
========================
Id SubCode Title
========================
1 4 NULL
2 4 NULL
Any idea?
You forgot to exclude NULLs by writing an appropriate WHERE clause (where title is not null).
However such problems (to get a best / last / ... record) are usually best solved with analytic functions (RANK, DENSE_RANK, ROW_NUMBER) anyway, because with them you access the table only once:
select id, subcode, title
from
(
select id, subcode, title, rank() over (partition by id order by subcode desc) as rn
from test
where title is not null
) ranked
where rn = 1;
You need a Title is not null where clause in your inner select:
select t.Id, t.SubCode, t.Title from Test t
inner join (
select max(Id) as Id, max(SubCode) as SubCode
from Test
where Title is not null
group by Id
) tm on t.Id = tm.Id and t.SubCode = tm.SubCode

GROUP BY column and clause in postgres

I would like group the columns of a table with by a column value as well as when another condition is met. For example, with the following table:
Events:
id session_id flags created_at ...
--------------------------------------------
1 100 OTHER ...
2 101 OTHER ...
3 101 NEW_SESSION ...
4 101 OTHER ...
5 101 NEW_SESSION ...
6 100 OTHER ...
7 102 OTHER ...
I want the following result:
session_id events_count first_event_id last_event_id
-------------------------------------------------------
100-0 2 1 6
101-0 1 2 2
101-1 2 3 4
101-2 1 5 5
102-0 1 7 7
The basic idea is that I want to extract sessions from events. They are grouped by session_id. I also want a new session whenever I have the flag NEW_SESSION.
The query is something like this:
SELECT ? as session_id
, count(id) as events_count
, MIN(id) as first_event_id
, MAX(id) last_event_id
GROUP BY session_id
-- , and whenever flags is NEW_SESSION
ORDER BY id
But I dont know how to express the group by condition properly. Any idea ?
Update 2
In the comments I've noticed that you want them unique. Then we can use a variable:
SET #inc := 0;
(
SELECT CONCAT(session_id, '-', !ABS(STRCMP(flags, 'NEW_SESSION'))) AS session_id
, COUNT(id) AS events_count
, MIN(id) AS first_event_id
, MAX(id) last_event_id
FROM events
WHERE flags != 'NEW_SESSION'
GROUP BY events.session_id, events.flags
ORDER BY events.id
) UNION (
SELECT CONCAT(session_id, '-', #inc := #inc + 1) AS session_id
, COUNT(id) AS events_count
, MIN(id) AS first_event_id
, MAX(id) last_event_id
FROM events
WHERE flags = 'NEW_SESSION'
GROUP by events.id
ORDER BY events.id
);
Update
The following prevents grouping for the NEW_SESSION rows:
(
SELECT CONCAT(session_id, '-', !ABS(STRCMP(flags, 'NEW_SESSION'))) AS session_id
, COUNT(id) AS events_count
, MIN(id) AS first_event_id
, MAX(id) last_event_id
FROM events
WHERE flags != 'NEW_SESSION'
GROUP BY events.session_id, events.flags
ORDER BY events.id
) UNION (
SELECT CONCAT(session_id, '-1') AS session_id
, COUNT(id) AS events_count
, MIN(id) AS first_event_id
, MAX(id) last_event_id
FROM events
WHERE flags = 'NEW_SESSION'
GROUP BY id
ORDER BY events.id
);
Original answer
As far as I understand, you are trying to group events by the session IDs and
"whether it's a NEW_SESSION" flag. If it's so, then I'd express it as follows:
SELECT CONCAT(session_id, '-', !ABS(STRCMP(flags, 'NEW_SESSION'))) AS session_id
, COUNT(id) AS events_count
, MIN(id) AS first_event_id
, MAX(id) last_event_id
FROM events
GROUP BY events.session_id, events.flags
ORDER BY events.id;

how to do dead reckoning on column of table, postgresql

I have a table looks like,
x y
1 2
2 null
3 null
1 null
11 null
I want to fill the null value by conducting a rolling
function to apply y_{i+1}=y_{i}+x_{i+1} with sql as simple as possible (inplace)
so the expected result
x y
1 2
2 4
3 7
1 8
11 19
implement in postgresql. I may encapsulate it in a window function, but the implementation of custom function seems always complex
WITH RECURSIVE t AS (
select x, y, 1 as rank from my_table where y is not null
UNION ALL
SELECT A.x, A.x+ t.y y , t.rank + 1 rank FROM t
inner join
(select row_number() over () rank, x, y from my_table ) A
on t.rank+1 = A.rank
)
SELECT x,y FROM t;
You can iterate over rows using a recursive CTE. But in order to do so, you need a way to jump from row to row. Here's an example using an ID column:
; with recursive cte as
(
select id
, y
from Table1
where id = 1
union all
select cur.id
, prev.y + cur.x
from Table1 cur
join cte prev
on cur.id = prev.id + 1
)
select *
from cte
;
You can see the query at SQL Fiddle. If you don't have an ID column, but you do have another way to order the rows, you can use row_number() to get an ID:
; with recursive sorted as
(
-- Specify your ordering here. This example sorts by the dt column.
select row_number() over (order by dt) as id
, *
from Table1
)
, cte as
(
select id
, y
from sorted
where id = 1
union all
select cur.id
, prev.y + cur.x
from sorted cur
join cte prev
on cur.id = prev.id + 1
)
select *
from cte
;
Here's the SQL Fiddle link.

TSQL - Mapping one table to another without using cursor

I have tables with following structure
create table Doc(
id int identity(1, 1) primary key,
DocumentStartValue varchar(100)
)
create Metadata (
DocumentValue varchar(100),
StartDesignation char(1),
PageNumber int
)
GO
Doc contains
id DocumentStartValue
1000 ID-1
1100 ID-5
2000 ID-8
3000 ID-9
Metadata contains
Documentvalue StartDesignation PageNumber
ID-1 D 0
ID-2 NULL 1
ID-3 NULL 2
ID-4 NULL 3
ID-5 D 0
ID-6 NULL 1
ID-7 NULL 2
ID-8 D 0
ID-9 D 0
What I need to is to map Metadata.DocumentValues to Doc.id
So the result I need is something like
id DocumentValue PageNumber
1000 ID-1 0
1000 ID-2 1
1000 ID-3 2
1000 ID-4 3
1100 ID-5 0
1100 ID-6 1
1100 ID-7 2
2000 ID-8 0
3000 ID-9 0
Can it be achieved without the use of cursor?
Something like, sorry can't test
;WITH RowList AS
( --assign RowNums to each row...
SELECT
ROW_NUMBER() OVER (ORDER BY id) AS RowNum,
id, DocumentStartValue
FROM
doc
), RowPairs AS
( --this allows us to pair a row with the previous rows to create ranges
SELECT
R.DocumentStartValue AS Start, R.id,
R1.DocumentStartValue AS End
FROM
RowList R JOIN RowList R1 ON R.RowNum + 1 = R1.RowNum
)
--use ranges to join back and get the data
SELECT
RP.id, M.DocumentValue, M.PageNumber
FROM
RowPairs RP
JOIN
Metadata M ON RP.Start <= M.DocumentValue AND M.DocumentValue < RP.End
Edit: This assumes that you can rely on the ID-x values matching and being ascending. If so, StartDesignation is superfluous/redundant and may conflict with the Doc table DocumentStartValue
with rm as
(
select DocumentValue
,PageNumber
,case when StartDesignation = 'D' then 1 else 0 end as IsStart
,row_number() over (order by DocumentValue) as RowNumber
from Metadata
)
,gm as
(
select
DocumentValue as DocumentGroup
,DocumentValue
,PageNumber
,RowNumber
from rm
where RowNumber = 1
union all
select
case when rm.IsStart = 1 then rm.DocumentValue else gm.DocumentGroup end
,rm.DocumentValue
,rm.PageNumber
,rm.RowNumber
from gm
inner join rm on rm.RowNumber = (gm.RowNumber + 1)
)
select d.id, gm.DocumentValue, gm.PageNumber
from Doc d
inner join gm on d.DocumentStartValue = gm.DocumentGroup
Try to use query above (maybe you will need to add option (maxrecursion ...) also) and add index on DocumentValue for Metadata table. Also, it it's possible - it will be better to save appropriate group on Metadat rows inserting.
UPD: I've tested it and fixed errors in my query, not it works and give result as in initial question.
UPD2: And recommended indexes:
create clustered index IX_Metadata on Metadata (DocumentValue)
create nonclustered index IX_Doc_StartValue on Doc (DocumentStartValue)