Get max id from grouping two columns as pair - group-by

i have search a lot to find a solution to get the max id using group by of two columns as a pair in a data set but none of the queries i have found and used worked as expected. Below is an example data set:
id
tour_id
p1
stage
rnd
assoc1
p2
assoc2
winner
996057
5277
107028
Main Draw
32
GER
110673
IRI
107028
996101
5277
107028
Main Draw
16
GER
105136
FRA
107028
996126
5277
107028
Main Draw
8
GER
112074
SWE
107028
996133
5277
107028
Main Draw
4
GER
123980
JPN
107028
996139
5277
107028
Main Draw
2
GER
121582
TPE
107028
996037
5277
116620
Main Draw
32
GER
121582
TPE
121582
996037
5277
121582
Main Draw
32
TPE
116620
GER
121582
996097
5277
121582
Main Draw
16
TPE
104314
IND
121582
996121
5277
121582
Main Draw
8
TPE
112092
NGR
121582
996132
5277
121582
Main Draw
4
TPE
112062
FRA
121582
996139
5277
121582
Main Draw
2
TPE
107028
GER
107028
996324
5278
107028
Main Draw
32
GER
100439
EGY
107028
996362
5278
107028
Main Draw
16
GER
104314
IND
107028
996379
5278
107028
Main Draw
8
GER
116853
SWE
107028
996390
5278
107028
Main Draw
4
GER
123980
JPN
123980
996283
5278
116620
Main Draw
64
GER
121514
KOR
121514
996313
5278
121582
Main Draw
32
TPE
106296
POR
121582
996357
5278
121582
Main Draw
16
TPE
102968
AUT
121582
996380
5278
121582
Main Draw
8
TPE
102761
GER
102761
998765
5299
101222
Main Draw
64
GER
118671
DEN
101222
998788
5299
101222
Main Draw
32
GER
102380
ENG
101222
998801
5299
101222
Main Draw
16
GER
116620
GER
101222
998807
5299
101222
Main Draw
8
GER
116853
SWE
101222
998810
5299
101222
Main Draw
4
GER
112074
SWE
101222
998812
5299
101222
Main Draw
2
GER
107028
GER
101222
998773
5299
107028
Main Draw
64
GER
120168
TUR
107028
998797
5299
107028
Main Draw
32
GER
102891
CRO
107028
998805
5299
107028
Main Draw
16
GER
104379
SWE
107028
998809
5299
107028
Main Draw
8
GER
104036
CZE
107028
998811
5299
107028
Main Draw
4
GER
102841
POR
107028
998812
5299
107028
Main Draw
2
GER
101222
GER
101222
998757
5299
116620
Main Draw
64
GER
101192
ITA
116620
998794
5299
116620
Main Draw
32
GER
115449
AUT
116620
998801
5299
116620
Main Draw
16
GER
101222
GER
101222
What I would like to get is the following output which is basically the max(id) of the grouping of p1 and tour_id
id
tour_id
p1
stage
rnd
assoc1
p2
assoc2
winner
996139
5277
107028
Main Draw
2
GER
121582
TPE
107028
996037
5277
116620
Main Draw
32
GER
121582
TPE
121582
996139
5277
121582
Main Draw
2
TPE
107028
GER
107028
996390
5278
107028
Main Draw
4
GER
123980
JPN
123980
996283
5278
116620
Main Draw
64
GER
121514
KOR
121514
996380
5278
121582
Main Draw
8
TPE
102761
GER
102761
998812
5299
101222
Main Draw
2
GER
107028
GER
101222
998812
5299
107028
Main Draw
2
GER
101222
GER
101222
998801
5299
116620
Main Draw
16
GER
101222
GER
101222
Any help is appreciated.

Generally, I would take a simple query to get the max id for the conditions, and then either use it as a subquery or join, depending on the use case. Take a look at this fiddle:
https://dbfiddle.uk/K1wM0gEK
I've inserted your data and then a series of queries. Here's the first one, just to get the maxID for each combination of tour_id and p1:
select tour_id, p1, max(id) as maxID
from t group by tour_id, p1;
which you can then use in a subquery to retrieve any rows that match those IDs like so:
select * from t
where id in (
select max(id)
from t group by tour_id, p1
);
or as a JOIN:
select t.* from t
join (
select max(id) as maxID
from t group by tour_id, p1
) ids on t.id = ids.maxID;
JOINs are usually more performant than IN for larger data sets, but that is not a hard and fast rule and the line really isn't well defined. I've included it here just for reference.
Now, these queries SHOULD be returning the same results, but it seems that the ID you're fetching the max value for isn't a unique ID, so they aren't, and it really depends on what you are trying to accomplish as to which answer is right. Here's one more option using window functions, which are really overkill for this, but let's look:
select tour_id, p1,
first_value(id) OVER (partition by tour_id, p1 order by id desc) as maxID,
first_value(stage) OVER (partition by tour_id, p1 order by id desc) as stage,
first_value(rnd) OVER (partition by tour_id, p1 order by id desc) as rnd,
first_value(assoc1) OVER (partition by tour_id, p1 order by id desc) as assoc1,
first_value(p2) OVER (partition by tour_id, p1 order by id desc) as p2,
first_value(assoc2) OVER (partition by tour_id, p1 order by id desc) as assoc2,
first_value(winner) OVER (partition by tour_id, p1 order by id desc) as winner
from t
Now this returns a LOT more rows, but a lot of them are duplicates, so let's add DISTINCT to just get the uniques:
select DISTINCT tour_id, p1,
first_value(id) OVER (partition by tour_id, p1 order by id desc) as maxID,
first_value(stage) OVER (partition by tour_id, p1 order by id desc) as stage,
first_value(rnd) OVER (partition by tour_id, p1 order by id desc) as rnd,
first_value(assoc1) OVER (partition by tour_id, p1 order by id desc) as assoc1,
first_value(p2) OVER (partition by tour_id, p1 order by id desc) as p2,
first_value(assoc2) OVER (partition by tour_id, p1 order by id desc) as assoc2,
first_value(winner) OVER (partition by tour_id, p1 order by id desc) as winner
from t
and now we're down to something that looks a little more like what you were after. For comparison, I have included the three queries side by side, ordered by id and with the columns all in the same order:
select DISTINCT
first_value(id) OVER (partition by tour_id, p1 order by id desc) as maxID,
tour_id, p1,
first_value(stage) OVER (partition by tour_id, p1 order by id desc) as stage,
first_value(rnd) OVER (partition by tour_id, p1 order by id desc) as rnd,
first_value(assoc1) OVER (partition by tour_id, p1 order by id desc) as assoc1,
first_value(p2) OVER (partition by tour_id, p1 order by id desc) as p2,
first_value(assoc2) OVER (partition by tour_id, p1 order by id desc) as assoc2,
first_value(winner) OVER (partition by tour_id, p1 order by id desc) as winner
from t order by 1;
select * from t
where id in (
select max(id)
from t group by tour_id, p1
) order by id;
select t.* from t
join (
select max(id) as maxID
from t group by tour_id, p1
) ids on t.id = ids.maxID
order by t.id;
The result set using the window functions seems to have the same output as you're looking for, but let me say that it seems like window functions are overkill for a case this simple, so I'm wondering if you need some unique ID instead. If you don't have a unique primary (autoincrementing) ID in your table(s), you should. It will save you a lot of headache at some point down the road. If you do, I wonder why we aren't using that instead of the non-unique one.
Let me know if this helps, or if anything is unclear.

Related

Reset Row_Number Window Function

I have data that looks like this in postgres
My row_num column is created like this:
ROW_NUMBER() OVER (PARTITION BY dept_id, name, status ORDER BY dept_id, name, status) as row_num
dept_id name status row_num
1 227 occupied 1
1 227 occupied 2
1 227 vacant 1
1 227 vacant 2
1 227 occupied 3
1 227 occupied 4
1 227 vacant 3
1 227 vacant 4
This is what I want it to look like:
dept_id name status row_num
1 227 occupied 1
1 227 occupied 2
1 227 vacant 1
1 227 vacant 2
1 227 occupied 1
1 227 occupied 2
1 227 vacant 1
1 227 vacant 2
Any suggestions?
You can use recursive CTE to create the result you expected.
Data structure and query result: dbfiddle
with recursive
cte_r as (
select dept_id,
name,
status,
row_number() over (partition by dept_id, name) as rn
from test),
cte as (
select dept_id,
name,
status,
rn,
1 as grp
from cte_r
where rn = 1
union all
select cr.dept_id,
cr.name,
cr.status,
cr.rn,
case
when cr.status = c.status then grp + 1
else 1
end
from cte c,
cte_r cr
where c.rn = cr.rn - 1)
select dept_id,
name,
status,
grp as row_num
from cte;

Ranking in PostgreSQL

I have a query that looks like this:
select
restaurant_id,
rank() OVER (PARTITION BY restaurant_id order by churn desc) as rank_churn,
churn,
orders,
rank() OVER (PARTITION BY restaurant_id order by orders desc) as rank_orders
from data
I would expect that this ranking function will order my data and provide a column that has 1,2,3,4 according to the values of the column.
However the outcome is always 1 in the ranking.
restaurant_id rank_churn churn orders rank_orders
2217 1 75 182 1
2249 1 398 896 1
2526 1 11 56 1
2596 1 89 139 1
What am I doing wrong?

Group and Stuff multiple rows based on Count condition

I have a script that runs every 10 minutes and returns table with events from past 24 hours (marked by the script run time)
ID Name TimeOfEvent EventCategory TeamColor
1 Verlene Bucy 2015-01-30 09:10:00.000 1 Blue
2 Geneva Rendon 2015-01-30 09:20:00.000 2 Blue
3 Juliane Hartwig 2015-01-30 09:25:00.000 3 Blue
4 Vina Dutton 2015-01-30 12:55:00.000 2 Red
5 Cristin Lewis 2015-01-30 15:50:00.000 2 Red
6 Reiko Cushman 2015-01-30 17:10:00.000 1 Red
7 Mallie Temme 2015-01-30 18:35:00.000 3 Blue
8 Keshia Seip 2015-01-30 19:55:00.000 2 Blue
9 Rosalia Maher 2015-01-30 20:35:00.000 3 Red
10 Keven Gabel 2015-01-30 21:25:00.000 3 Red
Now I'd like to select two groups of Names based on these conditions:
1) Select Names from same EventCategory having 4 or more records in past 24 hours.
2) Select Names from same EventCategory and same TeamColor having 2 or more records in past 1 hour.
So my result would be:
4+per24h: Geneva Rendon, Vina Dutton, Cristin Lewis, Keshia Seip EventCategory = 2
4+per24h: Juliane Hartwig, Mallie Temme, Rosalia Maher, Keven Gabel EventCategory = 3
2+per1h: Rosalia Maher, Keven Gabel EventCategory = 3, TeamColor = Red
For the first one, I have written this:
SELECT mt.EventCategory, MAX(mt.[name]), MAX(mt.TimeOfEvent), MAX(mt.TeamColor)
FROM #mytable mt
GROUP BY mt.EventCategory
HAVING COUNT(mt.EventCategory) >= 4
because I don't care for the actual time as long as it's in the past 24 hours (and it always is), but I have trouble stuffing the names in one row.
The second part, I have no idea how to do. Because the results need to have both same EventCategory and TeamColor and also be limited by the one hour bracket.
this is possible, but you mix two separate issues. Here you find them combined with UNION:
Just paste this into an empty query window and execute. Adapt to your needs:
DECLARE #tbl TABLE(ID INT,Name VARCHAR(100),TimeOfEvent DATETIME,EventCategory INT,TeamColor VARCHAR(10));
INSERT INTO #tbl VALUES
(1,'Verlene Bucy','2015-01-30T09:10:00.000',1,'Blue')
,(2,'Geneva Rendon','2015-01-30T09:20:00.000',2,'Blue')
,(3,'Juliane Hartwig','2015-01-30T09:25:00.000',3,'Blue')
,(4,'Vina Dutton','2015-01-30T12:55:00.000',2,'Red')
,(5,'Cristin Lewis','2015-01-30T15:50:00.000',2,'Red')
,(6,'Reiko Cushman','2015-01-30T17:10:00.000',1,'Red')
,(7,'Mallie Temme','2015-01-30T18:35:00.000',3,'Blue')
,(8,'Keshia Seip','2015-01-30T19:55:00.000',2,'Blue')
,(9,'Rosalia Maher','2015-01-30T20:35:00.000',3,'Red')
,(10,'Keven Gabel','2015-01-30T21:25:00.000',3,'Red');
WITH Extended AS
(
SELECT *
,DATEDIFF(MINUTE,'2015-01-30T21:26:00.000',TimeOfEvent) AS MinuteDiff --use GETDATE() here...
,COUNT(*) OVER(PARTITION BY EventCategory) AS CountCategory
FROM #tbl AS tbl
)
,Filtered24Hours AS
(
SELECT *
FROM Extended
WHERE CountCategory >=4
)
,Filtered60Mins AS
(
SELECT *
FROM Extended
WHERE MinuteDiff >=-60
AND CountCategory >=2
)
SELECT DISTINCT (SELECT COUNT(*) FROM Filtered24Hours AS x WHERE x.EventCategory=outerSource.EventCategory) AS CountNames
,'per24h' AS TimeIntervall
,STUFF((
SELECT ' ,' + innerSource.Name
FROM Filtered24Hours AS innerSource
WHERE innerSource.EventCategory=outerSource.EventCategory
ORDER BY innerSource.TimeOfEvent
FOR XML PATH('')
),1,2,'') AS Names
,EventCategory
,NULL
FROM Filtered24Hours AS outerSource
UNION
SELECT DISTINCT (SELECT COUNT(*) FROM Filtered60Mins AS x WHERE x.EventCategory=outerSource.EventCategory)
,'per1h'
,STUFF((
SELECT ' ,' + innerSource.Name
FROM Filtered60Mins AS innerSource
WHERE innerSource.EventCategory=outerSource.EventCategory
ORDER BY innerSource.TimeOfEvent
FOR XML PATH('')
),1,2,'')
,EventCategory
,TeamColor
FROM Filtered60Mins AS outerSource
The result
Count Interv Names Category Team
4 per24h Geneva Rendon ,Vina Dutton ,Cristin Lewis ,Keshia Seip 2 NULL
4 per24h Juliane Hartwig ,Mallie Temme ,Rosalia Maher ,Keven Gabel 3 NULL
2 per1h Rosalia Maher ,Keven Gabel 3 Red

select two maximum values per person based on a column partition

Hi if I have the following table:
Person------Score-------Score_type
1 30 A
1 35 A
1 15 B
1 16 B
2 74 A
2 68 A
2 40 B
2 39 B
Where for each person and score type I want to pick out the maximum score to obtain a table like:
Person------Score-------Score_type
1 35 A
1 16 B
2 74 A
2 40 B
I can do this using multiple select statements, but this will be cumbersome, especially later on. so I was wondering if there is a function which can help me do this. I have used the parititon function before but only to label sequences in a table....
select person,
score_type,
max(score) as score
from scores
group by person, score_type
order by person, score_type;
With "partition function" I guess you mean window functions. They can indeed be used for this as well:
select person
score_type,
score
from (
select person,
score_type,
score,
row_number() over (partition by person, score_type order by score desc) as rn
from scores
) t
where rn = 1
order by person, score_type;
Using the max() aggregate function along with the grouping by person and score_type should do the trick.

Extract Unique Time Slices in Oracle

I use Oracle 10g and I have a table that stores a snapshot of data on a person for a given day. Every night an outside process adds new rows to the table for any person whose had any changes to their core data (stored elsewhere). This allows a query to be written using a date to find out what a person 'looked' like on some past day. A new row is added to the table even if only a single aspect of the person has changed--the implication being that many columns have duplicate values from slice to slice since not every detail changed in each snapshot.
Below is a data sample:
SliceID PersonID StartDt Detail1 Detail2 Detail3 Detail4 ...
1 101 08/20/09 Red Vanilla N 23
2 101 08/31/09 Orange Chocolate N 23
3 101 09/15/09 Yellow Chocolate Y 24
4 101 09/16/09 Green Chocolate N 24
5 102 01/10/09 Blue Lemon N 36
6 102 01/11/09 Indigo Lemon N 36
7 102 02/02/09 Violet Lemon Y 36
8 103 07/07/09 Red Orange N 12
9 104 01/31/09 Orange Orange N 12
10 104 10/20/09 Yellow Orange N 13
I need to write a query that pulls out time slices records where some pertinent bits, not the whole record, have changed. So, referring to the above, if I only want to know the slices in which Detail3 has changed from its previous value, then I would expect to only get rows having SliceID 1, 3 and 4 for PersonID 101 and SliceID 5 and 7 for PersonID 102 and SliceID 8 for PersonID 103 and SliceID 9 for PersonID 104.
I'm thinking I should be able to use some sort of Oracle Hierarchical Query (using CONNECT BY [PRIOR]) to get what I want, but I have not figured out how to write it yet. Perhaps YOU can help.
Thanks you for your time and consideration.
Here is my take on the LAG() solution, which is basically the same as that of egorius, but I show my workings ;)
SQL> select * from
2 (
3 select sliceid
4 , personid
5 , startdt
6 , detail3 as new_detail3
7 , lag(detail3) over (partition by personid
8 order by startdt) prev_detail3
9 from some_table
10 )
11 where prev_detail3 is null
12 or ( prev_detail3 != new_detail3 )
13 /
SLICEID PERSONID STARTDT N P
---------- ---------- --------- - -
1 101 20-AUG-09 N
3 101 15-SEP-09 Y N
4 101 16-SEP-09 N Y
5 102 10-JAN-09 N
7 102 02-FEB-09 Y N
8 103 07-JUL-09 N
9 104 31-JAN-09 N
7 rows selected.
SQL>
The point about this solution is that it hauls in results for 103 and 104, who don't have slice records where detail3 has changed. If that is a problem we can apply an additional filtration, to return only rows with changes:
SQL> with subq as (
2 select t.*
3 , row_number () over (partition by personid
4 order by sliceid ) rn
5 from
6 (
7 select sliceid
8 , personid
9 , startdt
10 , detail3 as new_detail3
11 , lag(detail3) over (partition by personid
12 order by startdt) prev_detail3
13 from some_table
14 ) t
15 where t.prev_detail3 is null
16 or ( t.prev_detail3 != t.new_detail3 )
17 )
18 select sliceid
19 , personid
20 , startdt
21 , new_detail3
22 , prev_detail3
23 from subq sq
24 where exists ( select null from subq x
25 where x.personid = sq.personid
26 and x.rn > 1 )
27 order by sliceid
28 /
SLICEID PERSONID STARTDT N P
---------- ---------- --------- - -
1 101 20-AUG-09 N
3 101 15-SEP-09 Y N
4 101 16-SEP-09 N Y
5 102 10-JAN-09 N
7 102 02-FEB-09 Y N
SQL>
edit
As egorius points out in the comments, the OP does want hits for all users, even if they haven't changed, so the first version of the query is the correct solution.
In addition to OMG Ponies' answer: if you need to query slices for all persons, you'll need partition by:
SELECT s.sliceid
, s.personid
FROM (SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (
PARTITION BY t.personid ORDER BY t.startdt
) prev_val
FROM t) s
WHERE (s.prev_val IS NULL OR s.prev_val != s.detail3)
I think you'll have better luck with the LAG function:
SELECT s.sliceid
FROM (SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (PARTITION BY t.personid ORDER BY t.startdt) 'prev_val'
FROM TABLE t) s
WHERE s.personid = 101
AND (s.prev_val IS NULL OR s.prev_val != s.detail3)
Subquery Factoring alternative:
WITH slices AS (
SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (PARTITION BY t.personid ORDER BY t.startdt) 'prev_val'
FROM TABLE t)
SELECT s.sliceid
FROM slices s
WHERE s.personid = 101
AND (s.prev_val IS NULL OR s.prev_val != s.detail3)