select two maximum values per person based on a column partition - db2

Hi if I have the following table:
Person------Score-------Score_type
1 30 A
1 35 A
1 15 B
1 16 B
2 74 A
2 68 A
2 40 B
2 39 B
Where for each person and score type I want to pick out the maximum score to obtain a table like:
Person------Score-------Score_type
1 35 A
1 16 B
2 74 A
2 40 B
I can do this using multiple select statements, but this will be cumbersome, especially later on. so I was wondering if there is a function which can help me do this. I have used the parititon function before but only to label sequences in a table....

select person,
score_type,
max(score) as score
from scores
group by person, score_type
order by person, score_type;
With "partition function" I guess you mean window functions. They can indeed be used for this as well:
select person
score_type,
score
from (
select person,
score_type,
score,
row_number() over (partition by person, score_type order by score desc) as rn
from scores
) t
where rn = 1
order by person, score_type;

Using the max() aggregate function along with the grouping by person and score_type should do the trick.

Related

Taking N-samples from each group in PostgreSQL

I have a table containing data that has a column named id that looks like below:
id
value 1
value 2
value 3
1
244
550
1000
1
251
551
700
1
540
60
1200
...
...
...
...
2
19
744
2000
2
10
903
100
2
44
231
600
2
120
910
1100
...
...
...
...
I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points.
For example I would like a maximum 50 data points randomly selected from id = 1, id = 2 etc...
I cannot find any previous questions similar to this but have tried taking a stab at at least logically working through the solution where I could iterate and union all queries by id and limit to 50:
SELECT * FROM (SELECT * FROM schema.table AS tbl WHERE tbl.id = X LIMIT 50) UNION ALL;
But it's obvious that you cannot use this type of solution because UNION ALL requires aggregating outputs from one id to the next and I do not have a list of id values to use in place of X in tbl.id = X.
Is there a way to accomplish this by gathering that list of unique id values and union all results or is there a more optimal way this could be done?
If you want to select a random sample for each id, then you need to randomize the rows somehow. Here is a way to do it:
select * from (
select *, row_number() over (partition by id order by random()) as u
from schema.table
) as a
where u <= 50;
Example (limiting to 3, and some row number for each id so you can see the selection randomness):
setup
DROP TABLE IF EXISTS foo;
CREATE TABLE foo
(
id int,
value1 int,
idrow int
);
INSERT INTO foo
select 1 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 2 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 3 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow;
Selection
select * from (
select *, row_number() over (partition by id order by random()) as u
from foo
) as a
where u <= 3;
Output:
id
value1
idrow
u
1
542
6
1
1
24
86
2
1
155
74
3
2
505
95
1
2
100
46
2
2
422
33
3
3
966
88
1
3
747
89
2
3
664
19
3
In case you are looking to get 50 (or less) from each group of IDs then you can use windowing -
From question - "I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points."
Query -
with data as (
select row_number() over (partition by id order by random()) rn,
* from table_name)
select * from data where rn<=50 order by id;
Fiddle.
Your description of trying to get the UNION ALL without specifying all the branches ahead of time is aiming for a LATERAL join. And that is one way to solve the problem. But unless you have a table of all distinct ids, you would have to compute one on the fly. For example (using the same fiddle as Pankaj used):
with uniq as (select distinct id from test)
select foo.* from uniq cross join lateral
(select * from test where test.id=uniq.id order by random() limit 3) foo
This could be either slower or faster than the Window Function method, depending on your system and your data and your indexes. In my hands, it was quite a bit faster even with the need to dynamically compute the list of distinct ids.

PostgreSQL select statement to return rows after where condition

I am working on a query to return the next 7 days worth of data every time an event happens indicated by "where event = 1". The goal is to then group all the data by the user id and perform aggregate functions on this data after the event happens - the event is encoded as binary [0, 1].
So far, I have been attempting to use nested select statements to structure the data how I would like to have it, but using the window functions is starting to restrict me. I am now thinking a self join could be more appropriate but need help in constructing such a query.
The query currently first creates daily aggregate values grouped by user and date (3rd level nested select). Then, the 2nd level sums the data "value_x" to obtain an aggregate value grouped by the user. Then, the 1st level nested select statement uses the lead function to grab the next rows value over and partitioned by each user which acts as selecting the next day's value when event = 1. Lastly, the select statement uses an aggregate function to calculate the average "sum_next_day_value_after_event" grouped by user and where event = 1. Put together, where event = 1, the query returns the avg(value_x) of the next row's total value_x.
However, this doesn't follow my time rule; "where event = 1", return the next 7 days worth of data after the event happens. If there is not 7 days worth of data, then return whatever data is <= 7 days. Yes, I currently only have one lead with the offset as 1, but you could just put 6 more of these functions to grab the next 6 rows. But, the lead function currently just grabs the next row without regard to date. So theoretically, the next row's "value_x" could actually be 15 days from where "event = 1". Also, as can be seen below in the data table, a user may have more than one row per day.
Here is the following query I have so far:
select
f.user_id
avg(f.sum_next_day_value_after_event) as sum_next_day_values
from (
select
bld.user_id,
lead(bld.value_x, 1) over(partition by bld.user_id order by bld.daily) as sum_next_day_value_after_event
from (
select
l.user_id,
l.daily,
sum(l.value_x) as sum_daily_value_x
from (
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
group by date_part('day', day_ts), user_id, value_x) l
group by l.user_id, l.day_ts
order by l.user_id) bld) f
group by f.user_id
Below is a snippet of the data from table_1:
user_id
day_ts
value_x
event
50
4/2/21 07:37
25
0
50
4/2/21 07:42
45
0
50
4/2/21 09:14
67
1
50
4/5/21 10:09
8
0
50
4/5/21 10:24
75
0
50
4/8/21 11:08
34
0
50
4/15/21 13:09
32
1
50
4/16/21 14:23
12
0
50
4/29/21 14:34
90
0
55
4/4/21 15:31
12
0
55
4/5/21 15:23
34
0
55
4/17/21 18:58
32
1
55
4/17/21 19:00
66
1
55
4/18/21 19:57
54
0
55
4/23/21 20:02
34
0
55
4/29/21 20:39
57
0
55
4/30/21 21:46
43
0
Technical details:
PostgreSQL, supported by EDB, version = 14.1
pgAdmin4, version 5.7
Thanks for the help!
"The query currently first creates daily aggregate values"
I don't see any aggregate function in your first query, so that the GROUP BY clause is useless.
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
group by date_part('day', day_ts), user_id, value_x
could be simplified as
select
user_id, value_x, date_part('day', day_ts) as daily
from table_1
which in turn provides no real added value, so this first query could be removed and the second query would become :
select user_id
, date_part('day', day_ts) as daily
, sum(value_x) as sum_daily_value_x
from table_1
group by user_id, date_part('day', day_ts)
The order by user_id clause can also be removed at this step.
Now if you want to calculate the average value of the sum_daily_value_x in the period of 7 days after the event (I'm referring to the avg() function in your top query), you can use avg() as a window function that you can restrict to the period of 7 days after the event :
select f.user_id
, avg(f.sum_daily_value_x) over (order by f.daily range between current row and '7 days' following) as sum_next_day_values
from (
select user_id
, date_part('day', day_ts) as daily
, sum(value_x) as sum_daily_value_x
from table_1
group by user_id, date_part('day', day_ts)
) AS f
group by f.user_id
The partition by f.user_id clause in the window function is useless because the rows have already been grouped by f.user_id before the window function is applied.
You can replace the avg() window function by any other one, for instance sum() which could better fit with the alias sum_next_day_values

Postgresql difference between rows

My data:
id value
1 10
1 20
1 60
2 10
3 10
3 30
How to compute column 'change'?
id value change | my comment, how to compute
1 10 10 | 20-10
1 20 40 | 60-20
1 60 40 | default_value-60. In this example default_value=100
2 10 90 | default_value-10
3 10 20 | 30-10
3 30 70 | default_value-30
In other words: if row of id is last, then compute 100-value,
else compute next_value-value_now
You can access the value of the "next" (or "previous") row using a window function. The concept of a "next" row only makes sense if you have a column to define an order on the rows. You said you have a date column on which you can order the result. I used the column name your_date_column for this. You need to replace that with the actual column name of course.
select id,
value,
lead(value, 1, 100) over (partition by id order by your_date_column) - value as change
from the_table
order by id, your_date_column
lead(value, 1, 100) says: take the column value of the "next" row (that's the 1). If there is no such row, use the default value 100 instead.
Join on a subquery and use ROW_NUMBER to find the last value per group
WITH CTE AS(
SELECT id,value,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) rn,
(LEAD(value) OVER (PARTITION BY id ORDER BY date)-value) change FROM t)
SELECT cte.id,cte.value,
(CASE WHEN cte.change IS NULL THEN 100-cte.value ELSE cte.change END)as change FROM cte LEFT JOIN
(SELECT id,MAX(rn) mrn FROM cte
GROUP BY id) as x
ON x.mrn=cte.rn AND cte.id=x.id
FIDDLE

T-SQL: How to use MIN

I have the following simple table, called TableA
ID SomeVal
1 10
1 20
1 30
2 40
2 50
3 60
I want to select only those rows where SomeVal is the smallest value for the same ID value. So my results should look like this:
1 10
2 40
3 60
I think I need Group By in my SQL but am not sure how. Thanks for your help.
SELECT ID, MIN(SomeVal)
FROM [TableName]
GROUP BY ID
Group by will perform the aggregate function (MIN) for every unique value that is grouped by, and return the result.
I think this will do what you need:
Select ID, Min(SomeVal)
From MyTable
Group By ID

Insert rownumber repeatedly in records in t-sql

I want to insert a row number in a records like counting rows in a specific number of range. example output:
RowNumber ID Name
1 20 a
2 21 b
3 22 c
1 23 d
2 24 e
3 25 f
1 26 g
2 27 h
3 28 i
1 29 j
2 30 k
I rather to try using the rownumber() over (partition by order by column name) but my real records are not containing columns that will count into 1-3 rownumber.
I already try to loop each of record to insert a row count 1-3 but this loop affects the performance of the query. The query will use for the RDL report, that is why as much as possible the performance of the query must be good.
any suggestions are welcome. Thanks
have you tried modulo-ing rownumber()?
SELECT
((row_number() over (order by ID)-1) % 3) +1 as RowNumber
FROM table