Taking N-samples from each group in PostgreSQL - postgresql

I have a table containing data that has a column named id that looks like below:
id
value 1
value 2
value 3
1
244
550
1000
1
251
551
700
1
540
60
1200
...
...
...
...
2
19
744
2000
2
10
903
100
2
44
231
600
2
120
910
1100
...
...
...
...
I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points.
For example I would like a maximum 50 data points randomly selected from id = 1, id = 2 etc...
I cannot find any previous questions similar to this but have tried taking a stab at at least logically working through the solution where I could iterate and union all queries by id and limit to 50:
SELECT * FROM (SELECT * FROM schema.table AS tbl WHERE tbl.id = X LIMIT 50) UNION ALL;
But it's obvious that you cannot use this type of solution because UNION ALL requires aggregating outputs from one id to the next and I do not have a list of id values to use in place of X in tbl.id = X.
Is there a way to accomplish this by gathering that list of unique id values and union all results or is there a more optimal way this could be done?

If you want to select a random sample for each id, then you need to randomize the rows somehow. Here is a way to do it:
select * from (
select *, row_number() over (partition by id order by random()) as u
from schema.table
) as a
where u <= 50;
Example (limiting to 3, and some row number for each id so you can see the selection randomness):
setup
DROP TABLE IF EXISTS foo;
CREATE TABLE foo
(
id int,
value1 int,
idrow int
);
INSERT INTO foo
select 1 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 2 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow
union all
select 3 as id, (1000*random())::int as value1, generate_series(1, 100) as idrow;
Selection
select * from (
select *, row_number() over (partition by id order by random()) as u
from foo
) as a
where u <= 3;
Output:
id
value1
idrow
u
1
542
6
1
1
24
86
2
1
155
74
3
2
505
95
1
2
100
46
2
2
422
33
3
3
966
88
1
3
747
89
2
3
664
19
3

In case you are looking to get 50 (or less) from each group of IDs then you can use windowing -
From question - "I want to take 50 sample rows per id that exists but if less than 50 exist for the group to simply take the entire set of data points."
Query -
with data as (
select row_number() over (partition by id order by random()) rn,
* from table_name)
select * from data where rn<=50 order by id;
Fiddle.

Your description of trying to get the UNION ALL without specifying all the branches ahead of time is aiming for a LATERAL join. And that is one way to solve the problem. But unless you have a table of all distinct ids, you would have to compute one on the fly. For example (using the same fiddle as Pankaj used):
with uniq as (select distinct id from test)
select foo.* from uniq cross join lateral
(select * from test where test.id=uniq.id order by random() limit 3) foo
This could be either slower or faster than the Window Function method, depending on your system and your data and your indexes. In my hands, it was quite a bit faster even with the need to dynamically compute the list of distinct ids.

Related

PostgreSQL non-overlapping ranges

I use PostgreSQL database and have a cards table.
Each record(card) in this table have card_drop_rate integer value.
For example:
id | card_name |card_drop_rate
-------------------------------
1 |card1 |34
2 |card2 |16
3 |card3 |54
max drop rate is 34 + 16 + 54 = 104.
In accordance to my application logic I need to find a random value between 0 and 104 and then retrieve card according to this number, for example:
random value: 71
card1 range: 0 - 34(0 + 34)
card2 range: 34 - 50(34 + 16)
card3 range: 50 - 104(50 + 54)
So, my card is card3 because 71 is placed in the range 50 - 104
What is the proper way to reflect this structure in PostgreSQL ? I'll need to query this data often under so the performance is a criterion number one for this solution.
Following query works fine:
SELECT
b.id,
b.card_drop_rate
FROM (SELECT a.id, sum(a.card_drop_rate) OVER(ORDER BY id) - a.card_drop_rate as rate, card_drop_rate FROM cards as a) b
WHERE b.rate < 299 ORDER BY id DESC LIMIT 1
You can do this using cumulative sums and random. The "+ 1"s might be throwing me off, but it is something like this:
with c as (
select c.*,
sum(card_drop_rate + 1) - card_drop_rate as threshhold
from cards c
),
r as (
select random() * (sum(card_drop_rate) + count(*) - 1) as which_card
from cards c
)
select c.*
from c cross join
r
where which_card >= threshhold
order by threshhold
limit 1;
For performance, I would simply take the cards and generate a new table with 106 slots. Assign the card value to the slots and build an index on the slot number. Then get a value using:
select s.*
from slots s
where s.slotid = floor(random() * 107);

Postgresql difference between rows

My data:
id value
1 10
1 20
1 60
2 10
3 10
3 30
How to compute column 'change'?
id value change | my comment, how to compute
1 10 10 | 20-10
1 20 40 | 60-20
1 60 40 | default_value-60. In this example default_value=100
2 10 90 | default_value-10
3 10 20 | 30-10
3 30 70 | default_value-30
In other words: if row of id is last, then compute 100-value,
else compute next_value-value_now
You can access the value of the "next" (or "previous") row using a window function. The concept of a "next" row only makes sense if you have a column to define an order on the rows. You said you have a date column on which you can order the result. I used the column name your_date_column for this. You need to replace that with the actual column name of course.
select id,
value,
lead(value, 1, 100) over (partition by id order by your_date_column) - value as change
from the_table
order by id, your_date_column
lead(value, 1, 100) says: take the column value of the "next" row (that's the 1). If there is no such row, use the default value 100 instead.
Join on a subquery and use ROW_NUMBER to find the last value per group
WITH CTE AS(
SELECT id,value,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY date) rn,
(LEAD(value) OVER (PARTITION BY id ORDER BY date)-value) change FROM t)
SELECT cte.id,cte.value,
(CASE WHEN cte.change IS NULL THEN 100-cte.value ELSE cte.change END)as change FROM cte LEFT JOIN
(SELECT id,MAX(rn) mrn FROM cte
GROUP BY id) as x
ON x.mrn=cte.rn AND cte.id=x.id
FIDDLE

select two maximum values per person based on a column partition

Hi if I have the following table:
Person------Score-------Score_type
1 30 A
1 35 A
1 15 B
1 16 B
2 74 A
2 68 A
2 40 B
2 39 B
Where for each person and score type I want to pick out the maximum score to obtain a table like:
Person------Score-------Score_type
1 35 A
1 16 B
2 74 A
2 40 B
I can do this using multiple select statements, but this will be cumbersome, especially later on. so I was wondering if there is a function which can help me do this. I have used the parititon function before but only to label sequences in a table....
select person,
score_type,
max(score) as score
from scores
group by person, score_type
order by person, score_type;
With "partition function" I guess you mean window functions. They can indeed be used for this as well:
select person
score_type,
score
from (
select person,
score_type,
score,
row_number() over (partition by person, score_type order by score desc) as rn
from scores
) t
where rn = 1
order by person, score_type;
Using the max() aggregate function along with the grouping by person and score_type should do the trick.

Retrieve information dynamically from multiple CTE

I have multiple CTEs and I want to retrieve some information from a couple of them into next CTE.
So, I have this information from one of the CTEs:
PeriodID StarDate
1 2006-01-01
2 2007-04-25
3 2008-08-16
4 2009-12-08
5 2011-04-017
and this from other:
RecordID Date
100 2007-04-15
101 2008-05-21
102 2008-06-06
103 2008-07-01
104 2009-11-12
And I need to show in next one:
RecordID Date PeriodID
100 2007-04-15 1
101 2008-05-21 2
102 2008-06-06 2
103 2008-07-01 2
104 2009-11-12 3
I can use some case/when statement to define if date of record is in period 1,2,3,4 or 5 but it some situation I can have different numbers of periods return from the first CTE.
Is there a way to do this in the above context?
You can have multiple CTEs defined as follows, and then select from and join them as you would any other table.
with cte1 as (select * ...),
cte2 as (select * ...)
select
cte2.*,
periodid
from cte2
cross apply
(select top 1 * from cte1 where cte2.recorddate> cte1.startdate order by startdate desc) v

T-SQL A problem with SELECT TOP (case [...])

I have query like that:
(as You see I'd like to retrieve 50% of total rows or first 100 rows etc)
//#AllRowsSelectType is INT
SELECT TOP (
case #AllRowsSelectType
when 1 then 100 PERCENT
when 2 then 50 PERCENT
when 3 then 25 PERCENT
when 4 then 33 PERCENT
when 5 then 50
when 6 then 100
when 7 then 200
end
) ROW_NUMBER() OVER(ORDER BY [id]) AS row_num, a,b,c etc
why have I the error : "Incorrect syntax near the keyword 'PERCENT'." on line "when 1 [...]"
The syntax for TOP is:
TOP (expression) [PERCENT]
[ WITH TIES ]
The reserved keyword PERCENT cannot be included in the expression. Instead you can run two different queries: one for when you want PERCENT and another for when you don't.
If you need this to be one query you can run both queries and use UNION ALL to combine the results:
SELECT TOP (
CASE #AllRowsSelectType
WHEN 1 THEN 100
WHEN 2 THEN 50
WHEN 3 THEN 25
WHEN 4 THEN 33
ELSE 0
END) PERCENT
ROW_NUMBER() OVER(ORDER BY [id]) AS row_num, a, b, c, ...
UNION ALL
SELECT TOP (
CASE #AllRowsSelectType
WHEN 5 THEN 50
WHEN 6 THEN 100
WHEN 7 THEN 200
ELSE 0
END)
ROW_NUMBER() OVER(ORDER BY [id]) AS row_num, a, b, c, ...
You're also mixing two different types of use. The other is.
DECLARE #ROW_LIMT int
IF #AllRowsSelectType < 5
SELECT #ROW_LIMIT = COUNT(*)/#AllRowsSelectType FROM myTable -- 100%, 50%, 33%, 25%
ELSE
SELECT #ROW_LIMIT = 50 * POWER(2, #AllRowsSelectType - 5) -- 50, 100, 200...
WITH OrderedMyTable
(
select *, ROW_NUMBER() OVER (ORDER BY id) as rowNum
FROM myTable
)
SELECT * FROM OrderedMyTable
WHERE rowNum <= #ROW_LIMIT
You could do:
select top (CASE #FilterType WHEN 2 THEN 50 WHEN 3 THEN 25 WHEN 4 THEN 33 ELSE 100 END) percent * from
(select top (CASE #FilterType WHEN 5 THEN 50 WHEN 6 THEN 100 WHEN 7 THEN 200 ELSE 2147483647 END) * from
<your query here>
) t
) t
Which may be easier to read.