Top N values in window frame - postgresql

I have a table t with 3 fields of interest:
d (date), pid (int), and score (numeric)
I am trying to calculate a 4th field that is an average of each player's top N (3 or 5) scores for the days before the current row.
I tried the following join on a subquery but it is not producing the results I'm looking for:
SELECT t.d, t.pid, t.score, sq.highscores
FROM t, (SELECT *, avg(score) as highscores FROM
(SELECT *, row_number() OVER w AS rnum
FROM t AS t2
WINDOW w AS (PARTITION BY pid ORDER BY score DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) isq
WHERE rnum <= 3) sq
WHERE t.d = sq.d AND t.pid = sq.pid
Any suggestions would be greatly appreciated! I'm a hobbyist programmer and this is more complex of a query than I'm used to.

You can't select * and avg(score) in the same (inner) query. I.e. which non-aggregated values should be selected for each average? PostgreSQL won't decide this instead of you.
Becasue you PARTITION BY pid in the innermost query, you should use GROUP BY pid in the aggregating subquery. That way, you can SELECT pid, avg(score) as highscores:
SELECT pid, avg(score) as highscores
FROM (SELECT *, row_number() OVER w AS rnum
FROM t AS t2
WINDOW w AS (PARTITION BY pid ORDER BY score DESC)) isq
WHERE rnum <= 3
GROUP BY pid
Note: ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING makes no difference for row_number().
But if the top N part is fixed (and N will be few in your real-world use-case too), you can solve this without that much subquery (with the nth_value() window function):
SELECT d, pid, score,
(coalesce(nth_value(score, 1) OVER w, 0) +
coalesce(nth_value(score, 2) OVER w, 0) +
coalesce(nth_value(score, 3) OVER w, 0)) /
((nth_value(score, 1) OVER w IS NOT NULL)::int +
(nth_value(score, 2) OVER w IS NOT NULL)::int +
(nth_value(score, 3) OVER w IS NOT NULL)::int) highscores
FROM t
WINDOW w AS (PARTITION BY pid ORDER BY score DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
http://rextester.com/GUUPO5148

Related

PostgreSQL many scalar functions on same unnest

I have this query:
select distinct * ,
(select count(A) filter(where A < 360) from unnest(cust_journey_time_series) as A) as count_journey_time_series
, (select avg(A) filter(where A < 360) from unnest(cust_journey_time_series) as A) as avg_journey_time_series
, (select sum(A) filter(where A < 360) from unnest(cust_journey_time_series) as A) as order_journey_time
, (select sum(A) ..
, (select count(distinct A) ..
from (
select a,b,c, max(s) s, max(ev) ev, max(ord) ord,
array_agg(cust_journey_time_seconds order by ev asc) as cust_journey_time_series, min(mi) mi,
array_agg(col order by ev asc) as cus,
min(ra) ra, max(de) de, max(to) to,
max(orde) orde, max(cam) cam,
max(pag) pag,
count(collec) FILTER (WHERE collec <> las ) AS pag_count,
max (fi) as ad,
array_agg(fi) as dd2c,
array_agg(ut) as ut_places,
array_agg(ct) as ct_places,
array_agg(ut) as utaces
from temp_j
group by a,b,c
order by ev
)aaa
my intuitive code would be select distinct * , (select count(A), sum(A),avg(A) filter(where A < 360) from unnest(cust_journey_time_series) as A) as count_journey_time_series, sum_journey_time_series , avg_journey_time_series
but I get this error
subquery must return only one column
Is there a way to optimize the query or PostgreSQL does it under the hood?
I'll try to translate the fragment you provide to something better:
select distinct tab.*,
count(cjts.A) filter (where cjts.A < 360) as count_journey_time_series,
avg(cjts.A) filter (where cjts.A < 360) as avg_journey_time_series,
sum(cjts.A) filter (where cjts.A < 360) as order_journey_time,
...
from (select ...) as tab
cross join lateral unnest(tab.cust_journey_time_series) as cjts
group by tab.pkey;
That way, you have to evaluate the function only once.
The error message of your modified query comes from:
SELECT
(SELECT count(A), sum(A), avg(A) FROM ...)
...
That is not allowed. You can only use a single result column in a subquery in the SELECT list. Writing the query my way avoids the problem.

Open, high, low, close aggregation in BigQuery

Based on the BigQuery best practice of using ARRAY_AGG() to obtain the latest record, here's how I'm getting the first, last minimum and maximum values of a field for a day. The data is reported approximately hourly.
WITH t AS (
SELECT TIMESTAMP('2021-01-01 01:00:00') as l, 10 as v
UNION ALL
SELECT TIMESTAMP('2021-01-01 02:00:00') as l, 12 as v
UNION ALL
SELECT TIMESTAMP('2021-01-01 03:00:00') as l, 15 as v
UNION ALL
SELECT TIMESTAMP('2021-01-01 04:00:00') as l, 2 as v
UNION ALL
SELECT TIMESTAMP('2021-01-02 01:00:00') as l, 600 as v
UNION ALL
SELECT TIMESTAMP('2021-01-02 02:00:00') as l, 120 as v
UNION ALL
SELECT TIMESTAMP('2021-01-02 03:00:00') as l, 150 as v
UNION ALL
SELECT TIMESTAMP('2021-01-03 04:00:00') as l, 0 as v)
SELECT EXTRACT(DATE FROM l) d,
ARRAY_AGG(t.v ORDER BY t.l ASC LIMIT 1)[OFFSET(0)] first_value,
ARRAY_AGG(t.v ORDER BY t.l DESC LIMIT 1)[OFFSET(0)] last_value,
ARRAY_AGG(t.v ORDER BY t.v DESC LIMIT 1)[OFFSET(0)] max_value,
ARRAY_AGG(t.v ORDER BY t.v ASC LIMIT 1)[OFFSET(0)] min_value,
FROM
t
GROUP BY
d
Output:
Row
d
max_value
min_value
last_value
first_value
1
2021-01-01
15
2
2
10
2
2021-01-02
600
120
150
600
3
2021-01-03
0
0
0
0
Since there are only six BigQuery questions on Code Review, I thought I'd ask here on the main Stack Overflow. Is this the fastest method? Do I have anything extraneous in my query? (I'm not too sure that [OFFSET(0)] is doing anything.)
I've seen this question asked on Stack Overflow for Oracle, T-SQL and Postgres but I haven't seen anything specific for BigQuery. Thanks!
Obvious improvement is to use simple MIN and MAX for min_value and max_value
select date(l) d,
array_agg(v order by l asc limit 1)[offset(0)] first_value,
array_agg(v order by l desc limit 1)[offset(0)] last_value,
max(v) max_value,
min(v) min_value
from t
group by d
Rather than this, using array_agg is a good practice here and using [offset(0)] is important here as without it - your outputs will be arrays with one elements - but you most likely want the element itself out
One more - depends on the volume of your data - you can try below approach which uses analytic aggregation functions vs. just aggregation functions
select distinct * from (
select date(l) d,
first_value(v) over(partition by date(l) order by l asc) first_value,
first_value(v) over(partition by date(l) order by l desc) last_value,
max(v) over(partition by date(l)) max_value,
min(v) over(partition by date(l)) min_value
from t
)
More options to consider - using approximate aggregate functions as in below example
select extract(date from l) d,
approx_top_sum(v, 1 / unix_seconds(l), 1)[offset(0)].value first_value,
approx_top_sum(v, unix_seconds(l), 1)[offset(0)].value last_value,
max(v) max_value,
min(v) min_value,
from t
group by d

SQL Query to get top 2 records of group

I have a following Input Table
Source EventType
A X
A X
A X
A Y
A Y
A Z
B L
B L
B L
B L
B M
B N
B N
Expected output
Source EventType Frequency
A X 3
A Y 2
B L 4
B N 2
How to form a SQL query to get the result as shown above ?
I was able to achieve results but with just one source at a time.
select TOP 2 eventype, count(*) as frequencey
from myEventTable
where source = 'A'
group by eventtype
order by count(*) desc
We can use ROW_NUMBER here:
WITH cte AS (
SELECT Source, EventType, COUNT(*) as Frequency,
ROW_NUMBER() OVER (PARTITION BY Source ORDER BY COUNT(*) DESC) rn
FROM myEventTable
GROUP BY Source, Eventtype
)
SELECT Source, EventType, Frequency
FROM cte
WHERE rn <= 2;
Demo
The reason this works is that ROW_NUMBER is applied after the GROUP BY operation completes, i.e. it runs against the groups. We can then easily limit to the top 2 per source, as ordered by frequency descending.

GROUP BY getting the second highest date

I'm currently doing this group by to retrieve the max date :
SELECT A, MAX(B) FROM X GROUP BY A
This is perfectly working. However, when I try to retrieve the second highest value, I'm totally lost.
If anyone has an idea...
Try this:
SELECT X.A,
MAX(X.B)
FROM YourTable X
JOIN
(
SELECT
X1.A,
MAX(X1.B)
FROM YourTable X1
GROUP BY X1.A
) X1 ON X1.A = X.A
AND X.B < X1.B
GROUP BY X.A
Basically this says get the max of all the ones that are less than the max.
You can use the ranking function ROW_NUMBER in a cte:
WITH CTE AS
(
SELECT A,
MaxB = MAX(B)OVER(PARTITION BY A),
RN = ROW_NUMBER() OVER (PARTITION BY A ORDER By B DESC)
FROM dbo.X
)
SELECT A, MaxB
FROM CTE
WHERE RN <= 2
This will return the two highest values for each group (if that is what you want).
You're columns are rather ambiguous, but if A is max_date then, B is some other value you wish to sort by, then one way to do it could be:
SELECT A FROM X ORDER BY B DESC LIMIT 2
Which will give you 2 rows with the second highest displayed first.

Postgresql running sum of previous groups?

Given the following data:
sequence | amount
1 100000
1 20000
2 10000
2 10000
I'd like to write a sql query that gives me the sum of the current sequence, plus the sum of the previous sequence. Like so:
sequence | current | previous
1 120000 0
2 20000 120000
I know the solution likely involves windowing functions but I'm not too sure how to implement it without subqueries.
SQL Fiddle
select
seq,
amount,
lag(amount::int, 1, 0) over(order by seq) as previous
from (
select seq, sum(amount) as amount
from sa
group by seq
) s
order by seq
If your sequence is "sequencial" without holes you can simply do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount) from mytable t2 WHERE t2.sequence = t1.sequence - 1)
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence
Otherwise, instead of t2.sequence = t1.sequence - 1 you could do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount)
from mytable t2
WHERE t2.sequence = (SELECT MAX(t3.sequence)
FROM mytable t3
WHERE t3.sequence < t1.sequence))
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence;
You can see both approaches in this fiddle