Open, high, low, close aggregation in BigQuery - group-by

Based on the BigQuery best practice of using ARRAY_AGG() to obtain the latest record, here's how I'm getting the first, last minimum and maximum values of a field for a day. The data is reported approximately hourly.
WITH t AS (
SELECT TIMESTAMP('2021-01-01 01:00:00') as l, 10 as v
UNION ALL
SELECT TIMESTAMP('2021-01-01 02:00:00') as l, 12 as v
UNION ALL
SELECT TIMESTAMP('2021-01-01 03:00:00') as l, 15 as v
UNION ALL
SELECT TIMESTAMP('2021-01-01 04:00:00') as l, 2 as v
UNION ALL
SELECT TIMESTAMP('2021-01-02 01:00:00') as l, 600 as v
UNION ALL
SELECT TIMESTAMP('2021-01-02 02:00:00') as l, 120 as v
UNION ALL
SELECT TIMESTAMP('2021-01-02 03:00:00') as l, 150 as v
UNION ALL
SELECT TIMESTAMP('2021-01-03 04:00:00') as l, 0 as v)
SELECT EXTRACT(DATE FROM l) d,
ARRAY_AGG(t.v ORDER BY t.l ASC LIMIT 1)[OFFSET(0)] first_value,
ARRAY_AGG(t.v ORDER BY t.l DESC LIMIT 1)[OFFSET(0)] last_value,
ARRAY_AGG(t.v ORDER BY t.v DESC LIMIT 1)[OFFSET(0)] max_value,
ARRAY_AGG(t.v ORDER BY t.v ASC LIMIT 1)[OFFSET(0)] min_value,
FROM
t
GROUP BY
d
Output:
Row
d
max_value
min_value
last_value
first_value
1
2021-01-01
15
2
2
10
2
2021-01-02
600
120
150
600
3
2021-01-03
0
0
0
0
Since there are only six BigQuery questions on Code Review, I thought I'd ask here on the main Stack Overflow. Is this the fastest method? Do I have anything extraneous in my query? (I'm not too sure that [OFFSET(0)] is doing anything.)
I've seen this question asked on Stack Overflow for Oracle, T-SQL and Postgres but I haven't seen anything specific for BigQuery. Thanks!

Obvious improvement is to use simple MIN and MAX for min_value and max_value
select date(l) d,
array_agg(v order by l asc limit 1)[offset(0)] first_value,
array_agg(v order by l desc limit 1)[offset(0)] last_value,
max(v) max_value,
min(v) min_value
from t
group by d
Rather than this, using array_agg is a good practice here and using [offset(0)] is important here as without it - your outputs will be arrays with one elements - but you most likely want the element itself out
One more - depends on the volume of your data - you can try below approach which uses analytic aggregation functions vs. just aggregation functions
select distinct * from (
select date(l) d,
first_value(v) over(partition by date(l) order by l asc) first_value,
first_value(v) over(partition by date(l) order by l desc) last_value,
max(v) over(partition by date(l)) max_value,
min(v) over(partition by date(l)) min_value
from t
)
More options to consider - using approximate aggregate functions as in below example
select extract(date from l) d,
approx_top_sum(v, 1 / unix_seconds(l), 1)[offset(0)].value first_value,
approx_top_sum(v, unix_seconds(l), 1)[offset(0)].value last_value,
max(v) max_value,
min(v) min_value,
from t
group by d

Related

Windowing Functions

I have the following events table.
key time_stamp geohash
k1 1 thred0y
k2 5 thred0v
k4 7 thre6rd
k3 9 thre6rg
k1 10 thred3t
k1 12 thred3u
k2 14 thred3s
Where I want to cluster the keys into groups if they fall with in 500mts range in 10 minutes of time interval.
I tried cross join them and
select a.key, b.key, a.geohash, b.geohash, a.time_stamp, b.time_stamp,
round(ST_Distance(ST_PointFromGeoHash(a.geohash, 4326), ST_PointFromGeoHash(b.geohash, 4326), true)) distance,
abs(round(extract(EPOCH from a.time_stamp - b.time_stamp)/60))
from t a, t b
where a.key <> b.key
and a.time_stamp between b.time_stamp - interval '10 min' and b.time_stamp + interval '10 min'
and ST_Distance(ST_PointFromGeoHash(a.geohash, 4326), ST_PointFromGeoHash(b.v, 4326), true) <= 500
and least(a.key, b.key) = a.key
order by a.time_stamp desc
However the query works good with small data and additionally the query only works if there are two distinct keys but not more than 2.
Any inputs on how to proceed further will be helpful.
I added some sample data for test, https://pastebin.com/iVD1WU4Y.
I found the solution by clustering keys within 60 minutes along with 1.2 km apart.
with x as (
select key, time_stamp, geo, prev_ts, geo_hash6,
count(case when prev_ts is null or prev_ts > 60 then 1 else null end) over(order by time_stamp) cluster_id
from (
select key, time_stamp, geo,
EXTRACT(EPOCH FROM time_stamp - lag(time_stamp) over(order by time_stamp)) prev_ts,
substring(geo, 1, 6) geo_hash6
from t
) a
order by cluster_id, geo_hash6, geo, time_stamp)
select x.cluster_id, x.key, x.geo_hash6, min(time_stamp) first_time, max(time_stamp) last_time
from x, (select cluster_id, geo_hash6, count(distinct key) num_uniques from x group by cluster_id, geo_hash6) y
where x.cluster_id = y.cluster_id and x.geo_hash6 = y.geo_hash6 and y.num_uniques > 2
group by x.cluster_id, x.geo_hash6, x.key
order by x.cluster_id, x.geo_hash6;
Any suggestions improving the solution is welcome.

SQL Query to get top 2 records of group

I have a following Input Table
Source EventType
A X
A X
A X
A Y
A Y
A Z
B L
B L
B L
B L
B M
B N
B N
Expected output
Source EventType Frequency
A X 3
A Y 2
B L 4
B N 2
How to form a SQL query to get the result as shown above ?
I was able to achieve results but with just one source at a time.
select TOP 2 eventype, count(*) as frequencey
from myEventTable
where source = 'A'
group by eventtype
order by count(*) desc
We can use ROW_NUMBER here:
WITH cte AS (
SELECT Source, EventType, COUNT(*) as Frequency,
ROW_NUMBER() OVER (PARTITION BY Source ORDER BY COUNT(*) DESC) rn
FROM myEventTable
GROUP BY Source, Eventtype
)
SELECT Source, EventType, Frequency
FROM cte
WHERE rn <= 2;
Demo
The reason this works is that ROW_NUMBER is applied after the GROUP BY operation completes, i.e. it runs against the groups. We can then easily limit to the top 2 per source, as ordered by frequency descending.

Top N values in window frame

I have a table t with 3 fields of interest:
d (date), pid (int), and score (numeric)
I am trying to calculate a 4th field that is an average of each player's top N (3 or 5) scores for the days before the current row.
I tried the following join on a subquery but it is not producing the results I'm looking for:
SELECT t.d, t.pid, t.score, sq.highscores
FROM t, (SELECT *, avg(score) as highscores FROM
(SELECT *, row_number() OVER w AS rnum
FROM t AS t2
WINDOW w AS (PARTITION BY pid ORDER BY score DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)) isq
WHERE rnum <= 3) sq
WHERE t.d = sq.d AND t.pid = sq.pid
Any suggestions would be greatly appreciated! I'm a hobbyist programmer and this is more complex of a query than I'm used to.
You can't select * and avg(score) in the same (inner) query. I.e. which non-aggregated values should be selected for each average? PostgreSQL won't decide this instead of you.
Becasue you PARTITION BY pid in the innermost query, you should use GROUP BY pid in the aggregating subquery. That way, you can SELECT pid, avg(score) as highscores:
SELECT pid, avg(score) as highscores
FROM (SELECT *, row_number() OVER w AS rnum
FROM t AS t2
WINDOW w AS (PARTITION BY pid ORDER BY score DESC)) isq
WHERE rnum <= 3
GROUP BY pid
Note: ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING makes no difference for row_number().
But if the top N part is fixed (and N will be few in your real-world use-case too), you can solve this without that much subquery (with the nth_value() window function):
SELECT d, pid, score,
(coalesce(nth_value(score, 1) OVER w, 0) +
coalesce(nth_value(score, 2) OVER w, 0) +
coalesce(nth_value(score, 3) OVER w, 0)) /
((nth_value(score, 1) OVER w IS NOT NULL)::int +
(nth_value(score, 2) OVER w IS NOT NULL)::int +
(nth_value(score, 3) OVER w IS NOT NULL)::int) highscores
FROM t
WINDOW w AS (PARTITION BY pid ORDER BY score DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
http://rextester.com/GUUPO5148

GROUP BY getting the second highest date

I'm currently doing this group by to retrieve the max date :
SELECT A, MAX(B) FROM X GROUP BY A
This is perfectly working. However, when I try to retrieve the second highest value, I'm totally lost.
If anyone has an idea...
Try this:
SELECT X.A,
MAX(X.B)
FROM YourTable X
JOIN
(
SELECT
X1.A,
MAX(X1.B)
FROM YourTable X1
GROUP BY X1.A
) X1 ON X1.A = X.A
AND X.B < X1.B
GROUP BY X.A
Basically this says get the max of all the ones that are less than the max.
You can use the ranking function ROW_NUMBER in a cte:
WITH CTE AS
(
SELECT A,
MaxB = MAX(B)OVER(PARTITION BY A),
RN = ROW_NUMBER() OVER (PARTITION BY A ORDER By B DESC)
FROM dbo.X
)
SELECT A, MaxB
FROM CTE
WHERE RN <= 2
This will return the two highest values for each group (if that is what you want).
You're columns are rather ambiguous, but if A is max_date then, B is some other value you wish to sort by, then one way to do it could be:
SELECT A FROM X ORDER BY B DESC LIMIT 2
Which will give you 2 rows with the second highest displayed first.

Postgresql running sum of previous groups?

Given the following data:
sequence | amount
1 100000
1 20000
2 10000
2 10000
I'd like to write a sql query that gives me the sum of the current sequence, plus the sum of the previous sequence. Like so:
sequence | current | previous
1 120000 0
2 20000 120000
I know the solution likely involves windowing functions but I'm not too sure how to implement it without subqueries.
SQL Fiddle
select
seq,
amount,
lag(amount::int, 1, 0) over(order by seq) as previous
from (
select seq, sum(amount) as amount
from sa
group by seq
) s
order by seq
If your sequence is "sequencial" without holes you can simply do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount) from mytable t2 WHERE t2.sequence = t1.sequence - 1)
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence
Otherwise, instead of t2.sequence = t1.sequence - 1 you could do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount)
from mytable t2
WHERE t2.sequence = (SELECT MAX(t3.sequence)
FROM mytable t3
WHERE t3.sequence < t1.sequence))
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence;
You can see both approaches in this fiddle