Postgresql: Help calculating delta value in Postgres while using group by function - postgresql

I am building a stockmarket database. I have one table with timestamp, symbol, price and volume. The volume is cumulative volume traded per day. for e.g.
| timestamp | symbol | price | volume |
|----------------------------|--------|----------|--------|
| 2022-06-11 12:42:04.912+00 | SBIN | 120.0000 | 5 |
| 2022-06-11 12:42:25.806+00 | SBIN | 123.0000 | 6 |
| 2022-06-11 12:42:38.993+00 | SBIN | 123.4500 | 8 |
| 2022-06-11 12:42:42.735+00 | SBIN | 108.0000 | 12 |
| 2022-06-11 12:42:45.801+00 | SBIN | 121.0000 | 14 |
| 2022-06-11 12:43:43.186+00 | SBIN | 122.0000 | 16 |
| 2022-06-11 12:43:45.599+00 | SBIN | 125.0000 | 17 |
| 2022-06-11 12:43:51.655+00 | SBIN | 141.0000 | 20 |
| 2022-06-11 12:43:54.151+00 | SBIN | 111.0000 | 24 |
| 2022-06-11 12:44:01.908+00 | SBIN | 123.0000 | 27 |
I want to query to get OHLCV (open high low close and volume) data. I am using the following to get OHLC data but not volume and i am getting proper OHLC. Note that i am using timescale db timebucket function similar to date_trunc
SELECT
time_bucket('1 minute', "timestamp") AS time,
symbol,
max(price) AS high,
first(price, timestamp) AS open,
last(price, timestamp) AS close,
min(price) AS low,
FROM candle_ticks
GROUP BY time, symbol
ORDER BY time DESC, symbol;
So for volume, I need to calculate the difference of max / last volume in the same time and max/last volume in the previous time frame. to get the following data
| time | symbol | high | open | close | low | volume |
|------------------------|--------|----------|----------|----------|----------|--------|
| 2022-06-11 12:44:00+00 | SBIN | 123.0000 | 123.0000 | 123.0000 | 123.0000 | 14 |
| 2022-06-11 12:43:00+00 | SBIN | 141.0000 | 122.0000 | 111.0000 | 111.0000 | 10 |
| 2022-06-11 12:42:00+00 | SBIN | 123.4500 | 120.0000 | 121.0000 | 108.0000 | 3 |
What should be sql be like? I tried to use lag, but lag and group buy together is not playing well..

Would it work if you put your query in a CTE?
with ivals as (
SELECT time_bucket('1 minute', "timestamp") AS time,
symbol,
max(price) AS high,
first(price, timestamp) AS open,
last(price, timestamp) AS close,
min(price) AS low,
max(volume) AS close_volume
FROM candle_ticks
GROUP BY time, symbol
)
select i.*,
close_volume - coalesce(
lag(close_volume)
over (partition by symbol, time::date
order by time),
0
) as time_volume
from ivals i
;

Similar to Mike Organek's answer, you can collect the data into buckets via CTE and then in your main query, subtract a minute from the time column to get the time value for the previous bucket. You can use that value to
LEFT JOIN the row for the previous time bucket within the same day:
WITH buckets as (
SELECT
time_bucket('1 minute', "timestamp") AS time,
symbol,
max(price) AS high,
first(price, timestamp) AS open,
last(price, timestamp) AS close,
min(price) AS low,
max(volume) AS close_volume
FROM candle_ticks
GROUP BY time, symbol
ORDER BY time DESC, symbol
)
SELECT
b.*,
coalesce(b.close_volume - b2.close_volume,0) time_volume
FROM
buckets b
LEFT JOIN buckets b2 ON (b.time::date = b2.time::date and b.time - interval '1 minute' = b2.time)
This method will avoid the restrictions that TimescaleDB places on window functions.

Related

How to get non-aggregated measures?

I calculate my metrics with SQL and publish the resulting table to Tableau Server. Afterward, use this data source to create charts and dashboards.
For one analysis, I already calculated the measures per day with SQL. When I use the resulting table in Tableau, it aggregates these measures to SUM by default. However, I don't want to have SUM or AVG of the average or SUM of the Percentiles.
What I want is the result when I don't select date dimension and not GROUP BY date in SQL as attached below.
Here is the query:
SELECT
-- date,
COUNT(DISTINCT id) AS count_of_id,
AVG(timediff_in_sec) AS avg_timediff,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY timediff_in_sec) AS percentile_25,
PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY timediff_in_sec) AS percentile_50
FROM
(
--subquery
) AS t1
-- GROUP BY date
Here are the first 10 rows of the resulting table:
+------------+--------------+-------------+---------------+---------------+
| date | avg_timediff | count_of_id | percentile_25 | percentile_50 |
+------------+--------------+-------------+---------------+---------------+
| 10/06/2020 | 61,65186364 | 22 | 8,5765 | 13,3015 |
| 11/06/2020 | 127,2913333 | 3 | 15,6045 | 17,494 |
| 12/06/2020 | 306,0348214 | 28 | 12,2565 | 17,629 |
| 13/06/2020 | 13,2664 | 5 | 11,944 | 13,862 |
| 14/06/2020 | 16,728 | 7 | 14,021 | 17,187 |
| 15/06/2020 | 398,6424595 | 37 | 11,893 | 19,271 |
| 16/06/2020 | 293,6925152 | 33 | 12,527 | 17,134 |
| 17/06/2020 | 155,6554286 | 21 | 13,452 | 16,715 |
| 18/06/2020 | 383,8101429 | 7 | 266,048 | 493,722 |
+------------+--------------+-------------+---------------+---------------+
How can I achieve the desired output above?
Drag them all into the dimensions list, then they will be static dimensions. For your use you could also just drag the Date field to Rows. Aggregating 1 value, which you have for each date, returns the same value whatever the aggregation type.

Computing rolling sums efficiently in PostgreSQL

Supposing I have a set of transactions (purchases) with dates for a set of customers, I want to calculate a rolling x day sum of purchase amount and number of purchases by customer in that same window. I've gotten it to work using a window function, but I have to fill in for dates where the customer did not make any purchases. In so doing, I'm using a Cartesian product. Is there a more efficient approach so that it's more scalable as the number of customers – and time window – increases?
Edit: As noted in the comments, I'm on PostgreSQL v9.3.
Here's sample data (note that some customers may have 0, 1, or multiple purchases on a given date):
| id | cust_id | txn_date | amount |
|----|---------|------------|--------|
| 1 | 123 | 2017-08-17 | 10 |
| 2 | 123 | 2017-08-17 | 5 |
| 3 | 123 | 2017-08-18 | 5 |
| 4 | 123 | 2017-08-20 | 50 |
| 5 | 123 | 2017-08-21 | 100 |
| 6 | 456 | 2017-08-01 | 5 |
| 7 | 456 | 2017-08-01 | 5 |
| 8 | 456 | 2017-08-01 | 5 |
| 9 | 456 | 2017-08-30 | 5 |
| 10 | 456 | 2017-08-01 | 1000 |
| 11 | 789 | 2017-08-15 | 1000 |
| 12 | 789 | 2017-08-30 | 1000 |
And here's the desired output:
| cust_id | txn_date | sum_dly_txns | tot_txns_7d | cnt_txns_7d |
|---------|------------|--------------|-------------|-------------|
| 123 | 2017-08-17 | 15 | 15 | 2 |
| 123 | 2017-08-18 | 5 | 20 | 3 |
| 123 | 2017-08-20 | 50 | 70 | 4 |
| 123 | 2017-08-21 | 100 | 170 | 5 |
| 456 | 2017-08-01 | 1015 | 1015 | 4 |
| 456 | 2017-08-30 | 5 | 5 | 1 |
| 789 | 2017-08-15 | 1000 | 1000 | 1 |
| 789 | 2017-08-30 | 1000 | 1000 | 1 |
Here's SQL that produces the totals as desired:
SELECT *
FROM (
-- One row per day per user
WITH daily_txns AS (
SELECT
t.cust_id
,t.txn_date AS txn_date
,SUM(t.amount) AS sum_dly_txns
,COUNT(t.id) AS cnt_dly_txns
FROM transactions t
GROUP BY t.cust_id, txn_date
),
-- Every possible transaction date for every user
dummydates AS (
SELECT txn_date, uids.cust_id
FROM (
SELECT generate_series(
timestamp '2017-08-01'
,timestamp '2017-08-30'
,interval '1 day')::date
) d(txn_date)
CROSS JOIN (SELECT DISTINCT cust_id FROM daily_txns) uids
),
txns_dummied AS (
SELECT
d.cust_id
,d.txn_date
,COALESCE(sum_dly_txns,0) AS sum_dly_txns
,COALESCE(cnt_dly_txns,0) AS cnt_dly_txns
FROM dummydates d
LEFT JOIN daily_txns dx
ON d.txn_date = dx.txn_date
AND d.cust_id = dx.cust_id
ORDER BY d.txn_date, d.cust_id
)
SELECT
cust_id
,txn_date
,sum_dly_txns
,SUM(COALESCE(sum_dly_txns,0)) OVER w AS tot_txns_7d
,SUM(cnt_dly_txns) OVER w AS cnt_txns_7d
FROM txns_dummied
WINDOW w AS (
PARTITION BY cust_id
ORDER BY txn_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW -- 7d moving window
)
ORDER BY cust_id, txn_date
) xfers
WHERE sum_dly_txns > 0 -- Omit dates with no transactions
;
SQL Fiddle
Instead of ROWS BETWEEN 6 PRECEDING AND CURRENT ROW did you want to write RANGE '6 days' PRECEEDING ?
This must be what you are looking for:
SELECT DISTINCT
cust_id
,txn_date
,SUM(amount) OVER (PARTITION BY cust_id, txn_date) sum_dly_txns
,SUM(amount) OVER (PARTITION BY cust_id ORDER BY txn_date RANGE '6 days' PRECEDING)
,COUNT(*) OVER (PARTITION BY cust_id ORDER BY txn_date RANGE '6 days' PRECEDING)
from transactions
ORDER BY cust_id, txn_date
Edit: Since you are using an old version (I tested the one above on my postgresql 11), the point above will not make much sense so you will need to old-fashioned SQL (that is, witout window functions).
It is a bit less efficient but does a fair job.
WITH daily_txns AS (
SELECT
t.cust_id
,t.txn_date AS txn_date
,SUM(t.amount) AS sum_dly_txns
,COUNT(t.id) AS cnt_dly_txns
FROM transactions t
GROUP BY t.cust_id, txn_date
)
SELECT t1.cust_id, t1.txn_date, t1.sum_dly_txns, SUM(t2.sum_dly_txns), SUM(t2.cnt_dly_txns)
from daily_txns t1
join daily_txns t2 ON t1.cust_id = t2.cust_id and t2.txn_date BETWEEN t1.txn_date - 7 and t1.txn_date
group by t1.cust_id, t1.txn_date, t1.sum_dly_txns
order by t1.cust_id, t1.txn_date

How to get list day of month data per month in postgresql

i use psql v.10.5
and i have a structure table like this :
| date | total |
-------------------------
| 01-01-2018 | 50 |
| 05-01-2018 | 90 |
| 30-01-2018 | 20 |
how to get recap data by month, but the data showed straight 30 days, i want the data showed like this :
| date | total |
-------------------------
| 01-01-2018 | 50 |
| 02-01-2018 | 0 |
| 03-01-2018 | 0 |
| 04-01-2018 | 0 |
| 05-01-2018 | 90 |
.....
| 29-01-2018 | 0 |
| 30-01-2018 | 20 |
i've tried this query :
SELECT * FROM date
WHERE EXTRACT(month FROM "date") = 1 // dynamically
AND EXTRACT(year FROM "date") = 2018 // dynamically
but the result is not what i expected. also the params of month and date i create dynamically.
any help will be appreciated
Use the function generate_series(start, stop, step interval), e.g.:
select d::date, coalesce(total, 0) as total
from generate_series('2018-01-01', '2018-01-31', '1 day'::interval) d
left join my_table t on d::date = t.date
Working example in rextester.

postgres lag when data is missing

I have data on baseball players annual salaries, with some years missing. What I would like to do is calculate the min, max, average change in salary from the prior year for all players in a year.
For example data looks like below from the table 'salaries':
| playerid | yearid | salary |
| a | 2016 | 10000 |
| b | 2016 | 5000 |
| a | 2015 | 9000 |
| b | 2015 | 3000 |
| a | 2014 | 3000 |
| b | 2014 | 15000 |
| a | 2010 | 1000 |
As you can see, player A has a yearly change of 1k and 6k. player B has a yearly change of 2k and -12k. So I would like a select statement that brings out:
| yearid | min change | max change | avg change |
| 2016 | 1k | 2k | 1.5k |
| 2015 | -12k | 6k | -9k |
Is there a way to do this?
My lag function has unfortunately captured the difference between 2014 and 2010 for playerid a and that is obviously wrong. I couldn't figure out how to use the lag function only if the previous row's yearid was 1 less than the current rows yearid.
Any suggestions would be greatly appreciated.
Just use the previous year for the filtering:
select year, min(salary - prev_salary), max(salary - prev_salary),
avg(salary - prev_salary)
from (select s.*,
lag(s.salary) over (partition by s.playerid order by yearid) as prev_salary,
lag(s.yearid) over (partition by s.playerid order by yearid) as prev_yearid
from salaries s
) s
where prev_yearid = yearid - 1;
Or, you can just use a join:
select s.yearid, . . .
from salaries s join
salaries sp
on sp.playerid = s.playerid and sp.yearid = s.yearid - 1
group by s.yearid;

How to query just the last record of every second within a period of time in postgres

I have a table with hundreds of millions of records in 'prices' table with only four columns: uid, price, unit, dt. dt is a datetime in standard format like '2017-05-01 00:00:00.585'.
I can quite easily to select a period using
SELECT uid, price, unit from prices
WHERE dt > '2017-05-01 00:00:00.000'
AND dt < '2017-05-01 02:59:59.999'
What I can't understand how to select price for every last record in each second. (I also need a very first one of each second too, but I guess it will be a similar separate query). There are some similar example (here), but they did not work for me when I try to adapt them to my needs generating errors.
Could some please help me to crack this nut?
Let say that there is a table which has been generated with a help of this command:
CREATE TABLE test AS
SELECT timestamp '2017-09-16 20:00:00' + x * interval '0.1' second As my_timestamp
from generate_series(0,100) x
This table contains an increasing series of timestamps, each timestamp differs by 100 milliseconds (0.1 second) from neighbors, so that there are 10 records within each second.
| my_timestamp |
|------------------------|
| 2017-09-16T20:00:00Z |
| 2017-09-16T20:00:00.1Z |
| 2017-09-16T20:00:00.2Z |
| 2017-09-16T20:00:00.3Z |
| 2017-09-16T20:00:00.4Z |
| 2017-09-16T20:00:00.5Z |
| 2017-09-16T20:00:00.6Z |
| 2017-09-16T20:00:00.7Z |
| 2017-09-16T20:00:00.8Z |
| 2017-09-16T20:00:00.9Z |
| 2017-09-16T20:00:01Z |
| 2017-09-16T20:00:01.1Z |
| 2017-09-16T20:00:01.2Z |
| 2017-09-16T20:00:01.3Z |
.......
The below query determines and prints the first and the last timestamp within each second:
SELECT my_timestamp,
CASE
WHEN rn1 = 1 THEN 'First'
WHEN rn2 = 1 THEN 'Last'
ELSE 'Somwhere in the middle'
END as Which_row_within_a_second
FROM (
select *,
row_number() over( partition by date_trunc('second', my_timestamp)
order by my_timestamp
) rn1,
row_number() over( partition by date_trunc('second', my_timestamp)
order by my_timestamp DESC
) rn2
from test
) xx
WHERE 1 IN (rn1, rn2 )
ORDER BY my_timestamp
;
| my_timestamp | which_row_within_a_second |
|------------------------|---------------------------|
| 2017-09-16T20:00:00Z | First |
| 2017-09-16T20:00:00.9Z | Last |
| 2017-09-16T20:00:01Z | First |
| 2017-09-16T20:00:01.9Z | Last |
| 2017-09-16T20:00:02Z | First |
| 2017-09-16T20:00:02.9Z | Last |
| 2017-09-16T20:00:03Z | First |
| 2017-09-16T20:00:03.9Z | Last |
| 2017-09-16T20:00:04Z | First |
| 2017-09-16T20:00:04.9Z | Last |
| 2017-09-16T20:00:05Z | First |
| 2017-09-16T20:00:05.9Z | Last |
A working demo you can find here