Timescale/Postgres queries very slow/inefficient? - postgresql

I have table in TimescaleDb with the following schema:
create table market_quotes
(
instrument varchar(16) not null,
exchange varchar(16) not null,
time timestamp not null,
bid_1_price double precision,
ask_1_price double precision,
bid_1_quantity double precision,
ask_1_quantity double precision,
bid_2_price double precision,
ask_2_price double precision,
bid_2_quantity double precision,
ask_2_quantity double precision,
bid_3_price double precision,
ask_3_price double precision,
bid_3_quantity double precision,
ask_3_quantity double precision,
bid_4_price double precision,
ask_4_price double precision,
bid_4_quantity double precision,
ask_4_quantity double precision,
bid_5_price double precision,
ask_5_price double precision,
bid_5_quantity double precision,
ask_5_quantity double precision
);
and the following composite index:
create index market_quotes_instrument_exchange_time_idx
on market_quotes (instrument asc, exchange asc, time desc);
When I run the query:
EXPLAIN ANALYZE select * from market_quotes where instrument='BTC/USD' and exchange='gdax' and time between '2020-06-02 00:00:00' and '2020-06-03 00:00:00'
It takes almost 2 minutes to return 500k rows:
Index Scan using _hyper_1_1_chunk_market_quotes_instrument_exchange_time_idx on _hyper_1_1_chunk (cost=0.70..1353661.85 rows=1274806 width=183) (actual time=5.165..99990.424 rows=952931 loops=1)
" Index Cond: (((instrument)::text = 'BTC/USD'::text) AND ((exchange)::text = 'gdax'::text) AND (""time"" >= '2020-06-02 00:00:00'::timestamp without time zone) AND (""time"" <= '2020-06-02 01:00:00'::timestamp without time zone))"
Planning Time: 11.389 ms
JIT:
Functions: 2
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 0.404 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 0.404 ms"
Execution Time: 100121.392 ms
And when I run the query below:
select * from market_quotes where instrument='BTC/USD' and exchange='gdax' and time between '2020-06-02 00:00:00' and '2020-06-03 00:00:00'
This ran for >40 mins and crashed.
What can I do to speed the queries up? I often do 1 day queries - would it help if I added another column corresponding to the day of the week and indexed on that as well?
Should I be querying with a subset of rows each time instead and piecing the information together? (i.e. 10000 rows at a time)

It would be interesting to see the result of EXPLAIN (ANALYZE, BUFFERS) for the query after setting track_io_timing to on.
But if you are desperate to speed up an index range scan, the best you can do is to cluster the table:
CLUSTER market_quotes USING market_quotes_instrument_exchange_time_idx;
This will rewrite the table and block any concurrent access.
Another approach would be to use a pre-aggregated materialized view, if you can live with slightly stale data.

Related

TimescaleDB very slow query after compression

I have got a hypertable (around 300 millions rows) with the following schema
CREATE TABLE IF NOT EXISTS candlesticks(
open_time TIMESTAMP NOT NULL,
close_time TIMESTAMP NOT NULL,
open DOUBLE PRECISION,
high DOUBLE PRECISION,
low DOUBLE PRECISION,
close DOUBLE PRECISION,
volume DOUBLE PRECISION,
quote_volume DOUBLE PRECISION,
symbol VARCHAR (20) NOT NULL,
exchange VARCHAR (256),
PRIMARY KEY (symbol, open_time, exchange)
);
After compressing it with this query
ALTER TABLE candlesticks SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'symbol, exchange'
);
The following query takes several minutes (it seems timescale decompress every chunk) whereas before it was only 1/2 seconds :
SELECT DISTINCT ON (symbol) * FROM candlesticks
ORDER BY symbol, open_time DESC;
If i add a time conditions like open_time >= now() - INTERVAL '5 minutes' it's better
I'am not really comfortable in Timescaledb / SQL performance so maybe it's normal and i shouldn't use compression for my table ?

High CPU usage caused during select query

On running a select query with a QPS of 300+, query latency increases up to 20k ms with a CPU usage of ~100%. I couldn't identify what the issue could be over here!?
System details:
vCPUs: 16
RAM: 64 GiB
Create hypertable:
CREATE TABLE test_table (time TIMESTAMPTZ NOT NULL, val1 TEXT NOT NULL, val2 DOUBLE PRECISION NOT NULL, val3 DOUBLE PRECISION NOT NULL);
SELECT create_hypertable('test_table', 'time');
SELECT add_retention_policy('test_table', INTERVAL '2 day');
Query:
SELECT * from test_table where val1='abc'
Timescale version:
2.5.1
No. of rows in test-table:
6.9M
Latencies:
p50=8493.722203,
p75=8566.074792,
p95=21459.204448,
p98=21459.204448,
p99=21459.204448,
p999=21459.204448
CPU usage of postgres SELECT query as per the htop command.

SQL partition by query optimization

I have below prices table and I want to obtain last_30_days price array and last_year_price from it. As shown below
CREATE TABLE prices
(
id integer NOT NULL,
"time" timestamp without time zone NOT NULL,
close double precision NOT NULL,
CONSTRAINT prices_pkey PRIMARY KEY (id, "time")
)
select id,time,first_value(close) over (partition by id order by time range between '1 year' preceding and CURRENT ROW) as prev_year_close,
array_agg(p.close) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) as prices_30
from prices p
However I want to place a where clause on the prices table to get last_30_days price array and last_year_price for some specific rows. Eg. where time >= '1 week' interval (so I run this query only for last 1 week values as opposed to the entire table)
But a where clause pre-filtering the rows and then window partition only runs on that conditioned rows which results in wrong results. it is giving result like
time, id, last_30_days
-1day, X, [A,B,C,D,E, F,G]
-2day, X, [A,B,C,D,E,F]
-3day, X, [A,B,C,D,E]
-4day, X, [A,B,C,D]
-5day, X, [A,B,C]
-6day, X, [A,B]
-7day, X, [A]
How do I fix this so that partition over window always takes 30 values irrespective of where condition? Without having to run the query always on the entire table and then selecting a subset of rows with where clause. My prices table is huge and running it on entire table is very expensive.
EDIT
CREATE TABLE prices
(
id integer NOT NULL,
"time" timestamp without time zone NOT NULL,
close double precision NOT NULL,
prev_30 double precision[],
prev_year double precision,
CONSTRAINT prices_pkey PRIMARY KEY (id, "time")
)
Use a subquery:
SELECT *
FROM (select id, time,
first_value(close) over (partition by id order by time range between '1 year' preceding and CURRENT ROW) as prev_year_close,
array_agg(p.close) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) as prices_30 I
from prices p
WHERE time >= current_timestamp - INTERVAL '1 year 1 week') AS q
WHERE time >= current_timestamp - INTERVAL '1 week' ;

postgres generate_series performance slower on server then laptop

I have a series of views that build off of each other like this:
rpt_scn_appl_target_v --> rpt_scn_appl_target_unnest_v --> rpt_scn_appl_target_unnest_timeseries_v --> rpt_scn_appl_target_unnest_timeseries_ftprnt_v
In this view..rpt_scn_appl_target_unnest_timeseries_v, I use the generate_series function to generate monthly rows between 1/1/2015 and 12/31/2019.
what I've noticed is this:
this one takes 10secs to run
select * from rpt_scn_appl_target_unnest_timeseries_ftprnt_v where scenario_id = 202
this one takes 9 secs to run:
select * from rpt_scn_appl_target_unnest_timeseries_v where scenario_id = 202
this one takes 219msecs to run:
select * from rpt_scn_appl_target_unnest_v where scenario_id = 202
this one takes <1sec to run:
select * from rpt_scn_appl_target_v where scenario_id = 202
I've noticed that commenting out the generate_series code in the view, the query runs in under a second, but with it, it takes 10secs to run...
rpt_scn_appl_target_unnest_timeseries_v View:
CREATE OR REPLACE VIEW public.rpt_scn_appl_target_unnest_timeseries_v AS
SELECT a.scenario_id,
a.scenario_desc,
a.scenario_status,
a.scn_appl_lob_ownr_nm,
a.scn_appl_sub_lob_ownr_nm,
a.scenario_asv_id,
a.appl_ci_id,
a.appl_ci_nm,
a.appl_ci_comm_nm,
a.appl_lob_ownr_nm,
a.appl_sub_lob_ownr_nm,
a.cost,
a.agg_complexity,
a.srvc_lvl,
a.dc_loc,
a.start_dt,
a.end_dt,
a.decomm_dt,
a.asv_target_id,
a.asv_target_desc,
a.asv_target_master,
a.prod_qty_main_cloud,
a.prod_cost_main_cloud,
a.non_prod_qty_main_cloud,
a.non_prod_cost_main_cloud,
a.prod_qty_main_onprem,
a.prod_cost_main_onprem,
a.non_prod_qty_main_onprem,
a.non_prod_cost_main_onprem,
a.prod_qty_target_onprem,
a.prod_cost_target_onprem,
a.non_prod_qty_target_onprem,
a.non_prod_cost_target_onprem,
a.prod_qty_target_cloud,
a.prod_cost_target_cloud,
a.non_prod_qty_target_cloud,
a.non_prod_cost_target_cloud,
a.type,
a.cost_main,
a.qty_main,
a.cost_target,
a.qty_target,
a.dt,
a.mth_dt,
CASE
WHEN a.type ~~ '%onprem%'::text THEN 'On-Prem'::text
ELSE 'Cloud'::text
END AS env_stat,
CASE
WHEN a.type ~~ '%non_prod%'::text THEN 'Non-Prod'::text
ELSE 'Prod'::text
END AS env,
CASE
WHEN a.dt <= a.decomm_dt THEN COALESCE(a.cost_main, 0::double precision)
WHEN a.decomm_dt IS NULL AND a.end_dt IS NULL AND a.start_dt IS NULL THEN a.cost_main
ELSE 0::double precision
END AS cost_curr,
CASE
WHEN a.dt <= a.decomm_dt THEN COALESCE(a.qty_main, 0::bigint)
WHEN a.decomm_dt IS NULL AND a.end_dt IS NULL AND a.start_dt IS NULL THEN a.qty_main
ELSE 0::bigint
END AS qty_curr,
CASE
WHEN a.dt < a.start_dt THEN 0::bigint
WHEN a.dt >= a.start_dt AND a.dt < a.end_dt AND a.type ~~ '%non_prod%'::text THEN COALESCE(a.qty_target, 0::bigint)
WHEN a.dt > a.end_dt THEN COALESCE(a.qty_target, 0::bigint)
ELSE 0::bigint
END AS qty_trgt,
CASE
WHEN a.dt < a.start_dt THEN 0::double precision
WHEN a.dt >= a.start_dt AND a.dt < a.end_dt AND a.type ~~ '%non_prod%'::text THEN COALESCE(a.cost_target, 0::double precision)
WHEN a.dt > a.end_dt THEN COALESCE(a.cost_target, 0::double precision)
ELSE 0::double precision
END AS cost_trgt
FROM ( SELECT t1.scenario_id,
t1.scenario_desc,
t1.scenario_status,
t1.scn_appl_lob_ownr_nm,
t1.scn_appl_sub_lob_ownr_nm,
t1.scenario_asv_id,
t1.appl_ci_id,
t1.appl_ci_nm,
t1.appl_ci_comm_nm,
t1.appl_lob_ownr_nm,
t1.appl_sub_lob_ownr_nm,
t1.cost,
t1.agg_complexity,
t1.srvc_lvl,
t1.dc_loc,
t1.start_dt,
t1.end_dt,
t1.decomm_dt,
t1.asv_target_id,
t1.asv_target_desc,
t1.asv_target_master,
t1.prod_qty_main_cloud,
t1.prod_cost_main_cloud,
t1.non_prod_qty_main_cloud,
t1.non_prod_cost_main_cloud,
t1.prod_qty_main_onprem,
t1.prod_cost_main_onprem,
t1.non_prod_qty_main_onprem,
t1.non_prod_cost_main_onprem,
t1.prod_qty_target_onprem,
t1.prod_cost_target_onprem,
t1.non_prod_qty_target_onprem,
t1.non_prod_cost_target_onprem,
t1.prod_qty_target_cloud,
t1.prod_cost_target_cloud,
t1.non_prod_qty_target_cloud,
t1.non_prod_cost_target_cloud,
t1.type,
t1.cost_main,
t1.qty_main,
t1.cost_target,
t1.qty_target,
generate_series('2015-01-01 00:00:00'::timestamp without time zone, '2019-12-31 00:00:00'::timestamp without time zone, '1 mon'::interval)::date AS dt,
to_char(generate_series('2015-01-01 00:00:00'::timestamp without time zone, '2019-12-31 00:00:00'::timestamp without time zone, '1 mon'::interval)::date::timestamp with time zone, 'YYYY-MM'::text) AS mth_dt
FROM rpt_scn_appl_target_unnest_v t1) a;
What I've also noticed too is that performance between the database on my laptop and the AWS rds server with the same data, tables, and views..is faster on my laptop even though it has much less ram and cpu. I'm running postgres 9.6 on my laptop and 9.6 on AWS rds. My laptop is macbook pro with 16gb of ram and i7 dual core. For rds, I'm using a m4.4xlarge which is 16 cores and 64gb of ram.
Here is the AWS explain plan:
https://explain.depesz.com/s/UGF
My laptop explain plan:
https://explain.depesz.com/s/zaWt
So I guess my questions are:
1.) why is the query taking longer to run on AWS then my laptop?
2.) anything one can do to speed up generate_series function? Does creating a separate calendar table and then joining to that work faster?
1) Your laptop has fewer rows.
AWS: (cost=17,814.33..7,931,812.83 rows=158,200,000 width=527)
Laptop: (cost=15,238.52..4,002,252.94 rows=79,700,000 width=2,030)
2) If you are going to use the table several times, is better create a calendar table. 10 years is only 3650 rows, 100 years 36k rows

Find difference between timestamps in amount of custom intervals in PostgreSQL

I would like to find difference between two timestamps (with timezone) in amount of custom intervals. So function should be like custom_diff(timestamptz from, timestamptz to, interval custom).
Keep in mind, that it is not equivalent to (to-from)/custom (custom_diff('2016-08-01 00:00:00','2016-09-01 00:00:00','1 day') is exactly 31, but ('2016-08-01 00:00:00','2016-09-01 00:00:00')/'1 day')='1 month'/'1 day' and is ambiguous).
Also I understand that in general there is no exact result of such operation (custom_diff('2016-08-01 00:00:00','2016-09-01 00:00:00','1 month 1 day') so it is possible to have group of function (round-to-nearest, round-to-lower, round-to-upper and truncating, all of them should return integer number).
Is there any standard/common way for such calculation in PostgreSQL (PL/pgSQL)? My main interesting is round-to-nearest function.
The best way I have invented is to iteratively add/substract interval custom to/from timestamptz from and compare with timestamptz to. Also it can be optimized by initially finding approximate result (for example divide [difference in seconds between timestamps] for [approximation of interval custom in seconds]) to reduce amount of iterations.
UPD 1:
Why
SELECT EXTRACT(EPOCH FROM (timestamp '2016-08-01 10:00'
- timestamp '2016-08-01 00:00'))
/ EXTRACT(EPOCH FROM interval '1 day');
is a wrong solution: lets try yourself:
SELECT EXTRACT(EPOCH FROM ( TIMESTAMPTZ '2016-01-01 utc' -
TIMESTAMPTZ '1986-01-01 utc' ))
/ EXTRACT(EPOCH FROM INTERVAL '1 month');
Result is 365.23.... Then check result:
SELECT ( TIMESTAMPTZ '1986-01-01 utc' + 365 * INTERVAL '1 month' )
AT TIME ZONE 'utc';
Result is 2016-06-01 00:00:00.000000. Of cause 365 is wrong result, because timestamps in this example describe exactly 30 years and in any year always exactly 12 months, so right answer is 12*30=360.
UPD 2:
My solution is
CREATE OR REPLACE FUNCTION custom_diff(
_from TIMESTAMPTZ, _to TIMESTAMPTZ, _custom INTERVAL, OUT amount INTEGER)
RETURNS INTEGER
LANGUAGE plpgsql
AS $function$
DECLARE
max_iterations INTEGER :=10;
t INTEGER;
BEGIN
amount:=0;
WHILE max_iterations > 0 AND NOT (
extract(EPOCH FROM _to) <= ( extract(EPOCH FROM _from) + extract(EPOCH FROM _from + _custom) ) / 2
AND
extract(EPOCH FROM _to) >= ( extract(EPOCH FROM _from) + extract(EPOCH FROM _from - _custom) ) / 2
) LOOP
-- RAISE NOTICE 'iter: %', max_iterations;
t:=EXTRACT(EPOCH FROM ( _to - _from )) / EXTRACT(EPOCH FROM _custom);
_from:=_from + t * _custom;
amount:=amount + t;
max_iterations:=max_iterations - 1;
END LOOP;
RETURN;
END;
$function$
but I does not sure that it is correct and still waiting for sugestion about existing/common solution.
You can get exact result after extracting the epoch from both intervals:
SELECT EXTRACT(EPOCH FROM (timestamp '2016-08-01 10:00'
- timestamp '2016-08-01 00:00'))
/ EXTRACT(EPOCH FROM interval '1 day'); -- any given interval
If you want rounded (truncated) result, a simple option is to cast both to integer. Integer division cuts off the remainder.
SELECT EXTRACT(EPOCH FROM (ts_to - ts_from))::int
/ EXTRACT(EPOCH FROM interval '1 day')::int; -- any given interval
You can easily wrap the logic into a IMMUTABLE SQL function.
You are drawing the wrong conclusions from what you read in the manual. The result of a timestamp subtraction is an exact interval, storing only days and seconds (not months). So the result is exact. Try my query, it isn't "ambiguous".
You can avoid involving the data type interval:
SELECT EXTRACT(EPOCH FROM ts_to) - EXTRACT(EPOCH FROM ts_from))
/ 86400 -- = 24*60*60 -- any given interval as number of seconds
But the result is the same.
Aside:
"Exact" is an elusive term when dealing with timestamps. You may have to take DST rules and other corner cases of your time zone into consideration. You might convert to UTC time or use timestamptz before doing the math.