TimescaleDB very slow query after compression - postgresql

I have got a hypertable (around 300 millions rows) with the following schema
CREATE TABLE IF NOT EXISTS candlesticks(
open_time TIMESTAMP NOT NULL,
close_time TIMESTAMP NOT NULL,
open DOUBLE PRECISION,
high DOUBLE PRECISION,
low DOUBLE PRECISION,
close DOUBLE PRECISION,
volume DOUBLE PRECISION,
quote_volume DOUBLE PRECISION,
symbol VARCHAR (20) NOT NULL,
exchange VARCHAR (256),
PRIMARY KEY (symbol, open_time, exchange)
);
After compressing it with this query
ALTER TABLE candlesticks SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'symbol, exchange'
);
The following query takes several minutes (it seems timescale decompress every chunk) whereas before it was only 1/2 seconds :
SELECT DISTINCT ON (symbol) * FROM candlesticks
ORDER BY symbol, open_time DESC;
If i add a time conditions like open_time >= now() - INTERVAL '5 minutes' it's better
I'am not really comfortable in Timescaledb / SQL performance so maybe it's normal and i shouldn't use compression for my table ?

Related

High CPU usage caused during select query

On running a select query with a QPS of 300+, query latency increases up to 20k ms with a CPU usage of ~100%. I couldn't identify what the issue could be over here!?
System details:
vCPUs: 16
RAM: 64 GiB
Create hypertable:
CREATE TABLE test_table (time TIMESTAMPTZ NOT NULL, val1 TEXT NOT NULL, val2 DOUBLE PRECISION NOT NULL, val3 DOUBLE PRECISION NOT NULL);
SELECT create_hypertable('test_table', 'time');
SELECT add_retention_policy('test_table', INTERVAL '2 day');
Query:
SELECT * from test_table where val1='abc'
Timescale version:
2.5.1
No. of rows in test-table:
6.9M
Latencies:
p50=8493.722203,
p75=8566.074792,
p95=21459.204448,
p98=21459.204448,
p99=21459.204448,
p999=21459.204448
CPU usage of postgres SELECT query as per the htop command.

Postgres - convert international atomic time to UTC time (for loop with IF inside SQL function body)

I need to store TAI time in a pg database. This requires a custom type,
CREATE TYPE tai AS (
secs int,
nanosecs, int
);
which maps 1:1 to a GNU C timespec struct, with the TAI epoch of Jan 1 1958 00:00:00 and monotonic clock at its origins. A table of leapseconds is auxiliary data required to convert these to UTC timestamps,
DROP TABLE IF EXISTS leapseconds;
CREATE TABLE IF NOT EXISTS leapseconds (
id serial PRIMARY KEY,
moment TIMESTAMP WITHOUT TIME ZONE NOT NULL,
skew int NOT NULL
);
INSERT INTO leapseconds (moment, skew) VALUES -- note: pg assumes 00:00:00 if no hh:mm:ss given
('1972-Jan-01', 10),
('1972-Jun-30', 1),
('1972-Dec-31', 1),
('1973-Dec-31', 1),
('1974-Dec-31', 1),
('1975-Dec-31', 1),
('1976-Dec-31', 1),
('1977-Dec-31', 1),
('1978-Dec-31', 1),
('1979-Dec-31', 1),
('1981-Jun-30', 1),
('1982-Jun-30', 1),
('1983-Jun-30', 1),
('1985-Jun-30', 1),
('1987-Dec-31', 1),
('1989-Dec-31', 1),
('1990-Dec-31', 1),
('1992-Jun-30', 1),
('1993-Jun-30', 1),
('1994-Jun-30', 1),
('1995-Dec-31', 1),
('1997-Jun-30', 1),
('1998-Dec-31', 1),
('2005-Dec-31', 1),
('2008-Dec-31', 1),
('2012-Jun-30', 1),
('2015-Jun-30', 1),
('2016-Dec-31', 1)
;
I need a function to convert these to UTC timestamps. It would be optimal for for this to live in postgres to avoid latency. The SQL/python pseudocode to do this is
# SQL
SELECT (moment, skew)
FROM LEAPSECONDS
ORDER BY MOMEN ASC
AS tuples
# python
def tai_to_utc(tai):
modtime = to_timestamp(tai.seconds) # to_timestamp from pgsql
modtime += tai.nanosec; # timestamp in pg has usec precision, gloss over it
for moment, skew in tuples:
if modtime > moment:
modtime += skew # type mismatch, gloss over it
return modtime
I know how to do the typecasting, but I'm struggling to write this for+if in plpsql. Is the path of least resistance to learn how to write a stored C procedure and do this in the database? I can also have the client provide the UTC timestamps and do this conversion based on a query to the database, but the chatter to pull data from the database in order to insert into it is going to really hurt ingest speed.
You basically need to use the sum() window function to get the cumulative sum of leap seconds over the moments. Add that to the base timestamp (without the leap seconds) and get the one with the youngest moment where the moment is older or at the base timestamp with the leap seconds added for all the previous moments. You can use DISTINCT ON and LIMIT for that.
CREATE FUNCTION tai_to_utc
(_tai tai)
RETURNS timestamptz
AS
$$
SELECT DISTINCT ON (moment)
ts
FROM (SELECT moment AT TIME ZONE 'UTC' AS moment,
skew,
'1958-01-01T00:00:00+00:00'::timestamptz
+ (_tai.secs || ' seconds')::interval
+ (_tai.nanosecs / 1000 || ' microseconds')::interval
+ (sum(skew) OVER (ORDER BY moment) || ' seconds')::interval AS ts
FROM (SELECT moment,
skew
FROM leapseconds
UNION ALL
SELECT '-infinity',
0) AS x) AS y
WHERE moment <= ts - (skew || ' seconds')::interval
ORDER BY moment DESC
LIMIT 1;
$$
LANGUAGE SQL;
I'd however recommend to also change the type of leapseconds.moment to timestamptz (timestamp with time zone) and insert the moments with explicit time zone to ensure things are what they meant to be. That way the awkward time zone conversion isn't needed in the function.
CREATE FUNCTION tai_to_utc
(_tai tai)
RETURNS timestamptz
AS
$$
SELECT DISTINCT ON (moment)
ts
FROM (SELECT moment,
skew,
'1958-01-01T00:00:00+00:00'::timestamptz
+ (_tai.secs || ' seconds')::interval
+ (_tai.nanosecs / 1000 || ' microseconds')::interval
+ (sum(skew) OVER (ORDER BY moment) || ' seconds')::interval AS ts
FROM (SELECT moment,
skew
FROM leapseconds
UNION ALL
SELECT '-infinity',
0) AS x) AS y
WHERE moment <= ts - (skew || ' seconds')::interval
ORDER BY moment DESC
LIMIT 1;
$$
LANGUAGE SQL;
db<>fiddle
An index on leapseconds (moment, skew) might improve performance. Though leapseconds is quite very small, so it might not do that much.

SQL partition by query optimization

I have below prices table and I want to obtain last_30_days price array and last_year_price from it. As shown below
CREATE TABLE prices
(
id integer NOT NULL,
"time" timestamp without time zone NOT NULL,
close double precision NOT NULL,
CONSTRAINT prices_pkey PRIMARY KEY (id, "time")
)
select id,time,first_value(close) over (partition by id order by time range between '1 year' preceding and CURRENT ROW) as prev_year_close,
array_agg(p.close) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) as prices_30
from prices p
However I want to place a where clause on the prices table to get last_30_days price array and last_year_price for some specific rows. Eg. where time >= '1 week' interval (so I run this query only for last 1 week values as opposed to the entire table)
But a where clause pre-filtering the rows and then window partition only runs on that conditioned rows which results in wrong results. it is giving result like
time, id, last_30_days
-1day, X, [A,B,C,D,E, F,G]
-2day, X, [A,B,C,D,E,F]
-3day, X, [A,B,C,D,E]
-4day, X, [A,B,C,D]
-5day, X, [A,B,C]
-6day, X, [A,B]
-7day, X, [A]
How do I fix this so that partition over window always takes 30 values irrespective of where condition? Without having to run the query always on the entire table and then selecting a subset of rows with where clause. My prices table is huge and running it on entire table is very expensive.
EDIT
CREATE TABLE prices
(
id integer NOT NULL,
"time" timestamp without time zone NOT NULL,
close double precision NOT NULL,
prev_30 double precision[],
prev_year double precision,
CONSTRAINT prices_pkey PRIMARY KEY (id, "time")
)
Use a subquery:
SELECT *
FROM (select id, time,
first_value(close) over (partition by id order by time range between '1 year' preceding and CURRENT ROW) as prev_year_close,
array_agg(p.close) OVER (PARTITION BY id ORDER BY time ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) as prices_30 I
from prices p
WHERE time >= current_timestamp - INTERVAL '1 year 1 week') AS q
WHERE time >= current_timestamp - INTERVAL '1 week' ;

Timescale/Postgres queries very slow/inefficient?

I have table in TimescaleDb with the following schema:
create table market_quotes
(
instrument varchar(16) not null,
exchange varchar(16) not null,
time timestamp not null,
bid_1_price double precision,
ask_1_price double precision,
bid_1_quantity double precision,
ask_1_quantity double precision,
bid_2_price double precision,
ask_2_price double precision,
bid_2_quantity double precision,
ask_2_quantity double precision,
bid_3_price double precision,
ask_3_price double precision,
bid_3_quantity double precision,
ask_3_quantity double precision,
bid_4_price double precision,
ask_4_price double precision,
bid_4_quantity double precision,
ask_4_quantity double precision,
bid_5_price double precision,
ask_5_price double precision,
bid_5_quantity double precision,
ask_5_quantity double precision
);
and the following composite index:
create index market_quotes_instrument_exchange_time_idx
on market_quotes (instrument asc, exchange asc, time desc);
When I run the query:
EXPLAIN ANALYZE select * from market_quotes where instrument='BTC/USD' and exchange='gdax' and time between '2020-06-02 00:00:00' and '2020-06-03 00:00:00'
It takes almost 2 minutes to return 500k rows:
Index Scan using _hyper_1_1_chunk_market_quotes_instrument_exchange_time_idx on _hyper_1_1_chunk (cost=0.70..1353661.85 rows=1274806 width=183) (actual time=5.165..99990.424 rows=952931 loops=1)
" Index Cond: (((instrument)::text = 'BTC/USD'::text) AND ((exchange)::text = 'gdax'::text) AND (""time"" >= '2020-06-02 00:00:00'::timestamp without time zone) AND (""time"" <= '2020-06-02 01:00:00'::timestamp without time zone))"
Planning Time: 11.389 ms
JIT:
Functions: 2
" Options: Inlining true, Optimization true, Expressions true, Deforming true"
" Timing: Generation 0.404 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 0.404 ms"
Execution Time: 100121.392 ms
And when I run the query below:
select * from market_quotes where instrument='BTC/USD' and exchange='gdax' and time between '2020-06-02 00:00:00' and '2020-06-03 00:00:00'
This ran for >40 mins and crashed.
What can I do to speed the queries up? I often do 1 day queries - would it help if I added another column corresponding to the day of the week and indexed on that as well?
Should I be querying with a subset of rows each time instead and piecing the information together? (i.e. 10000 rows at a time)
It would be interesting to see the result of EXPLAIN (ANALYZE, BUFFERS) for the query after setting track_io_timing to on.
But if you are desperate to speed up an index range scan, the best you can do is to cluster the table:
CLUSTER market_quotes USING market_quotes_instrument_exchange_time_idx;
This will rewrite the table and block any concurrent access.
Another approach would be to use a pre-aggregated materialized view, if you can live with slightly stale data.

postgres - estimate index size for timestamp column

Have a postgres table, ENTRIES, with a 'made_at' column of type timestamp without time zone.
That table has a btree index on both that column and on another column (USER_ID, a foreign key):
btree (user_id, date_trunc('day'::text, made_at))
As you can see, the date is truncated at the 'day'. The total size of the index constructed this way is 130 MB -- there are 4,000,000 rows in the ENTRIES table.
QUESTION: How do I estimate the size of the index if I were to care for time to be up to the second? Basically, truncate timestamp at second rather than day (should be easy to do, I hope).
Interesting question! According to my investigation they will be the same size.
My intuition told me that there should be no difference between the size of your two indices, as timestamp types in PostgreSQL are of fixed size (8 bytes), and I supposed the truncate function simply zeroed out the appropriate number of least significant time bits, but I figured I had better support my guess with some facts.
I spun up a free dev database on heroku PostgreSQL and generated a table with 4M random timestamps, truncated to both day and second values as follows:
test_db=> SELECT * INTO ts_test FROM
(SELECT id,
ts,
date_trunc('day', ts) AS trunc_day,
date_trunc('second', ts) AS trunc_s
FROM (select generate_series(1, 4000000) AS id,
now() - '1 year'::interval * round(random() * 1000) AS ts) AS sub)
AS subq;
SELECT 4000000
test_db=> create index ix_day_trunc on ts_test (id, trunc_day);
CREATE INDEX
test_db=> create index ix_second_trunc on ts_test (id, trunc_s);
CREATE INDEX
test_db=> \d ts_test
Table "public.ts_test"
Column | Type | Modifiers
-----------+--------------------------+-----------
id | integer |
ts | timestamp with time zone |
trunc_day | timestamp with time zone |
trunc_s | timestamp with time zone |
Indexes:
"ix_day_trunc" btree (id, trunc_day)
"ix_second_trunc" btree (id, trunc_s)
test_db=> SELECT pg_size_pretty(pg_relation_size('ix_day_trunc'));
pg_size_pretty
----------------
120 MB
(1 row)
test_db=> SELECT pg_size_pretty(pg_relation_size('ix_second_trunc'));
pg_size_pretty
----------------
120 MB
(1 row)