High CPU usage caused during select query - postgresql

On running a select query with a QPS of 300+, query latency increases up to 20k ms with a CPU usage of ~100%. I couldn't identify what the issue could be over here!?
System details:
vCPUs: 16
RAM: 64 GiB
Create hypertable:
CREATE TABLE test_table (time TIMESTAMPTZ NOT NULL, val1 TEXT NOT NULL, val2 DOUBLE PRECISION NOT NULL, val3 DOUBLE PRECISION NOT NULL);
SELECT create_hypertable('test_table', 'time');
SELECT add_retention_policy('test_table', INTERVAL '2 day');
Query:
SELECT * from test_table where val1='abc'
Timescale version:
2.5.1
No. of rows in test-table:
6.9M
Latencies:
p50=8493.722203,
p75=8566.074792,
p95=21459.204448,
p98=21459.204448,
p99=21459.204448,
p999=21459.204448
CPU usage of postgres SELECT query as per the htop command.

Related

TimescaleDB very slow query after compression

I have got a hypertable (around 300 millions rows) with the following schema
CREATE TABLE IF NOT EXISTS candlesticks(
open_time TIMESTAMP NOT NULL,
close_time TIMESTAMP NOT NULL,
open DOUBLE PRECISION,
high DOUBLE PRECISION,
low DOUBLE PRECISION,
close DOUBLE PRECISION,
volume DOUBLE PRECISION,
quote_volume DOUBLE PRECISION,
symbol VARCHAR (20) NOT NULL,
exchange VARCHAR (256),
PRIMARY KEY (symbol, open_time, exchange)
);
After compressing it with this query
ALTER TABLE candlesticks SET (
timescaledb.compress,
timescaledb.compress_segmentby = 'symbol, exchange'
);
The following query takes several minutes (it seems timescale decompress every chunk) whereas before it was only 1/2 seconds :
SELECT DISTINCT ON (symbol) * FROM candlesticks
ORDER BY symbol, open_time DESC;
If i add a time conditions like open_time >= now() - INTERVAL '5 minutes' it's better
I'am not really comfortable in Timescaledb / SQL performance so maybe it's normal and i shouldn't use compression for my table ?

Why is Clickhouse slower than PostgreSQL?

I want to use Clickhouse as an OLAP and PostgreSQL as an OLTP
database.
The problem is that queries to Clickhouse run slower than on Postgres. The query is as below:
select count(id) from {table_name}
Here is my table structure:
CREATE TABLE IF NOT EXISTS {table_name}
(
`id` UInt64,
`label` Nullable(FixedString(50)),
`query` Nullable(text),
`creation_datetime` DateTime,
`offset` UInt64,
`user_is_first_search` UInt8,
`user_date_of_start` Date,
`usage_type` Nullable(FixedString(20)),
`user_ip` Nullable(FixedString(200)),
`who_searched_query` Nullable(FixedString(15)),
`device_type` Nullable(FixedString(20)),
`device_os` Nullable(FixedString(20)),
`tab_type` Nullable(FixedString(20)),
`response_api_type` Nullable(FixedString(20)),
`total_response_time` Float64,
`retrieved_instant_answer` Nullable(FixedString(100)),
`is_relative_instant_answer` UInt8,
`meta_search_instant_answer_type` Nullable(FixedString(50)),
`settings_alignment` Nullable(FixedString(20)),
`settings_safe_search` Nullable(FixedString(30)),
`settings_search_results_number` Nullable(FixedString(30)),
`settings_proxy_image_urls` Nullable(FixedString(30)),
`cache_hit` Nullable(FixedString(20)),
`net_status` Nullable(FixedString(20)),
`is_transitional` UInt8
)
ENGINE = MergeTree() PARTITION BY creation_datetime ORDER BY (id)
I created an index on datetime field in both database and then ran optimize query on both. can anyone tell me why Clickhouse is slower than Postgres?
There are ways to shoot your feet with Clickhouse
create table test ( id Int64, d Date ) Engine=MergeTree Order by id;
insert into test select number, today() from numbers(1e9);
select count() from test;
┌───count()─┐
│ 100000000 │
└───────────┘
1 rows in set. Elapsed: 0.002 sec.
select count(id) from test;
┌─count(id)─┐
│ 100000000 │
└───────────┘
1 rows in set. Elapsed: 0.239 sec. Processed 100.00 million rows, 800.00 MB (418.46 million rows/s., 3.35 GB/s.)
drop table test;
create table test ( id Int64, d Int64 ) Engine=MergeTree partition by (intDiv(d, 10000)) Order by id;
set max_partitions_per_insert_block=0;
insert into test select number, number from numbers(1e8);
select count(id) from test;
┌─count(id)─┐
│ 100000000 │
└───────────┘
1 rows in set. Elapsed: 1.050 sec. Processed 100.00 million rows, 800.00 MB (95.20 million rows/s., 761.61 MB/s.)
select count(d) from test;
┌──count(d)─┐
│ 100000000 │
└───────────┘
1 rows in set. Elapsed: 0.004 sec.
Finally I found the what I did wrong. I should not have made partition by datetime field. I created the table without partition and it got so much faster.

Get Data From Postgres Table At every nth interval

Below is my table and i am inserting data from my windows .Net application at every 1 Second Interval. i want to write query to fetch data from the table at every nth interval for example at every 5 second.Below is the query i am using but not getting result as required. Please Help me
CREATE TABLE table_1
(
timestamp_col timestamp without time zone,
value_1 bigint,
value_2 bigint
)
This is my query which i am using
select timestamp_col,value_1,value_2
from (
select timestamp_col,value_1,value_2,
INTERVAL '5 Seconds' * (row_number() OVER(ORDER BY timestamp_col) - 1 )
+ timestamp_col as r
from table_1
) as dt
Where r = 1
Use date_part() function with modulo operator:
select timestamp_col, value_1, value_2
from table_1
where date_part('second', timestamp_col)::int % 5 = 0

generate_series function in Amazon Redshift

I tried the below:
SELECT * FROM generate_series(2,4);
generate_series
-----------------
2
3
4
(3 rows)
SELECT * FROM generate_series(5,1,-2);
generate_series
-----------------
5
3
1
(3 rows)
But when I try,
select * from generate_series('2011-12-31'::timestamp, '2012-12-31'::timestamp, '1 day');
It generated error.
ERROR: function generate_series(timestamp without time zone, timestamp without time zone, "unknown") does not exist
HINT: No function matches the given name and argument types. You may need to add explicit type casts.
I use PostgreSQL 8.0.2 on Redshift 1.0.757.
Any idea why it happens?
UPDATE:
generate_series is working with Redshift now.
SELECT CURRENT_DATE::TIMESTAMP - (i * interval '1 day') as date_datetime
FROM generate_series(1,31) i
ORDER BY 1
This will generate last 30 days date
The version of generate_series() that supports dates and timestamps was added in Postgres 8.4.
As Redshift is based on Postgres 8.0, you need to use a different way:
select timestamp '2011-12-31 00:00:00' + (i * interval '1 day')
from generate_series(1, (date '2012-12-31' - date '2011-12-31')) i;
If you "only" need dates, this can be abbreviated to:
select date '2011-12-31' + i
from generate_series(1, (date '2012-12-31' - date '2011-12-31')) i;
generate_series is working with Redshift now.
SELECT CURRENT_DATE::TIMESTAMP - (i * interval '1 day') as date_datetime
FROM generate_series(1,31) i
ORDER BY 1
This will generate last 30 days date
I found a solution here for my problem of not being able to generate a time dimension table on Redshift using generate_series(). You can generate a temporary sequence by using the following SQL snippet.
with digit as (
select 0 as d union all
select 1 union all select 2 union all select 3 union all
select 4 union all select 5 union all select 6 union all
select 7 union all select 8 union all select 9
),
seq as (
select a.d + (10 * b.d) + (100 * c.d) + (1000 * d.d) as num
from digit a
cross join
digit b
cross join
digit c
cross join
digit d
order by 1
)
select (getdate()::date - seq.num)::date as "Date"
from seq;
The generate_series() function, it seems, is not supported completely on Redshift yet. If I run the SQL mentioned in the answer by DJo, it works, because the SQL runs only on the leader node. If I prepend insert into dim_time to the same SQL it doesn't work.
There is no generate_series() function in Redshift for Date Range but you can generate the series with below steps...
Step 1: Created a table genid and insert constant value as 1 for number of times you need to generate the series. If you need the series to be generated for 12 month you can insert 12 times. Better you can insert for more number of times like 100, so that you do not face any issue.
create table genid(id int)
------------ for number of months
insert into genid values(1)
Step 2: The table for which you need to generate the series.
create table pat(patid varchar(10),stdt timestamp, enddt timestamp);
insert into pat values('Pat01','2018-03-30 00:00:00.0','2018-04-30 00:00:00.0')
insert into pat values('Pat02','2018-02-28 00:00:00.0','2018-04-30 00:00:00.0')
insert into pat values('Pat03','2017-10-28 00:00:00.0','2018-04-30 00:00:00.0')
Step 3: This query will generate the series for you.
with cte as
(
select max(enddt) as maxdt
from pat
) ,
cte2 as(
select dateadd('month', -1 * row_number() over(order by 1), maxdt::date ) as gendt
from genid , cte
) select *
from pat, cte2
where gendt between stdt and enddt

postgres - estimate index size for timestamp column

Have a postgres table, ENTRIES, with a 'made_at' column of type timestamp without time zone.
That table has a btree index on both that column and on another column (USER_ID, a foreign key):
btree (user_id, date_trunc('day'::text, made_at))
As you can see, the date is truncated at the 'day'. The total size of the index constructed this way is 130 MB -- there are 4,000,000 rows in the ENTRIES table.
QUESTION: How do I estimate the size of the index if I were to care for time to be up to the second? Basically, truncate timestamp at second rather than day (should be easy to do, I hope).
Interesting question! According to my investigation they will be the same size.
My intuition told me that there should be no difference between the size of your two indices, as timestamp types in PostgreSQL are of fixed size (8 bytes), and I supposed the truncate function simply zeroed out the appropriate number of least significant time bits, but I figured I had better support my guess with some facts.
I spun up a free dev database on heroku PostgreSQL and generated a table with 4M random timestamps, truncated to both day and second values as follows:
test_db=> SELECT * INTO ts_test FROM
(SELECT id,
ts,
date_trunc('day', ts) AS trunc_day,
date_trunc('second', ts) AS trunc_s
FROM (select generate_series(1, 4000000) AS id,
now() - '1 year'::interval * round(random() * 1000) AS ts) AS sub)
AS subq;
SELECT 4000000
test_db=> create index ix_day_trunc on ts_test (id, trunc_day);
CREATE INDEX
test_db=> create index ix_second_trunc on ts_test (id, trunc_s);
CREATE INDEX
test_db=> \d ts_test
Table "public.ts_test"
Column | Type | Modifiers
-----------+--------------------------+-----------
id | integer |
ts | timestamp with time zone |
trunc_day | timestamp with time zone |
trunc_s | timestamp with time zone |
Indexes:
"ix_day_trunc" btree (id, trunc_day)
"ix_second_trunc" btree (id, trunc_s)
test_db=> SELECT pg_size_pretty(pg_relation_size('ix_day_trunc'));
pg_size_pretty
----------------
120 MB
(1 row)
test_db=> SELECT pg_size_pretty(pg_relation_size('ix_second_trunc'));
pg_size_pretty
----------------
120 MB
(1 row)