I want to use Clickhouse as an OLAP and PostgreSQL as an OLTP
database.
The problem is that queries to Clickhouse run slower than on Postgres. The query is as below:
select count(id) from {table_name}
Here is my table structure:
CREATE TABLE IF NOT EXISTS {table_name}
(
`id` UInt64,
`label` Nullable(FixedString(50)),
`query` Nullable(text),
`creation_datetime` DateTime,
`offset` UInt64,
`user_is_first_search` UInt8,
`user_date_of_start` Date,
`usage_type` Nullable(FixedString(20)),
`user_ip` Nullable(FixedString(200)),
`who_searched_query` Nullable(FixedString(15)),
`device_type` Nullable(FixedString(20)),
`device_os` Nullable(FixedString(20)),
`tab_type` Nullable(FixedString(20)),
`response_api_type` Nullable(FixedString(20)),
`total_response_time` Float64,
`retrieved_instant_answer` Nullable(FixedString(100)),
`is_relative_instant_answer` UInt8,
`meta_search_instant_answer_type` Nullable(FixedString(50)),
`settings_alignment` Nullable(FixedString(20)),
`settings_safe_search` Nullable(FixedString(30)),
`settings_search_results_number` Nullable(FixedString(30)),
`settings_proxy_image_urls` Nullable(FixedString(30)),
`cache_hit` Nullable(FixedString(20)),
`net_status` Nullable(FixedString(20)),
`is_transitional` UInt8
)
ENGINE = MergeTree() PARTITION BY creation_datetime ORDER BY (id)
I created an index on datetime field in both database and then ran optimize query on both. can anyone tell me why Clickhouse is slower than Postgres?
There are ways to shoot your feet with Clickhouse
create table test ( id Int64, d Date ) Engine=MergeTree Order by id;
insert into test select number, today() from numbers(1e9);
select count() from test;
┌───count()─┐
│ 100000000 │
└───────────┘
1 rows in set. Elapsed: 0.002 sec.
select count(id) from test;
┌─count(id)─┐
│ 100000000 │
└───────────┘
1 rows in set. Elapsed: 0.239 sec. Processed 100.00 million rows, 800.00 MB (418.46 million rows/s., 3.35 GB/s.)
drop table test;
create table test ( id Int64, d Int64 ) Engine=MergeTree partition by (intDiv(d, 10000)) Order by id;
set max_partitions_per_insert_block=0;
insert into test select number, number from numbers(1e8);
select count(id) from test;
┌─count(id)─┐
│ 100000000 │
└───────────┘
1 rows in set. Elapsed: 1.050 sec. Processed 100.00 million rows, 800.00 MB (95.20 million rows/s., 761.61 MB/s.)
select count(d) from test;
┌──count(d)─┐
│ 100000000 │
└───────────┘
1 rows in set. Elapsed: 0.004 sec.
Finally I found the what I did wrong. I should not have made partition by datetime field. I created the table without partition and it got so much faster.
Related
On running a select query with a QPS of 300+, query latency increases up to 20k ms with a CPU usage of ~100%. I couldn't identify what the issue could be over here!?
System details:
vCPUs: 16
RAM: 64 GiB
Create hypertable:
CREATE TABLE test_table (time TIMESTAMPTZ NOT NULL, val1 TEXT NOT NULL, val2 DOUBLE PRECISION NOT NULL, val3 DOUBLE PRECISION NOT NULL);
SELECT create_hypertable('test_table', 'time');
SELECT add_retention_policy('test_table', INTERVAL '2 day');
Query:
SELECT * from test_table where val1='abc'
Timescale version:
2.5.1
No. of rows in test-table:
6.9M
Latencies:
p50=8493.722203,
p75=8566.074792,
p95=21459.204448,
p98=21459.204448,
p99=21459.204448,
p999=21459.204448
CPU usage of postgres SELECT query as per the htop command.
I have a table of sensor readings as postgres hstore key-value pairs.
Each record has a timestamp at 1 sec intervals.
Not all sensors record every second.
The table is essentially:
create table readings (
timer timestamp primary key,
readings hstore
);
the hstore comprises (<sensor_id> ) key/value pairs for readings taken at the time specified in the timestamp.
eg: "67" "-45.67436", "68" "176.5424" could be key/value pairs representing latitude & longitude, with a timestamp in the timer column.
There would be several lat/lon hstore pairs in a given minute, the query I want would return the last one in the timeseries for that minute (for each key).
I want to retrieve the latest value for each sensor in the hstore at 1 minute intervals.
Essentially:
Get the set of timestamps & hstores for that minute.
Get the keys from that set.
Get the value for that key from the key-value pair with the latest timestamp in the set
Return the 1 min timestamp & resulting hstore
Repeat for the next minute
I'm happy with a pl/pgsql function if that is the best approach.
for
"the query I want would return the last one in the timeseries for that
minute (for each key)"
you can use windows function, also I'm guessing you have different sensors
select * from
(
select * , row_number() over (partition by sensorid, to_char(timer, 'YYYY-MM-DD HH:MM') order by timer desc) rn
) t
where rn = 1
with t(t, h) as (values
('2000-01-01 01:01:01'::timestamp, 'a=>1,b=>2'::hstore),
('2000-01-01 01:01:02', 'a=>3'),
('2000-01-01 01:02:03', 'b=>4'),
('2000-01-01 01:02:05', 'a=>2,b=>3'))
select
date_trunc('minute', t),
jsonb_object_agg(k, v order by t)
from t, each(h) as h(k,v)
group by date_trunc('minute', t);
┌─────────────────────┬──────────────────────┐
│ date_trunc │ jsonb_object_agg │
├─────────────────────┼──────────────────────┤
│ 2000-01-01 01:01:00 │ {"a": "3", "b": "2"} │
│ 2000-01-01 01:02:00 │ {"a": "2", "b": "3"} │
└─────────────────────┴──────────────────────┘
demo
I have a table with the following fields:
my_date my_time my_time2
2017-04-14 10:00:01 286115
How do I combine these fields into timestamp column like 2017-04-14 10:00:01.286115?
Unfortunately select my_date+my_time+my_time2 from my_table works fine only for the first two fields. Thanks in advance.
select date '2017-04-14' + time '10:00:01' + 286115 * interval '1usec';
┌────────────────────────────┐
│ ?column? │
╞════════════════════════════╡
│ 2017-04-14 10:00:01.286115 │
└────────────────────────────┘
(1 row)
I want select the opening balance of the first month and the closing balance of the last month in PostgreSQl then now i the sum of income as total income and sum of expenditure as total expenditure on the same row... Here is my data below
ID OPENING_BAL INCOME EXPENDITURE CLOSING_BAL COUNCIL_NAME DATE_COMPILED
21 5000.00 1000.00 2000.00 6000.00 BAKONE 2017-04-28
22 6000.00 1000.00 4000.00 9000.00 BAKONE 2017-05-31
23 9000.00 1500.00 2000.00 9500.00 BAKONE 2017-06-30
You can use the FIRST_VALUE/LAST_VALUE window functions:
CREATE TEMP TABLE e (DATE_COMPILED date, OPENING_BAL int, CLOSING_BAL int);
INSERT INTO e (opening_bal, closing_bal, DATE_COMPILED) VALUES
(5000.00, 6000.00, '2017-04-28'),
(6000.00, 9000.00, '2017-05-31'),
(9000.00, 9500.00, '2017-06-30');
SELECT
FIRST_VALUE(OPENING_BAL) OVER all_dates_asc,
LAST_VALUE(CLOSING_BAL) OVER all_dates_asc
FROM e
WINDOW all_dates_asc AS (
ORDER BY DATE_COMPILED ASC
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING
)
LIMIT 1;
┌─────────────┬────────────┐
│ first_value │ last_value │
├─────────────┼────────────┤
│ 5000 │ 9500 │
└─────────────┴────────────┘
(1 row)
You can try ordering them by desired value and the selecting the first result for example:
(SELECT * FROM reconcilation ORDER BY DATE_COMPILED::date ASC LIMIT 1)
UNION
(SELECT * FROM reconcilation ORDER BY DATE_COMPILED::date DESC LIMIT 1)
if you want just single column you can also select desired columns, but in this case you will lose clarity
(SELECT opening_balance FROM reconcilation WHERE council_name = 'BAKONE' ORDER BY DATE_COMPILED::date Desc LIMIT 1)
UNION
(SELECT closing_balance FROM reconcilation WHERE council_name= 'BAKONE' ORDER BY DATE_COMPILED::date ASC LIMIT 1)
since the data is ordered by date, not by datetime, the query may return inaccurate results if there are multiple entries during the same day, you still can order by both date and id, or just by id, but I would keep id out of it.
ORDER BY DATE_COMPILE::date, id ASC
Have a postgres table, ENTRIES, with a 'made_at' column of type timestamp without time zone.
That table has a btree index on both that column and on another column (USER_ID, a foreign key):
btree (user_id, date_trunc('day'::text, made_at))
As you can see, the date is truncated at the 'day'. The total size of the index constructed this way is 130 MB -- there are 4,000,000 rows in the ENTRIES table.
QUESTION: How do I estimate the size of the index if I were to care for time to be up to the second? Basically, truncate timestamp at second rather than day (should be easy to do, I hope).
Interesting question! According to my investigation they will be the same size.
My intuition told me that there should be no difference between the size of your two indices, as timestamp types in PostgreSQL are of fixed size (8 bytes), and I supposed the truncate function simply zeroed out the appropriate number of least significant time bits, but I figured I had better support my guess with some facts.
I spun up a free dev database on heroku PostgreSQL and generated a table with 4M random timestamps, truncated to both day and second values as follows:
test_db=> SELECT * INTO ts_test FROM
(SELECT id,
ts,
date_trunc('day', ts) AS trunc_day,
date_trunc('second', ts) AS trunc_s
FROM (select generate_series(1, 4000000) AS id,
now() - '1 year'::interval * round(random() * 1000) AS ts) AS sub)
AS subq;
SELECT 4000000
test_db=> create index ix_day_trunc on ts_test (id, trunc_day);
CREATE INDEX
test_db=> create index ix_second_trunc on ts_test (id, trunc_s);
CREATE INDEX
test_db=> \d ts_test
Table "public.ts_test"
Column | Type | Modifiers
-----------+--------------------------+-----------
id | integer |
ts | timestamp with time zone |
trunc_day | timestamp with time zone |
trunc_s | timestamp with time zone |
Indexes:
"ix_day_trunc" btree (id, trunc_day)
"ix_second_trunc" btree (id, trunc_s)
test_db=> SELECT pg_size_pretty(pg_relation_size('ix_day_trunc'));
pg_size_pretty
----------------
120 MB
(1 row)
test_db=> SELECT pg_size_pretty(pg_relation_size('ix_second_trunc'));
pg_size_pretty
----------------
120 MB
(1 row)