Update all time index in a TimescaleDB/PostgreSQL hypertable? - postgresql

I am using an open-source time-series database named TimescaleDB ( based on PostgreSQL ).
Assuming this table :
CREATE TABLE order (
time TIMESTAMPTZ NOT NULL,
product text,
price DOUBLE PRECISION,
qty DOUBLE PRECISION
);
Next, I transform it into a hypertable with :
SELECT create_hypertable('order', 'time');
Next, insert some data (more than 5 millions rows) :
2020-01-01T12:23:52.1235,product1,10,1
2020-01-01T12:23:53.5496,product2,52,7
2020-01-01T12:23:55.3512,product1,23,5
[...]
I need then to update data to get a time index minus 1h interval, like this :
2020-01-01T11:23:52.1235,product1,10,1
2020-01-01T11:23:53.5496,product2,52,7
2020-01-01T11:23:55.3512,product1,23,5
[...]
What is the most efficient method (duration) to alter the time index in this hypertable in order to remove a 1h interval on all data inside the order table ?

not sure if partitioning is available in Timescale, that would ease the process by putting partitions based on the time-range or even date-range.
See if this is one of the options available that way you can just drop the partition based off of a range and voila!

Related

Predict partition number for Postgres hash partitioning

I'm writting an app which uses partitions in Postgres DB. This is will be send to customers and run on their server. This implies that I have to be prepared for many different scenarios.
Lets start with simple table schema:
CREATE TABLE dir (
id SERIAL,
volume_id BIGINT,
path TEXT
);
I want to partition that table by volume_id column.
What I would like to achieve:
limited number of partitions (right now it's 500 but I'm will be tweaking this parameter later)
Do not create all partitions at once - add them only when they are needed
support volume ids up to 100K
[NICE TO HAVE] - been able for human to calculate partition number from volume_id
Solution that I have right now:
partition by LIST
each partition handles volume_id % 500 like this:
CREATE TABLE dir_part_1 PARTITION OF dir FOR VALUES IN (1, 501, 1001, 1501, ..., 9501);
This works great because I can create partition when it's needed, and I know exactly to which partition given volume_id belongs. But I have to manually declare numbers and I cannot support high volume_ids because speed of insert statements decrease drastically (more than 2 times).
It looks like I could try HASH partitioning but my biggest concern is that I have to create all partitions at the very beginning and I would like to be able to create them dynamically when they are needed, because planning time increases significantly up to 5 seconds for 500 partitions. For example I know that I will be adding rows with volume_id=5. How can I tell which partition should I create?
I was able to force Postgres to use dummy hash function by adding hash operator for partitioned table.
CREATE OR REPLACE FUNCTION partition_custom_bigint_hash(value BIGINT, seed BIGINT)
RETURNS BIGINT AS $$
-- this number is UINT64CONST(0x49a0f4dd15e5a8e3) from
-- https://github.com/postgres/postgres/blob/REL_13_STABLE/src/include/common/hashfn.h#L83
SELECT value - 5305509591434766563;
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
CREATE OPERATOR CLASS partition_custom_bigint_hash_op
FOR TYPE int8
USING hash AS
OPERATOR 1 =,
FUNCTION 2 partition_custom_bigint_hash(BIGINT, BIGINT);
Now you can declare partitioned table like this:
CREATE TABLE some_table (
id SERIAL,
partition_id BIGINT,
value TEXT
) PARTITION BY HASH (partition_id);
CREATE TABLE some_table_part_2 PARTITION OF some_table FOR VALUES WITH (modulus 3, remainder 2);
Now you can safely assume that allow rows with partition_id % 3 = 2 will land in some_table_part_2 partition. So if you are sure what values you will receive in partition_id column you can create only required partitions.
DISCLAIMER 1: Unfortunately this will not work correctly right now (Postgres 13.1) because of bug #16840
DISCLAIMER 2: There is not point of using this technic unless you are planning to create large number of partitions (I would say 50 or more) and prolonged planning time is an issue.

When comparing insert performance between postgres and timescaledb, timescaledb didn't perform that well?

I tried an insert query performance test. Here are the numbers :-
Postgres
Insert : Avg Execution Time For 10 inserts of 1 million rows : 6260 ms
Timescale
Insert : Avg Execution Time For 10 inserts of 1 million rows : 10778 ms
Insert Queries:
-- Testing SQL Queries
--Join table
CREATE TABLE public.sensors(
id SERIAL PRIMARY KEY,
type VARCHAR(50),
location VARCHAR(50)
);
-- Postgres table
CREATE TABLE sensor_data (
time TIMESTAMPTZ NOT NULL,
sensor_id INTEGER,
temperature DOUBLE PRECISION,
cpu DOUBLE PRECISION,
FOREIGN KEY (sensor_id) REFERENCES sensors (id)
);
CREATE INDEX idx_sensor_id
ON sensor_data(sensor_id);
-- TimescaleDB table
CREATE TABLE sensor_data_ts (
time TIMESTAMPTZ NOT NULL,
sensor_id INTEGER,
temperature DOUBLE PRECISION,
cpu DOUBLE PRECISION,
FOREIGN KEY (sensor_id) REFERENCES sensors (id)
);
SELECT create_hypertable('sensor_data_ts', 'time');
-- Insert Data
INSERT INTO sensors (type, location) VALUES
('a','floor'),
('a', 'ceiling'),
('b','floor'),
('b', 'ceiling');
-- Postgres
EXPLAIN ANALYSE
INSERT INTO sensor_data (time, sensor_id, cpu, temperature)
SELECT
time,
sensor_id,
random() AS cpu,
random()*100 AS temperature
FROM generate_series(now() - interval '125 week', now(), interval '5 minute') AS g1(time), generate_series(1,4,1) AS g2(sensor_id);
-- TimescaleDB
EXPLAIN ANALYSE
INSERT INTO sensor_data_ts (time, sensor_id, cpu, temperature)
SELECT
time,
sensor_id,
random() AS cpu,
random()*100 AS temperature
FROM generate_series(now() - interval '125 week', now(), interval '5 minute') AS g1(time), generate_series(1,4,1) AS g2(sensor_id);
Am I overlooking any optimizations ?
By default, a hypertable creates a chunk per week (that's configurable in the create_hypertable call). So with the above setting, you created 125 chunks for TimescaleDB, each with 8000 rows. There is overhead to this chunk creation, as well as the logic handling this. So with the dataset being so small, you are seeing the overhead of this chunk creation, which typically is amortized over much larger datasets: In most "natural" settings, we'll typically see on the order of millions+ (or at least 100,000s) of rows per chunk.
The place you start to see insert performance differences between a partitioned architecture like TimescaleDB and single table is also when the dataset (and particular, the indexes that you are currently maintaining) don't naturally fit in the memory.
In the above, 1M rows easily fit in memory, and the only index you have on your vanilla PG table is for sensor_id, so it's pretty tiny. (On the TimescaleDB hypertable, you by default has indexes on timestamps, distinct per chunk, so you actually have 125 indexes, each of size 8000 given the distinct timestamps).
For visual, see this older blog post: https://blog.timescale.com/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/#result-15x-improvement-in-insert-rate
Note inserts to single PG table is ~same at the beginning, but then falls off as the table gets bigger and data/indexes start swapping to disk.
If you want to do larger performance tests, might suggest trying out the Time Series Benchmark Suite: https://github.com/timescale/tsbs
With TimescaleDB version 1.7 running on Docker was able to insert around 600,000 rows per second on my laptop using https://github.com/vincev/tsdbperf:
$ ./tsdbperf --workers 8 --measurements 200000
[2020-11-03T20:25:48Z INFO tsdbperf] Number of workers: 8
[2020-11-03T20:25:48Z INFO tsdbperf] Devices per worker: 10
[2020-11-03T20:25:48Z INFO tsdbperf] Metrics per device: 10
[2020-11-03T20:25:48Z INFO tsdbperf] Measurements per device: 200000
[2020-11-03T20:26:15Z INFO tsdbperf] Wrote 16000000 measurements in 26.55 seconds
[2020-11-03T20:26:15Z INFO tsdbperf] Wrote 602750 measurements per second
[2020-11-03T20:26:15Z INFO tsdbperf] Wrote 6027500 metrics per second

Calculate the difference between two hours on PostgreSQL 9.4

Given a table as the following:
create table meetings(
id integer primary key,
start_time varchar,
end_time varchar
)
Considering that the string stored in this table follow the format 'HH:MM' and have a 24 hours format, is there a command on PostgreSQL 9.4 that I can cast fields to time, calculate the difference between them, and return a single result of the counting of full hours available?
e.g: start_time: '08:00' - end_time: '12:00'
Result must be 4.
In your particular case, assuming that you are working with clock values (both of them belonging to the same day), I would guess you can do this
(clock_to::time - clock_from::time) as duration
Allow me to leave you a ready to run example:
with cte as (
select '4:00'::varchar as clock_from, '14:00'::varchar as clock_to
)
select (clock_to::time - clock_from::time) as duration
from cte

Handling of multiple queries as one result

Lets say I have this table
CREATE TABLE device_data_by_year (
year int,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY (year, device_id, nano_since_epoch,sensor_id)
) WITH CLUSTERING ORDER BY (device_id desc, nano_since_epoch desc);
I need to query data for a particular device and sensor in a period between 2017 and 2018. In this case 2 queries will be issued:
select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?
select * from device_data_by_year where year = 2018 AND device_id = ? AND sensor_id = ? AND nano_since_epoch >= ? AND nano_since_epoch <= ?
Currently I iterate over the resultsets and build a List with all the results. I am aware that this could (and will) run into OOM problems some day. Is there a better approach, how to handle / merge query results into one set?
Thanks
You can use IN to specify a list of years, but this is not very optimal solution - because the year field is partition key, then most probably the data will be on different machines, so one of the node will work as "coordinator", and will need to ask another machine for results, and aggregate data. From performance point of view, 2 async requests issued in parallel could be faster, and then do the merge on client side.
P.S. your data model have quite serious problems - you partition by year, this means:
Data isn't very good distributed across the cluster - only N=RF machines will hold the data;
These partitions will be very huge, even if you get only hundred of devices, reporting one measurement per minute;
Only one partition will be "hot" - it will receive all data during the year, and other partitions won't be used very often.
You can use months, or even days as partition key to decrease the size of partition, but it still won't solve the problem of the "hot" partitions.
If I remember correctly, Data Modelling course at DataStax Academy has an example of data model for sensor network.
Changed the table structure to:
CREATE TABLE device_data (
week_first_day timestamp,
device_id uuid,
sensor_id uuid,
nano_since_epoch bigint,
unit text,
value double,
source text,
username text,
PRIMARY KEY ((week_first_day, device_id), nano_since_epoch, sensor_id)
) WITH CLUSTERING ORDER BY (nano_since_epoch desc, sensor_id desc);
according to #AlexOtt proposal. Some changes to the application logic are required - for example findAllByYear needs to iterate over single weeks now.
Coming back to the original question: would you rather send 52 queries (getDataByYear, one query per week) oder would you use the IN operator here?

Looping through unique dates in PostgreSQL

In Python (pandas) I read from my database and then I use a pivot table to aggregate data each day. The raw data I am working on is about 2 million rows per day and it is per person and per 30 minutes. I am aggregating it to be daily instead so it is a lot smaller for visualization.
So in pandas, I would read each date into memory and aggregate it and then load it into a fresh table in postgres.
How can I do this directly in postgres? Can I loop through each unique report_date in my table, groupby, and then append it to another table? I am assuming doing it in postgres would be fast compared to reading it over a network in python, writing a temporary .csv file, and then writing it again over the network.
Here's an example: Suppose that you have a table
CREATE TABLE post (
posted_at timestamptz not null,
user_id integer not null,
score integer not null
);
representing the score various user have earned from posts they made in SO like forum. Then the following query
SELECT user_id, posted_at::date AS day, sum(score) AS score
FROM post
GROUP BY user_id, posted_at::date;
will aggregate the scores per user per day.
Note that this will consider that the day changes at 00:00 UTC (like SO does). If you want a different time, say midnight Paris time, then you can do it like so:
SELECT user_id, (posted_at AT TIME ZONE 'Europe/Paris')::date AS day, sum(score) AS score
FROM post
GROUP BY user_id, (posted_at AT TIME ZONE 'Europe/Paris')::date;
To have good performace for the above queries, you might want to create a (computed) index on (user_id, posted_at::date), or similarly for the second case.