I created a table with the command:
CREATE TABLE public.g_dl (
id bigint NOT NULL,
id_tram integer,
site_id character varying(40),
user_id character varying(40),
app_code character varying(40),
time TIMESTAMP WITHOUT TIME ZONE NOT NULL
);
SELECT create_hypertable('g_dl','time');
Then I insert an amount of about 34 million records.
The query speed is very good, the amount of Ram used by the docker container is about 500MB-1.2GB. But query speed gets a problem after I insert amount of 1.8 million records, the time field is out of order before I had inserted it.
I use DBbeaver and get the message "You might need to increase max_locks_per_transaction". Then I change that value up
increase max_locks_per_transaction = 1000
Query speed is very slow and the amount of ram used by docker is 10GB - 12GB. What I'm doing wrong way. Pls let me know.
Outout when explain:
EXPLAIN analyze select * from g_dlquantracts gd where id_tram = 300
Raw JSON explain:
https://drive.google.com/file/d/1bA5EwcWMUn5oTUirHFD25cSU1cYGwGLo/view?usp=sharing
Formatted execution plan: https://explain.depesz.com/s/tM57
Related
I have the following test table created in Postgres with 5 million rows
create table temp_test(
id serial,
latitude float, longitude float,name varchar(32),
questionaire jsonb, city varchar(32), country varchar(32)
);
I have generated random data in this table using the below query
CREATE OR REPLACE FUNCTION
random_in_range(INTEGER, INTEGER) RETURNS INTEGER
AS $$
SELECT floor(($1 + ($2 - $1 + 1) * random()))::INTEGER;
$$ LANGUAGE SQL;
insert into temp_test(latitude,longitude,name,questionaire,city,country)
Select random_in_range(-180,180),random_in_range(-180,180),md5(random()::text),'["autism","efg"]',md5(random()::text)
,md5(random()::text)
from generate_series(1,5000000);
Column 'id' has a btree index
CREATE INDEX id_index ON temp_test (id);
When I try to query with only 'id' which has an index and "explain analyse" the query
explain analyse select id from temp_test where id = 10000;
I get the following result:
The execution time of this query was around ~0.049 second and if I rerun the same query (considering the database is caching) I get the results in approximately a similar time duration.
From the results, I see that even if I am querying on an indexed column and not fetching any other column (I know it is not practical) why heap memory is being used when the information exists within the index.
I would expect the query to not touch the heap if I am extracting information from the index.
Is there something I am missing here? Any help would be appreciated. Thanks in advance!
The query is probably hitting the heap to check the visibility ma. Run vacuum full verbose and see what postgres says.. this will make the rows visible to all transactions. Sometimes postgres can’t run vacuum fully for one reason or another so verbose help.
I found the solution, I ran vacuum on my table
vacuum temp_test;
instead of running "Vacuum Full Verbose"
Although it worked I wonder what's the difference between the two commands
I have an AWS RDS PostgreSQL 12.3 (t3.small, 2CPU 2GB RAM). I have this table:
CREATE TABLE public.phones_infos
(
phone_id integer NOT NULL DEFAULT nextval('phones_infos_phone_id_seq'::regclass),
phone character varying(50) COLLATE pg_catalog."default" NOT NULL,
company_id integer,
phone_tested boolean DEFAULT false,
imported_at timestamp with time zone NOT NULL,
CONSTRAINT phones_infos_pkey PRIMARY KEY (phone_id),
CONSTRAINT fk_phones_infos FOREIGN KEY (company_id)
REFERENCES public.companies_infos (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE CASCADE
)
There are exactly 137468 records in this table, using:
SELECT count(1) FROM phones_infos;
The ERROR: out of memory for query result occurs with this simple query when I use pgAdmin 4.6:
SELECT * FROM phones_infos;
I have tables with 5M+ records and never had this problem before.
EXPLAIN SELECT * FROM phones_infos;
Seq Scan on phones_infos (cost=0.00..2546.68 rows=137468 width=33)
I read this article to see if I could find answers, but unfortunately as we can see on metrics: there are no old pending connections that could eat memory.
As suggested, the shared_buffers seems to be correctly sized:
SHOW shared_buffers;
449920kB
What should I try?
The problem must be on the client side. A sequential scan does not require much memory in PostgreSQL.
pgAdmin will cache the complete result set in RAM, which probably explains the out-of-memory condition.
I see two options:
Limit the number of result rows in pgAdmin:
SELECT * FROM phones_infos LIMIT 1000;
Use a different client, for example psql. There you can avoid the problem by setting
\set FETCH_COUNT 1000
so that the result set is fetched in batches.
I tried an insert query performance test. Here are the numbers :-
Postgres
Insert : Avg Execution Time For 10 inserts of 1 million rows : 6260 ms
Timescale
Insert : Avg Execution Time For 10 inserts of 1 million rows : 10778 ms
Insert Queries:
-- Testing SQL Queries
--Join table
CREATE TABLE public.sensors(
id SERIAL PRIMARY KEY,
type VARCHAR(50),
location VARCHAR(50)
);
-- Postgres table
CREATE TABLE sensor_data (
time TIMESTAMPTZ NOT NULL,
sensor_id INTEGER,
temperature DOUBLE PRECISION,
cpu DOUBLE PRECISION,
FOREIGN KEY (sensor_id) REFERENCES sensors (id)
);
CREATE INDEX idx_sensor_id
ON sensor_data(sensor_id);
-- TimescaleDB table
CREATE TABLE sensor_data_ts (
time TIMESTAMPTZ NOT NULL,
sensor_id INTEGER,
temperature DOUBLE PRECISION,
cpu DOUBLE PRECISION,
FOREIGN KEY (sensor_id) REFERENCES sensors (id)
);
SELECT create_hypertable('sensor_data_ts', 'time');
-- Insert Data
INSERT INTO sensors (type, location) VALUES
('a','floor'),
('a', 'ceiling'),
('b','floor'),
('b', 'ceiling');
-- Postgres
EXPLAIN ANALYSE
INSERT INTO sensor_data (time, sensor_id, cpu, temperature)
SELECT
time,
sensor_id,
random() AS cpu,
random()*100 AS temperature
FROM generate_series(now() - interval '125 week', now(), interval '5 minute') AS g1(time), generate_series(1,4,1) AS g2(sensor_id);
-- TimescaleDB
EXPLAIN ANALYSE
INSERT INTO sensor_data_ts (time, sensor_id, cpu, temperature)
SELECT
time,
sensor_id,
random() AS cpu,
random()*100 AS temperature
FROM generate_series(now() - interval '125 week', now(), interval '5 minute') AS g1(time), generate_series(1,4,1) AS g2(sensor_id);
Am I overlooking any optimizations ?
By default, a hypertable creates a chunk per week (that's configurable in the create_hypertable call). So with the above setting, you created 125 chunks for TimescaleDB, each with 8000 rows. There is overhead to this chunk creation, as well as the logic handling this. So with the dataset being so small, you are seeing the overhead of this chunk creation, which typically is amortized over much larger datasets: In most "natural" settings, we'll typically see on the order of millions+ (or at least 100,000s) of rows per chunk.
The place you start to see insert performance differences between a partitioned architecture like TimescaleDB and single table is also when the dataset (and particular, the indexes that you are currently maintaining) don't naturally fit in the memory.
In the above, 1M rows easily fit in memory, and the only index you have on your vanilla PG table is for sensor_id, so it's pretty tiny. (On the TimescaleDB hypertable, you by default has indexes on timestamps, distinct per chunk, so you actually have 125 indexes, each of size 8000 given the distinct timestamps).
For visual, see this older blog post: https://blog.timescale.com/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/#result-15x-improvement-in-insert-rate
Note inserts to single PG table is ~same at the beginning, but then falls off as the table gets bigger and data/indexes start swapping to disk.
If you want to do larger performance tests, might suggest trying out the Time Series Benchmark Suite: https://github.com/timescale/tsbs
With TimescaleDB version 1.7 running on Docker was able to insert around 600,000 rows per second on my laptop using https://github.com/vincev/tsdbperf:
$ ./tsdbperf --workers 8 --measurements 200000
[2020-11-03T20:25:48Z INFO tsdbperf] Number of workers: 8
[2020-11-03T20:25:48Z INFO tsdbperf] Devices per worker: 10
[2020-11-03T20:25:48Z INFO tsdbperf] Metrics per device: 10
[2020-11-03T20:25:48Z INFO tsdbperf] Measurements per device: 200000
[2020-11-03T20:26:15Z INFO tsdbperf] Wrote 16000000 measurements in 26.55 seconds
[2020-11-03T20:26:15Z INFO tsdbperf] Wrote 602750 measurements per second
[2020-11-03T20:26:15Z INFO tsdbperf] Wrote 6027500 metrics per second
I am using an open-source time-series database named TimescaleDB ( based on PostgreSQL ).
Assuming this table :
CREATE TABLE order (
time TIMESTAMPTZ NOT NULL,
product text,
price DOUBLE PRECISION,
qty DOUBLE PRECISION
);
Next, I transform it into a hypertable with :
SELECT create_hypertable('order', 'time');
Next, insert some data (more than 5 millions rows) :
2020-01-01T12:23:52.1235,product1,10,1
2020-01-01T12:23:53.5496,product2,52,7
2020-01-01T12:23:55.3512,product1,23,5
[...]
I need then to update data to get a time index minus 1h interval, like this :
2020-01-01T11:23:52.1235,product1,10,1
2020-01-01T11:23:53.5496,product2,52,7
2020-01-01T11:23:55.3512,product1,23,5
[...]
What is the most efficient method (duration) to alter the time index in this hypertable in order to remove a 1h interval on all data inside the order table ?
not sure if partitioning is available in Timescale, that would ease the process by putting partitions based on the time-range or even date-range.
See if this is one of the options available that way you can just drop the partition based off of a range and voila!
I have a large table "measurement" with 4 columns:
measurement-service=> \d measurement
Table "public.measurement"
Column | Type | Collation | Nullable | Default
-----------------------+-----------------------------+-----------+----------+---------
hour | timestamp without time zone | | not null |
config_id | bigint | | not null |
sensor_id | bigint | | not null |
event_id | uuid | | not null |
Partition key: RANGE (hour)
Indexes:
"hour_config_id_sensor_id_event_id_key" UNIQUE CONSTRAINT, btree (hour, config_id, sensor_id, event_id)
Number of partitions: 137 (Use \d+ to list them.)
An example of a partition name: "measurement_y2019m12d04"
And then i insert a lot of events as CSV via COPY to a temporary table, and from there i copy the table directly into the partition using ON CONFLICT DO NOTHING.
Example:
CREATE TEMPORARY TABLE 'tmp_measurement_y2019m12d04T02_12345' (
hour timestamp without timezone,
config_id bigint,
sensor_id bigint,
event_id uuid
) ON COMMIT DROP;
[...]
COPY tmp_measurement_y2019m12d04T02_12345 FROM STDIN DELIMITER ',' CSV HEADER;
INSERT INTO measurement_y2019m12d04 (SELECT * FROM tmp_measurement_y2019m12d04T02_12345) ON CONFLICT DO NOTHING;
I think i help postgres by sending CSV with data of the same hour only. Also within that hour, i remove all duplicates in the CSV. Therefore the CSV only contains unique rows.
But i send many batches for different hours. There is no order. It can be the hour of today, yesterday, the last week. Etc.
My approach worked alright so far, but i think i have reached a limit now. The insertion speed has become very slow. While the CPU is idle, i have 25% i/o wait. Subsystem is a RAID with several TB, using disks, that are not SSD.
maintenance_work_mem = 32GB
max_wal_size = 1GB
fsync = off
max_worker_processes = 256
wal_buffers = -1
shared_buffers = 64GB
temp_buffers = 4GB
effective_io_concurrency = 1000
effective_cache_size = 128GB
Each partition per day is around 20gb big and contains no more than 500m rows. And by maintaining the unique index per partition, i just duplicated the data once more.
The lookup speed, on the other hand, is quick.
I think the limit is in the maintenance of the btree with the rather random UUIDs in (hour,config_id,sensor_id). I constantly change it, its written out and has to be re-read.
I am wondering, if there is another approach. Basically i want uniqueness for (hour,config_id,sensor_id,event_id) and then a quick lookup per (hour,config_id,sensor_id).
I am considering removal of the unique index and only having an index over (hour,config_id,sensor_id). And then providing the uniqueness on the reader side. But it may slow down the reading, as the event_id can no longer be delivered via the index, when i lookup via (hour,config_id,sensor_id). It has to access the actual row to get the event_id.
Or i provide uniqueness via a hash index.
Any other ideas are welcome!
Thank you.
When you do the insert, you should specify an ORDER BY which matches the index of the table being inserted into:
INSERT INTO measurement_y2019m12d04
SELECT * FROM tmp_measurement_y2019m12d04T02_12345
order by hour, config_id, sensor_id, event_id
Only if this fails to give enough improvement would I consider any of the other options you list.
Hash indexes don't provide uniqueness. You can simulate it with an exclusion constraint, but I think they are less efficient. Exclusion constraints do support DO NOTHING, but not support DO UPDATE. So as long as your use case does not evolve to want DO UPDATE, you would be good on that front, but I still doubt it would actually solve the problem. If your bottleneck is IO from updating the index, hash would only make it worse as it is designed to scatter your data all over the place, rather than focus it in a small cacheable area.
You also mention parallel processing. For inserting into the temp table, that might be fine. But I wouldn't do the INSERT...SELECT in parallel. If IO is your bottleneck, that would probably just make it worse. Of course if IO is no longer the bottleneck after my ORDER BY suggestion, then ignore this part.