How to speed up large copy into a Postgres table

How to speed up large copy into a Postgres table - postgresql

I'm trying to load a large dataset (25 GB) into a Postgres table. The command below works, but it takes 2.5 hours and fully utilizes 3-4 cores on the machine the entire time. Can I speed it up a lot? The problem is I want to insert another 1700 such 25 GB files into the table too (these would be separate partitions). 2.5 hours per file is too slow. Or rather it's not too slow, but makes me think subsequent queries against the data will be too slow.
Probably my dataset is too big for Postgres, but the idea of being able to run an optimized query (once the data is all inserted and later indexed and partitioned) and get a result back in < 3 seconds is appealing, hence I wanted to try.
I'm mostly following the guidelines from here. I don't have an index on the table yet. I don't use foreign keys, I'm using a bulk copy, etc.
-- some of the numeric columns have "nan" values so they need to start as strings in raw form
CREATE TABLE raw_eqtl_allpairs (
gene_id varchar,
variant_id varchar,
chr varchar,
bp BIGINT,
ref_ varchar,
alt varchar,
tss_distance bigint,
ma_samples int,
ma_count int,
maf varchar,
pval_nominal varchar,
slope varchar,
slope_se varchar,
chromosome_ varchar,
pos BIGINT,
af varchar
);
COPY raw_eqtl_allpairs(gene_id, variant_id, chr, bp, ref_, alt, tss_distance, ma_samples, ma_count, maf, pval_nominal, slope, slope_se, chromosome_, pos, af)
FROM '/Downloads/genomics_data.tsv'
DELIMITER E'\t'
CSV HEADER;
Edit 1:
I'm running on Docker on my Mac, which has 4 Intel i7 cores and 16 GB of ram. I have 500 GB of flash storage.
Edit 2:
I'm using the default /var/lib/postgresql/data/postgresql.conf that comes with the DockerHub Postgres 14.2 tag. For simplicity, I grepped it for key details. Seems like most of the important things were commented out:
#maintenance_work_mem = 64MB # min 1MB
#autovacuum_work_mem = -1 # min 1MB, or -1 to use maintenance_work_mem
max_wal_size = 1GB
#wal_level = replica # minimal, replica, or logical
#checkpoint_timeout = 5min # range 30s-1d
#archive_mode = off # enables archiving; off, on, or always
#max_wal_senders = 10 # max number of walsender processes
Maybe I could speed things up a lot if I changed wal_level to minimal, and changed max_wal_size to something larger, like 4GB?

Related

PostgreSQL update with indexed joins

PostgreSQL 14 for Windows on a medium sized machine. I'm using the default settings - literally as shipped. New to PostgreSQL from MS SQL Server.
A seemingly simple statement that runs in a minute in MS is taking hours in PostgreSQL - not sure why? I'm busy migrating over, i.e. it is the exact same data on the exact same hardware.
It's an update statement that joins a master table (roughly 1000 records) and fact table (roughly 8 million records). I've masked the tables and exact application here, but the structure is exactly reflective of the real data.
CREATE TABLE public.tmaster(
masterid SERIAL NOT NULL,
masterfield1 character varying,
PRIMARY KEY(masterid)
);
-- I've read that the primary key tag creates an index on that field automatically - correct?
CREATE TABLE public.tfact(
factid SERIAL NOT NULL,
masterid int not null,
fieldtoupdate character varying NULL,
PRIMARY KEY(factid),
CONSTRAINT fk_public_tfact_tmaster
FOREIGN KEY(masterid)
REFERENCES public.tmaster(masterid)
);
CREATE INDEX idx_public_fact_master on public.tfact(masterid);
The idea is to set public.tfact.fieldtoupdate = public.tmaster.masterfield1
I've tried the following ways (all taking over an hour to complete):
update public.tfact b
set fieldtoupdate = c.masterfield1
from public.tmaster c
where c.masterid = b.masterid;
update public.tfact b
set fieldtoupdate = c.masterfield1
from public.tfact bb
join public.tmaster c
on c.masterid = bb.masterid
where bb.factid = b.factid;
with t as (
select b.factid,
c.fieldtoupdate
from public.tfact b
join public.tmaster c
on c.masterid = b.masterid
)
update public.tfact b
set fieldtoupdate = t.fieldtoupdate
from t
where t.factid = b.factid;
What am I missing? This should take no time at all, but yet takes over an hour??
Any help is appreciated...

If the table was tightly packed to start with, there will be no room to use the HOT (Heap-only-tuple) UPDATE short cut. Updating 8 million rows will mean inserting 8 million new rows and doing index maintenance on each one.
If your indexed columns on tfact are not clustered, this can involve large amounts of random IO, and if your RAM is small most of this may be uncached. With slow disk, I can see why this would take a long time, maybe even much longer than an hour.
If you will be doing this on a regular basis, you should change the table's fillfactor to keep it loosely packed.
Note that the default settings are generally suited for a small machine, or at least a machine here running the database is a small one of its tasks. But the only thing likely to effect you here is work_mem, and even that is probably not really a problem for this task.
If you use psql, then the command \d+ tfact would show you what the fillfactor is set to if it is not the default. But note that this only applies to new tuples, not to existing ones. To see the fill on an existing table, you would want to check the freespacemap using the extension pg_freespacemap and see that every block has about half space available.
To see if an index is well clustered, you can check the correlation column of pg_stats on the table for the leading column ("attname") of the index.

Speed query very slow on TimescaleDB

I created a table with the command:
CREATE TABLE public.g_dl (
id bigint NOT NULL,
id_tram integer,
site_id character varying(40),
user_id character varying(40),
app_code character varying(40),
time TIMESTAMP WITHOUT TIME ZONE NOT NULL
);
SELECT create_hypertable('g_dl','time');
Then I insert an amount of about 34 million records.
The query speed is very good, the amount of Ram used by the docker container is about 500MB-1.2GB. But query speed gets a problem after I insert amount of 1.8 million records, the time field is out of order before I had inserted it.
I use DBbeaver and get the message "You might need to increase max_locks_per_transaction". Then I change that value up
increase max_locks_per_transaction = 1000
Query speed is very slow and the amount of ram used by docker is 10GB - 12GB. What I'm doing wrong way. Pls let me know.
Outout when explain:
EXPLAIN analyze select * from g_dlquantracts gd where id_tram = 300
Raw JSON explain:
https://drive.google.com/file/d/1bA5EwcWMUn5oTUirHFD25cSU1cYGwGLo/view?usp=sharing
Formatted execution plan: https://explain.depesz.com/s/tM57

When comparing insert performance between postgres and timescaledb, timescaledb didn't perform that well?

I tried an insert query performance test. Here are the numbers :-
Postgres
Insert : Avg Execution Time For 10 inserts of 1 million rows : 6260 ms
Timescale
Insert : Avg Execution Time For 10 inserts of 1 million rows : 10778 ms
Insert Queries:
-- Testing SQL Queries
--Join table
CREATE TABLE public.sensors(
id SERIAL PRIMARY KEY,
type VARCHAR(50),
location VARCHAR(50)
);
-- Postgres table
CREATE TABLE sensor_data (
time TIMESTAMPTZ NOT NULL,
sensor_id INTEGER,
temperature DOUBLE PRECISION,
cpu DOUBLE PRECISION,
FOREIGN KEY (sensor_id) REFERENCES sensors (id)
);
CREATE INDEX idx_sensor_id
ON sensor_data(sensor_id);
-- TimescaleDB table
CREATE TABLE sensor_data_ts (
time TIMESTAMPTZ NOT NULL,
sensor_id INTEGER,
temperature DOUBLE PRECISION,
cpu DOUBLE PRECISION,
FOREIGN KEY (sensor_id) REFERENCES sensors (id)
);
SELECT create_hypertable('sensor_data_ts', 'time');
-- Insert Data
INSERT INTO sensors (type, location) VALUES
('a','floor'),
('a', 'ceiling'),
('b','floor'),
('b', 'ceiling');
-- Postgres
EXPLAIN ANALYSE
INSERT INTO sensor_data (time, sensor_id, cpu, temperature)
SELECT
time,
sensor_id,
random() AS cpu,
random()*100 AS temperature
FROM generate_series(now() - interval '125 week', now(), interval '5 minute') AS g1(time), generate_series(1,4,1) AS g2(sensor_id);
-- TimescaleDB
EXPLAIN ANALYSE
INSERT INTO sensor_data_ts (time, sensor_id, cpu, temperature)
SELECT
time,
sensor_id,
random() AS cpu,
random()*100 AS temperature
FROM generate_series(now() - interval '125 week', now(), interval '5 minute') AS g1(time), generate_series(1,4,1) AS g2(sensor_id);
Am I overlooking any optimizations ?

By default, a hypertable creates a chunk per week (that's configurable in the create_hypertable call). So with the above setting, you created 125 chunks for TimescaleDB, each with 8000 rows. There is overhead to this chunk creation, as well as the logic handling this. So with the dataset being so small, you are seeing the overhead of this chunk creation, which typically is amortized over much larger datasets: In most "natural" settings, we'll typically see on the order of millions+ (or at least 100,000s) of rows per chunk.
The place you start to see insert performance differences between a partitioned architecture like TimescaleDB and single table is also when the dataset (and particular, the indexes that you are currently maintaining) don't naturally fit in the memory.
In the above, 1M rows easily fit in memory, and the only index you have on your vanilla PG table is for sensor_id, so it's pretty tiny. (On the TimescaleDB hypertable, you by default has indexes on timestamps, distinct per chunk, so you actually have 125 indexes, each of size 8000 given the distinct timestamps).
For visual, see this older blog post: https://blog.timescale.com/blog/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c/#result-15x-improvement-in-insert-rate
Note inserts to single PG table is ~same at the beginning, but then falls off as the table gets bigger and data/indexes start swapping to disk.
If you want to do larger performance tests, might suggest trying out the Time Series Benchmark Suite: https://github.com/timescale/tsbs

With TimescaleDB version 1.7 running on Docker was able to insert around 600,000 rows per second on my laptop using https://github.com/vincev/tsdbperf:
$ ./tsdbperf --workers 8 --measurements 200000
[2020-11-03T20:25:48Z INFO tsdbperf] Number of workers: 8
[2020-11-03T20:25:48Z INFO tsdbperf] Devices per worker: 10
[2020-11-03T20:25:48Z INFO tsdbperf] Metrics per device: 10
[2020-11-03T20:25:48Z INFO tsdbperf] Measurements per device: 200000
[2020-11-03T20:26:15Z INFO tsdbperf] Wrote 16000000 measurements in 26.55 seconds
[2020-11-03T20:26:15Z INFO tsdbperf] Wrote 602750 measurements per second
[2020-11-03T20:26:15Z INFO tsdbperf] Wrote 6027500 metrics per second

Postgres: How to efficiently bucketize random event ids below (hour,config_id,sensor_id)

I have a large table "measurement" with 4 columns:
measurement-service=> \d measurement
Table "public.measurement"
Column | Type | Collation | Nullable | Default
-----------------------+-----------------------------+-----------+----------+---------
hour | timestamp without time zone | | not null |
config_id | bigint | | not null |
sensor_id | bigint | | not null |
event_id | uuid | | not null |
Partition key: RANGE (hour)
Indexes:
"hour_config_id_sensor_id_event_id_key" UNIQUE CONSTRAINT, btree (hour, config_id, sensor_id, event_id)
Number of partitions: 137 (Use \d+ to list them.)
An example of a partition name: "measurement_y2019m12d04"
And then i insert a lot of events as CSV via COPY to a temporary table, and from there i copy the table directly into the partition using ON CONFLICT DO NOTHING.
Example:
CREATE TEMPORARY TABLE 'tmp_measurement_y2019m12d04T02_12345' (
hour timestamp without timezone,
config_id bigint,
sensor_id bigint,
event_id uuid
) ON COMMIT DROP;
[...]
COPY tmp_measurement_y2019m12d04T02_12345 FROM STDIN DELIMITER ',' CSV HEADER;
INSERT INTO measurement_y2019m12d04 (SELECT * FROM tmp_measurement_y2019m12d04T02_12345) ON CONFLICT DO NOTHING;
I think i help postgres by sending CSV with data of the same hour only. Also within that hour, i remove all duplicates in the CSV. Therefore the CSV only contains unique rows.
But i send many batches for different hours. There is no order. It can be the hour of today, yesterday, the last week. Etc.
My approach worked alright so far, but i think i have reached a limit now. The insertion speed has become very slow. While the CPU is idle, i have 25% i/o wait. Subsystem is a RAID with several TB, using disks, that are not SSD.
maintenance_work_mem = 32GB
max_wal_size = 1GB
fsync = off
max_worker_processes = 256
wal_buffers = -1
shared_buffers = 64GB
temp_buffers = 4GB
effective_io_concurrency = 1000
effective_cache_size = 128GB
Each partition per day is around 20gb big and contains no more than 500m rows. And by maintaining the unique index per partition, i just duplicated the data once more.
The lookup speed, on the other hand, is quick.
I think the limit is in the maintenance of the btree with the rather random UUIDs in (hour,config_id,sensor_id). I constantly change it, its written out and has to be re-read.
I am wondering, if there is another approach. Basically i want uniqueness for (hour,config_id,sensor_id,event_id) and then a quick lookup per (hour,config_id,sensor_id).
I am considering removal of the unique index and only having an index over (hour,config_id,sensor_id). And then providing the uniqueness on the reader side. But it may slow down the reading, as the event_id can no longer be delivered via the index, when i lookup via (hour,config_id,sensor_id). It has to access the actual row to get the event_id.
Or i provide uniqueness via a hash index.
Any other ideas are welcome!
Thank you.

When you do the insert, you should specify an ORDER BY which matches the index of the table being inserted into:
INSERT INTO measurement_y2019m12d04
SELECT * FROM tmp_measurement_y2019m12d04T02_12345
order by hour, config_id, sensor_id, event_id
Only if this fails to give enough improvement would I consider any of the other options you list.
Hash indexes don't provide uniqueness. You can simulate it with an exclusion constraint, but I think they are less efficient. Exclusion constraints do support DO NOTHING, but not support DO UPDATE. So as long as your use case does not evolve to want DO UPDATE, you would be good on that front, but I still doubt it would actually solve the problem. If your bottleneck is IO from updating the index, hash would only make it worse as it is designed to scatter your data all over the place, rather than focus it in a small cacheable area.
You also mention parallel processing. For inserting into the temp table, that might be fine. But I wouldn't do the INSERT...SELECT in parallel. If IO is your bottleneck, that would probably just make it worse. Of course if IO is no longer the bottleneck after my ORDER BY suggestion, then ignore this part.

pgAdmin slow when displaying data?

I am seeing a huge degradation in performance after moving some tables from SQL Server 2008 to Postgres, and I'm wondering if I'm missing a configuration step, or it is normal for postgres to behave this way.
The query used is a simple SELECT from the table. No joins, no ordering, nothing.
The table itself has only about 12K rows.
I have tried this on 3 machines:
Machine A hardware: 50GB RAM, SSD disks, CPU: Xeon® E5-2620v3 (OS:
Ubuntu Server 16), DBMS: Postgres 9.5
Machine B hardware: 8GB RAM, Sata disks, CPU: Xeon E5-4640 (OS:
Ubuntu Server 12), DBMS: Postgres 9.4
Machine C hardware: 4GB RAM, IDE disks, CPU: Xeon E3-1220v2 (OS:
Windows Server 2008), DBMS: SQL Server 2008 R2
The performance I am seeing is similar between the 2 Postgres databases, despite the vast difference in hardware and configuration. How can this be?
Machine A query. Notice that I'm excluding the geometry column in order to work with "pure" datatypes:
EXPLAIN ANALYZE VERBOSE SELECT id, "OID", type, name, orofos, xrisi_orofoy, area_sqm,
perimeter_m, syn1_, syn1_id, str_name, str_no, katanomh, linkc,
xrcode, kat, ot, use, syn, notes, "MinX", "MinY", "MaxX", "MaxY"
FROM public."korydallos_Land_Uses";
Results:
"Seq Scan on public."korydallos_Land_Uses" (cost=0.00..872.41 rows=12841 width=209) (actual time=0.025..13.450 rows=12841 loops=1)"
" Output: id, "OID", type, name, orofos, xrisi_orofoy, area_sqm, perimeter_m, syn1_, syn1_id, str_name, str_no, katanomh, linkc, xrcode, kat, ot, use, syn, notes, "MinX", "MinY", "MaxX", "MaxY""
"Planning time: 0.137 ms"
"Execution time: 14.788 ms"
This is 14 seconds for a simple select!! Wtf? Compare this with SQL Server:
Query Profile Statistics
Number of INSERT, DELETE and UPDATE statements 0
Rows affected by INSERT, DELETE, or UPDATE statements 0
Number of SELECT statements 1
Rows returned by SELECT statements 12840
Number of transactions 0
Network Statistics
Number of server roundtrips 1
TDS packets sent from client 1
TDS packets received from server 1040
Bytes sent from client 1010
Bytes received from server 2477997
Time Statistics
Client processing time 985
Total execution time 1022
Wait time on server replies 37
I am at a loss at what could be happening. I also tried:
Checking for dead rows: 0
Vacuuming
Simply querying the primary key (!). This takes 500ms to execute.
With each column I add to the select, around 500 more ms are added to
the query.
Machine A Postgres performance settings:
max_connections = 200
shared_buffers = 12800MB
effective_cache_size = 38400MB
work_mem = 32MB
maintenance_work_mem = 2GB
min_wal_size = 4GB
max_wal_size = 8GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
Machine B Postgres performance settings:
max_connections = 200
shared_buffers = 128MB
#effective_cache_size = 4GB
#work_mem = 4MB
#maintenance_work_mem = 64MB
#min_wal_size = 80MB
#max_wal_size = 1GB
#checkpoint_completion_target = 0.5
#wal_buffers = -1
#default_statistics_target = 100
Table definition in Postgres:
CREATE TABLE public."korydallos_Land_Uses"
(
id integer NOT NULL DEFAULT nextval('"korydallos_Land_Uses_id_seq"'::regclass),
wkb_geometry geometry(Polygon,4326),
"OID" integer,
type character varying(255),
name character varying(255),
orofos character varying(255),
xrisi_orofoy character varying(255),
area_sqm numeric,
perimeter_m numeric,
syn1_ numeric,
syn1_id numeric,
str_name character varying(255),
str_no character varying(255),
katanomh numeric,
linkc numeric,
xrcode character varying(255),
kat numeric,
ot character varying(255),
use character varying(255),
syn numeric,
notes character varying(255),
"MinX" numeric,
"MinY" numeric,
"MaxX" numeric,
"MaxY" numeric,
CONSTRAINT "korydallos_Land_Uses_pkey" PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public."korydallos_Land_Uses"
OWNER TO root;
CREATE INDEX "sidx_korydallos_Land_Uses_wkb_geometry"
ON public."korydallos_Land_Uses"
USING gist
(wkb_geometry);
EDIT: Removed the irrelevant SQL Server definition as suggested in the comments. Keeping the time as I think it's still relevant.
As per the comments, more info using:
explain (analyze, verbose, buffers, timing) SELECT id, "OID", type, name, orofos, xrisi_orofoy, area_sqm,
perimeter_m, syn1_, syn1_id, str_name, str_no, katanomh, linkc,
xrcode, kat, ot, use, syn, notes, "MinX", "MinY", "MaxX", "MaxY"
FROM public."korydallos_Land_Uses"
Results:
"Seq Scan on public."korydallos_Land_Uses" (cost=0.00..872.41 rows=12841 width=209) (actual time=0.019..11.207 rows=12841 loops=1)"
" Output: id, "OID", type, name, orofos, xrisi_orofoy, area_sqm, perimeter_m, syn1_, syn1_id, str_name, str_no, katanomh, linkc, xrcode, kat, ot, use, syn, notes, "MinX", "MinY", "MaxX", "MaxY""
" Buffers: shared hit=744"
"Planning time: 1.073 ms"
"Execution time: 12.269 ms"
PG Admin shows me this in the "Explain tab":
How I measure the 14 seconds:
Status window of PG Admin 3, bottom right corner, when running the query. (It says 14.3 secs for the trolls here).

https://www.postgresql.org/docs/current/static/using-explain.html
Note that the “actual time” values are in milliseconds of real time,
so in your case
actual time=0.019..11.207
means running query took 11 milliseconds.
pgadmin "explain tab" says the same... Now if you see 14.3 sec in right bottom corner and the time it took is indeed 14 seconds (measured with watches) I assume it is some awful delay on network level or pgadmin itself. Try running this in psql for instance:
select clock_timestamp();
explain analyze select * FROM public."korydallos_Land_Uses";
select clock_timestamp();
this will show time intervals server side + time needed to send command from psql to server - if it takes still 14 seconds - talk to you network admin, if not, try upgrading pgadmin

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse