Inserting row to temp table - postgresql

I have a problem with a performance, when I'm trying to create temporary table. The following code is a part of plpgsql function:
StartTime := clock_timestamp();
CREATE TEMP TABLE wo_tmp WITH (OIDS) AS
SELECT workorders1_.woid AS w_id, workorders1_.woid4seg AS w_id4seg
FROM common.workorders workorders1_
INNER JOIN common.lines lines2_ ON workorders1_.wolineid=lines2_.lineid
INNER JOIN common.products products2_ ON workorders1_.woprodid=products2_.prodid
INNER JOIN common.depts depts3_ ON lines2_.linedeptid=depts3_.deptid
WHERE workorders1_.wostatus='F'
AND workorders1_.wotypestatus = ANY ('{R,C,I,D}'::text[])
AND (p_deptid = 0 OR (depts3_.deptid = p_deptid AND ((p_deptid = 5 AND workorders1_.wosegid = 1) OR workorders1_.wosegid = 4)))
AND (p_lineid = 0 OR lines2_.lineid = p_lineid)
AND (p_prodid = 0 OR products2_.prodid = p_prodid)
AND (p_nrkokili = 0 OR workorders1_.wonrkokili = p_nrkokili)
AND (p_accepted = TRUE OR workorders1_.worjacceptstatus = 'Y')
AND workorders1_.wodateleaverr BETWEEN p_dfr AND p_dto
AND lines2_.status <> 'D';
CREATE INDEX wo_tmp_w_id_idx
ON wo_tmp USING btree (w_id ASC NULLS LAST);
CREATE INDEX wo_tmp_w_id4seg_idx
ON wo_tmp USING btree (w_id4seg ASC NULLS LAST);
EndTime := clock_timestamp();
Delta := extract('epoch' from EndTime)::bigint - extract('epoch' from StartTime)::bigint;
RAISE NOTICE 'Duration [0] in seconds=%', Delta;
Here's an explain analyze report: http://explain.depesz.com/s/uerF
It's strange, because when I execute this function, I obtain notice: Duration [0] in seconds=11. I check query without creating temp table and the result time is ~300ms.
Is it possible that inserting records (~73k) into a temporary table takes 11 seconds? Can I speed it up?

When you fill a temp table inside functions, you can find more than one issue:
locking issues - every temp table is table with some fields in system catalog. Intensive creating and dropping these tables creates high overhead with lot locking. Sometimes temp tables can be replaced by arrays. It is not your case, because you need a indexes.
blind optimization - embedded SQL in PlpgSQL functions are optimized for most common values (this mechanism was slightly enhanced in PostgreSQL 9.2 (but still with possible performance issues). It is not optimized for current values - and this fact can enforces some performance issue. Then dynamic SQL is necessary. Some links of this issues (one and second)
Some hw or file system issues - I am little bit confused about help WITHOUT OIDS. It looks like your file system is terrible bottleneck for you. Temp tables are stored in file system cache - storing 53K rows there should be fast .. removing four bytes (from 35) is not too big change.
postgres=# create table t1 with (oids) as select 1 a,2 b,3 c from generate_series(1,73000);
SELECT 73000
Time: 302.083 ms
postgres=# create table t2 as select 1 a,2 b,3 c from generate_series(1,73000);
SELECT 73000
Time: 267.459 ms
postgres=# create temp table t3 with (oids) as select 1 a,2 b,3 c from generate_series(1,73000);
SELECT 73000
Time: 154.431 ms
postgres=# create temp table t4 as select 1 a,2 b,3 c from generate_series(1,73000);
SELECT 73000
Time: 153.085 ms
postgres=# \dt+ t*
List of relations
Schema | Name | Type | Owner | Size | Description
-----------+------+-------+-------+---------+-------------
pg_temp_2 | t3 | table | pavel | 3720 kB |
pg_temp_2 | t4 | table | pavel | 3160 kB |
public | t1 | table | pavel | 3720 kB |
public | t2 | table | pavel | 3160 kB |
(4 rows)
Writing 3MB file to file system should be significantly less than 1sec .. so it is strange for 11 sec overhead. p.s. default temp_buffers is 8MB, so your result should be stored in memory only - and probably this hypothesis is false - and more probable is blind optimization hypothesis.

For starters, don't use WITH (OIDS) for temporary tables. Ever. Use of OIDs in regular tables is discouraged. That goes doubly for temp tables. Also reduces required RAM / space on disk, which is probably the main bottle neck here. Switch to WITHOUT OIDS.
Next, a likely cause (educated guess) is a lack of temp buffers which forces the temp table to spill to disk. Check the actual size of the temp table with
SELECT pg_size_pretty(pg_relation_size('wo_tmp'));
And set temp_buffers accordingly, possibly for the session only - round up generously, enough to avoid writing to disk.
Details:
How to delete duplicate entries?

Related

Postgresql + psycopg: Bulk Insert large data with POSTGRESQL function call

I am working with large, very large amount of very simple data (point clouds). I want to insert this data into a simple table in a Postgresql database using Python.
An example of the insert statement I need to execute is as follows:
INSERT INTO points_postgis (id_scan, scandist, pt) VALUES (1, 32.656, **ST_MakePoint**(1.1, 2.2, 3.3));
Note the call to the Postgresql function ST_MakePoint in the INSERT statement.
I must call this billions (yes, billions) of times, so obviously I must insert the data into the Postgresql in a more optimized way. There are many strategies to bulk insert the data as this article presents in a very good and informative way (insertmany, copy, etc).
https://hakibenita.com/fast-load-data-python-postgresql
But no example shows how to do these inserts when you need to call a function on the server-side. My question is: how can I bulk INSERT data when I need to call a function on the server-side of a Postgresql database using psycopg?
Any help is greatly appreciated! Thank you!
Please note that using a CSV doesn't make much sense because my data is huge.
Alternatively, I tried already to fill a temp table with simple columns for the 3 inputs of the ST_MakePoint function and then, after all data is into this temp function, call a INSERT/SELECT. The problem is that this takes a lot of time and the amount of disk space I need for this is nonsensical.
The most important, in order to do this within reasonable time, and with minimum effort, is to break this task down into component parts, so that you could take advantage of different Postgres features seperately.
Firstly, you will want to first create the table minus the geometry transformation. Such as:
create table temp_table (
id_scan bigint,
scandist numeric,
pt_1 numeric,
pt_2 numeric,
pt_3 numeric
);
Since we do not add any indexes and constraints, this will be most likely the fastest way to get the "raw" data into the RDBMS.
The best way to do this would be with COPY method, which you can use either from Postgres directly (if you have sufficient access), or via the Python interface by using https://www.psycopg.org/docs/cursor.html#cursor.copy_expert
Here is example code to achieve this:
iconn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format(target_host, target_usr, target_db, target_pw, "require")
iconn = psycopg2.connect(iconn_string)
import_cursor = iconn.cursor()
csv_filename = '/path/to/my_file.csv'
copy_sql = "COPY temp_table (id_scan, scandist, pt_1, pt_2, pt_3) FROM STDIN WITH CSV HEADER DELIMITER ',' QUOTE '\"' ESCAPE '\\' NULL AS 'null'"
with open(csv_filename, mode='r', encoding='utf-8', errors='ignore') as csv_file:
import_cursor.copy_expert(copy_sql, csv_file)
iconn.commit()
The next step will be to efficiently create the table you want, from the existing raw data. You will then be able to create your actual target table with single SQL statement, and let RDBMS to do its magic.
Once data is in the RDBMS, makes sense to optimize it a little and add an index or two if applicable (primary or unique index preferably to speed up transformation)
This will be dependent on your data / use case, but something like this should help:
alter table temp_table add primary key (id_scan); --if unique
-- or
create index idx_temp_table_1 on temp_table(id_scan); --if not unique
To move data from raw into your target table:
with temp_t as (
select id_scan, scandist, ST_MakePoint(pt_1, pt_2, pt_3) as pt from temp_table
)
INSERT INTO points_postgis (id_scan, scandist, pt)
SELECT temp_t.id_scan, temp_t.scandist, temp_t.pt
FROM temp_t;
This will in one go select all data from the previous table and transform it.
Second option that you could use is similar. You can load all raw data to points_postgis directly, while keeping it separated into 3 temp columns. Then use alter table points_postgis add column pt geometry; and follow up with an update, and removal of the temp columns: update points_postgis set pt = ST_MakePoint(pt_1, pt_2, pt_3); & alter table points_postgis drop column pt_1, drop column pt_2, drop column pt_3;
The main takeaway is that the most performant option would be to not concentrate on the final final table state, but to break it down in easily achievable chunks. Postgres will easily handle both import of billion of rows, and transformation of it afterwards.
Some simple examples using a function that generates a UPC A barcode with check digit:
Using execute_batch. execute_batch has page_size argument that allows you to batch the inserts using a multi-line statement. By default this is set at 100 which will insert 100 rows at a time. You can bump this up to make fewer round trips to the server.
Using just execute and selecting data from another table.
import psycopg2
from psycopg2.extras import execute_batch
con = psycopg2.connect(dbname='test', host='localhost', user='postgres',
port=5432)
cur = con.cursor()
cur.execute('create table import_test(id integer, suffix_val varchar, upca_val
varchar)')
con.commit()
# Input data as a list of tuples. Means some data is duplicated.
input_list = [(1, '12345', '12345'), (2, '45278', '45278'), (3, '61289',
'61289')]
execute_batch(cur, 'insert into import_test values(%s, %s,
upc_check_digit(%s))', input_list)
con.commit()
select * from import_test ;
id | suffix_val | upca_val
----+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
# Input data as list of dicts and using named parameters to avoid duplicating data.
input_list_dict = [{'id': 50, 'suffix_val': '12345'}, {'id': 51, 'suffix_val': '45278'}, {'id': 52, 'suffix_val': '61289'}]
execute_batch(cur, 'insert into import_test values(%(id)s,
%(suffix_val)s, upc_check_digit(%(suffix_val)s))', input_list_dict)
con.commit()
select * from import_test ;
id | suffix_val | upca_val
----+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
50 | 12345 | 744835123458
51 | 45278 | 744835452787
52 | 61289 | 744835612891
# Create a table with values to be used for inserting into final table
cur.execute('create table input_vals (id integer, suffix_val varchar)')
con.commit()
execute_batch(cur, 'insert into input_vals values(%s, %s)', [(100, '76234'),
(101, '92348'), (102, '16235')])
con.commit()
cur.execute('insert into import_test select id, suffix_val,
upc_check_digit(suffix_val) from input_vals')
con.commit()
select * from import_test ;
id | suffix_val | upca_val
-------+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
12345 | 12345 | 744835123458
45278 | 45278 | 744835452787
61289 | 61289 | 744835612891
100 | 76234 | 744835762343
101 | 92348 | 744835923485
102 | 16235 | 744835162358

PostgreSQL Nested Loop Join Performance

I have two tables exchange_rate (100 Thousand Rows) and paid_date_t (9 million rows) with below structure.
"exchange_rate"
Column | Type | Collation | Nullable | Default
-----------------------------+--------------------------+-----------+----------+---------
valid_from | timestamp with time zone | | |
valid_until | timestamp with time zone | | |
currency | text | | |
Indexes:
"exchange_rate_unique_valid_from_currency_key" UNIQUE, btree (valid_from, currency)
"exchange_rate_valid_from_gist_idx" gist (valid_from)
"exchange_rate_valid_from_until_currency_gist_idx" gist (valid_from, valid_until, currency)
"exchange_rate_valid_from_until_gist_idx" gist (valid_from, valid_until)
"exchange_rate_valid_until_gist_idx" gist (valid_until)
"paid_date_t"
Column | Type | Collation | Nullable | Default
-------------------+-----------------------------+-----------+----------+---------
currency | character varying(3) | | |
paid_date | timestamp without time zone | | |
Indexes:
"paid_date_t_paid_date_idx" btree (paid_date)
I am running below select query and joining these tables based on multiple join keys:
SELECT
paid_date
FROM exchange_rate erd
JOIN paid_date_t sspd
ON sspd.paid_date >= erd.valid_from AND sspd.paid_date < erd.valid_until
AND erd.currency = sspd.currency
WHERE sspd.currency != 'USD'
However, the performance of the query is inefficient and takes hours to execute. The query plan below shows that it using a nested loop join.
Nested Loop (cost=0.28..44498192.71 rows=701389198 width=40)
-> Seq Scan on paid_date_t sspd (cost=0.00..183612.84 rows=2557615 width=24)
Filter: ((currency)::text <> 'USD'::text)
-> Index Scan using exchange_rate_valid_from_until_currency_gist_idx on exchange_rate erd (cost=0.28..16.53 rows=80 width=36)
Index Cond: (currency = (sspd.currency)::text)
Filter: ((sspd.paid_date >= valid_from) AND (sspd.paid_date < valid_until))
I have worked with different indexing methods but got the same result. I know that <= and >= operators are not supporting merge or hash joins.
Any ideas are appreciated.
You should create a smaller table with just a sample of the rows from paid_date_t in it. It is hard to optimize a query if it takes a very long time each time you try to test it.
Your btree index has the column tested for equality as the 2nd column, which is certainly less efficient. The better btree index for this query (as it is currently written) would be something like (currency, valid_from, valid_until).
For a gist index, you really want it to be on the time range, not on the separate end points of the range. You could either convert the table to hold a range type, or build a functional index to convert them on the fly (and then rewrite the query to use the same expression). This is complicated by the fact that your tables have different types due to the different handling of time zones. The index would look like:
create index on exchange_rate using gist (tstzrange(valid_from,valid_until), currency);
and then the ON condition would look like:
ON sspd.paid_date::timestamptz <# tstzrange(erd.valid_from, erd.valid_until)
AND erd.currency = sspd.currency
It might be faster to have the order of the columns in the gist index be reversed from what I show, you should try it both ways on your own data and see.

Postgres best way to delete duplicates in large table with no primary key

I have a table that logs scan events wherein I store the first and last event. Each night at midnight the all the scan events from the previous day are added to the table, duplicates are dropped, and a query is run to delete anything other than the scan event with the minimum and maximum timestamp.
One of the problems is that the data provider recycles scan ID's every 45 days, so this table does not have a primary key. Here is an example of what the table looks like in it's final state:
|scaneventID|scandatetime |status |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-03 19:32:18|Processed|45867 |
|awgjnh09864|2020-01-01 10:24:16|Intake |84676 |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676 |
But before the cleanup queries are run it can look like this:
|scaneventID|scandatetime |status |scanfacilityzip|
+-----------+-------------------+---------+---------------+
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-01 13:45:12|Intake |12345 |
|isdijh23452|2020-01-01 19:30:32|Received |12345 |
|isdijh23452|2020-01-02 04:50:22|Confirmed|12345 |
|isdijh23452|2020-01-03 19:32:18|Processed|45867 |
|awgjnh09864|2020-01-01 10:24:16|Intake |84676 |
|awgjnh09864|2020-01-01 19:30:32|Received |84676 |
|awgjnh09864|2020-01-01 19:30:32|Received |84676 |
|awgjnh09864|2020-01-02 02:15:52|Processed|84676 |
Because there are sometimes data overlap from the vendor and there's nothing we can really do about that. I currently run the following queries to delete duplicates:
DELETE FROM scans T1
USING scans T2
WHERE EXTRACT(DAY FROM current_timestamp-T1.scandatetime) < 2
AND T1.ctid < T2.ctid
AND T1.scaneventID = T2.scaneventID
AND T1.scandatetime = T2.scandatetime
;
And to retain only the min/max timestamps:
delete from scans
where EXTRACT(DAY FROM current_timestamp-scandatetime) < 2 and
scandatetime <> (select min(tt.scandatetime) from scans tt where tt.scaneventID = scans.scaneventID) and
scandatetime <> (select max(tt.scandatetime) from scans tt where tt.scaneventID = scans.scaneventID)
;
However the table is quite large (100's of millions of scans over multiple years) so these run quite slowly. How can I speed this up?

Postgres sequence's last_value field does not work as expected

I have a table in postgres whose primary key is assinged using a sequence (let's call it 'a_seq'). The sequence is for incrementing the value and inserting the current value as primary key of record being inserted.
Code i use for sequence:
CREATE SEQUENCE public.a_seq
INCREMENT 1
START 1
MINVALUE 1
MAXVALUE 9223372036854775807
CACHE 1;
ALTER SEQUENCE public.AssembleTable_RowId_seq OWNER TO postgres;
I am trying to copy a file from a disk and insert the information about copied file to table. There are files with same name in the disk so i'm retrieving the "last_value" from the sequence with this query:
SELECT last_value FROM a_seq;
and rename file "_" then insert it do database so the file name and the primary key (id) of that file is coherent like:
id | fileName
1 | 1_asd.txt
But when i insert the record, the id is always 1 value greater than the "last_value" a get from the query so table looks like this:
id | fileName
2 | 1_asd.txt
And i've tried to execute the select query above multiple times to check if its increment the value but it doesn't.
Any idea how to get the value which will be assinged to the record before the insertion?
NOTE: I use MATLAB and this is the code is use for insertion:
colnames = {'DataType' , ...
'FilePath' , ...
'FileName' , ...
'FileVersion' , ...
'CRC32Q' , ...
'InsertionDateTime', ...
'DataSource' };
data = {FileLine{5} ,... % DataType
tempPath ,... % FilePath
FileLine{1} ,... % FileName
FileLine{2} ,... % FileVersion
FileLine{3} ,... % CRC32Q
FileLine{4} ,... % InsertionDateTime
FileLine{6} ,... % DataSource};
data_table = cell2table(data, 'VariableNames', colnames);
datainsert(conn , 'CopiedFiles' , colnames , data_table);
updated
What I believe happens for you is: when you select last_value - you get last used sequence value and when you insert row, the default value for id is nextval, which rolls value by one above...
previous
I believe you have an extra nextval somewhere in middle step. If you do it in one statement, it works as you expect, eg:
t=# create table so12(s int default nextval('s'), t text);
CREATE TABLE
t=# insert into so12(t) select last_value||'_abc.txt' from s;
INSERT 0 1
t=# select * from so12;
s | t
---+-----------
1 | 1_abc.txt
(1 row)
update2
as Nick Barnes noticed, further (then initial1) iterations will give wrong results, su you need to use heis proposed CASE logic
This is a quirk in the way Postgres implements sequences; as inherently non-transactional objects in a transactional database, they behave a bit strangely.
The first time you call nextvalue() on a sequence, it will not affect the number you see in a_seq.last_value. However, it will flip the a_seq.is_called flag:
test=# create sequence a_seq;
test=# select last_value, is_called from a_seq;
last_value | is_called
------------+-----------
1 | f
test=# select nextval('a_seq');
nextval
---------
1
test=# select last_value, is_called from a_seq;
last_value | is_called
------------+-----------
1 | t
So if you need the next value in the sequence, you'd want something like
SELECT
last_value + CASE WHEN is_called THEN 1 ELSE 0 END
FROM a_seq
Note that this is horribly broken if two processes are doing this concurrently, as there's no guarantee you'll actually receive this value from your next nextval() call. In that case, if you really need the filename to match the id, you'd need to either generate it with a trigger, or UPDATE it once you know what the id is.
But in my experience, it's best to avoid any dependencies between your data and your keys. If all you need is a unique filename, I'd just create an independent filename_seq.
When INSERT statement is executed without a value for id - Postgres automatically takes it from sequence using next_val. List of columns in the variable colnames does not have an id, so PG takes next value from the sequence. To solve the problem you may add id to colnames.
To avoid any dependencies between your data and your keys, Please try:
CREATE SEQUENCE your_sequence
INCREMENT 1
MINVALUE 1
MAXVALUE 9223372036854775807
START 1
CACHE 1;
ALTER TABLE your_sequence
OWNER TO postgres;

Is there any way to match multiple date ranges for inclusion in other multiple ranges in postgresql

For example I have in database allowed ranges - (08:00-12:00), (12:00-15:00) and requested range I want to test - (09:00-14:00). Is there any way to understand that my test range is included in allowed range in database. It can be splited in even more parts, I just want to know if my range fully fits to list of time ranges in database.
You don't provide table structure, so I have no idea of data type. lets assume those are texts:
t=# select '(8:00, 12:30)' a,'(12:00, 15:00)' b,'(09:00, 14:00)' c;
a | b | c
---------------+----------------+----------------
(8:00, 12:30) | (12:00, 15:00) | (09:00, 14:00)
(1 row)
then how you can do it:
t=# \x
Expanded display is on.
t=# with d(a,b,c) as (values('(8:00, 12:30)','(12:00, 15:00)','(09:00, 14:00)'))
, w as (select '2017-01-01 ' h)
, timerange as (
select
tsrange(concat(w.h,split_part(substr(a,2),',',1))::timestamp,concat(w.h,split_part(a,',',2))::timestamp) ta
, tsrange(concat(w.h,split_part(substr(b,2),',',1))::timestamp,concat(w.h,split_part(b,',',2))::timestamp) tb
, tsrange(concat(w.h,split_part(substr(c,2),',',1))::timestamp,concat(w.h,split_part(c,',',2))::timestamp) tc
from w
join d on true
)
select *, ta + tb glued, tc <# ta + tb fits from timerange;
-[ RECORD 1 ]----------------------------------------
ta | ["2017-01-01 08:00:00","2017-01-01 12:30:00")
tb | ["2017-01-01 12:00:00","2017-01-01 15:00:00")
tc | ["2017-01-01 09:00:00","2017-01-01 14:00:00")
glued | ["2017-01-01 08:00:00","2017-01-01 15:00:00")
fits | t
first you need to "cast" your time to timestamp, as there is no timerange in postgres, so we take same day for all times (w.h = 2017-01-01) and convert a,b,c to ta,tb,tc with default including brackets (which totally fits our case).
then use union https://www.postgresql.org/docs/current/static/functions-range.html#RANGE-FUNCTIONS-TABLE operator to get "glued" interval
lastly check if the range is contained by the larger one with <# operator