I am working with large, very large amount of very simple data (point clouds). I want to insert this data into a simple table in a Postgresql database using Python.
An example of the insert statement I need to execute is as follows:
INSERT INTO points_postgis (id_scan, scandist, pt) VALUES (1, 32.656, **ST_MakePoint**(1.1, 2.2, 3.3));
Note the call to the Postgresql function ST_MakePoint in the INSERT statement.
I must call this billions (yes, billions) of times, so obviously I must insert the data into the Postgresql in a more optimized way. There are many strategies to bulk insert the data as this article presents in a very good and informative way (insertmany, copy, etc).
https://hakibenita.com/fast-load-data-python-postgresql
But no example shows how to do these inserts when you need to call a function on the server-side. My question is: how can I bulk INSERT data when I need to call a function on the server-side of a Postgresql database using psycopg?
Any help is greatly appreciated! Thank you!
Please note that using a CSV doesn't make much sense because my data is huge.
Alternatively, I tried already to fill a temp table with simple columns for the 3 inputs of the ST_MakePoint function and then, after all data is into this temp function, call a INSERT/SELECT. The problem is that this takes a lot of time and the amount of disk space I need for this is nonsensical.
The most important, in order to do this within reasonable time, and with minimum effort, is to break this task down into component parts, so that you could take advantage of different Postgres features seperately.
Firstly, you will want to first create the table minus the geometry transformation. Such as:
create table temp_table (
id_scan bigint,
scandist numeric,
pt_1 numeric,
pt_2 numeric,
pt_3 numeric
);
Since we do not add any indexes and constraints, this will be most likely the fastest way to get the "raw" data into the RDBMS.
The best way to do this would be with COPY method, which you can use either from Postgres directly (if you have sufficient access), or via the Python interface by using https://www.psycopg.org/docs/cursor.html#cursor.copy_expert
Here is example code to achieve this:
iconn_string = "host={0} user={1} dbname={2} password={3} sslmode={4}".format(target_host, target_usr, target_db, target_pw, "require")
iconn = psycopg2.connect(iconn_string)
import_cursor = iconn.cursor()
csv_filename = '/path/to/my_file.csv'
copy_sql = "COPY temp_table (id_scan, scandist, pt_1, pt_2, pt_3) FROM STDIN WITH CSV HEADER DELIMITER ',' QUOTE '\"' ESCAPE '\\' NULL AS 'null'"
with open(csv_filename, mode='r', encoding='utf-8', errors='ignore') as csv_file:
import_cursor.copy_expert(copy_sql, csv_file)
iconn.commit()
The next step will be to efficiently create the table you want, from the existing raw data. You will then be able to create your actual target table with single SQL statement, and let RDBMS to do its magic.
Once data is in the RDBMS, makes sense to optimize it a little and add an index or two if applicable (primary or unique index preferably to speed up transformation)
This will be dependent on your data / use case, but something like this should help:
alter table temp_table add primary key (id_scan); --if unique
-- or
create index idx_temp_table_1 on temp_table(id_scan); --if not unique
To move data from raw into your target table:
with temp_t as (
select id_scan, scandist, ST_MakePoint(pt_1, pt_2, pt_3) as pt from temp_table
)
INSERT INTO points_postgis (id_scan, scandist, pt)
SELECT temp_t.id_scan, temp_t.scandist, temp_t.pt
FROM temp_t;
This will in one go select all data from the previous table and transform it.
Second option that you could use is similar. You can load all raw data to points_postgis directly, while keeping it separated into 3 temp columns. Then use alter table points_postgis add column pt geometry; and follow up with an update, and removal of the temp columns: update points_postgis set pt = ST_MakePoint(pt_1, pt_2, pt_3); & alter table points_postgis drop column pt_1, drop column pt_2, drop column pt_3;
The main takeaway is that the most performant option would be to not concentrate on the final final table state, but to break it down in easily achievable chunks. Postgres will easily handle both import of billion of rows, and transformation of it afterwards.
Some simple examples using a function that generates a UPC A barcode with check digit:
Using execute_batch. execute_batch has page_size argument that allows you to batch the inserts using a multi-line statement. By default this is set at 100 which will insert 100 rows at a time. You can bump this up to make fewer round trips to the server.
Using just execute and selecting data from another table.
import psycopg2
from psycopg2.extras import execute_batch
con = psycopg2.connect(dbname='test', host='localhost', user='postgres',
port=5432)
cur = con.cursor()
cur.execute('create table import_test(id integer, suffix_val varchar, upca_val
varchar)')
con.commit()
# Input data as a list of tuples. Means some data is duplicated.
input_list = [(1, '12345', '12345'), (2, '45278', '45278'), (3, '61289',
'61289')]
execute_batch(cur, 'insert into import_test values(%s, %s,
upc_check_digit(%s))', input_list)
con.commit()
select * from import_test ;
id | suffix_val | upca_val
----+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
# Input data as list of dicts and using named parameters to avoid duplicating data.
input_list_dict = [{'id': 50, 'suffix_val': '12345'}, {'id': 51, 'suffix_val': '45278'}, {'id': 52, 'suffix_val': '61289'}]
execute_batch(cur, 'insert into import_test values(%(id)s,
%(suffix_val)s, upc_check_digit(%(suffix_val)s))', input_list_dict)
con.commit()
select * from import_test ;
id | suffix_val | upca_val
----+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
50 | 12345 | 744835123458
51 | 45278 | 744835452787
52 | 61289 | 744835612891
# Create a table with values to be used for inserting into final table
cur.execute('create table input_vals (id integer, suffix_val varchar)')
con.commit()
execute_batch(cur, 'insert into input_vals values(%s, %s)', [(100, '76234'),
(101, '92348'), (102, '16235')])
con.commit()
cur.execute('insert into import_test select id, suffix_val,
upc_check_digit(suffix_val) from input_vals')
con.commit()
select * from import_test ;
id | suffix_val | upca_val
-------+------------+--------------
1 | 12345 | 744835123458
2 | 45278 | 744835452787
3 | 61289 | 744835612891
12345 | 12345 | 744835123458
45278 | 45278 | 744835452787
61289 | 61289 | 744835612891
100 | 76234 | 744835762343
101 | 92348 | 744835923485
102 | 16235 | 744835162358
Related
In a dataset I have, there is a columns contains numbers like 83.420, 43.317, 149.317, ... and this columns is stored as string. The dot in the numbers doesn't represent decimal point, i.e., the number 83.420 is basically 83420 etc.
One way to remove this dot from numbers in this column is to use TRANSLATE function as follows:
SELECT translate('83.420', '.', '')
which returns 83420. But how I can apply this function on all the rows in the dataset?
I tried this, however, I failed:
SELECT translate(SELECT num_column FROM my_table, '.', '')
I face with error SQL Error [42601]: ERROR: syntax error at end of input.
Any idea how I can apply translate function on one column in data entirely? or any better idea to use rather than translate?
You can even cast the result to numeric like this:
SELECT translate(num_column, '.', '')::integer from the_table;
-- average:
SELECT avg(translate(num_column, '.', '')::integer from the_table;
or use replace
SELECT replace(num_column, '.', '')::integer from the_table;
-- average:
SELECT avg(replace(num_column, '.', '')::integer) from the_table;
Please note that storing numbers as formatted text is a (very) bad idea. Use a native numeric type instead.
Two options.
Set up table:
create table string_conv(id integer, num_column varchar);
insert into string_conv values (1, 83.420), (2, 43.317), (3, 149.317 );
select * from string_conv ;
id | num_column
----+------------
1 | 83.420
2 | 43.317
3 | 149.317
First option leave as string field:
update string_conv set num_column = translate(num_column, '.', '');
select * from string_conv ;
id | num_column
----+------------
1 | 83420
2 | 43317
3 | 149317
The above changes the value format in place. I means though that if new data comes in with the old format, 'XX.XXX', then those values would have to be converted.
Second option convert to integer column:
truncate string_conv ;
insert into string_conv values (1, 83.420), (2, 43.317), (3, 149.317 );
alter table string_conv alter COLUMN num_column type integer using translate(num_column, '.', '')::int;
select * from string_conv ;
id | num_column
----+------------
1 | 83420
2 | 43317
3 | 149317
\d string_conv
Table "public.string_conv"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
id | integer | | |
num_column | integer | | |
This option changes the format of the values and changes the type of the column they are being stored in. The issue is here is that from then on new values would need to be compatible with the new type. This would mean changing the input data from 'XX.XXX' to 'XXXXX'.
I'm trying to write a function that is able to find the shortest way between two points using pgr_dijkstra function. I'm following this guide. With data provided in the guide everything works fine. But when I try to apply the same steps (build a topology using pgr_createTopology and then test it with pgr_dijkstra) to another data set, pgr_dijkstra returns an empty result. I've also noticed that the guide's data set has a LineString geometry column, while I have a MultiLineString geometry column. What could be the reason?
My table's structure:
Table "public.roads"
Column | Type | Collation | Nullable | Default
--------+--------------------------------+-----------+----------+------------------------------------
id | integer | | not null | nextval('roads_gid_seq'::regclass)
geom | geometry(MultiLineString,4326) | | |
source | integer | | |
target | integer | | |
Indexes:
"roads_pkey" PRIMARY KEY, btree (id)
"roads_geom_idx" gist (geom)
"roads_source_idx" btree (source)
"roads_target_idx" btree (target)
Topology creation query:
SELECT pgr_createTopology('roads', 0.00001, 'geom', 'id');
Shortest way test:
SELECT seq, node, edge, cost as cost, agg_cost, geom
FROM pgr_dijkstra(
'SELECT id, source, target, st_length(geom, true) AS cost FROM roads',
-- Some random points
1, 200
) AS pt
JOIN roads rd ON pt.edge = rd.id;
The problem was actually related to geometry data types. The function doesn't work properly with MultiLineString, though it doesn't produce any errors. So, I've converted MultiLineString to LineString and now everything seems to be OK.
I have a table in postgres whose primary key is assinged using a sequence (let's call it 'a_seq'). The sequence is for incrementing the value and inserting the current value as primary key of record being inserted.
Code i use for sequence:
CREATE SEQUENCE public.a_seq
INCREMENT 1
START 1
MINVALUE 1
MAXVALUE 9223372036854775807
CACHE 1;
ALTER SEQUENCE public.AssembleTable_RowId_seq OWNER TO postgres;
I am trying to copy a file from a disk and insert the information about copied file to table. There are files with same name in the disk so i'm retrieving the "last_value" from the sequence with this query:
SELECT last_value FROM a_seq;
and rename file "_" then insert it do database so the file name and the primary key (id) of that file is coherent like:
id | fileName
1 | 1_asd.txt
But when i insert the record, the id is always 1 value greater than the "last_value" a get from the query so table looks like this:
id | fileName
2 | 1_asd.txt
And i've tried to execute the select query above multiple times to check if its increment the value but it doesn't.
Any idea how to get the value which will be assinged to the record before the insertion?
NOTE: I use MATLAB and this is the code is use for insertion:
colnames = {'DataType' , ...
'FilePath' , ...
'FileName' , ...
'FileVersion' , ...
'CRC32Q' , ...
'InsertionDateTime', ...
'DataSource' };
data = {FileLine{5} ,... % DataType
tempPath ,... % FilePath
FileLine{1} ,... % FileName
FileLine{2} ,... % FileVersion
FileLine{3} ,... % CRC32Q
FileLine{4} ,... % InsertionDateTime
FileLine{6} ,... % DataSource};
data_table = cell2table(data, 'VariableNames', colnames);
datainsert(conn , 'CopiedFiles' , colnames , data_table);
updated
What I believe happens for you is: when you select last_value - you get last used sequence value and when you insert row, the default value for id is nextval, which rolls value by one above...
previous
I believe you have an extra nextval somewhere in middle step. If you do it in one statement, it works as you expect, eg:
t=# create table so12(s int default nextval('s'), t text);
CREATE TABLE
t=# insert into so12(t) select last_value||'_abc.txt' from s;
INSERT 0 1
t=# select * from so12;
s | t
---+-----------
1 | 1_abc.txt
(1 row)
update2
as Nick Barnes noticed, further (then initial1) iterations will give wrong results, su you need to use heis proposed CASE logic
This is a quirk in the way Postgres implements sequences; as inherently non-transactional objects in a transactional database, they behave a bit strangely.
The first time you call nextvalue() on a sequence, it will not affect the number you see in a_seq.last_value. However, it will flip the a_seq.is_called flag:
test=# create sequence a_seq;
test=# select last_value, is_called from a_seq;
last_value | is_called
------------+-----------
1 | f
test=# select nextval('a_seq');
nextval
---------
1
test=# select last_value, is_called from a_seq;
last_value | is_called
------------+-----------
1 | t
So if you need the next value in the sequence, you'd want something like
SELECT
last_value + CASE WHEN is_called THEN 1 ELSE 0 END
FROM a_seq
Note that this is horribly broken if two processes are doing this concurrently, as there's no guarantee you'll actually receive this value from your next nextval() call. In that case, if you really need the filename to match the id, you'd need to either generate it with a trigger, or UPDATE it once you know what the id is.
But in my experience, it's best to avoid any dependencies between your data and your keys. If all you need is a unique filename, I'd just create an independent filename_seq.
When INSERT statement is executed without a value for id - Postgres automatically takes it from sequence using next_val. List of columns in the variable colnames does not have an id, so PG takes next value from the sequence. To solve the problem you may add id to colnames.
To avoid any dependencies between your data and your keys, Please try:
CREATE SEQUENCE your_sequence
INCREMENT 1
MINVALUE 1
MAXVALUE 9223372036854775807
START 1
CACHE 1;
ALTER TABLE your_sequence
OWNER TO postgres;
My table contains an integer column (gid) which is nullable:
gid | value
-------------
0 | a
| b
1 | c
2 | d
| e
Now I would like to change the gid column into a SERIAL primary key column. That means filling up the empty slots with new integers. The existing integers must remain in place. So the result should look like:
gid | value
-------------
0 | a
3 | b
1 | c
2 | d
4 | e
I just can't figure out the right SQL command for doing the transformation. Code sample would be appreciated...
A serial is "just" a column that takes it default value from a sequence.
Assuming your table is named n1000 then the following will do what you want.
The first thing you need to do is to create that sequence:
create sequence n1000_gid_seq;
Then you need to make that the "default" for the column:
alter table n1000 alter column gid set default nextval('n1000_gid_seq');
To truly create a "serial" you also need to tell the sequence that it is associated with the column:
alter sequence n1000_gid_seq owned by n1000.gid;
Then you need to advance the sequence so that the next value doesn't collide with the existing values:
select setval('n1000_gid_seq', (select max(gid) from n1000), true);
And finally you need to update the missing values in the table:
update n1000
set gid = nextval('n1000_gid_seq')
where gid is null;
Once this is done, you can define the column as the PK:
alter table n1000
add constraint pk_n1000
primary key (gid);
And of course if you have turned off autocommit you need to commit all this.
I have a problem with a performance, when I'm trying to create temporary table. The following code is a part of plpgsql function:
StartTime := clock_timestamp();
CREATE TEMP TABLE wo_tmp WITH (OIDS) AS
SELECT workorders1_.woid AS w_id, workorders1_.woid4seg AS w_id4seg
FROM common.workorders workorders1_
INNER JOIN common.lines lines2_ ON workorders1_.wolineid=lines2_.lineid
INNER JOIN common.products products2_ ON workorders1_.woprodid=products2_.prodid
INNER JOIN common.depts depts3_ ON lines2_.linedeptid=depts3_.deptid
WHERE workorders1_.wostatus='F'
AND workorders1_.wotypestatus = ANY ('{R,C,I,D}'::text[])
AND (p_deptid = 0 OR (depts3_.deptid = p_deptid AND ((p_deptid = 5 AND workorders1_.wosegid = 1) OR workorders1_.wosegid = 4)))
AND (p_lineid = 0 OR lines2_.lineid = p_lineid)
AND (p_prodid = 0 OR products2_.prodid = p_prodid)
AND (p_nrkokili = 0 OR workorders1_.wonrkokili = p_nrkokili)
AND (p_accepted = TRUE OR workorders1_.worjacceptstatus = 'Y')
AND workorders1_.wodateleaverr BETWEEN p_dfr AND p_dto
AND lines2_.status <> 'D';
CREATE INDEX wo_tmp_w_id_idx
ON wo_tmp USING btree (w_id ASC NULLS LAST);
CREATE INDEX wo_tmp_w_id4seg_idx
ON wo_tmp USING btree (w_id4seg ASC NULLS LAST);
EndTime := clock_timestamp();
Delta := extract('epoch' from EndTime)::bigint - extract('epoch' from StartTime)::bigint;
RAISE NOTICE 'Duration [0] in seconds=%', Delta;
Here's an explain analyze report: http://explain.depesz.com/s/uerF
It's strange, because when I execute this function, I obtain notice: Duration [0] in seconds=11. I check query without creating temp table and the result time is ~300ms.
Is it possible that inserting records (~73k) into a temporary table takes 11 seconds? Can I speed it up?
When you fill a temp table inside functions, you can find more than one issue:
locking issues - every temp table is table with some fields in system catalog. Intensive creating and dropping these tables creates high overhead with lot locking. Sometimes temp tables can be replaced by arrays. It is not your case, because you need a indexes.
blind optimization - embedded SQL in PlpgSQL functions are optimized for most common values (this mechanism was slightly enhanced in PostgreSQL 9.2 (but still with possible performance issues). It is not optimized for current values - and this fact can enforces some performance issue. Then dynamic SQL is necessary. Some links of this issues (one and second)
Some hw or file system issues - I am little bit confused about help WITHOUT OIDS. It looks like your file system is terrible bottleneck for you. Temp tables are stored in file system cache - storing 53K rows there should be fast .. removing four bytes (from 35) is not too big change.
postgres=# create table t1 with (oids) as select 1 a,2 b,3 c from generate_series(1,73000);
SELECT 73000
Time: 302.083 ms
postgres=# create table t2 as select 1 a,2 b,3 c from generate_series(1,73000);
SELECT 73000
Time: 267.459 ms
postgres=# create temp table t3 with (oids) as select 1 a,2 b,3 c from generate_series(1,73000);
SELECT 73000
Time: 154.431 ms
postgres=# create temp table t4 as select 1 a,2 b,3 c from generate_series(1,73000);
SELECT 73000
Time: 153.085 ms
postgres=# \dt+ t*
List of relations
Schema | Name | Type | Owner | Size | Description
-----------+------+-------+-------+---------+-------------
pg_temp_2 | t3 | table | pavel | 3720 kB |
pg_temp_2 | t4 | table | pavel | 3160 kB |
public | t1 | table | pavel | 3720 kB |
public | t2 | table | pavel | 3160 kB |
(4 rows)
Writing 3MB file to file system should be significantly less than 1sec .. so it is strange for 11 sec overhead. p.s. default temp_buffers is 8MB, so your result should be stored in memory only - and probably this hypothesis is false - and more probable is blind optimization hypothesis.
For starters, don't use WITH (OIDS) for temporary tables. Ever. Use of OIDs in regular tables is discouraged. That goes doubly for temp tables. Also reduces required RAM / space on disk, which is probably the main bottle neck here. Switch to WITHOUT OIDS.
Next, a likely cause (educated guess) is a lack of temp buffers which forces the temp table to spill to disk. Check the actual size of the temp table with
SELECT pg_size_pretty(pg_relation_size('wo_tmp'));
And set temp_buffers accordingly, possibly for the session only - round up generously, enough to avoid writing to disk.
Details:
How to delete duplicate entries?