fastest way of inserting data into a table - postgresql

I have a Postgres database, and I have inserted some data into the table. Because of issues with the internet connection, some of the data couldn't be written.The file that I am trying to write into the database is large (about 330712484 rows - even the ws -l command takes a while to complete.
Now, the column row_id is the (integer) primary key, and is already indexed. Since some of the rows could not be inserted into the table, I wanted to insert these specific rows into the table. (I estimate only about 1.8% of the data isn't inserted into the table ...) As a beginning, I tried to see of the primary keys were inside the database like so:
conn = psycopg2.connect(connector)
cur = conn.cursor()
with open(fileName) as f:
header = f.readline().strip()
header = list(csv.reader([header]))[0]
print(header)
for i, l in enumerate(f):
if i>10: break
print(l.strip())
row_id = l.split(',')[0]
query = 'select * from raw_data.chartevents where row_id={}'.format(row_id)
cur.execute(query)
print(cur.fetchall())
cur.close()
conn.close()
Even for the first few rows of data, checking to see whether the primary key exists takes a really large amount of time.
What would be the fastest way of doing this?

The fastest way to insert data in PostgreSQL is using the COPY protocol, which is implemented in psycopg2. COPY will not allow you to check if target id already exists, tho. Best option is to COPY your file content's into a temporary table then INSERT or UPDATE from this, as in the Batch Update article I wrote on my http://tapoueh.org blog a while ago.
With a recent enough version of PostgreSQL you may use
INSERT INTO ...
SELECT * FROM copy_target_table
ON CONFICT (pkey_name) DO NOTHING

Can i offer a work around. ?
The index will be checked for each row inserted, also Postgres performs the whole insert in a single transaction so you are effectively storing all this data to disk before its being written.
Could i suggest you drop the indexes to avoid this slow down, then split the file into smaller files using head -n [int] > newfile or something similar. then performing the copy commands separately for each one.

Related

Postgres Upsert vs Truncate and Insert

I have a stream of data that I can replay any time to reload data into a Postgres table. Lets say I have millions of rows in my table and I add a new column. Now I can replay that stream of data to map a key in the data to the column name that I have just added.
The two options I have are:
1) Truncate and then Insert
2) Upsert
Which would be a better option in terms of performance?
The way PostgreSQL does multiversioning, every update creates a new row version. The old row version will have to be reclaimed later.
This means extra work and tables with a lot of empty space in them.
On the other hand, TRUNCATE just throws away the old table, which is very fast.
You can gain extra performance by using COPY instead of INSERT to load bigger amounts of data.

Postgres parallel/efficient load huge amount of data psycopg

I want to load many rows from a CSV file.
The file​s​ contain​ data like these​ "article​_name​,​article_time,​start_time,​end_time"
There is a contraint on the table: for the same article name, i don't insert a new row if the new ​article_time falls in an existing range​ [start_time,​end_time]​ for the same article.
ie: don't insert row y if exists [​start_time_x,​end_time_x] for which time_article_y inside range [​start_time_x,​end_time_x] , with article_​name_​y = article_​name_​x
I tried ​with psycopg by selecting the existing article names ad checking manually if there is an overlap --> too long
I tried again with psycopg, this time by setting a condition 'exclude using...' and tryig to insert with specifying "on conflict do nothing" (so that it does not fail) but still too long
I tried the same thing but this time trying to insert many values at each call of execute (psycopg): it got a little better (1M rows processed in almost 10minutes)​, but still not as fast as it needs to be for the amount of data ​I have (500M+)
I tried to parallelize by calling the same script many time, on different files but the timing didn't get any better, I guess because of the locks on the table each time we want to write something
Is there any way to create a lock only on rows containing the same article_name? (and not a lock on the whole table?)
Could you please help with any idea to make this parallellizable and/or more time efficient?
​Lots of thanks folks​
Your idea with the exclusion constraint and INSERT ... ON CONFLICT is good.
You could improve the speed as follows:
Do it all in a single transaction.
Like Vao Tsun suggested, maybe COPY the data into a staging table first and do it all with a single SQL statement.
Remove all indexes except the exclusion constraint from the table where you modify data and re-create them when you are done.
Speed up insertion by disabling autovacuum and raising max_wal_size (or checkpoint_segments on older PostgreSQL versions) while you load the data.

Efficient way to move large number of rows from one table to another new table using postgres

I am using PostgreSQL database for live project. In which, I have one table with 8 columns.
This table contains millions of rows, so to make search faster from table, I want to delete and store old entries from this table to new another table.
To do so, I know one approach:
first select some rows
create new table
store this rows in that table
than delete from main table.
But it takes too much time and it is not efficient.
So I want to know what is the best possible approach to perform this in postgresql database?
Postgresql version: 9.4.2.
Approx number of rows: 8000000
I want to move rows: 2000000
You can use CTE (common table expressions) to move rows in a single SQL statement (more in the documentation):
with delta as (
delete from one_table where ...
returning *
)
insert into another_table
select * from delta;
But think carefully whether you actually need it. Like a_horse_with_no_name said in the comment, tuning your queries might be enough.
This is a sample code for copying data between two table of same.
Here i used different DB, one is my production DB and other is my testing DB
INSERT INTO "Table2"
select * from dblink('dbname=DB1 dbname=DB2 user=postgres password=root',
'select "col1","Col2" from "Table1"')
as t1(a character varying,b character varying);

Implications of using ADD COLUMN on large dataset

Docs for Redshift say:
ALTER TABLE locks the table for reads and writes until the operation completes.
My question is:
Say I have a table with 500 million rows and I want to add a column. This sounds like a heavy operation that could lock the table for a long time - yes? Or is it actually a quick operation since Redshift is a columnar db? Or it depends if column is nullable / has default value?
I find that adding (and dropping) columns is a very fast operation even on tables with many billions of rows, regardless of whether there is a default value or it's just NULL.
As you suggest, I believe this is a feature of the it being a columnar database so the rest of the table is undisturbed. It simply creates empty (or nearly empty) column blocks for the new column on each node.
I added an integer column with a default to a table of around 65M rows in Redshift recently and it took about a second to process. This was on a dw2.large (SSD type) single node cluster.
Just remember you can only add a column to the end (right) of the table, you have to use temporary tables etc if you want to insert a column somewhere in the middle.
Personally I have seen rebuilding the table works best.
I do it in following ways
Create a new table N_OLD_TABLE table
Define the datatype/compression encoding in the new table
Insert data into N_OLD(old_columns) select(old_columns) from old_table Rename OLD_Table to OLD_TABLE_BKP
Rename N_OLD_TABLE to OLD_TABLE
This is a much faster process. Doesn't block any table and you always have a backup of old table incase anything goes wrong

PostgreSQL bulk insert with ActiveRecord

I've a lot of records that are originally from MySQL. I massaged the data so it will be successfully inserted into PostgreSQL using ActiveRecord. This I can easily do with insertions on row basis i.e one row at a time. This is very slow I want to do bulk insert but this fails if any of the rows contains invalid data. Is there anyway I can achieve bulk insert and only the invalid rows failing instead of the whole bulk?
COPY
When using SQL COPY for bulk insert (or its equivalent \copy in the psql client), failure is not an option. COPY cannot skip illegal lines. You have to match your input format to the table you import to.
If data itself (not decorators) is violating your table definition, there are ways to make this a lot more tolerant though. For instance: create a temporary staging table with all columns of type text. COPY to it, then fix offending rows with SQL commands before converting to the actual data type and inserting into the actual target table.
Consider this related answer:
How to bulk insert only new rows in PostreSQL
Or this more advanced case:
"ERROR: extra data after last expected column" when using PostgreSQL COPY
If NULL values are offending, remove the NOT NULL constraint from your target table temporarily. Fix the rows after COPY, then reinstate the constraint. Or take the route with the staging table, if you cannot afford to soften your rules temporarily.
Sample code:
ALTER TABLE tbl ALTER COLUMN col DROP NOT NULL;
COPY ...
-- repair, like ..
-- UPDATE tbl SET col = 0 WHERE col IS NULL;
ALTER TABLE tbl ALTER COLUMN col SET NOT NULL;
Or you just fix the source table. COPY tells you the number of the offending line. Use an editor of your preference and fix it, then retry. I like to use vim for that.
INSERT
For an INSERT (like commented) the check for NULL values is trivial:
To skip a row with a NULL value:
INSERT INTO (col1, ...
SELECT col1, ...
WHERE col1 IS NOT NULL
To insert sth. else instead of a NULL value (empty string in my example):
INSERT INTO (col1, ...
SELECT COALESCE(col1, ''), ...
A common work-around for this is to import the data into a TEMPORARY or UNLOGGED table with no constraints and, where data in the input is sufficiently bogus, text typed columns.
You can then do INSERT INTO ... SELECT queries against the data to populate the real table with a big query that cleans up the data during import. You can use a lot of CASE statements for this. The idea is to transform the data in one pass.
You might be able to do many of the fixes in Ruby as you read the data in, then push the data to PostgreSQL using COPY ... FROM STDIN. This is possible with Ruby's Pg gem, see eg https://bitbucket.org/ged/ruby-pg/src/tip/sample/copyfrom.rb .
For more complicated cases, look at Pentaho Kettle or Talend Studio ETL tools.