What is the fastest way to bulk load a file (created by Postgres) into an in-memory SQLite database using Python 3? - postgresql

I have a large dataset I want to load into a SQLite in-memory database. I plan on loading the data from a file exported from Postgres. What file format and load mechanism is the fastest?
Currently I'm considering the following two options:
Importing a CSV file (copy). Reference.
Running a SQL file (pg_dump) with INSERT statements using a single transaction. Reference.
Which is faster? Is there a third faster option, maybe?
This will be done as part of a Python 3 script. Does that affect the choice?
If nobody has any experience with this, I'll make sure to post benchmarks as an answer later.
Edit: This question has gotten a downvote. From the comments it seems this is due to the lack of benchmarking. If not, please let me know how to improve this question. I definitely don't expect anybody to perform benchmarking for me. I'm simply hoping that someone has prior experience with bulk loading into SQLite.

Turns out there is no great way to do this using pg_dump and insert statements in a performant way. We end up with inserting line-by-line from the source file both when we use the CSV and the pg_dump strategies. We're going with the CSV method loading 10000 rows each batch using executemany.
import sqlite3
from datetime import datetime
import csv
conn = sqlite3.connect(":memory:")
cur = conn.cursor()
create_query = """
CREATE VIRTUAL TABLE my_table USING fts4(
id INTEGER,
my_field TEXT
);
"""
cur.execute(create_query)
csv.field_size_limit(2147483647)
from_time = datetime.now()
with open('test.csv', 'r', encoding="utf8") as file:
csv_file = csv.reader(file)
header = next(csv_file)
query_template = """
INSERT INTO my_table (id, my_field)
VALUES (?, ?);
"""
for batch in split_iterable_by_size(csv_file, 10000):
cur.executemany(query_template, batch)
conn.commit()
On our system and dataset this took 2 hours 30 minutes. We're not testing the alternative.

Related

Bulk Inserts/Updates Are Slow Using On Conflict with JSONB

I'm doing a bulk insert from a giant csv file that needs to turn into a both a relational and JSONB object at the same time into a table. Problem is; that the inserts need to do ether an insert or update. If it's an update. The column needs to append the JSON object to the row. The current setup I have has individual INSERT/UPDATE calls and of course, it's horribly slow.
Example Import Command I'm Running:
INSERT INTO "trade" ("id_asset", "trade_data", "year", "month") VALUES ('1925ad09-51e9-4de4-a506-9bccb7361297', '{"28":{"open":2.89,"high":2.89,"low":2.89,"close":2.89}}', 2017, 3) ON CONFLICT ("year", "month", "id_asset") DO
UPDATE SET "trade_data" = "trade"."trade_data" || '{"28":{"open":2.89,"high":2.89,"low":2.89,"close":2.89}}' WHERE "trade"."id_asset" = '1925ad09-51e9-4de4-a506-9bccb7361297' AND "trade"."year" = 2017 AND "trade"."month" = 3;
I've tried wrapping my script in a BEGIN and COMMIT, but it didn't improve performance at all and I tried a few configurations, but it didn't seem to help.
\set autocommit off;
set schema 'market';
\set fsync off;
\set full_page_writes off;
SET synchronous_commit TO off;
\i prices.sql
This hole thing is extremely slow, and I'm not sure how to re-write the query without loading a crap ton of data into RAM using my program just to spit out a large INSERT/UPDATE command efficiently for Postgres to read. Since related data could be a million rows or another file all together to properly generate a JSON w/ out losing current JSON data that's already in the database.
Simply scp moved my large SQL file into the Postgres server and re-ran the commands inside psql and now the migration is extremely faster.

How to load bulk data to table from table using query as quick as possible? (postgresql)

I have a large table(postgre_a) which has 0.1 billion records with 100 columns. I want to duplicate this data into the same table.
I tried to do this using sql
INSERT INTO postgre_a select i1 + 100000000, i2, ... FROM postgre_a;
However, this query is running more than 10 hours now... so I want to do this more faster. I tried to do this with copy, but I cannot find the way to use copy from statement with query.
Is there any other method can do this faster?
You cannot directly use a query in COPY FROM, but maybe you can use COPY FROM PROGRAM with a query to do what you want:
COPY postgre_a
FROM PROGRAM '/usr/pgsql-10/bin/psql -d test'
' -c ''copy (SELECT i1+ 100000000, i2, ... FROM postgre_a) TO STDOUT''';
(Of course you have to replace the path to psql and the database name with your values.)
I am not sure if that is faster than using INSERT, but it is worth a try.
You should definitely drop all indexes and constraints before the operation and recreate them afterwards.

Better way than HDF5 -> Pandas -> PostgreSQL

I have 51 massive HDF5 tables, each with enough (well behaved) data that I cannot load even one of them completely into memory. To make life easier for the rest of my team I need to transfer this data into a PostgreSQL database (and delete the HDF5 tables). However, this is easier said than done, mainly because of these hurdles:
pandas.read_hdf() still has a wonky chunksize kwag: SO Question; Open github issue
pandas.DataFrame.to_sql() is monumentally slow and inefficient: Open github issue (see my post at the bottom of the issue page)
PostgreSQL does not have a native or third party data wrapper to deal with HDF5: PostgreSQL wiki article
HDF5 ODBC driver is still nascent: HDF5 ODBC blog
Basically to go from HDF5 -> Pandas -> PostgreSQL, will require surmounting hurdles 1 and 2 by extensive monkey patching. And there seems to be no direct way to go from HDF5 -> PostgreSQL directly. Unless I am missing something.
Perhaps one of you fine users can hint at something I am missing, some patchwork you created to surmount a similar issue that would help my cause, or any suggestions or advice...
You could convert to CSV with something like the following:
import csv
import h5py
with h5py.File('input.hdf5') as hdf5file:
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
for row in hdf5file['__data__']['table']:
writer.writerow(row)
And then then import into postgres with psql:
create table mytable (col1 bigint, col2 float, col3 float);
\copy mytable from 'output.csv' CSV
Depending on the complexity of your data, you could probably do something clever to get the schema out of the hdf5 file and use that to make the CREATE TABLE statement.
Alternatively you could try writing your own INSERT statements in your Python script, this will probably be slower than using COPY but could be a simpler solution:
import psycopg2
from itertools import islice
with h5py.File('input.hdf5') as hdf5file:
with psycopg2.connect("dbname=mydb user=postgres") as conn
cur = conn.cursor()
chunksize=50
t = iter(hdf5file['__data__']['table'])
rows = islice(t, chunksize)
while rows != []:
statement = "INSERT INTO mytable VALUES {}".format(','.join(rows))
cur.execute(row)
rows = islice(t, chunksize)
conn.commit()

Command to read a file and execute script with psql

I am using PostgreSQL 9.0.3. I have an Excel spreadsheet with lots of data to load into couple of tables in Windows OS.
I have written the script to get the data from input file and Insert into some 15 tables. This can't be done with COPY or Import. I named the input file as DATALD.
I find out the psql command -d to point the db and -f for the script sql. But I need to know the commands how to feed the input file along with the script so that the data gets inserted into the tables..
For example this is what I have done:
begin
for emp in (select distinct w_name from DATALD where w_name <> 'w_name')
--insert in a loop
INSERT INTO tblemployer( id_employer, employer_name,date_created, created_by)
VALUES (employer_id,emp.w_name,now(),'SYSTEM1');
Can someone please help?
For an SQL script you must ..
either have the data inlined in your script (in the same file).
or you need to utilize COPY to import the data into Postgres.
I suppose you use a temporary staging table, since the format doesn't seem to fit the target tables. Code example:
How to bulk insert only new rows in PostreSQL
There are other options like pg_read_file(). But:
Use of these functions is restricted to superusers.
Intended for special purposes.

syntax for COPY in postgresql

INSERT INTO contacts_lists (contact_id, list_id)
SELECT contact_id, 67544
FROM plain_contacts
Here I want to use Copy command instead of Insert command in sql to reduce the time to insert values. I fetched the data using select operation. How can i insert it into a table using Copy command in postgresql. Could you please give an example for it?. Or any other suggestion in order to achieve the reduction of time to insert the values.
As your rows are already in the database (because you apparently can SELECT them), then using COPY will not increase the speed in any way.
To be able to use COPY you have to first write the values into a text file, which is then read into the database. But if you can SELECT them, writing to a textfile is a completely unnecessary step and will slow down your insert, not increase its speed
Your statement is as fast as it gets. The only thing that might speed it up (apart from buying a faster harddisk) is to remove any potential index on contact_lists that contains the column contact_id or list_id and re-create the index once the insert is finished.
You can find the syntax described in many places, I'm sure. One of those is this wiki article.
It looks like it would basically be:
COPY plain_contacts (contact_id, 67544) TO some_file
And
COPY contacts_lists (contact_id, list_id) FROM some_file
But I'm just reading from the resources that Google turned up. Give it a try and post back if you need help with a specific problem.