I'm doing a bulk insert from a giant csv file that needs to turn into a both a relational and JSONB object at the same time into a table. Problem is; that the inserts need to do ether an insert or update. If it's an update. The column needs to append the JSON object to the row. The current setup I have has individual INSERT/UPDATE calls and of course, it's horribly slow.
Example Import Command I'm Running:
INSERT INTO "trade" ("id_asset", "trade_data", "year", "month") VALUES ('1925ad09-51e9-4de4-a506-9bccb7361297', '{"28":{"open":2.89,"high":2.89,"low":2.89,"close":2.89}}', 2017, 3) ON CONFLICT ("year", "month", "id_asset") DO
UPDATE SET "trade_data" = "trade"."trade_data" || '{"28":{"open":2.89,"high":2.89,"low":2.89,"close":2.89}}' WHERE "trade"."id_asset" = '1925ad09-51e9-4de4-a506-9bccb7361297' AND "trade"."year" = 2017 AND "trade"."month" = 3;
I've tried wrapping my script in a BEGIN and COMMIT, but it didn't improve performance at all and I tried a few configurations, but it didn't seem to help.
\set autocommit off;
set schema 'market';
\set fsync off;
\set full_page_writes off;
SET synchronous_commit TO off;
\i prices.sql
This hole thing is extremely slow, and I'm not sure how to re-write the query without loading a crap ton of data into RAM using my program just to spit out a large INSERT/UPDATE command efficiently for Postgres to read. Since related data could be a million rows or another file all together to properly generate a JSON w/ out losing current JSON data that's already in the database.
Simply scp moved my large SQL file into the Postgres server and re-ran the commands inside psql and now the migration is extremely faster.
Related
I want to read from a CSV file a number of rows every few minutes, is there a way to keep track of what row it was inserted last time, and start the next insertion from that specific row?
In SQL Server I know it is possible using the bulk insert command, but I don't know how I could do it in postgresql.
I tried using COPY command and timescaledb-parallel-copy command but in the latter one i could only limit the number of rows I want to insert.
I am trying to do this to compare the performance of bulk inserting over time between SQL Server and a time-series database
timescaledb-parallel-copy --db-name test --table 'test_table' --file weather_big_conditions.csv --connection "host=localhost port=5432 user=postgres password=postgres sslmode=disable" -limit 2000000
Reading the code here, I think it's possible to combine the --skip-head=true and --header-line-count=N which N is a number that will work as an offset skipping the lines that you want from the file. You can see that the code is prepared for it.
I haven't tested it but you can give it a try.
I have a large table(postgre_a) which has 0.1 billion records with 100 columns. I want to duplicate this data into the same table.
I tried to do this using sql
INSERT INTO postgre_a select i1 + 100000000, i2, ... FROM postgre_a;
However, this query is running more than 10 hours now... so I want to do this more faster. I tried to do this with copy, but I cannot find the way to use copy from statement with query.
Is there any other method can do this faster?
You cannot directly use a query in COPY FROM, but maybe you can use COPY FROM PROGRAM with a query to do what you want:
COPY postgre_a
FROM PROGRAM '/usr/pgsql-10/bin/psql -d test'
' -c ''copy (SELECT i1+ 100000000, i2, ... FROM postgre_a) TO STDOUT''';
(Of course you have to replace the path to psql and the database name with your values.)
I am not sure if that is faster than using INSERT, but it is worth a try.
You should definitely drop all indexes and constraints before the operation and recreate them afterwards.
I have a large dataset I want to load into a SQLite in-memory database. I plan on loading the data from a file exported from Postgres. What file format and load mechanism is the fastest?
Currently I'm considering the following two options:
Importing a CSV file (copy). Reference.
Running a SQL file (pg_dump) with INSERT statements using a single transaction. Reference.
Which is faster? Is there a third faster option, maybe?
This will be done as part of a Python 3 script. Does that affect the choice?
If nobody has any experience with this, I'll make sure to post benchmarks as an answer later.
Edit: This question has gotten a downvote. From the comments it seems this is due to the lack of benchmarking. If not, please let me know how to improve this question. I definitely don't expect anybody to perform benchmarking for me. I'm simply hoping that someone has prior experience with bulk loading into SQLite.
Turns out there is no great way to do this using pg_dump and insert statements in a performant way. We end up with inserting line-by-line from the source file both when we use the CSV and the pg_dump strategies. We're going with the CSV method loading 10000 rows each batch using executemany.
import sqlite3
from datetime import datetime
import csv
conn = sqlite3.connect(":memory:")
cur = conn.cursor()
create_query = """
CREATE VIRTUAL TABLE my_table USING fts4(
id INTEGER,
my_field TEXT
);
"""
cur.execute(create_query)
csv.field_size_limit(2147483647)
from_time = datetime.now()
with open('test.csv', 'r', encoding="utf8") as file:
csv_file = csv.reader(file)
header = next(csv_file)
query_template = """
INSERT INTO my_table (id, my_field)
VALUES (?, ?);
"""
for batch in split_iterable_by_size(csv_file, 10000):
cur.executemany(query_template, batch)
conn.commit()
On our system and dataset this took 2 hours 30 minutes. We're not testing the alternative.
I am using PostgreSQL 9.0.3. I have an Excel spreadsheet with lots of data to load into couple of tables in Windows OS.
I have written the script to get the data from input file and Insert into some 15 tables. This can't be done with COPY or Import. I named the input file as DATALD.
I find out the psql command -d to point the db and -f for the script sql. But I need to know the commands how to feed the input file along with the script so that the data gets inserted into the tables..
For example this is what I have done:
begin
for emp in (select distinct w_name from DATALD where w_name <> 'w_name')
--insert in a loop
INSERT INTO tblemployer( id_employer, employer_name,date_created, created_by)
VALUES (employer_id,emp.w_name,now(),'SYSTEM1');
Can someone please help?
For an SQL script you must ..
either have the data inlined in your script (in the same file).
or you need to utilize COPY to import the data into Postgres.
I suppose you use a temporary staging table, since the format doesn't seem to fit the target tables. Code example:
How to bulk insert only new rows in PostreSQL
There are other options like pg_read_file(). But:
Use of these functions is restricted to superusers.
Intended for special purposes.
I use both Firebird embedded and Firebird Server, and from time to time I need to reindex the tables using a procedure like the following:
CREATE PROCEDURE MAINTENANCE_SELECTIVITY
ASDECLARE VARIABLE S VARCHAR(200);
BEGIN
FOR select RDB$INDEX_NAME FROM RDB$INDICES INTO :S DO
BEGIN
S = 'SET statistics INDEX ' || s || ';';
EXECUTE STATEMENT :s;
END
SUSPEND;
END
I guess this is normal using embedded, but is it really needed using a server? Is there a way to configure the server to do it automatically when required or periodically?
First, let me point out that I'm no Firebird expert, so I'm answering on the basis of how SQL Server works.
In that case, the answer is both yes, and no.
The indexes are of course updated on SQL Server, in the sense that if you insert a new row, all indexes for that table will contain that row, so it will be found. So basically, you don't need to keep reindexing the tables for that part to work. That's the "no" part.
The problem, however, is not with the index, but with the statistics. You're saying that you need to reindex the tables, but then you show code that manipulates statistics, and that's why I'm answering.
The short answer is that statistics goes slowly out of whack as time goes by. They might not deteriorate to a point where they're unusable, but they will deteriorate down from the perfect level they're in when you recreate/recalculate them. That's the "yes" part.
The main problem with stale statistics is that if the distribution of the keys in the indexes changes drastically, the statistics might not pick that up right away, and thus the query optimizer will pick the wrong indexes, based on the old, stale, statistics data it has on hand.
For instance, let's say one of your indexes has statistics that says that the keys are clumped together in one end of the value space (for instance, int-column with lots of 0's and 1's). Then you insert lots and lots of rows with values that make this index contain values spread out over the entire spectrum.
If you now do a query that uses a join from another table, on a column with low selectivity (also lots of 0's and 1's) against the table with this index of yours, the query optimizer might deduce that this index is good, since it will fetch many rows that will be used at the same time (they're on the same data page).
However, since the data has changed, it'll jump all over the index to find the relevant pieces, and thus not be so good after all.
After recalculating the statistics, the query optimizer might see that this index is sub-optimal for this query, and pick another index instead, which is more suited.
Basically, you need to recalculate the statistics periodically if your data is in flux. If your data rarely changes, you probably don't need to do it very often, but I would still add a maintenance job with some regularity that does this.
As for whether or not it is possible to ask Firebird to do it on its own, then again, I'm on thin ice, but I suspect there is. In SQL Server you can set up maintenance jobs that does this, on a schedule, and at the very least you should be able to kick off a batch file from the Windows scheduler to do something like it.
That does not reindex, it recomputes weights for indexes, which are used by optimizer to select most optimal index. You don't need to do that unless index size changes a lot. If you create the index before you add data, you need to do the recalculation.
Embedded and Server should have exactly same functionality apart the process model.
I wanted to update this answer for newer firebird. here is the updated dsql.
SET TERM ^ ;
CREATE OR ALTER PROCEDURE NEW_PROCEDURE
AS
DECLARE VARIABLE S VARCHAR(300);
begin
FOR select 'SET statistics INDEX ' || RDB$INDEX_NAME || ';'
FROM RDB$INDICES
WHERE RDB$INDEX_NAME <> 'PRIMARY' INTO :S
DO BEGIN
EXECUTE STATEMENT :s;
END
end^
SET TERM ; ^
GRANT EXECUTE ON PROCEDURE NEW_PROCEDURE TO SYSDBA;