Efficient way to move large number of rows from one table to another new table using postgres - postgresql

I am using PostgreSQL database for live project. In which, I have one table with 8 columns.
This table contains millions of rows, so to make search faster from table, I want to delete and store old entries from this table to new another table.
To do so, I know one approach:
first select some rows
create new table
store this rows in that table
than delete from main table.
But it takes too much time and it is not efficient.
So I want to know what is the best possible approach to perform this in postgresql database?
Postgresql version: 9.4.2.
Approx number of rows: 8000000
I want to move rows: 2000000

You can use CTE (common table expressions) to move rows in a single SQL statement (more in the documentation):
with delta as (
delete from one_table where ...
returning *
)
insert into another_table
select * from delta;
But think carefully whether you actually need it. Like a_horse_with_no_name said in the comment, tuning your queries might be enough.

This is a sample code for copying data between two table of same.
Here i used different DB, one is my production DB and other is my testing DB
INSERT INTO "Table2"
select * from dblink('dbname=DB1 dbname=DB2 user=postgres password=root',
'select "col1","Col2" from "Table1"')
as t1(a character varying,b character varying);

Related

Cloning a Postgres table, including indexes and data

I am trying to create a clone of a Postgres table using plpgsql.
To date I have been simply truncating table 2 and re-inserting data from table 1.
TRUNCATE TABLE "dbPlan"."tb_plan_next";
INSERT INTO "dbPlan"."tb_plan_next" SELECT * FROM "dbPlan"."tb_plan";
As code this works as expected, however "dbPlan"."tb_plan" contains around 3 million records and therefore completes in around 20 minutes. This is too long and has a knock on effects on other processes.
It's important that all constraints, indexes and data are copied exactly to table 2.
I had tried dropping the table and re-creating it, however this did not improve speed.
DROP TABLE IF EXISTS "dbPlan"."tb_plan_next";
CREATE TABLE "dbPlan"."tb_plan_next" (LIKE "dbPlan"."tb_plan" INCLUDING ALL);
INSERT INTO "dbPlan"."tb_plan_next" SELECT * FROM "dbPlan"."tb_plan";
Is there a better method for achieving this?
I am considering creating the table and then creating indexes as a second step.
PostgreSQL doesn't provide a very elegant way of doing this. You could use pg_dump with -t and --section= to dump the pre-data and post-data for the table. Then you would replay the pre-data to create the table structure and the check constraints, then load the data from whereever you get it from, then replay the post-data to add the indexes and FK constraints.

What is the best way to backfill a partition table using data from a non partitioned table? (postgres 12)

I'm converting a non partitioned table to a partitioned table in postgres 12. Assuming I have set up the new partition table and have created appropriate triggers for automatic creation of partitions, what is the best way to back fill the currently empty partition table?
Is a naive
insert into partitioned_table(a,d,b,c) select a, d, b, c from non_partitioned_table;
delete from non_partitioned_table;
appropriate? We have ~250M rows in the table so am a bit concerned about doubling the storage requirements.
Or perhaps
WITH moved_rows AS (
DELETE FROM non_partitioned_table
)
INSERT INTO partitioned_table
SELECT [DISTINCT] * FROM moved_rows;
or
COPY non_partitioned_table TO '/tmp/non_partitioned_table.csv' DELIMITER ',';
COPY partitioned_table FROM '/tmp/non_partitioned_table.csv' DELIMITER ',';
Either way seems like it could take a while to transfer the data. I'm also concerned that it will kill INSERT performance while data is being migrated so I assume we'd need to block inserts until it's done. Is there any way to estimate how long it will take to copy the data over?

fastest way of inserting data into a table

I have a Postgres database, and I have inserted some data into the table. Because of issues with the internet connection, some of the data couldn't be written.The file that I am trying to write into the database is large (about 330712484 rows - even the ws -l command takes a while to complete.
Now, the column row_id is the (integer) primary key, and is already indexed. Since some of the rows could not be inserted into the table, I wanted to insert these specific rows into the table. (I estimate only about 1.8% of the data isn't inserted into the table ...) As a beginning, I tried to see of the primary keys were inside the database like so:
conn = psycopg2.connect(connector)
cur = conn.cursor()
with open(fileName) as f:
header = f.readline().strip()
header = list(csv.reader([header]))[0]
print(header)
for i, l in enumerate(f):
if i>10: break
print(l.strip())
row_id = l.split(',')[0]
query = 'select * from raw_data.chartevents where row_id={}'.format(row_id)
cur.execute(query)
print(cur.fetchall())
cur.close()
conn.close()
Even for the first few rows of data, checking to see whether the primary key exists takes a really large amount of time.
What would be the fastest way of doing this?
The fastest way to insert data in PostgreSQL is using the COPY protocol, which is implemented in psycopg2. COPY will not allow you to check if target id already exists, tho. Best option is to COPY your file content's into a temporary table then INSERT or UPDATE from this, as in the Batch Update article I wrote on my http://tapoueh.org blog a while ago.
With a recent enough version of PostgreSQL you may use
INSERT INTO ...
SELECT * FROM copy_target_table
ON CONFICT (pkey_name) DO NOTHING
Can i offer a work around. ?
The index will be checked for each row inserted, also Postgres performs the whole insert in a single transaction so you are effectively storing all this data to disk before its being written.
Could i suggest you drop the indexes to avoid this slow down, then split the file into smaller files using head -n [int] > newfile or something similar. then performing the copy commands separately for each one.

Redshift query a daily-generated table

I am looking for a way to create a Redshift query that will retrieve data from a table that is generated daily. Tables in our cluster are of the form:
event_table_2016_06_14
event_table_2016_06_13
.. and so on.
I have tried writing a query that appends the current date to the table name, but this does not seem to work correctly (invalid operation):
SELECT * FROM concat('event_table_', to_char(getdate(),'YYYY_MM_DD'))
Any suggestions on how this can be performed are greatly appreciated!
I have tried writing a query that appends the current date to the
table name, but this does not seem to work correctly (invalid
operation):
Redshift does not support that. But you most likely won't need it.
Try the following (expanding on the answer from #ketan):
Create your main table with appropriate (for joins) DIST key, and COMPOUND or simple SORT KEY on timestamp column, and proper compression on columns.
Daily, create a temp table (use CREATE TABLE ... LIKE - this will preserve DIST/SORT keys), load it with daily data, VACUUM SORT.
Copy sorted temp table into main table using ALTER TABLE APPEND - this will copy the data sorted, and will reduce VACUUM on the main table. You may still need VACUUM SORT after that.
After that query your main table normally, probably giving it a range on timestamp. Redshift is optimised for these scenarios, and 99% of times you don't need to optimise table scans yourself - even on tables with billion of rows scans take milliseconds to few seconds. You may need to optimise elsewhere, but that's the second step.
To get insight in the performance of scans, use STL_QUERY system table to find your query ID, and then use STL_SCAN (or SVL_QUERY_SUMMARY) table to see how fast the scan was.
Your example is actually the main use case for ALTER TABLE APPEND.
I am assuming that you are creating a new table everyday.
What you can do is:
Create a view on top of event_table_* tables. Query your data using this view.
Whenever you create or drop a table, update the view.
If you want, you can avoid #2: Instead of creating a new table everyday, create empty tables for next 1-2 years. So, no need to update the view every day. However, do remember that there is an upper limit of 9,900 tables in Redshift.
Edit: If you always need to query today's table (instead of all tables, as I assumed originally), I don't think you can do that without updating your view.
However, you can modify your design to have just one table, with date as sort-key. So, whenever your table is queried with some date, all disk blocks that don't have that date will be skipped. That'll be as efficient as having time-series tables.

Implications of using ADD COLUMN on large dataset

Docs for Redshift say:
ALTER TABLE locks the table for reads and writes until the operation completes.
My question is:
Say I have a table with 500 million rows and I want to add a column. This sounds like a heavy operation that could lock the table for a long time - yes? Or is it actually a quick operation since Redshift is a columnar db? Or it depends if column is nullable / has default value?
I find that adding (and dropping) columns is a very fast operation even on tables with many billions of rows, regardless of whether there is a default value or it's just NULL.
As you suggest, I believe this is a feature of the it being a columnar database so the rest of the table is undisturbed. It simply creates empty (or nearly empty) column blocks for the new column on each node.
I added an integer column with a default to a table of around 65M rows in Redshift recently and it took about a second to process. This was on a dw2.large (SSD type) single node cluster.
Just remember you can only add a column to the end (right) of the table, you have to use temporary tables etc if you want to insert a column somewhere in the middle.
Personally I have seen rebuilding the table works best.
I do it in following ways
Create a new table N_OLD_TABLE table
Define the datatype/compression encoding in the new table
Insert data into N_OLD(old_columns) select(old_columns) from old_table Rename OLD_Table to OLD_TABLE_BKP
Rename N_OLD_TABLE to OLD_TABLE
This is a much faster process. Doesn't block any table and you always have a backup of old table incase anything goes wrong