I'm trying to insert couple of million rows into a PostgreSQL database. I am wondering what is the best way to do it.
CREATE TABLE AS
INSERT INTO
I'm looking to see which one is better and why? I have read through some blogs but still couldn't come to a conclusion.
I think INSERT INTO is a bulk insert operation. Please correct me if I'm wrong. Whether CREATE TABLE AS SELECT is a bulk insert operation?
Please advise.
CREATE TABLE AS is a bulk insert operation as well. The main difference is that CREATE TABLE AS is easier to optimize for PostgreSQL; it is clear that no WAL information has to be written (unless WAL-based replication is active, of course). See the wal_level documentation and Disable WAL Archival and Streaming Replication for some other cases where this optimization applies.
Related
As part of daily load in Redshift, I have a couple of tables to drop and full load all of them, (data size is small, less than 1 million).
My question is which of the below two strategies is better in terms of CPU utilization and memory in Redshift:
1) Truncate data
2) DROP and Recreate Table.
If I truncate tables, should I perform Vacuum on tables every day as I have read that frequent drop and recreate tables in the database cause fragmentation of pages.
And one of the tables I would like to enable compression. So, is there any downside creating DDL with encoding every day.
Please advise! Thank you!
If you drop the tables you will lose assigned permissions to these tables. If you have views for these tables they will be obsolete.
Truncate is a better option, truncate does not require vacuum or analyze, it is built for use cases like this.
For further info Redshift Truncate documentation
If I index a PostgreSQL table and then update it, do I need to re-index the table or is it automatically re-indexed?
Can someone provide a link to PostgreSQL documentation for further reading? I've got this so far:
https://www.postgresql.org/docs/9.1/static/sql-createindex.html
indexes in PostgreSQL do not need maintenance or tuning
You do not need to re-index manually.
For more details, please also read
https://www.postgresql.org/docs/current/static/monitoring-stats.html
From further reading in the PostgreSQL documentation:
Once an index is created, no further intervention is required: the
system will update the index when the table is modified, and it will
use the index in queries when it thinks doing so would be more
efficient than a sequential table scan. But you might have to run the
ANALYZE command regularly to update statistics to allow the query
planner to make educated decisions. See Chapter 14 for information
about how to find out whether an index is used and when and why the
planner might choose not to use an index.
See:
https://www.postgresql.org/docs/current/static/indexes-intro.html
I've got PostgreSQL 9.2 and a tiny database with just a bit of seed data for a website that I'm working on.
The following query seems to run forever:
ALTER TABLE diagnose_bodypart ADD description text NOT NULL;
diagnose_bodypart is a table with less than 10 rows. I've let the query run for over a minute with no results. What could be the problem? Any recommendations for debugging this?
Adding a column does not require rewriting a table (unless you specify a DEFAULT). It is a quick operation absent any locks. pg_locks is the place to check, as Craig pointed out.
In general the most likely cause are long-running transactions. I would be looking at what work-flows are hitting these tables and how long the transactions are staying open for. Locks of this sort are typically transactional and so committing transactions will usually fix the problem.
I am currently transferring a large amount of records from one table to another, summarizing in the process. So, I have a SQL in this general format:
INSERT INTO TargetTable
(Col1,
Col2,
...
ColX)
Tot
)
SELECT
Col1,
Col2,
...
ColX
SUM(TOT)
FROM
SourceTable
GROUP BY
Col1,
Col2,
...
ColX
Is there any performance advantage of moving this SQL into an SSIS task when transferring records from one table to another using a SQL SELECT as a source? For example, is logging turned off?
Secondary question: Are there any tactics that I could use to ensure a maximum transfer rate? For example, removing indexes from the Target table before inserting, locking the table, etc?
In my experience (and, bear in mind that it's been a year and change since I've done this), the only advantage you'd get from SSIS is its ability to make use of the bulk insert task. This adds an additional step, requiring you to export your source data to a flat file before you begin the import process.
Alternatively, if you stick with a SQL statement, the section in this article titled Using INSERT INTO…SELECT to Bulk Import Data with Minimal Logging provides the following suggestions:
You can use INSERT INTO SELECT FROM to efficiently transfer a large number of rows from one table, such as a staging table, to another table with minimal logging. Minimal logging can improve the performance of the statement and reduce the possibility of the operation filling the available transaction log space during the transaction.
Minimal logging for this statement has the following requirements:
The recovery model of the database is set to simple or bulk-logged.
The target table is an empty or nonempty heap.
The target table is not used in replication.
The TABLOCK hint is specified for the target table.
I personally dislike SSIS packages for a particular reason: I have never had a DBA who was dedicated to maintaining them. The data import projects I worked on required a lot of fiddling, as the source data wasn't clean (which I assume won't be a problem for you), so I had many packages that worked just fine in a testing environment with a limited data sample that crashed immediately when deployed into production, which made the process a pain in the neck to deal with.
This is just my opinion, but I would say that unless you or someone else you work with focuses on SSIS packages as a part of database maintenance, then it's easier to maintain and document a process that lives inside a stored procedure.
Set logging as simple. Set the log size high enough to handle the insert. Are others on the sytems? A tablock will help the insert - TargetTable with (tablock). If you have a clustered index on TargetTable order the data that way in the select. If you can accept dirty read SourceTable with (nolock). If you are inserting more than 100,000 records you might want to break up the insert using a where.
I am working with a database of a million rows approx.. using python to parse documents and populate the table with terms.. The insert statements work fine but the update statements get extremely time consuming as the table size grows..
It would be great if some can explain this phenomenon and also tell me if there is a faster way to do updates.
Thanks,
Arnav
Sounds like you have an indexing problem. Whenever I hear about problems getting worse as table size grows, it makes me wonder if you're doing a table scan whenever you interact with a table.
Check to see if you have a primary key and meaningful indexes on that table. Look at the WHERE clause you have on that UPDATE and make sure there's an index on those columns to make finding that record as fast as possible.
UPDATE: Write a SELECT query using the WHERE clause you use to UPDATE and ask the database engine to EXPLAIN PLAN. If you see a TABLE SCAN, you'll know what to do.