SSIS for table-to-table inserts vs. (SQL only) INSERT INTO () SELECT FROM approach - tsql

I am currently transferring a large amount of records from one table to another, summarizing in the process. So, I have a SQL in this general format:
INSERT INTO TargetTable
(Col1,
Col2,
...
ColX)
Tot
)
SELECT
Col1,
Col2,
...
ColX
SUM(TOT)
FROM
SourceTable
GROUP BY
Col1,
Col2,
...
ColX
Is there any performance advantage of moving this SQL into an SSIS task when transferring records from one table to another using a SQL SELECT as a source? For example, is logging turned off?
Secondary question: Are there any tactics that I could use to ensure a maximum transfer rate? For example, removing indexes from the Target table before inserting, locking the table, etc?

In my experience (and, bear in mind that it's been a year and change since I've done this), the only advantage you'd get from SSIS is its ability to make use of the bulk insert task. This adds an additional step, requiring you to export your source data to a flat file before you begin the import process.
Alternatively, if you stick with a SQL statement, the section in this article titled Using INSERT INTO…SELECT to Bulk Import Data with Minimal Logging provides the following suggestions:
You can use INSERT INTO SELECT FROM to efficiently transfer a large number of rows from one table, such as a staging table, to another table with minimal logging. Minimal logging can improve the performance of the statement and reduce the possibility of the operation filling the available transaction log space during the transaction.
Minimal logging for this statement has the following requirements:
The recovery model of the database is set to simple or bulk-logged.
The target table is an empty or nonempty heap.
The target table is not used in replication.
The TABLOCK hint is specified for the target table.
I personally dislike SSIS packages for a particular reason: I have never had a DBA who was dedicated to maintaining them. The data import projects I worked on required a lot of fiddling, as the source data wasn't clean (which I assume won't be a problem for you), so I had many packages that worked just fine in a testing environment with a limited data sample that crashed immediately when deployed into production, which made the process a pain in the neck to deal with.
This is just my opinion, but I would say that unless you or someone else you work with focuses on SSIS packages as a part of database maintenance, then it's easier to maintain and document a process that lives inside a stored procedure.

Set logging as simple. Set the log size high enough to handle the insert. Are others on the sytems? A tablock will help the insert - TargetTable with (tablock). If you have a clustered index on TargetTable order the data that way in the select. If you can accept dirty read SourceTable with (nolock). If you are inserting more than 100,000 records you might want to break up the insert using a where.

Related

PostgreSQL performance of INSERT INTO table SELECT vs. COPY

I'm attempting to move medium amounts of data around in PostgreSQL (tens-to-hundreds of millions of rows).
In designing the system, I'm trying to understand: How does the performance of INSERT INTO table(field1, field2) SELECT field1, field2 FORM other_table compared with COPY FROM ... BINARY in PostgreSQL?
I can't find any documentation that directly speaks to that question. Some considerations I can see:
INTO INTO ... SELECT requires both reads and writes from the same disk
COPY FROM ... BINARY requires either a client that has the data, or doing a round-trip COPY TO ... piped to COPY FROM ...
But I'm sure there are others, I'm hoping there's some form of canonical performance guidance around comparative expectations for these.
Ultimately questions like this can only be answered by tests.
But if you want to copy data from one table to another, INSERT ... SELECT ... should perform better, because it does not require saving the data to an intermediary file or going through a client-server connection.
Tips for speed:
Have no constraints and indexes on the new table when you load the data, but add them afterwards.
Make sure max_wal_size is high.
I'd VACUUM (FREEZE) the new table afterwards (which doesn't disturb normal work on the table) to make future anti-wraparound autovacuum runs fast.

How does postgresql lock tables when inserting and selecting?

I'm migrating data from one table to another in an environment where any long locks or downtime is not acceptable, in total about 80000 rows. Essentially the query boils down to this simple case:
INSERT INTO table_2
SELECT * FROM table_1
JOIN table_3 on table_1.id = table_3.id
All 3 tables are being read from and could have an insert at any time. I want to just run the query above, but I'm not sure how the locking works and whether the tables will be totally inaccessible during the operation. My understanding tells me that only the affected rows (newly inserted) will be locked. Table 1 is just being selected, so no harm, and concurrent inserts are safe so table 2 should be freely accessible.
Is this understanding correct, and can I run this query in a production environment without fear? If it's not safe, what is the standard way to accomplish this?
You're fine.
If you're interested in the details, you can read up on multiversion concurrency control, or on the details of the Postgres MVCC implementation, or how its various locking modes interact, but the implications for your case are nicely summarised in the docs:
reading never blocks writing and writing never blocks reading
In short, every record stored in the database has some version number attached to it, and every query knows which versions to consider and which to ignore.
This means that an INSERT can safely write to a table without locking it, as any concurrent queries will simply ignore the new rows until the inserting transaction decides to commit.

How to perform a Bulk Insert in Sybase SQL

I need to insert a Big amount of data(Some Millions) and I need to perform it Quickly.
I read about Bulk insert via ODBC on .NET and JAVA But I need to perform it directly on the Database.
I also read about Batch Insert but What I have tried have not seemed to work
Batch Insert, Example
I'm executing a INSERT SELECT but it's taking something like 0,360s per row, this is very slow and I need to perform some improvements here.
I would really appreciate some guidance here with examples and documentation if possible.
DATABASE: SYBASE ASE 15.7
Expanding on some of the comments ...
blocking, slow disk IO, and any other 'wait' events (ie, anything other than actual insert/update activity) can be ascertained from the master..monProcessWaits table (where SPID = spid_of_your_insert_update_process) [see the P&T manual for Monitoring Tables (aka MDA tables)]
master..monProcessObject and master..monProcessStatement will show logical/physical IOs for currently running queries [again, see P&T manual for MDA tables]
master..monSysStatement will show logical/physical IOs for recently completed queries [again, see P&T manual for MDA tables]
for UPDATE statements you'll want to take a look at the query plan to see if you're suffering from a poor join order; also of key importance ... direct (fast/good) updates vs deferred (slow/bad) updates; deferred updates can occur for many reasons ... some fixable, some not ... updating indexed columns, poor join order, updates that cause page splits and/or row forwardings
RI (PK/FK) constraints can be viewed with sp_helpconstraint table_name; query plans will also show the under-the-covers joins required when performing RI (PK/FK) validations during inserts/updates/deletes
triggers are a bit harder to locate (an official sp_helptrigger doesn't show up until ASE 16); check the sysobjects.[ins|upd|del]trig where name = your_table - these represent the object id(s) of any insert/update/delete triggers on the table; also check sysobjects records where type = 'TR' and deltrig = object_id(your_table) - provides support for additional insert/update/delete triggers (don't recall at moment if this is just ASE 16+)
if triggers are being fired, need to review the associated query plans to make sure the inserted and deleted tables (if referenced) are driving any queries where these pseudo tables are joined with permanent tables
There are likely some areas I'm forgetting (off the top of my head) ... key take away is that there could be many reasons for 'slow' DML statements.
One (relatively) quick way to find out if RI (PK/FK) constraints or triggers are at play ...
set showplan on
go
insert/update/delete statements
go
Then review the resulting query plan(s); if you see references to any tables other than the ones explicitly listed in the insert/update/delete statements then you're likely dealing with RI constraints and/or triggers.

SQL Transactions - allow read original data before commit (snapshot?)

I am facing an issue, possibly quite easy to solve, I am just new to advanced transaction settings.
Every 30 minutes I am running an INSERT query that is getting latest data from a linked server to my client's server, to a table we can call ImportTable. For this I have a simple job that looks like this:
BEGIN TRAN
DELETE FROM ImportTable
INSERT INTO ImportTable (columns)
SELECT (columns)
FROM QueryGettingResultsFromLinkedServer
COMMIT
The thing is, each time the job runs the ImportTable is locked for the query run time (2-5 minutes) and nobody can read the records. I wish the table to be read-accessible all the time, with as little downtime as possible.
Now, I read that it is possible to allow SNAPSHOT ISOLATION in the database settings that could probably solve my problem (set to FALSE at the moment), but I have never played with different transaction isolation types and as this is not my DB but my client's, I'd rather not alter any database settings if I am not sure if it can break something.
I know I could have an intermediary table that the records are inserted to and then inserted to the final table and that is certainly a possible solution, I was just hoping for something more sophisticated and learning something new in the process.
PS: My client's server & database is fairly new and barely used, so I expect very little impact if I change some settings, but still, I cannot just randomly change various settings for learning purposes.
Many thanks!
Inserts wont normally block the table ,unless it is escalated to table level.In this case,you are deleting table first and inserting data again,why not insert only updated data?.for the query you are using transaction level (rsci)snapshot isolation will help you,but you will have an added impact of row version which means sql will store row versions of rows that changed in tempdb.
please see MCM isolation videos of Kimberely tripp for indepth understanding ,also dont forget to test in stage enviornment.
You are making this harder than it needs to be
The problem is the 2-5 minutes that you let be part of a transaction
It is only a few thousand rows - that part takes like a few milliseconds
If you need ImportTable to be available during those few milliseconds then put it in a SnapShot
Delete ImportTableStaging;
INSERT INTO ImportTableStaging(columns)
SELECT (columns)
FROM QueryGettingResultsFromLinkedServer;
BEGIN TRAN
DELETE FROM ImportTable
INSERT INTO ImportTable (columns) with (tablock)
SELECT (columns)
FROM ImportTableStaging
COMMIT
If you are worried about concurrent update to ImportTableStaging then use a #temp

Which will have more Performance in DB2

I need to insert a table from a master table having 2 billion records . Insert needs to satisfy some conditons and also in the some columns to be calculated and then it has to be inserted.
I am having 2 options but I dont know which to follow to improve performance.
1 option
Create a cursor by filtering from master table with the conditons. and get one by one record for caluclation and then last insertion to the child table
2 option
insert first using into conditon and then calculation using update statement.
Please Assist.
Having a cursor to get data, perform calculation, and then insert into the database will be time consuming. My guess is that since it involves data connections and I/O for each retrieval and insertion (for both the databases )
Databases are usually better with bulk operations, so it will definitely give you better performance if you use Option 2. Option 2 is better for troubleshooting also ( as the process is cleanly separated - step1: download, step2: calculate) than Option 1 where in case of an error in the middle of the process, you'll be forced to redo all the steps again.
Opening a cursor and inserting records one by one might have serious performance issues at the volumes on the order of a Billion . Especially if you have a weak network between your Database tier and App tier . The fastest way to do this could be to use Db2 export utility to download data , let the program manipulate the data from the file and later load the file back to the child table . Apart from the file based option you can also consider the following approaches
1) Write an SQL stored procedure (No need to ship the data out of the database to make changes )
2) If you using Java/JDBC use Batch Update feature to update multiple records at the same time
3) If you using a tool like Informatica, turn on the bulk load feature in informatica
Also see the IBM DW article on imporving insert performance . The article is a little bit older but concepts are still valid . http://www.ibm.com/developerworks/data/library/tips/dm-0403wilkins/