Which will have more Performance in DB2 - db2

I need to insert a table from a master table having 2 billion records . Insert needs to satisfy some conditons and also in the some columns to be calculated and then it has to be inserted.
I am having 2 options but I dont know which to follow to improve performance.
1 option
Create a cursor by filtering from master table with the conditons. and get one by one record for caluclation and then last insertion to the child table
2 option
insert first using into conditon and then calculation using update statement.
Please Assist.

Having a cursor to get data, perform calculation, and then insert into the database will be time consuming. My guess is that since it involves data connections and I/O for each retrieval and insertion (for both the databases )
Databases are usually better with bulk operations, so it will definitely give you better performance if you use Option 2. Option 2 is better for troubleshooting also ( as the process is cleanly separated - step1: download, step2: calculate) than Option 1 where in case of an error in the middle of the process, you'll be forced to redo all the steps again.

Opening a cursor and inserting records one by one might have serious performance issues at the volumes on the order of a Billion . Especially if you have a weak network between your Database tier and App tier . The fastest way to do this could be to use Db2 export utility to download data , let the program manipulate the data from the file and later load the file back to the child table . Apart from the file based option you can also consider the following approaches
1) Write an SQL stored procedure (No need to ship the data out of the database to make changes )
2) If you using Java/JDBC use Batch Update feature to update multiple records at the same time
3) If you using a tool like Informatica, turn on the bulk load feature in informatica
Also see the IBM DW article on imporving insert performance . The article is a little bit older but concepts are still valid . http://www.ibm.com/developerworks/data/library/tips/dm-0403wilkins/

Related

DB2 Tables Not Loading when run in Batch

I have been working on a reporting database in DB2 for a month or so, and I have it setup to a pretty decent degree of what I want. I am however noticing small inconsistencies that I have not been able to work out.
Less important, but still annoying:
1) Users claim it takes two login attempts to connect, first always fails, second is a success. (Is there a recommendation for what to check for this?)
More importantly:
2) Whenever I want to refresh the data (which will be nightly), I have a script that drops and then recreates all of the tables. There are 66 tables, each ranging from 10's of records to just under 100,000 records. The data is not massive and takes about 2 minutes to run all 66 tables.
The issue is that once it says it completed, there is usually at least 3-4 tables that did not load any data in them. So the table is deleted and then created, but is empty. The log shows that the command completed successfully and if I run them independently they populate just fine.
If it helps, 95% of the commands are just CAST functions.
While I am sure I am not doing it the recommended way, is there a reason why a number of my tables are not populating? Are the commands executing too fast? Should I lag the Create after the DROP?
(This is DB2 Express-C 11.1 on Windows 2012 R2, The source DB is remote)
Example of my SQL:
DROP TABLE TEST.TIMESHEET;
CREATE TABLE TEST.TIMESHEET AS (
SELECT NAME00, CAST(TIMESHEET_ID AS INTEGER(34))TIMESHEET_ID ....
.. (for 5-50 more columns)
FROM REMOTE_DB.TIMESHEET
)WITH DATA;
It is possible to configure DB2 to tolerate certain SQL errors in nested table expressions.
https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.data.fluidquery.doc/topics/iiyfqetnint.html
When the federated server encounters an allowable error, the server allows the error and continues processing the remainder of the query rather than returning an error for the entire query. The result set that the federated server returns can be a partial or an empty result.
However, I assume that your REMOTE_DB.TIMESHEET is simply a nickname, and not a view with nested table expressions, and so any errors when pulling data from the source should be surfaced by DB2. Taking a look at the db2diag.log is likely the way to go - you might even be hitting a Db2 issue.
It might be useful to change your script to TRUNCATE and INSERT into your local tables and see if that helps avoid the issue.
As you say you are maybe not doing things the most efficient way. You could consider using cache tables to take a periodic copy of your remote data https://www.ibm.com/support/knowledgecenter/en/SSEPGG_11.5.0/com.ibm.data.fluidquery.doc/topics/iiyvfed_tuning_cachetbls.html

DB2 updated rows since last check

I want to periodically export data from db2 and load it in another database for analysis.
In order to do this, I would need to know which rows have been inserted/updated since the last time I've exported things from a given table.
A simple solution would probably be to add a timestamp to every table and use that as a reference, but I don't have such a TS at the moment, and I would like to avoid adding it if possible.
Is there any other solution for finding the rows which have been added/updated after a given time (or something else that would solve my issue)?
There is an easy option for a timestamp in Db2 (for LUW) called
ROW CHANGE TIMESTAMP
This is managed by Db2 and could be defined as HIDDEN so existing SELECT * FROM queries will not retrieve the new row which would cause extra costs.
Check out the Db2 CREATE TABLE documentation
This functionality was originally added for optimistic locking but can be used for such situations as well.
There is a similar concept for Db2 z/OS - you have to check that out as I have not tried this one.
Of cause there are other ways to solve it like Replication etc.
That is not possible if you do not have a timestamp column. With a timestamp, you can know which are new or modified rows.
You can also use the TimeTravel feature, in order to get the new values, but that implies a timestamp column.
Another option, is to put the tables in append mode, and then get the rows after a given one. However, this option is not sure after a reorg, and affects the performance and space utilisation.
One possible option is to use SQL replication, but that needs extra tables for staging.
Finally, another option is to read the logs, with the db2ReadLog API, but that implies a development. Also, just appliying the archived logs into the new database is possible, however the database will remain in roll forward pending.

An alternative design to insert/update of talend

I have a requirement in Talend where in I have to update/insert rows from the source table to the destination table. The source and destination tables are identical. The source gets refreshed by a business process and need to update/insert these results in the destination table.
I had designed for the 'insert or update' in tmap and tmysqloutput. However, the job turns out to be super slow
As an alternative to the above solution I am trying to do design the insert and update separately.In order to do this, I was wanting to hash the source rows as the number of rows would be usually less.
So, my question I will hash the input rows but when I join them with the destination rows in tmap should I hash the destination rows as well? Or should I use the destination rows as it is and then join them?
Any suggestions on the job design here?
Thanks
Rathi
If you are using the same database, you should not use ETL loading techniques but ELT loading so that all processing will happen in the database. Talend offers a few ELT components which are a bit different to use but very helpful for this case. I've had things to speed up by multiple magnitudes using only those components.
It is still a good idea to use an indexed hashed field both in the source and the target, which is done in a same way in loading Satellites in the Data Vault 2.0 model.
Alternatively, if you have direct access to the source table database, you could consider adding triggers for C(R)UD scenarios. Doing this, every action on the source database could be reflected in your database immediately. Remember though that you might need to think about a buffer table ("staging") where you could store your changes so that you are able to ingest fast, process later. In this table only the changed rows and the change type (create, update, delete) would be present for you to process. This decouples loading and processing which can be helpful if there will be a problem with loading or processing later on.
Yes i believe that you should use hash component for destination table as well.
Because than your processing (lookup) will be very fast as its happening in memory
If not than lookup load may take more time.

SSIS or TSQL for SQL/MySQL table comparrison

I am new to SSIS and am after some assistance in creating an SSIS package to do a specific task. My data is stored remotely within a MySQL Database and this is downloaded to a SQL Server 2014 Database. What I want to do is the following, create a package where I can enter 2 dates that can be compared against the create date/date modified per record on a number of tables to give me a snap shot and compare the MySQL Data to the SQL Data so that I can see if there are any rows that are missing from my local SQL Database or if any need to be updated. Some tables have no dates so I just want to see a record count on what is missing if anything between the 2. If this is better achieved through TSQL I am happy to hear about other suggestions or sites to look at where things have been done similar.
In relation to your query Tab :
"Hi Tab, What happens at the moment is our master data is stored in a MySQL Database, the data was then downloaded to a SQL Server Database as a one off. What happens at the moment is I have a SSIS package that uses the MAX ID which can be found on most of the tables to work out which records are new and just downloads them or updates them. What I want to do is run separate checks on the tables to make sure that during the download nothing has been missed and everything is within sync. In an ideal world I would like to pass in to a SSIS package or tsql stored procedure a date range, shall we say calender week, this would then check for any differences between the remote MySQL database tables and the local SQL tables. It does not currently have to do anything but identify issues, correcting them may come later or changes would need to be made to the existing sync package. Hope his makes more sense."
Thanks P
To do this, you need to implement a Type 1 Slowly Changing Dimension type data flow in SSIS. There are a number of ways to do this, including a built in transformation aptly called the Slowly Changing Dimension transformation. Whilst this is easy to set up, it is a pain to maintain and it runs horrendously slowly.
There are numerous ways to set this up using other transformations or even SQL merge statements which are detailed here: https://bennyaustin.wordpress.com/2010/05/29/alternatives-to-ssis-scd-wizard-component/
I would recommend that you use Lookup transformations as they perform better than the Slowly Changing Dimension transformation but offer better diagnostics and error handling than the better performing SQL merge statement.
Before you do this you will need to add a Checksum or Hashbytes column to your SQL data for ease of comparison with the incoming MySQL data.
In short, calculate some sort of repeatable checksum as the data is downloaded into your SQL Server, then use this in an SSIS Lookup, matching on the row key, to check for changes. Where the checksum value is different for the same row it needs updating and where there is no matching row key in your SQL Data you need to insert the new row.

SSIS for table-to-table inserts vs. (SQL only) INSERT INTO () SELECT FROM approach

I am currently transferring a large amount of records from one table to another, summarizing in the process. So, I have a SQL in this general format:
INSERT INTO TargetTable
(Col1,
Col2,
...
ColX)
Tot
)
SELECT
Col1,
Col2,
...
ColX
SUM(TOT)
FROM
SourceTable
GROUP BY
Col1,
Col2,
...
ColX
Is there any performance advantage of moving this SQL into an SSIS task when transferring records from one table to another using a SQL SELECT as a source? For example, is logging turned off?
Secondary question: Are there any tactics that I could use to ensure a maximum transfer rate? For example, removing indexes from the Target table before inserting, locking the table, etc?
In my experience (and, bear in mind that it's been a year and change since I've done this), the only advantage you'd get from SSIS is its ability to make use of the bulk insert task. This adds an additional step, requiring you to export your source data to a flat file before you begin the import process.
Alternatively, if you stick with a SQL statement, the section in this article titled Using INSERT INTO…SELECT to Bulk Import Data with Minimal Logging provides the following suggestions:
You can use INSERT INTO SELECT FROM to efficiently transfer a large number of rows from one table, such as a staging table, to another table with minimal logging. Minimal logging can improve the performance of the statement and reduce the possibility of the operation filling the available transaction log space during the transaction.
Minimal logging for this statement has the following requirements:
The recovery model of the database is set to simple or bulk-logged.
The target table is an empty or nonempty heap.
The target table is not used in replication.
The TABLOCK hint is specified for the target table.
I personally dislike SSIS packages for a particular reason: I have never had a DBA who was dedicated to maintaining them. The data import projects I worked on required a lot of fiddling, as the source data wasn't clean (which I assume won't be a problem for you), so I had many packages that worked just fine in a testing environment with a limited data sample that crashed immediately when deployed into production, which made the process a pain in the neck to deal with.
This is just my opinion, but I would say that unless you or someone else you work with focuses on SSIS packages as a part of database maintenance, then it's easier to maintain and document a process that lives inside a stored procedure.
Set logging as simple. Set the log size high enough to handle the insert. Are others on the sytems? A tablock will help the insert - TargetTable with (tablock). If you have a clustered index on TargetTable order the data that way in the select. If you can accept dirty read SourceTable with (nolock). If you are inserting more than 100,000 records you might want to break up the insert using a where.