I am trying to fetch around 100000 records from an oracle db server. As fetch size of ojdbc driver by default is 10 it takes too much time to process these records. I tried to alter setFetchSize api of ResultSet but Talend Open Studio isn't supporting this change of code. What can I do to solve this problem?
In the advanced settings of the input components there is a use cursor / cursor size option. The name is very misleading but it does what you are looking for: set the fetchsize. I usually go with 100.
Related
I need to process millions of records coming from MongoDb and put a ETL pipeline to insert that data into a PostgreSQL database. However, in all the methods I've tried, I keep getting the out memory heap space exception. Here's what I've already tried -
Tried connecting to MongoDB using tMongoDBInput and put a tMap to process the records and output them using a connection to PostgreSQL. tMap could not handle it.
Tried to load the data into a JSON file and then read from the file to PostgreSQL. Data got loaded into JSON file but from there on got the same memory exception.
Tried increasing the RAM for the job in the settings and tried the above two methods again, still no change.
I specifically wanted to know if there's any way to stream this data or process it in batches to counter the memory issue.
Also, I know that there are some components dealing with BulkDataLoad. Could anyone please confirm whether it would be helpful here since I want to process the records before inserting and if yes, point me to the right kind of documentation to get that set up.
Thanks in advance!
As you already tried all the possibilities the only way that I can see to do this requirement is breaking done the job into multiple sub-jobs or going with incremental load based on key columns or date columns, Considering this as a one-time activity for now.
Please let me know if it helps.
I have a job in talend open studio which is working fine, it conects a tMSSqlinput to a tMap then tMysqlOutput, very straight forward. My problem is that i need this job running on daily basis, but only run when a new record is created or modified...any help is highly aprecciated!
It seems that you are searching for a Change Data Capture Tool for Talend.
Unfortunately it is only available on the licenced product.
To implement your need, you do have several ways. I want to show the most popular ones.
CDC from Talend
As Corentin said correctly, you could choose to use CDC (Change Data Capture) from Talend if you use the subscription version.
CDC of MSSQL
Alternatively you can check if you can activate or use CDC in your MSSQL server. This depends on your license. If it is possible, you can use the function to identify new elements and proceed them.
Triggers
Also you can create triggers on your database (if you have access to it). For example, creating a trigger for the cases INSERT, UPDATE, DELETE would help you getting the deltas. Then you could store those records separately or their IDs.
Software driven / API
If your database is connected to a software and you have developers around, you could ask for a service which identifies records on insert / update / delete and shows them to you. This could be done e.g. in a REST interface.
Delta via ID
If the primary key is an ID and it is set to autoincrement, you could also check your MySQL table for the biggest number and only SELECT those from the source which have a bigger ID than you have already got. This depends of course from the database layout.
We have many talend jobs to transfer data from oracle (tOracleInput) to redshift (tRedshiftOutputBulkExec). I would like to store the result information into a DB table. For example:
Job name, start time, running time, rows loaded, successful or failed
I know if I turn on log4j, most those information can be derived from the log. However, saving it into DB table will make it easy to check and report the result.
I'm most interested in records loaded. I checked this link http://www.talendbyexample.com/talend-logs-and-errors-component-reference.html and manual of tRedshiftOutputBulkExec. None of them gives me such information.
Will Talend Administration Center provide such function? What is the best way to implement it?
Thanks,
After looking at the URL you provided, tLogCatcher should provide you with what you need (minus the rows loaded, which you can get with a lookup).
I started with Talend Studio Version 6.4.1. There you can set "Stats & Logs" for a job. It can log to console, files or database. When writing to a DB you set JDBC parameters and the name for three tables:
Stats Table: stores start and end timestamps of the job
Logs Table: stores error messages
Meter Table: stores the count of rows for each monitored flow
They correspond to the components tStatCatcher, tLogCatcher, tFlowMeterCatcher, where you can find the needed table schema.
To make a flow monitored select it, open tab "Component" and mark the checkbox "Monitor this connection".
To see the logged values you can use the "AMC" (Active Monitoring Console) in Studio or TAC.
After going through similar questions on Stackoverflow, I am unable to find a method where I could export a large CSV file from a query made in MySQL workbench (v 5.2).
The query is about 4 million rows, 8 columns (comes to about 300Mb when exported as a csv file).
Currently I load the entire rows (have see it in the GUI) and use the export option. This makes my machine crash most of the time)
My constraints are:
I am not looking for a solution via bash terminal.
I need to export it to the client machine and not the database server.
Is this drawback of MySQL Workbench?
How do I not see it in GUI but yet export all the rows into a single file?
There is a similar question I found, but the answers dont meet the constraints I have:
" Exporting query results in MySQL Workbench beyond 1000 records "
Thanks.
In order to export to CSV you first have to load all that data, which is a lot to have in a GUI. many controls are simply no made to carry that much data. So your best bet is to avoid GUI as much as possible.
One way could be to run your query outputting to a text window (see Query menu). This is not CSV but at least should work. You can then try to copy out the text into a spreadsheet and convert it to CSV.
If that is too much work try limiting your rows into ranges, say 1 million each, using the LIMIT clause on your query. Lower the size until you have one that can be handled by MySQL Workbench. You will get n CSV files you have to concatenate later. A small application or (depending on your OS) a system tool should be able to strip headers and concatenate the files into one.
I need to insert a table from a master table having 2 billion records . Insert needs to satisfy some conditons and also in the some columns to be calculated and then it has to be inserted.
I am having 2 options but I dont know which to follow to improve performance.
1 option
Create a cursor by filtering from master table with the conditons. and get one by one record for caluclation and then last insertion to the child table
2 option
insert first using into conditon and then calculation using update statement.
Please Assist.
Having a cursor to get data, perform calculation, and then insert into the database will be time consuming. My guess is that since it involves data connections and I/O for each retrieval and insertion (for both the databases )
Databases are usually better with bulk operations, so it will definitely give you better performance if you use Option 2. Option 2 is better for troubleshooting also ( as the process is cleanly separated - step1: download, step2: calculate) than Option 1 where in case of an error in the middle of the process, you'll be forced to redo all the steps again.
Opening a cursor and inserting records one by one might have serious performance issues at the volumes on the order of a Billion . Especially if you have a weak network between your Database tier and App tier . The fastest way to do this could be to use Db2 export utility to download data , let the program manipulate the data from the file and later load the file back to the child table . Apart from the file based option you can also consider the following approaches
1) Write an SQL stored procedure (No need to ship the data out of the database to make changes )
2) If you using Java/JDBC use Batch Update feature to update multiple records at the same time
3) If you using a tool like Informatica, turn on the bulk load feature in informatica
Also see the IBM DW article on imporving insert performance . The article is a little bit older but concepts are still valid . http://www.ibm.com/developerworks/data/library/tips/dm-0403wilkins/