How to get missing data from Db2 in Datastage? - datastage

I have a job which runs everyday and based on column 'modifyts' pulls the records from db2 as delta: modifyts>current_date-1.
Recently we found out that some of the data is getting missed and not loaded to our netezza target table.
Is there a way we can get the missing data?
As of now, we are planning to load for past 3 days with condition modifyts>current_date-3 but it could cause data error in other process. Is there an efficient way to achieve this?
Any suggestions will greatly help. thanks in advance

Related

Can someone explain how to use tDbBulkExec for db2 in Talend?

I need to bulk load from csv to db2 database, i tried with tFileInputDelimited but it took 7 hours for 199million rows, can someone explain how to use it?
Is the template that I use correct?
job testing

How to process and insert millions of MongoDB records into Postgres using Talend Open Studio

I need to process millions of records coming from MongoDb and put a ETL pipeline to insert that data into a PostgreSQL database. However, in all the methods I've tried, I keep getting the out memory heap space exception. Here's what I've already tried -
Tried connecting to MongoDB using tMongoDBInput and put a tMap to process the records and output them using a connection to PostgreSQL. tMap could not handle it.
Tried to load the data into a JSON file and then read from the file to PostgreSQL. Data got loaded into JSON file but from there on got the same memory exception.
Tried increasing the RAM for the job in the settings and tried the above two methods again, still no change.
I specifically wanted to know if there's any way to stream this data or process it in batches to counter the memory issue.
Also, I know that there are some components dealing with BulkDataLoad. Could anyone please confirm whether it would be helpful here since I want to process the records before inserting and if yes, point me to the right kind of documentation to get that set up.
Thanks in advance!
As you already tried all the possibilities the only way that I can see to do this requirement is breaking done the job into multiple sub-jobs or going with incremental load based on key columns or date columns, Considering this as a one-time activity for now.
Please let me know if it helps.

How to copy a database to sap hana from postgresql with talend?

well my problem is, how could i copy a database with talend from postgresql to sap hana without needing to write a job for every table ?
The reason for this is, because it could take some long time to prepare all those jobs, while taking in consideration, having at least 200 tables, which at least have 30 columns.
I tried tTransferDatabase plugin, but i can't success to transfer it to sap hana, it gives me an error that it can't copy schema (while it successfully worked copying it to other database in postgresql), and i am sure that the schemas names are right.
here is the error:
Exception in component tTransferDatabase_1
java.lang.NullPointerException
at org.apache.ddlutils.PlatformFactory.createNewPlatformInstance(PlatformFactory.java:86)
at org.apache.ddlutils.PlatformFactory.createNewPlatformInstance(PlatformFactory.java:124)
at com.devjpcb.transferdatabase.TransferDatabase.getPlatformDestine(TransferDatabase.java:179)
at com.devjpcb.transferdatabase.TransferDatabase.copySchemaToDatabase(TransferDatabase.java:249)
at local_project.aaasa_0_1.aaasa.tTransferDatabase_1Process(aaasa.java:836)
at local_project.aaasa_0_1.aaasa.runJobInTOS(aaasa.java:1130)
at local_project.aaasa_0_1.aaasa.main(aaasa.java:951)
Is there maybe a chance to do sth like .. for each table in connection, table guess schema, copy columns from table to other side of tmap, run ?
Any advice would be helpful ;), Thank you !
With some work, you could use the example job created by rbaldwin on Talend Exchange; note that it starts with files, not a database. But you could easily create a job that loops through all your database tables and does an extract to file, to then use as the starting point.
Another option is Bekwam's solution

BigQuery streaming insert using template tables data availability issue

We have been using BigQuery for over a year now with no issues. We load data as batch jobs every few hours and it usually is instantly available.
We just started experimenting with streaming inserts using template tables. With our first test, we saw no errors and the data showed up instantly. The test created approximately 120 tables. A simple select count (using the web ui) on the tables came up with the right total number of ~8000 rows. After a couple of hours of more streaming, the total dropped to ~1400 rows.
Unsure about what happened, we dropped the dataset, recreated the template table and re-ran the streaming. This time around, the tables showed up right away but the data did not. On our third attempt the tables themselves did not show up for more than a couple of hours. We are on the fourth attempt and this time we only streamed data belonging to one table. The table showed up right away, but it has been over an hour and the data does not show up.
The streaming service uses the latest Java library, inserts only one record at a time and logs the response. The response, without an exception is always {"kind":"bigquery#tableDataInsertAllResponse"} and no errors.
Any help trying to understand what is happening would be great. Thanks.
Looks like we've identified the issue. It appears there's a race in the template-tables path only that causes our system to think the first chunk of data was deleted by user action (table truncation -- which it obviously wasn't), and is dropped. We've identified the fix and will attempt to push out a fix shortly.
Thanks for letting us know!

how to extract data from mongo collection for data warehouse use

My company starts to use mongo and we are starting to think about what is the best way to extract data from mongodb and send it to our data warehouse.
My question focus around the extract part of the process. As i see it, the best way is to expose API on the service that is built on top of mongo, that the ETL process (that is invoked by a job from the data warehouse) will execute with some specific query that will probably will query for set of times (i.e. - startdate and enddate for every record).
is that sound right or i am missing something or maybe there is better way than that?
initially i was thinking about doing mongoexport every X duration but according to the documentation it seems not so good performance wise.
Thanks in advance!
give a try to pentaho kettle.
https://anonymousbi.wordpress.com/2012/07/25/creating-pentaho-reports-from-mongodb/
I am using Alteryx Designer to extract from MongoDB with the dedicated connector and prep my data to load into Tableau, with optional data prep in between.
Works pretty well!
ALteryx can write to most DBs though...