talend - lost few records not inserted in database - talend

I am very new to talend and I have been struggling with this since two weeks now
I need to diagnose a problem happening with a job (see picture)
When I execute the job I see on the screen it says 1000 rows committed to the database but when you go to the database there is fewer records
Is there a way to tell why there are fewer records committed to the database then what it said on the screen ?
tmap:

After talking to a few colleagues, I got some interesting answers. About your question, you could add a reject link after tMysqlOutput to see if it has any records.
For example: tMysqlInput--main--tMap--out1--tMysqlOutput---reject--tLogRow
Hope this helps! T Data.

i think you should check the tMap join type for all lookups. By changing lookup type you will get desired output. see attached image.

Related

How to process and insert millions of MongoDB records into Postgres using Talend Open Studio

I need to process millions of records coming from MongoDb and put a ETL pipeline to insert that data into a PostgreSQL database. However, in all the methods I've tried, I keep getting the out memory heap space exception. Here's what I've already tried -
Tried connecting to MongoDB using tMongoDBInput and put a tMap to process the records and output them using a connection to PostgreSQL. tMap could not handle it.
Tried to load the data into a JSON file and then read from the file to PostgreSQL. Data got loaded into JSON file but from there on got the same memory exception.
Tried increasing the RAM for the job in the settings and tried the above two methods again, still no change.
I specifically wanted to know if there's any way to stream this data or process it in batches to counter the memory issue.
Also, I know that there are some components dealing with BulkDataLoad. Could anyone please confirm whether it would be helpful here since I want to process the records before inserting and if yes, point me to the right kind of documentation to get that set up.
Thanks in advance!
As you already tried all the possibilities the only way that I can see to do this requirement is breaking done the job into multiple sub-jobs or going with incremental load based on key columns or date columns, Considering this as a one-time activity for now.
Please let me know if it helps.

Tableau Queries with JOINS and check for NULL are failing in ClickHouse

I am running Tableau connected to ClickHouse via ODBC driver. At first mostly any report request was failing. I have configured this tdc file https://github.com/yandex/clickhouse-odbc/blob/clickhouse-tbc/clickhouse.tdc and its actually started to work, however now some of the query requests with JOINS that contain check for NULL in ON are failing because of using IS NULL instead of isNull(id)
JOIN users ON ((users.user_id = t0.user_id) OR ((users.user_id IS NULL) AND (t0.user_id IS NULL)))
This is the correct way that works:
JOIN users ON ((users.user_id = t0.user_id) OR ((isNull(users.user_id) = 1) AND (isNull(t0.user_id) = 1 IS NULL)))
How to make tablau driver to send the right requerst?
Here are a few suggestions:
This post on the Tableau Community looks like it has similar symptoms as you describe. The suggested resolution is to wrap all fields as such IfNull([Dimension], "") thereby reducing the need, apparently, to have Clickhouse do the check of nulls.
The TDC file from Github looks pretty complete, but they might not have taken joins into consideration. The GitHub commit states that the tdc is "untested." I would message the creator of that TDC and see if they've done any work around joins and if they have any suggestions.
Here is a list of possible ODBC Customizations that can be added to or removed from your TDC file. The combination of which may take some experimentation, but they're well worth researching as a possible solution.
Create an extract before performing complex analysis. If you're able to connect initially, then it should be possible to bring all the data from Clickhouse into an extract.
Custom SQL would probably alleviate any join syntax issue because the query and any joins are purely written by you. After making the initial connection to ClickHouse, instead of choosing a table, select "Custom ODBC" and write a query that will return the joined tables of your choosing.
Finally, the Tableau Ideas Forum is a place to ask for and/or vote on upcoming connectors. I can see there is already an idea in place for ClickHouse. Feel free to vote it up.
If you can make sure not to have any NULL values in the data, you can also use this proxy that I wrote for this exact problem.
https://github.com/kfzteile24/clickhouse-proxy
It kinda worked, for most cases, but it's not bullet-proof.

DB2 updated rows since last check

I want to periodically export data from db2 and load it in another database for analysis.
In order to do this, I would need to know which rows have been inserted/updated since the last time I've exported things from a given table.
A simple solution would probably be to add a timestamp to every table and use that as a reference, but I don't have such a TS at the moment, and I would like to avoid adding it if possible.
Is there any other solution for finding the rows which have been added/updated after a given time (or something else that would solve my issue)?
There is an easy option for a timestamp in Db2 (for LUW) called
ROW CHANGE TIMESTAMP
This is managed by Db2 and could be defined as HIDDEN so existing SELECT * FROM queries will not retrieve the new row which would cause extra costs.
Check out the Db2 CREATE TABLE documentation
This functionality was originally added for optimistic locking but can be used for such situations as well.
There is a similar concept for Db2 z/OS - you have to check that out as I have not tried this one.
Of cause there are other ways to solve it like Replication etc.
That is not possible if you do not have a timestamp column. With a timestamp, you can know which are new or modified rows.
You can also use the TimeTravel feature, in order to get the new values, but that implies a timestamp column.
Another option, is to put the tables in append mode, and then get the rows after a given one. However, this option is not sure after a reorg, and affects the performance and space utilisation.
One possible option is to use SQL replication, but that needs extra tables for staging.
Finally, another option is to read the logs, with the db2ReadLog API, but that implies a development. Also, just appliying the archived logs into the new database is possible, however the database will remain in roll forward pending.

BigQuery streaming insert using template tables data availability issue

We have been using BigQuery for over a year now with no issues. We load data as batch jobs every few hours and it usually is instantly available.
We just started experimenting with streaming inserts using template tables. With our first test, we saw no errors and the data showed up instantly. The test created approximately 120 tables. A simple select count (using the web ui) on the tables came up with the right total number of ~8000 rows. After a couple of hours of more streaming, the total dropped to ~1400 rows.
Unsure about what happened, we dropped the dataset, recreated the template table and re-ran the streaming. This time around, the tables showed up right away but the data did not. On our third attempt the tables themselves did not show up for more than a couple of hours. We are on the fourth attempt and this time we only streamed data belonging to one table. The table showed up right away, but it has been over an hour and the data does not show up.
The streaming service uses the latest Java library, inserts only one record at a time and logs the response. The response, without an exception is always {"kind":"bigquery#tableDataInsertAllResponse"} and no errors.
Any help trying to understand what is happening would be great. Thanks.
Looks like we've identified the issue. It appears there's a race in the template-tables path only that causes our system to think the first chunk of data was deleted by user action (table truncation -- which it obviously wasn't), and is dropped. We've identified the fix and will attempt to push out a fix shortly.
Thanks for letting us know!

What is the vcslog table in JediVCS used for?

Just wondering what the vcslog table is used for in JediVCS.
I received some consistancy errors on this table (during my backup procedure these errors were flagged by the database backend) and there is a chance that after repair some data went missing.
If the table is just acting as a log then this should be ok.
Some other info:
Was going to ask this quesion on the JediVCS news group but it appears to
be down
I do have recent backups that I could restore but would rather not as it
means finding and re-committing any intervening work.
I diffed all other tables and their data between the pre-fix and post-fix
versions of the VCS and they all match.
I tried to diff the vcslog table but the tool I have crashed as the table
has millions of records. (I think the tool ran out of memory doing the
diff)
Any info appreciated.
Peter Mayes
I ended up mailing the active JediVCS admin staff directly.
Thanks go to them for their prompt and helpul advice.
Basically the vcslog table holds information relating to actions taken by a user. As such its data is entirely optional, but recommended.
The least useful records are marked as type=g in the database. (This logs a 'get' operation by a user).
After deleting these records directly in the database the vcslog table reduced in size by 97%. (I suspect the large number of 'get' logs is due to our nighly autobuilds).
The database has been stable since the clear. (Some six weeks ago)
Here are some other topics in the JediVCS help manual I was pointed at.
Just in case they prove helpful to someone. These detail the logging behaviour inside the JediVCS client:
"Project history"
"Application server options"
"Write VCS log"