How to log a talend job result? - talend

We have many talend jobs to transfer data from oracle (tOracleInput) to redshift (tRedshiftOutputBulkExec). I would like to store the result information into a DB table. For example:
Job name, start time, running time, rows loaded, successful or failed
I know if I turn on log4j, most those information can be derived from the log. However, saving it into DB table will make it easy to check and report the result.
I'm most interested in records loaded. I checked this link http://www.talendbyexample.com/talend-logs-and-errors-component-reference.html and manual of tRedshiftOutputBulkExec. None of them gives me such information.
Will Talend Administration Center provide such function? What is the best way to implement it?
Thanks,

After looking at the URL you provided, tLogCatcher should provide you with what you need (minus the rows loaded, which you can get with a lookup).

I started with Talend Studio Version 6.4.1. There you can set "Stats & Logs" for a job. It can log to console, files or database. When writing to a DB you set JDBC parameters and the name for three tables:
Stats Table: stores start and end timestamps of the job
Logs Table: stores error messages
Meter Table: stores the count of rows for each monitored flow
They correspond to the components tStatCatcher, tLogCatcher, tFlowMeterCatcher, where you can find the needed table schema.
To make a flow monitored select it, open tab "Component" and mark the checkbox "Monitor this connection".
To see the logged values you can use the "AMC" (Active Monitoring Console) in Studio or TAC.

Related

COPY command runs but no data being copied from Teradata (on-prem)

I am running into an issue where I have a set up a pipeline that gets a list of tables from Teradata using a Lookup activity and then passes those items to a ForEach activity that then copies the data in parallel and saves them as a gzipped file. The requirement is to essentially archive some tables that are no longer being used.
For this pipeline I am not using any partition options as most of the tables are small and I kept it to be flexible.
Pipeline
COPY activity within ForEach activity
99% of the tables ran without issues and were copied as gz files into blob storage, but two tables in particular run for long time (apprx 4 to 6 hours) without any of the data being written into a blob storage account.
Note that the image above says "Cancelled", but that was done by me. Before that I had a run time as described above, but still no data being written. This is affecting only 2 tables.
I checked with our Teradata team and those tables are not being used by any one (hence its not locked). I also looked at "Teradata Viewpoint" (admin tool) and looked at the query monitor and saw that the query was running on Teradata without issues.
Any insight would be greatly apreciated.
Onlooking issue mention it look the data size of table is more than a blob can store ( As you are not using any partition options )
Use partition option for optimize performance and hold the data
Link
Just in case someone else comes across this, the way I solved this was to create a new data store connection called "TD_Prod_datasetname". The purpose of this dataset is to not point to a specific table, but to just accept a "item().TableName" value.
This datasource contains two main values. 1st is the #dataset().TeradataName
Dataset property
I only came up with that after doing a little bit of digging in Google.
I then created a parameter called "TeradataTable" as String.
I then updated my pipeline. As above the main two activities remain the same. I have a lookup and then a ForEach Activity (where for each will get the item values):
However, in the COPY command inside the ForEach activity I updated the source. Instead of getting "item().Name" I am passing through #item().TableName:
This then enabled me to then select the "Table" option and because I am using Table instead of query I can then use the "Hash" partition. I left it blank because according to Microsoft documentation it will automatically find the Primary Key that will be used for this.
The only issue that I ran into when using this was that if you run into a table that does not have a Primary Key then this item will fail and will need to be run through either a different process or manually outside of this job.
Because of this change the previously files that just hung there and did not copy now copied successfully into our blob storage account.
Hope this helps someone else that wants to see how to create parallel copies using Teradata as a source and pass through multiple table values.

SonarQube DB lacking values

I connected my sonarqube server to my postgres db however when I view the the "metrics" table, it lacks the actual value of the metric.
Those are all the columns I get, which are not particularly helpful. How can I get the actual values of the metrics?
My end goal is to obtain metrics such as duplicate code, function size, complexity etc. on my projects. I understand I could also use the REST api to do this however another application I am using will need a db to extract data from.
As far as i know connecting to db just helps to store data, not to display data.
You can check stored data on sonarqube's gui
Click on project
Click on Activity

writing Log files to Database using talend

This is the actual JobI am having job which executes log files as output. I am doing this using tLogRow, but I want to write them into Database, Is that possible with talend? I have already used tMysqlRow and tMysqlOutput but those are throwing some errors. I have searched in google but there isn't any clear answer for this.
This is the actual output when using TlogRow
You can save the error information directly from tLogCatcher to database table. Please make sure that table name and column names from table should match the tMysqlOutput schema. if you don't want to save the complete information, you can use a tXmlMap for filter the columns required.

Talend open studio run only created or modified records among 15k

I have a job in talend open studio which is working fine, it conects a tMSSqlinput to a tMap then tMysqlOutput, very straight forward. My problem is that i need this job running on daily basis, but only run when a new record is created or modified...any help is highly aprecciated!
It seems that you are searching for a Change Data Capture Tool for Talend.
Unfortunately it is only available on the licenced product.
To implement your need, you do have several ways. I want to show the most popular ones.
CDC from Talend
As Corentin said correctly, you could choose to use CDC (Change Data Capture) from Talend if you use the subscription version.
CDC of MSSQL
Alternatively you can check if you can activate or use CDC in your MSSQL server. This depends on your license. If it is possible, you can use the function to identify new elements and proceed them.
Triggers
Also you can create triggers on your database (if you have access to it). For example, creating a trigger for the cases INSERT, UPDATE, DELETE would help you getting the deltas. Then you could store those records separately or their IDs.
Software driven / API
If your database is connected to a software and you have developers around, you could ask for a service which identifies records on insert / update / delete and shows them to you. This could be done e.g. in a REST interface.
Delta via ID
If the primary key is an ID and it is set to autoincrement, you could also check your MySQL table for the biggest number and only SELECT those from the source which have a bigger ID than you have already got. This depends of course from the database layout.

Viewing tableau server when one data source is missing

I have a dashboard in Tableau which pulls data from about 10 tables in a SQL database.
These tables are refreshed at various times of day. There are occasions where one of them is not available (or has been deleted and awaiting rebuild)
However when I open my tableau dashboard on the server it wont let me see any of it. Not seeing the data from the missing table is fine but the majority of the data that does not come from that table is unavailable too.
I get this error
An unexpected error occurred. If you continue to receive this error please contact your Tableau Server Administrator.
TableauException: [Microsoft][SQL Server Native Client 11.0][SQL Server]Invalid object name 'dbo.survey_order_info_fy16_TV_L'. The table "[dbo].[survey_order_info_fy16_TV_L]" does not exist. Unable to connect to the server "dbedwro.vistaprint.net". Check that the server is running and that you have access privileges to the requested database.
"survey_order_info_fy16_TV_L" being the missing table but not one I'm bothered about right now.
Is there an option that might help me see all the other data?
I am not sure if it's possible to avoid this behavior.
If there isn't there is a workaround for that by creating extract of these tables and storing them on the Tableau server. You can then use these extracts instead of the tables on the DB and just refresh them either by schedule if you know when the tables are available again or from the SQL server (eg. with SSIS by triggering the refresh once the data is available again).
Advantage of that would be that
you can refresh them independently and always have the latest data
it performs better than an SQL connection
you don't jam your SQL server with connections (in case you have a lot of users accesing)
you can filter and select if you didn't want your users to get access to the full dataset
disadvantages:
you will have to create one extract per table, and replace all data sources in workbooks you already use
It's a matter of creating a workbook, connecting to the source (adding filters or hiding fields) and publishing it to the server. Details of that can be found here:
http://onlinehelp.tableau.com/current/pro/online/mac/en-us/publish_datasources.html