Catch data errors in Talend - talend

I have some data that occasionally has errors that need to be checked/validated rather than be loaded to my target system.
How can I do this? For example:
Input:
Local number
------------
789-523
Output:
LocalNumber
------------
789-523
This kind of value needs validation. There should be another number at the last. How I can I get this kind value in my input file and put it on a separate file for validation?

At a high level you'll need to define what you consider data errors using some components and use those to filter the data.
As an example you might have the following data:
.--+-----+-----------+--------------------+--------+-----.
|id|name |phone |address |city |state|
|=-+-----+-----------+--------------------+--------+----=|
|1 |Bob |02071234568|165 Lake Tahoe Blvd.|Richmond|MN |
|2 |Susan|02071234567|345 E Fowler Avenue |Helena |CA |
|3 |Jimmy|foobar |222 Tully Road East |Bismarck|MA |
|4 |Janet|07811111111|230 Camelback Rd |Boise |GB |
'--+-----+-----------+--------------------+--------+-----'
I'm using British phone numbers here for US address' and states but that's because I can't think of any useful US phone numbers :)
Here we want to check whether the phone number is valid and whether and the state is valid too. For now we're just going to print the results of everything to the console using a tLogRow but equally this could be any kind of output including log files, databases or even the Talend Data Stewardship Console. A quick job might look like this:
To check whether the phone number is valid (and also optionally standardise it to a predefined format) we can use the tStandardizePhoneNumber component:
This then adds some columns to your schema including whether the phone number is valid and also a standardised output.
We then use a tMap to filter on whether the phone number is valid and at the same time replace the number with the provided standardised phone number (in this case an international formatted phone number):
After this we can use a lookup to a list of valid US states and inner join this in a tMap to check if the state is valid. We also use this opportunity to get the full state name:
This general principle applies to how you apply any data validation: use a component or some logic (either in a tMap or something like a tPatternCheck) to determine if the data is valid and then use a filtering component (the tPatternCheck is already a filtering component) to direct your output.
If you're looking to validate more basic things such as the metadata of the data e.g., the length of the column or the data type then you can use a tSchemaComplianceCheck to filter rows of data that don't match the predefined schema.

Most of the Output Components has "Reject" connector by using this you get error Code and error message and error record.
Or you can use tSchemComplianceCheck component to get schema related errors.

Related

Practical storage of many (in single-row) boolean values

I wish to have stored many (N ~ about 150) boolean values of web app "environment" variables.
What is the proper way to get them stored?
creating N columns and one (1) row of data,
creating two (2) or three (3) columns (id smallserial, name varchar(255), value boolean) with N rows of data,
by using jsonb data type,
by using area data type,
by using bit string bit varying(n),
by another way (please advise)
Note: name may be too long.
Tia!
Could you perhaps use a bit string? https://www.postgresql.org/docs/7.3/static/datatype-bit.html. (Set the nth bit to 1 when the nth attribute would have been "true")
Depends how you wants to access them in normal usage.
Do you need to access one of this value at time, in this case JSONB is a really good way, really easy and quick to find a record, or do you need to get all of them in one call, in this case Bit String Types are the best, but you need to be really careful around, order and transcription for writing and reading..
Any of the options will do, depending on your circumstances. There is little need to optimise storage if you have only 150 values. Unless, of course there can be a very large number of these sets of 150 values or you are working in a very restricted environment like an embedded system (in which case a full-blown database client is probably not what you're looking for).
There is no definite answer here, but I will give you a few guidelines to consider. As from experience:
You don't want to have an anonymous string of values that is interpreted in code. When you change anything later on, your 1101011 or 0x12f08a will be rendered an fascinatingly enigmatic problem.
When the number of your fields starts to grow, you will regret if they are all stored in a single cell on a single row, because you will either be developing some obscure SQL or transforming a larger-than-needed dataset from the server.
When you feel that boolean values are really not enough, you start to wonder if there is a possibility to store something else too.
Settings and environmental properties are seldom subject to processor or data intensive processing, so follow the easiest path.
As my recommendation based on the given information and some educated guessing, you'll probably want to store your information in a table like
string key | integer set_idx | string value
---------------------------------------------------------
use.the.force | 1899 | 1
home.directory | 1899 | /home/dvader
use.the.force | 1900 | 0
home.directory | 1900 | /home/yoda
Converting a 1 to boolean true is cheap, and if you have only one set of values, you can ignore the set index.

How to save the predictions of YOLO (You Only Look Once) Object detection in a jsonb field in a database

I want to run Darknet(YOLO) on a number of images and store its predictions in PostgreSQL Database.
This is the structure of my table:
sample=> \d+ prediction2;
Table "public.prediction2"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+-------+-----------+----------+--------------+-------------
path | text | not null | extended | |
pred_result | jsonb | | extended | |
Indexes:
"prediction2_pkey" PRIMARY KEY, btree (path)
Darknet(YOLO)'s source files are written in C.
I have already stored Caffe's predictions in the database as follows. I have listed one of the rows of my database here as an example.
path | pred_result
-------------------------------------------------+------------------------------------------------------------------------------------------------------------------
/home/reena-mary/Pictures/predict/gTe5gy6xc.jpg | {"bow tie": 0.00631, "lab coat": 0.59257, "neck brace": 0.00428, "Windsor tie": 0.01155, "stethoscope": 0.36260}
I want to add YOLO's predictions to the jsonb data of pred_result i.e for each image path and Caffe prediction result already stored in the database, I would like to append Darknet (YOLO's) predictions.
The reason I want to do this is to add search tags to each image. So, by running Caffe and Darknet on images, I want to be able to get enough labels that can help me make my image search better.
Kindly help me with how I should do this in Darknet.
This is an issue I also encountered. Actually YOLO does not provide a JSON output interface, so there is no way to get the same output as from Caffe.
However, there is a pull request that you can merge to get workable output here: https://github.com/pjreddie/darknet/pull/34/files. It outputs CSV data, which you can convert to JSON to store in the database.
You could of course also alter the source code of YOLO to make your own implementation that outputs JSON directly.
If you are able to use a TensorFlow implementation of YOLO try this: https://github.com/thtrieu/darkflow
You can directly interact with darkflow from another python application and then do with the output data as you please (or get JSON data saved to a file, whichever is easier).

SQL Server 2014 set based way to create a unique integer for a string combination input

I'm using SQL Server 2014 Developer Edition Service Pack 2 on Windows 7 Enterprise machine.
The question
Is there a set based way that I can create an integer field based on a string input? It must ensure that the Entity ID field is never duplicated for different inputs?
Hypothetical table structure
|ID|Entity ID|Entity Code|Field One|From Date|To Date |
|1 |1 |CodeOne |ValueOne |20160731 |20160801|
|2 |1 |CodeOne |ValueTwo |20160802 |NULL |
|3 |2 |CodeTwo |ValueSix |20160630 |NULL |
Given the above table, I'm trying to find a way to create the Entity ID based on the Entity Code field (it is possible that we would use a combination of fields)
What I've tried so far
Using a sequence object (don't like this because it is too easy for the sequence to be dropped and reset the count)
Creating a table to track the Entities, creating a new Entity ID each time a new Entity is discovered (don't like this because it is not a set based operation)
Creating a hashbyte on the Entity Code field and converting this to a BIGINT (I have no proof that this won't work but it doesn't feel like this is a robust solution)
Thanks in advance all.
Your concerns over HashBytes collisions is understandable, but I think yo can put your worries aside. see How many random elements before MD5 produces collisions?
I've used this technique when masking tens of thousands of customer account numbers. I've yet to witness a collision
Select cast(HashBytes('MD5', 'V5H 3K3') as int)
Returns
-381163718
(Note: as illustrated above, you may see negative values. We didn't mind)

GROUP BY CLAUSE using SYNCSORT

I have some content in a file on which I must generate statistics such as how many of records are of type - 1, type - 2 etc. Number of types can change and is unknown to the code until file arrives. In a SQL system, I can do this using COUNT and GROUP BY clause. But I am not sure if I can do this using SYNCSORT or COBOL program. Would anyone here have an idea on how I can implement 'GROUP BY' type query on a file using SYNCSORT.
Sample Data:
TYPE001 SUBTYPE001 TYPE01-DESC
TYPE001 SUBTYPE002 TYPE01-DESC
TYPE001 SUBTYPE003 TYPE01-DESC
TYPE002 SUBTYPE001 TYPE02-DESC
TYPE002 SUBTYPE004 TYPE02-DESC
TYPE002 SUBTYPE008 TYPE02-DESC
I want to get the information such as TYPE001 ==> 3 Records, TYPE002 ==> 3 Records. What the code doesn't know until runtime is the TYPENNN value
You show data already in sequence, so there is no need to sort the data itself, which makes SUM FIELDS= with SORT a poor solution if anyone suggests it (plus code for the formatting).
MERGE with a single input file and SUM FIELDS= would be better, but still require the code for formatting.
The simplest way to produce output which may suit you is to use OUTFIL reporting functions:
OPTION COPY
OUTFIL NODETAIL,
REMOVECC,
SECTIONS=(1,7,
TRAILER3=(1,7,
' ==> ',
COUNT=(M10,LENGTH=3),
' Records'))
The NODETAIL says "remove all the data lines". The REMOVECC says "although it is a report, don't use printer-control characters on position one of the output records". The SECTIONS says "we're going to use control-breaks, and here they (it in this case) are". In this case, your control-field is 1,7. The TRAILER3 defines the output which will be produced at each control-break: COUNT here is the number of records in that particular break. M10 is an editing mask which will change leading zeros to blanks. The LENGTH gives a length to the output of COUNT, three is chosen from your sample data with sub-types being unique and having three digits as the unique part of the data. Change to whatever suits your actual data.
You've not been clear, and perhaps you want the output "floating" (3bb instead of bb3, where b represents a blank)? That would require more code...

How to delete data from an RDBMS using Talend ELT jobs?

What is the best way to delete from a table using Talend?
I'm currently using a tELTJDBCoutput with the action on Delete.
It looks like Talend always generate a DELETE ... WHERE EXISTS (<your generated query>) query.
So I am wondering if we have to use the field values or just put a fixed value of 1 (even in only one field) in the tELTmap mapping.
To me, putting real values looks like it useless as in the where exists it only matters the Where clause.
Is there a better way to delete using ELT components?
My current job is set up like so:
The tELTMAP component with real data values looks like:
But I can also do the same thing with the following configuration:
Am I missing the reason why we should put something in the fields?
The following answer is a demonstration of how to perform deletes using ETL operations where the data is extracted from the database, read in to memory, transformed and then fed back into the database. After clarification, the OP specifically wants information around how this would differ for ELT operations
If you need to delete certain records from a table then you can use the normal database output components.
In the following example, the use case is to take some updated database and check to see which records are no longer in the new data set compared to the old data set and then delete the relevant rows in the old data set. This might be used for refreshing data from one live system to a non live system or some other usage case where you need to manually move data deltas from one database to another.
We set up our job like so:
Which has two tMySqlConnection components that connect to two different databases (potentially on different hosts), one containing our new data set and one containing our old data set.
We then select the relevant data from the old data set and inner join it using a tMap against the new data set, capturing any rejects from the inner join (rows that exist in the old data set but not in the new data set):
We are only interested in the key for the output as we will delete with a WHERE query on this unique key. Notice as well that the key has been selected for the id field. This needs to be done for updates and deletes.
And then we simply need to tell Talend to delete these rows from the relevant table by configuring our tMySqlOutput component properly:
Alternatively you can simply specify some constraint that would be used to delete the records as if you had built the DELETE statement manually. This can then be fed in as the key via a main link to your tMySqlOutput component.
For instance I might want to read in a CSV with a list of email addresses, first names and last names of people who are opting out of being contacted and then make all of these fields a key and connect this to the tMySqlOutput and Talend will generate a DELETE for every row that matches the email address, first name and last name of the records in the database.
In the first example shown in your question:
you are specifically only selecting (for the deletion) products where the SOME_TABLE.CODE_COUNTRY is equal to JS_OPP.CODE_COUNTRY and SOME_TABLE.FK_USER is equal to JS_OPP.FK_USER in your where clause and then the data you send to the delete statement is setting the CODE_COUNTRY equal to JS_OPP.CODE_COUNTRY and FK_USER equal to JS_OPP.CODE_COUNTRY.
If you were to put a tLogRow (or some other output) directly after your tELTxMap you would be presented with something that looks like:
.----------+---------.
| tLogRow_1 |
|=-----------+------=|
|CODE_COUNTRY|FK_USER|
|=-----------+------=|
|GBR |1 |
|GBR |2 |
|USA |3 |
'------------+-------'
In your second example:
You are setting CODE_COUNTRY to an integer of 1 (your database will then translate this to a VARCHAR "1"). This would then mean the output from the component would instead look like:
.------------.
|tLogRow_1 |
|=-----------|
|CODE_COUNTRY|
|=-----------|
|1 |
|1 |
|1 |
'------------'
In your use case this would mean that the deletion should only delete the rows where the CODE_COUNTRY is equal to "1".
You might want to test this a bit further though because the ELT components are sometimes a little less straightforward than they seem to be.