Talend Data Itegration: Avoid nulls coming out of tExtractXMLField? - talend

I have this simple flow in Talend DI 6 (simplified for posting on SO):
The last step crashes with a NullPointerException, because missing XML attributes are returned as null.
Is there a way to get empty string values instead of nulls?
For now I'm using a tReplace step to remove nulls as a work-around, but it's tedious and adds to the cost of maintenance by creating one more place where the list of attributes needs to be maintained.

In Talend DI 5.6.2 it is possible to add default data values to the schema. The column in the schema is called "Default". If you expect strings, you can set an empty string, which is set if the column value is null:
Talend schema view with Default column
Works also for other data types. Talend DI 6 should still be able to do this, although the field might be renamed.

Related

How to Validate Data issue for fixed length file in Azure Data Factory

I am reading a fixed-width file in mapping Data Flow and loading it to the table. I want to validate the fields, datatype, lengths of the field that I am extracting in the Derived column using substring.
How to Achieve this in ADF
Use a Conditional Split and add a condition for each property of the field that you wish to test for. For data type checking, we literally just landed new isInteger(), isString() ... functions today. The docs are still in the printing press, but you'll find them in the expression builder. For length use length().

Mapping Data Flow Common Data Model source connector datetime/timestamp columns nullified?

We are using Azure Data Factory Mapping data flow to read from Common Data Model (model.json).
We use dynamic pattern – where Entity is parameterised and we do not project any columns and we have selected Allow schema drift.
Problem: We are having issue with “Source” in mapping data flow (Source Type is Common Data Model). All the datetime/timestamp columns are read as null in source activity.
We also tried in projection tab Infer drifted column types where we provide a format for timestamp columns, However, it satisfies only certain timestamp columns - since in the source each datetime column has different timestamp format.
11/20/2020 12:45:01 PM
2020-11-20T03:18:45Z
2018-01-03T07:24:20.0000000+00:00
Question: How to prevent datetime columns becoming null? Ideally, we do not want Mapping Data Flow to typecast any columns - is there a way to just read all columns as string?
Some screenshots
In Projection tab - we do not specify schema - to allow schema drift and to dynamically load more than 1 entities.
In Data Preview tab
ModifiedOn, SinkCreatedOn, SinkModifiedOn - all these are system columns and will definitely have values in it.
This is now resolved on a separate conversation with Azure Data Factory team.
Firstly there is no way to 'stringfy' all the columns in Source, because CDM connector gets its metadata from model.json (if needed this file can be manipulated, however not ideal for my scenario).
To solve datetime/timestamp columns becoming null - under Projection tab we need to select Infer drifted column types and then you can add "multiple" time formats that you expect to come from CDM. You can either select from dropdown - if your particular datetime format is not listed in the dropdown (which was my case) then you can edit the code behind the data flow (i.e. data flow script), to add your format (see second screenshot).

Mybatis dynamic select query with prevention of 'no column exits' errors

What's the best approch to have a dynamic query like
select $dynamic_columns from table
But also prevent error like column not found and get result with available columns. considering $dynamic_columns is given by end users.
One approach would be to store the schema in java object and filter it. Again if schema is update in DB we will need to update the schema java object cache. is there any better way to handle this?
Be careful with this as it is more vulnerable to SQL injection.
Never let the user type something into a text field, instead build a
list for them to select from.
For building the list, I think the best approach is to use the JDBC method DatabaseMetaData.getColumns(...) to retrieve a list of columns for a table. I don't think there's a need to cache anything.

How to populated the table via Pentaho Data Integration's table_output step?

I am performing an ETL job via Pentaho 7.1.
The job is to populate a table 'PRO_T_TICKETS' in PostgreSQL 9.2 via the Pentaho Jobs and transformations?
I have mapped the table fields with respect to the stream fields
Mapped Fields
My Table PRO_T_TICKETS contains the Schema (Column Names) in UPPERCASE.
Is this the reason I can't populate the table PRO_T_TICKETS with my ETL Job?
I duplicated the step TABLE_OUTPUT to PRO_T_TICKETS and changed the Target table field to 'PRO_T_TICKETS2'. Pentaho created a new table with lowercase schema and populated the data in it.
But I want this data to be uploaded in the table PRO_T_TICKETS only and with the UPPERCASE schema if possible.
I am attaching the whole job here and the error thrown by Pentaho. Pentaho Error I have also tried my query by adding double quotes to the column names as you can see in the error. But it didn't help.
What do you think I should do?
When you create (or modify) the connection, select Advanced on the left panel and click on the Force to upper case or Force to lower case or, even better, Preserve case of reserved words.
To know which option to choose, copy the 4th line of your error log, the line starting with INSERT INTO "public"."PRO_T_TICKETS("OID"... in your SQL-developer tool and change the connection advanced parameters until it works.
Also, at debug time, don't use batch updates, don't use lazy conversion on previous steps, and try with one (1) field rather than all (25).
Just as a complement: it worked for me following the tips from AlainD and using specific configurations that I'd like to share with you. I have a transformation streaming data from MySQL to PostgreSQL using a Table Input and Output. In both of DBs I have uppercase objects.
I did the following steps to work in the right way:
In the table input (MySQL) the objects are uppercase too, but I typed in lowercase and it worked and I didn't set any special option in the DB Connection.
In the table output (PostgreSQL) I typed everything in uppercase (schema, table name and columns) and I also set "specify the database fields" (clicking on "Get fields").
In the target DB Connection (PostgreSQL) I put the options (in "Advanced" section): "Quote all in database" and "Preserve case of reserved words".
PS: Ah, the last option is because I've found out that there was one more problem with my fields: there was a column called "Admin" (yes guys, they created a camelcase column using a reserved word!) and for that reason I must to put "Preserve case of reserved words" and type it as "Admin" (without quotes and in camelcase) in the Table Output.

Partial inserts with Cassandra and Phantom DSL

I'm building a simple Scala Play app which stores data in a Cassandra DB using the Phantom DSL driver for Scala. One of the nice features of Cassandra is that you can do partial updates i.e. so long as you provide the key columns, you do not have to provide values for all the other columns in the table. Cassandra will merge the data into your existing record based on the key.
Unfortunately, it seems this doesn't work with Phantom DSL. I have a table with several columns, and I want to be able to do an update, specifying values just for the key and one of the data columns, and let Cassandra merge this into the record as usual, while leaving all the other data columns for that record unchanged.
But Phantom DSL overwrites existing columns with null if you don't specify values in your insert/update statement.
Does anybody know of a work-around for this? I don't want to have to read/write all the data columns every time, as eventually the data columns will be quite large.
FYI I'm using the same approach to my Phantom coding as in these examples:
https://github.com/thiagoandrade6/cassandra-phantom/blob/master/src/main/scala/com/cassandra/phantom/modeling/model/GenericSongsModel.scala
It would be great to see some code, but partial updates are possible with phantom. Phantom is an immutable builder, it will not override anything with null by default. If you don't specify a value it won't do anything about it.
database.table.update.where(_.id eqs id).update(_.bla setTo "newValue")
will produce a query where only the values you've explicitly set to something will be set to null. Please provide some code examples, your problem seems really strange as queries don't keep track of table columns to automatically add in what's missing.
Update
If you would like to delete column values, e.g set them to null inside Cassandra basically, phantom offers a different syntax which does the same thing:
database.table.delete(_.col1, _.col2).where(_.id eqs id)`
Furthermore, you can even delete map entries in the same fashion:
database.table.delete(_.props("test"), _.props("test2").where(_.id eqs id)
This assumes props is a MapColumn[Table, Record, String, _], as the props.apply(key: T) is typesafe, so it will respect the keytype you define for the map column.