Creating spectrum table in matillion for csv file with comma inside quotes - amazon-redshift

I have a scenario for creating spectrum table in redshift using matillion.
my CSV file data is like this:-
column1,column2,column3
abc,"qwety,pqr",xyz
but in spectrum table i am seeing data
as
column1 column2 column3
abc qwerty pqr
Matillion is not taking quotes value as one.
can you please suggest how to achieve this using matillion's EXTERNAL TABLE component.

Basically you would like to specify a quote parameter for your CSV data.
Redshift has 2 ways of specifying external tables (see Redshift Docs for reference):
using the default built-in SerDes and properties like ROW FORMAT DELIMITED, FIELDS TERMINATED BY
explicitly specifying a SerDe with ROW FORMAT SERDE, WITH SERDEPROPERTIES
I don't think it's possible to specify a quote parameter using the built-in SerDes.
It is possible to specify them using org.apache.hadoop.hive.serde2.OpenCSVSerde (look here for details on it's properties), but beware that there are know problems with it, as one described in this SO question.
Now for Metillion:
I have never used Matillion, but looking at their Redshift External Table documentation page, looks like it's only possible to specify the FORMAT and the FIELD TERMINATOR, but not to specify a SerDe and it's properties, hence it's not possible to specify the quote parameters for the external table - unless there are some undocumented means to specify a custom SerDe.
Personal note:
We have experienced many problems with ingesting data stored as CSV, and we basically try to avoid it. There's no standard for CSV, each tool implements it's own version of support for it, and it's very difficult convince all your tools to see data the same way.

Related

Talend - Output data to Snowflake table with spaces in Field Names

I have a very specific requirement to output data to a Snowflake table but the field names must have spaces in them. Snowflake appears to handle this okay, but I'm unsure how Talend will as I understand Java doesn't allow it. Can anyone help?
Also are there other tools that won't handle spaces in field names (i.e. R or Python) so we would be restricting use of the warehouse if we did that?

How to Validate Data issue for fixed length file in Azure Data Factory

I am reading a fixed-width file in mapping Data Flow and loading it to the table. I want to validate the fields, datatype, lengths of the field that I am extracting in the Derived column using substring.
How to Achieve this in ADF
Use a Conditional Split and add a condition for each property of the field that you wish to test for. For data type checking, we literally just landed new isInteger(), isString() ... functions today. The docs are still in the printing press, but you'll find them in the expression builder. For length use length().

How to get the servername\hostname in Firebird 2.5.x

I use Firebird 2.5, and I want to retrieve the following values
Username:
I used SELECT rdb$get_context('SYSTEM', 'CURRENT_USER') FROM ...
Database name:
I used SELECT rdb$get_context('SYSTEM', 'DB_NAME') FROM ...
But for server name, I did not find any client API, do you know how can I retrieve the server name with a SELECT statement.
There is nothing built-in in Firebird to obtain the server host name (or host names, as a server can have multiple host names) through SQL.
The closest you can get is by requesting the isc_info_firebird_version information item through the isc_database_info API function. This returns version information that - if connecting through TCP/IP - includes a host name of the server.
But as your primary reason for this is to discern between environments in SQL, it might be better to look for a different solution. Some ideas:
Use an external table
You can create an external table to contain the environment information you need
In this example I just put in a short, fixed width name for the environment types, but you could include other information, just be aware the external table format is a binary format. When using CHAR it will look like a fixed width format, where values shorter than declared need to be padded with spaces.
You can follow these steps:
Configure ExternalFileAccess in firebird.conf (for this example, you'd need to set ExternalFileAccess = Restrict D:\data\DB\exttables).
You can then create a table as
create table environment
external file 'D:\data\DB\exttables\environment.dat' (
environment_type CHAR(3) CHARACTER SET ASCII NOT NULL
)
Next, create the file D:\data\DB\exttables\environment.dat and populate it with exactly three characters (eg TST for test, PRO for production, etc). You can also insert the value instead, the external table file will be created automatically. Inserting might be simpler if you want more columns, or data with varying length, etc. Just keep in mind it is binary, but using CHAR for all columns will make it look like text.
Do this for each environment, and make sure the file is read-only to avoid accidental inserts.
After this is done, you can obtain the environment information using
select environment_type
from environment
You can share the same file for multiple databases on the same host, and external files are - by default - not included in a gbak backup (they are only included if you apply the -convert backup option), so this would allow moving database between environments without dragging this information along.
Use an UDF or UDR
You can write an UDF (User-Defined Function) or UDR (User Defined Routine) in a suitable programming language to return the information you want and define this function in your database. Firebird can then call this function from SQL.
UDFs are considered deprecated, and you should use UDRs - introduced in Firebird 3 - instead if possible.
I have never written an UDF or UDR myself, so I can't describe it in detail.

Substring of column name in Copy Activity in ADF v2

Is there a way in the V2 Copy Activity to operate upon one of the input columns (of type string) with an expression? Before I load rows to the destination, I need to limit the number of characters in the column.
My hope was to simply switch from something like this:
"ColumnMappings": "inColumn: outColumn"
to something like this:
"ColumnMappings": "#substring(inColumn, 1, 300): outColumn"
If anyone can point me to where I can read-up on where & when string expressions can be used, I could use the guidance.
This is the official documentation on expressions and functions: https://learn.microsoft.com/en-us/azure/data-factory/control-flow-expression-language-functions
And this is the documentation on mappings: https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping
Also remember that if you are using a defined query in the copy activity, you can use sql functions like CAST([fieldName] as varchar(300)) to limit the amount of characters on a particular field.
Hope this helped!
When you don't have a SQL Source, but your destination is a SQL sink, you can use a Stored Procedure to insert your data into the final table. That way, you can define these kinds of transformations in the stored procedure. I don't think the Data Factory can handle these kinds of activities, it is more intended as an orchestrator.
Have a look here:
https://learn.microsoft.com/en-us/azure/data-factory/connector-sql-server#invoke-stored-procedure-from-sql-sink

how to create a PostgreSQL table from a XML file...

I have a XML Document file. The part of the file looks like this:
-<attr>
<attrlabl>COUNTY</attrlabl>
<attrdef>County abbreviation</attrdef>
<attrtype>Text</attrtype>
<attwidth>1</attwidth>
<atnumdec>0</atnumdec>
-<attrdomv>
-<edom>
<edomv>C</edomv>
<edomvd>Clackamas County</edomvd>
<edomvds/>
</edom>
-<edom>
<edomv>M</edomv>
<edomvd>Multnomah County</edomvd>
<edomvds/>
</edom>
-<edom>
<edomv>W</edomv>
<edomvd>Washington County</edomvd>
<edomvds/>
</edom>
</attrdomv>
</attr>
From this XML file, I want to create a PostgreSQL table with columns of attrlabl, attrdef, attrtype, and attrdomv. I appreciate your suggestions!
While Erwin is right that this can be done with PostgreSQL tools, I would suggest still going the custom translation yourself as there are a few reasons here.
The first is determining appropriate XML to PostgreSQL type conversions. You probably want to choose these yourself. But this example highlights a very different problem, what to do with nested data structures. You could, for example, store XML fragments. You could store text, json, or the like. You could create other tables and fkey in.
In general I have almost always found the best approach is to simply manually create the tables. This substitutes human judgement for automated mappings and allows you to create better matches than a computer will.