USQL Escape Quotes - double

I am new to Azure data lake analytics, I am trying to load a csv which is double quoted for sting and there are quotes inside a column on some random rows.
For example
ID, BookName
1, "Life of Pi"
2, "Story about "Mr X""
When I try loading, it fails on second record and throwing an error message.
1, I wonder if there is a way to fix this in csv file, unfortunatly we cannot extract new from source as these are log files?
2, is it possible to let ADLA to ignore the bad rows and proceed with rest of the records?
Execution failed with error '1_SV1_Extract Error :
'{"diagnosticCode":195887146,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_ROW_ERROR","message":"Error
occurred while extracting row after processing 9045 record(s) in the
vertex' input split. Column index: 9, column name:
'instancename'.","description":"","resolution":"","helpLink":"","details":"","internalDiagnostics":"","innerError":{"diagnosticCode":195887144,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD","message":"Invalid
character following the ending quote character in a quoted
field.","description":"Invalid character is detected following the
ending quote character in a quoted field. A column delimiter, row
delimiter or EOF is expected.\nThis error can occur if double-quotes
within the field are not correctly escaped as two
double-quotes.","resolution":"Column should be fully surrounded with
double-quotes and double-quotes within the field escaped as two
double-quotes."

As per the error message, if you are importing a quoted csv, which has quotes within some of the columns, then these need to be escaped as two double-quotes. In your particular example, you second row needs to be:
..."Life after death and ""good death"" models - a qualitative study",...
So one option is to fix up the original file on output. If you are not able to do this, then you can import all the columns as one column, use RegEx to fix up the quotes and output the file again, eg
// Import records as one row then use RegEx to clean columns
#input =
EXTRACT oneCol string
FROM "/input/input132.csv"
USING Extractors.Text( '|', quoting: false );
// Fix up the quotes using RegEx
#output =
SELECT Regex.Replace(oneCol, "([^,])\"([^,])", "$1\"\"$2") AS cleanCol
FROM #input;
OUTPUT #output
TO "/output/output.csv"
USING Outputters.Csv(quoting : false);
The file will now import successfully. My results:

Related

Azure Data Factory - Data Flow - first row with all columns empty

I have been facing an issue that I am not managing to solve.
I am using the data factory's dataflow when I sink my data sometimes it comes with the first row with all columns empty. I realized if I set the "escape character" and "quote character" in the dataset that I am using to sink, it works well. The problem is that my data comes with a lot of quotes (") in the middle, and I want to write it as it is. So I decided to set the "escape character" and "quote character" empty.
The first row as header is false
Example of my data:
EPI|004006922|2021005|K4|0|0000444|000|0001||"|SA|"
Dataset settings:
Columns Delimiter = |
Row Delimiter = Default (\r,\n or \r\n)
encoding = UTF-8
Escape Character = empty
Quote Character = empty
First row as header = false
output csv file:
||||||||||
EPI|004006922|2021005|K4|0|0000444|000|0001||"|SA|"
Since I am using a synapse's external table, I cannot have this first row empty

How to remove a dynamic string from a CSV file using sed?

I added a dummy column at the beginning of my data export to a CSV file to get rid of control characters and some specific string values as mentioned below by using a pipe '|' delimiter. This data is coming from Teradata fast export using utf-8
'''
y^CDUMMYCOLUMN|
<86>^ADUMMYCOLUMN|
<87>^ADUMMYCOLUMN|
<94>^ADUMMYCOLUMN|
{^ADUMMYCOLUMN|
_^ADUMMYCOLUMN|
y^CDUMMYCOLUMN|
[^ADUMMYCOLUMN|
k^ADUMMYCOLUMN|
m^ADUMMYCOLUMN|
<82>^ADUMMYCOLUMN|
c^ADUMMYCOLUMN|
<8e>^ADUMMYCOLUMN|
<85>^ADUMMYCOLUMN|
'''
This is completely random and not every row has these special characters. I'm sure I'm missing something here. I'm using sed to get rid of dummycolumn and control characters.
'''$ sed -e 's/.*DUMMYCOLUMN|//;/^$/d' data.csv > data_output.csv'''
After running this statement, I'm still remaining these below random values.
'''
<86>
<87>
<85>
<94>
<8a>
<85>
<8e>
'''
I could have written a sed statement to remove first three letters from each row but this series is not appearing in every row. At the same time, row count is 400 Million.
Current output.
y^CDUMMYCOLUMN|COLUMN1|COLUMN2|COLUMN3
<86>^ADUMMYCOLUMN|6218915846|36596|12
<87>^ADUMMYCOLUMN|9822354765|35325|33
t^ADUMMYCOLUMN|6788793999|111|12
g^ADUMMYCOLUMN|6090724004|7017|12
_^ADUMMYCOLUMN|IC-21357688806502|111|12
<8e>^ADUMMYCOLUMN|9682027117|35335|33
v^ADUMMYCOLUMN|6406807681|121|12
h^ADUMMYCOLUMN|6346768510|121|12
V^ADUMMYCOLUMN|6130452510|7017|12
Desired Output
COLUMN1|COLUMN2|COLUMN3
6218915846|36596|12
9822354765|35325|33
6788793999|111|12
6090724004|7017|12
IC-21357688806502|111|12
9682027117|35335|33
6406807681|121|12
6346768510|121|12
6130452510|7017|12
Please help.
Thank you.

how to return empty field in Spark

I am trying to check incomplete record and identify the bad record in Spark.
eg. sample test.txt file, it is in record format, columns separated by \t
L1C1 L1C2 L1C3 L1C4
L2C1 L2C2 L2C3
L3C1 L3C2 L3C3 L3C4
scala> sc.textFile("test.txt").filter(_.split("\t").length < 4).collect.foreach(println)
L2C1 L2C2 L2C3
The second line is printing as having less number of columns.
How should i parse without ignoring the empty column after in second line
It is the split string in scala removes trailing empty substrings.
The behavior is similar to Java, to let all the substrings checked we can call as
"L2C1 L2C2 L2C3 ".split("\t",-1)

Vertica COPY + FLEX table

I want to load on a flex table a log in which each record is composed by some fields + a JSON, the format is the following:
"concorde-fe";"DETAILS.SHOWN";"1bcilyejs6d4w";"2017-01-31T00:00:04.801Z";"2017-01-31T00:00:04.714Z";"{"requestedFrom":"BUTTON","tripId":{"request":3003926837969,"mac":"v01162450701"}}"
and (after many tries) I'm using the COPY command with a CSV parser in this way:
COPY schema.flex_table from local 'C:\temp/test.log' parser fcsvparser(delimiter=';',header=false, trim=true, type='traditional')
in this way all is loaded correctly except the JSON, that is skipped and left empty.
Is there a way to load also the JSON as a string?
HINT: just for test puposes, I noticed that if in the JSON I put a '\' before every '"' in the log, the loading runs smoothly, but unfortunately I cannot modify the content of the log.
Not without modifying the file beforehand - or writing your own UDParser function.
It clearly is a strange format: CSV (well, semicolon delimited and with double quotes as string enclosers), until the children appear - which are stored with a leading double quote and a trailing double quote; doubly nested with curly braces - JSON type, ok. But you have double quotes (not doubled) within the JSON encoding - any parser would go astray on those.
You'll have to write a program (ideally in C) to remove the curly braces, to remove the column names in the JSON code and leave just a CSV line
So, from the line (the backslash at the end means an escaped newline, meaning that the three lines you see are actually one line, for readability)
"concorde-fe";"DETAILS.SHOWN";"1bcilyejs6d4w";"2017-01-31T00:00:04.801Z"; \
"2017-01-31T00:00:04.714Z"; \
"{"requestedFrom":"BUTTON","tripId":{"request":3003926837969,"mac":"v01162450701"}}"
you make (title line with column names, then data line)
col1;col2;col3;timestampz1;\
timestampz2;requestedfrom;tripid_request;tripid_mac
"concorde-fe";"DETAILS.SHOWN";"1bcilyejs6d4w";"2017-01-31T00:00:04.801Z"; \
"2017-01-31T00:00:04.714Z";"BUTTON";3003926837969;"v01162450701"
Finally, you'll be able to load it as a CSV file - and maybe you will have to then normalise everything again: tripIdseems to be a dependent structure ....
Good luck
Marco the Sane

How to import empty strings as null values from CSV file - using pgloader?

I am using pgloader to import from a .csv file which has empty strings in double quotes. A sample line is
12334,0,"MAIL","CA","","Sanfransisco","TX","","",""
After a successful import, the fields that has double quotes ("") are shown as two single quotes('') in postgres database.
Is there a way we can insert a null or even empty string in place of two single quotes('')?
I am using the arguments -
WITH truncate,
fields optionally enclosed by '"',
fields escaped by double-quote,
fields terminated by ','
SET client_encoding to 'UTF-8',
work_mem to '12MB',
standard_conforming_strings to 'on'
I tried using 'empty-string-to-null' mentioned in the documentation like this -
CAST column enumerate.fax using empty-string-to-null
But it gives me an error saying -
pgloader nph_opr_addr.test.load An unhandled error condition has been
signalled: At LOAD CSV
^ (Line 1, Column 0, Position 0) Could not parse subexpression ";"
when parsing
Use the field option:
null if blanks
Something like this:
...
having fields foo, bar, mynullcol null if blanks, baz
From the documentation:
null if
This option takes an argument which is either the keyword blanks or a double-quoted string.
When blanks is used and the field value that is read contains only space characters, then it's automatically converted to an SQL NULL value.
When a double-quoted string is used and that string is read as the field value, then the field value is automatically converted to an SQL NULL value