Trying to work around the error DF-CSVWriter-InvalidEscapeSetting - azure-data-factory

So I have a dataset which I want to export to csv with pipe as separator and no escape character.
That dataset contains in fact 4 source columns, 3 regular ones (just text) and one variable one.
That last column holds another subset of values that are also separated with a pipe.
Purpose is that the export looks like this, where the values are coming from my 4th field.
COL1|COL2|COL3|VAL1|VAL2|VAL3|....
The number of values can be different for each record but.
When I set the csv export separator to ";", I get this result which is expected
COL1;COL2;COL3;VAL1|VAL2|VAL3|....
However setting it to "|", it throws the error DF-CSVWriter-InvalidEscapeSetting.
Most likely because it detected the separator character in my 4th field and then enforces that an escape character needs to be set.
Which is a logical thing in most case but in my case I would like him to ignore this and just export as-is.
Any way how I can work around this, perhaps with a different approach or some additional settings?
Split & flatten produces extra rows but that's not what I want.
Regards,
Sven Peeters

As you have the same characters in the column value same as your delimiter character, with no escape character in your dataset will throw an error.
You have to change the delimiter character to a different character or add a Quote character and Escape character to Double quote(").
Downloaded file:

Related

defining escape character for a csv import

I have a source file that has text columns which end with a "\" and I have specified "^" as the column delimiter.
I have the file format for this specified use - ESCAPE = 'NONE', but rows with "\^" are causing premature end-of-line errors - assuming SF is not interpreting the "\^" as a column delimiter - therefore the column count is off.
I have changed the file format to use something else for ESCAPE but get the same message. The offending rows have the right number of columns and a text column containing "\", that is not the last character in the column, imports correctly.
The values are exported from SQL Server.
Is this an escape character problem or am I overlooking something else? I am new to SF.
I was seeing this same issue. Nomatter what I used as an escape character, when it showed up in my file next to a " at the end of a string it started causing trouble.
I switched my delimiter to \u0001 which is a special "start of header" character that very rarely shows up, especially at the end of strings.
I wouldn't say this was an ideal option for us, but it worked and is something you might want to try.

Is there a way for spark to read this odd text format?

The file format I have is sort of like csv and looks like this (abinitio .dat file of some sort):
1,apple,10.00,\n
2,banana,12.35,\n
3,orange,9.23,\n
The commas are actually "Start of Header" 0x01 byte characters, but I will use commas for simplicity. I can easily read the above sample by reading the file as a string RDD with a custom line split ,\n and then passing that into spark.read.csv. I am currently splitting lines by ,\n because there may be newlines in the data and I thought that those two characters were unique for each record. However a problem occurs when there are newline characters at the start of text fields. For example:
1,one \n apple,10.00,\n
2,two banana,12.35,\n
3,\n three orange,9.23,\n
My current code is able to ignore the newline in record 1 but picks up the ,\n after the 3 and splits the 3 lines into 4. How can I reliably read in this format?
My current ideas are:
Check that there are the right number of , column delimiters before allowing a split. I am not sure how to implement this, is it possible to do a regex look-back when spark sees a ,\n and check for the correct number of delimiters?
Try to coerce the file into some other format besides CSV
Make my own InputFormatClass, although I am not sure what this entails.

Spark: Split CSV with newlines in octet-stream field

I am using Scala to parse CSV files. Some of these files have fields which are non-textual data like images or octet-streams. I would like to use Apache Spark's textFile() method to split up the CSV into rows, and
split(",[ ]*(?=([^\"]*\"[^\"]*\")*[^\"]*$)")
to split the row into fields. Unfortunatly this does not work with files that have these mentioned binary fields. There are two problems: 1) The octet-streams can contain newlines which make textFile() split rows which should be one, and 2) The octet-streams contain commas and/or double quotes which are not escaped and mess up my schema.
The files are usually big, couple of MBs up to couple of 100MBs. I have to take the CSV's as they are, although I could preprocess them.
All I want to achieve is a working split function so I can ignore the field with the octet-stream. Nevertheless, a great bonus would be to extract the textual information in the octet-stream.
So how would I go forward to solve my problems?
Edit: A typical record obtained with cat, the newlines are from the file, not for cosmetic purposes (shortened):
7,url,user,02/24/2015 02:29:00 AM,03/22/2015 03:12:36 PM,octet-stream,27156,"MSCF^#^#^#^#�,^#^#^#^#^#^#D^#^#^#^#^#^#^#^C^A^A^#^C^#^D^#^#^#^#^#^T^#^#^#^#^#^P^#�,^#^#^X=^#^#^#^#^#^#^#^#^#^#�^#^#^#^E^#^A^#��^A^#^#^#^#^#^#^#WF6�!^#Info.txt^#=^B^#^#��^A^#^#^#WF7�^#^#List.xml^#^�^#^#��^A^#^#^#WF:�^#^#Filename.txt^#��>��
^#�CK�]�r��^Q��T�^O�^#�-�j�]��FI�Ky��Ei�Je^K""!�^Qx #�*^U^?�^_�;��ħ�^LI^#$(�^Q���b��\N����t�����+������ȷgvM�^L̽�LǴL�^L��^ER��w^Ui^M��^X�Kޓ�^QJȧ��^N~��&�x�bB��D]1�^B|^G���g^SyG�����:����^_P�^T�^_�����U�|B�gH=��%Z^NY���,^U�^VI{��^S�^U�!�^Lpw�T���+�a�z�l������b����w^K��or��pH� ��ܞ�l��z�^\i=�z�:^C�^S!_ESCW��ESC""��g^NY2��s�� u���X^?�^R^R+��b^]^Ro�r���^AR�h�^D��^X^M�^]ޫ���ܰ�^]���0^?��^]�92^GhCx�DN^?
mY<{��L^Zk�^\���M�^V^HE���-Ե�$f�f����^D�e�^R:�u����� ^E^A�Ȑ�^B�^E�sZ���Yo��8Eސ�}��&JY���^A9^P������^P����~Jʭy��`�^9«�""�U� �:�}3���6�Hߧ�v���A7^Xi^L^]�sA�^Q�7�5d�^Xo˛�tY
Bp��4�Y���7DkV_���\^_q~�w�|�a�s̆���#�g�ӳu�^�!W}�n��Rgż_2�]�p�2}��b�G9�M^Q
�����:�X����bR[ԳZV!^G����^U�tq�&�Y6b��GR���s#mn6Z=^ZH^]�b��R^G�C�0R��{r1��4�#�
=r/X2�^O�����r^M�Rȕ�goG^X-����}���P+˥Qf�#��^C�Բ�z1�I�j����6�^Np���ܯ^P�[�^Tzԏ���^F2�e��\�E�߻6c�%���$�:E�*�*©t�y�J�,�S�2U�S�^X}ME�]��]�i��G�su�""��!�-��!r'ܷe_et Y^K^?0���l^A��^^�m�1/q����|�_r�5$�%�([x��W^E�G^^y���#����Z2^?ڠ�^_��^AҶ�OO��^]�vq%:j�^?�jX��\�]����^S�^^n�^C��>.^CY^O-� �_�\K����:p�<7Sֺnj���-Yk�r���^Q^M�n�J^B��^Z0^?�(^C��^W³!�g�Z�~R�A^M�^O^^�%;��Ԗ�p^S�w���*m^S���jڒ|�����<�^S�;Z^^Fc�1���^O�G_o����8��CS���w��^?��n�2~��m���G;��rx4�(�]�'��^E���eƧ�x��.�w�9WO�^^�י3��0,�y��H�Y�.H�x�""'���h}灢^T�Gm;^XE�̼�J��c�^^񾠫;�^A�qZ1ׁBZ^Q�^A^FB�^QbQ�_�3|ƺ�EvZ���^S�w���^P���9^MT��ǩY[+�+�9�Ԩ�^O�^Q���Fy(+�9p�^^Mj�2��Y^?��ڞ��^Ķb�^Z�ψMр}�ڣ�^^S�^?��^U�^Wڻ����z�^#��uk��k^^�>^O�^W�ݤO�h�^G�����Kˇ�.�R|�)-��e^G�^]�/J����U�ϴ�a���i5HO�^L�ESCg�R'���.����d���+~�}��ڝ^Y5]l�3jg54M�������2t�5^Y}�q)��^O;�X\�q^Ox~Vۗ�t�^\f� >k;^G�K5��,��X�t/�ǧ^G""5��4^MiΟ�n��^B^]�|�����V��ߌ֗Q~�H���8��t��5��ܗ�
�Z�^c�6N�ESCG����^_��>��t^L^R�^:�x���^]v�{^#+KM��qԎ�.^S�%&��=^W-�=�^S�����^CI���&^]_�s�˞�y�z�Jc^W�kڠ�^\��^]j�����^O��;�oY^^�^V59;�c��^B��T�nb����^C��^N��s�x�<{�9-�F�T�^N�5�^Se-���^T�Y[���`^ZsL��v�բ<C�+�~�^ۚ��""�Yκ2^_�^VxT�>��/ݳ^U�m�^#���3^Ge�n^Vc�V�^#�NVn�,�q��^^^]gy�R�S��Ȃ$���>A�d����xg�^GB3�M�J�^QJ^]�^\�{.�D��碎�^W�8a����qޠl?,'^R�^X�Cgy�P[����mڞ��H�Z�s�SD&蠤�s�E��nu�O#O<��3wj`C-%w�W�J�^WP^T�^]r^NT�TC�Lq�Z�f�!�;�l�Y��Gb��>�ud�hx�Ԭ^N)9�^N!k�҉s�35v������.�""^]��~4������۴�Z^]u�^Ti^^�i:�)K��P᳕!�#�^?�>��EE^VE-u�^SgV^L��<��^D�O<�+�J.�c�Z#>�.l����^S�
ESC��(��E�j�π쬖���2{^U&b\��P^S�`^O^XdL�^ 6bu��FD��^#^#^#^#","field_x, data",field_y,field_z
Expected output would be an array
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","field_x, data",field_y",field_z")
Or, but this is probably another question, such an array (like running strings on the octet-stream field):
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","Info.txt List.xml Filename.txt","field_x, data",field_y",field_z")
Edit 2: Every file that has a binary field also contains a length field for it. So instead of splitting directly I can walk left to right through my record and extract the fields. This is certainly a great improvement of my current situation but problem 1) still persists. How can I split those files reliably?
I took a closer look at the files and a header looks like this:
RecordId, Field_A, Content_Type, Content_Length, Content, Field_B
(Where Content_Type can be "octet-stream", Content_Length the number of bytes in the Content field, and Content obviously the data). And good for me, the value of Field_B is predictable, let's assume for a certain file it's always "Hello World".
So instead of using Spark's default behaviour splitting on newlines, how can I achieve that Spark is only splitting on newlines following "Hello World"? (I also edited the question title since the focus of the question changed)
As answered in Spark: Reading files using different delimiter than new line, I used textinputformat.record.delimiter to split on "Hello World\n" because I am a bit lucky that the last column always contains the same value. After that I simply walk left to right through the record and when I reach the length field I skip the next n bytes. Everything works now. Thanks for pointing me in the right direction.
There are two problems: 1) The octet-streams can contain newlines
which make textFile() split rows which should be one, and 2) The
octet-streams contain commas and/or double quotes which are not
escaped and mess up my schema.
Well, actually that csv file is properly escaped:
the multiline field is enclosed in double quotes: "MSCF^# .. ^#^#" (which also handles possible separators inside the field)
double quotes inside the field are escaped with another double quote as it should be: Je^K""!
Of course a simple split will not work in this case (and should never be used on csv data), but any csv reader able to handle multiline fields should parse that data correctly.
Also keep in mind that the double quotes inside the octet-stream have to be unescaped, or that data won't be valid (another reason not to use split, but a csv reader that handles this).

Postgres import a double quote value

I have a large .csv file with 9 million rows. Some of these columns contain text with quotes or other special characters in them I would like to import from this .csv file into the database. For example I would like to import this row:
ID BH Units Name Type_building Year_cons
1 4 900.00 schoolgebouw "De Bolster Schoolgebouw 2014-01-01
As you can see there is a double quote in the fourth column. None of the values in the .csv file are quoted, but sometimes a double quote or backslash '\' appears in the text. When I try to upload the data using:
\COPY <tablename> FROM <path to file> WITH CSV DELIMITER ';' NULL '\N';
It gives an error message: ERROR value to long for type character varying(25).
Apparently it sees the double quote as the start of a string and it tries to combine everything after it in the .csv file (including the fifth and sixth column) into a single cell (so that cell will contain 'De Bolster Schoolgebouw 2014-01-01'), which doesnt fit because the 'Name' column allows max 25 characters.
I found a similar topic (Is it possible to turn off quote processing in the Postgres COPY command with CSV format?) in which this solution was presented:
\COPY <tablename> FROM <path to file> WITH CSV DELIMITER ';' QUOTE E'\b' NULL '\N';
I think what it does is sets the quote value (default is double quote) to something else, in this case a backspace, so it won't recognize a double quote as a quote anymore. However when I run this I get another error: INVALID input syntax for integer.
What has happened is that every value now is quoted, so ID with value '1' becomes value '"1"' and because ID is defined as an integer it won't accept quotes.
Do you have any idea how to import double quotes and other special characters from a .csv file into a postgres database?
Thanks in advance!!
Based on the error message, I'd be suspicious it has anything to do with double quoting or anything of the sort -- had it been so, it would have been a widely reported bug and fixed ages ago.
When it comes to Postgres, the error messages are almost always correct and helpful. As such, consider the very real possibility that there are more characters than meets the eye.
My own guess is that you've some trailing (or leading) spaces in there somewhere, and as such have pieces of data that look 24 characters long when viewed in a spreadsheet while being, in fact, longer.
If you don't, my second guess would be some kind of bizarro character sets conflicts or effects. Perhaps you've some double byte characters, or two single characters behaving as a single one due to a diacritic in there. These look fine in the viewer you're using for your data; but then when these get interpreted or viewed as utf8 they end up counting as two distinct characters. Unlikely imo, but possible (example).
Lastly and per Frank's suggestion, try removing the length constraint. It is only slowing you down as things stand, because it slows down inserts and is preventing you to move forward. Once done importing, re-add the constraint to the table's definition. You'll then be able to find the offending rows using the likes of:
select name from table where length(name) > 24;
... and upon fixing them, you'll be able to re-add your constraint if it serves any purpose. (Hint: it doesn't, or at the very least shouldn't have. There's a real person out there whose name is: "Kim-Jong Sexy Glorious Beast Divine Dick Father Lovely Iron Man Even Unique Poh Un Winn Charlie Ghora Khaos Mehan Hansa Kimmy Humbero Uno Master Over Dance Shake Bouti Bepop Rocksteady Shredder Kung Ulf Road House Gilgamesh Flap Guy Theo Arse Hole Im Yoda Funky Boy Slam Duck Chuck Jorma Jukka Pekka Ryan Super Air Ooy Rusell Salvador Alfons Molgan Akta Papa Long Nameh Ek.")

How to handle new line characters when using COPY in POSTGRESQL

I have text that has the following form in my csv:
'0001'|'text1'|'\ntext2'|'text3'\n
However when I try to import the data into my postgres instance, it keeps breaking by thinking the first newline character is the start of a new line. Is there an easy way to tell postgres to import the newline character into the field?
If delimiters are explicitly set you avoid the trouble of special characters being interpreted, and instead are taken literally. The same thing can be said about quotes. The parser needs to know how to recognize strings to not interpret \n as a newline.
Here's the documentation:
Backslash characters () can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters.
In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character.
SO, you might have
COPY data FROM STDIN WITH CSV HEADER DELIMITER E'|' QUOTE E'\'';