Postgres import a double quote value - postgresql

I have a large .csv file with 9 million rows. Some of these columns contain text with quotes or other special characters in them I would like to import from this .csv file into the database. For example I would like to import this row:
ID BH Units Name Type_building Year_cons
1 4 900.00 schoolgebouw "De Bolster Schoolgebouw 2014-01-01
As you can see there is a double quote in the fourth column. None of the values in the .csv file are quoted, but sometimes a double quote or backslash '\' appears in the text. When I try to upload the data using:
\COPY <tablename> FROM <path to file> WITH CSV DELIMITER ';' NULL '\N';
It gives an error message: ERROR value to long for type character varying(25).
Apparently it sees the double quote as the start of a string and it tries to combine everything after it in the .csv file (including the fifth and sixth column) into a single cell (so that cell will contain 'De Bolster Schoolgebouw 2014-01-01'), which doesnt fit because the 'Name' column allows max 25 characters.
I found a similar topic (Is it possible to turn off quote processing in the Postgres COPY command with CSV format?) in which this solution was presented:
\COPY <tablename> FROM <path to file> WITH CSV DELIMITER ';' QUOTE E'\b' NULL '\N';
I think what it does is sets the quote value (default is double quote) to something else, in this case a backspace, so it won't recognize a double quote as a quote anymore. However when I run this I get another error: INVALID input syntax for integer.
What has happened is that every value now is quoted, so ID with value '1' becomes value '"1"' and because ID is defined as an integer it won't accept quotes.
Do you have any idea how to import double quotes and other special characters from a .csv file into a postgres database?
Thanks in advance!!

Based on the error message, I'd be suspicious it has anything to do with double quoting or anything of the sort -- had it been so, it would have been a widely reported bug and fixed ages ago.
When it comes to Postgres, the error messages are almost always correct and helpful. As such, consider the very real possibility that there are more characters than meets the eye.
My own guess is that you've some trailing (or leading) spaces in there somewhere, and as such have pieces of data that look 24 characters long when viewed in a spreadsheet while being, in fact, longer.
If you don't, my second guess would be some kind of bizarro character sets conflicts or effects. Perhaps you've some double byte characters, or two single characters behaving as a single one due to a diacritic in there. These look fine in the viewer you're using for your data; but then when these get interpreted or viewed as utf8 they end up counting as two distinct characters. Unlikely imo, but possible (example).
Lastly and per Frank's suggestion, try removing the length constraint. It is only slowing you down as things stand, because it slows down inserts and is preventing you to move forward. Once done importing, re-add the constraint to the table's definition. You'll then be able to find the offending rows using the likes of:
select name from table where length(name) > 24;
... and upon fixing them, you'll be able to re-add your constraint if it serves any purpose. (Hint: it doesn't, or at the very least shouldn't have. There's a real person out there whose name is: "Kim-Jong Sexy Glorious Beast Divine Dick Father Lovely Iron Man Even Unique Poh Un Winn Charlie Ghora Khaos Mehan Hansa Kimmy Humbero Uno Master Over Dance Shake Bouti Bepop Rocksteady Shredder Kung Ulf Road House Gilgamesh Flap Guy Theo Arse Hole Im Yoda Funky Boy Slam Duck Chuck Jorma Jukka Pekka Ryan Super Air Ooy Rusell Salvador Alfons Molgan Akta Papa Long Nameh Ek.")

Related

Trying to work around the error DF-CSVWriter-InvalidEscapeSetting

So I have a dataset which I want to export to csv with pipe as separator and no escape character.
That dataset contains in fact 4 source columns, 3 regular ones (just text) and one variable one.
That last column holds another subset of values that are also separated with a pipe.
Purpose is that the export looks like this, where the values are coming from my 4th field.
COL1|COL2|COL3|VAL1|VAL2|VAL3|....
The number of values can be different for each record but.
When I set the csv export separator to ";", I get this result which is expected
COL1;COL2;COL3;VAL1|VAL2|VAL3|....
However setting it to "|", it throws the error DF-CSVWriter-InvalidEscapeSetting.
Most likely because it detected the separator character in my 4th field and then enforces that an escape character needs to be set.
Which is a logical thing in most case but in my case I would like him to ignore this and just export as-is.
Any way how I can work around this, perhaps with a different approach or some additional settings?
Split & flatten produces extra rows but that's not what I want.
Regards,
Sven Peeters
As you have the same characters in the column value same as your delimiter character, with no escape character in your dataset will throw an error.
You have to change the delimiter character to a different character or add a Quote character and Escape character to Double quote(").
Downloaded file:

Replacing a phrase with a leading space in T-SQL - but it's also replacing the phrases without the leading space

I've run into an interesting problem I'm hoping someone can shed some light on.
I'm trying to pull a unique list of names from an MS SQL Database - but the company has been sloppy with their names. They were tacking on a code to the end of last name for some users. I need to remove that code.
Example:
firstname lastname
John Doe
Mary Smith AST
Mike Jackson AST
Brian Astor
Jackie Masterson
In the example, "AST" is the code they tack on. It's not tacked on to all last names either. I need to get an output of just the last names without the code.
I would have expected this is a simple use of REPLACE. I tried:
select REPLACE(lastname, ' AST', '') from table
Note the leading space in the quotes for the search phrase... this does work to remove the "AST" appended to the last names.
However - my problem is that it will also remove anywhere AST appears at the BEGINNING of the field. So Brian Astor comes out as "Brian or" since the field started with AST. However... it correctly does not remove ast from the middle, so Jackie Masterson is fine.
Any ideas why it is ignoring the leading space in my search phrase for the beginning of the field? I've tried ltrim to eliminate the possibility the field has leading spaces.
Thanks!
Replace with an empty string will eliminate the searched string anywhere in your source string. So the behaviour is as expected.
If you only need to replace ' ast' at the end of your searched string, try something like this:
select replace(lastname + '$$$', ' AST$$$', '') from table
Of course you need to be sure that the $$$ appended don't appear by chance in your source string (lastname). Which I guess is not that likely.

defining escape character for a csv import

I have a source file that has text columns which end with a "\" and I have specified "^" as the column delimiter.
I have the file format for this specified use - ESCAPE = 'NONE', but rows with "\^" are causing premature end-of-line errors - assuming SF is not interpreting the "\^" as a column delimiter - therefore the column count is off.
I have changed the file format to use something else for ESCAPE but get the same message. The offending rows have the right number of columns and a text column containing "\", that is not the last character in the column, imports correctly.
The values are exported from SQL Server.
Is this an escape character problem or am I overlooking something else? I am new to SF.
I was seeing this same issue. Nomatter what I used as an escape character, when it showed up in my file next to a " at the end of a string it started causing trouble.
I switched my delimiter to \u0001 which is a special "start of header" character that very rarely shows up, especially at the end of strings.
I wouldn't say this was an ideal option for us, but it worked and is something you might want to try.

Spark: Split CSV with newlines in octet-stream field

I am using Scala to parse CSV files. Some of these files have fields which are non-textual data like images or octet-streams. I would like to use Apache Spark's textFile() method to split up the CSV into rows, and
split(",[ ]*(?=([^\"]*\"[^\"]*\")*[^\"]*$)")
to split the row into fields. Unfortunatly this does not work with files that have these mentioned binary fields. There are two problems: 1) The octet-streams can contain newlines which make textFile() split rows which should be one, and 2) The octet-streams contain commas and/or double quotes which are not escaped and mess up my schema.
The files are usually big, couple of MBs up to couple of 100MBs. I have to take the CSV's as they are, although I could preprocess them.
All I want to achieve is a working split function so I can ignore the field with the octet-stream. Nevertheless, a great bonus would be to extract the textual information in the octet-stream.
So how would I go forward to solve my problems?
Edit: A typical record obtained with cat, the newlines are from the file, not for cosmetic purposes (shortened):
7,url,user,02/24/2015 02:29:00 AM,03/22/2015 03:12:36 PM,octet-stream,27156,"MSCF^#^#^#^#�,^#^#^#^#^#^#D^#^#^#^#^#^#^#^C^A^A^#^C^#^D^#^#^#^#^#^T^#^#^#^#^#^P^#�,^#^#^X=^#^#^#^#^#^#^#^#^#^#�^#^#^#^E^#^A^#��^A^#^#^#^#^#^#^#WF6�!^#Info.txt^#=^B^#^#��^A^#^#^#WF7�^#^#List.xml^#^�^#^#��^A^#^#^#WF:�^#^#Filename.txt^#��>��
^#�CK�]�r��^Q��T�^O�^#�-�j�]��FI�Ky��Ei�Je^K""!�^Qx #�*^U^?�^_�;��ħ�^LI^#$(�^Q���b��\N����t�����+������ȷgvM�^L̽�LǴL�^L��^ER��w^Ui^M��^X�Kޓ�^QJȧ��^N~��&�x�bB��D]1�^B|^G���g^SyG�����:����^_P�^T�^_�����U�|B�gH=��%Z^NY���,^U�^VI{��^S�^U�!�^Lpw�T���+�a�z�l������b����w^K��or��pH� ��ܞ�l��z�^\i=�z�:^C�^S!_ESCW��ESC""��g^NY2��s�� u���X^?�^R^R+��b^]^Ro�r���^AR�h�^D��^X^M�^]ޫ���ܰ�^]���0^?��^]�92^GhCx�DN^?
mY<{��L^Zk�^\���M�^V^HE���-Ե�$f�f����^D�e�^R:�u����� ^E^A�Ȑ�^B�^E�sZ���Yo��8Eސ�}��&JY���^A9^P������^P����~Jʭy��`�^9«�""�U� �:�}3���6�Hߧ�v���A7^Xi^L^]�sA�^Q�7�5d�^Xo˛�tY
Bp��4�Y���7DkV_���\^_q~�w�|�a�s̆���#�g�ӳu�^�!W}�n��Rgż_2�]�p�2}��b�G9�M^Q
�����:�X����bR[ԳZV!^G����^U�tq�&�Y6b��GR���s#mn6Z=^ZH^]�b��R^G�C�0R��{r1��4�#�
=r/X2�^O�����r^M�Rȕ�goG^X-����}���P+˥Qf�#��^C�Բ�z1�I�j����6�^Np���ܯ^P�[�^Tzԏ���^F2�e��\�E�߻6c�%���$�:E�*�*©t�y�J�,�S�2U�S�^X}ME�]��]�i��G�su�""��!�-��!r'ܷe_et Y^K^?0���l^A��^^�m�1/q����|�_r�5$�%�([x��W^E�G^^y���#����Z2^?ڠ�^_��^AҶ�OO��^]�vq%:j�^?�jX��\�]����^S�^^n�^C��>.^CY^O-� �_�\K����:p�<7Sֺnj���-Yk�r���^Q^M�n�J^B��^Z0^?�(^C��^W³!�g�Z�~R�A^M�^O^^�%;��Ԗ�p^S�w���*m^S���jڒ|�����<�^S�;Z^^Fc�1���^O�G_o����8��CS���w��^?��n�2~��m���G;��rx4�(�]�'��^E���eƧ�x��.�w�9WO�^^�י3��0,�y��H�Y�.H�x�""'���h}灢^T�Gm;^XE�̼�J��c�^^񾠫;�^A�qZ1ׁBZ^Q�^A^FB�^QbQ�_�3|ƺ�EvZ���^S�w���^P���9^MT��ǩY[+�+�9�Ԩ�^O�^Q���Fy(+�9p�^^Mj�2��Y^?��ڞ��^Ķb�^Z�ψMр}�ڣ�^^S�^?��^U�^Wڻ����z�^#��uk��k^^�>^O�^W�ݤO�h�^G�����Kˇ�.�R|�)-��e^G�^]�/J����U�ϴ�a���i5HO�^L�ESCg�R'���.����d���+~�}��ڝ^Y5]l�3jg54M�������2t�5^Y}�q)��^O;�X\�q^Ox~Vۗ�t�^\f� >k;^G�K5��,��X�t/�ǧ^G""5��4^MiΟ�n��^B^]�|�����V��ߌ֗Q~�H���8��t��5��ܗ�
�Z�^c�6N�ESCG����^_��>��t^L^R�^:�x���^]v�{^#+KM��qԎ�.^S�%&��=^W-�=�^S�����^CI���&^]_�s�˞�y�z�Jc^W�kڠ�^\��^]j�����^O��;�oY^^�^V59;�c��^B��T�nb����^C��^N��s�x�<{�9-�F�T�^N�5�^Se-���^T�Y[���`^ZsL��v�բ<C�+�~�^ۚ��""�Yκ2^_�^VxT�>��/ݳ^U�m�^#���3^Ge�n^Vc�V�^#�NVn�,�q��^^^]gy�R�S��Ȃ$���>A�d����xg�^GB3�M�J�^QJ^]�^\�{.�D��碎�^W�8a����qޠl?,'^R�^X�Cgy�P[����mڞ��H�Z�s�SD&蠤�s�E��nu�O#O<��3wj`C-%w�W�J�^WP^T�^]r^NT�TC�Lq�Z�f�!�;�l�Y��Gb��>�ud�hx�Ԭ^N)9�^N!k�҉s�35v������.�""^]��~4������۴�Z^]u�^Ti^^�i:�)K��P᳕!�#�^?�>��EE^VE-u�^SgV^L��<��^D�O<�+�J.�c�Z#>�.l����^S�
ESC��(��E�j�π쬖���2{^U&b\��P^S�`^O^XdL�^ 6bu��FD��^#^#^#^#","field_x, data",field_y,field_z
Expected output would be an array
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","field_x, data",field_y",field_z")
Or, but this is probably another question, such an array (like running strings on the octet-stream field):
("7","url","user","02/24/2015 02:29:00 AM","03/22/2015 03:12:36 PM","octet-stream","27156","Info.txt List.xml Filename.txt","field_x, data",field_y",field_z")
Edit 2: Every file that has a binary field also contains a length field for it. So instead of splitting directly I can walk left to right through my record and extract the fields. This is certainly a great improvement of my current situation but problem 1) still persists. How can I split those files reliably?
I took a closer look at the files and a header looks like this:
RecordId, Field_A, Content_Type, Content_Length, Content, Field_B
(Where Content_Type can be "octet-stream", Content_Length the number of bytes in the Content field, and Content obviously the data). And good for me, the value of Field_B is predictable, let's assume for a certain file it's always "Hello World".
So instead of using Spark's default behaviour splitting on newlines, how can I achieve that Spark is only splitting on newlines following "Hello World"? (I also edited the question title since the focus of the question changed)
As answered in Spark: Reading files using different delimiter than new line, I used textinputformat.record.delimiter to split on "Hello World\n" because I am a bit lucky that the last column always contains the same value. After that I simply walk left to right through the record and when I reach the length field I skip the next n bytes. Everything works now. Thanks for pointing me in the right direction.
There are two problems: 1) The octet-streams can contain newlines
which make textFile() split rows which should be one, and 2) The
octet-streams contain commas and/or double quotes which are not
escaped and mess up my schema.
Well, actually that csv file is properly escaped:
the multiline field is enclosed in double quotes: "MSCF^# .. ^#^#" (which also handles possible separators inside the field)
double quotes inside the field are escaped with another double quote as it should be: Je^K""!
Of course a simple split will not work in this case (and should never be used on csv data), but any csv reader able to handle multiline fields should parse that data correctly.
Also keep in mind that the double quotes inside the octet-stream have to be unescaped, or that data won't be valid (another reason not to use split, but a csv reader that handles this).

Identifying hidden characters in text

I have an ETL process that regularly extracts code from an ODBC data source, manipulates it, and inserts it into my postgres database. One of the columns from this data source regularly has odd characters in it.
For the most part I can catch and convert all of the characters appropriately, but I have one character that exists in the ODBC data source, cannot be brought into postgres (all of the text after that character gets truncated), and I'm having a hard time identifying what the character is.
I can't even insert an example of the character directly into this post because it gets stripped out :/ The closest I can get is a screen shot of the character in textmate (the only application I can actually see the character in):
There character is the diamond between the 1 and 0. When my data comes in, everything after the 0 is truncated.
Is there a good way of identifying what this character is so I can figure out a way of stripping it out?
Per tripleee's comment on the original question post:
To identify the character I grabbed the hex value of the text to identify the hex value of the offending character in question.
There are a number of ways to do this, but the quickest way for me was to use a utility application I have called HexFiend so dump the text into. Once the text was in and I highlighted the character it returned the hex value "00".
A bit more investigation pointed towards the hex null value being used as a line terminator in C applications (which makes sense given the context of my project).
I've fit this null value into my ETL process so that it gets switched out with a new line and now everything is sunshine and daises.
Thanks again for the help!