Postgresql COPY with text value containing \0 (backslash 0) - postgresql

Setup: Postgresql Server 9.3 - OS: CentOS 6.6
Attempting to bulk insert 250 million records into a Postgresql 9.3 server using the COPY command. The data is in delimited format using a pipe '|' as the delimiter.
Almost all columns in the table that I'm copying to are TEXT datatypes. Unfortunately, out of the 250 million records, there's about 2 million that have legitimate textual values with a "\0" in the text.
Example entry:
245150963|DATASOURCE|736778|XYZNR-1B5.1|10-DEC-1984 00:00:00|||XYZNR-1B5.1\1984-12-10\0.5\1\ASDF1|pH|Physical|Water|XYZNR|Estuary
As you can see, the 8th column has a legitimate \0 in its value.
XYZNR-1B5.1\1984-12-10\0.5\1\ASDF1
No matter how I escape this, the COPY command will either convert this \0 into an actual "\x0" or the COPY command fails with "ERROR: invalid byte sequence for encoding "UTF8": 0x00".
I have tried replacing the \0 with "sed -i" with:
\\0
\\\0
'\0'
\'\'0
\\\\\0
... and many others I can't remember and none of them work.
What would be the correct escaping of these types of strings?
Thanks!

Per Postgres doc on COPY:
Backslash characters () can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters.
In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character.
Try to convert all your backslash characters in that path in the field to \\, not just the \0.
FYI \b is shorthand for backslash as well.
So either of these should work:
XYZNR-1B5.1\b1984-12-10\b0.5\b1\bASDF1
XYZNR-1B5.1\\1984-12-10\\0.5\\1\\ASDF1

The one you needed was the one example you didn't give:
sed -e 's/\\/\\\\/g'
You want this for all occurrences of \, not just for \0.
From the perspective of the file and postgres, were trying to convert \ to \\.
In sed, \ is a special character which we need to self escape, so \ becomes \\, and \\ becomes \\\\, hence the above expression.

Have you confirmed that your sed command is actually giving you \\0?

Related

postgreSQL COPY failing with value too long for type character varying(255)

I am trying to copy a csv file with each column enclosed in double quotes. The COPY statement that I am using is:
\COPY schema.table_name FROM 'file_name.csv' WITH (FORMAT CSV, DELIMITER ',', ESCAPE '\', QUOTE '"', FORCE_NULL(col1,col2,col3,col4))
The column 'col3' is defined as VARCHAR(255). One of the rows in csv file has a string that has literal '\n' repeated 4 times, making the length of string 259. I am expecting that each occurrence of '\n' will convert into newline character, making the string to 255 character long. However, the COPY statement is failing with the below error:
ERROR: value too long for type character varying(255)
COPY schema.table_name, line 11859, column col3: "RN recommends 3HX3D (Mon, Wed, Fri) of PCA/ service.\nincontinence supplies\nPERS new\nConsumer stat..."
When I change the column length to 260, the COPY works fine. I cannot change the length of the column as I won't know how many '\n' literals may come in this column. Is this a bug in postgreSQL? If not, how can this be fixed? I am using version 12.3 of postgreSQL.
Thanks in advance!
The answer provided by #wildplasser works fine. I removed the size for VARCHAR column and the data is loaded successfully.
CSV files do not use \ escaping (except possibly for the quote character itself). You escape a newline by having it appear literally in the file, but included inside the quotes. So if you make the type long enough to hold the data, it will still be holding a literal sequence of \ and n, not newlines.
So this is not a bug in PostgreSQL, but rather a bug in your file data.

Postgres COPY multibyte delimiter

I have a large body of text and other data that I need to import into Postgres. This text contains all the possible single-byte characters. This means I can't choose ",", ";", "-" or any other single-byte character as a delimiter in my CSV file because it would be confused by the text that contains it.
Is there any way to chose a multibyte character as a delimiter, use multiple characters as a delimiter or use COPY command in some other way to solve this?
Command I'm using:
COPY site_articles(id,url,title,content) FROM '/home/sites/site_articles.csv' DELIMITER '^' CSV;
This means I can't choose ",", ";", "-" or any other single-byte character as a delimiter in my CSV file because it would be confused by the text that contains it.
CSV has an escaping mechanism. Use it. Quote strings that contain the delimiter character ,, and if the quoted string contains the quote character, double the quote character.
e.g. if you want to represent two values Fred "wiggle" Smith and one, two, you'd do so as:
"Fred ""Wiggle"" Smith","one, two"
At time of writing (9.5) copy does not support multi-byte characters as delimiters. You can use 3rd party ETL tools like Pentaho Kettle, though.

How do I pass non-printable characters to BCP's field/row terminator parameters?

I have a tab delimited file I am trying to import into SQL Server 2012; the row terminator is CRLF. The following is my BCP statement in PowerShell:
bcp database.dbo.table IN C:\filePath.tsv -SserverName -UuserName -Ppassword -c -t\t -r\n
Which reports a
Unexpected EOF encountered
error.
I can't for the life of me figure out why this is not working. An extra eye would be great.
EDIT:
After review, I think the problem is with -r\n...What is the metacharacter for CRLF?
Encode it in hex:
bcp database.dbo.table IN C:\filePath.tsv -SserverName -UuserName -Ppassword -c -t0x9 -r0xa
You can use multiple characters by encoding each in hex and appending them together. For example, we use the record separator character, carriage return, and newline to separate each row, so we pass 0x1e0d0a as the value of the -r parameter.
I use ASCII Table to do quick lookups for this.

Datastage-What is the escape character for # in execute command activity?

What is the escape character for # in execute command activity? Well i was trying to replace one string in file with "#",but datastage treating # as job parameter and expecting the value to be assigned in datastage parameters.For that we need escape character for "#". I tried using \ and / as escape characters but none of them solved my problem. Thank You.
Just wanted to add that the actual answer to this question is:
\043
The expression mentioned previously (!/^\043/) is related to the first link #Damienknight posted.
Use awk to replace the # sign, and you can use the octal character code for # in the regular expression:
!/^\043/
There is a thread in DSXchange that discusses this.
Here is a guide on escape sequences in AWK expressions.
A table with ascii codes, including a column for octal values.

How can I get sed to remove `\` followed by anything?

I am trying to write a sed script to convert LaTeX coded tables into tab delimited tables.
To do this I need to convert & into \t and strip out anything that is preceded by \.
This is what I have so far:
s/&/\t/g
s/\*/" "/g
The first line works as intended. In the second line I try to replace \ followed by anything with a space but it doesn't alter the lines with \ in them.
Any suggestions are appreciated. Also, can you briefly explain what suggested scripts "say"? I am new to sed and that really helps with the learning process!
Thanks
Assuming you're running this as a sed script, and not directly on the command line:
s/\\.*/ /g
Explanation:
\\ - double backslash to match a literal backslash (a single \ is interpreted as "escape the following character", followed by a .* (. - match any single character, * - arbitrarily many times).
You need to escape the backslash as it is a special character.
If you want to denote "any character" you need to use . (a period)
the second expression should be:
s/\\.//g
I hope I understood your intention and you want to strip the character after the backslash,
if you want to delete all the characters in the line after the backslash add a star (*)
after the period.