I have a large body of text and other data that I need to import into Postgres. This text contains all the possible single-byte characters. This means I can't choose ",", ";", "-" or any other single-byte character as a delimiter in my CSV file because it would be confused by the text that contains it.
Is there any way to chose a multibyte character as a delimiter, use multiple characters as a delimiter or use COPY command in some other way to solve this?
Command I'm using:
COPY site_articles(id,url,title,content) FROM '/home/sites/site_articles.csv' DELIMITER '^' CSV;
This means I can't choose ",", ";", "-" or any other single-byte character as a delimiter in my CSV file because it would be confused by the text that contains it.
CSV has an escaping mechanism. Use it. Quote strings that contain the delimiter character ,, and if the quoted string contains the quote character, double the quote character.
e.g. if you want to represent two values Fred "wiggle" Smith and one, two, you'd do so as:
"Fred ""Wiggle"" Smith","one, two"
At time of writing (9.5) copy does not support multi-byte characters as delimiters. You can use 3rd party ETL tools like Pentaho Kettle, though.
Related
I try to import a text file into a single column table, i.e. I don't want a single line of the source file to be delimited into columns. The file contains many different characters (tabs, commas, spaces) that could be recognized as delimiters. Since bell (CHR(7)) doesn't exist in the data file I chose it as delimiter:
COPY data_table(single_column) FROM '/tmp/data' WITH ENCODING 'LATIN1' DELIMITER CHR(7);
Unfortunately, this results in an error:
ERROR: syntax error at or near "chr"
What would be the correct syntax?
You can't use a function there. Use the escape notation.
DELIMITER E'\007'
Question (because I can't work it out), should ""hello world"" be a valid field value in a CSV file according to the specification?
i.e should:
1,""hello world"",9.5
be a valid CSV record?
(If so, then the Perl CSV-XS parser I'm using is mildly broken, but if not, then $line =~ s/\342\200\234/""/g; is a really bad idea ;) )
The weird thing is is that this code has been running without issue for years, but we've only just hit a record that started with both a left double quote and contained no comma (the above is from a CSV pre-parser).
The canonical format definition of CSV is https://www.rfc-editor.org/rfc/rfc4180.txt. It says:
Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Last rule means your line should have been:
1,"""hello world""",9.5
But not all parsers/generators follow this standard perfectly, so you might need for interoperability reasons to relax some rules. It all depends on how much you control the CSV format writing and CSV format parsing parts.
That depends on the escape character you use. If your escape character is '"' (double quote) then your line should look like
1,"""hello world""",9.5
If your escape character is '\' (backslash) then your line should look like
1,"\"hello world\"",9.5
Check your parser/environment defaults or explicitly configure your parser with the escape character you need e.g. to use backslash do:
my $csv = Text::CSV_XS->new ({ quote_char => '"', escape_char => "\\" });
Setup: Postgresql Server 9.3 - OS: CentOS 6.6
Attempting to bulk insert 250 million records into a Postgresql 9.3 server using the COPY command. The data is in delimited format using a pipe '|' as the delimiter.
Almost all columns in the table that I'm copying to are TEXT datatypes. Unfortunately, out of the 250 million records, there's about 2 million that have legitimate textual values with a "\0" in the text.
Example entry:
245150963|DATASOURCE|736778|XYZNR-1B5.1|10-DEC-1984 00:00:00|||XYZNR-1B5.1\1984-12-10\0.5\1\ASDF1|pH|Physical|Water|XYZNR|Estuary
As you can see, the 8th column has a legitimate \0 in its value.
XYZNR-1B5.1\1984-12-10\0.5\1\ASDF1
No matter how I escape this, the COPY command will either convert this \0 into an actual "\x0" or the COPY command fails with "ERROR: invalid byte sequence for encoding "UTF8": 0x00".
I have tried replacing the \0 with "sed -i" with:
\\0
\\\0
'\0'
\'\'0
\\\\\0
... and many others I can't remember and none of them work.
What would be the correct escaping of these types of strings?
Thanks!
Per Postgres doc on COPY:
Backslash characters () can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters.
In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character.
Try to convert all your backslash characters in that path in the field to \\, not just the \0.
FYI \b is shorthand for backslash as well.
So either of these should work:
XYZNR-1B5.1\b1984-12-10\b0.5\b1\bASDF1
XYZNR-1B5.1\\1984-12-10\\0.5\\1\\ASDF1
The one you needed was the one example you didn't give:
sed -e 's/\\/\\\\/g'
You want this for all occurrences of \, not just for \0.
From the perspective of the file and postgres, were trying to convert \ to \\.
In sed, \ is a special character which we need to self escape, so \ becomes \\, and \\ becomes \\\\, hence the above expression.
Have you confirmed that your sed command is actually giving you \\0?
im am importing data to postgresql with this comand
COPY codigos_postales
(CPRO, CMUN, Nombre_Municipio, CP, Municipio_CP, Lugar_CP)
FROM 'path' WITH DELIMITER E'/t';
But i got this error.
ERROR: COPY delimiter must be a single-byte character
If you're trying to specify a tab as the delimiter, you want E'\t' (the escape character is a backslash not a forward slash) or just a literal tab ' '.
You can see that with:
regress=> SELECT E'\t' AS backslash, E'/t' AS forwardslash;
backslash | forwardslash
-----------+--------------
| /t
(1 row)
If the delimiter is actually the string /t then you won't be able to use COPY, as it only supports single character delimiters.
your delimiter looks like a little bit complex but not a single byte char... Try with '\t'.
I'm trying to import a tab-delimited file into my PostgreSQL database. One of the fields in my file is a "title" field, which occasionally contains actual quotation marks. For example, my tsv might look like:
id title
5 Hello/Bleah" Foo
(Yeah, there's just that one quotation mark in the title.)
When I try importing the file into my database:
copy articles from 'articles.tsv' with delimiter E'\t' csv header;
I get this error, referencing that line:
ERROR: unterminated CSV quoted field
How do I fix this? Quotation marks are never used to surround entire fields in the file. I tried copy articles from 'articles.tsv' with delimiter E'\t' escape E'\\' csv header; but I get the same error on the same line.
Assuming the file never actually tries to quote its fields:
The option you want is "with quote", see http://www.postgresql.org/docs/8.2/static/sql-copy.html
Unfortunately, I'm not sure how to turn off quote processing altogether, one kludge would be to specify a character that does not appear in your file at all.
Tab separated is the default format for copy statements. Treating them as CSV is just silly. (do you take this path just to skip the header ?)
copy articles from 'articles.tsv';
does exactly what you want.
I struggled with the same error and a few more. Finally gathering knowledge from few SO questions I came up with the following setup for making COPY TO/FROM successful even for quite sophisticated JSON columns:
COPY "your_schema_name.yor_table_name" (your, column_names, here)
FROM STDIN WITH CSV DELIMITER E'\t' QUOTE '\b' ESCAPE '\';
--here rows data
\.
the most important parts:
QUOTE '\b' - quote with backspace (thanks a lot #grautur!)
DELIMITER E'\t' - delimiter with tabs
ESCAPE '\' - and escape with a backslash