I've been given some CSV data I need to load into a PostgreSQL table with the following format:
7227, {text with, commas}, 10.0, 3.0, text with no commas
I want my table rows to appear as:
7227 | text with, commas | 10.0 | 3.0 | text with no commas
How can I use COPY and get it to ignore commas between the brackets?
I'm trying to avoid pre-processing the file, although that is an option.
I'm afraid you'll have to edit the file. This format should be ok:
7227, "text with, commas", 10.0, 3.0, text with no commas
Alternatively
7227, text with\, commas, 10.0, 3.0, text with no commas
according to this principle in the documentation:
Backslash characters (\) can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters.
In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character.
Related
I am trying to copy a csv file with each column enclosed in double quotes. The COPY statement that I am using is:
\COPY schema.table_name FROM 'file_name.csv' WITH (FORMAT CSV, DELIMITER ',', ESCAPE '\', QUOTE '"', FORCE_NULL(col1,col2,col3,col4))
The column 'col3' is defined as VARCHAR(255). One of the rows in csv file has a string that has literal '\n' repeated 4 times, making the length of string 259. I am expecting that each occurrence of '\n' will convert into newline character, making the string to 255 character long. However, the COPY statement is failing with the below error:
ERROR: value too long for type character varying(255)
COPY schema.table_name, line 11859, column col3: "RN recommends 3HX3D (Mon, Wed, Fri) of PCA/ service.\nincontinence supplies\nPERS new\nConsumer stat..."
When I change the column length to 260, the COPY works fine. I cannot change the length of the column as I won't know how many '\n' literals may come in this column. Is this a bug in postgreSQL? If not, how can this be fixed? I am using version 12.3 of postgreSQL.
Thanks in advance!
The answer provided by #wildplasser works fine. I removed the size for VARCHAR column and the data is loaded successfully.
CSV files do not use \ escaping (except possibly for the quote character itself). You escape a newline by having it appear literally in the file, but included inside the quotes. So if you make the type long enough to hold the data, it will still be holding a literal sequence of \ and n, not newlines.
So this is not a bug in PostgreSQL, but rather a bug in your file data.
Question (because I can't work it out), should ""hello world"" be a valid field value in a CSV file according to the specification?
i.e should:
1,""hello world"",9.5
be a valid CSV record?
(If so, then the Perl CSV-XS parser I'm using is mildly broken, but if not, then $line =~ s/\342\200\234/""/g; is a really bad idea ;) )
The weird thing is is that this code has been running without issue for years, but we've only just hit a record that started with both a left double quote and contained no comma (the above is from a CSV pre-parser).
The canonical format definition of CSV is https://www.rfc-editor.org/rfc/rfc4180.txt. It says:
Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Last rule means your line should have been:
1,"""hello world""",9.5
But not all parsers/generators follow this standard perfectly, so you might need for interoperability reasons to relax some rules. It all depends on how much you control the CSV format writing and CSV format parsing parts.
That depends on the escape character you use. If your escape character is '"' (double quote) then your line should look like
1,"""hello world""",9.5
If your escape character is '\' (backslash) then your line should look like
1,"\"hello world\"",9.5
Check your parser/environment defaults or explicitly configure your parser with the escape character you need e.g. to use backslash do:
my $csv = Text::CSV_XS->new ({ quote_char => '"', escape_char => "\\" });
I am trying to import a csv file to MongoDB in cmd using mongoimport.
Some of my csv fields contain a single "double quotes" like so:
Dave, 25, 406-836-3336, "51 Ashleigh St, 20141123
I would like them either to be ignored, or imported as empty string.
I don't care for the address field really. I don't care how it will be imported as no operations will be made on it.
All I really care is that all the rows will be imported.
Replace a double-quote with double double-quote
Dave, 25, 406-836-3336, ""51 Ashleigh St, 20141123
From mongoimport docs:
The csv parser accepts that data that complies with RFC RFC 4180. As a result, backslashes are not a valid escape character. If you use double-quotes to enclose fields in the CSV data, you must escape internal double-quote marks by prepending another double-quote.
replace with sed
sed 's/"/""/' test.csv > test2.csv
I have a large body of text and other data that I need to import into Postgres. This text contains all the possible single-byte characters. This means I can't choose ",", ";", "-" or any other single-byte character as a delimiter in my CSV file because it would be confused by the text that contains it.
Is there any way to chose a multibyte character as a delimiter, use multiple characters as a delimiter or use COPY command in some other way to solve this?
Command I'm using:
COPY site_articles(id,url,title,content) FROM '/home/sites/site_articles.csv' DELIMITER '^' CSV;
This means I can't choose ",", ";", "-" or any other single-byte character as a delimiter in my CSV file because it would be confused by the text that contains it.
CSV has an escaping mechanism. Use it. Quote strings that contain the delimiter character ,, and if the quoted string contains the quote character, double the quote character.
e.g. if you want to represent two values Fred "wiggle" Smith and one, two, you'd do so as:
"Fred ""Wiggle"" Smith","one, two"
At time of writing (9.5) copy does not support multi-byte characters as delimiters. You can use 3rd party ETL tools like Pentaho Kettle, though.
Setup: Postgresql Server 9.3 - OS: CentOS 6.6
Attempting to bulk insert 250 million records into a Postgresql 9.3 server using the COPY command. The data is in delimited format using a pipe '|' as the delimiter.
Almost all columns in the table that I'm copying to are TEXT datatypes. Unfortunately, out of the 250 million records, there's about 2 million that have legitimate textual values with a "\0" in the text.
Example entry:
245150963|DATASOURCE|736778|XYZNR-1B5.1|10-DEC-1984 00:00:00|||XYZNR-1B5.1\1984-12-10\0.5\1\ASDF1|pH|Physical|Water|XYZNR|Estuary
As you can see, the 8th column has a legitimate \0 in its value.
XYZNR-1B5.1\1984-12-10\0.5\1\ASDF1
No matter how I escape this, the COPY command will either convert this \0 into an actual "\x0" or the COPY command fails with "ERROR: invalid byte sequence for encoding "UTF8": 0x00".
I have tried replacing the \0 with "sed -i" with:
\\0
\\\0
'\0'
\'\'0
\\\\\0
... and many others I can't remember and none of them work.
What would be the correct escaping of these types of strings?
Thanks!
Per Postgres doc on COPY:
Backslash characters () can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters.
In particular, the following characters must be preceded by a
backslash if they appear as part of a column value: backslash itself,
newline, carriage return, and the current delimiter character.
Try to convert all your backslash characters in that path in the field to \\, not just the \0.
FYI \b is shorthand for backslash as well.
So either of these should work:
XYZNR-1B5.1\b1984-12-10\b0.5\b1\bASDF1
XYZNR-1B5.1\\1984-12-10\\0.5\\1\\ASDF1
The one you needed was the one example you didn't give:
sed -e 's/\\/\\\\/g'
You want this for all occurrences of \, not just for \0.
From the perspective of the file and postgres, were trying to convert \ to \\.
In sed, \ is a special character which we need to self escape, so \ becomes \\, and \\ becomes \\\\, hence the above expression.
Have you confirmed that your sed command is actually giving you \\0?