CSV specification - double quotes at the start and end of fields - perl

Question (because I can't work it out), should ""hello world"" be a valid field value in a CSV file according to the specification?
i.e should:
1,""hello world"",9.5
be a valid CSV record?
(If so, then the Perl CSV-XS parser I'm using is mildly broken, but if not, then $line =~ s/\342\200\234/""/g; is a really bad idea ;) )
The weird thing is is that this code has been running without issue for years, but we've only just hit a record that started with both a left double quote and contained no comma (the above is from a CSV pre-parser).

The canonical format definition of CSV is https://www.rfc-editor.org/rfc/rfc4180.txt. It says:
Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Last rule means your line should have been:
1,"""hello world""",9.5
But not all parsers/generators follow this standard perfectly, so you might need for interoperability reasons to relax some rules. It all depends on how much you control the CSV format writing and CSV format parsing parts.

That depends on the escape character you use. If your escape character is '"' (double quote) then your line should look like
1,"""hello world""",9.5
If your escape character is '\' (backslash) then your line should look like
1,"\"hello world\"",9.5
Check your parser/environment defaults or explicitly configure your parser with the escape character you need e.g. to use backslash do:
my $csv = Text::CSV_XS->new ({ quote_char => '"', escape_char => "\\" });

Related

Papa Parse with backslash escaping

I have input that people will probably say "that's not really CSV", but I still have to parse it. (using Papa Parse)
comma is the delimiter. backslash is the escape. comma, double quote, backslash, r and n (to denote newlines) can all be escaped. There is no "quoting" of strings.
so... I see data like:
this is one\, field,1/2\" bolt,this is text with \\ and a new line \r\n embedded
and I want:
[0] this is one\, field
[1] 1/2\" bolt
[2] this is text with \\ and a new line \r\n embedded
but I'm getting
[0] this is one\
[1] field
[2] 1/2\" bolt
...
I can deal with the other \x things in post processing... I'd just like to get it to handle \, correctly.
I've tried the obvious values of quoteChar and escapeChar with no luck.
oh... and the Donate link is broken on https://www.papaparse.com/ if Matt Holt is listening.
const parsed = window.Papa.parse(csvText, {
escapeChar: '\\',
});
Seems like default escape character is ", but it can be overidden in the paramters.
Upd. though now that I look at it, it does not seem to work with your case. It only did fix an issue I had of 2.5\","Shell being considered single value because " was interpreted as escape character for ,.
I'm starting to get a feeling that the only way to escape coma is to enclose the field in the quotes.
Hope someone will post the right answer eventually...

YAML, Docker Compose, Spaces & Quotes

Under what circumstances must one use quotes in a YAML file, specifically when using docker-compose.
For instance,
service:
image: "my-registry/repo:tag1"
environment:
ENV1: abc
ENV2: "abc"
ENV3: "a b c"
If spaces are required, for example, must one use quotes around the environment variable, as depicted in ENV3?
After some googling I've found a blog post
that touches this problem as I understood it.
I'll cite the most important part here:
plain scalars:
- a string
- a string with a \ backslash that doesn't need to be escaped
- can also use " quotes ' and $ a % lot /&?+ of other {} [] stuff
single quoted:
- '& starts with a special character, needs quotes'
- 'this \ backslash also does not need to be escaped'
- 'just like the " double quote'
- 'to express one single quote, use '' two of them'
double quoted:
- "here we can use predefined escape sequences like \t \n \b"
- "or generic escape sequences \x0b \u0041 \U00000041"
- "the double quote \" needs to be escaped"
- "just like the \\ backslash"
- "the single quote ' and other characters must not be escaped"
literal block scalar: |
a multiline text
line 2
line 3
folded block scalar: >
a long line split into
several short
lines for readability
Also I have not seen such docker-compose syntax to set env variables. Documentation suggests using simple values like
environment:
- ENV1=abc
- "ENV2=abc"
Where quotes " or ' are optional in this particular example according to what I've said earlier.
To see how to include spaces in env variables you can check out this so answer
Whether or not you need quotes, depends on the parser. Docker-compose AFAIK is still relying on the PyYAML module and that implements most of YAML 1.1 and has a few quirks of its own.
In general you only need to quote what could otherwise be misinterpreted or clash with some YAML construct that is not a scalar string. You also need (double) quotes for things that cannot be represented in plain scalars, single quoted scalars or block style literal or folded scalars.
Misinterpretation
You need to quote strings that look like some of the other data structures:
booleans: "True", "False", but PyYAML also assumes alternatives words like "Yes", "No", "On", "Off" represent boolean values ( and the all lowercase, all uppercase versions should be considered as well). Please note that the YAML 1.2 standard removed references to these alternatives.
integers: this includes string consisting of numbers only. But also hex (0x123) and octal number (0123). The octals in YAML 1.2 are written as 0o123, but PyYAML doesn't support this, however it is best to quote both.
A special integer that PyYAML still supports but again not in the YAML 1.2 specification are sexagesimals: base 60 number separated by colon (:), time indications, but also MAC addresses can be interpreted as such if the values between/after the colons are in the range 00-59
floats: strings like 1E3 (with optional sign ans mantissa) should be quoted. Of course 3.14 needs to be quoted as well if it is a string. And sexagesimal floats (with a mantissa after the number after the final colon) should be quoted as well.
timestamps: 2001-12-15T02:59:43.1Z but also iso-8601 like strings should be quoted to prevent them from being interpreted as timestamps
The null value is written as the empty string, as ~ or Null (in all casing types), so any strings matching those need to be quoted.
Quoting in the above can be done with either single or double quotes, or block style literal or folded scalars can be used. Please note that for the block-style you should use |- resp. >- in order not to introduce a trailing newline that is not in the original string.
Clashes
YAML assigns special meaning to certain characters or character combinations. Some of these only have special meaning at the beginning of a string, others only within a string.
characters fromt the set !&*?{[ normally indicate special YAML constructs. Some of these might be disambiguated depending on the following character, but I would not rely on that.
whitespace followed by # indicates an end of line comment
wherever a key is possible (and within block mode that is in many places) the combination of colon + space (:) indicates a value will be following. If that combination is part of your scalar string, you have to quote.
As with the misinterpretation you can use single or double quoting or block-style literal or folding scalars. There can be no end-of-line comments beyond the first line of a block-style scalar.
PyYAML can additionally get confused by any colon + space within a plain scalar (even when this is in a value) so always quote those.
Representing special characters
You can insert special characters or unicode code-points in a YAML file, but if you want these to be clearly visible in all cases, you might want to use escape sequences. In that case you have to use double quotes, this is the only mode that
allows backslash escapes. And e.g. \u2029. A full list of such escapes can be taken from the standard, but note that PyYAML doesn't implement e.g \/ (or at least did not when I forked that library).
One trick to find out what to quote or not is to use the library used to dump the strings that you have. My ruamel.yaml and PyYAML used by docker-compose, when potentially dumping a plain scalar, both try to read back (yes, by parsing the result) the plain scalar representation of a string and if that results in something different than a string, it is clear quotes need to be applied. You can do so too: when in doubt write a small program dumping the list of strings that you have using PyYAML's safe_dump() and apply quotes anywhere that PyYAML does.

unterminated CSV quoted field in Postgres

I'm trying to insert some data into my table using the copy command :
copy otype_cstore from '/tmp/otype_fdw.csv' delimiter ';' quote '"' csv;
And I have this answer :
ERROR: unterminated CSV quoted field
There is the line in my CSV file where I have the problem :
533696;PoG;-251658240;from id GSW C";
This is the only line a double quote and I can't remove it, so do you have some advice for me ?
Thank you in advance
If you have lines like this in your csv:
533696;PoG;-251658240;from id GSW C";
this actually means/shows the fields are not quoted, which is still perfectly valid csv as long as there are no separators inside the fields.
In this case the parser should be told the fields are not quoted.
So, instead of using quote '"' (which is actually telling the parser the fields are quoted and why you get the error), you should use something like quote 'none', or leave the quote parameter out (I don't know Postgres, so I can't give you the exact option to do this).
Ok, I did a quick lookup of the parameters. It looks like there is not really an option to turn quoting off. The only option left would be to provide a quote character that is never used in the data.
quote E'\b' (backspace) seems to work ok.
Bit late to reply, but I was facing same issue and found that for double quote we need to add one more double quote to escape with along with prefix and suffix double quote to that special characater. for your case it would be
input data is : from id GSW C"
change data to : from id GSW C""""
note there are 4 consecutive double quotes.
first and last double quote is prefix and postfix as per documentation.
middle 2 double quotes is data with one escape double quote.
Hope this helps for readers with similar issue going forward.
So for every double quote in data it needs to be escaped with one escape character (default double quote). This is as per documentation.

How do I parse CSV in PERL that has quotes in a field?

My data has quotes in it, the names of properties in addresses are quoted. e.g.
"21JAN1984:00:00:00","M",""Millfield""," "
PERL Text::CSV dies at this point with an error
CSV_PP ERROR: 2025 - EIQ - Loose unescaped escape
This looks to me like valid CSV, as is a field, "James said "nice". ".
An abbreviated version of the code used is:
my $csv = Text::CSV->new({
binary => 1,
auto_diag => 1,
eol => "\n",
always_quote => 1
}) or die "Cannot use CSV: " . Text::CSV->error_diag();
open my $fh, '<', $ARGV[0] or die $!;
while (my $person = $csv->getline_hr($fh)) {
...
}
Addressing the revised question
"21JAN1984:00:00:00","M",""Millfield""," "
If you want a double quote before Millfield and another after it, the correct CSV format is:
"21JAN1984:00:00:00","M","""Millfield"""," "
As written, the CSV data is broken. Or, at any rate, it is not the 'standard' format. You can find a standard specification for CSV as RFC4180. This is not identical to Microsoft's specification; the RFC itself identifies that Excel doesn't use precisely this format.
Since you're using Perl's Text::CSV module, you should read its specification. Note that the allow_loose_quotes attribute describes input exactly like what you're trying to deal with. It is one of the many attributes you can use to configure the behaviour of Text::CSV in its new method.
Addressing the original question
What was shown in the original version of the question was horribly ill-formed CSV.
21JAN1984:00:00:00","M",""Millfield""," "
The double quote after the 00 has no place in the format. At best, you have to treat it as a regular character at the end of a field delimited by the comma that follows. The "M" is non-controversial. The ""Millfield"" is malformed; if a string starts with a double quote, it ends at the next double quote unless that is itself followed by another double quote, so the second double quote is erroneous. If a field starts with a double quote, it should be enclosed by double quotes. The best you can do is assume that the field is Millfield"" and stops at the comma, but it is erroneous by any normal rules. Under those error-recovery rules, the " " at the end is non-controversial.
To be reasonably well-formed and to contain "Millfield" as a value, you'd need one of these:
"21JAN1984:00:00:00","M","""Millfield"""," "
21JAN1984:00:00:00,"M","""Millfield"""," "
21JAN1984:00:00:00,M,"""Millfield"""," "
21JAN1984:00:00:00,M,"""Millfield""",
The last of those lines has a trailing blank.
Alternatively, if Millfield should not be surrounded by double quotes when extracted, then all the double quotes are superfluous, though any field could be surrounded by a single pair of double quotes.

escaping single quote sign in PowerShell

I have a replace statement in my code whereby Band's is being replaced by Cig's. However when I put single quote it took the first sentence... Example 'Band'
I tried to use double quote but it does not work. Do you know how to escape the single quote sign?
-replace 'Band's', 'Cig's'
See Escape characters, Delimiters and Quotes and Get-Help about_Quoting_Rules from the built-in help (as pointed out by as Nacimota).
To include a ' inside a single-quoted string, simply double it up as ''. (Single-quote literals don't support any of the other escape characters.)
> "Band's Toothpaste" -replace 'Band''s', 'Cig''s'
Or, simply switch to double-quotes. (Double-quote literals are required when wishing to use interpolation or escape characters.)
> "Band's Toothpaste" -replace "Band's", "Cig's"
(Don't forget that -replace uses a regular expression)
Escape a single quote with two single quotes:
"Band's Toothpaste" -replace 'Band''s', 'Cig''s'
Also, this is a duplicate of
Can I use a single quote in a Powershell 'string'?
For trivial cases, you can use embedded escape characters. For more complex cases, you can use here-strings.
$Find = [regex]::escape(#'
Band's
'#)
$Replace = #'
Cig's
'#
"Band's Toothpaste" -replace $Find,$Replace
Then put the literal text you want to search for and replace in the here-strings.
Normal quoting rules don't apply within the here-string #' - '# delimiters, so you can put whatever kind of quotes you want, wherever you want them without needing any escape characters.
The [regex]::excape() on $Find will take care of doing the backslash escapes on any regex reserved characters that might be in the search pattern.