When using the parser Parsekit for the iPhone. Is it possible to include against a double quote? And things which are part of the special BNF?
(Is it possible to escape sequences in a defined grammer?)
#start = doublequote+;
doublequote= '"'
Developer of ParseKit here.
By default you can match against quoted strings easily using the built-in QuotedString parser (which will match QuotedString tokens):
#start = quotes;
quotes = QuotedString+;
that would match input like: "foo" 'bar' "baz"
as three quoted strings: "foo", 'bar', "baz"
So this demonstrates that by default the ParseKit tokenizer (the PKTokenizer class) produces QuotedString tokens when encountering a " or '.
For more details on default tokenizer behavior, read the ParseKit tokenization documentation.
However, if you want quote chars (", ') to be recognized as standalone symbols rather than indicating the start or end of a quoted string, you must alter the tokenizer behavior first.
In code, you would alter tokenizer behavior by calling methods on your PKTokenizer object.
In grammars, you alter tokenizer behavior with tokenizer directives.
Tokenizer directives are special rules placed at the top of your grammar which start with a # character. In this case, you want to change which characters are recognized as standalone symbol tokens by the tokenizer. Specifically, you want to add two chars as symbols with the #symbolState tokenizer directive.
You can do that in your grammar by changing it to:
#symbolState = '"' "'"; // a tokenizer directive stating ' and " should be recognized as standalone symbol tokens
// (by default they are start- and end-markers for quoted string tokens)
#start = stuff;
stuff = (Word | Symbol)+;
Given the same input as above, you would match separate quote symbols and words: ", foo, ", ', bar, ', ", baz, "
Related
Given a sequence of Unicode characters, how can I obtain a string of whitespace characters that has the same width (at least in monospace fonts that display each character with single or double width of the characters from Basic Latin)?
Examples
For example, given the string `\u0061\u0020\u0062\u0020\u0063' with five characters that looks like this:
a b c
('a', space, 'b', space, 'c'), I would like to obtain a string consisting of just five spaces:
\u0020\u0020\u0020\u0020\u0020
and given \u6b22\u8fce\u5149\u4e34 that looks like
欢迎光临
I'd want to obtain a string containing four ideographic spaces: \u3000\u3000\u3000\u3000.
Background
Here is an example where this matters: error reporting in compilers for languages that support Unicode. Suppose that we have some hypothetical programming language PL (could be Python, Java, Scala, Ruby ...) that has string literals and parentheses. Suppose that this is an invalid snippet of PL-code, because it contains an unmatched parenthesis:
"stringLiteral")
If we tried to compile it, the compiler of PL could produce an error message that looks as follows:
:1: error: ';' expected but ')' found.
"stringLiteral")
^
Note the "phantom string" followed by '^' in the last line: it points exactly at the unmatched closing parenthesis.
If I try the same with CJK characters, here is what I get:
:1: error: ';' expected but ')' found.
"欢迎光临欢迎光临欢迎光临欢迎光临欢迎光临欢迎")
^
Note that now the "phantom string" in the last line consists of ordinary Latin whitespaces, and in the console, the '^' looks as if it's somewhere in the middle of the string of the CJK characters, instead of at the parenthesis.
If I try the same with Croatian characters:
:1: error: ';' expected but ')' found.
"DŽDždžLJLjljNJNjnj")
^
the '^' pointer also ends up at a seemingly completely wrong position, because those special Croatian characters are much wider than ordinary spaces.
All of the examples produce similar results in such languages as Python, Java, Scala, Ruby (just copy-paste " y⃝e҈s҉ ") or "临欢迎光临欢迎") into the interactive shell, and see where the ^ ends up).
Solution attempt
Here is a naïve attempt to generate "phantom"-strings in Scala. There is a method Character.isIdeographic. I can use it to define a phantom method by mapping every ideographic character to \u3000, and all other characters to ' ' (ordinary space).
def phantom(s: String) =
s.map(c => if (Character.isIdeographic(c)) '\u3000' else ' ')
In simple cases, it works. For example, if I define a string
val s = "foo欢迎光临欢迎bar光临欢baz"
and then print the string followed by a vertical bar |, a line break, and then the phantom(s) followed by vertical bar |,
println(s + "|\n" + phantom(s) + "|")
then I obtain:
foo欢迎光临欢迎bar光临欢baz|
|
and the vertical bars in the end of the strings line up perfectly, because the phantom(s) is now
\u0020\u0020\u0020\u3000\u3000\u3000\u3000\u3000\u3000\u0020\u0020\u0020\u3000\u3000\u3000\u0020\u0020\u0020
that is:
three ordinary spaces corresponding to "foo"
six ideographic spaces corresponding to the "欢迎光临欢迎" piece
then again three spaces corresponding to "bar"
...
and so on.
However, if I try the same with Croatian characters, I again get a mess:
DŽDždžLJLjljNJNjnj|
|
(vertical bars don't line up).
Question
Does Unicode define any properties that would allow me to generate robust "phantom" strings of same width?
Question (because I can't work it out), should ""hello world"" be a valid field value in a CSV file according to the specification?
i.e should:
1,""hello world"",9.5
be a valid CSV record?
(If so, then the Perl CSV-XS parser I'm using is mildly broken, but if not, then $line =~ s/\342\200\234/""/g; is a really bad idea ;) )
The weird thing is is that this code has been running without issue for years, but we've only just hit a record that started with both a left double quote and contained no comma (the above is from a CSV pre-parser).
The canonical format definition of CSV is https://www.rfc-editor.org/rfc/rfc4180.txt. It says:
Each field may or may not be enclosed in double quotes (however
some programs, such as Microsoft Excel, do not use double quotes
at all). If fields are not enclosed with double quotes, then
double quotes may not appear inside the fields. For example:
"aaa","bbb","ccc" CRLF
zzz,yyy,xxx
Fields containing line breaks (CRLF), double quotes, and commas
should be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Last rule means your line should have been:
1,"""hello world""",9.5
But not all parsers/generators follow this standard perfectly, so you might need for interoperability reasons to relax some rules. It all depends on how much you control the CSV format writing and CSV format parsing parts.
That depends on the escape character you use. If your escape character is '"' (double quote) then your line should look like
1,"""hello world""",9.5
If your escape character is '\' (backslash) then your line should look like
1,"\"hello world\"",9.5
Check your parser/environment defaults or explicitly configure your parser with the escape character you need e.g. to use backslash do:
my $csv = Text::CSV_XS->new ({ quote_char => '"', escape_char => "\\" });
This does not work:
scala> """one\r\ntwo\r\nthree\r\nfour""".replace("\r\n", "\n")
res1: String = one\r\ntwo\r\nthree\r\nfour
How to do that in Scala?
Is there a more idiomatic way of doing that, instead of using replace?
The problem is that """ quotes does not expand escape sequences. Three different approaches:
Use single " quotes in order to treat escape sequences correctly: "one\r\ntwo";
Use the s string interpolator, be careful following this approach cause this could lead to unexpected replacements: s"""one\r\ntwo""";
Call treatEscapes directly to expands escape sequences in your string: StringContext.treatEscapes("""one\r\ntwo""").
Refer also to this earlier question.
try this
"""one\r\ntwo\r\nthree\r\nfour""".replace("\\r\\n", "\n")
\ is treated as escape charater within string, so you need to tell the compiler that its not a escape character but a string.
I have a database entry that has entries that look like this:
id | name | code_set_id
I have this particular entry that I need to find:
674272310 | raphodo/qrc_resources.py | 782732
In my rails app (2.3.8), I have a statement that evaluates to this:
SELECT * from fyles WHERE code_set_id = 782732 AND name LIKE 'raphodo/qrc\\_resources.py%';
From reading up on escaping, the above query is correct. This is supposed to correctly double escape the underscore. However this query does not find the record in the database. These queries will:
SELECT * from fyles WHERE code_set_id = 782732 AND name LIKE 'raphodo/qrc\_resources.py%';
SELECT * from fyles WHERE code_set_id = 782732 AND name LIKE 'raphodo/qrc_resources.py%';
Am I missing something here? Why is the first SQL statement not finding the correct entry?
A single backslash in the RHS of a LIKE escapes the following character:
9.7.1. LIKE
[...]
To match a literal underscore or percent sign without matching other characters, the respective character in pattern must be preceded by the escape character. The default escape character is the backslash but a different one can be selected by using the ESCAPE clause. To match the escape character itself, write two escape characters.
So this is a literal underscore in a LIKE pattern:
\_
and this is a single backslash followed by an "any character" pattern:
\\_
You want LIKE to see this:
raphodo/qrc\_resources.py%
PostgreSQL used to interpret C-stye backslash escapes in strings by default but no longer, now you have to use E'...' to use backslash escapes in string literals (unless you've changed the configuration options). The String Constants with C-style Escapes section of the manual covers this but the simple version is that these two:
name LIKE E'raphodo/qrc\\_resources.py%'
name LIKE 'raphodo/qrc\_resources.py%'
do the same thing as of PostgreSQL 9.1.
Presumably your Rails 2.3.8 app (or whatever is preparing your LIKE patterns) is assuming an older version of PostgreSQL than the one you're actually using. You'll need to adjust things to not double your backslashes (or prefix the pattern string literals with Es).
please provide me a sed oneliner which provides this output:
sdc3 sdc2
for Input :
sdc3[1] sdc2[0]
I mean remove all subscript value from the string ..
sed 's/\[[^]]*\]//g'
reads: substitute any string with literal "[" followed by zero or more characters that aren't a "]", and then the closing "]", with an empty string.
You need the [^]] bit to prevent greedy matching treating "[1] sdc2[0]" as a single match in your sample string.
As for your comment:
sed 's#\([^[ ]*\)\[[^]]*\]#/dev/\1#g'
I switch the seperator from the usual '/' to '#', just to avoid escaping the /dev/ bit you asked for (I won't say "for clarity")
the \(...\) bit matches a subgroup, here sdc2 or whatever, so we can refer to it in the replacement
the subgroup uses a similar character class to the one we used discarding the index: [^[ ] means any character except an "[" (again, to avoid greedily matching the index) or a space (assuming your values are space-delimited as per your post)
the replacement is now the literal "/dev/" followed by the first (and only) subgroup match
the g flag at the end tells it to perform multiple matches per line, instead of stopping at the first one