JFlex and NUL characters - jflex

Does anybody know how to handle '\0' in java JFlex?? I tried encoding as a regular expression to be matched like
\0 { /* DO nothing */ }
but it did not work. The documentation does not provide any information. The reason I need this is because I am handling some strings coming from a C/C++ source.
Regards.

All of the following worked for me (using trunk JFlex, soon to be released as v1.5):
\0
"\0"
\u0000
"\u0000"
How do you know it did not work? It's possible there was an earlier rule in your grammar that matches the null character, in which case the \0 rule will never match (though if that's true, you should get a warning to this effect when you generate your scanner with JFlex).

According to the manual it should be '\0'.

Related

Lua Patterns and Unicode

What would be the best way to find a word such as Hi or a name mainly like dön with that special char in it through a pattern. They would be optional so it should obviously use a '?' but I dont know what control code to use to find them.
I basically want to make sure that I am getting words with possible unicode characters in them but nothing else. So dön would be fine but no other special chars or numbers and such like brackets.
According to the Lua guide on Unicode, "Lua's pattern matching facilities work byte by byte. In general, this will not work for Unicode pattern matching, although some things will work as you want". This means the best option is probably to iterate over each character and work out if it is a valid letter. To loop over each unicode character in a string:
for character in string.gmatch(myString, "([%z\1-\127\194-\244][\128-\191]*)") do
-- Do something with the character
end
Note this method will not work if myString isn't valid unicode. To check if the character is one that you want, it's probably best to simply have a list of all characters you don't want in your strings and then exclude them:
local notAllowed = ":()[]{}+_-=\|`~,.<>/?!##$%^&*"
local isValid = true
for character in string.gmatch(myString, "([%z\1-\127\194-\244][\128-\191]*)") do
if notAllowed:find(character) then
isValid = false
break
end
end
Hope this helped.

How do I distinguish between an EOF character, and the actual end of file?

When reading a file, I understand the last character provided is an EOF. Now, what happens, when I have an EOF character in that file?
How do I distinguish between the "real" end of a file, and the EOF character?
I decided to move my comments to an answer.
You can't have an "EOF character" in your file because there is no such thing. The underlying filesystem knows how many bytes are in a file; it doesn't rely on the contents of the file to know where the end is.
The C functions you're using return EOF (-1) but that wasn't read from the file. It's just the way the function tells you that you're reached the end. And because -1 isn't a valid character in any character set, there's no confusion.
You need some context for this question. On Windows, there's the outdated DOS concept of a real "EOF character" -- Ctrl-Z. It is actually not possible to tell a "real" one from a "fake" one; a file with an embedded Ctrl-Z will contain some trailing hidden data from the perspective of a program which is actually looking for Ctrl-Z as an end of file character. Don't try to write this kind of code anymore -- it's not necessary.
In the portable C API and on UNIX, a 32-bit -1 is used to indicate end of file, which can't be a valid 8 or 16-bit character, so it's easy to tell the difference.
Assuming you're talking about C, EOF is -1, which is not a character (hence there is no confusion).

Interpretation of Greek characters by FOP

Can you please help me interpret the Greek Characters with HTML display as HTML= & #8062; and Hex value 01F7E
Details of these characters can be found on the below URL
http://www.isthisthingon.org/unicode/index.php?page=01&subpage=F&hilite=01F7E
When I run this character in Apache FOP, they give me an ArrayIndexOut of Bounds Exception
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.fop.text.linebreak.LineBreakUtils.getLineBreakPairProperty(LineBreakUtils.java:668)
at org.apache.fop.text.linebreak.LineBreakStatus.nextChar(LineBreakStatus.java:117)
When I looked into the FOP Code, I was unable to understand the need for lineBreakProperties[][] Array in LineBreakUtils.java.
I also noticed that FOP fails for all the Greek characters mentioned on the above page which are non-displayable with the similar error.
What are these special characters ?
Why is their no display for these characters are these Line Breaks or TAB’s ?
Has anyone solved a similar issue with FOP ?
The U+1F7E code point is part of the Greek Extended Unicode block. But it is does not represent any actual character; it is a reserved but unassigned code point. Here is the chart from Unicode 6.0: http://www.unicode.org/charts/PDF/U1F00.pdf.
So the errors you are getting are perhaps not so surprising.
I ran a FO file that included the following <fo:block> through both FOP 0.95 and FOP 1.0:
<fo:block>Unassigned code point: ὾</fo:block>
I did get the same java.lang.ArrayIndexOutOfBoundsException that you are seeing.
When using an adjacent "real" character, there was no error:
<fo:block>Assigned code point: ώ</fo:block>
So it seems like you have to ensure that your datastream does not contain non-characters like U+1F7E.
Answer from Apache
At first glance, this seems like a minor oversight in the implementation of Unicode linebreaking in FOP. This does not take into account the possibility that a given codepoint is not assigned a 'class' in linebreaking context. (=
U+1F7E does not appear in the file
http://www.unicode.org/Public/UNIDATA/LineBreak.txt, which is used as a basis to generate those arrays in LineBreakUtils.java)
On the other hand, one could obviously raise the question why you so desperately need to have an unassigned codepoint in your output. Are you absolutely sure you need this? If yes, then can you elaborate on the exact reason? (i.e. What exactly is this unassigned codepoint used for?)
The most straightforward 'fix' seems to be roughly as follows:
Index: src/java/org/apache/fop/text/linebreak/LineBreakStatus.java
--- src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (revision
1054383)
+++ src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (working
copy)
## -87,6 +87,7 ##
/* Initial conversions */
switch (currentClass) {
+ case 0: // Unassigned codepoint: consider as AL?
case LineBreakUtils.LINE_BREAK_PROPERTY_AI:
case LineBreakUtils.LINE_BREAK_PROPERTY_SG:
case LineBreakUtils.LINE_BREAK_PROPERTY_XX:
What this does, is assign the class 'AL' or 'Alphabetic' to any codepoint that has not been assigned a class by Unicode. This means it will be treated as a regular letter.
Now, the reason why I am asking the question whether you are sure you know what you're doing, is that this may turn out to be undesirable. Perhaps the character in question needs to be treated as a space rather than a letter.
Unicode does not define U+1F7E other than as a 'reserved' character, so it makes sense that Unicode cannot say what should happen with this character in the context of linebreaking...
That said, it is also wrong of FOP to crash in this case, so the bug is definitely genuine.

ack-grep: chars escaping

My goal is to find all "<?=" occurrences with ack. How can I do that?
ack "<?="
Doesn't work. Please tell me how can I fix escaping here?
Since ack uses Perl regular expressions, your problem stems from the fact that in Perl RegEx language, ? is a special character meaning "last match is optional". So what you are grepping for is = preceded by an optional <
So you need to escape the ? if that's just meant to be a regular character.
To escape, there are two approaches - either <\?= or <[?]=; some people find the second form of escaping (putting a special character into a character class) more readable than backslash-escape.
UPDATE As Josh Kelley graciously added in the comment, a third form of escaping is to use the \Q operator which escapes all the following special characters till \E is encountered, as follows: \Q<?=\E
Rather than trying to remember which characters have to be escaped, you can use -Q to quote everything that needs to be quoted.
ack -Q "<?="
This is the best solution if you will want to find by simple text.
(if you need not find by regular expression.)
ack "<\?="
? is a regex operator, so it needs escaping

How to detect malformed UTF characters

I want to detect and replace malformed UTF-8 characters with blank space using a Perl script while loading the data using SQL*Loader. How can I do this?
Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.
import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')
This prints:
spam eggs bacon
EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)
One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.
RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).
But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.