Why flex scanner is slow when matching NUL characters? - lex

I have a lexer written by someone else, who generated it using flex. It works, but in a sample which contains a string literal, and a lot of NUL characters in it, the scanning is very slow.
After some googling I found this paragraph in the flex docs, that states this, without reason:
A final note: flex is slow when matching NUL's, particularly when a
token contains multiple NUL's. It's best to write rules which match
short amounts of text if it's anticipated that the text will often
include NUL's.
What's flex's problem with NUL characters?

(The question is answered in the comments from #Rhymoid and #Kaz. The are copied here. See also Question with no answers, but issue solved in the comments (or extended in chat) )
Perhaps it uses it as a string termination character (which is normal in C), and needs to escape it in some way. Moreover, what will yytext contain?

Related

Scala Random.nextString(int) returning question marks

Whenever I use Random.nextString(int), I get a String of questions marks (??????). I've tried using creating an instance of Random and using a seed, but nothing works. I am using Scala 2.10.5. Anyone know what the issue is?
In most terminals, when a character is not displayable (there are a lot of existing characters, and you cannot remotely hope to have them all in the font used by your terminal), it will print a question mark instead.
Because the string is random, you are very likely to have the vast majority of them be non displayable (and thus rendered as a sequence of question marks).
So the strings are valid, they are indeed random (not just a series of question marks), and this is all just a rendering issue. You can easily check that their content really is different each time by displaying the character codes (something like println(myString.map(_.toInt)) will do).

Unicode alternative for <wbr> tag

I am looking for Unicode solution which will be an alternative to <wbr> tag in following regards:
Allowing line breaks at given position.
Browser can find string in the page even when 'broken' apart.
When copied from page, these characters will not be transfered.
This is exactly what <wbr> does in modern browsers. ­, ­ and ​ does not seem to conform.
Here is a http://jsfiddle.net/qY6mp/7/ to demonstrate.
I will be thankful for any pointers.
The Unicode counterpart of <wbr> is U+200B ZERO WIDTH SPACE, representable in HTML as ​ if you don’t want or can’t use it as such. It’s not clear from the question why you think it does not conform. The main problem with it as questionable support in IE 6, but this is not a conformance issue.
According to Unicode line breaking rules, U+200B “is used to enable additional (invisible) break opportunities wherever SPACE cannot be used”. HTML specifications do not require conformance to the Unicode standard, in this issue or otherwise, but modern browsers generally implement U+200B this way.
What happens when text is copied from an HTML document is outside the scope of specifications. The same applies to requirement 2 in the question. Since generally copy and paste copies characters, including zero width characters, and search functionality operates on characters, requirements 2 and 3 are really asking for a character that does not behave like character.
Note that hyphenation is completely different issue.

How to I mark an empty translation (msgstr) as translated in po gettext files?

I found that is the translation for a string (msgid) is empty all gettext tools will consider the string as untranslated.
Is there a workaround for this? I do want to have an empty string as the translation for this item.
As this seems to be a big design flaw in the gettext specification, I decided to use:
Unicode Character 'ZERO WIDTH SPACE' (U+200B) inside these fields.
I realize this is an old question, but I wanted to point out an alternate approach:
msgid "This is a string"
msgstr "\0"
Since gettext uses embedded nulls to signal the end of a string, and it properly translates C escape sequences, I would guess that this might work and result in the empty string translation? It seemed to work in my program (based on GNU libintl) but I can't tell if this is actually standard / permitted by the system. As I understand gettext PO is not formally specified so there may be no authoritative answer other than looking at source code...
https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html
It's often not a nice thing to do to programmers to put embedded nulls in things but it might work in your case? Arguably it's less evil than the zero-width-space trick, since it will actually result in a string whose size is zero.
Edit:
Basically, the worst thing that can happen is you get a segfault / bad behavior when running msgfmt, if it would get confused about the size of strings which it assumes don't have embedded null, and overflow a buffer somewhere.
Assuming that msgfmt can tolerate this though, libintl is going to have to do the right thing with it because the only means it has to return strings is char *, so the final application can only see up to the null character no matter what.
For what it's worth, my po-parser library spirit-po explicitly supports this :)
https://github.com/cbeck88/spirit-po
Edit: In gettext documentation, it appears that they do mention the possibility of embedded nulls in MO files and said "it was strongly debated":
https://www.gnu.org/software/gettext/manual/html_node/MO-Files.html
Nothing prevents a MO file from having embedded NULs in strings. However, the program interface currently used already presumes that strings are NUL terminated, so embedded NULs are somewhat useless. But the MO file format is general enough so other interfaces would be later possible, if for example, we ever want to implement wide characters right in MO files, where NUL bytes may accidentally appear. (No, we don’t want to have wide characters in MO files. They would make the file unnecessarily large, and the ‘wchar_t’ type being platform dependent, MO files would be platform dependent as well.)
This particular issue has been strongly debated in the GNU gettext development forum, and it is expectable that MO file format will evolve or change over time. It is even possible that many formats may later be supported concurrently. But surely, we have to start somewhere, and the MO file format described here is a good start. Nothing is cast in concrete, and the format may later evolve fairly easily, so we should feel comfortable with the current approach.
So, at the least it's not like they're going to say "man, embedded null in message string? We never thought of that!" Most likely it works, if msgfmt doesn't crash then I would assume it's kosher.
I have had the same problem for a long time, and I actually don't think you can at all. My best option was to insert a comment so I could mark it "translated" from there:
# No translation needed / Translated
msgid "This is a string"
msgstr ""
So far, it's been by best workaround :/ If you do end up finding a way, please post!

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

Japanese COBOL Code: rules for G literals and identifiers?

We are processing IBMEnterprise Japanese COBOL source code.
The rules that describe exactly what is allowed in G type literals,
and what are allowed for identifiers are unclear.
The IBM manual indicates that a G'....' literal
must have a SHIFT-OUT as the first character inside the quotes,
and a SHIFT-IN as the last character before the closing quote.
Our COBOL lexer "knows" this, but objects to G literals
found in real code. Conclusion: the IBM manual is wrong,
or we are misreading it. The customer won't let us see the code,
so it is pretty difficult to diagnose the problem.
EDIT: Revised/extended below text for clarity:
Does anyone know the exact rules of G literal formation,
and how they (don't) match what the IBM reference manuals say?
The ideal answer would a be regular expression for the G literal.
This is what we are using now (coded by another author, sigh):
#token non_numeric_literal_quote_g [STRING]
"<G><squote><ShiftOut> (
(<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)
(<NotLineOrParagraphSeparator>|<squote><squote>)
| <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
<ShiftIn>|<ShiftOut>)
| <squote><squote>
)* <ShiftIn><squote>"
where <name> is a macro that is another regular expression. Presumably they
are named well enough so you can guess what they contain.
Here is the IBM Enterprise COBOL Reference.
Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading.
I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means
when it says "one or more characters in the range X'00...X'FF for either byte"
How can DBCS-characters be anything but pairs of 8-bit character codes?
The existing RE matches 3 types of pairs of characters if you examine it.
One answer below suggests that the <squote><squote> pairing is wrong.
OK, I might believe that, but that means the RE would only reject
literal strings containing single <squote>s. I don't believe that's
the problem we are having as we seem to trip over every instance of a G literal.
Similarly, COBOL identifiers can apparantly be composed
with DBCS characters. What is allowed for an identifier, exactly?
Again a regular expression would be ideal.
EDIT2: I'm beginning to think the problem might not be the RE.
We are reading Shift-JIS encoded text. Our reader converts that
text to Unicode as it goes. But DBCS characters are really
not Shift-JIS; rather, they are binary-coded data. Likely
what is happening is the that DBCS data is getting translated
as if it were Shift-JIS, and that would muck up the ability
to recognize "two bytes" as a DBCS element. For instance,
if a DBCS character pair were :81 :1F, a ShiftJIS reader
would convert this pair into a single Unicode character,
and its two-byte nature is then lost. If you can't count pairs,
you can't find the end quote. If you can't find the end quote,
you can't recognize the literal. So the problem would appear
to be that we need to switch input-encoding modes in the middle
of the lexing process. Yuk.
Try to add a single quote in your rule to see if it passes by making this change,
<squote><squote> => <squote>{1,2}
If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.
EDIT: I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,
G"ABC<ヲァィ>" <> are Shift-out/shift-in
You RE assumes the DBCS only. I would loose this restriction and try again.
I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.
You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.
I am just trying to help you shoot in the dark without seeing the actual code :)
Does <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut> also include single and double quotation marks, or just apostrophes? That would be a problem, as it would consume the literal closing character sequence >' ...
I would check the definition of all other macros to make sure. The only obvious problem that I can see is the <squote><squote> that you already seem to be aware of.