"Unrecognized rule" error in lex program - lex

I am writing a lex program. The objective of this problem is that I enter a string (letters and other characters) and it returns the length of this string.
Here is the code:
letter ([a-z]|[A-Z])
carac (•|¤|¶|§|à|î|ì|Ä|Å|É|æ|Æ|ô|ö|ò|û|ù|ÿ|Ö|Ü|ø|£|Ø|×|ƒ|á|í|ó|ú|ñ|Ñ|ª|º|¿|®|¬|½|¼|¡|:|;|.|,|/|?|=|-|!|*|£|µ|^|¨|%)
String {letter}({letter}|{carac})*
%%
{String} printf("[%d] : The number of your String \n",yyleng);
.* printf("You have a problem somewhere !");
%%
int yywrap(){return 1;}
main ()
{
yylex ();
}
And the output:

(The answers are contained in the comments, which I am including here. See Question with no answers, but issue solved in the comments (or extended in chat) ).
#Thomas Padron-McCarthy and #David Gorsline are correct:
It is likely that Flex doesn't understand the character encoding of your input file. As far as I know, Flex still only understands single-byte characters.
To amplify Thomas's comment: try a simpler version of the program, where you define carac as carac (:|;|.|,|/|?|=|-|!|^|%).
You may need to quote special characters: carac (\:|\;|\.|\,|\/|\?|\=|\-|\!|\^|\%) Or use the character class notation: carac [-:;.,/?=!^%]
To confirm this I applied these edits and ran it through flex. The following does not give flex errors:
carac (\•|\¤|\¶|\§|\à|\î|\ì|\Ä|\Å|\É|\æ|\Æ|\ô|\ö|\ò|\û|\ù|\ÿ|\Ö|\Ü|\ø|\£|\Ø|\×|\ƒ|\á|\í|\ó|\ú|\ñ|\Ñ|\ª|\º|\¿|\®|\¬|\½|\¼|\¡|\:|\;|\.|\,|\/|\?|\=|\-|\!|\*|\£|\µ|\^|\¨|\%)

Related

How to search for any unicode symbol in a character string?

I've got an existing DOORS module which happens to have some rich text entries; these entries have some symbols in them such as 'curly' quotes. I'm trying to upgrade a DXL macro which exports a LaTeX source file, and the problem is that these high-number symbols are not considered "standard UTF-8" by TexMaker's import function (and in any case probably won't be processed by Xelatex or other converters) . I can't simply use the UnicodeString functions in DXL because those break the rest of the rich text, and apparently the character identifier charOf(decimal_number_code) only works over the basic set of characters, i.e. less than some numeric code value. For example, charOf(8217) should create a right-curly single quote, but when I tried code along the lines of
if (charOf(8217) == one_char)
I never get a match. I did copy the curly quote from the DOORS module and verified via an online unicode analyzer that it was definitely Unicode decimal value 8217 .
So, what am I missing here? I just want to be able to detect any symbol character, identify it correctly, and then replace it with ,e.g., \textquoteright in the output stream.
My overall setup works for lower-count chars, since this works:
( c is a single character pulled from a string)
thedeg = charOf(176)
if( thedeg == c )
{
temp += "$\\degree$"
}
Got some help from DXL coding experts over at IBM forums.
Quoting the important stuff (there's some useful code snippets there as well):
Hey, you are right it seems intOf(char) and charOf(int) both do some
modulo 256 and therefore cut anything above that off. Try:
int i=8217;
char c = addr_(i);
print c;
Which then allows comparison of c with any input char.

How to print non-BMP Unicode characters in Tkinter (e.g. 𝄫)

Note: Non-BMP characters can be displayed in IDLE as of Python 3.8 (so, it's possible Tkinter might display them now, too, since they both use TCL), which was released some time after I posted this question. I plan to edit this after I try out Python 3.9 (after I install an updated version of Xubuntu). I also read the editing these characters in IDLE might not be as straightforward as other characters; see the last comment here.
So, today I was making shortcuts for entering certain Unicode characters. All was going well. Then, when I decided to do these characters (in my Tkinter program; they wouldn't even try to go in IDLE), 𝄫 and 𝄪, I got a strange unexpected error and my program started deleting just about everything I had written in the text box. That's not acceptable.
Here's the error:
_tkinter.TclError: character U+1d12b is above the range (U+0000-U+FFFF) allowed by Tcl
I realize most of the Unicode characters I had been using only had four characters in the code. For some reason, it doesn't like five.
So, is there any way to print these characters in a ScrolledText widget (let alone without messing everything else up)?
UTF-8 is my encoding. I'm using Python 3.4 (so UTF-8 is the default).
I can print these characters just fine with the print statement.
Entering the character without just using ScrolledText.insert (e.g. Ctrl-shift-u, or by doing this in the code: b'\xf0\x9d\x84\xab') does actually enter it, without that error, but it still starts deleting stuff crazily, or adding extra spaces (including itself, although it reappears randomly at times).
There is currently no way to display those characters as they are supposed to look in Tkinter in Python 3.4 (although someone mentioned how using surrogate pairs may work [in Python 2.x]). However, you can implement methods to convert the characters into displayable codes and back, and just call them whenever necessary. You have to call them when you print to Text widgets, copy/paste, in file dialogs*, in the tab bar, in the status bar, and other stuff.
*The default Tkinter file dialogs do not allow for much internal engineering of the dialogs. I made my own file dialogs, partly to help with this issue. Let me know if you're interested. Hopefully I'll post the code for them here in the future.
These methods convert out-of-range characters into codes and vice versa. The codes are formatted with ordinal numbers, like this: {119083ū}. The brackets and the ū are just to distinguish this as a code. {119083ū} represents 𝄫. As you can see, I haven’t yet bothered with a way to escape codes, although I did purposefully try to make the codes very unlikely to occur. The same is true for the ᗍ119083ūᗍ used while converting. Anyway, I'm meaning to add escape sequences eventually. These methods are taken from my class (hence the self). (And yes, I know you don’t have to use semi-colons in Python. I just like them and consider that they make the code more readable in some situations.)
import re;
def convert65536(self, s):
#Converts a string with out-of-range characters in it into a string with codes in it.
l=list(s);
i=0;
while i<len(l):
o=ord(l[i]);
if o>65535:
l[i]="{"+str(o)+"ū}";
i+=1;
return "".join(l);
def parse65536(self, match):
#This is a regular expression method used for substitutions in convert65536back()
text=int(match.group()[1:-2]);
if text>65535:
return chr(text);
else:
return "ᗍ"+str(text)+"ūᗍ";
def convert65536back(self, s):
#Converts a string with codes in it into a string with out-of-range characters in it
while re.search(r"{\d\d\d\d\d+ū}", s)!=None:
s=re.sub(r"{\d\d\d\d\d+ū}", self.parse65536, s);
s=re.sub(r"ᗍ(\d\d\d\d\d+)ūᗍ", r"{\1ū}", s);
return s;
My answer is based on #Shule answer but provide more pythnoic and easy to read code. It also provide a real case.
This is the methode populating items to a tkinter.Listbox. There is no back conversion. This solution only take care of displaying strings with Tcl-unallowed characters.
class MyListbox (Listbox):
# ...
def populate(self):
"""
"""
def _convert65536(to_convert):
"""Converts a string with out-of-range characters in it into a
string with codes in it.
Based on <https://stackoverflow.com/a/28076205/4865723>.
This is a workaround because Tkinter (Tcl) doesn't allow unicode
characters outside of a specific range. This could be emoticons
for example.
"""
for character in to_convert[:]:
if ord(character) > 65535:
convert_with = '{' + str(ord(character)) + 'ū}'
to_convert = to_convert.replace(character, convert_with)
return to_convert
# delete all listbox items
self.delete(0, END)
# add items to listbox
for item in mydata_list:
try:
self.insert(END, item)
except TclError as err:
_log.warning('{} It will be converted.'.format(err))
self.insert(END, _convert65536(item))

Unicode Library - Where to Find One (NEW FOCUS: Unicode in XSL-FO)

EDIT: I am no longer asking for a Unicode Library. Not only has one been linked, but the origial question was inappropriate to ask, as mentioned below. This question is now focused on how to implement unicode in XSL-FO.
My primary question now is what steps are required in implementing the unicode. I already have the necessary unicode character references, but I understand that the proper 'font' needs to be selected as well, and am led to believe there are other steps that need to be taken in order to implement it in my XSL-FO document.
What do you mean by "write foreign characters"? An XSL-FO file is just an XML file, so you can use any Unicode reference to figure out the character number and then an XML numeric character reference to include it.
For example, the Unicode hex for the Euro symbol € is U+20ac, so in XML (XSL-FO) that would be €
I experienced a same kind of problem. The problem with unicode characters is hardcoded in the FONET.DLL. In the class TrueTypeFont method MapCharacter is written as:
public override ushort MapCharacter(char c)
{
if (c > Byte.MaxValue)
return (ushort) FirstChar;
return mapping.MapCharacter(c);
}
So any character with a value greater than 255 will be "ignored". I downloaded the sources (from https://fonet.codeplex.com/) and modified the method to:
public override ushort MapCharacter(char c)
{
return mapping.MapCharacter(c);
}
Using this library with this new method, the euro-symbol magically became visible!

Why flex scanner is slow when matching NUL characters?

I have a lexer written by someone else, who generated it using flex. It works, but in a sample which contains a string literal, and a lot of NUL characters in it, the scanning is very slow.
After some googling I found this paragraph in the flex docs, that states this, without reason:
A final note: flex is slow when matching NUL's, particularly when a
token contains multiple NUL's. It's best to write rules which match
short amounts of text if it's anticipated that the text will often
include NUL's.
What's flex's problem with NUL characters?
(The question is answered in the comments from #Rhymoid and #Kaz. The are copied here. See also Question with no answers, but issue solved in the comments (or extended in chat) )
Perhaps it uses it as a string termination character (which is normal in C), and needs to escape it in some way. Moreover, what will yytext contain?

Interpretation of Greek characters by FOP

Can you please help me interpret the Greek Characters with HTML display as HTML= & #8062; and Hex value 01F7E
Details of these characters can be found on the below URL
http://www.isthisthingon.org/unicode/index.php?page=01&subpage=F&hilite=01F7E
When I run this character in Apache FOP, they give me an ArrayIndexOut of Bounds Exception
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.fop.text.linebreak.LineBreakUtils.getLineBreakPairProperty(LineBreakUtils.java:668)
at org.apache.fop.text.linebreak.LineBreakStatus.nextChar(LineBreakStatus.java:117)
When I looked into the FOP Code, I was unable to understand the need for lineBreakProperties[][] Array in LineBreakUtils.java.
I also noticed that FOP fails for all the Greek characters mentioned on the above page which are non-displayable with the similar error.
What are these special characters ?
Why is their no display for these characters are these Line Breaks or TAB’s ?
Has anyone solved a similar issue with FOP ?
The U+1F7E code point is part of the Greek Extended Unicode block. But it is does not represent any actual character; it is a reserved but unassigned code point. Here is the chart from Unicode 6.0: http://www.unicode.org/charts/PDF/U1F00.pdf.
So the errors you are getting are perhaps not so surprising.
I ran a FO file that included the following <fo:block> through both FOP 0.95 and FOP 1.0:
<fo:block>Unassigned code point: ὾</fo:block>
I did get the same java.lang.ArrayIndexOutOfBoundsException that you are seeing.
When using an adjacent "real" character, there was no error:
<fo:block>Assigned code point: ώ</fo:block>
So it seems like you have to ensure that your datastream does not contain non-characters like U+1F7E.
Answer from Apache
At first glance, this seems like a minor oversight in the implementation of Unicode linebreaking in FOP. This does not take into account the possibility that a given codepoint is not assigned a 'class' in linebreaking context. (=
U+1F7E does not appear in the file
http://www.unicode.org/Public/UNIDATA/LineBreak.txt, which is used as a basis to generate those arrays in LineBreakUtils.java)
On the other hand, one could obviously raise the question why you so desperately need to have an unassigned codepoint in your output. Are you absolutely sure you need this? If yes, then can you elaborate on the exact reason? (i.e. What exactly is this unassigned codepoint used for?)
The most straightforward 'fix' seems to be roughly as follows:
Index: src/java/org/apache/fop/text/linebreak/LineBreakStatus.java
--- src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (revision
1054383)
+++ src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (working
copy)
## -87,6 +87,7 ##
/* Initial conversions */
switch (currentClass) {
+ case 0: // Unassigned codepoint: consider as AL?
case LineBreakUtils.LINE_BREAK_PROPERTY_AI:
case LineBreakUtils.LINE_BREAK_PROPERTY_SG:
case LineBreakUtils.LINE_BREAK_PROPERTY_XX:
What this does, is assign the class 'AL' or 'Alphabetic' to any codepoint that has not been assigned a class by Unicode. This means it will be treated as a regular letter.
Now, the reason why I am asking the question whether you are sure you know what you're doing, is that this may turn out to be undesirable. Perhaps the character in question needs to be treated as a space rather than a letter.
Unicode does not define U+1F7E other than as a 'reserved' character, so it makes sense that Unicode cannot say what should happen with this character in the context of linebreaking...
That said, it is also wrong of FOP to crash in this case, so the bug is definitely genuine.