How to search for any unicode symbol in a character string? - unicode

I've got an existing DOORS module which happens to have some rich text entries; these entries have some symbols in them such as 'curly' quotes. I'm trying to upgrade a DXL macro which exports a LaTeX source file, and the problem is that these high-number symbols are not considered "standard UTF-8" by TexMaker's import function (and in any case probably won't be processed by Xelatex or other converters) . I can't simply use the UnicodeString functions in DXL because those break the rest of the rich text, and apparently the character identifier charOf(decimal_number_code) only works over the basic set of characters, i.e. less than some numeric code value. For example, charOf(8217) should create a right-curly single quote, but when I tried code along the lines of
if (charOf(8217) == one_char)
I never get a match. I did copy the curly quote from the DOORS module and verified via an online unicode analyzer that it was definitely Unicode decimal value 8217 .
So, what am I missing here? I just want to be able to detect any symbol character, identify it correctly, and then replace it with ,e.g., \textquoteright in the output stream.
My overall setup works for lower-count chars, since this works:
( c is a single character pulled from a string)
thedeg = charOf(176)
if( thedeg == c )
{
temp += "$\\degree$"
}

Got some help from DXL coding experts over at IBM forums.
Quoting the important stuff (there's some useful code snippets there as well):
Hey, you are right it seems intOf(char) and charOf(int) both do some
modulo 256 and therefore cut anything above that off. Try:
int i=8217;
char c = addr_(i);
print c;
Which then allows comparison of c with any input char.

Related

D Unicode string literals: can't print specific Unicode character

I'm just trying to pick up D having come from C++. I'm sure it's something very basic, but I can't find any documentation to help me. I'm trying to print the character à, which is U+00E0. I am trying to assign this character to a variable and then use write() to output it to the console.
I'm told by this website that U+00E0 is encoded as 0xC3 0xA0 in UTF-8, 0x00E0 in UTF-16 and 0x000000E0 in UTF-32.
Note that for everything I've tried, I've tried replacing string with char[] and wstring with wchar[]. I've also tried with and without the w or d suffixes after wide strings.
These methods return the compiler error, "Invalid trailing code unit":
string str = "à";
wstring str = "à"w;
dstring str = "à"d;
These methods print a totally different character (Ò U+00D2):
string str = "\xE0";
string str = hexString!"E0";
And all these methods print what looks like ˧á (note á ≠ à!), which is UTF-16 0x2E7 0x00E1:
string str = "\xC3\xA0";
wstring str = "\u00E0"w;
dstring str = "\U000000E0"d;
Any ideas?
I confirmed it works on my Windows box, so gonna type this up as an answer now.
In the source code, if you copy/paste the characters directly, make sure your editor is saving it in utf8 encoding. The D compiler insists on it, so if it gives a compile error about a utf thing, that's probably why. I have never used c:b but an old answer on the web said edit->encodings... it is a setting somewhere in the editor regardless.
Or, you can replace the characters in your source code with \uxxxx in the strings. Do NOT use the hexstring thing, that is for binary bytes, but your example of "\u00E0" is good, and will work for any type of string (not just wstring like in your example).
Then, on the output side, it depends on your target because the program just outputs bytes, and it is up to the recipient program to interpret it correctly. Since you said you are on Windows, the key is to set the console code page to utf-8 so it knows what you are trying to do. Indeed, the same C function can be called from D too. Leading to this program:
import core.sys.windows.windows;
import std.stdio;
void main() {
SetConsoleOutputCP(65001);
writeln("Hi \u00E0");
}
printing it successfully. On older Windows versions, you might need to change your font to see the character too (as opposed to the generic box it shows because some fonts don't have all the characters), but on my Windows 10 box, it just worked with the default font.
BTW, technically the console code page a shared setting (after running the program and it exits, you can still hit properties on your console window and see the change reflected there) and you should perhaps set it back when your program exits. You could get that at startup with the get function ( https://learn.microsoft.com/en-us/windows/console/getconsoleoutputcp ), store it in a local var, and set it back on exit. You could auto ccp = GetConsoleOutputCP(); SetConsoleOutputCP(65005;) scope(exit) SetConsoleOutputCP(ccp); right at startup - the scope exit will run when the function exits, so doing it in main would be kinda convenient. Just add some error checking if you want.
The Microsoft docs don't say anything about setting it back, so it probably doesn't actually matter, but still I wanna mention it just in case. But also the knowledge that it is shared and persists can help in debugging - if it works after you comment it, it isn't because the code isn't necessary, it is just because it was set previously and not unset yet!
Note that running it from an IDE might not be exactly the same, because IDEs often pipe the output instead of running it right out to the Windows console. If that happens, lemme know and we can type up some stuff about that for future readers too. But you can also open your own copy of the console (run the program outside the IDE) and it should show correctly for you.
D source code needs to be encoded as UTF-8.
My guess is that you're putting a UTF-16 character into the UTF-8 source file.
E.g.
import std.stdio;
void main() {
writeln(cast(char)0xC3, cast(char)0xA0);
}
Will output as UTF-8 the character you seek.
Which you can then hard code like so:
import std.stdio;
void main() {
string str = "à";
writeln(str);
}

Does Unicode have a special marker character?

My father created in mid 90's an encoding for his engineering purposes for his company's computers. It was close to ISO 8859-2 (Latin 2), but with some differences.
For example there was added a special "MARKER CHARACTER". This character wasn't determined to be a literal, but also it wasn't a control character.
The purpose of this character was to be inserted by machine when needed to split text into parts. See the following Python parser script:
re.sub(r'\{\{', r'~{{', text)
re.sub(r'\[\[', r'~[[', text)
re.sub(r'\]\]', r']]~', text)
re.sub(r'\}\}', r'}}~', text)
parts = text.strip('~').split('~')
inCurly = [False]
inSharp = [False]
whereAmI = ['']
for part in parts:
if part[:2] == '{{':
inCurly.append(True)
whereAmI.append('Curly')
elif part[:2] == '[[':
inSharp.append(True)
whereAmI.append('Sharp')
if whereAmI[-1] == 'Sharp' and not inCurly[-1]:
# some advanced magic on current part,
# if it is directly surrounded by sharp brackets,
# but these sharp brackets are not in curly brackets anyhow
# (not: "{{ (( [[ some text ]] )) }}")
# detecting closing brackets and popping inSharp, inCurly, whereAmI
# joining parts back to text
This is an easy parser for advanced purposes, you can detect more parenthesis or quotation marks as you want. But this have one huge fault. It break things when a ~ is in text.
For this purpose and similar purposes like this (but in C lang I think) he added to his encoding/character set that marker character.
For years I use for this purpose three german "sharp s": ßßß, because it is almost impossible to see three of them in a row. But this is not an ideal solution.
Yesterday my father told me this story and I immediatelly thought: is there some equivalent in an Unicode family? Unicode is a modern developing standard spreaded all over the world in past decade drastically. There should be a special character only for this particular purpose, or not?
I don't think there's anything called that specifically, but you might find zero-width space or information separator, among others, suitable for the purpose. You can arbitrarily select any character as your marker, and use an escape character if it occurs within the string.
In the control pictures block, there is a symbol for the group separator.

How to print non-BMP Unicode characters in Tkinter (e.g. 𝄫)

Note: Non-BMP characters can be displayed in IDLE as of Python 3.8 (so, it's possible Tkinter might display them now, too, since they both use TCL), which was released some time after I posted this question. I plan to edit this after I try out Python 3.9 (after I install an updated version of Xubuntu). I also read the editing these characters in IDLE might not be as straightforward as other characters; see the last comment here.
So, today I was making shortcuts for entering certain Unicode characters. All was going well. Then, when I decided to do these characters (in my Tkinter program; they wouldn't even try to go in IDLE), 𝄫 and 𝄪, I got a strange unexpected error and my program started deleting just about everything I had written in the text box. That's not acceptable.
Here's the error:
_tkinter.TclError: character U+1d12b is above the range (U+0000-U+FFFF) allowed by Tcl
I realize most of the Unicode characters I had been using only had four characters in the code. For some reason, it doesn't like five.
So, is there any way to print these characters in a ScrolledText widget (let alone without messing everything else up)?
UTF-8 is my encoding. I'm using Python 3.4 (so UTF-8 is the default).
I can print these characters just fine with the print statement.
Entering the character without just using ScrolledText.insert (e.g. Ctrl-shift-u, or by doing this in the code: b'\xf0\x9d\x84\xab') does actually enter it, without that error, but it still starts deleting stuff crazily, or adding extra spaces (including itself, although it reappears randomly at times).
There is currently no way to display those characters as they are supposed to look in Tkinter in Python 3.4 (although someone mentioned how using surrogate pairs may work [in Python 2.x]). However, you can implement methods to convert the characters into displayable codes and back, and just call them whenever necessary. You have to call them when you print to Text widgets, copy/paste, in file dialogs*, in the tab bar, in the status bar, and other stuff.
*The default Tkinter file dialogs do not allow for much internal engineering of the dialogs. I made my own file dialogs, partly to help with this issue. Let me know if you're interested. Hopefully I'll post the code for them here in the future.
These methods convert out-of-range characters into codes and vice versa. The codes are formatted with ordinal numbers, like this: {119083ū}. The brackets and the ū are just to distinguish this as a code. {119083ū} represents 𝄫. As you can see, I haven’t yet bothered with a way to escape codes, although I did purposefully try to make the codes very unlikely to occur. The same is true for the ᗍ119083ūᗍ used while converting. Anyway, I'm meaning to add escape sequences eventually. These methods are taken from my class (hence the self). (And yes, I know you don’t have to use semi-colons in Python. I just like them and consider that they make the code more readable in some situations.)
import re;
def convert65536(self, s):
#Converts a string with out-of-range characters in it into a string with codes in it.
l=list(s);
i=0;
while i<len(l):
o=ord(l[i]);
if o>65535:
l[i]="{"+str(o)+"ū}";
i+=1;
return "".join(l);
def parse65536(self, match):
#This is a regular expression method used for substitutions in convert65536back()
text=int(match.group()[1:-2]);
if text>65535:
return chr(text);
else:
return "ᗍ"+str(text)+"ūᗍ";
def convert65536back(self, s):
#Converts a string with codes in it into a string with out-of-range characters in it
while re.search(r"{\d\d\d\d\d+ū}", s)!=None:
s=re.sub(r"{\d\d\d\d\d+ū}", self.parse65536, s);
s=re.sub(r"ᗍ(\d\d\d\d\d+)ūᗍ", r"{\1ū}", s);
return s;
My answer is based on #Shule answer but provide more pythnoic and easy to read code. It also provide a real case.
This is the methode populating items to a tkinter.Listbox. There is no back conversion. This solution only take care of displaying strings with Tcl-unallowed characters.
class MyListbox (Listbox):
# ...
def populate(self):
"""
"""
def _convert65536(to_convert):
"""Converts a string with out-of-range characters in it into a
string with codes in it.
Based on <https://stackoverflow.com/a/28076205/4865723>.
This is a workaround because Tkinter (Tcl) doesn't allow unicode
characters outside of a specific range. This could be emoticons
for example.
"""
for character in to_convert[:]:
if ord(character) > 65535:
convert_with = '{' + str(ord(character)) + 'ū}'
to_convert = to_convert.replace(character, convert_with)
return to_convert
# delete all listbox items
self.delete(0, END)
# add items to listbox
for item in mydata_list:
try:
self.insert(END, item)
except TclError as err:
_log.warning('{} It will be converted.'.format(err))
self.insert(END, _convert65536(item))

PyGame: Proper use of Unicode

My goal is to create a program, with which the user can learn Bible verses by getting shown a problem and solving it through input (e.g. "Quote vers Gen 3:15"). As the Bible translation, I have to work with, is German, it contains a ton of umlauts, which are never showing properly.
My PyGame file's header:
#!/usr/bin/python
# -*- coding: utf-8 -*-
Later on, I list the three German umlauts:
u'ö'.encode('utf-8')
u'ä'.encode('utf-8')
u'ü'.encode('utf-8')
The txt-file is parsed by this function:
def load_list(listname):
fullname = os.path.join("daten", listname + ".txt")
with codecs.open(fullname, "r", "utf-8-sig") as name:
lines = name.readlines()
for x in range(0, len(lines)):
lines[x] = lines[x].strip("\n")
lines[x] = lines[x].strip("\r")
print lines
I'm aware, that I could combine the two lines with the strip-commands, but that's not the topic here.
How can I get my PyGame to display the umlauts from the text-file correctly as well also display the user input's umlauts correctly? I checked hundreds of suggestions, I can't get anything really working here.
Any help is highly appreciated, before I lose my sane mind (well, as I'm sitting here, coding games, I probably did already anyway :D )
I'll try to summarize:
Printing something else than a string or unicode opject triggers that object's __repr__() method. If it is a sequence, this applies to the contained elements as well, causing any non-ascii character to be escaped with \xXX (or \uXXXX) notation. Note the difference between print 'text' and print ['text']: in the latter case, the string's quotes will be printed as well (besides the brackets of course). Use str.join() for concatenating lists of strings in order to control the way the output looks.
It's a good idea to always explicitely decode input (as you do by using codecs) and encode the output (which is not done in the code snippets in your question).
The source file encoding (the # coding: utf8 line in the header) has nothing to do with encoding of input and output. It only enables you to type non-ascii character in string literals (= characters inside quotes in the source file), instead of using \xXX escapes.
Hope that makes some things clearer. There's a lot that can go wrong that looks like an encoding error, and it's not always easy to find out what's actually happening.

Japanese COBOL Code: rules for G literals and identifiers?

We are processing IBMEnterprise Japanese COBOL source code.
The rules that describe exactly what is allowed in G type literals,
and what are allowed for identifiers are unclear.
The IBM manual indicates that a G'....' literal
must have a SHIFT-OUT as the first character inside the quotes,
and a SHIFT-IN as the last character before the closing quote.
Our COBOL lexer "knows" this, but objects to G literals
found in real code. Conclusion: the IBM manual is wrong,
or we are misreading it. The customer won't let us see the code,
so it is pretty difficult to diagnose the problem.
EDIT: Revised/extended below text for clarity:
Does anyone know the exact rules of G literal formation,
and how they (don't) match what the IBM reference manuals say?
The ideal answer would a be regular expression for the G literal.
This is what we are using now (coded by another author, sigh):
#token non_numeric_literal_quote_g [STRING]
"<G><squote><ShiftOut> (
(<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)
(<NotLineOrParagraphSeparator>|<squote><squote>)
| <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
<ShiftIn>|<ShiftOut>)
| <squote><squote>
)* <ShiftIn><squote>"
where <name> is a macro that is another regular expression. Presumably they
are named well enough so you can guess what they contain.
Here is the IBM Enterprise COBOL Reference.
Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading.
I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means
when it says "one or more characters in the range X'00...X'FF for either byte"
How can DBCS-characters be anything but pairs of 8-bit character codes?
The existing RE matches 3 types of pairs of characters if you examine it.
One answer below suggests that the <squote><squote> pairing is wrong.
OK, I might believe that, but that means the RE would only reject
literal strings containing single <squote>s. I don't believe that's
the problem we are having as we seem to trip over every instance of a G literal.
Similarly, COBOL identifiers can apparantly be composed
with DBCS characters. What is allowed for an identifier, exactly?
Again a regular expression would be ideal.
EDIT2: I'm beginning to think the problem might not be the RE.
We are reading Shift-JIS encoded text. Our reader converts that
text to Unicode as it goes. But DBCS characters are really
not Shift-JIS; rather, they are binary-coded data. Likely
what is happening is the that DBCS data is getting translated
as if it were Shift-JIS, and that would muck up the ability
to recognize "two bytes" as a DBCS element. For instance,
if a DBCS character pair were :81 :1F, a ShiftJIS reader
would convert this pair into a single Unicode character,
and its two-byte nature is then lost. If you can't count pairs,
you can't find the end quote. If you can't find the end quote,
you can't recognize the literal. So the problem would appear
to be that we need to switch input-encoding modes in the middle
of the lexing process. Yuk.
Try to add a single quote in your rule to see if it passes by making this change,
<squote><squote> => <squote>{1,2}
If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.
EDIT: I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,
G"ABC<ヲァィ>" <> are Shift-out/shift-in
You RE assumes the DBCS only. I would loose this restriction and try again.
I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.
You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.
I am just trying to help you shoot in the dark without seeing the actual code :)
Does <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut> also include single and double quotation marks, or just apostrophes? That would be a problem, as it would consume the literal closing character sequence >' ...
I would check the definition of all other macros to make sure. The only obvious problem that I can see is the <squote><squote> that you already seem to be aware of.