PyGame: Proper use of Unicode - unicode

My goal is to create a program, with which the user can learn Bible verses by getting shown a problem and solving it through input (e.g. "Quote vers Gen 3:15"). As the Bible translation, I have to work with, is German, it contains a ton of umlauts, which are never showing properly.
My PyGame file's header:
#!/usr/bin/python
# -*- coding: utf-8 -*-
Later on, I list the three German umlauts:
u'ö'.encode('utf-8')
u'ä'.encode('utf-8')
u'ü'.encode('utf-8')
The txt-file is parsed by this function:
def load_list(listname):
fullname = os.path.join("daten", listname + ".txt")
with codecs.open(fullname, "r", "utf-8-sig") as name:
lines = name.readlines()
for x in range(0, len(lines)):
lines[x] = lines[x].strip("\n")
lines[x] = lines[x].strip("\r")
print lines
I'm aware, that I could combine the two lines with the strip-commands, but that's not the topic here.
How can I get my PyGame to display the umlauts from the text-file correctly as well also display the user input's umlauts correctly? I checked hundreds of suggestions, I can't get anything really working here.
Any help is highly appreciated, before I lose my sane mind (well, as I'm sitting here, coding games, I probably did already anyway :D )

I'll try to summarize:
Printing something else than a string or unicode opject triggers that object's __repr__() method. If it is a sequence, this applies to the contained elements as well, causing any non-ascii character to be escaped with \xXX (or \uXXXX) notation. Note the difference between print 'text' and print ['text']: in the latter case, the string's quotes will be printed as well (besides the brackets of course). Use str.join() for concatenating lists of strings in order to control the way the output looks.
It's a good idea to always explicitely decode input (as you do by using codecs) and encode the output (which is not done in the code snippets in your question).
The source file encoding (the # coding: utf8 line in the header) has nothing to do with encoding of input and output. It only enables you to type non-ascii character in string literals (= characters inside quotes in the source file), instead of using \xXX escapes.
Hope that makes some things clearer. There's a lot that can go wrong that looks like an encoding error, and it's not always easy to find out what's actually happening.

Related

How to print non-BMP Unicode characters in Tkinter (e.g. 𝄫)

Note: Non-BMP characters can be displayed in IDLE as of Python 3.8 (so, it's possible Tkinter might display them now, too, since they both use TCL), which was released some time after I posted this question. I plan to edit this after I try out Python 3.9 (after I install an updated version of Xubuntu). I also read the editing these characters in IDLE might not be as straightforward as other characters; see the last comment here.
So, today I was making shortcuts for entering certain Unicode characters. All was going well. Then, when I decided to do these characters (in my Tkinter program; they wouldn't even try to go in IDLE), 𝄫 and 𝄪, I got a strange unexpected error and my program started deleting just about everything I had written in the text box. That's not acceptable.
Here's the error:
_tkinter.TclError: character U+1d12b is above the range (U+0000-U+FFFF) allowed by Tcl
I realize most of the Unicode characters I had been using only had four characters in the code. For some reason, it doesn't like five.
So, is there any way to print these characters in a ScrolledText widget (let alone without messing everything else up)?
UTF-8 is my encoding. I'm using Python 3.4 (so UTF-8 is the default).
I can print these characters just fine with the print statement.
Entering the character without just using ScrolledText.insert (e.g. Ctrl-shift-u, or by doing this in the code: b'\xf0\x9d\x84\xab') does actually enter it, without that error, but it still starts deleting stuff crazily, or adding extra spaces (including itself, although it reappears randomly at times).
There is currently no way to display those characters as they are supposed to look in Tkinter in Python 3.4 (although someone mentioned how using surrogate pairs may work [in Python 2.x]). However, you can implement methods to convert the characters into displayable codes and back, and just call them whenever necessary. You have to call them when you print to Text widgets, copy/paste, in file dialogs*, in the tab bar, in the status bar, and other stuff.
*The default Tkinter file dialogs do not allow for much internal engineering of the dialogs. I made my own file dialogs, partly to help with this issue. Let me know if you're interested. Hopefully I'll post the code for them here in the future.
These methods convert out-of-range characters into codes and vice versa. The codes are formatted with ordinal numbers, like this: {119083ū}. The brackets and the ū are just to distinguish this as a code. {119083ū} represents 𝄫. As you can see, I haven’t yet bothered with a way to escape codes, although I did purposefully try to make the codes very unlikely to occur. The same is true for the ᗍ119083ūᗍ used while converting. Anyway, I'm meaning to add escape sequences eventually. These methods are taken from my class (hence the self). (And yes, I know you don’t have to use semi-colons in Python. I just like them and consider that they make the code more readable in some situations.)
import re;
def convert65536(self, s):
#Converts a string with out-of-range characters in it into a string with codes in it.
l=list(s);
i=0;
while i<len(l):
o=ord(l[i]);
if o>65535:
l[i]="{"+str(o)+"ū}";
i+=1;
return "".join(l);
def parse65536(self, match):
#This is a regular expression method used for substitutions in convert65536back()
text=int(match.group()[1:-2]);
if text>65535:
return chr(text);
else:
return "ᗍ"+str(text)+"ūᗍ";
def convert65536back(self, s):
#Converts a string with codes in it into a string with out-of-range characters in it
while re.search(r"{\d\d\d\d\d+ū}", s)!=None:
s=re.sub(r"{\d\d\d\d\d+ū}", self.parse65536, s);
s=re.sub(r"ᗍ(\d\d\d\d\d+)ūᗍ", r"{\1ū}", s);
return s;
My answer is based on #Shule answer but provide more pythnoic and easy to read code. It also provide a real case.
This is the methode populating items to a tkinter.Listbox. There is no back conversion. This solution only take care of displaying strings with Tcl-unallowed characters.
class MyListbox (Listbox):
# ...
def populate(self):
"""
"""
def _convert65536(to_convert):
"""Converts a string with out-of-range characters in it into a
string with codes in it.
Based on <https://stackoverflow.com/a/28076205/4865723>.
This is a workaround because Tkinter (Tcl) doesn't allow unicode
characters outside of a specific range. This could be emoticons
for example.
"""
for character in to_convert[:]:
if ord(character) > 65535:
convert_with = '{' + str(ord(character)) + 'ū}'
to_convert = to_convert.replace(character, convert_with)
return to_convert
# delete all listbox items
self.delete(0, END)
# add items to listbox
for item in mydata_list:
try:
self.insert(END, item)
except TclError as err:
_log.warning('{} It will be converted.'.format(err))
self.insert(END, _convert65536(item))

JMeter CSV Data Set is corrupting Japanese strings stored as proper UTF-8, I get Question Marks instead

I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.

How to use '^#' in Vim scripts?

I'm trying to work around a problem with using ^# (i.e., <ctrl-#>) characters in Vim scripts. I can insert them into a script, but when the script runs it seems the line is truncated at the point where a ^# was located.
My kludgy solution so far is to have a ^# stored in a variable, then reference the variable in the script whenever I would have quoted a literal ^#. Can someone tell me what's going on here? Is there a better way around this problem?
That is one reason why I never use raw special character values in scripts. While ^# does not work, string <C-#> in mappings works as expected, so you may use one of
nnoremap <C-#> {rhs}
nnoremap <Nul> {rhs}
It is strange, but you cannot use <Char-0x0> here. Some notes about null byte in strings:
Inserting null byte into string truncates it: vim uses old C-style strigs that end with null byte, thus it cannot appear in strings. These strings are very inefficient, so if you want to generate a very large text, try accumulating it into a list of lines (using setline is very fast as buffer is represented as a list of lines).
Most functions that return list of strings (like readfile, getline(start, end)) or take list of strings (like writefile, setline, append) treat \n (NL) as Null. It is also the internal representation of buffer lines, see :h NL-used-for-Nul.
If you try to insert \n character into the command-line, you will get Null shown (but this is really a newline). If you want to edit a file that has \n in a filename (it is possible on *nix), you will need to prepend newline with backslash.
The byte ctrl-# is also known as '\0'. Many languages, programs, etc. use it as an "end of string" marker, so it's not surprising that vim gets confused there. If you must use this byte in the middle of a script string, it sounds like your workaround is a decent one.

NSURL doesn't work any time

i have the following problem sometimes my openURL-Dialog works perfectly, then i looked at the variable from the url and that is the variable:
www.brehm-gmbh.de
but some other times there are some crazy elements at the end of the variable like this:
www.adamczyk-fenster.de%E2%80%8E
i get this pages from an .asc file and both are in this file normal without this elements,
what can i do to solve this problem?
thank you all for helping beforehand
From Wikipedia:
The left-to-right mark (LRM) is a
control character or non-printing
character, used in the computerized
typesetting of bi-directional text,
containing mixed left-to-right scripts
(such as English and Russian) and
right-to-left scripts (such as Arabic
and Hebrew). It is used to change the
way adjacent characters are grouped
with respect to text direction.
You're getting this because (1) you've got non-English URLs, are composing URLs from non-English strings or you have some other non-English elements and the string encoding is attempting to compensate or (2) it's garbarge being interpreted as an encoding (unlikely if it is consistant.)
Call -[NSString localizedNameOfStringEncoding] on the string before you use it see what encoding it is using. You probably need to explicitly establish an encoding when you read in the strings before you put them in the NSURL.

How to detect malformed UTF characters

I want to detect and replace malformed UTF-8 characters with blank space using a Perl script while loading the data using SQL*Loader. How can I do this?
Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.
import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')
This prints:
spam eggs bacon
EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)
One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.
RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).
But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.