representing \xd9\x88 (UTF-8 literal) in its arabic character و in Python - unicode

I know that \xd9\x88is utf-8 code for the letter و in Arabic (you can see this page).
I have a file that contains a list of such utf-8 characters, how can I represent them in Arabic characters, for example represent \xd9\x88 with و ?
On Python 3 if I do:
>>> i = '\xd9\x88'
>>> print(i)
Ù

If you want to print the character, simply use print(); but you have to make sure your terminal supports the encoding and is using a font that has that glyph.
In the Windows command prompt, with the default encoding (that doesn't support Arabic), you'll see this:
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> i = "\xd9\x88"
>>> print(i)
و
>>>
On Linux, using UTF-8 as the default encoding and using a font that has the Arabic glyphs, you'll see this:
>>> i = "\xd9\x88"
>>> print(i)
و
>>>
Back on Windows, if you use a text editor that supports UTF-8 (in this case, I am using Sublime Text), you'll see:
I am using IDLE for Python and Python 3 on Windows.
Python 3 introduced some major changes to how strings are handled in Python. In Python 3, all strings are stored as unicode.
You actually have a byte string, a string representing the code points that represent a character. So you need to decode it properly.
You can do this two ways, first is to make sure its a byte string to start with:
>>> i = b"\xd9\x88"
>>> print(i.decode('utf-8'))
و
Or, you can encode it to latin-1 first, which will give you a bytestring, then decode it:
>>> i = "\xd9\x88"
>>> type(i)
<class 'str'>
>>> type(i.encode('latin-1'))
<class 'bytes'>
>>> print(i.encode('latin-1').decode('utf-8'))
و

Related

Mapping arbitrary unicode alphanumeric characters to their ascii equivalents

When I encounter arbitrary unicode string, such as in a hashtag, I would like to express only its alphanumeric components in a string of their ascii equivalents. For example,
x='€𝙋𝙖𝙩𝙧𝙞𝙤𝙩'
would be rendered as
x='Patriot'
Since I cannot anticipate the unicode that could appear in such strings, I would like the method to be as general as possible. Any suggestions?
The unicodedata.normalize method can translate Unicode code points to a canonical value. Then, run the value through ascii encoding ignoring non-ASCII values for a byte string, then back through ascii decode to get a Unicode string again:
>>> x='€𝙋𝙖𝙩𝙧𝙞𝙤𝙩'
>>> ud.normalize('NFKC',x).encode('ascii',errors='ignore').decode('ascii')
'Patriot'
If you need to removed accents from letters, but still keep the base letter, use 'NFKD' instead.
>>> x='€𝙋𝙖𝙩𝙧𝙞ô𝙩'
>>> ud.normalize('NFKD',x).encode('ascii',errors='ignore').decode('ascii')
'Patriot'

I have found two non-printing characters in a database, what do they mean?

Seems a database I am working on, had two non printing characters that was messing something up down the line. After doing some digging, the computer shows them as â, then U+0080 then U+0093.
Any idea what these characters could mean? I suspect its something from Unicode that wasn't converted correctly. But I don't know how to translate it.
The Unicode codepoint for â is U+00E2. E2 80 93 is the UTF-8 sequence for a hyphen, specifically U+2013 EN DASH.
If UTF-8-encoded data is incorrectly decoded as ISO-8859-1 (also called "latin1") it is displayed as you describe. Here's an example in Python:
>>> print('\u2013') # Displays U+2013 EN DASH
–
>>> '\u2013'.encode('utf8') # byte sequence of UTF-8-encoded EN DASH
b'\xe2\x80\x93'
>>> '\u2013'.encode('utf8').decode('latin1') # decoded incorrectly
'â\x80\x93'
Found a website that described it for me.
https://www.compart.com/en/unicode/U+2012#UNC_DB
The numbers matched what appeared in the UTF-8 Encoding.

Cyrillic sets CP 1048 conversion

I am using a printer which needs Cyrillic sets CP 1048 for printing Kazak and Russian language in text mode. How do we convert a text to CP 1048? CP 1048 is combined character set for Kazak and Russian languages. These languages come together in text files and this code page is available as a standard feature in the printer.
You convert text with some kind of text encoding converter. Since it wasn't specified, I'll use a Python script. Note this requires Python 3.5 or later. That version first defined the kz1048 codec.
Unicode string to KZ-1048 encoding:
>>> 'Россия/Ресей'.encode('kz1048')
b'\xd0\xee\xf1\xf1\xe8\xff/\xd0\xe5\xf1\xe5\xe9'
Note in Python b'\xd0\xee' denotes a byte string containing the byte hexadecimal values D0 and EE, which in KZ-1048 represent Р and о.

Fixing file encoding

Today I ordered a translation for 7 different languages, and 4 of them appear to be great, but when I opened the other 3, namely Greek, Russian, and Korean, the text that was there wasn't related to any language at all. It looked like a bunch of error characters, like the kind you get when you have the wrong encoding on a file.
For instance, here is part of the output of the Korean translation:
½Ì±ÛÇ÷¹À̾î
¸ÖƼÇ÷¹À̾î
¿É¼Ç
I may not even speak a hint of Korean, but I can tell you with all certainty that is not Korean.
I assume this is a file encoding issue, and when I open the file in Notepad, the encoding is listed as ANSI, which is clearly a problem; the same can be said for the other two languages.
Does anyone have any ideas on how to fix the encoding of these 3 files; I requested the translators reupload in UTF-8, but in the meantime, I thought I might try to fix it myself.
If anyone is interested in seeing the actual files, you can get them from my Dropbox.
If you look at the byte stream as pairs of bytes, they look vaguely Korean but I cannot tell of they are what you would expect or not.
bash$ python3.4
Python 3.4.3 (v3.4.3:b4cbecbc0781, May 30 2015, 15:45:01)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> buf = '½Ì±ÛÇ÷¹À̾î'
>>> [hex(ord(b)) for b in buf]
>>> ['0xbd', '0xcc', '0xb1', '0xdb', '0xc7', '0xc3', '0xb7', '0xb9', '0xc0', '0xcc', '0xbe', '0xee']
>>> u'\uBDCC\uB1DB\uC7C3\uB7B9\uC0CC\uBEEE'
'뷌뇛쟃랹샌뻮'
Your best bet is to wait for the translator to upload UTF-8 versions or have them tell you the encoding of the file. I wouldn't make the assumption that they bytes are simply 16 bit characters.
Update
I passed this through the chardet module and it detected the character set as EUC-KR.
>>> import chardet
>>> chardet.detect(b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE')
{'confidence': 0.833333333333334, 'encoding': 'EUC-KR'}
>>> b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE'.decode('EUC-KR')
'싱글플레이어'
According to Google translate, the first line is "Single Player". Try opening it with Notepad and using EUC-KR as the encoding.

Show a character's Unicode codepoint value in Eclipse

I have a UTF-8 text file open in Eclipse, and I'd like to find out what a particular Unicode character is. Is there a function to display the Unicode codepoint of the character under the cursor?
I do not think there is yet a plugin doing exactly what you are looking for.
I know about a small plugin able to encode/decode a unicode sequence:
The sources (there is not even a fully built jar plugin yet) are here, with its associated tarball: you can import it as a PDE plugin project a,d test it in your eclipse.
You can also look-up a character in the Unicode database using the Character Properties Unicode Utility at http://unicode.org/. I've made a Firefox Search Engine to search via that utility. So, just copy-and-paste from your favourite editor into the search box.
See the list of online tools at http://unicode.org/. E.g. it lists Unicode Lookup by Jonathan Hedley.
Here's a Python script to show information about Unicode characters on a Windows clipboard. So, just copy the character in your favourite editor, then run this program.
Not built-in to Eclipse, but it's what I'll probably use when I haven't got a better option.
"""
Print information about Unicode characters on the Windows clipboard
Requires Python 2.6 and PyWin32.
For ideas on how to make it work on Linux via GTK, see:
http://mrlauer.wordpress.com/2007/12/31/python-and-the-clipboard/
"""
import win32con
import win32clipboard
import unicodedata
import sys
import codecs
from contextlib import contextmanager
MAX_PRINT_CHARS = 1
# If a character can't be output in the current encoding, output a replacement e.g. '??'
sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout, errors='replace')
#contextmanager
def win_clipboard_context():
"""
A context manager for using the Windows clipboard safely.
"""
try:
win32clipboard.OpenClipboard()
yield
finally:
win32clipboard.CloseClipboard()
def get_clipboard_text():
with win_clipboard_context():
clipboard_text = win32clipboard.GetClipboardData(win32con.CF_UNICODETEXT)
return clipboard_text
def print_unicode_info(text):
for char in text[:MAX_PRINT_CHARS]:
print(u"Char: {0}".format(char))
print(u" Code: {0:#x} (hex), {0} (dec)".format(ord(char)))
print(u" Name: {0}".format(unicodedata.name(char, u"Unknown")))
try:
clipboard_text = get_clipboard_text()
except TypeError:
print(u"The clipboard does not contain Unicode text")
else:
print_unicode_info(clipboard_text)