matplotlib pyplot table with non-ascii data? - encoding

I'm trying to create a table figure in png.
The data in the table contains non-ascii characters.(suppose it's chinese or something)
I pass unicode code-points (u'hello') to pyplot and it shows the characters as squares.

It may be (slightly) platform dependent, but I prefer to use unicode for Chinese and other languages. One other thing is that you need to make sure is matplotlib must get the necessary font. You can do it anywhere you need a text except sometimes not with mathtext.
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import matplotlib
zhfont1 = matplotlib.font_manager.FontProperties(fname='/Library/Fonts/Kai.dfont') #I am on OSX.
s=u'\u54c8\u54c8' #Need the unicode for your Chinese Char.
plt.text(0.5,0.5,s,fontproperties=zhfont1, size=50) #example: plt.text()

another method is to modify the file /matplotlibrc/ on your matplotlib system.
Find this line
#font.sans-serif
and add your font in this line.
And then you need to add the font-file into the font directory of matplotlib.
Here I give this two paths:
~\AppData\Local\Enthought\Canopy\User\Lib\site-packages\matplotlib\mpl-data
$HOME.matplotlib\fontList.cache
PS: Windows 7.

Related

Stata 13: Encoding of German Characters in Windows 8 and Mac OS X

For a current project, I use a number of csv files that are saved in UTF8. The motivation for this encoding is that it contains information in German with special characters ä,ö,ü,ß. My team is working with Stata 13 on Mac OS X and Windows 7 (software is frequently updated).
When we import the csv file (when importing, we choose Latin-1) in Stata special characters are correctly displayed on both operating system. However, when we export the dataset to another csv file on Mac OS X - which we need to do quite often in our setup - the special characters are replaced, e.g. ä -> Š, ü -> Ÿ etc. On Windows, exporting works like a charme and special characters are not replaced.
Troubleshooting: Stata 13 cannot interpret unicode. I have tried to convert the utf8 files to windows1252 and latin 1 (iso 8859-1) encoding (since, after all, all it contains are german characters) using Sublime Text 2 prior to importing it in Stata. However the same problem remains for Mac OS X.
Yesterday, Stata 14 was announced which apparently can deal with unicode. If that is the reason, then it would probably help with my problem, however, we will not be able to upgrade soon. Apart from then, I am wondering why the problem arises on Mac but not on Windows? Can anyone help? Thank you.
[EDIT Start] When I import the exported csv file again using a "Mac Roman" Text encoding (Stata allows to specify that in the importing dialogue), then my german special characters appear again. Apparently I am not the only one encountering this problem by the looks of this thread. However, because I need to work with the exported csv files, I still need a solution to this problem. [EDIT End]
[EDIT2 Start] One example is the word "Bösdorf" that is changed to "Bšsdorf". In the original file the hex code is 42c3 b673 646f 7266, whereas the hex code in the exported file is 42c5 a173 646f 7266. [EDIT2 End]
Until the bug gets fixed, you can work around this with
iconv -f utf-8 -t cp1252 <oldfile.csv | iconv -f mac -t utf-8 >newfile.csv
This undoes an incorrect transcoding which apparently the export function in Stata performs internally.
Based on your indicators, cp1252 seems like a good guess, but it could also be cp1254. More examples could help settle the issue if you can't figure it out (common German characters to test with still would include ä and the uppercase umlauts, the German double s ligature ß, etc).
Stata 13 and below uses a deprecated locale in Mac OS X, macroman (Mac OS X is unicode). I generally used StatTransfer to convert, for example, from Excel (unicode) to Stata (Western, macroman; Options->Encoding options) in Spanish language. It was the only way to have á, é, etc. Furthermore, Stata 14 imports unicode without problem but insist to export es_ES (Spanish Spain) as the default locale, having to add the command locale UTF-8 at the end of the export command to have a readable Excel file.

Python 2.7: detect emoji from text

I'd like to be able to detect emoji in text and look up their names.
I've had no luck using unicodedata module and I suspect that I'm not
understanding the UTF-8 conventions.
I'd guess that I need to load my doc as as utf-8, then break the unicode "strings" into unicode symbols. Iterate over these and look them up.
#new example loaded using pandas and encoding UTF-8
'A man tried to get into my car\U0001f648'
type(test) = unicode
import unicodedata as uni
uni.name(test[0])
Out[89]: 'LATIN CAPITAL LETTER A'
uni.name(test[-3])
Out[90]: 'LATIN SMALL LETTER R'
uni.name(test[-1])
ValueError Traceback (most recent call last)
<ipython-input-105-417c561246c2> in <module>()
----> 1 uni.name(test[-1])
ValueError: no such name
# just to be clear
uni.name(u'\U0001f648')
ValueError: no such name
I looked up the unicode symbol via google and it's a legit symbol.
Perhaps the unicodedata module isn't very comprehensive...?
I'm considering making my own look up table from here.
Interested in other ideas...this one seems do-able.
My problem was in using Python2.7 for the unicodedata module.
using Conda I created a python 3.3 environment and now unicodedata works
as expected and I've given up on all weird hacks I was working on.
# using python 3.3
import unicodedata as uni
In [2]: uni.name('\U0001f648')
Out[2]: 'SEE-NO-EVIL MONKEY'
Thanks to Mark Ransom for pointing out that I originally had Mojibake from not
correctly importing my data. Thanks again for your help.
Here's a way to read the link you provided. It's translated from Python 2 so there might be a glitch or two.
import re
import urllib2
rexp = re.compile(r'U\+([0-9A-Za-z]+)[^#]*# [^)]*\) *(.*)')
mapping = {}
for line in urllib2.urlopen('ftp://ftp.unicode.org/Public/emoji/1.0/emoji-data.txt'):
line = line.decode('utf-8')
m = rexp.match(line)
if m:
mapping[chr(int(m.group(1), 16))] = m.group(2)

Importing German values in openerp 7

I would like to import German language values in Openerp 7. Currently, the import fails due to special characters. Importing English text works perfectly.
Some of the sample values are:
beschränkt
öffentlich nach Einzelgewerken
Do I need to change the language preference to German first, before importing?
Also, do I need to know German to accomplish this task?
Any advice?
Changing the language of your User account to German is only useful if you'd like to load the German version of translatable fields. For example, if you were importing a list of Products as a CSV file, it would allow you to load the German translation of the product names. If you don't, the names will simply be stored as the master translation (English).
However it is very likely that your import fails due to encoding issue. Encoding comes into the picture because of German special characters, such as Umlauts. In that case you basically need to make sure that you are importing the CSV file using the same Encoding setting that was used to export it.
If your CSV was produced on Windows using Excel or something similar, there is a good chance it was produced with Windows-1252 encoding. By default OpenERP will select utf-8, so you will need to change that in the Encoding selection box of the File Format Options that appear after you select the CSV file to import (in OpenERP 7.0).

Which packages for many unicode-characters?

I'm trying to create a LaTeX document with as many as Unicode characters as possible. My header is as follows:
\documentclass[12pt,a4paper]{article}
\usepackage[utf8x]{inputenc}
\usepackage{ucs}
\pagestyle{empty}
\usepackage{textcomp}
\usepackage[T1]{fontenc}
\begin{document}
The Unicode characters which follow in the document-body are in the form of
\unichar{xyz}
where xyz stands for an integer, e.g. 97 (for "a").
The problem is: for many integers the script's not compiling with an error message as:
! Package ucs Error: Unknown Unicode character 79297 = U+135C1,
(ucs) possibly declared in uni-309.def.
(ucs) Type H to see if it is available with options.
Which packages should I add to the header to make the file compile with as many characters as possible?
As far as I remember the utf8x package is merely a hack to allow for some Unicode "support". Basically it was a giant lookup table translating individual character sequences to LaTeX's expectations.
You should really use Xe(La)?TeX for such things which was designed with Unicode in mind. The old TeX still suffers from it's 1970's heritage in this respect.
ETA: From the package's documentation:
This bundle provides the ucs package, and utf8x.def, together with a large number of support files.
The utf8x.def definition file for use with inputenc covers a wider range of Unicode characters than does utf8.def in the LaTeX distribution. The ucs package provides facilities for efficient use of large sets of Unicode characters.
Since you're already using both packages I guess you're out of luck then with plain LaTeX.
You can manually edit .def files similarly to this https://bugzilla.redhat.com/show_bug.cgi?id=418981
For example this is only way how to enable LaTex for Latvian language as far as I know.
Just find you symbol (on Mac use "locate uni-1.def" to find where the package is located - to enable locate command http://osxdaily.com/2011/11/02/enable-and-use-the-locate-command-in-the-mac-os-x-terminal/ )

Show a character's Unicode codepoint value in Eclipse

I have a UTF-8 text file open in Eclipse, and I'd like to find out what a particular Unicode character is. Is there a function to display the Unicode codepoint of the character under the cursor?
I do not think there is yet a plugin doing exactly what you are looking for.
I know about a small plugin able to encode/decode a unicode sequence:
The sources (there is not even a fully built jar plugin yet) are here, with its associated tarball: you can import it as a PDE plugin project a,d test it in your eclipse.
You can also look-up a character in the Unicode database using the Character Properties Unicode Utility at http://unicode.org/. I've made a Firefox Search Engine to search via that utility. So, just copy-and-paste from your favourite editor into the search box.
See the list of online tools at http://unicode.org/. E.g. it lists Unicode Lookup by Jonathan Hedley.
Here's a Python script to show information about Unicode characters on a Windows clipboard. So, just copy the character in your favourite editor, then run this program.
Not built-in to Eclipse, but it's what I'll probably use when I haven't got a better option.
"""
Print information about Unicode characters on the Windows clipboard
Requires Python 2.6 and PyWin32.
For ideas on how to make it work on Linux via GTK, see:
http://mrlauer.wordpress.com/2007/12/31/python-and-the-clipboard/
"""
import win32con
import win32clipboard
import unicodedata
import sys
import codecs
from contextlib import contextmanager
MAX_PRINT_CHARS = 1
# If a character can't be output in the current encoding, output a replacement e.g. '??'
sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout, errors='replace')
#contextmanager
def win_clipboard_context():
"""
A context manager for using the Windows clipboard safely.
"""
try:
win32clipboard.OpenClipboard()
yield
finally:
win32clipboard.CloseClipboard()
def get_clipboard_text():
with win_clipboard_context():
clipboard_text = win32clipboard.GetClipboardData(win32con.CF_UNICODETEXT)
return clipboard_text
def print_unicode_info(text):
for char in text[:MAX_PRINT_CHARS]:
print(u"Char: {0}".format(char))
print(u" Code: {0:#x} (hex), {0} (dec)".format(ord(char)))
print(u" Name: {0}".format(unicodedata.name(char, u"Unknown")))
try:
clipboard_text = get_clipboard_text()
except TypeError:
print(u"The clipboard does not contain Unicode text")
else:
print_unicode_info(clipboard_text)