Python 2.7: detect emoji from text

Python 2.7: detect emoji from text - unicode

I'd like to be able to detect emoji in text and look up their names.
I've had no luck using unicodedata module and I suspect that I'm not
understanding the UTF-8 conventions.
I'd guess that I need to load my doc as as utf-8, then break the unicode "strings" into unicode symbols. Iterate over these and look them up.
#new example loaded using pandas and encoding UTF-8
'A man tried to get into my car\U0001f648'
type(test) = unicode
import unicodedata as uni
uni.name(test[0])
Out[89]: 'LATIN CAPITAL LETTER A'
uni.name(test[-3])
Out[90]: 'LATIN SMALL LETTER R'
uni.name(test[-1])
ValueError Traceback (most recent call last)
<ipython-input-105-417c561246c2> in <module>()
----> 1 uni.name(test[-1])
ValueError: no such name
# just to be clear
uni.name(u'\U0001f648')
ValueError: no such name
I looked up the unicode symbol via google and it's a legit symbol.
Perhaps the unicodedata module isn't very comprehensive...?
I'm considering making my own look up table from here.
Interested in other ideas...this one seems do-able.

My problem was in using Python2.7 for the unicodedata module.
using Conda I created a python 3.3 environment and now unicodedata works
as expected and I've given up on all weird hacks I was working on.
# using python 3.3
import unicodedata as uni
In [2]: uni.name('\U0001f648')
Out[2]: 'SEE-NO-EVIL MONKEY'
Thanks to Mark Ransom for pointing out that I originally had Mojibake from not
correctly importing my data. Thanks again for your help.

Here's a way to read the link you provided. It's translated from Python 2 so there might be a glitch or two.
import re
import urllib2
rexp = re.compile(r'U\+([0-9A-Za-z]+)[^#]*# [^)]*\) *(.*)')
mapping = {}
for line in urllib2.urlopen('ftp://ftp.unicode.org/Public/emoji/1.0/emoji-data.txt'):
line = line.decode('utf-8')
m = rexp.match(line)
if m:
mapping[chr(int(m.group(1), 16))] = m.group(2)

Related

How to overarchingly apply "utf-8" to opening csv/txt files in pandas dataframe?

I am trying to import data from text files from particular file path, but am getting error 'utf-8' codec can't decode byte 0xa5 in position 18: invalid start byte
My question is there anyway I can apply "utf-8" encoding to all the text files(about 20 others) I will have to open eventually so I can prevent the above error?
Code:
import pandas as pd
filelist = [r'D:/file1',r'D:/file2']
print (len((pd.concat([pd.read_csv(item, names=[item[:-4]]) for item in filelist],axis=1))))
also open to any suggestions if I am doing something wrong.
Thank you in advance.

Not aware of solution to automatically convert encoding to utf-8 in python.
Alternatively, you can find out what the encoding is, and read it accordingly. Then write to file in utf-8.
this solution worked well for my files (credit maxnoe)
import chardet
import pandas as pd
with open('filename.csv', 'rb') as f:
result = chardet.detect(f.read()) # or readline if the file is large
pd.read_csv('filename.csv', encoding=result['encoding'])
don't forget to pip install chardet
if you now write file using pd.to_csv(), pandas default is to encode in utf-8

Fixing file encoding

Today I ordered a translation for 7 different languages, and 4 of them appear to be great, but when I opened the other 3, namely Greek, Russian, and Korean, the text that was there wasn't related to any language at all. It looked like a bunch of error characters, like the kind you get when you have the wrong encoding on a file.
For instance, here is part of the output of the Korean translation:
½Ì±ÛÇÃ·¹ÀÌ¾î
¸ÖÆ¼ÇÃ·¹ÀÌ¾î
¿É¼Ç
I may not even speak a hint of Korean, but I can tell you with all certainty that is not Korean.
I assume this is a file encoding issue, and when I open the file in Notepad, the encoding is listed as ANSI, which is clearly a problem; the same can be said for the other two languages.
Does anyone have any ideas on how to fix the encoding of these 3 files; I requested the translators reupload in UTF-8, but in the meantime, I thought I might try to fix it myself.
If anyone is interested in seeing the actual files, you can get them from my Dropbox.

If you look at the byte stream as pairs of bytes, they look vaguely Korean but I cannot tell of they are what you would expect or not.
bash$ python3.4
Python 3.4.3 (v3.4.3:b4cbecbc0781, May 30 2015, 15:45:01)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> buf = '½Ì±ÛÇÃ·¹ÀÌ¾î'
>>> [hex(ord(b)) for b in buf]
>>> ['0xbd', '0xcc', '0xb1', '0xdb', '0xc7', '0xc3', '0xb7', '0xb9', '0xc0', '0xcc', '0xbe', '0xee']
>>> u'\uBDCC\uB1DB\uC7C3\uB7B9\uC0CC\uBEEE'
'뷌뇛쟃랹샌뻮'
Your best bet is to wait for the translator to upload UTF-8 versions or have them tell you the encoding of the file. I wouldn't make the assumption that they bytes are simply 16 bit characters.
Update
I passed this through the chardet module and it detected the character set as EUC-KR.
>>> import chardet
>>> chardet.detect(b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE')
{'confidence': 0.833333333333334, 'encoding': 'EUC-KR'}
>>> b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE'.decode('EUC-KR')
'싱글플레이어'
According to Google translate, the first line is "Single Player". Try opening it with Notepad and using EUC-KR as the encoding.

matplotlib pyplot table with non-ascii data?

I'm trying to create a table figure in png.
The data in the table contains non-ascii characters.(suppose it's chinese or something)
I pass unicode code-points (u'hello') to pyplot and it shows the characters as squares.

It may be (slightly) platform dependent, but I prefer to use unicode for Chinese and other languages. One other thing is that you need to make sure is matplotlib must get the necessary font. You can do it anywhere you need a text except sometimes not with mathtext.
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import matplotlib
zhfont1 = matplotlib.font_manager.FontProperties(fname='/Library/Fonts/Kai.dfont') #I am on OSX.
s=u'\u54c8\u54c8' #Need the unicode for your Chinese Char.
plt.text(0.5,0.5,s,fontproperties=zhfont1, size=50) #example: plt.text()

another method is to modify the file /matplotlibrc/ on your matplotlib system.
Find this line
#font.sans-serif
and add your font in this line.
And then you need to add the font-file into the font directory of matplotlib.
Here I give this two paths:
~\AppData\Local\Enthought\Canopy\User\Lib\site-packages\matplotlib\mpl-data
$HOME.matplotlib\fontList.cache
PS: Windows 7.

Scrappy' method re() doesn't work with Unicode strings

I'm working in Windows 7 and scrappy interactive console (based on IPython).
I'm doing step Trying Selectors in the Shell in the tutorial
If i grab some site with english letters title, all is okay, like in the tutorial:
In [5]: hxs.select('//title/text()').re('(\w+):')`
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']`
But if i grab site with non-english letters (russian, Unicode), re() method does not return anything:
In [25]: hxs.select('//title/text()').re('(\w+)')
Out[25]: []
There is some text in Title, it is not empty:
In [24]: hxs.select('//title/text()').extract()
Out[24]: [u'\u041b\u043e\u043a\u0430\u0446\u0438\u043e\u043d\u043d\u044b\u0439 \u043f\u043e\u0438\u0441\u043a \u0430\u0431\u043e\u043d\u0435\u043d\u0442\u043e\u0432']
Help me, can i use scrapy' re() with unicode symbols?

Sounds like Scrapy isn't using the re.UNICODE flag for its regexes, so \w isn't including all the Unicode-defined "word" characters.
The docs seem to indicate that Scrapy's .re can take an already-compiled regex, so you could try compiling your regex yourself with the UNICODE flag:
import re
hxs.select('//title/text()').re(re.compile('(\w+)', re.UNICODE))

Show a character's Unicode codepoint value in Eclipse

I have a UTF-8 text file open in Eclipse, and I'd like to find out what a particular Unicode character is. Is there a function to display the Unicode codepoint of the character under the cursor?

I do not think there is yet a plugin doing exactly what you are looking for.
I know about a small plugin able to encode/decode a unicode sequence:
The sources (there is not even a fully built jar plugin yet) are here, with its associated tarball: you can import it as a PDE plugin project a,d test it in your eclipse.

You can also look-up a character in the Unicode database using the Character Properties Unicode Utility at http://unicode.org/. I've made a Firefox Search Engine to search via that utility. So, just copy-and-paste from your favourite editor into the search box.
See the list of online tools at http://unicode.org/. E.g. it lists Unicode Lookup by Jonathan Hedley.

Here's a Python script to show information about Unicode characters on a Windows clipboard. So, just copy the character in your favourite editor, then run this program.
Not built-in to Eclipse, but it's what I'll probably use when I haven't got a better option.
"""
Print information about Unicode characters on the Windows clipboard
Requires Python 2.6 and PyWin32.
For ideas on how to make it work on Linux via GTK, see:
http://mrlauer.wordpress.com/2007/12/31/python-and-the-clipboard/
"""
import win32con
import win32clipboard
import unicodedata
import sys
import codecs
from contextlib import contextmanager
MAX_PRINT_CHARS = 1
# If a character can't be output in the current encoding, output a replacement e.g. '??'
sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout, errors='replace')
#contextmanager
def win_clipboard_context():
"""
A context manager for using the Windows clipboard safely.
"""
try:
win32clipboard.OpenClipboard()
yield
finally:
win32clipboard.CloseClipboard()
def get_clipboard_text():
with win_clipboard_context():
clipboard_text = win32clipboard.GetClipboardData(win32con.CF_UNICODETEXT)
return clipboard_text
def print_unicode_info(text):
for char in text[:MAX_PRINT_CHARS]:
print(u"Char: {0}".format(char))
print(u" Code: {0:#x} (hex), {0} (dec)".format(ord(char)))
print(u" Name: {0}".format(unicodedata.name(char, u"Unknown")))
try:
clipboard_text = get_clipboard_text()
except TypeError:
print(u"The clipboard does not contain Unicode text")
else:
print_unicode_info(clipboard_text)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Python 2.7: detect emoji from text - unicode

Related

How to overarchingly apply "utf-8" to opening csv/txt files in pandas dataframe?

Fixing file encoding

matplotlib pyplot table with non-ascii data?

Scrappy' method re() doesn't work with Unicode strings

Show a character's Unicode codepoint value in Eclipse

Categories

Resources