Show a character's Unicode codepoint value in Eclipse - eclipse

I have a UTF-8 text file open in Eclipse, and I'd like to find out what a particular Unicode character is. Is there a function to display the Unicode codepoint of the character under the cursor?

I do not think there is yet a plugin doing exactly what you are looking for.
I know about a small plugin able to encode/decode a unicode sequence:
The sources (there is not even a fully built jar plugin yet) are here, with its associated tarball: you can import it as a PDE plugin project a,d test it in your eclipse.

You can also look-up a character in the Unicode database using the Character Properties Unicode Utility at http://unicode.org/. I've made a Firefox Search Engine to search via that utility. So, just copy-and-paste from your favourite editor into the search box.
See the list of online tools at http://unicode.org/. E.g. it lists Unicode Lookup by Jonathan Hedley.

Here's a Python script to show information about Unicode characters on a Windows clipboard. So, just copy the character in your favourite editor, then run this program.
Not built-in to Eclipse, but it's what I'll probably use when I haven't got a better option.
"""
Print information about Unicode characters on the Windows clipboard
Requires Python 2.6 and PyWin32.
For ideas on how to make it work on Linux via GTK, see:
http://mrlauer.wordpress.com/2007/12/31/python-and-the-clipboard/
"""
import win32con
import win32clipboard
import unicodedata
import sys
import codecs
from contextlib import contextmanager
MAX_PRINT_CHARS = 1
# If a character can't be output in the current encoding, output a replacement e.g. '??'
sys.stdout = codecs.getwriter(sys.stdout.encoding)(sys.stdout, errors='replace')
#contextmanager
def win_clipboard_context():
"""
A context manager for using the Windows clipboard safely.
"""
try:
win32clipboard.OpenClipboard()
yield
finally:
win32clipboard.CloseClipboard()
def get_clipboard_text():
with win_clipboard_context():
clipboard_text = win32clipboard.GetClipboardData(win32con.CF_UNICODETEXT)
return clipboard_text
def print_unicode_info(text):
for char in text[:MAX_PRINT_CHARS]:
print(u"Char: {0}".format(char))
print(u" Code: {0:#x} (hex), {0} (dec)".format(ord(char)))
print(u" Name: {0}".format(unicodedata.name(char, u"Unknown")))
try:
clipboard_text = get_clipboard_text()
except TypeError:
print(u"The clipboard does not contain Unicode text")
else:
print_unicode_info(clipboard_text)

Related

I need to remove a specific unicode in my existing subtitle text file

I basically work on subtitles and I have this arabic file and when I open it up on notepad and right click and select SHOW UNICODE CONTROL CHARACTERS I give me some weird characters on the left of every line. I tried so many ways to remove it but failed I also tried NOTEPAD++ but failed.
Notepad ++
SUBTITLE EDIT
EXCEL
WORD
288
00:24:41,960 --> 00:24:43,840
‫أتعلم، قللنا من شأنك فعلاً‬
289
00:24:44,000 --> 00:24:47,120
‫كان علينا تجنيدك لتكون جاسوساً‬
‫مكان (كاي سي)‬
290
00:24:47,280 --> 00:24:51,520
‫لا تعلمون كم أنا سعيد‬
‫لسماع ذلك‬
291
00:24:54,800 --> 00:24:58,160
‫لا تقلق، سيستيقظ نشيطاً غداً‬
292
00:24:58,320 --> 00:25:00,800
‫ولن يتذكر ما حصل‬
‫في الساعات الـ٦‬
the unicodes are not showing in this the unicode is U+202B which shows a ¶ sign, after googling it I think it's called PILCROW.
The issue with this is that it doesn't display subtitles correctly on ps4 app.
I need this PILCROW sign to go away. with this website I can see the issue in this file https://www.soscisurvey.de/tools/view-chars.php
The PILCROW ¶ is used by various software and publishers to show the end of a line in a document. The actual Unicode character does not exist in your file so you can't get rid of it.
The Unicode characters in these lines are 'RIGHT-TO-LEFT EMBEDDING'
(code \u202b) and 'POP DIRECTIONAL FORMATTING' (code \u202c) -
these are used in the text to indicate that the included text should be rendered
right-to-left instead of the ocidental left-to-right direction.
Now, these characters are included as hints to the application displaying the text, rather than to actually perform the text reversing - so they likely can be removed without compromising the text displaying itself.
Now this a programing Q&A site, but you did not indicate any programming language you are familiar with - enough for at least running a program. So it is very hard to know how give an answer that is suitable to you.
Python can be used to create a small program to filter such characters from a file, but I am not willing to write a full fledged GUI program, or an web app that you could run there just as an answer here.
A program that can work from the command line just to filter out a few characters is another thing - as it is just a few lines of code.
You have to store the follwing listing as a file named, say "fixsubtitles.py" there, and, with a terminal ("cmd" if you are on Windows) type python3 fixsubtitles.py \path\to\subtitlefile.txt and press enter.
That, of course, after installing Python3 runtime from http://python.org
(if you are on Mac or Linux that is already pre-installed)
import sys
from pathlib import Path
encoding = "utf-8"
remove_set = str.maketrans("\u202b\u202c")
if len(sys.argv < 2):
print("Usage: python3 fixsubtitles.py [filename]", file=sys.stderr)
exit(1)
path = Path(sys.argv[1])
data = path.read_text(encoding=encoding)
path.write_text(data.translate("", "", remove_set), encoding=encoding)
print("Done")
You may need to adjust the encoding - as Windows not always use utf-8 (the files can be in, for example "cp1256" - if you get an unicode error when running the program try using this in place of "utf-8") , and maybe add more characters to the set of characters to be removed - the tool you linked in the question should show you other such characters if any. Other than that, the program above should work

Python 2.7: detect emoji from text

I'd like to be able to detect emoji in text and look up their names.
I've had no luck using unicodedata module and I suspect that I'm not
understanding the UTF-8 conventions.
I'd guess that I need to load my doc as as utf-8, then break the unicode "strings" into unicode symbols. Iterate over these and look them up.
#new example loaded using pandas and encoding UTF-8
'A man tried to get into my car\U0001f648'
type(test) = unicode
import unicodedata as uni
uni.name(test[0])
Out[89]: 'LATIN CAPITAL LETTER A'
uni.name(test[-3])
Out[90]: 'LATIN SMALL LETTER R'
uni.name(test[-1])
ValueError Traceback (most recent call last)
<ipython-input-105-417c561246c2> in <module>()
----> 1 uni.name(test[-1])
ValueError: no such name
# just to be clear
uni.name(u'\U0001f648')
ValueError: no such name
I looked up the unicode symbol via google and it's a legit symbol.
Perhaps the unicodedata module isn't very comprehensive...?
I'm considering making my own look up table from here.
Interested in other ideas...this one seems do-able.
My problem was in using Python2.7 for the unicodedata module.
using Conda I created a python 3.3 environment and now unicodedata works
as expected and I've given up on all weird hacks I was working on.
# using python 3.3
import unicodedata as uni
In [2]: uni.name('\U0001f648')
Out[2]: 'SEE-NO-EVIL MONKEY'
Thanks to Mark Ransom for pointing out that I originally had Mojibake from not
correctly importing my data. Thanks again for your help.
Here's a way to read the link you provided. It's translated from Python 2 so there might be a glitch or two.
import re
import urllib2
rexp = re.compile(r'U\+([0-9A-Za-z]+)[^#]*# [^)]*\) *(.*)')
mapping = {}
for line in urllib2.urlopen('ftp://ftp.unicode.org/Public/emoji/1.0/emoji-data.txt'):
line = line.decode('utf-8')
m = rexp.match(line)
if m:
mapping[chr(int(m.group(1), 16))] = m.group(2)

How to convert escaped Unicode (e.g. \u0432\u0441\u0435) to UTF-8 chars (все) in Notepad++

I have .properties files with a bunch of unicode escaped characters. I want to convert it to the correct chars display.
E.g.:
Currently: \u0432\u0441\u0435 \u0433\u043e\u0442\u043e\u0432\u043e\u005c
Desired result: все готово
Notepad++ is already set to encode UTF8 without BOM. Opening the document and 'converting' (from the Encoding drop-down menu) doesn't do anything.
How do I achieve this with notepad++?
If not in Notepad++, is there any other way to do this for many files, perhaps by using some script?
You need a plugin named HTML Tag.
Once plugin is installed, select your text and invoke command Plugins > HTML Tag > Decode JS (Ctrl+Shift+J).
I'm not aware of how you can do it natively in Notepad++ but as requested you can script it with Python:
import codecs
# opens a file and converts input to true Unicode
with codecs.open("escaped-unicode.txt", "rb", "unicode_escape") as my_input:
contents = my_input.read()
# type(contents) = unicode
# opens a file with UTF-8 encoding
with codecs.open("utf8-out.txt", "wb", "utf8") as my_output:
my_output.write(contents)

matplotlib pyplot table with non-ascii data?

I'm trying to create a table figure in png.
The data in the table contains non-ascii characters.(suppose it's chinese or something)
I pass unicode code-points (u'hello') to pyplot and it shows the characters as squares.
It may be (slightly) platform dependent, but I prefer to use unicode for Chinese and other languages. One other thing is that you need to make sure is matplotlib must get the necessary font. You can do it anywhere you need a text except sometimes not with mathtext.
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import matplotlib
zhfont1 = matplotlib.font_manager.FontProperties(fname='/Library/Fonts/Kai.dfont') #I am on OSX.
s=u'\u54c8\u54c8' #Need the unicode for your Chinese Char.
plt.text(0.5,0.5,s,fontproperties=zhfont1, size=50) #example: plt.text()
another method is to modify the file /matplotlibrc/ on your matplotlib system.
Find this line
#font.sans-serif
and add your font in this line.
And then you need to add the font-file into the font directory of matplotlib.
Here I give this two paths:
~\AppData\Local\Enthought\Canopy\User\Lib\site-packages\matplotlib\mpl-data
$HOME.matplotlib\fontList.cache
PS: Windows 7.

Which packages for many unicode-characters?

I'm trying to create a LaTeX document with as many as Unicode characters as possible. My header is as follows:
\documentclass[12pt,a4paper]{article}
\usepackage[utf8x]{inputenc}
\usepackage{ucs}
\pagestyle{empty}
\usepackage{textcomp}
\usepackage[T1]{fontenc}
\begin{document}
The Unicode characters which follow in the document-body are in the form of
\unichar{xyz}
where xyz stands for an integer, e.g. 97 (for "a").
The problem is: for many integers the script's not compiling with an error message as:
! Package ucs Error: Unknown Unicode character 79297 = U+135C1,
(ucs) possibly declared in uni-309.def.
(ucs) Type H to see if it is available with options.
Which packages should I add to the header to make the file compile with as many characters as possible?
As far as I remember the utf8x package is merely a hack to allow for some Unicode "support". Basically it was a giant lookup table translating individual character sequences to LaTeX's expectations.
You should really use Xe(La)?TeX for such things which was designed with Unicode in mind. The old TeX still suffers from it's 1970's heritage in this respect.
ETA: From the package's documentation:
This bundle provides the ucs package, and utf8x.def, together with a large number of support files.
The utf8x.def definition file for use with inputenc covers a wider range of Unicode characters than does utf8.def in the LaTeX distribution. The ucs package provides facilities for efficient use of large sets of Unicode characters.
Since you're already using both packages I guess you're out of luck then with plain LaTeX.
You can manually edit .def files similarly to this https://bugzilla.redhat.com/show_bug.cgi?id=418981
For example this is only way how to enable LaTex for Latvian language as far as I know.
Just find you symbol (on Mac use "locate uni-1.def" to find where the package is located - to enable locate command http://osxdaily.com/2011/11/02/enable-and-use-the-locate-command-in-the-mac-os-x-terminal/ )