Unicode vector over a character string

Unicode vector over a character string - unicode

I'm using Python 3.5, PyQT5 and I need to print a character with a vector above it.
I know I have to use a Unicode codepoint, and I tried the following instruction :
myLabel = QLabel(b"\U+20D6".encode('utf-16','ignore')
Nothing worked. It does not work with any type of encoding (utf-8, utf-16, ecc.).
My goal is to put an arrow above a character, according to the tutorial found on the web I have to use unicode b"\U+20D6" codepoint.
Do you know right way to do this?
Thanks in advance.

Related

Displaying Unicode Characters with Arduino

I am currently using the Keyboard.h library on Arduino
I would like to display the following characters upon pressing a button on my breadboard : ♥ ♦ ♣ ♠
I don't know much about ASCII, Unicode and Hexadecimal so I'm having a hard time figuring this out
Does someone know how to do it ?
Thanks.

See my answer at
https://arduino.stackexchange.com/a/91365/70109
for how to convert from unicode to Octal for output
The GCC compiler used by Arduino also does not accept all unicode sequences such as \u0020
Using octal avoids this problem.
Serial.print("\342\204\211");
will output ℉ provided the receiver has font for that unicode.

Try Keyboard.print("\uUNICODE_VALUE");
Unicode values can be found at: http://www.unicode.org/charts/
If that don't work on linux you can hit Ctrl+Shift+u, type the unicode value, and press enter like this:
void typeUnicode(int val, int time){
Keyboard.press(KEY_LEFT_CTRL);
Keyboard.press(KEY_LEFT_SHIFT);
Keyboard.press('u');
delay(time);
Keyboard.releaseAll();
delay(time);
Keyboard.println(String(val, HEX));
delay(time);
}
On Windows you havel "ALT codes", and i'm not sure how they work since i'm a unix geek.

Are there any character sets that don't respect ASCII?

As far as I understand, a character encoding maps bits to integers and a character set maps integers to characters.
So in the Unicode character set there is a telephone character. It is represented using the integer 9742, more commonly represented using Hexadecimal as 260E. This is then saved to a file using UTF-8 which translates the integer 9742 into 10011000001110. Please correct me if I am wrong.
Yesterday I created a text file that used the Unicode character set and UTF-8 encoding and I saved it to my desktop. I then reopened the file in my text editor and started to manually switch the character sets for fun. Unsurprisingly there were problems and odd characters starting displaying! I noticed that only some of the characters are misrepresented though. This got me thinking, why do only some of the characters break? Why not all?
Someone told me that the characters breaking are those outside the original ASCII specification. Upon reflection this seemed to make sense, as it's only non US characters that break. I was told that because all character sets use the ASCII character set up to the first 128 characters they will remain unbroken, and that it's the characters above 127 that break. Please correct me if I am wrong.
Finally, I got thinking. Are there any character sets that don't respect ASCII? If so, what are they called and what are they used for?

Based on my findings from the comments I am able to answer my own question. Thank you to everyone who commented!
Yes, there are a couple; EBCDIC and Baudot.

In Corona SDK how to reverse a unicode string?

I knew that Lua does not fully support unicode however there should be a workaround to solve this problem?
string.reverse will not work with unicode so the following example will not work
print(string.reverse("أحمد"))
any help on that?

Corona SDK seems to be using UTF-8 as encoding.
If you want to reverse all Unicode code points in a string, instead of all bytes, you can use that code:
function utf8reverse(str)
return str:gsub("([\194-\244][\128-\191]+)", string.reverse):reverse()
end
print(utf8reverse("أحمد"))
The trick is as follows: a multibyte Unicode code point always start with a byte 11xx xxxx, followed by one or several bytes 10xx xxxx. The first step is to reverse all bytes on each multibyte code point, and then reverse all bytes.
Note: when a Unicode character is composed of several code points, that simple trick will not work. A full support would require a big Unicode database to deal with.

What's the ASCII character code for '—'?

I am working on decoding text. I am trying to find the character code for the — character, not to be mistaken for -, in ASCII. I have tried unsuccessfully. Does anybody know how to convert it?

Quotation from wiki (Em dash)
When an actual em dash is unavailable—as in the ASCII character set—a double ("--") or triple hyphen-minus ("---") is used. In Unicode, the em dash is U+2014 (decimal 8212).
Em dash character is not a part of ASCII character set.

— is known as an Em Dash. It's character code is \u2014. It is not an ASCII character, so you cannot decode it with the ASCII character set because it is not in the ASCII character table. You would probably want to use UTF8 instead.

Windows
For Windows on a keyboard with a Numeric keypad:
Use Alt+0150 (en dash), Alt+0151 (em dash), or Alt+8722 (minus sign) using the numeric keypad.

This character does not exist in ASCII, but only in Unicode, usually encoded by UTF-8.
In UTF-8, characters are encoded by 2- or 3-byte sequences (or occasionally longer), where none of the two or three bytes is a valid ASCII code, where all of them are outside the ASCII range of 0 through 127.
One suspects that the foregoing only partly answers your question, but if so then this is probably because your question is, inadvertently, only partly asked. For further details, you can extend your question with more specifics.

The character — is not part of the ASCII set.
But if you are looking to convert it to some other format (like U+hex), you can use this online tool. Put your character into the first green box and click "Convert" (above the box)
further below you'll find a number of different codes, including U+hex:
U+2014
Feel free to edit this answer if the link breaks or leave a comment so I can find a replacement.

Alt + 0151 seems to do the trick—perhaps it doesn't work on all keyboards.

alt-196 - while holding down the 'Alt' key, type 196 on the numeric keypad, then release the 'Alt' key

Working out file encoding: I know the string, know the character, what is the encoding?

I'm adding data from a csv file into a database. If I open the CSV file, some of the entries contain bullet points - I can see them. file says it is encoded as ISO-8859.
$ file data_clean.csv
data_clean.csv: ISO-8859 English text, with very long lines, with CRLF, LF line terminators
I read it in as follows and convert it from ISO-8859-1 to UTF-8, which my database requires.
row = [unicode(x.decode("ISO-8859-1").strip()) for x in row]
print row[4]
description = row[4].encode("UTF-8")
print description
This gives me the following:
'\xa5 Research and insight \n\xa5 Media and communications'
¥ Research and insight
¥ Media and communications
Why is the \xa5 bullet character converting as a yen symbol?
I assume because I'm reading it in as the wrong encoding, but what is the right encoding in this case? It isn't cp1252 either.
More generally, is there a tool where you can specify (i) string (ii) known character, and find out the encoding?

I don't know of any general tool, but this Wikipedia page (linked from the page on codepage 1252) shows that A5 is a bullet point in the Mac OS Roman codepage.

More generally, is there a tool where
you can specify (i) string (ii) known
character, and find out the encoding?
You can easily write one in Python.
(Examples use 3.x syntax.)
import encodings
ENCODINGS = set(encodings._aliases.values()) - {'mbcs', 'tactis'}
def _decode(data, encoding):
try:
return data.decode(encoding)
except UnicodeError:
return None
def possible_encodings(encoded, decoded):
return {enc for enc in ENCODINGS if _decode(encoded, enc) == decoded}
So if you know that your bullet point is U+2022, then
>>> possible_encodings(b'\xA5', '\u2022')
{'mac_iceland', 'mac_roman', 'mac_turkish', 'mac_latin2', 'mac_cyrillic'}

You could try
iconv -f latin1 -t utf8 data_clean.csv
if you know it is indeed iso-latin-1
Although in iso-latin-1 \xA5 is indeed a ¥
Edit: Actually this seems to be a problem on Mac, using Word or similar and Arial (?) and printing or converting to PDF. Some issues about fonts and what not. Maybe you need to explicitly massage the file first. Sounds familiar?
http://forums.quark.com/p/14849/61253.aspx
http://www.macosxhints.com/article.php?story=2003090403110643

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Unicode vector over a character string - unicode

Related

Displaying Unicode Characters with Arduino

Are there any character sets that don't respect ASCII?

In Corona SDK how to reverse a unicode string?

What's the ASCII character code for '—'?

Working out file encoding: I know the string, know the character, what is the encoding?

Categories

Resources