I'm trying to represent the copyright symbol © in Python.
If I type © into python interactive terminal I get '\xc2\xa9'. This is 169 and 194 in hexadecimal.
But if I look up the copyright symbol in the unicode table it's only 169.
Python Interactive Terminal:
ord(u"©") --> 169
However '\xa9' == "©" --> False
Only '\xc2\xa9' == "©" --> True
I don't really get why 169 194 together gives copyright instead of just 169 or just 194.
Your terminal supports UTF-8 encoding, and you are likely using Python 2:
>>> import sys
>>> sys.stdout.encoding
'utf-8'
>>> '©'
'\xc2\xa9'
>>> u'©'
u'\xa9'
Python 2 uses byte strings and characters are encoded in the terminal's encoding. Use a Unicode string to get the Unicode value.
Related
Trying to migrate data from Firebird DB to MS Sql Server using fdb(2.0.1) and pyodbc. Since there are blobs in the Firebird database which are over 64K, they are being returned as BlobReader objects. Since i would like not to deal with the bytes myself and write them using pyodbc. The docs say that you can turn off the 64K threshold by passing -1 to the cursor.set_stream_blob_threshold. However that doesn't seem to work, since fdb.fbcore.ProgrammingError is thrown...
https://fdb.readthedocs.io/en/v2.0/reference.html#fdb.Cursor.set_stream_blob_treshold
Here is how i call the function:
import fdb
class Firebird:
def __init__(self, db_name: str):
self.__fb_conn = fdb.connect(database=db_name, user='someuser', password='somepass', charset='ISO8859_1')
self.__fb_cursor = self.__fb_conn.cursor()
#change the blob safety threshold to unlimited for troubleshooting
self.__fb_cursor.set_stream_blob_treshold(-1) #doesn't work :(
Here is a stack trace for the error:
(.venv) >python3.8.exe -i
Python 3.8.5 (tags/v3.8.5:580fbb0, Jul 20 2020, 15:57:54) [MSC v.1924 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from commonlibs import Firebird
>>>
>>> fb = Firebird('somedb.fdb')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\user1\dev\commonlibs\Firebird.py", line 13, in __init__
self.__fb_cursor.set_stream_blob_treshold(int(-1)) #doesn't work :(
File "C:\Users\user1\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\fdb\fbcore.py", line 3930, in set_stream_blob_treshold
raise ProgrammingError
fdb.fbcore.ProgrammingError
Per Mark's comment:
I don't know much about the data source and what sort of blobs. It was one of those situations where the other teams guy said: "Hey, here is some data from this partner, let's see what inside"
However when trying to pass the obj.read() value to the pyodbc for BlobReader objects, it did insert some of the blobs. However with a lot of them pyodbc would report this error:
pyodbc.Error: ('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Warning: Partial insert/update. The insert/update of a text or image column(s) did not succeed. (0) (SQLPutData); [HY000] [Microsoft][ODBC SQL Server Driver][SQL Server]The text, ntext, or image pointer value conflicts with the column name specified. (7125)')
I was kind hoping i could avoid all this pyodbc and .read() stuff by setting that threshold, but i wonder if the pyodbc error would show up regardless...
I'm trying to find a way to learn the encoding for 64-bit characters (mostly Chinese) that I encounter. For example, the encoding for '好' ("hǎo", good) is
597d. But entering:
echo 好|od -t x1
in Linux Mint gives a result of:
0000000 e5 a5 bd 0a
0000004
What is the rule for translating "e5 a5 bd 0a" to "597d" ?
I'm trying to get an Erlang function to execute a bash command containing unicode characters. For example, I want to execute the equivalent of:
touch /home/jani/ჟანიweł
I put that command in variable D, for example:
io:fwrite("~ts", [list_to_binary(D)]).
touch /home/jani/ჟანიwełok
but after I execute:
os:cmd(D)
I get file called á??á??á??á??weÅ?. How can I fix it?
os:cmd(binary_to_list(unicode:characters_to_binary("touch /home/jani/编程"))).
Executing this command creates a file named ��, while executing the equivalent touch command directly in a terminal creates the file with the correct name.
Its because Erlang reads your source files like latin1 by default, but on newer versions of erlang you can set your files to use unicode.
%% coding: utf-8
-module(test).
-compile(export_all).
test() ->
COMMAND = "touch ჟანიweł",
os:cmd(COMMAND).
and then compiling and executing the module works fine
rorra-air:~ > erl
Erlang/OTP 17 [erts-6.4] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]
Eshell V6.4 (abort with ^G)
1> c(test).
{ok,test}
2> test:test().
[]
and it created the file on my filesystem
rorra-air:~ > ls -lta
total 144
-rw-r--r-- 1 rorra staff 0 Jun 9 15:18 ჟანიweł
Need good tool to detect encoding of the strings using some kind of mapping or heuristic method.
For example String: áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì
Expected: сохранив много приложений Java, можно занять всю доступную память
The encoding is "ISO8859-5". When I'am trying to detect it with the below libs the result is "UTF-8". It is obviously that string was saved in utf, but is there any heuristic way using symbols mapping to analyse the characters and match them with the correct encoding?
Used usual encoding detect libs:
- enca (aptitude install enca)
- chardet (aptitude install chardet)
- uchardet (aptitude install uchardet)
- http://tika.apache.org/
- http://npmjs.com/package/detect-encoding
- libencode-detect-perl
- http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
- http://jchardet.sourceforge.net/
- http://grepcode.com/snapshot/repo1.maven.org/maven2/com.googlecode.juniversalchardet/juniversalchardet/1.0.3/
- http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/
- http://userguide.icu-project.org/
- http://site.icu-project.org
You need to unwrap the UTF-8 encoding and then pass it to a character-encoding detection library.
If random 8-bit data is encoded into UTF-8 (assuming an identity mapping, i.e. a C4 byte is assumed to represent U+00C4, as is the case with ISO-8859-1 and its superset Windows 1252), you end up with something like
Source: 8F 0A 20 FE 65
Result: C2 8F 0A 20 C3 BE 65
(because the UTF-8 encoding of U+008F is C2 8F, and U+00FE is C3 BE). You need to revert this encoding in order to obtain the source string, so that you can then identify its character encoding.
In Python, something like
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import chardet
mystery = u'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì'
print chardet.detect(mystery.encode('cp1252'))
Result:
{'confidence': 0.99, 'encoding': 'ISO-8859-5'}
On the Unix command line,
vnix$ echo 'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì' |
> iconv -t cp1252 | chardet
<stdin>: ISO-8859-5 (confidence: 0.99)
or iconv -t cp1252 file | chardet to decode a file and pass it to chardet.
(For this to work successfully at the command line, you need to have your environment properly set up for transparent Unicode handling. I am assuming that your shell, your terminal, and your locale are adequately configured. Try a recent Ubuntu Live CD or something if your regular environment is stuck in the 20th century.)
In the general case, you cannot know that the incorrectly applied encoding is CP 1252 but in practice, I guess it's going to be correct (as in, yield correct results for this scenario) most of the time. In the worst case, you would have to loop over all available legacy 8-bit encodings and try them all, then look at the one(s) with the highest confidence rating from chardet. Then, the example above will be more complex, too -- the mapping from legacy 8-bit data to UTF-8 will no longer be a simple identity mapping, but rather involve a translation table as well (for example, a byte F5 might correspond arbitrarily to U+0092 or whatever).
(Incidentally, iconv -l spits out a long list of aliases, so you will get a lot of fundamentally identical results if you use that as your input. But here is a quick ad-hoc attempt at fixing your slightly weird Perl script.
#!/bin/sh
iconv -l |
grep -F -v -e UTF -e EUC -e 2022 -e ISO646 -e GB2312 -e 5601 |
while read enc; do
echo 'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì' |
iconv -f utf-8 -t "${enc%//}" 2>/dev/null |
chardet | sed "s%^[^:]*%${enc%//}%"
done |
grep -Fwive ascii -e utf -e euc -e 2022 -e None |
sort -k4rn
The output still contains a lot of chaff, but once you remove that, the verdict is straightforward.
It makes no sense to try any multi-byte encodings such as UTF-16, ISO-2022, GB2312, EUC_KR etc in this scenario. If you convert a string into one of these successfully, then the result will most definitely be in that encoding. This is outside the scope of the problem outlined above: a string converted from an 8-bit encoding into UTF-8 using the wrong translation table.
The ones which returned ascii definitely did something wrong; most of them will have received an empty input, because iconv failed with an error. In a Python script, error handling would be more straightforward.)
The string
сохранив много приложений Java, можно занять всю доступную память
is encoded in ISO8859-5 as bytes
E1 DE E5 E0 D0 DD D8 D2 20 DC DD DE D3 DE 20 DF E0 D8 DB DE D6 D5 DD D8 D9 20 4A 61 76 61 2C 20 DC DE D6 DD DE 20 D7 D0 DD EF E2 EC 20 D2 E1 EE 20 D4 DE E1 E2 E3 DF DD E3 EE 20 DF D0 DC EF E2 EC
The string
áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì
is encoded in ISO-8859-1 as bytes
E1 DE E5 E0 D0 DD D8 D2 20 DC DD DE D3 DE 20 DF E0 D8 DB DE D6 D5 DD D8 D9 20 4A 61 76 61 2C 20 DC DE D6 DD DE 20 D7 D0 DD EF E2 EC 20 D2 E1 EE 20 D4 DE E1 E2 E3 DF DD E3 EE 20 DF D0 DC EF E2 EC
Look familiar? They are the same bytes, just interpreted differently by different charsets.
Any tool that would look at these bytes would not be able to tell you the charset automatically, as they are perfectly valid bytes in both charsets. You would have to tell the tool which charset to use when interpreting the bytes.
Any tool that tells you this particular byte sequence is encoded as UTF-8 is wrong. These are NOT valid UTF-8 bytes.
I have a simple POD text file:
$ cat test.pod
=encoding UTF-8
Münster
It is encoded in UTF-8, as per this literal hex dump of the file:
00000000 3d 65 6e 63 6f 64 69 6e 67 20 55 54 46 2d 38 0a |=encoding UTF-8.|
00000010 0a 4d c3 bc 6e 73 74 65 72 0a |.M..nster.|
0000001a
The "ü" is being encoded as the two bytes C3 and BC.
But when I run perldoc on the file it is turning my lovely formatted UTF-8 characters into ASCII.
What's more, it is correctly handling the German language convention of representing "ü" as "ue".
$ perldoc test.pod | cat
TEST(1) User Contributed Perl Documentation TEST(1)
Muenster
perl v5.16.3 2014-06-10 TEST(1)
Why is it doing this?
Is there an additional declaration I can put into my file to stop it from happening?
After additional investigation with App::perlbrew I've found the difference comes from having a particular version of Pod::Perldoc.
perl-5.10.1 3.14_04 Muenster
perl-5.12.5 3.15_02 Muenster
perl-5.14.4 3.15_04 Muenster
perl-5.16.2 3.17 Münster
perl-5.16.3 3.19 Muenster
perl-5.16.3 3.17 Münster
perl-5.17.3 3.17 Münster
perl-5.18.0 3.19 Muenster
perl-5.18.1 3.23 Münster
However I would still like, if possible, a way to make Pod::Perldoc 3.14, 3.15, and 3.19 behave "correctly".
Found this RT ticket http://rt.cpan.org/Public/Bug/Display.html?id=39000
This "bug" seems to be introduced with Perl 5.10 and perhaps this was solved in later versions.
Also see: How can I use Unicode characters in Perl POD-derived man pages? and incorrect behaviour of perldoc with UTF-8 texts.
You should add the latest available version of Pod::Perldoc as a dependency.