Fixing file encoding

Fixing file encoding - encoding

Today I ordered a translation for 7 different languages, and 4 of them appear to be great, but when I opened the other 3, namely Greek, Russian, and Korean, the text that was there wasn't related to any language at all. It looked like a bunch of error characters, like the kind you get when you have the wrong encoding on a file.
For instance, here is part of the output of the Korean translation:
½Ì±ÛÇÃ·¹ÀÌ¾î
¸ÖÆ¼ÇÃ·¹ÀÌ¾î
¿É¼Ç
I may not even speak a hint of Korean, but I can tell you with all certainty that is not Korean.
I assume this is a file encoding issue, and when I open the file in Notepad, the encoding is listed as ANSI, which is clearly a problem; the same can be said for the other two languages.
Does anyone have any ideas on how to fix the encoding of these 3 files; I requested the translators reupload in UTF-8, but in the meantime, I thought I might try to fix it myself.
If anyone is interested in seeing the actual files, you can get them from my Dropbox.

If you look at the byte stream as pairs of bytes, they look vaguely Korean but I cannot tell of they are what you would expect or not.
bash$ python3.4
Python 3.4.3 (v3.4.3:b4cbecbc0781, May 30 2015, 15:45:01)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> buf = '½Ì±ÛÇÃ·¹ÀÌ¾î'
>>> [hex(ord(b)) for b in buf]
>>> ['0xbd', '0xcc', '0xb1', '0xdb', '0xc7', '0xc3', '0xb7', '0xb9', '0xc0', '0xcc', '0xbe', '0xee']
>>> u'\uBDCC\uB1DB\uC7C3\uB7B9\uC0CC\uBEEE'
'뷌뇛쟃랹샌뻮'
Your best bet is to wait for the translator to upload UTF-8 versions or have them tell you the encoding of the file. I wouldn't make the assumption that they bytes are simply 16 bit characters.
Update
I passed this through the chardet module and it detected the character set as EUC-KR.
>>> import chardet
>>> chardet.detect(b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE')
{'confidence': 0.833333333333334, 'encoding': 'EUC-KR'}
>>> b'\xBD\xCC\xB1\xDB\xC7\xC3\xB7\xB9\xC0\xCC\xBE\xEE'.decode('EUC-KR')
'싱글플레이어'
According to Google translate, the first line is "Single Player". Try opening it with Notepad and using EUC-KR as the encoding.

Related

Is there a way to recover a clinking beer emoji from CP1258 text encoding?

I have this emoji '🍻' (U+1F37B). I have its plaintext when encoded into CP1258 which is "ðŸ»". I found this question for how to change the encoding properly. I tried to follow this same exact procedure for "ðŸ»" but it just crashed the program. Is there any way to reliably do this without just creating a dictionary with "ðŸ»" as the key and '🍻' as the value? Examples in Java, Python, C# or Javascript appreciated :)
Edit: This is what Notepad++ made of it. Not sure what it means:

You're wrong, it's not displaying in CP1258, it's CP1252. The bytes you're getting are UTF-8 encoded, and one of them can't be displayed - there are 4 bytes, not 3. Here's some Python:
>>> '\U0001f37b'.encode('utf-8')
b'\xf0\x9f\x8d\xbb'
>>> '\U0001f37b'.encode('utf-8').decode('cp1258','ignore')
'đŸ»'
>>> '\U0001f37b'.encode('utf-8').decode('cp1252','ignore')
'ðŸ»'
Recovering the emoji character is simply a matter of decoding those bytes again:
>>> '\U0001f37b'.encode('utf-8').decode('utf-8','ignore')
'\U0001f37b'

Stata 13: Encoding of German Characters in Windows 8 and Mac OS X

For a current project, I use a number of csv files that are saved in UTF8. The motivation for this encoding is that it contains information in German with special characters ä,ö,ü,ß. My team is working with Stata 13 on Mac OS X and Windows 7 (software is frequently updated).
When we import the csv file (when importing, we choose Latin-1) in Stata special characters are correctly displayed on both operating system. However, when we export the dataset to another csv file on Mac OS X - which we need to do quite often in our setup - the special characters are replaced, e.g. ä -> Š, ü -> Ÿ etc. On Windows, exporting works like a charme and special characters are not replaced.
Troubleshooting: Stata 13 cannot interpret unicode. I have tried to convert the utf8 files to windows1252 and latin 1 (iso 8859-1) encoding (since, after all, all it contains are german characters) using Sublime Text 2 prior to importing it in Stata. However the same problem remains for Mac OS X.
Yesterday, Stata 14 was announced which apparently can deal with unicode. If that is the reason, then it would probably help with my problem, however, we will not be able to upgrade soon. Apart from then, I am wondering why the problem arises on Mac but not on Windows? Can anyone help? Thank you.
[EDIT Start] When I import the exported csv file again using a "Mac Roman" Text encoding (Stata allows to specify that in the importing dialogue), then my german special characters appear again. Apparently I am not the only one encountering this problem by the looks of this thread. However, because I need to work with the exported csv files, I still need a solution to this problem. [EDIT End]
[EDIT2 Start] One example is the word "Bösdorf" that is changed to "Bšsdorf". In the original file the hex code is 42c3 b673 646f 7266, whereas the hex code in the exported file is 42c5 a173 646f 7266. [EDIT2 End]

Until the bug gets fixed, you can work around this with
iconv -f utf-8 -t cp1252 <oldfile.csv | iconv -f mac -t utf-8 >newfile.csv
This undoes an incorrect transcoding which apparently the export function in Stata performs internally.
Based on your indicators, cp1252 seems like a good guess, but it could also be cp1254. More examples could help settle the issue if you can't figure it out (common German characters to test with still would include ä and the uppercase umlauts, the German double s ligature ß, etc).

Stata 13 and below uses a deprecated locale in Mac OS X, macroman (Mac OS X is unicode). I generally used StatTransfer to convert, for example, from Excel (unicode) to Stata (Western, macroman; Options->Encoding options) in Spanish language. It was the only way to have á, é, etc. Furthermore, Stata 14 imports unicode without problem but insist to export es_ES (Spanish Spain) as the default locale, having to add the command locale UTF-8 at the end of the export command to have a readable Excel file.

Flask - handling unicode text with werkzeug?

So I am trying to have a browser download a file with a certain name, which is stored in a database. To prevent filename conflicts the file is saved on disk with a GUID, and when it comes time to actually download it, the filename from the database is supplied for the browser. The name is written in Japanese, and when I display it on the page it comes out fine, so it is stored OK in the database. When I try to actually have the browser download it under that name:
return send_from_directory(app.config['FILE_FOLDER'], name_on_disk,
as_attachment=True, attachment_filename = filename)
Flask throws an error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-20:
ordinal not in range(128)
The error seems to originate not from my code, but from part of Werkzeug:
/werkzeug/http.py", line 150, in quote_header_value
value = str(value)
Why is this happening? According to their docs, Flask is "100% Unicode"
I actually had this problem before I rewrote my code, and fixed it by modifying numerous things actually in Werkzeug, but I really do not want to have to do this for the deployed app because it is a pain and bad practice.
Python 2.7.6 (default, Nov 26 2013, 12:52:49)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
filename = "[얼티메이트] [131225] TVアニメ「キルラキル」オリジナルサウンドトラック (FLAC).zip"
print repr(filename)
'[\xec\x96\xbc\xed\x8b\xb0\xeb\xa9\x94\xec\x9d\xb4\xed\x8a\xb8] [131225] TV\xe3\x82\xa2\xe3\x83\x8b\xe3\x83\xa1\xe3\x80\x8c\xe3\x82\xad\xe3\x83\xab\xe3\x83\xa9\xe3\x82\xad\xe3\x83\xab\xe3\x80\x8d\xe3\x82\xaa\xe3\x83\xaa\xe3\x82\xb8\xe3\x83\x8a\xe3\x83\xab\xe3\x82\xb5\xe3\x82\xa6\xe3\x83\xb3\xe3\x83\x89\xe3\x83\x88\xe3\x83\xa9\xe3\x83\x83\xe3\x82\xaf (FLAC).zip'
>>>

You should explictly pass unicode strings (type unicode) when dealing with non-ASCII data. Generally in Flask, bytestrings are assumed to have an ascii encoding.

I had a similar problem. I originally had this to send the file as attachment:
return send_file(dl_fd,
mimetype='application/pdf',
as_attachment=True,
attachment_filename=filename)
where dl_fd is a file descriptor for my file.
The unicode filename didn't work because the HTTP header doesn't support it. Instead, based on information from this Flask issue and these test cases for RFC 2231, I rewrote the above to encode the filename:
response = make_response(send_file(dl_fd,
mimetype='application/pdf'
))
response.headers["Content-Disposition"] = \
"attachment; " \
"filename*=UTF-8''{quoted_filename}".format(
quoted_filename=urllib.quote(filename.encode('utf8'))
)
return response
Based on the test cases, the above doesn't work with IE8 but works with the other browsers listed. (I personally tested Firefox, Safari and Chrome on Mac)

You should use something like:
#route('/attachment/<int:attachment_id>/<filename>', methods=['GET'])
def attachment(attachment_id, filename):
attachment_meta = AttachmentMeta(attachment_id, filename)
if not attachment_meta:
flask.abort(404)
return flask.send_from_directory(
directory=flask.current_app.config['UPLOAD_FOLDER'],
filename=attachment_meta.filepath,
)
This way url_for('attachment',1,u'Москва 北京 תֵּל־אָבִיב.pdf') would generate:
/attachment/1/%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D0%B0%20%E5%8C%97%E4%BA%AC%20%D7%AA%D6%B5%D6%BC%D7%9C%D6%BE%D7%90%D6%B8%D7%91%D6%B4%D7%99%D7%91.pdf
Browsers would display or save this file with correct unicode name. Don't use as_attachment=True, as this would not work.

Japanese mojibake detection

I want to know if there is a way to detect mojibake (Invalid) characters by their byte range. (For a simple example, detecting valid ascii characters is just to see if their byte values are less 128) Given the old customized characters sets, such as JIS, EUC and of course, UNICODE, is there a way to do this?
The immediate interest is in a C# project, but I'd like to find a language/platform independent solution as much as possible, so I could use in c++, Java, PHP or whatever.
Arrigato

detecting 文字化け(mojibake) by byte range is very difficult.
As you know, most Japanese characters consist of multi-bytes. In Shift-JIS (one of most popular encodings in Japan) case, the first-byte range of a Japanese character is 0x81 to 0x9f and 0xe0 to 0xef, and the second-byte has other range. In addition, ASCII characters may be inserted into Shift-JIS text. it's difficult.
In Java, you can detect invalid characters with java.nio.charset.CharsetDecoder.

What you're trying to do here is character encoding auto-detection, as performed by Web browsers. So you could use an existing character encoding detection library, like the universalchardet library in Mozilla; it should be straightforward to port it to the platform of your choice.
For example, using Mark Pilgrim's Python 3 port of the universalchardet library:
>>> chardet.detect(bytes.fromhex('83828357836f8350'))
{'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
>>> chardet.detect(bytes.fromhex('e383a2e382b8e38390e382b1'))
{'confidence': 0.938125, 'encoding': 'utf-8'}
But it's not 100% reliable!
>>> chardet.detect(bytes.fromhex('916d6f6a6962616b6592'))
{'confidence': 0.6031748712523237, 'encoding': 'ISO-8859-2'}
(Exercise for the reader: what encoding was this really?)

This is not a direct answer to the question, but I've had luck using the ftfy Python package to automatically detect/fix mojibake:
https://github.com/LuminosoInsight/python-ftfy
https://pypi.org/project/ftfy/
https://ftfy.readthedocs.io/en/latest/
>>> import ftfy
>>> print(ftfy.fix_encoding("(à¸‡'âŒ£')à¸‡"))
(ง'⌣')ง
It works surprisingly well for my purposes.

I don't have time and / or priority level to follow up on this for the moment, but I think, if knowing the source is Unicode, using these charts and following on some of the work done here, I think some headway can be made into the issue. Likewise, for Shift-JIS, using this chart can be helpful.

Working out file encoding: I know the string, know the character, what is the encoding?

I'm adding data from a csv file into a database. If I open the CSV file, some of the entries contain bullet points - I can see them. file says it is encoded as ISO-8859.
$ file data_clean.csv
data_clean.csv: ISO-8859 English text, with very long lines, with CRLF, LF line terminators
I read it in as follows and convert it from ISO-8859-1 to UTF-8, which my database requires.
row = [unicode(x.decode("ISO-8859-1").strip()) for x in row]
print row[4]
description = row[4].encode("UTF-8")
print description
This gives me the following:
'\xa5 Research and insight \n\xa5 Media and communications'
¥ Research and insight
¥ Media and communications
Why is the \xa5 bullet character converting as a yen symbol?
I assume because I'm reading it in as the wrong encoding, but what is the right encoding in this case? It isn't cp1252 either.
More generally, is there a tool where you can specify (i) string (ii) known character, and find out the encoding?

I don't know of any general tool, but this Wikipedia page (linked from the page on codepage 1252) shows that A5 is a bullet point in the Mac OS Roman codepage.

More generally, is there a tool where
you can specify (i) string (ii) known
character, and find out the encoding?
You can easily write one in Python.
(Examples use 3.x syntax.)
import encodings
ENCODINGS = set(encodings._aliases.values()) - {'mbcs', 'tactis'}
def _decode(data, encoding):
try:
return data.decode(encoding)
except UnicodeError:
return None
def possible_encodings(encoded, decoded):
return {enc for enc in ENCODINGS if _decode(encoded, enc) == decoded}
So if you know that your bullet point is U+2022, then
>>> possible_encodings(b'\xA5', '\u2022')
{'mac_iceland', 'mac_roman', 'mac_turkish', 'mac_latin2', 'mac_cyrillic'}

You could try
iconv -f latin1 -t utf8 data_clean.csv
if you know it is indeed iso-latin-1
Although in iso-latin-1 \xA5 is indeed a ¥
Edit: Actually this seems to be a problem on Mac, using Word or similar and Arial (?) and printing or converting to PDF. Some issues about fonts and what not. Maybe you need to explicitly massage the file first. Sounds familiar?
http://forums.quark.com/p/14849/61253.aspx
http://www.macosxhints.com/article.php?story=2003090403110643

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Fixing file encoding - encoding

Related

Is there a way to recover a clinking beer emoji from CP1258 text encoding?

Stata 13: Encoding of German Characters in Windows 8 and Mac OS X

Flask - handling unicode text with werkzeug?

Japanese mojibake detection

Working out file encoding: I know the string, know the character, what is the encoding?

Categories

Resources