Encoding from ANSI when having non-latin letters

Encoding from ANSI when having non-latin letters - encoding

I have a very old program (not a server or something on the internet) that I think it use the ANSI (Windows-1252) encoding.
The problem is that some inputs to this program are written in Arabic.
However, when I am trying to read the result, the Arabic words are written with very wired character. For example the input: "نور" is converted to "äæÑ".
The program output should contain a combination of English words and Arabic words.
E.x. It outputs "Name äæÑ" while the correct output should be something like "Name نور".
In general, the English words are correct and readable with both UTF-8 and ANSI. But the Arabic words are read for example as "���" with UTF-8 and as "äæÑ" with ANSI.
I understand that this is because ANSI doesn't have support to non-Latin letters.
but what should I do now? How can I convert them to Arabic again?
Note: I know the exact input and the exact output that this program should produce.
Note2: I don't have the source code of this program. I just want to convert the output file of this program to have the correct words or encoding.

I solved this problem now by typing in the terminal:
iconv -f WINDOWS-1256 -t utf8 < my_File.ged > result.ged
I tried to write code in java that do a similar thing but it wasn't really working with giving my the result I wanted.
I have also tried the previous terminal command but using WINDOWS-1252 instead of WINDOWS-1256 but it wasn't working. So, I guess it is good to try different encoding until it is working

Related

How to remove accents and keep Chinese characters using a command?

I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin

There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).

The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.

How should a properly UTF-8 encoded file look in notepad++

I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin

é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.

Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()

You need to know or at least guess the character encoding of the data in order to be able to convert it properly. So you should try and find information about the origin and format of the text file and make sure that you read the file properly in your software.
For example, “Ullerهkersvنgen” looks like a Scandinavian name, with Scandinavian letters in it, misinterpreted according to a wrong character encoding assumption or as munged by an incorrect character code conversion. The first Arabic letter in it, “ه”, is U+0647 ARABIC LETTER HEH. In the ISO-8859-6 encoding, it is E7 (hex.); in windows-1256, it is E5. Since Scandinavian text are normally represented in ISO-8859-1 or windows-1252 (when Unicode encodings are not used), it is natural to check what E7 and E5 mean in them: “ç” and “å”. For linguistic reasons, the latter is much more probable here. The second Arabic letter is “ن” U+0646 ARABIC LETTER NOON, which is E4 in windows-1256. And in ISO-8859-1, E4 is “ä”. This makes perfect sense: the word is “Ulleråkersvägen”, a real Swedish street name (in Uppsala, at least).
Thus, the data is probably ISO-8859-1 or windows-1252 (Windows Latin 1) encoded text, incorrectly interpreted as windows-1256 (Windows Arabic). No conversion is needed; you just need to read the data as windows-1252 encoded. (After reading, it can of course be converted to another encoding.)

file encoding and deformed strings

I am just working with a text file, that contains lots of deformed strings such as:
VyplÅ<88>te prosÃm pole "jmÃ©no
My editor says that the file encoding is latin1. The string is supposed to be a czech sentence that contains some diacritics so no wonder it is displayed wrong. I have tried to force utf8 and latin2 encodings in my editor but that did not help. I have also tried to use iconv to convert the file from latin1 to utf8 or latin2 but neither that helped. I quite often encounter issues likes this and I don't know any other solution than to manually rewrite the strings. Is there a better way to fix this?
EDIT:
Here is the original sentence:
Vyplňte prosím pole "jméno"
Here is hex dump of the part where the malformed string occurs:
0002640: 6a6d 656e 6f22 5d20 3d20 2744 453a 2056 jmeno"] = 'DE: V
0002650: 7970 6cc5 8874 6520 7072 6f73 c3ad 6d20 ypl..te pros..m
0002660: 706f 6c65 2022 6a6d c3a9 6e6f 222e 273b pole "jm..no".';
EDIT2:
The sentence above is really correct utf8 as deceze have said. But I have just found out some strange thing. If I try to transcode the file from utf8 to utf8 (with iconv), I get an error on a word: Postgebühr at character ü. If I look at hex dump, this character is represented as \xfc (252 in decimal), which is valid latin1 byte encoding for ü but completely invalid utf8 byte encoding. It seems that part of the file is in latin1 and another part in utf8. Here is part of the file that is in latin1 (probably):
0000250: 506f 7374 6765 62fc 6872 273b 0a09 0963 Postgeb.hr';...c
0000260: 6f6e 665b 2277 6166 6572 7322 5d20 3d20 onf["wafers"] =
0000270: 2744 453a 206f 706c c3a1 746b 20c3 273b 'DE: opl..tk .';
As I look into this more, this even does not seem to be valid latin1 cause even in latin1 it is garbled (DE: oplÃ¡tk Ã instead of probably DE: oplatky za). This part of the file seems to contain some damaged text.
I can't understand how encoding in this file could have got mixed up like that. Any ideas?

If the file is supposed to contain Latin2 encoded text, then trying to convert it from Latin1 or similar is of course messing things up.
The problem is simply that your text editor does not automagically recognize the encoding, since the single-byte Latin* encodings all look identically interchangeable on a byte level. If your editor "tells" you the encoding is Latin1, what it means is that it is currently interpreting the file as Latin1. Obviously it has that wrong.
You either need to tell your editor to treat the file as Latin2 (Open As... Latin2, or however your editor gives you this choice) or to convert the file from Latin2 into an encoding your editor handles correctly.
To understand encodings better, I recommend you read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
In response to your posted hex dump: That file is UTF-8 encoded.

Iconv is the way to go, but you must know the correct enconding. Latin2 (iso8859-2) is only one of the possibilities, since there were many ascii extensions in Europe. What language is this supposed to be in?

Get window title with AppleScript in Unicode

I've stuck with the following problem:
I have a script which is retrieving title form the Firefox window:
tell application "Firefox"
if the (count of windows) is not 0 then
set window_name to name of front window
end if
end tell
It works well as long as the title contains only English characters but when title contains some non-ASCII characters(Cyrillic in my case) it produces some utf-8 garbage. I've analyzed this garbage a bit and it seems that my Cyrillic character is converted to the Utf-8 without any concerning about codepage i.e instead of using Cyrillic codepage for conversion it uses non codepages at all and I have utf-8 text with characters different from those in the window title.
My question is: How can I retrieved the window title in utf-8 directly without any conversion?
I can achieve this goal by using AXAPI but I want to achieve this by AppleScript because AXAPI needs some option turned on in the system.
UPD:
It works fine in the AppleScript Editor. But I'm compiling it through the C++ code via OSACompile->OSAExecute->OSADisplay
I don't know the guts of the AppleScript Editor so maybe it has some inside information about how to encode the characters

I've found the answer when wrote update. Sometimes it is good to ask a question for better it understanding :)
So for the future searchers: If you want to use unicode result of the script execution you should provide typeUnicodeText to the OSADisplay then you will have result in the UTF-16LE in the result AEDesc