How to convert ANSI text to Unicode? - unicode

I would like to convert RTF text to Unicode. In the RTF font table one can find the name of the font or font-face (eg. Arial Cyr, Courier Greek) and the charset to use with it (0-255). So how to write a function that converts a character code (0-255) with these settings to Unicode?
As I see, the post-tags like Greek, Cyr, Tur etc. affect the glyph of the displayed characters and the charset affects it too. So the function could have these input parameters:
fontname postfix, font charset, character code
But what is next? Or am I on the wrong way?

RTF was invented long before Unicode. It most certainly isn't ANSI text, RTF only uses ASCII, it uses a rather unholy mix of character sets with non-ASCII characters encoded in hex with a reference to the character set. The mapping is also not perfect, many Unicode codepoints have no corresponding charset.
You'll spend a lifetime creating your own RTF to Unicode converter. Take advantage of an existing solution, most any platform has one. On Windows that would be the RichEdit control. If you use .NET then it is especially simple, use the RichTextBox class, assign its Rtf property and read back its Text property. Which is utf-16 encoded Unicode.

Related

Character Encodings compatibility with ASCII

I'm currently reading mails from file and process some of the header information. Non-ASCII characters are encoded according to RFC2047 in quoted-printable oder Base64, so the files contain no non-ASCII characters . If the file is encoded in UTF-8, Win-1252 or one of the ISO-8859-* character encodings, I won't run into problems because ASCII is embedded at the same place in all these charsets (so 0x41 is a A in all of those charsets).
But what if the file is encoded using an encoding that does not embed ASCII in that way? Do encodings like this even exist? And if so, is there even a reliable way of detecting them?
There is a Charset-detector of Mozilla based on this very interesting article. It can detect a very large amount of different encodings. There is also a port to C# available on GitHub which I used before. It turned out to be quite reliable. But of course, when the text just contains ASCII characters, it cannot distinguish between the different encodings that encode ASCII in the same way. But any encodings that encode ASCII in a different way should be detected correctly with this library.

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.
Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()
You need to know or at least guess the character encoding of the data in order to be able to convert it properly. So you should try and find information about the origin and format of the text file and make sure that you read the file properly in your software.
For example, “Ullerهkersvنgen” looks like a Scandinavian name, with Scandinavian letters in it, misinterpreted according to a wrong character encoding assumption or as munged by an incorrect character code conversion. The first Arabic letter in it, “ه”, is U+0647 ARABIC LETTER HEH. In the ISO-8859-6 encoding, it is E7 (hex.); in windows-1256, it is E5. Since Scandinavian text are normally represented in ISO-8859-1 or windows-1252 (when Unicode encodings are not used), it is natural to check what E7 and E5 mean in them: “ç” and “å”. For linguistic reasons, the latter is much more probable here. The second Arabic letter is “ن” U+0646 ARABIC LETTER NOON, which is E4 in windows-1256. And in ISO-8859-1, E4 is “ä”. This makes perfect sense: the word is “Ulleråkersvägen”, a real Swedish street name (in Uppsala, at least).
Thus, the data is probably ISO-8859-1 or windows-1252 (Windows Latin 1) encoded text, incorrectly interpreted as windows-1256 (Windows Arabic). No conversion is needed; you just need to read the data as windows-1252 encoded. (After reading, it can of course be converted to another encoding.)

Convert non english characters into Unicode (UTF-8)

I copied large amount of text from another system to my PC. When I viewed the text in my PC, it looked weird. So I copied all the fonts from the other PC and installed them in mine too. Now the text looks okay, but actually it seems that is not in Unicode. For example, if I copy the text and paste in another UTF-8 supported editor such as Notepad++, I get English characters ("bgah;") only like shown below.
How to convert this whole text into unicode text, like the one below. So I can copy the text and paste anywhere else.
பெயர்
The above text was manually obtained using http://www.google.com/transliterate/indic/Tamil
I need this conversion to be done, so I can copy them into database tables.
'Ja-01' is a font with a custom 'visual encoding'.
That is to say, the sequence of characters really is "bgah;" and it only looks like Tamil to you because the font's shapes for the Latin characters bg look like பெ.
This is always to be avoided, because by storing the content as "bgah;" you lose the ability to search and process it as real Tamil, but this approach was common in the pre-Unicode days especially for less-widespread scripts without mature encoding standards. This application probably predates widespread use of TSCII.
Because it is a custom encoding not shared by any other font, it is very unlikely you will be able to find a tool to convert content in this encoding to proper Unicode characters. It does not appear to be any standard character ordering, so you will have to look at the font (eg in charmap.exe) and note down every character, find the matching character in Unicode and map between them.
For example here's a trivial Python script to replace characters in a file:
mapping= {
u'a': u'\u0BAF', # Tamil letter Ya
u'b': u'\u0BAA', # Tamil letter Pa
u'g': u'\u0BC6', # Tamil vowel sign E (combining)
u'h': u'\u0BB0', # Tamil letter Ra
u';': u'\u0BCD', # Tamil sign virama (combining)
# fill in the rest of the mapping information here!
}
with open('ja01data.txt', 'rb') as fp:
data= fp.read().decode('utf-8')
for char in mapping:
data= data.replace(char, mapping[char])
with open('utf8data.txt', 'wb') as fp:
fp.write(data.encode('utf-8'))
The font you found is getting you into trouble. The actual cell text is "bgah;", it gets rendered to பெயர் because you found a font that can work with 8-bit non-Unicode characters. So reading it or pasting it into Notepad++ is going to produce "bgah;" since that's the real text. It can only ever be rendered properly again by forcing the program that displays the string to use that same font.
Ditch the font and enter Unicode so it looks like this:
"bgah" looks like a Baamini based system, which is pre-unicode. It was popular in Canada (and the SL Tamil diaspora in general) in the 90s.
As the others mentioned, it looks like a custom visual encoding that mimics the performance of a foreign script while maintaining ASCII encoding.
Google "Baamini to unicode convertor". The University of Colombo seems to have put one up: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/?maps=t_b-u.xml
Let me know if this works. If not, I can ask around and get something for you.
You could first check whether the encoding is TSCII, as this sounds most probable. It is an 8-bit encoding, and the fonts you copied are probably based on that encoding. Check out whether the TSCII to UTF-8 converter at SourceForge is suitable. The project there is called “Any Tamil Encoding to Unicode” but they say that only TSCII is supported for now.

Scintilla Supports Unicode? What about SCI_GETCHARAT?

Does Scintilla really support Unicode? If so, why does SCI_GETCHARAT return a char value (casted to LRESULT)?
From the SCI_SETCODEPAGE docs...
Code page SC_CP_UTF8 (65001) sets Scintilla into Unicode mode with the document treated as a sequence of characters expressed in UTF-8. The text is converted to the platform's normal Unicode encoding before being drawn by the OS and thus can display Hebrew, Arabic, Cyrillic, and Han characters.
You will have to examine the byte you retrieve with SCI_GETCHARAT(pos) and, depending on the top bits of that, maybe read SCI_GETCHARAT(pos+1) and beyond in order to get the Unicode code point. (See here.)
Edit:
For some C++ code that does this, see below (search for SciMoz::GetWCharAt):
http://vacuproj.googlecode.com/svn/trunk/npscimoz/npscimoz/oldsrc/trunk.nsSciMoz.cxx
I was long time ago but if I remember well Scintilla is not a native Unicode application. Still it has some Unicode support.
First, the function name should SCI_GETBYTEAT, because it returns a byte from UTF-8 internal buffer.
Also, the application has Unicode support for keybaord, so it has some Unicode support :)

How to discover what codepage to use when converting RTF hex literals to Unicode

I'm parsing RTF 1.5+ files generated by Word 2003+ that may have content from other languages. This content is usually encoded as hex literals (\'xx). I would like to convert these literals to unicode values.
I know my document's code page by looking for ansicpg (\ansi\ansicpg1252).
When I use the ansicpg codepage to decode to Unicode, many languages (like French) seem to convert to the Unicode char values that I expect.
However when I see Russian text (like below), codepage 1252 decodes the content to jibberish.
\f277\lang1049\langfe1033\langnp1049\insrsid5989826\charrsid6817286
\'d1\'f2\'f0\'e0\'ed\'e8\'f6\'fb \'e1\'e5\'e7 \'ed\'e0\'e7\'e2\'e0\'ed\'e8\'ff. \'dd\'f2
\'e0 \'f1\'f2\'f0\'e0\'ed\'e8\'f6\'e0 \'ed\'e5 \'e4\'ee\'eb\'e6\'ed\'e0
\'ee\'f2\'ee\'e1\'f0\'e0\'e6\'e0\'f2\'fc\'f1\'ff \'e2 \'f2\'e0\'e1\'eb\'e8\'f6\'e5
\'e2 \'f1\'ee\'e4\'e5\'f0\'e6\'e0\'ed\'e8\'e8.
I assume that lang1049, langfe1033, langnp1049 should provide me clues so I can programmatically choose a different (non-default) code page for the text that they reference? If so, where can I find information that explains how to map a lang* code to a codepage? Or should I be looking for some other RTF command/directive to provide me with the information I'm looking for? (Or must I use \f277 as a font reference and see if it has an associated codepage?)
\lang really only marks up particular stretches of the text as being in a particular language, and shouldn't impact what code page is to be used for the old non-Unicode \' escapes.
Putting an \ansicpg token in the header should perhaps do it, but seems to be ignored by Word (for both raw bytes and \' escapes.
Or must I use \f277 as a font reference and see if it has an associated codepage?
It looks that way. Changing the \fcharset of the font assigned to a particular stretch of text is the only way I can get Word to change how it treats the bytes, anyway. The codes in this token (see eg here for list) are, aggravatingly, different again from either the language ID or the code page number.
It is not so clear but you can use the RichEdit control in order to convert the RTF to UTF-8 format according to the MSDN:
http://msdn.microsoft.com/en-us/library/windows/desktop/bb774304(v=vs.85).aspx
Take a look to the SF_USECODEPAGE for the EM_STREAMOUT message.