How to discover what codepage to use when converting RTF hex literals to Unicode - unicode

I'm parsing RTF 1.5+ files generated by Word 2003+ that may have content from other languages. This content is usually encoded as hex literals (\'xx). I would like to convert these literals to unicode values.
I know my document's code page by looking for ansicpg (\ansi\ansicpg1252).
When I use the ansicpg codepage to decode to Unicode, many languages (like French) seem to convert to the Unicode char values that I expect.
However when I see Russian text (like below), codepage 1252 decodes the content to jibberish.
\f277\lang1049\langfe1033\langnp1049\insrsid5989826\charrsid6817286
\'d1\'f2\'f0\'e0\'ed\'e8\'f6\'fb \'e1\'e5\'e7 \'ed\'e0\'e7\'e2\'e0\'ed\'e8\'ff. \'dd\'f2
\'e0 \'f1\'f2\'f0\'e0\'ed\'e8\'f6\'e0 \'ed\'e5 \'e4\'ee\'eb\'e6\'ed\'e0
\'ee\'f2\'ee\'e1\'f0\'e0\'e6\'e0\'f2\'fc\'f1\'ff \'e2 \'f2\'e0\'e1\'eb\'e8\'f6\'e5
\'e2 \'f1\'ee\'e4\'e5\'f0\'e6\'e0\'ed\'e8\'e8.
I assume that lang1049, langfe1033, langnp1049 should provide me clues so I can programmatically choose a different (non-default) code page for the text that they reference? If so, where can I find information that explains how to map a lang* code to a codepage? Or should I be looking for some other RTF command/directive to provide me with the information I'm looking for? (Or must I use \f277 as a font reference and see if it has an associated codepage?)

\lang really only marks up particular stretches of the text as being in a particular language, and shouldn't impact what code page is to be used for the old non-Unicode \' escapes.
Putting an \ansicpg token in the header should perhaps do it, but seems to be ignored by Word (for both raw bytes and \' escapes.
Or must I use \f277 as a font reference and see if it has an associated codepage?
It looks that way. Changing the \fcharset of the font assigned to a particular stretch of text is the only way I can get Word to change how it treats the bytes, anyway. The codes in this token (see eg here for list) are, aggravatingly, different again from either the language ID or the code page number.

It is not so clear but you can use the RichEdit control in order to convert the RTF to UTF-8 format according to the MSDN:
http://msdn.microsoft.com/en-us/library/windows/desktop/bb774304(v=vs.85).aspx
Take a look to the SF_USECODEPAGE for the EM_STREAMOUT message.

Related

Displaying foreign characters on the website

I have a small list in a foreign language and I am able to display the special foreign characters on the website I am updating. For example to display ü on the website, I write ü in the file. Or to display ö, I write ö in the file. And they are displayed correctly. So far no problems. But now I must also display the character β. Can you just write me the code for it in that same set? Or better yet, tell me where can I find the corresponding character? such as in a list? what is the name of the list I must look at? Again, I want to display character β on a website, by writing the corresponding special character on the source file, just like I am writing ü to display ü.
Mojibake is what's happening, because your text editor use ISO 8859-1 to open and save the files, but your web server serve them to your user with UTF-8. You can confirm it with https://string-functions.com/encodedecode.aspx or other tools using encode set to ISO 8859-1 but the decode set to UTF-8.
The fix is to set your text editor to use UTF-8.

Unicode converted text isn't shown properly in MS-Word

In a mapping editor, the display is correct after the legacy to unicode conversion for DEVANAGARI text shown using a unicode font (Arial Unicode MS). However, in MS-WORD, the display isn't as expected for the same unicode text in the unicode font (Arial Unicode MS) or any other Devanagari unicode fonts. The expected sequence of unicodes are provided as per the documentation. The sequence can be seen on the left-hand side table.
Please let me know where I am going wrong.
Thanks for your help!
Does your map have to insert the zero_width_joiner? The halant (virama) by itself is enough to get the half-consonant (for some combinations) and in particular, it may be that Word is using the presence of the ZWJ to keep them separate.
If getting rid of the ZWJ doesn't help, another possibility is that Word may be treating the individual characters of the text string as individual "runs" of text.
If those first 4 characters are not in a single run, this can happen.
[aside: the way to tell if it's being treated as a single run, is to save the document as an xml file and then open it with something like notepad++ and look at the xml "w:t" element (IIRC) associated with these characters. If they're all in separate w:t elements, it means they're in separate runs. In that case, you might need to copy the text from Word to some other tool (e.g. Notepad++) and then copy it from there and paste it back in Word -- that might cause it to be imported into Word in a single run.

Difference between Unicode format?

I noticed something while uploading some unicode data to the database. When the content is uploaded throught textarea, is gets stored in क format, but when you personally type or paste the unicode and insert it hardcoded in php, then it would store in ठformat. But for both, the unicode character is same क.
Now please tell me the difference between the different formats of unicode characters. And how they affect the development. There has to be some limitations in those formats.
& #2325; is markup used in HTML to represent a Unicode character
If you hard code something in a php source file, Make sure you are opening it with editor that correctly displays text files with unicode characters in it.
http://www.joelonsoftware.com/articles/Unicode.html is good place to know the basics of unicode.
UTF-8 encoding of क has the byte sequence E0 A4
Now if somebody interprets this as 8 bit Latin encoding it will think it is two characters
you will see in the table in the above link E0 is à and A4 is ¤
When the content is uploaded throught textarea, is gets stored in क format,
Forms should not submit content in a character-reference (&#...;) format.
But in reality, they do in most current browsers... but only when they can't submit the character in question in any other way. In this case, you can't tell whether the user originally typed क or क, it is a lossy encoding.
To avoid this, make sure you are serving your page in a charset that supports all possible Unicode characters. In practical terms this means always use UTF-8, and serve your page with the Content-Type: text/html;charset=utf-8 header and/or the <meta http-equiv="Content=Type" content="text/html;charset=utf-8"/> element in the header. You'll then get all characters in simple, uncorrupted UTF-8 format.

UTF8 charset, diacritical elements, conversion problems - and Zend Framework form escaping

I am writing a webapp in ZF and am having serious issues with UTF8. It's using multi lingual content through Zend Form and it seems that ZF heavily escapes all of these characters and basically just won't show a field if there's diacritical elements 'é' and if I use the HTML entity equivalent e.g. é it gets escaped so that the user will see 'é'.
Zend Form allows for having non escaped data, but trying to use this is confusing, and it seems it'd need to be used all over the place.
So, I have been told that if the page and the text is in UTF8, no conversion to htmlentities is required. Is this true?
And if the last question is true, then how do I convert the source text to UTF8? I am comfortable setting up apache so that it sends a default UTF8 charset heading, and also adding the charset meta tag to the html, but doing this I am still getting messed up encoding. I have also tried opening the translation csv file in TextWrangler on OSX as UTF8, but it has done nothing.
Thanks!
L
'é' and if I use the HTML entity equivalent e.g. é it gets escaped so that the user will see 'é'.
This I don't understand. Can you show an example of how it is displayed, as opposed to how it should be displayed?
So, I have been told that if the page and the text is in UTF8, no conversion to htmlentities is required. Is this true?
Yup. In more detail: If the data you're displaying and the encoding of the HTML page are both UTF-8, the multi-byte special characters will be displayed correctly.
And if the last question is true, then how do I convert the source text to UTF8?
Advanced editors and IDEs enable you to define what encoding the source file is saved in. You would need to open the file in its current encoding (with special characters being displayed correctly) and save it as UTF-8.
If the content is messed up when you have the right content-type header and/or meta tag specified, then the content is not UTF-8 yet. If you don't get it sorted, post an example of what it looks like here.

Codepages and encodings

Before anyone recommends that I do a google search on this, I have. I just need a bit more clarity around what codepages and encodings.
If I use UTF8 encoding, and use an italian code page and then a french code page, does this mean ill get different characters even though the bytes havent changed?
Joel has a nice summary of this:
http://www.joelonsoftware.com/articles/Unicode.html
And no. if I understand your question correctly it doesn't mean that.
When you're converting UTF-8 to a specific code page, it is possible that only some of the characters are going to be converted. What happens to the ones that don't get converted depends on how you call the conversion. A possible result is that the characters which could not be mapped to the code page would be converted to question mark characters.
An encoding is simply a mapping between numerical values and "characters".
US-ASCII maps the number 65 to the letter A, 32 to a space and 49 to the digit "1". (How these things are rendered is another matter.) In fact, UTF-8 does the same! But there are other values which UTF-8 treats differently to ASCII. It is a variable-length encoding, i.e. a character may be encoded with 1, 2, 3, or 4 bytes; common characters generally consume less bytes.
Plain text files, including web pages, are stored and transmitted as sequences of bytes. These bytes are supposed to represent something textual. Software applications (like text editors and web browsers) are responsible for rending the information within these files on the screen. Usually they make use of library or OS functions.
If the software assumes a different encoding to the software that created the file, the wrong characters may be displayed!
Note that it is possible to convert between different encodings; however if you convert to an encoding that does not contain a certain character, the software must make a choice as to what to use instead. This conversion often happens transparently (when you save a file with a certain encoding, whatever you've typed must be changed into that encoding).
UTF-8 includes all characters from your French and Italian code page, but the language specific code pages does not include all of each others characters.
So you can take input from each language and convert it to UTF-8 for storage, but you can not be certain that you will get the right characters if you take Italian input and show it as French.
Use UTF-8 all the way if you can.