Unicode RTF text in RichEdit - unicode

I'm having trouble getting a RichEdit control to display unicode RTF text. My application is Unicode, so all strings are wchar_t strings.
If I create the control as "RichEdit20A" I can use e.g. SetWindowText, and the text is displayed with the proper formatting. If I create the control as "RichEdit20W" then using SetWindowText shows the text verbatim, i.e. all the RTF code is displayed. The same happens if I use the EM_SETTEXTEX parameter, specifying codepage 1200 which MSDN tells me is used to indicate unicode.
I've tried using the StreamIn function, but this only seems to work if I stream in ASCII text. If I stream in widechars then I get empty text in the control. I use the SF_RTF|SF_UNICODE flags, and MSDN hints that this combination may not be allowed.
So what to do? Is there any way to get widechars into a RichEdit without losing RTF interpretation, or do I need to encode it? I've thought about trying UTF-8, or perhaps use the encoding facilities in RTF, but am unsure what the best choice is.

I had to do this recently, and noticed the same sorts of observations you're making.
It seems that, despite what MSDN almost suggests, the "RTF" parser will only work with 8-bit encodings. So what I ended up doing was using UTF-8, which is an 8 bit encoding but still can represent the full range of Unicode characters. You can get UTF-8 from a PWSTR via WideCharToMultiByte():
PWSTR WideString = /* Some string... */;
DWORD WideLength = wcslen(WideString) + 1;
PSTR Utf8;
DWORD Length;
INT ReturnedLength;
// A utf8 representation shouldn't be longer than 4 times the size
// of the utf16 one.
Length = WideLength * 4;
Utf8 = malloc(Length);
if (!Utf8) { /* TODO: handle failure */ }
ReturnedLength = WideCharToMultiByte(CP_UTF8,
0,
WideString,
WideLength-1,
Utf8,
Length-1,
NULL,
NULL);
if (ReturnedLength)
{
// Need to zero terminate...
Utf8[ReturnedLength] = 0;
}
else { /* TODO: handle failure */ }
Once you have it in UTF-8, you can do:
SETTEXTEX TextInfo = {0};
TextInfo.flags = ST_SELECTION;
TextInfo.codepage = CP_UTF8;
SendMessage(hRichText, EM_SETTEXTEX, (WPARAM)&TextInfo, (LPARAM)Utf8);
And of course (I left this out originally, but while I'm being explicit...):
free(Utf8);

RTF is ASCII, any charactor out of ASCII would be encoded using escape sequence.
RTF 1.9.1 specification (March 2008)

Take a look at \uN literal in rtf specification so you have to convert your wide string to string of unicode characters like \u902?\u300?\u888?
http://www.biblioscape.com/rtf15_spec.htm#Heading9
The numbers in this case represent the characters decimal code and the question mark is the character which will replace the unicode char in case if RichEdit does not support unicode (RichEdit v1.0).
For example for unicode string L"TIME" the rtf data will be "\u84?\u73?\u77?\u69?"

Related

Weird Normalization on .net

I am trying to normalize a string (using .net standard 2.0) using Form D, and it works perfectly and running on a Windows machine.
[TestMethod]
public void TestChars()
{
var original = "é";
var normalized = original.Normalize(NormalizationForm.FormD);
var originalBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(original));
Assert.AreEqual("233,0", originalBytesCsv);
var normalizedBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(normalized));
Assert.AreEqual("101,0,1,3", normalizedBytesCsv);
}
When I run this on Linux, it returns "253,255" for both strings, before and after normalization. These two bytes form the word 65533 which is the Unicode Replacement char, used when something goes wrong with encoding. That's the part where I am lost.
What am I missing here? Is there someone to point me in the right direction?
It might be related to the encoding of the source file. I'm not sure which encodings .net on Linux supports, but to be on the safe side, you should use plain ASCII source files and Unicode escapes for Non-ASCII characters:
var original = "\u00e9";
There is no text but encoded text.
When communicating text to person or program, both the bytes and the character encoding are essential.
The C# compiler (like all programs that process text, except in special cases like JSON) must know which character encoding the input files use. You must inform it accurately. The default is UTF-8 and that is a fine choice, especially for C# files, which are, lexically, sequences of Unicode codepoints.
If you used your editor or IDE or file transfer without full mindfulness of these requirements, you might have used an unintended character encoding.
For example, "é" when saved as Windows-1252 (0xE9) but read as UTF-8 (leading code unit that should be followed by two continuation code units), would give � to indicate this mishandling to the readers.
To be on the safe side, use UTF-8 everywhere but do it mindfully.

use wcstombs() to convert wchar_t* (containing Unicode) to MBCS (char*) dependent on locale

My input is Unicode characters, eg:, "(U+00DB) (U+0081)" (wchar_t*). I use wcstombs to convert this wide char string into char * (MBCS). Since Unicode is already encoded in UTF-8 I am expecting it to return byte by byte copied sequence of Unicode as DB81 char*. But instead I get c3 9b. This is happening on Linux and on windows i get "DB 81" only.
I need to open a file with name DB 81 (as shown in hexdump), but fopen uses char* filename. Thus I have to convert this wchar_t* to MBCS. Please help!!
No, what you want to do is not what you think you should.
fopen(), under any circumstances - cannot handle all possible filenames on your system, because it lacks unicode support.
Please refer to http://www.utf8everywhere.org to see how to do it with _wfopen().

Text encoding in ID3v2.3 tags

Thanks to this site and a few others, I've created some simple code to read ID3v2.3 tags from MP3 files. Doing so has been a great learning experience as I previously had no knowledge of hex / byte / binary etc.
I can successfully read data, but have come across an issue that I believe is to do with encoding used. I've realized that Text frames have a byte at the beginning of the 'text' that describes encoding used, and potentially more information in the next 2 bytes...
Example:
Data from frame TIT2 starts with the byte $03 (hex) before the actual text. This text displays correctly, albeit with an additional character at the beginning, using Encoding.ASCII.GetString
In another MP3, data from TIT2 starts $01 and is followed by $FF $FE, which I believe is to do with Unicode? The text itself is broken up though, there are $00 between every text character, and this stops the data from being displayed in windows forms (as soon as a 00 is encountered, the text just stops, so I get the first character and that's it). I've tried using Encoding.UNICODE.GetString, but that just seems to return gibberish.
Printing this data to a console seems to work, with spaces between each char, so the reading of the data is working properly.
I've been reading the official documentation for ID3v2.3 but I guess I'm just not clued-up enough to understand the text encoding section.
Any replies or links to articles that may be of help would be much appreciated!
Regards
Ross
Just add one more comment, for the text encoding code:
00 – ISO-8859-1 (ASCII).
01 – UCS-2 (UTF-16 encoded Unicode with BOM), in ID3v2.2 and ID3v2.3.
02 – UTF-16BE encoded Unicode without BOM, in ID3v2.4.
03 – UTF-8 encoded Unicode, in ID3v2.4.
from:
http://en.wikipedia.org/wiki/ID3
Data from frame TIT2 starts with the byte $03 (hex) before the actual text. This text displays correctly, albeit with an additional character at the beginning, using Encoding.ASCII.GetString
Encoding 0x03 is UTF-8, so you should use Encoding.UTF8.GetString. The character at the beginning may be U+FEFF Byte Order Mark, which is used to distinguish between UTF-16LE and UTF-16BE... it's no use for UTF-8, but Windows tools love to put it there anyway.
UTF-8 is an ID3v2.4 feature not present in 2.3, which may be why you can't find it in the spec. In the real world you will find all sorts of total nonsense in ID3 tags regardless of version.
data from TIT2 starts $01 and is followed by $FF $FE, which I believe is to do with Unicode? The text itself is broken up though, there are $00 between every text character,
That's UTF-16LE, the text-to-byte encoding that Windows misleadingly calls “Unicode”. It is made up of two-byte code units, so the characters in the range U+0000–U+00FF come out as the low-byte of the same number, followed by a zero high-byte. The 0xFF-0xFE prefix is a Byte Order Mark correctly used. Encoding.Unicode.GetString should return a correct string from this—post some code?
Printing this data to a console seems to work
Getting non-ASCII characters to print on the Windows console can be a trial, so if you hit problems bear in mind they may be caused by the print operation itself.
For completeness, encoding 0x02 is UTF-16BE without a BOM (there is little reason for this to exist and I have never met this in the wild at all), and encoding 0x00 is supposed to be ISO-8859-1, but in reality could be pretty much any ASCII-superset encoding, more likely a Windows ‘ANSI’ code page like Encoding.GetEncoding(1252) than a standard like 8859-1.
Great, I've gotten some code to read Unicode & ASCII properly (below)!
One question though - I expected Encoding.UNICODE.GetString() to handle the BOM, but it doesn't seem to. I take it you have to read these bytes & deal with the data accordingly yourself? I've just stripped out the 2 bytes if it's UNICODE below.
public class Frame
{
FrameHeader _header;
public string data;
public string name;
public Frame(FrameHeader frm, byte[] bytes)
{
_header = frm;
name = _header._name;
if (!name.Equals("APIC"))
{
byte[] actualdata;
int y;
int x;
int encoding = bytes[0];
if (encoding.Equals(1))
{
y = 3;
actualdata = new byte[bytes.Length - 3];
for (x = 0; x < (bytes.Length - 3); x++, y++)
actualdata[x] = bytes[y];
data = Encoding.Unicode.GetString(actualdata);
}
else
{
y = 1;
actualdata = new byte[bytes.Length - 1];
for (x = 0; x < (bytes.Length - 1); x++, y++)
actualdata[x] = bytes[y];
data = Encoding.ASCII.GetString(actualdata);
}
}
}
}

How to convert from unicode to ASCII

Is there any way to convert unicode values to ASCII?
To simply strip the accents from unicode characters you can use something like:
string.Concat(input.Normalize(NormalizationForm.FormD).Where(
c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have. Almost the only thing you can do that is even close to the right thing is to discard all characters above codepoint 128, and even that is very likely nowhere near what your requirements say. (The other possibility is to simplify accented or umlauted letters to make more than 128 characters 'nearly' expressible, but that still doesn't even begin to actually cover Unicode.)
Technically, yes you can by using Encoding.ASCII.
Example (from byte[] to ASCII):
// Convert Unicode to Bytes
byte[] uni = Encoding.Unicode.GetBytes("Whatever unicode string you have");
// Convert to ASCII
string Ascii = Encoding.ASCII.GetString(uni);
Just remember Unicode a much larger standard than Ascii and there will be characters that simply cannot be correctly encoded. Have a look here for tables and a little more information on the two encodings.
This workaround might better suit your needs. It strips the unicode chars from a string and only keeps the ASCII chars.
byte[] bytes = Encoding.ASCII.GetBytes("eéêëèiïaâäàåcç  test");
char[] chars = Encoding.ASCII.GetChars(bytes);
string line = new String(chars);
line = line.Replace("?", "");
//Results in "eiac test"
Please note that the 2nd "space" in the character input string is the char with ASCII value 255
It depends what you mean by "convert".
You can transliterate using the AnyAscii package.
// C#
using AnyAscii;
string s = "άνθρωποι".Transliterate();
// anthropoi
Well, seeing as how there's some 100,000+ unicode characters and only 128 ASCII characters, a 1-1 mapping is obviously impossible.
You can use the Encoding.ASCII object to get the ASCII byte values from a Unicode string, though.
If your metadata fields only accept ASCII input. Unicode characters can be converted to their TEX equivalent through MathJax. What is MathJax?
MathJax is a JavaScript display engine for rendering TEX or MathML-coded mathematics in browsers without requiring font installation or browser plug-ins. Any modern browser with JavaScript enabled will be MathJax-ready. For general information about MathJax, visit mathjax.org.

RichTextBox use to retrieve Text property in C++

I am using a hidden RichTextBox to retrieve Text property from a RichEditCtrl.
rtb->Text; returns the text portion of either English of national languages – just great!
But I need this text in \u12232? \u32232? instead of national characters and symbols. to work with my db and RichEditCtrl. Any idea how to get from “пассажирским поездом Невский” to “\u12415?\u12395?\u23554?\u20219?\u30456?\u35527?\u21729? (where each national character is represented as “\u23232?”
If you have, that would be great.
I am using visual studio 2008 C++ combination of MFC and managed code.
Cheers and have a wonderful weekend
If you need a System::String as an output as well, then something like this would do it:
String^ s = rtb->Text;
StringBuilder^ sb = gcnew StringBuilder(s->Length);
for (int i = 0; i < s->Length; ++i) {
sb->AppendFormat("\u{0:D5}?", (int)s[i]);
}
String^ result = s->ToString();
By the way, are you sure the format is as described? \u is a traditional Escape sequence for a hexadecimal Unicode codepoint, exactly 4 hex digits long, e.g. \u0F3A. It's also not normally followed by ?. If you actually want that, format specifier {0:X4} should do the trick.
You don't need to use escaping to put formatted Unicode in a RichText control. You can use UTF-8. See my answer here: Unicode RTF text in RichEdit.
I'm not sure what your restrictions are on your database, but maybe you can use UTF-8 there too.