use wcstombs() to convert wchar_t* (containing Unicode) to MBCS (char*) dependent on locale - unicode

My input is Unicode characters, eg:, "(U+00DB) (U+0081)" (wchar_t*). I use wcstombs to convert this wide char string into char * (MBCS). Since Unicode is already encoded in UTF-8 I am expecting it to return byte by byte copied sequence of Unicode as DB81 char*. But instead I get c3 9b. This is happening on Linux and on windows i get "DB 81" only.
I need to open a file with name DB 81 (as shown in hexdump), but fopen uses char* filename. Thus I have to convert this wchar_t* to MBCS. Please help!!

No, what you want to do is not what you think you should.
fopen(), under any circumstances - cannot handle all possible filenames on your system, because it lacks unicode support.
Please refer to http://www.utf8everywhere.org to see how to do it with _wfopen().

Related

Change of char encoding in Eclipse

I am working on an assignment where I need to XOR the bits of each char of a given text. For example, weird char's like '��'.
When trying to save, Eclipse prompts that "Some characters cannot be mapped with Cp1252...", after which I can choose to save as UTF-8.
My knowledge of character encoding is quite fuzzy; wouldn't saving to UTF-8 change the bits? If so, how may I instead work with the original message (original bits) to XOR them and do my assignment?
Thanks!
I am assuming you are using Java in this answer.
The file encoding only changes how the data is represented in the file. When you read the file again (using the correct encoding) it will converted back to Unicode in your String so the program will see the same bits.
Encoding Cp1252 can only represent a small number of characters (less than 256) compared to the 113,021 characters in Unicode 7 all of which can be encoded with UTF-8.

NSString UTF8String mangling unicode characters

When I run [NSString UTF8String] on certain unicode characters the resulting const char* representation is mangled both in NSLog and on the device/simulator display. The NSString itself displays fine but I need to convert the NSString to a cStr to use it in CGContextShowTextAtPoint.
It's very easy to reproduce (see code below) but I've searched for similar questions without any luck. Must be something basic I'm missing.
const char *cStr = [#"章" UTF8String];
NSLog(#"%s", cStr);
Thanks!
CGContextShowTextAtPoint is only for ASCII chars.
Check this SO question for answers.
When using the string format specifier (aka %s) you cannot be guaranteed that the characters of a c string will print correctly if they are not ASCII. Using a complex character as you've defined can be expressed in UTF-8 using escape characters to indicate the character set from which the character can be found. However the %s uses the system encoding to interpret the characters in the character string you provide to the formatting ( in this case, in NSLog ). See Apple's documentation:
https://developer.apple.com/library/mac/documentation/cocoa/Conceptual/Strings/Articles/formatSpecifiers.html
%s
Null-terminated array of 8-bit unsigned characters. %s interprets its input in the system encoding rather than, for example, UTF-8.
Going onto you CGContextShowTextAtPoint not working, that API supports only the macRoman character set, which is not the entire Unicode character set.
Youll need to look into another API for showing Unicode characters. Probably Core Text is where you'll want to start.
I've never noticed this issue before, but some quick experimentation shows that using printf instead of NSLog will cause the correct Unicode character to show up.
Try:
printf("%s", cStr);
This gives me the desired output ("章") both in the Xcode console and in Terminal. As nob1984 stated in his answer, the interpretation of the character data is up to the callee.

How did SourceForge maim this Unicode character?

A little encoding puzzle for you.
A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.
In the XML export, however, it shows up as:
—
Decoding the entities, that results in these code points:
U+00E2 U+20AC U+201D
I.e. the characters —. The XML should have been —, the decimal representation of 0x2014, so this is probably a bug in the SF.net exporter.
Now I'm looking to reverse the process, but I can't find a way to get the above output from this Unicode character, no matter what erroneous encoding/decoding sequence I try. Any idea what happened here and how to reverse the process?
The the XML output is incorrectly been encoded using CP1252. To revert this, convert — to bytes using CP1252 encoding and then convert those bytes back to string/char using UTF-8 encoding.
Java based evidence:
String s = "—";
System.out.println(new String(s.getBytes("CP1252"), "UTF-8")); // —
Note that this assumes that the stdout console uses by itself UTF-8 to display the character.
In .Net, Encoding.UTF8.GetString(Encoding.GetEncoding(1252).GetBytes("—")) returns —.
SourceForge converted it to UTF8, interpreted the each of the bytes as characters in CP1252, then saved the characters as three separate entities using the actual Unicode codepoints for those characters.

How to convert from unicode to ASCII

Is there any way to convert unicode values to ASCII?
To simply strip the accents from unicode characters you can use something like:
string.Concat(input.Normalize(NormalizationForm.FormD).Where(
c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark));
You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have. Almost the only thing you can do that is even close to the right thing is to discard all characters above codepoint 128, and even that is very likely nowhere near what your requirements say. (The other possibility is to simplify accented or umlauted letters to make more than 128 characters 'nearly' expressible, but that still doesn't even begin to actually cover Unicode.)
Technically, yes you can by using Encoding.ASCII.
Example (from byte[] to ASCII):
// Convert Unicode to Bytes
byte[] uni = Encoding.Unicode.GetBytes("Whatever unicode string you have");
// Convert to ASCII
string Ascii = Encoding.ASCII.GetString(uni);
Just remember Unicode a much larger standard than Ascii and there will be characters that simply cannot be correctly encoded. Have a look here for tables and a little more information on the two encodings.
This workaround might better suit your needs. It strips the unicode chars from a string and only keeps the ASCII chars.
byte[] bytes = Encoding.ASCII.GetBytes("eéêëèiïaâäàåcç  test");
char[] chars = Encoding.ASCII.GetChars(bytes);
string line = new String(chars);
line = line.Replace("?", "");
//Results in "eiac test"
Please note that the 2nd "space" in the character input string is the char with ASCII value 255
It depends what you mean by "convert".
You can transliterate using the AnyAscii package.
// C#
using AnyAscii;
string s = "άνθρωποι".Transliterate();
// anthropoi
Well, seeing as how there's some 100,000+ unicode characters and only 128 ASCII characters, a 1-1 mapping is obviously impossible.
You can use the Encoding.ASCII object to get the ASCII byte values from a Unicode string, though.
If your metadata fields only accept ASCII input. Unicode characters can be converted to their TEX equivalent through MathJax. What is MathJax?
MathJax is a JavaScript display engine for rendering TEX or MathML-coded mathematics in browsers without requiring font installation or browser plug-ins. Any modern browser with JavaScript enabled will be MathJax-ready. For general information about MathJax, visit mathjax.org.

CMemFile and Unicode

Am I right in thinking that the MFC class CMemFile is cannot be used to write unicode data to because it uses BYTE* which is defined as unsigned char BYTE?
The line line that actually writes the data in CMemFile::Write is
Memcpy((BYTE*)m_lpBuffer + m_nPosition, (BYTE*)lpBuf, nCount);
and if so can I replace BYTE with wchar_t in my own implementation of CMemfIle to get it working with unicode?
Thanks You
Paul..
I don't see why it couldn't be used directly.
The only issue is that when you're doing memory copying, you can't interchange the character count with the byte count.
Files are binary so always read/write bytes and use an encoding layer to convert to/from string unless you are sure the data is in ASCII encoding.
No, you need an encoder/decoder. For Unicode you need a unicode header followed by encoded characters. The exact binary values of encoded characters could be different based on the unicode encoding (UTF-7, UTF-8, UTF-16, UTF-32, etc).