cstring m_pszdata doesn't match converted char* in UNICODE - unicode

I tested the Unicode conversion with a UNICODE MFC dialog app, where I can input some Chinese in the edit box. After reading in the characters using
DDX_Text(pDX, IDC_EDIT1, m_strUnicode)
UpdateDate(TRUE)
the m_pszdata of m_strUnicode shows "e0 65 2d 4e 1f 75 09 67". Then I used the following code to convert it to char*:
char *psText; psText = new char[dwMinSize];
WideCharToMultiByte (CP_OEMCP, NULL, m_strUnicode,-1, psText,
dwMinSize, NULL, FALSE);
The psText contains "ce de d6 d0 c9 fa d3 d0", nothing similar with the m_pszdata of m_strUnicode. Would anyone please explain why it is like that?

ce de d6 d0 c9 fa d3 d0 is 无中生有 in GBK. You sure you're manipulating Unicode?
CP_OEMCP instructs the API to use the currently set default OEM codepage.
So my guess here is that you're on a Chinese PC with GBK as default codepage.
无中生有 in UTF16LE is e0 65 2d 4e 1f 75 09 67 so basically you are converting a UTF-16-LE string to GBK.

Related

Random symbols in Source window instead of Russian characters in RStudio

I have been googling and stackoverflowing (yes, that is the word now) on how to fix the problem with wrong encoding. However, I could not find the solution.
I am trying to load .Rmd file with UTF-8 encoding which basically has Russian characters in it. They do not show properly. Instead, the code lines in the Source window look like so:
Initially, I created this .Rmd file long ago on my previous laptop. Now, I am using another one and I cannot spot the issue here.
I have already tried to use some Sys.setlocale() commands with no success whatsoever.
I run RStudio on Windows 10.
Edited
This is the output of readBin('raw[1].Rmd', raw(), 10000). Slice from 2075 to 2211:
[2075] 64 31 32 2c 20 71 68 35 20 3d 3d 20 22 d0 a0 d1 9a d0 a0 d0 88 d0 a0
e2 80 93 d0 a0 d0 8e d0 a0 d1 99
[2109] d0 a0 d1 9b d0 a0 e2 84 a2 22 29 3b 20 64 31 32 6d 24 71 68 35 20 3d
20 4e 55 4c 4c 0d 0a 64 31 35 6d
[2143] 20 3d 20 66 69 6c 74 65 72 28 64 31 35 2c 20 74 68 35 20 3d 3d 20 22
d0 a0 d1 9a d0 a0 d0 88 d0 a0 e2
[2177] 80 93 d0 a0 d0 8e d0 a0 d1 99 d0 a0 d1 9b d0 a0 e2 84 a2 22 29 3b 20
64 31 35 6d 24 74 68 35 20 3d 20
Thank you.
Windows doesn't have very good support for UTF-8. Likely your local encoding is something else.
RStudio normally reads files using the system encoding. If that is wrong, you can use "File | Reopen with encoding..." to re-open the file using a different encoding.
Edited to add:
The first line of the sample output looks like UTF-8 encoding with some Cyrillic letters, but not Russian-language text. I decode it as "d12, qh5 == \"РњРЈР–РЎРљ". Is that what RStudio gave you when you re-opened the file, declaring it as UTF-8?

Mixing UTF-8 with UTF-16

I'm currently working on a korean program, which should be translated into chinese language. What I found strange, is that application is mixing UTF-8 and UTF-16 characters.
Let's say we've a string which goes as:
"게임을 정말로 종료하시겠습니까"
8C AC 84 C7 44 C7 20 00 15 C8 D0 B9 5C B8 20 00
85 C8 CC B8 58 D5 DC C2 A0 AC B5 C2 C8 B2 4C AE 00
But it's stored as
B0 D4 C0 D3 C0 BB 20 C1 A4 B8 BB B7 CE 20 C1 BE
B7 E1 C7 CF BD C3 B0 DA BD C0 B4 CF B1 EE 3F 00
just to prevent zeros. I'd like to know, if it's some kind of encryption, or is it just a normal method used by compilers to prevent end of the string somewhere in the middle of the string? Because, the final result is the first string, that I've mentioned. Any reading would be strongly appreciated.
A string must be either uft-8 or utf-16 (or some other encoding). If you mix encodings in a string it is an error. However it is very common to pass string about as utf-8, and only convert them to utf-16 when needed by a Windows function. There are several reasons for this, Basile Starynkevitch has provided a link.
If you need routines to read UFT-8, I've got some here.
https://github.com/MalcolmMcLean/babyx/blob/master/src/common/BBX_Font.c

How does the Hex data of a .wav file reads?

I was trying to read a .wav file with java and apparently the problem is that I don't understand how the Hex figures are supposed to be read.
Here are the first lines of DATA from a .wav (32 Bits per sample, 2 Channels) on a Hex editor :
64 61 74 61 00 1A 01 00 data.....
1D F6 FB 3D 84 DF FB 3D öû=„ßû=
4B 03 03 3E 4B 03 03 3E K..>K..>
D5 F8 08 3E D5 F8 08 3E Õø.>Õø.>
C6 48 0F 3E C6 48 0F 3E ÆH.>ÆH.>
So here is what I thought : the first value from the first channel should be read : 3D FB F6 1D, which would mean 1039922717
And so I take that number and substract 2^31 and I get -1107560931 and that would be the first value. But then I compare this to the value I get from MATLAB audioread and I get 264200656 as first value. Why ?

SonarQube 6.3 LDAP/SSO UTF-8 encoding

We‘re using LDAP/SSO in my company which provides the username in UTF-8 format to SonarQube.
However LDAP/SSO sends the username in UFT-8 but SonarQube requires Latin1/ISO-8859. There is no way to change the encoding on LDAP/SSO or SonarQube.
Result wrong umlauts:
Andrü Tingö = Andr«Ã Ting¼Ã OR äëüö = äëüÃ
Is there any workaround?
I wanted to post this as comment, but I need 50 reputations to write comments.
We are using simplesamlphp for SSO as IdP and SP. IdP takes cn, givenName and sn from LDAP, which has UTF-8 values. Loginnames/Usernames are us-ascii only.
If the user comes to Sonar, the non-us-ascii characters are incorrect - they were converted from ... to utf-8, even they already are in utf-8.
If I use the attributes from IDP in PHP which sends the page in UTF-8, then characters are correct.
I did just now one test. In our Apache Config we set the X-Forwarded-Name to MCAC_ATTR_CN attribute what SP get from IdP. Original configuration is:
RequestHeader set X-Forwarded-Name "expr=%{reqenv:MCAC_ATTR_CN}"
Now I have added fixed string in UTF-8:
RequestHeader set X-Forwarded-Name "expr=%{reqenv:MCAC_ATTR_CN} cäëöüc"
The "c" characters are only separators to see the encoded text better.
The hexdump of this configuration line is:
0000750: 09 0909 5265 7175 6573 7448 6561 ...RequestHea
0000760: 6465 7220 7365 7420 582d 466f 7277 6172 der set X-Forwar
0000770: 6465 642d 4e61 6d65 2022 6578 7072 3d25 ded-Name "expr=%
0000780: 7b72 6571 656e 763a 4d43 4143 5f41 5454 {reqenv:MCAC_ATT
0000790: 525f 434e 7d20 63c3 a4c3 abc3 b6c3 bc63 R_CN} c........c
00007a0: 220a ".
As you can see, there are fixed utf-8 characters "ä" c3a4 "ë" c3ab "ö" c3b6 "ü" c3bc.
From LDAP comes follwing name:
xxxxxx xxxxx xxxx äëüö
In Apache config is appended " cäëöüc", therefore resulting name should be:
xxxxxx xxxxx xxxx äëüö cäëöüc
But in Sonar, the name is displayed as
xxxxxx xxxxx xxxx äëüö cäëöüc
You get similar result if you convert follwing text:
xxxxxx xxxxx xxxx äëüö cäëöüc
from ISO-8859-1 to UTF-8:
echo "xxxxxx xxxxx xxxx äëüö cäëöüc" | iconv -f iso-8859-2 -t utf-8
xxxxxx xxxxx xxxx äÍßÜ cäÍÜßc
The "¤" character is utf-8 char c2 a4:
00000000: c2a4 0a ...
I have made tcpdump on loopback to get communications from apache proxy module to sonarqube and even there you can see correct UTF-8 characters c3a4 c3ab c3bc c3b6 comming from IdP and then between "c"s you can see c3a4 c3ab c3b6 c3bc comming direct from apache.
00000000 47 45 54 20 2f 61 63 63 6f 75 6e 74 20 48 54 54 GET /acc ount HTT
...
00000390 58 2d 46 6f 72 77 61 72 64 65 64 2d 4e 61 6d 65 X-Forwar ded-Name
000003A0 3a 20 72 6f 62 65 72 74 20 74 65 73 74 32 20 77 : xxxxxx xxxxx x
000003B0 6f 6c 66 20 c3 a4 c3 ab c3 bc c3 b6 20 63 c3 a4 xxx .... .... c..
000003C0 c3 ab c3 b6 c3 bc 63 0d 0a ......c. .
...
The system has locales set to en_US.UTF-8, if this matters.
So Sonar gets really UTF-8 Text from Apache (direct config or from IdP) but then something probably converts this utf-8 text as if it was iso-8859 text to utf-8 again and makes nonsense.
Do you have any idea now? Could this be something in sonar or in wrapper or somewhere some options set incorrectly?
Regards,
Robert.

Matlab reading endian-incorrect binary data input / interpreting as uint32

While writing this post, I attempted b = fread(s, 1, 'uint32')
This would work great, but my poor data is sent LSB first! (no I can not change this)
Before, I was using b = fread(s, 4)' which gives me a vector similar to [47 54 234 0].
Here is my input stream:
0A
0D 39 EA 00 04 39 EA 00
4B 39 EA 00 D0 38 EA 00
0A
etc...
I can successfully delimit by 0x0A by
while ~isequal(fread(s, 1), 10) end
Basically I need to get the array of uint32s represented by [00EA390D 00EA3904 00EA394B 00EA38D0]
The documentation for swapbytes doesn't help me much and the uint32 operator operates on individual elements!!
The matlab fread function directly supports little endian machine format. Just set the 5th argument of the fread function to the string "L".
b = fread(s, 4, 'uint32',0,'l');