How can I support UCS-2 for chinese / japanese characters in C? - cjk

I'm trying to understand how wchar_t works in C by writing a simple console program, and maybe it's not as simple as I first thought.
Here's the problem:
When I tried to get a string of characters using wscanf, it works.
// Code 1
wchar_t wstr[10];
wprintf(L"Enter string: ");
wscanf(L"%ls", wstr);
wprintf(L"You have entered: %ls\n", wstr);
Output:
Enter character: 煮
You have entered: 煮
Press any key to continue . . .
Which is to be expected. But if I try to get a single character (wchar), it fails:
// Code 2
wchar_t wstr[10];
wprintf(L"Enter character: ");
wscanf(L"%lc", wstr); wstr[1] = L'\0';
wprintf(L"You have entered: %lc\n", wstr[0]);
Output:
Enter character: 煮
You have entered: ・
Press any key to continue . . .
My system locale is currently set to Japanese (Japan). Has anyone experienced this?

Related

Extract words in Lua split by Unicode spaces and control characters

I'm interested in a pure-Lua (i.e., no external Unicode library) solution to extracting the units of a string between certain Unicode control characters and spaces. The code points I would like to use as delimiters are:
0000-0020
007f-00a0
00ad
1680
2000-200a
2028-2029
202f
205f
3000
I know how to access the code points in a string, for example:
> for i,c in utf8.codes("é$ \tπ😃") do print(c) end
233
36
32
9
960
128515
but I am not sure how to "skip" the spaces and tabs and reconstitute the other codepoints into strings themselves. What I would like to do in the example above, is drop the 32 and 9, then perhaps use utf8.char(233, 36) and utf8.char(960, 128515) to somehow get ["é$", "π😃"].
It seems that putting everything into a table of numbers and painstakingly walking through the table with for-loops and if-statements would work, but is there a better way? I looked into string:gmatch but that seems to require making utf8 sequences out of each of the ranges I want, and it's not clear what that pattern would even look like.
Is there a idiomatic way to extract the strings between the spaces? Or must I manually hack tables of code points? gmatch does not look up to the task. Or is it?
would require painstakingly generating the utf8 encodings for all code points at each end of the range.
Yes. But of course not manually.
local function range(from, to)
assert(utf8.codepoint(from) // 64 == utf8.codepoint(to) // 64)
return from:sub(1,-2).."["..from:sub(-1).."-"..to:sub(-1).."]"
end
local function split_unicode(s)
for w in s
:gsub("[\0-\x1F\x7F]", " ")
:gsub("\u{00a0}", " ")
:gsub("\u{00ad}", " ")
:gsub("\u{1680}", " ")
:gsub(range("\u{2000}", "\u{200a}"), " ")
:gsub(range("\u{2028}", "\u{2029}"), " ")
:gsub("\u{202f}", " ")
:gsub("\u{205f}", " ")
:gsub("\u{3000}", " ")
:gmatch"%S+"
do
print(w)
end
end
Test:
split_unicode("#\0#\t#\x1F#\x7F#\u{00a0}#\u{00ad}#\u{1680}#\u{2000}#\u{2005}#\u{200a}#\u{2028}#\u{2029}#\u{202f}#\u{205f}#\u{3000}#")

python3.4 under Windows7 encoding issuу - \u2014

I have 'em dash' character in my python code to split by it a line in a certain txt file.
with open(path, 'r') as r:
number = r.readline()
num = number.split(' — ')[1].replace('\n',' — ')
It worked fine under ubuntu with python3.4, but when running the code under windows 7 (python3.4) get the following error.
num = number.split(' \u2014 ')[1].replace('\n',' \u2014 ') IndexError:
list index out of range
I'm sure that it should work and It seems that the problem is in encoding.
Will appreciate any help to fix my programm. I've tried to set "# -- coding: utf-8 --" without any result
SOLUTION WAS open(path, mode, encoding='UTF8')
when you do:
num = number.split(' — ')[1].replace('\n',' — ')
you assume that the string 'number' contains a dash, and then take the second field ([1]), if number does not contains a dash then [1] does not exists, only [0] exists, and you get the index out of range response.
if ' — ' in number:
num = number.split(' — ')[1].replace('\n',' — ')
else:
num = number.replace('\n',' — ')
furthermore, as you are now on Windows, you might want to check for '\r\n' as well as '\n' depending what the file is using as end of line character(s)

How do I parse a pseudo float number in Xtext grammar?

I need to filter a 'reference number' of the form XX.XX, where X is any upper or lower-case letter or number (0-9). This is what I have came up with:
SCR_REF:
'Scr_Ref' ':' value=PROFILE
;
terminal PROFILE :
((CHAR|INT)(CHAR|INT)'.'(CHAR|INT)(CHAR|INT))
;
terminal CHAR returns ecore::EString : ('a'..'z'|'A'..'Z');
But his doesn't work in the generated editor. The following test entry:
Scr_Ref: 11.22
throws an error saying:
"no viable alternative at character '.' "
What I'm I doing wrong?
I think your problem is that you are using default INT in here. Both 11 and 22 is an integer by themselves. You need digits in here not Integer. Down here I made an example for you.
grammar org.xtext.example.mydsl.MyDsl with org.eclipse.xtext.common.Terminals
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
import"http://www.eclipse.org/emf/2002/Ecore" as ecore
Model:
greetings+=Greeting*;
Greeting:
'Hello' name=ID '!' "val=" val= PROFILE;
terminal PROFILE :
((CHAR|DIGIT)(CHAR|DIGIT)'.'(CHAR|DIGIT)(CHAR|DIGIT))
;
terminal DIGIT:
('0'..'9')
;
terminal CHAR returns ecore::EString :
('a'..'z'|'A'..'Z')
;
Hope this helps.

Why _printf_l can't print multibyte string of Chinese locale

I am using Windows 7, VS2008 to test following code:
wchar_t *pWCBuffer = L"你好,世界"; // some Chinese character
char *pMBBuffer = (char *)malloc( BUFFER_SIZE );
_locale_t locChinese = _create_locale(LC_CTYPE, "chs");
_wcstombs_l(pMBBuffer, pWCBuffer, BUFFER_SIZE, locChinese );
_printf_l("Multibyte character: %s\n\n", locChinese, pMBBuffer );
I convert a wide string to multibyte string and then print it out, using chinese locale, but the printed out string is not right, it is something weird like: ─π║├ú¼╩└╜τ
How could I print out the right multi-byte string?
This is not an absolute answer, because unicode on different platforms can be tricky. But if your Windows 7 is an English version, then your might want to try the Powershell ISE to see the output. I use that to print out unicode when writing programs in Ruby too.

How to convert Unicode characters to escape codes

So, I have a bunch of strings like this: {\b\cf12 よろてそ } . I'm thinking I could iterate over each character and replace any unicode (Edit: Anything where AscW(char) > 127 or < 0) with a unicode escape code (\u###). However, I'm not sure how to programmatically do so. Any suggestions?
Clarification:
I have a string like {\b\cf12 よろてそ } and I want a string like {\b\cf12 [STUFF]}, where [STUFF] will display as よろてそ when I view the rtf text.
You can simply use the AscW() function to get the correct value:-
sRTF = "\u" & CStr(AscW(char))
Note unlike other escapes for unicode, RTF uses the decimal signed short int (2 bytes) representation for a unicode character. Which makes the conversion in VB6 really quite easy.
Edit
As MarkJ points out in a comment you would only do this for characters outside of 0-127 but then you would also need to give some other characters inside the 0-127 range special handling as well.
Another more roundabout way, would be to add the MSScript.OCX to the project and interface with VBScript's Escape function. For example
Sub main()
Dim s As String
s = ChrW$(&H3088) & ChrW$(&H308D) & ChrW$(&H3066) & ChrW$(&H305D)
Debug.Print MyEscape(s)
End Sub
Function MyEscape(s As String) As String
Dim scr As Object
Set scr = CreateObject("MSScriptControl.ScriptControl")
scr.Language = "VBScript"
scr.Reset
MyEscape = scr.eval("escape(" & dq(s) & ")")
End Function
Function dq(s)
dq = Chr$(34) & s & Chr$(34)
End Function
The Main routine passes in the original Japanese characters and the debug output says:
%u3088%u308D%u3066%u305D
HTH