Special character in Borland c++ Builder - unicode

I just wanna use Delta sign "Δ" in Borland c++ Builder 5.
for example in a label:
Label1->Caption = "delta sign here?";
thnx.

C++Builder 5 uses an ANSI-based VCL and ANSI-based Win32 API calls, where the ANSI encoding is dictated by the active user's locale settings in Windows.
If your app is running on a Greek machine that uses Latin-7/ISO-8859-7 (Windows codepage 28597) as its native locale, or at least has Greek fonts installed, you should be able to set the Label1->Font->Charset to GREEK_CHARSET (161) and Label1->Font->Name to a Greek font, and then assign the Delta character like this:
// using an implicit conversion from Unicode
// to ANSI on a Greek-locale machine...
Label1->Caption = L"Δ";
Label1->Caption = L"\x0394";
Label1->Caption = (wchar_t) 0x0394;
Label1->Caption = (wchar_t) 916;
Or:
// using an explicit Greek ANSI codeunit
// on a Greek font machine...
Label1->Caption = (char) 0xC4;
Label1->Caption = (char) 196;
However, if you need to display the Delta character on a non-Greek machine, or at least one that does not have any Greek fonts installed, you will have to use a third-party Unicode-enabled Label component, such as from the old TNTWare component suite, so that you can use Unicode codepoint U+0394 directly, eg:
TntLabel1->Caption = L"Δ";
TntLabel1->Caption = L"\x0394";
TntLabel1->Caption = (wchar_t) 0x0394;
TntLabel1->Caption = (wchar_t) 916;

If you are on Windows:
EDIT: Try ALT + 30, it works! ▲▲▲▲

Related

Filtering out all non-kanji characters in a text with Python 3

I have a text in which there are latin letters and japanese characters (hiragana, katakana & kanji).
I want to filter out all latin characters, hiragana and katakana but I am not sure how to do this in an elegant way.
My direct approach would be to just filter out every single letter of the latin alphabet in addition to every single hiragana/katakana but I am sure there is a better way.
I am guessing that I have to use regex but I am not quite sure how to go about it. Are letters somehow classified in roman letters, japanese, chinese etc.
If yes, could I somehow use this?
Here some sample text:
"Lesson 1:",, "私","わたし","I" "私たち","わたしたち","We" "あ なた","あなた","You" "あの人","あのひと","That person" "あの方","あのかた","That person (polite)" "皆さん","みなさん"
The program should only return the kanjis (chinese character) like this:
`私、人,方,皆`
I found the answer thanks to Olsgaarddk on reddit.
https://github.com/olsgaard/Japanese_nlp_scripts/blob/master/jp_regex.py
# -*- coding: utf-8 -*-
import re
''' This is a library of functions and variables that are helpful to have handy
when manipulating Japanese text in python.
This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.
Copyright (c) 2014-2015, Mads Sørensen Ølsgaard
All rights reserved.
Released under BSD3 License, see http://opensource.org/licenses/BSD-3-Clause or license.txt '''
## UNICODE BLOCKS ##
# Regular expression unicode blocks collected from
# http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/
hiragana_full = r'[ぁ-ゟ]'
katakana_full = r'[゠-ヿ]'
kanji = r'[㐀-䶵一-鿋豈-頻]'
radicals = r'[⺀-⿕]'
katakana_half_width = r'[⦅-゚]'
alphanum_full = r'[!-~]'
symbols_punct = r'[、-〿]'
misc_symbols = r'[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]'
ascii_char = r'[ -~]'
## FUNCTIONS ##
def extract_unicode_block(unicode_block, string):
''' extracts and returns all texts from a unicode block from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.findall( unicode_block, string)
def remove_unicode_block(unicode_block, string):
''' removes all chaacters from a unicode block and returns all remaining texts from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.sub( unicode_block, '', string)
## EXAMPLES ##
text = '初めての駅 自由が丘の駅で、大井町線から降りると、ママは、トットちゃんの手を引っ張って、改札口を出ようとした。ぁゟ゠ヿ㐀䶵一鿋豈頻⺀⿕⦅゚abc!~、〿ㇰㇿ㈠㉃㊀㋾㌀㍿'
print('Original text string:', text, '\n')
print('All kanji removed:', remove_unicode_block(kanji, text))
print('All hiragana in text:', ''.join(extract_unicode_block(hiragana_full, text)))

How to get Unicode char on IME window in VB6?

I have a special case when the user is first typing through IME by Press Alphabetic KeyCode on my Grid UserControl, How do I pick up the Unicode on IME Window? If the user is typing in English, it is OK. But if the user is typing Chinese or Japanese on IME, the Unicode turns into question marks.
Select Case uMsg
Case WM_IME_SETCONTEXT
If Not wParam = 0 Then
Dim flag As Boolean
flag = ImmAssociateContextEx(lng_hWnd, 0, 16)
If flag Then
Dim IntPtr As Long
IntPtr = ImmGetContext(lng_hWnd)
flag = ImmSetOpenStatus(IntPtr, True)
End If
End If
Case WM_IME_STARTCOMPOSITION
Dim hIMC As Long
hIMC = ImmGetContext(lng_hWnd)
Dim cf As COMPOSITIONFORM
cf.dwStyle = 2
cf.ptCurrentPos.X = UserControl1.ScaleLeft + 3
cf.ptCurrentPos.Y = UserControl1.ScaleTop + UserControl1.Height - 16
ImmSetCompositionWindow hIMC, cf
Case WM_IME_CHAR
'Send IME Char to UserControl1.KeyPress
UserControl1_KeyPress(wParam And &HFFFF&)
Exit Sub
End Select
After I use different Subclasser from Krool, now I can get Right Unicode. Not sure why Paul Caton and LaVolpe cSelfSubHookCallBack doesn't work.
The Subclasser may internally turn Unicode to ANSI or failed to prevent Windows from UNICODE to ANSI conversion.

Why _printf_l can't print multibyte string of Chinese locale

I am using Windows 7, VS2008 to test following code:
wchar_t *pWCBuffer = L"你好,世界"; // some Chinese character
char *pMBBuffer = (char *)malloc( BUFFER_SIZE );
_locale_t locChinese = _create_locale(LC_CTYPE, "chs");
_wcstombs_l(pMBBuffer, pWCBuffer, BUFFER_SIZE, locChinese );
_printf_l("Multibyte character: %s\n\n", locChinese, pMBBuffer );
I convert a wide string to multibyte string and then print it out, using chinese locale, but the printed out string is not right, it is something weird like: ─π║├ú¼╩└╜τ
How could I print out the right multi-byte string?
This is not an absolute answer, because unicode on different platforms can be tricky. But if your Windows 7 is an English version, then your might want to try the Powershell ISE to see the output. I use that to print out unicode when writing programs in Ruby too.

Should there be something like 'bytelen' (along with 'strlen')?

In my opinion the 'strlen' function should only return the number of characters in a string. Nothing else. And it does, whether it counts ASCII characters or Unicode characters. A character is a character, pointing to a given position on an ASCII table or a UTF-8 table. Nothing more.
If you would like to know, for whatever reason, the byte-length of a string, then you should use a differtent function. I am a newby in PHP scripting, so I did not find that function yet. (Should be something like 'bytelen()'?)
mb_strlen() does what you're after.
Yes, that would be most logical design. However, PHP has not been planned to support multibyte charsets from the beginning. Instead, it's been evolving along the years in a sort of chaotic manner. You've tagged your question as PHP 4 but PHP 5 does not have a decent Unicode support yet (and I don't think it'll change in a nearby future).
There're a few reasons for this anyway:
PHP is not a closed-source commercial product owned by a company with a centralized design controlled by enterprise rules.
PHP was released in 1995 as a personal project by someone who needed some functionality in his static home page: at that time, it had no need for Unicode support.
If you modify core functions like strlen() you must do it in a way that it doesn't break previous functionality. It's not easy. Writing a new separate function is much easier.
Update
Sorry, I forgot the second part of your question. If you need to handle Unicode strings you have to use a separate set of functions:
http://es.php.net/manual/en/book.mbstring.php
You might also find these chapters interesting:
http://es.php.net/manual/en/book.iconv.php
http://es.php.net/manual/en/book.unicode.php
Please take note of the PHP version required by each function you are planning to use; PHP 4 is pretty old.
If I'm not grossly misunderstanding you, then strlen() is your 'bytelen()', as alluded to in the other responses here.
strlen() itself has no support for utf-8 or other multi-byte character sets; if you want a proper strlen(), you'll need mb_strlen().
Pentium10's function strBytes($str), from glancing over it (not testing) looks like it would be a good alternative if you know your encoding is utf-8 and you're stuck with a super low version of PHP4 for some reason.
(And I do recommend taking a look at Álvaro G. Vicario's post for the reasons behind this behaviour. Proper, native UTF-8 support is due to come with PHP6.)
/**
* Count the number of bytes of a given string.
* Input string is expected to be ASCII or UTF-8 encoded.
* Warning: the function doesn't return the number of chars
* in the string, but the number of bytes.
*
* #param string $str The string to compute number of bytes
*
* #return The length in bytes of the given string.
*/
function strBytes($str)
{
// STRINGS ARE EXPECTED TO BE IN ASCII OR UTF-8 FORMAT
// Number of characters in string
$strlen_var = strlen($str);
// string bytes counter
$d = 0;
/*
* Iterate over every character in the string,
* escaping with a slash or encoding to UTF-8 where necessary
*/
for ($c = 0; $c < $strlen_var; ++$c) {
$ord_var_c = ord($str{$d});
switch (true) {
case (($ord_var_c >= 0x20) && ($ord_var_c <= 0x7F)):
// characters U-00000000 - U-0000007F (same as ASCII)
$d++;
break;
case (($ord_var_c & 0xE0) == 0xC0):
// characters U-00000080 - U-000007FF, mask 110XXXXX
// see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
$d+=2;
break;
case (($ord_var_c & 0xF0) == 0xE0):
// characters U-00000800 - U-0000FFFF, mask 1110XXXX
// see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
$d+=3;
break;
case (($ord_var_c & 0xF8) == 0xF0):
// characters U-00010000 - U-001FFFFF, mask 11110XXX
// see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
$d+=4;
break;
case (($ord_var_c & 0xFC) == 0xF8):
// characters U-00200000 - U-03FFFFFF, mask 111110XX
// see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
$d+=5;
break;
case (($ord_var_c & 0xFE) == 0xFC):
// characters U-04000000 - U-7FFFFFFF, mask 1111110X
// see http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
$d+=6;
break;
default:
$d++;
}
}
return $d;
}

How to convert Unicode characters to escape codes

So, I have a bunch of strings like this: {\b\cf12 よろてそ } . I'm thinking I could iterate over each character and replace any unicode (Edit: Anything where AscW(char) > 127 or < 0) with a unicode escape code (\u###). However, I'm not sure how to programmatically do so. Any suggestions?
Clarification:
I have a string like {\b\cf12 よろてそ } and I want a string like {\b\cf12 [STUFF]}, where [STUFF] will display as よろてそ when I view the rtf text.
You can simply use the AscW() function to get the correct value:-
sRTF = "\u" & CStr(AscW(char))
Note unlike other escapes for unicode, RTF uses the decimal signed short int (2 bytes) representation for a unicode character. Which makes the conversion in VB6 really quite easy.
Edit
As MarkJ points out in a comment you would only do this for characters outside of 0-127 but then you would also need to give some other characters inside the 0-127 range special handling as well.
Another more roundabout way, would be to add the MSScript.OCX to the project and interface with VBScript's Escape function. For example
Sub main()
Dim s As String
s = ChrW$(&H3088) & ChrW$(&H308D) & ChrW$(&H3066) & ChrW$(&H305D)
Debug.Print MyEscape(s)
End Sub
Function MyEscape(s As String) As String
Dim scr As Object
Set scr = CreateObject("MSScriptControl.ScriptControl")
scr.Language = "VBScript"
scr.Reset
MyEscape = scr.eval("escape(" & dq(s) & ")")
End Function
Function dq(s)
dq = Chr$(34) & s & Chr$(34)
End Function
The Main routine passes in the original Japanese characters and the debug output says:
%u3088%u308D%u3066%u305D
HTH