In computer terminology, what do native English speakers call the character set that displays the table in the console? - character

In computer terminology, what do native English speakers call the character set that displays the table in the console?
Google translate tells me it should be "Tabs".
Is it because I'm not a native English speaker?
This sounds weird to me.
such as
┌ ┬ ┐
├ ┼ ┤
└ ┴ ┘
┏ ┳ ┓
┣ ╋ ┫
┗ ┻ ┛

Related

Displaying Chinese characters on a form from an INI File

My plugin reads the control caption text from an INI file (ANSI as UTF-8 encoding) in order to display multiple languages. Key point being it is a plugin, I have no control nor ability to change this INI file format or file type.
They are currently being read into my plugin with TINIFile.ReadString and stored as a string. I can modify this (data type, read method, etc) as needed.
The main application reads from its own application language files that are UCS-2 Little Endian encoded as a TXT file. These display fine when the language is changed, even when the Windows OS is kept in English (in other words no OS locale changes need to be made for the application to switch display languages).
My plugin's form cannot display Asian characters (Chinese, Japanese, Korean, etc). English language is fine.
I have tried various fonts, using various combinations of AnsiString, String, etc. What am I missing to be able to display Asian characters on the form? I have not found a similar question to what I'm trying to do specifically with how my language text is being read into the plugin.
If the .INI file reader does not interpret the contents of the values, and allows all values through transparently, then you need to map the strings into one with the correct locale.
There is a similar question at Delphi 2010: how do I convert a UTF8-encoded PAnsiChar to a UnicodeString? that explains how to do the conversion. You may need to extract the contents into a RawByteString to avoid the implicit conversions.

Hindi Unicode supported by Mobile phones

I'm currently doing a project on Language translation where I'm converting an English text to Hindi. I'm trying to send the converted Hindi text to a mobile phone, but the message could not be displayed on my phone as there is no hindi font. But I have seen mobile network operators sending their promos in Hindi which my mobile reads like charm. I would like to know if there is any unicode or some other conversion of the text so that the hindi text will be displayed on my phone?
I also thought to start such program. I maintained the Unicode characters of all the hindi letters in a file.
2305
2309
2309 अआइईउऊऋएऐओऔअंअँअंअः
2325 कखगघङ
2330 चछजझञ
2335 टठडढण
2340 तथदधनऩ
2346 पफबभम
2351 यरऱलळऴव
2358 शषसह 2361
2364 च़चऽचाचि ---- -upto च॔ 2388
2392 क़ख़ग़ज़ड़ड़ढ़फ़य़ॠॡ 2401
2402 चॢचॣ। ॥ 2405
2406 ०१२३४५६७८९
I hope you are having 16-bit characters to store hindi characters.

Decoding Korean text files from the 90s

I have a collection of .html files created in the mid-90s, which include a significant ammount of Korean text. The HTML lacks character set metadata, so of course all of the Korean text now does not render properly. The following examples will all make use of the same excerpt of text .
In text editors such as Coda and Text Wrangler the text displays as
╙╦ ╝№бя└К ▓щ╥НВь╕цль▒Ф ▓щ╥НВь╕цль▒Ф
Which in the absence of character set metadata in < head > is rendered by the browser as:
ÓË ¼ü¡ïÀŠ ²éÒ‚ì¸æ«ì±” ²éÒ‚ì¸æ«ì±”
Adding euc-kr metadata to < head >
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
Yields the following, which is illegible nonsense (verified by a native speaker):
沓 숩∽핅 꿴�귥멩レ콛 꿴�귥멩レ콛
I have tried this approach with all historic Korean character sets, each yielding similarly unsuccessful results. I also tried parsing and upgrading to UTF-8, via Beautiful Soup, which also failed.
Viewing the files in Emacs seems promising, as it reveals the text encoding a lower level. The following is the same sample of text:
\323\313 \274\374\241\357\300\212
\262\351\322\215\202\354\270\346\253\354\261\224 \262\3\
51\322\215\202\354\270\346\253\354\261\224
How can I identify this text encoding and promote it to UTF-8?
All of those octal codes that emacs revealed are less than 254 (or \376 in octal), so it looks like one of those old pre-Unicode fonts that just used it's own mapping in the ASCII range. If this is right, you'll just have to try to figure out what font it was intended for, find it and perhaps do the conversion yourself.
It's a pain. Many years ago I did something similar for some popular pre-Unicode Greek fonts: http://litot.es/unicode-converter/ (the code: https://github.com/seanredmond/Encoding-Converter)
In the end, it is about finding the correct character encoding and using iconv.
iconv --list
displays all available encodings. Grepping for "KR" reveals at least my system can do CSEUCKR, CSISO2022KR, EUC-KR, ISO-2022-KR and ISO646-KR. Korean is also BIG5HKSCS, CSKSC5636 and KSC5636 according to Wikipedia. Try them all until something reasonable pops out.
Even if this thread is old, it's still an issue, and not having found a way to convert the files in bulk (outside of using a Korean version of Windows7), now I'm using Naver, which has a cloud service like Google docs and if you upload those weirdly encoded files there, it deals with them very well. I just edit and copy the text, and it's back to being standard when I copy it elsewhere.
Not the kind of solution I like, but it might save a few passers-by.
You can register for the cloud account with an ID, even if you do not live in SKorea by the way, there's some minimal english to get by.

What is the most common encoding of each language?

I am developing a plain-text reader application. Sometimes app can't auto determine the encoding of a file, so user needs select an encoding from a list of encodings. If this list contains all supported encodings, it will be too long. I want to provide a simplified list, only contains most common encodings of each language.
This is some relationship I am known:
Traditional Chinese: Big5
Simplified Chinese: GB18030
Japanese: Shift-JIS, EUC-JP
Russian: KOI8-R
If you know any other language's most common encoding, please tell me.
On the web, UTF-8 is by far the most common encoding for all languages.
That being said, here are the Windows XP locales grouped by default character encoding ("Language for non-Unicode programs"):
Big5: zh_HK, zh_MO, zh_TW
GBK (≈GB2312): zh_CN, zh_SG
Windows-31J (≈Shift_JIS): ja_JP
windows-874 (≈TIS-620, ISO-8859-11): th_TH
windows-949 (≈EUC-KR): ko_KR
windows-1250: bs_BA, cs_CZ, hr_BA, hr_HR, hu_HU, pl_PL, ro_RO, sk_SK, sl_SI, sq_AL, sr_BA, sr_SP
windows-1251: az_AZ, be_BY, bg_BG, kk_KZ, ky_KG, mk_MK, mn_MN, ru_RU, sr_BA, sr_SP, tt_RU, uk_UA, uz_UZ
windows-1252 (≈ISO-8859-1): af_ZA, arn_CL, ca_ES, cy_GB, da_DK, de_AT, de_CH, de_DE, de_LI, de_LU, en_AU, en_BZ, en_CA, en_CB, en_GB, en_IE, en_JM, en_NZ, en_PH, en_TT, en_US, en_ZA, en_ZW, es_AR, es_BO, es_CL, es_CO, es_CR, es_DO, es_EC, es_ES, es_GT, es_HN, es_MX, es_NI, es_PA, es_PE, es_PR, es_PY, es_SV, es_UY, es_VE, eu_ES, fi_FI, fil_PH, fo_FO, fr_BE, fr_CA, fr_CH, fr_FR, fr_LU, fr_MC, fy_NL, ga_IE, gl_ES, id_ID, is_IS, it_CH, it_IT, iu_CA, iv_IV, lb_LU, moh_CA, ms_BN, ms_MY, nb_NO, nl_BE, nl_NL, nn_NO, ns_ZA, pt_BR, pt_PT, qu_BO, qu_EC, qu_PE, rm_CH, se_FI, se_NO, se_SE, sv_FI, sv_SE, sw_KE, tn_ZA, xh_ZA, zu_ZA
windows-1253: el_GR
windows-1254 (≈ISO-8859-9): az_AZ, tr_TR, uz_UZ
windows-1255: he_IL
windows-1256: ar_AE, ar_BH, ar_DZ, ar_EG, ar_IQ, ar_JO, ar_KW, ar_LB, ar_LY, ar_MA, ar_OM, ar_QA, ar_SA, ar_SY, ar_TN, ar_YE, fa_IR, ps_AF, ur_PK
windows-1257: et_EE, lt_LT, lv_LV
windows-1258: vi_VN
and the most common encodings overall on the Web as of October 30th 2020:
UTF-8 95.7%
ISO-8859-1 1.8%
Windows-1251 1.0%
Windows-1252 0.4%
GB2312 0.3%
Shift JIS 0.2%
GBK 0.1%
EUC-KR 0.1%
ISO-8859-9 0.1%
Windows-1254 0.1%
EUC-JP 0.1%
Big5 0.1%
The HTML5 draft contains a table of default encodings for languages, reflecting what is regarded as common. However, note that it is supposed to be based on the user locale, i.e. the language of the browser or the operating system, not the language of the document—obviously because the latter is usually unknown, at least before you actually read the document, based on some assumption about the encoding.
I think you could in practice copy the list of encodings in a popular web browser. If it works well there, it probably works reasonably well in your application. Browsers do some clever things with the list and its order, but in practice, I think it would suffice to have a short list like utf-8, utf-16, windows-1252, and maybe a few others, followed by an option of getting the full list. Note that although utf-16 is practically unused and useless for web pages, it is common for plain text files around. It is important to name the encodings well, preferably with a common English (or other language) name together with the IANA “charset” name in parentheses—much like browsers do.
I would recommend the menu structure like the one used by browsers. For instance Firefox: View -> Character Encoding -> More Encoding -> East Asian -> Chinese/Japanese/Korean.
(ok, easier if you just look). And View -> Encoding -> More in IE.
Might seem too deep and clunky, but it is very familiar. And does not drop useful encodings (Why KOI8-R for Russian, for instance? And what happens if I use Windows 1251 and is not in the list?)

VerQueryValue and multi codepage Unicode characters

In our application we use VerQueryValue() API call to fetch version info such as ProductName etc. For some applications running on a machine in Traditional Chinese (code page 950), the ProductName which has Unicode sequences that span multiple code pages, some characters are not translated properly. For instance,in the sequence below,
51 00 51 00 6F 8F F6 4E A1 7B 06 74
Some characters are returned as invalid Unicode 0x003f (question mark)
In the above sequence, the Unicode '8F 6F' is not picked up & converted properly by the WinAPI call and is just filled with the invalid Unicode '00 3F' - since '8F 6F' is present in codepage 936 only (ie., Simplified Chinese)
The .exe has just one translation table - as '\StringFileInfo\080404B0' - which refers to a language ID of '804' for Traditional Chinese only
How should one handle such cases - where the ProductName refers to Unicode from both 936 and 950 even though the translation table has one entry only ? Is there any other API call to use ?
Also, if I were to right-click on the exe and view 'details' tab, it shows the Productname correctly ! So it appears Microsoft uses a different API call or somehow
handle this correctly. I need to know how it so done.
Thanks in advance,
Venkat
It looks somewhat waierd to have contents compatible with codepage1 only in a block marked as codepage2. This is the source of your problem.
The best way to handle multi-codepages issues is obviously to turn your app to a Unicode-aware application. There will be no conversion to any codepages anymore, which will make everyone happy.
The LANGID (0804) is only an indication about the language of the contents in the block. If a version info has several blocks, you may program your app to lookup the block in the language of your user.
When you call VerQueryValue() in an ANSI application, this LANGID is not taken into account when converting the Unicode contents to ANSI: You're ANSI, so Windows assume you only understand the machine's default ANSI codepage.
Note about display in console
Beware of the console! It's an old creature that is not totally Unicode-aware. It is based on codepages. Therefore, you should expect display problems which can't be addressed. Even worse: It uses its own codepage (called OEM codepage) which may be different that the usual ANSI codepage (Although for East Asian languages, OEM codepage = ANSI codepage).
HTH.