Printing unicode characters in PowerShell via a C++ program - unicode

My end goal here is to write some non-latin text output to console in Windows via a C++ program.
cmd.exe gets me nowhere, so I got the latest, shiny version of PowerShell (that supports unicode). I've verified that I can
type-in non-unicode characters and
see non-unicode console output from windows commands (like "dir")
for example, I have this file, "가.txt" (가 is the first letter in the korean alphabet) and I can get an output like this:
PS P:\reference\unicode> dir .\가.txt
Directory: P:\reference\unicode
Mode LastWriteTime Length
Name
---- ------------- ------
----
-a--- 1/12/2010 8:54 AM 0 가.txt
So far so good. But writing to console using a C++ program doesn't work.
int main()
{
wchar_t text[] = {0xAC00, 0}; // 가 has code point U+AC00 in unicode
wprintf(L"%s", text); // this prints a single question mark: "?"
}
I don't know what I'm missing. The fact that I can type-in and see 가 on the console seems to indicate that I have the three needed pieces (unicode support, font and glyph), but I must be mistaken.
I've also tried "chcp" without any luck. Am I doing something wrong in my C++ program?
Thanks!

From the printf docs:
wprintf and printf behave identically
if the stream is opened in ANSI mode.
Check out this blog post. It has this nice short little listing:
#include <fcntl.h>
#include <io.h>
#include <stdio.h>
int main(void) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x043a\x043e\x0448\x043a\x0430 \x65e5\x672c\x56fd\n");
return 0;
}

Related

Looking for c++ equivalent for _wfindfirst for char16_t

I have filenames using char16_t characters:
char16_t Text[2560] = u"ThisIsTheFileName.txt";
char16_t const* Filename = Text;
How can I check if the file exists already? I know that I can do so for wchar_t using _wfindfirst(). But I need char16_t here.
Is there an equivalent function to _wfindfirst() for char16_t?
Background for this is that I need to work with Unicode characters and want my code working on Linux (32-bit) as well as on other platforms (16-bit).
findfirst() is the counterpart to _wfindfirst().
However, both findfirst() and _wfindfirst() are specific to Windows. findfirst() accepts ANSI (outdated legacy stuff). _wfindfirst() accepts UTF-16 in the form of wchar_t (which is not exactly the same thing as char16_t).
ANSI and UTF-16 are generally not used on Linux. findfirst()/_wfindfirst() are not included in the gcc compiler.
Linux uses UTF-8 for its Unicode format. You can use access() to check for file permission, or use opendir()/readdir()/closedir() as the equivalent to findfirst().
If you have a UTF-16 filename from Windows, you can convert the name to UTF-8, and use the UTF-8 name in Linux. See How to convert UTF-8 std::string to UTF-16 std::wstring?
Consider using std::filesystem in C++17 or higher.
Note that a Windows or Linux executable is 32-bit or 64-bit, that doesn't have anything to do with the character set. Some very old systems are 16-bit, you probably don't come across them.

Check if terminal supports unicode in rust

I would like some way of doing essentially the following:
if supports_unicode {
print!("some unicode");
} else {
print!("ascii");
}
Is there any way in rust to check if the output supports unicode?
Update
I found a way to check if the device supports unicode, but it doesn't check if the current output is set to the correct encoding, nor does it check if the font supports the full range of unicode characters. If you're curious, it uses the crate locale-codes 0.3.0, and the code is
locale_codes::codeset::all_names().contains(&String::from("UTF-8"))
But, as I said, this doesn't solve my problem
Also, if you want, here is a more specific example of the problem I've been having. In the VSCode intergrated terminal (Windows 10 x64, VSCode 1.47), if I run a rust program that prints the character 𝑥 (U+1D465), I get a variety of results, such as:
It actually printing the correct character
It prints �
It prints nothing at all
It prints 𝐵 (U+1D435)
I hope this example helps.

D Unicode string literals: can't print specific Unicode character

I'm just trying to pick up D having come from C++. I'm sure it's something very basic, but I can't find any documentation to help me. I'm trying to print the character à, which is U+00E0. I am trying to assign this character to a variable and then use write() to output it to the console.
I'm told by this website that U+00E0 is encoded as 0xC3 0xA0 in UTF-8, 0x00E0 in UTF-16 and 0x000000E0 in UTF-32.
Note that for everything I've tried, I've tried replacing string with char[] and wstring with wchar[]. I've also tried with and without the w or d suffixes after wide strings.
These methods return the compiler error, "Invalid trailing code unit":
string str = "à";
wstring str = "à"w;
dstring str = "à"d;
These methods print a totally different character (Ò U+00D2):
string str = "\xE0";
string str = hexString!"E0";
And all these methods print what looks like ˧á (note á ≠ à!), which is UTF-16 0x2E7 0x00E1:
string str = "\xC3\xA0";
wstring str = "\u00E0"w;
dstring str = "\U000000E0"d;
Any ideas?
I confirmed it works on my Windows box, so gonna type this up as an answer now.
In the source code, if you copy/paste the characters directly, make sure your editor is saving it in utf8 encoding. The D compiler insists on it, so if it gives a compile error about a utf thing, that's probably why. I have never used c:b but an old answer on the web said edit->encodings... it is a setting somewhere in the editor regardless.
Or, you can replace the characters in your source code with \uxxxx in the strings. Do NOT use the hexstring thing, that is for binary bytes, but your example of "\u00E0" is good, and will work for any type of string (not just wstring like in your example).
Then, on the output side, it depends on your target because the program just outputs bytes, and it is up to the recipient program to interpret it correctly. Since you said you are on Windows, the key is to set the console code page to utf-8 so it knows what you are trying to do. Indeed, the same C function can be called from D too. Leading to this program:
import core.sys.windows.windows;
import std.stdio;
void main() {
SetConsoleOutputCP(65001);
writeln("Hi \u00E0");
}
printing it successfully. On older Windows versions, you might need to change your font to see the character too (as opposed to the generic box it shows because some fonts don't have all the characters), but on my Windows 10 box, it just worked with the default font.
BTW, technically the console code page a shared setting (after running the program and it exits, you can still hit properties on your console window and see the change reflected there) and you should perhaps set it back when your program exits. You could get that at startup with the get function ( https://learn.microsoft.com/en-us/windows/console/getconsoleoutputcp ), store it in a local var, and set it back on exit. You could auto ccp = GetConsoleOutputCP(); SetConsoleOutputCP(65005;) scope(exit) SetConsoleOutputCP(ccp); right at startup - the scope exit will run when the function exits, so doing it in main would be kinda convenient. Just add some error checking if you want.
The Microsoft docs don't say anything about setting it back, so it probably doesn't actually matter, but still I wanna mention it just in case. But also the knowledge that it is shared and persists can help in debugging - if it works after you comment it, it isn't because the code isn't necessary, it is just because it was set previously and not unset yet!
Note that running it from an IDE might not be exactly the same, because IDEs often pipe the output instead of running it right out to the Windows console. If that happens, lemme know and we can type up some stuff about that for future readers too. But you can also open your own copy of the console (run the program outside the IDE) and it should show correctly for you.
D source code needs to be encoded as UTF-8.
My guess is that you're putting a UTF-16 character into the UTF-8 source file.
E.g.
import std.stdio;
void main() {
writeln(cast(char)0xC3, cast(char)0xA0);
}
Will output as UTF-8 the character you seek.
Which you can then hard code like so:
import std.stdio;
void main() {
string str = "à";
writeln(str);
}

Issue in displaying Unicoded native language characters through MFC in Windows XP

I'm trying to display Unicode Bengali, a native language of India through a MFC application as below:
CFont *m_pFontSmallBN = new CFont();
m_pFontSmallBN->CreateFont(34,0,0,0,600,0,0,0,ANSI_CHARSET,OUT_DEFAULT_PRECIS,
CLIP_DEFAULT_PRECIS,DEFAULT_QUALITY,DEFAULT_PITCH|FF_DONTCARE,
_T("Ekushey Lalsalu")); //"Ekushey Lalsalu" is the Bengali Font name here.
CStatic m_msg_bn;
m_msg_bn.SetFont(m_pFontSmallBN,TRUE);
m_msg_bn.SetWindowText(_T("TEXT IN NATIVE LANGUAGE")); //TEXT is typed with the Font
While I'm running the app in Windows vista it can display the text perfectly; but in Windows XP it cannot display unicode characters properly. Compound alphabets (framed with multiple unicode characters) of the bengali language are being displayed as separate characters. I ensured that both Windows Vista and XP have the Font installed and character set of my MFC project setting is Unicode.
Could anybody please help me to find out the issue in Windows XP environment ?
Choosing a font in Windows is tricky. You'd expect the font name to take precedence over all other font characteristics, but that's not always the case. To be sure you're getting the proper font you should make sure all the parameters to CreateFont match the font you want. This article, though old, details the font mapping process: Windows Font Mapping.
Here's a small program that puts up a font selection dialog and dumps the parameters that you can pass to CreateFont to guarantee that you're getting the font you want.
#include <Windows.h>
#include <stdio.h>
int wmain(int argc, wchar_t* argv[])
{
LOGFONT lf = {};
CHOOSEFONT cf = {sizeof(CHOOSEFONT)};
cf.lpLogFont = &lf;
cf.Flags = CF_BOTH | CF_FORCEFONTEXIST;
if (ChooseFont(&cf))
{
wprintf(L"%d,%d,%d,%d,%d,", lf.lfHeight, lf.lfWidth, lf.lfEscapement, lf.lfOrientation, lf.lfWeight);
wprintf(L"%d,%d,%d,%d,%d,", lf.lfItalic, lf.lfUnderline, lf.lfStrikeOut, lf.lfCharSet, lf.lfOutPrecision);
wprintf(L"%d,%d,%d,", lf.lfClipPrecision, lf.lfQuality, lf.lfPitchAndFamily);
wprintf(L"_T(\"%s\")\n", lf.lfFaceName);
}
return 0;
}
#Mark I could not add my comment using "add a comment" link; therefore, I'm adding it here. Even in XP environment the program displays same values for the Font properties. Another thing is that using notepad of the same system, I see the same improper display. It can display bengali font but display is improper for compound alphabet (consonant conjunct or consonant attached with a diacritic form of a vowel) of bengali language. This is probably due to XP doesn't have in-built support for complex text for native scripts like bengali by default. Windows from Vista and onward have this complex text support enabled by default; therefore just installing a native unicode font enables us to view native script properly.

Howto identify UTF-8 encoded strings

What's the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode isn't of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.
chardet character set detection developed by Mozilla used in FireFox. Source code
jchardet is a java port of the source from mozilla's automatic charset detection algorithm.
NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.
Code project C# sample that uses Microsoft's MLang for character encoding detection.
UTRAC is a command line tool and library written in c++ to detect string encoding
cpdetector is a java project used for encoding detection
chsdet is a delphi project, and is a stand alone executable module for automatic charset / encoding detection of a given text or file.
Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
You could also take a look at the related question How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?, it has some useful content.
There is no really reliable way, but basically, as a random sequence of bytes (e.g. a string in an standard 8 bit encoding) is very unlikely to be a valid UTF-8 string (if the most significant bit of a byte is set, there are very specific rules as to what kind of bytes can follow it in UTF-8), you can try decoding the string as UTF-8 and consider that it is UTF-8 if there are no decoding errors.
Determining if there were decoding errors is another problem altogether, many Unicode libraries simply replace invalid characters with a question mark without indicating whether or not an error occurred. So you need an explicit way of determining if an error occurred while decoding or not.
This W3C page has a perl regular expression for validating UTF-8
You didn't specify a language, but in PHP you can use mb_check_encoding
if(mb_check_encoding($yourDtring, 'UTF-8'))
{
//the string is UTF-8
}
else
{
//string is not UTF-8
}
For Win32, you can use the mlang API, this is part of Windows and supported from Windows XP, cool thing about it is that it gives you statistics of how likely the input is to be in a particular encoding:
CComPtr<IMultiLanguage2> lang;
HRESULT hr = lang.CoCreateInstance(CLSID_CMultiLanguage, NULL, CLSCTX_INPROC_SERVER);
char* str = "abc"; // EF BB BF 61 62 63
int size = 6;
DetectEncodingInfo encodings[100];
int encodingsCount = 100;
hr = lang->DetectInputCodepage(MLDETECTCP_NONE, 0, str, &size, &encodings, &encodingsCount);
To do character detection in ruby install the 'chardet' gem
sudo gem install chardet
Here's a little ruby script to run chardet over the standard input stream.
require "rubygems"
require 'UniversalDetector' #chardet gem
infile = $stdin.read()
p UniversalDetector::chardet(infile)
Chardet outputs a guess at the character set encoding and also a confidence level (0-1) from its statistical analysis
see also this snippet
C/C++ standalone library based on Mozilla's character set detector
https://github.com/batterseapower/libcharsetdetect
Universal Character Set Detector (UCSD)
A library exposing a C interface and dependency-free interface to the Mozilla C++ UCSD library. This library provides a highly accurate set of heuristics that attempt to determine the character set used to encode some input text. This is extremely useful when your program has to handle an input file which is supplied without any encoding metadata.
On Windows, you can use MultiByteToWideChar() with the CP_UTF8 codepage and the MB_ERR_INVALID_CHARS flag. If the function fails, the string is not valid UTF-8.
As an add-on to the previous answer about the Win32 mlang DetectInputCodepage() API, here's how to call it in C:
#include <Mlang.h>
#include <objbase.h>
#pragma comment(lib, "ole32.lib")
HRESULT hr;
IMultiLanguage2 *pML;
char *pszBuffer;
int iSize;
DetectEncodingInfo lpInfo[10];
int iCount = sizeof(lpInfo) / sizeof(DetectEncodingInfo);
hr = CoInitialize(NULL);
hr = CoCreateInstance(&CLSID_CMultiLanguage, NULL, CLSCTX_INPROC_SERVER, &IID_IMultiLanguage2, (LPVOID *)&pML);
hr = pML->lpVtbl->DetectInputCodepage(pML, 0, 0, pszBuffer, &iSize, lpInfo, &iCount);
CoUninitialize();
But the test results are very disappointing:
It can't distinguish between French texts in CP 437 and CP 1252, even though the text is completely unreadable if opened in the wrong code page.
It can detect text encoded in CP 65001 (UTF-8), but not text in UTF-16, which is wrongly reported as CP 1252 with good confidence!