Powershell and Unicode support [duplicate]

Powershell and Unicode support [duplicate] - powershell

What I'm trying to achieve should be rather straightforward although Powershell is trying to make it hard.
I want to display the full path of files, some with Arabic, Chinese, Japanese and Russian characters in their names
I always get some undecipherable output, such as the one shown below
The output seen in console is being consumed as is by another script.
The output contains ? instead of the actual characters.
The command executed is
(Get-ChildItem -Recurse -Path "D:\test" -Include *unicode* | Get-ChildItem -Recurse).FullName
Is there any easy way to launch powershell (via command line or in any fashion that can be written into a script) such that the output is seen correctly.
P.S. I've gone through many similar questions on Stack Overflow but none of them have much input other than calling it a Windows Console Subsystem issue.

Note:
On Windows, with respect to rendering Unicode characters, it is primarily the choice of font / console (terminal) application that matters.
Nowadays, using Windows Terminal, which is distributed and updated via the Microsoft Store since Windows 10, is a good replacement for the legacy console host (console windows provided by conhost.exe), providing superior Unicode character support. In Windows 11 22H2, Windows Terminal even became the default console (terminal).
With respect to programmatically processing Unicode characters when communicating with external programs, $OutputEncoding, [Console]::InputEncoding and [Console]::OutputEncoding matter too - see below.
The PowerShell Core (v6+) perspective (see next section for Windows PowerShell), irrespective of character rendering issues (also covered in the next section), with respect to communicating with external programs:
On Unix-like platforms, PowerShell Core uses UTF-8 by default (typically, these days, given that modern Unix-like platforms use UTF-8-based locales).
On Windows, it is the legacy system locale, via its OEM code page, that determines the default encoding in all consoles, including both Windows PowerShell and PowerShell Core console windows, though recent versions of Windows 10 now allow setting the system locale to code page 65001 (UTF-8); note that the feature is still in beta as of this writing, and using it has far-reaching consequences - see this answer.
If you do use that feature, PowerShell Core console windows will then automatically be UTF-8-aware, though in Windows PowerShell you'll still have to set $OutputEncoding to UTF-8 too (which in Core already defaults to UTF-8), as shown below.
Otherwise - notably on older Windows versions - you can use the same approach as detailed below for Windows PowerShell.
Making your Windows PowerShell console window Unicode (UTF-8) aware:
Pick a TrueType (TT) font that supports the specific scripts (writing systems, alphabets) whose characters you want to display properly in the console:
Important: While all TrueType fonts support Unicode in principle, they usually only support a subset of all Unicode characters, namely those corresponding to specific scripts (writing systems), such as the Latin script, the Cyrillic (Russian) script, ...
In your particular case - if you must support Arabic as well as Chinese, Japanese and Russian characters - your only choice is SimSun-ExtB, which is available on Windows 10 only.
See Wikipedia for a list of what Windows fonts target what scripts (alphabets).
To change the font, click on the icon in the top-left corner of the window and select Properties, then change to the Fonts tab and select the TrueType font of interest.
See this SU answer by not2quibit for how to make additional fonts available.
Additionally, for proper communication with external programs:
The console window's code page must be switched to 65001, the UTF-8 code page (which is usually done with chcp 65001, which, however, cannot be used directly from within a PowerShell session[1], but the PowerShell command below has the same effect).
Windows PowerShell must be instructed to use UTF-8 to communicate with external utilities too, both when sending pipeline input to external programs, via it $OutputEncoding preference variable (on decoding output from external programs, it is the encoding stored in [console]::OutputEncoding that is applied).
The following magic incantation in Windows PowerShell does this (as stated, this implicitly performs chcp 65001):
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding =
New-Object System.Text.UTF8Encoding
To persist these settings, i.e., to make your future interactive PowerShell sessions UTF-8-aware by default, add the command above to your $PROFILE file.
Note: Recent versions of Windows 10 now allow setting the system locale to code page 65001 (UTF-8) (the feature is still in beta as of Window 10 version 1903), which makes all console windows default to UTF-8, including Windows PowerShell's.
If you do use that feature, setting [console]::InputEncoding / [console]::OutputEncoding is then no longer strictly necessary, but you'll still have to set $OutputEncoding (which is not necessary in PowerShell Core, where $OutputEncoding already defaults to UTF-8).
Important:
These settings assume that any external utilities you communicate with expect UTF-8-encoded input and produce UTF-8 output.
CLIs written in Node.js fulfill that criterion, for instance.
Python scripts - if written with UTF-8 support in mind - can handle UTF-8 too.
By contrast, these settings can break (older) utilities that only expect a single-byte encoding as implied by the system's legacy OEM code page.
Up to Windows 8.1, this even included standard Windows utilities such as find.exe and findstr.exe, which have been fixed in Windows 10.
See the bottom of this post for how to bypass this problem by switching to UTF-8 temporarily, on demand for invoking a given utility.
These settings apply to external programs only and are unrelated to the encodings that PowerShell's cmdlets use on output:
See this answer for the default character encodings used by PowerShell cmdlets; in short: If you want cmdlets in Windows PowerShell to default to UTF-8 (which PowerShell [Core] v6+ does anyway), add $PSDefaultParameterValues['*:Encoding'] = 'utf8' to your $PROFILE, but note that this will affect all calls to cmdlets with an -Encoding parameter in your sessions, unless that parameter is used explicitly; also note that in Windows PowerShell you'll invariably get UTF-8 files with BOM; conversely, in PowerShell [Core] v6+, which defaults to BOM-less UTF-8 (both in the absence of -Encoding and with -Encoding utf8, you'd have to use 'utf8BOM'.
Optional background information
Tip of the hat to eryksun for all his input.
While a TrueType font is active, the console-window buffer correctly preserves (non-ASCII) Unicode chars. even if they don't render correctly; that is, even though they may appear generically as ?, so as to indicate lack of support by the current font, you can copy & paste such characters elsewhere without loss of information, as eryksun notes.
PowerShell is capable of outputting Unicode characters to the console even without having switched to code page 65001 first.
However, that by itself does not guarantee that other programs can handle such output correctly - see below.
When it comes to communicating with external programs via stdout (piping), PowersShell uses the character encoding specified in the $OutputEncoding preference variable, which defaults to ASCII(!) in Windows PowerShell, which means that any non-ASCII characters are transliterated to literal ? characters, resulting in information loss. (By contrast, commendably, PowerShell Core (v6+) now uses (BOM-less) UTF-8 as the default encoding, consistently.)
By contrast, however, passing non-ASCII arguments (rather than stdout (piped) output) to external programs seems to require no special configuration (it is unclear to me why that works); e.g., the following Node.js command correctly returns €: 1 even with the default configuration:
node -pe "process.argv[1] + ': ' + process.argv[1].length" €
[Console]::OutputEncoding:
controls what character encoding is assumed when the console translates program output into console display characters.
also tells PowerShell what encoding to assume when capturing output from an external program.
The upshot is that if you need to capture output from an UTF-8-producing program, you need to set [Console]::OutputEncoding to UTF-8 as well; setting $OutputEncoding only covers the input (to the external program) aspect.
[Console]::InputEncoding sets the encoding for keyboard input into a console[2] and also determines how PowerShell's CLI interprets data it receives via stdin (standard input).
If switching the console to UTF-8 for the entire session is not an option, you can do so temporarily, for a given call:
# Save the current settings and temporarily switch to UTF-8.
$oldOutputEncoding = $OutputEncoding; $oldConsoleEncoding = [Console]::OutputEncoding
$OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.Utf8Encoding
# Call the UTF-8 program, using Node.js as an example.
# This should echo '€' (`U+20AC`) as-is and report the length as *1*.
$captured = '€' | node -pe "require('fs').readFileSync(0).toString().trim()"
$captured; $captured.Length
# Restore the previous settings.
$OutputEncoding = $oldOutputEncoding; [Console]::OutputEncoding = $oldConsoleEncoding
Problems on older versions of Windows (pre-W10):
An active chcp value of 65001 breaking the console output of some external programs and even batch files in general in older versions of Windows may ultimately have stemmed from a bug in the WriteFile() Windows API function (as also used by the standard C library), which mistakenly reported the number of characters rather than bytes with code page 65001 in effect, as discussed in this blog post.
The resulting symptoms, according to a comment by bobince on this answer from 2008, are: "My understanding is that calls that return a number-of-bytes (such as fread/fwrite/etc) actually return a number-of-characters. This causes a wide variety of symptoms, such as incomplete input-reading, hangs in fflush, the broken batch files and so on."
Superior alternatives to the native Windows console (terminal), conhost.exe
eryksun suggests two alternatives to the native Windows console windows (conhost.exe), which provider better and faster Unicode character rendering, due to using the modern, GPU-accelerated DirectWrite/DirectX API instead of the "old GDI implementation [that] cannot handle complex scripts, non-BMP characters, or automatic fallback fonts."
Microsoft's own, open-source Windows Terminal, which is distributed and updated via the Microsoft Store since Windows 10 - see here for an introduction.
Long-established third-party alternative ConEmu, which has the advantage of working on older Windows versions too.
[1] Note that running chcp 65001 from inside a PowerShell session is not effective, because .NET caches the console's output encoding on startup and is unaware of later changes made with chcp (only changes made directly via [console]::OutputEncoding] are picked up).
[2] I am unclear on how that manifests in practice; do tell us, if you know.

Elaborated Alexander Martin's answer. For testing purposes, I have created some folders and files with valid names from different Unicode subranges as follows:
For instance, with Courier New console font, replacement symbols are displayed instead of CJK characters in a PowerShell console:
On the other hand, with SimSun console font, (poorly visible) replacement symbols are displayed instead of Arabic and Hebrew characters while CJK chars seem to be displayed correct:
Please note that all replacement symbols are merely displayed whereas real characters are preserved as you can see in the following Copy&Paste from above PowerShell console:
PS D:\PShell> (Get-ChildItem 'D:\bat\UnASCII Names\' -Dir).Name
Arabic (عَرَبِيّ‎)
CJK (中文(繁體))
Czech (Čeština)
Greek (Γρεεκ)
Hebrew (עִבְרִית)
Japanese (日本語)
MathBoldScript (𝓜𝓪𝓽𝓱𝓑𝓸𝓵𝓭𝓢𝓬𝓻𝓲𝓹𝓽)
Russian (русский язык)
Türkçe (Türkiye)
‹angles›
☺☻♥♦
For the sake of completeness, here are appropriate registry values to Enable More Fonts for the Windows Command Prompt (this works for the Windows PowerShell console as well):
(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont' |
Select-Object -Property [0-9]* | Out-String).Split(
[System.Environment]::NewLine,
[System.StringSplitOptions]::RemoveEmptyEntries) |
Sort-Object
Sample output:
0 : Consolas
00 : Source Code Pro
000 : DejaVu Sans Mono
0000 : Courier New
00000 : Simplified Arabic Fixed
000000 : Unifont
0000000 : Lucida Console
932 : *ＭＳ ゴシック
936 : *新宋体

If you install Microsoft's "Windows Terminal" from the Microsoft Store (or the Preview version), it comes pre-configured for full Unicode localization.
You still can't enter commands with special characters... unless you use WSL! 😍

The Powershell ISE is an option for displaying foreign characters: korean.txt is a UTF8 encoded file:
PS C:\Users\js> get-content korean.txt
The Korean language (South Korean: 한국어/韓國語 Hangugeo; North
Korean: 조선말/朝鮮말 Chosŏnmal) is an East Asian language
spoken by about 77 million people.[3]

I was facing a similar challenge, working with the AMAZON translate service. I installed the terminal from Windows store and it works for me now!

Make sure you have a font containing all the problematic characters installed and set as your Win32 Console font. If I remember right, click the PowerShell icon in the top-left corner of the window and pick Properties. The resulting popup dialog should have an option to set the font used. It might have to be a bitmap (.FON or .FNT) font.

Just registered just to clear up the confusion about why "Lucida Console" as font is working in Powershell ISE. Unfortunately I am not able to comment due to missing reputation, so here as Answer:
In normal powershell all characters are displayed in the configured font. Thats why e.g. chinese or cyrillic characters are broken with "Lucida Console" and many other fonts.
For chinese characters Powershell ISE changes the font automatically to "DengXian".
You can find out which alternative Font is used for your special character by copying them to Word or a similar program which is capable of displaying different fonts.

Related

Powershell external command returns confused UTF16? [duplicate]

What I'm trying to achieve should be rather straightforward although Powershell is trying to make it hard.
I want to display the full path of files, some with Arabic, Chinese, Japanese and Russian characters in their names
I always get some undecipherable output, such as the one shown below
The output seen in console is being consumed as is by another script.
The output contains ? instead of the actual characters.
The command executed is
(Get-ChildItem -Recurse -Path "D:\test" -Include *unicode* | Get-ChildItem -Recurse).FullName
Is there any easy way to launch powershell (via command line or in any fashion that can be written into a script) such that the output is seen correctly.
P.S. I've gone through many similar questions on Stack Overflow but none of them have much input other than calling it a Windows Console Subsystem issue.

Note:
On Windows, with respect to rendering Unicode characters, it is primarily the choice of font / console (terminal) application that matters.
Nowadays, using Windows Terminal, which is distributed and updated via the Microsoft Store since Windows 10, is a good replacement for the legacy console host (console windows provided by conhost.exe), providing superior Unicode character support. In Windows 11 22H2, Windows Terminal even became the default console (terminal).
With respect to programmatically processing Unicode characters when communicating with external programs, $OutputEncoding, [Console]::InputEncoding and [Console]::OutputEncoding matter too - see below.
The PowerShell Core (v6+) perspective (see next section for Windows PowerShell), irrespective of character rendering issues (also covered in the next section), with respect to communicating with external programs:
On Unix-like platforms, PowerShell Core uses UTF-8 by default (typically, these days, given that modern Unix-like platforms use UTF-8-based locales).
On Windows, it is the legacy system locale, via its OEM code page, that determines the default encoding in all consoles, including both Windows PowerShell and PowerShell Core console windows, though recent versions of Windows 10 now allow setting the system locale to code page 65001 (UTF-8); note that the feature is still in beta as of this writing, and using it has far-reaching consequences - see this answer.
If you do use that feature, PowerShell Core console windows will then automatically be UTF-8-aware, though in Windows PowerShell you'll still have to set $OutputEncoding to UTF-8 too (which in Core already defaults to UTF-8), as shown below.
Otherwise - notably on older Windows versions - you can use the same approach as detailed below for Windows PowerShell.
Making your Windows PowerShell console window Unicode (UTF-8) aware:
Pick a TrueType (TT) font that supports the specific scripts (writing systems, alphabets) whose characters you want to display properly in the console:
Important: While all TrueType fonts support Unicode in principle, they usually only support a subset of all Unicode characters, namely those corresponding to specific scripts (writing systems), such as the Latin script, the Cyrillic (Russian) script, ...
In your particular case - if you must support Arabic as well as Chinese, Japanese and Russian characters - your only choice is SimSun-ExtB, which is available on Windows 10 only.
See Wikipedia for a list of what Windows fonts target what scripts (alphabets).
To change the font, click on the icon in the top-left corner of the window and select Properties, then change to the Fonts tab and select the TrueType font of interest.
See this SU answer by not2quibit for how to make additional fonts available.
Additionally, for proper communication with external programs:
The console window's code page must be switched to 65001, the UTF-8 code page (which is usually done with chcp 65001, which, however, cannot be used directly from within a PowerShell session[1], but the PowerShell command below has the same effect).
Windows PowerShell must be instructed to use UTF-8 to communicate with external utilities too, both when sending pipeline input to external programs, via it $OutputEncoding preference variable (on decoding output from external programs, it is the encoding stored in [console]::OutputEncoding that is applied).
The following magic incantation in Windows PowerShell does this (as stated, this implicitly performs chcp 65001):
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding =
New-Object System.Text.UTF8Encoding
To persist these settings, i.e., to make your future interactive PowerShell sessions UTF-8-aware by default, add the command above to your $PROFILE file.
Note: Recent versions of Windows 10 now allow setting the system locale to code page 65001 (UTF-8) (the feature is still in beta as of Window 10 version 1903), which makes all console windows default to UTF-8, including Windows PowerShell's.
If you do use that feature, setting [console]::InputEncoding / [console]::OutputEncoding is then no longer strictly necessary, but you'll still have to set $OutputEncoding (which is not necessary in PowerShell Core, where $OutputEncoding already defaults to UTF-8).
Important:
These settings assume that any external utilities you communicate with expect UTF-8-encoded input and produce UTF-8 output.
CLIs written in Node.js fulfill that criterion, for instance.
Python scripts - if written with UTF-8 support in mind - can handle UTF-8 too.
By contrast, these settings can break (older) utilities that only expect a single-byte encoding as implied by the system's legacy OEM code page.
Up to Windows 8.1, this even included standard Windows utilities such as find.exe and findstr.exe, which have been fixed in Windows 10.
See the bottom of this post for how to bypass this problem by switching to UTF-8 temporarily, on demand for invoking a given utility.
These settings apply to external programs only and are unrelated to the encodings that PowerShell's cmdlets use on output:
See this answer for the default character encodings used by PowerShell cmdlets; in short: If you want cmdlets in Windows PowerShell to default to UTF-8 (which PowerShell [Core] v6+ does anyway), add $PSDefaultParameterValues['*:Encoding'] = 'utf8' to your $PROFILE, but note that this will affect all calls to cmdlets with an -Encoding parameter in your sessions, unless that parameter is used explicitly; also note that in Windows PowerShell you'll invariably get UTF-8 files with BOM; conversely, in PowerShell [Core] v6+, which defaults to BOM-less UTF-8 (both in the absence of -Encoding and with -Encoding utf8, you'd have to use 'utf8BOM'.
Optional background information
Tip of the hat to eryksun for all his input.
While a TrueType font is active, the console-window buffer correctly preserves (non-ASCII) Unicode chars. even if they don't render correctly; that is, even though they may appear generically as ?, so as to indicate lack of support by the current font, you can copy & paste such characters elsewhere without loss of information, as eryksun notes.
PowerShell is capable of outputting Unicode characters to the console even without having switched to code page 65001 first.
However, that by itself does not guarantee that other programs can handle such output correctly - see below.
When it comes to communicating with external programs via stdout (piping), PowersShell uses the character encoding specified in the $OutputEncoding preference variable, which defaults to ASCII(!) in Windows PowerShell, which means that any non-ASCII characters are transliterated to literal ? characters, resulting in information loss. (By contrast, commendably, PowerShell Core (v6+) now uses (BOM-less) UTF-8 as the default encoding, consistently.)
By contrast, however, passing non-ASCII arguments (rather than stdout (piped) output) to external programs seems to require no special configuration (it is unclear to me why that works); e.g., the following Node.js command correctly returns €: 1 even with the default configuration:
node -pe "process.argv[1] + ': ' + process.argv[1].length" €
[Console]::OutputEncoding:
controls what character encoding is assumed when the console translates program output into console display characters.
also tells PowerShell what encoding to assume when capturing output from an external program.
The upshot is that if you need to capture output from an UTF-8-producing program, you need to set [Console]::OutputEncoding to UTF-8 as well; setting $OutputEncoding only covers the input (to the external program) aspect.
[Console]::InputEncoding sets the encoding for keyboard input into a console[2] and also determines how PowerShell's CLI interprets data it receives via stdin (standard input).
If switching the console to UTF-8 for the entire session is not an option, you can do so temporarily, for a given call:
# Save the current settings and temporarily switch to UTF-8.
$oldOutputEncoding = $OutputEncoding; $oldConsoleEncoding = [Console]::OutputEncoding
$OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.Utf8Encoding
# Call the UTF-8 program, using Node.js as an example.
# This should echo '€' (`U+20AC`) as-is and report the length as *1*.
$captured = '€' | node -pe "require('fs').readFileSync(0).toString().trim()"
$captured; $captured.Length
# Restore the previous settings.
$OutputEncoding = $oldOutputEncoding; [Console]::OutputEncoding = $oldConsoleEncoding
Problems on older versions of Windows (pre-W10):
An active chcp value of 65001 breaking the console output of some external programs and even batch files in general in older versions of Windows may ultimately have stemmed from a bug in the WriteFile() Windows API function (as also used by the standard C library), which mistakenly reported the number of characters rather than bytes with code page 65001 in effect, as discussed in this blog post.
The resulting symptoms, according to a comment by bobince on this answer from 2008, are: "My understanding is that calls that return a number-of-bytes (such as fread/fwrite/etc) actually return a number-of-characters. This causes a wide variety of symptoms, such as incomplete input-reading, hangs in fflush, the broken batch files and so on."
Superior alternatives to the native Windows console (terminal), conhost.exe
eryksun suggests two alternatives to the native Windows console windows (conhost.exe), which provider better and faster Unicode character rendering, due to using the modern, GPU-accelerated DirectWrite/DirectX API instead of the "old GDI implementation [that] cannot handle complex scripts, non-BMP characters, or automatic fallback fonts."
Microsoft's own, open-source Windows Terminal, which is distributed and updated via the Microsoft Store since Windows 10 - see here for an introduction.
Long-established third-party alternative ConEmu, which has the advantage of working on older Windows versions too.
[1] Note that running chcp 65001 from inside a PowerShell session is not effective, because .NET caches the console's output encoding on startup and is unaware of later changes made with chcp (only changes made directly via [console]::OutputEncoding] are picked up).
[2] I am unclear on how that manifests in practice; do tell us, if you know.

Elaborated Alexander Martin's answer. For testing purposes, I have created some folders and files with valid names from different Unicode subranges as follows:
For instance, with Courier New console font, replacement symbols are displayed instead of CJK characters in a PowerShell console:
On the other hand, with SimSun console font, (poorly visible) replacement symbols are displayed instead of Arabic and Hebrew characters while CJK chars seem to be displayed correct:
Please note that all replacement symbols are merely displayed whereas real characters are preserved as you can see in the following Copy&Paste from above PowerShell console:
PS D:\PShell> (Get-ChildItem 'D:\bat\UnASCII Names\' -Dir).Name
Arabic (عَرَبِيّ‎)
CJK (中文(繁體))
Czech (Čeština)
Greek (Γρεεκ)
Hebrew (עִבְרִית)
Japanese (日本語)
MathBoldScript (𝓜𝓪𝓽𝓱𝓑𝓸𝓵𝓭𝓢𝓬𝓻𝓲𝓹𝓽)
Russian (русский язык)
Türkçe (Türkiye)
‹angles›
☺☻♥♦
For the sake of completeness, here are appropriate registry values to Enable More Fonts for the Windows Command Prompt (this works for the Windows PowerShell console as well):
(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont' |
Select-Object -Property [0-9]* | Out-String).Split(
[System.Environment]::NewLine,
[System.StringSplitOptions]::RemoveEmptyEntries) |
Sort-Object
Sample output:
0 : Consolas
00 : Source Code Pro
000 : DejaVu Sans Mono
0000 : Courier New
00000 : Simplified Arabic Fixed
000000 : Unifont
0000000 : Lucida Console
932 : *ＭＳ ゴシック
936 : *新宋体

If you install Microsoft's "Windows Terminal" from the Microsoft Store (or the Preview version), it comes pre-configured for full Unicode localization.
You still can't enter commands with special characters... unless you use WSL! 😍

The Powershell ISE is an option for displaying foreign characters: korean.txt is a UTF8 encoded file:
PS C:\Users\js> get-content korean.txt
The Korean language (South Korean: 한국어/韓國語 Hangugeo; North
Korean: 조선말/朝鮮말 Chosŏnmal) is an East Asian language
spoken by about 77 million people.[3]

I was facing a similar challenge, working with the AMAZON translate service. I installed the terminal from Windows store and it works for me now!

Make sure you have a font containing all the problematic characters installed and set as your Win32 Console font. If I remember right, click the PowerShell icon in the top-left corner of the window and pick Properties. The resulting popup dialog should have an option to set the font used. It might have to be a bitmap (.FON or .FNT) font.

Just registered just to clear up the confusion about why "Lucida Console" as font is working in Powershell ISE. Unfortunately I am not able to comment due to missing reputation, so here as Answer:
In normal powershell all characters are displayed in the configured font. Thats why e.g. chinese or cyrillic characters are broken with "Lucida Console" and many other fonts.
For chinese characters Powershell ISE changes the font automatically to "DengXian".
You can find out which alternative Font is used for your special character by copying them to Word or a similar program which is capable of displaying different fonts.

Change Unicode to UTF-8 | PowerShell script

When I use Write-Host in my PowerShell script, the output looks like this: ????? ?????.
This happens because I'm entering strings in Arabic with Write-Host, and it seems that PowerShell doesn't support Arabic...
How do I print text using Write-Host, but in unicode UTF-8 (which supports Arabic).
Example: Write-Host "مرحباً بالعالم"
The output in this case will be: ????? ?????
Any solutions?

Fixed
You need to set a font that supports those characters. Like Cascadia Code PL
Note: The non-PL version didn't work, so get the PL one.
You might have to set the console encoding as well. Unless you really need a different encoding, default to utf8 is a good idea.
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
Before
I didn't actually need wt here, I'd suggest it. windowsterminal is a modern term, which is now an inbox app. Meaning it's default on windows going forward.
There's utf8 that doesn't work on the term, that wt supports (both using the cascadia code pl font

Displaying Unicode in Powershell

What I'm trying to achieve should be rather straightforward although Powershell is trying to make it hard.
I want to display the full path of files, some with Arabic, Chinese, Japanese and Russian characters in their names
I always get some undecipherable output, such as the one shown below
The output seen in console is being consumed as is by another script.
The output contains ? instead of the actual characters.
The command executed is
(Get-ChildItem -Recurse -Path "D:\test" -Include *unicode* | Get-ChildItem -Recurse).FullName
Is there any easy way to launch powershell (via command line or in any fashion that can be written into a script) such that the output is seen correctly.
P.S. I've gone through many similar questions on Stack Overflow but none of them have much input other than calling it a Windows Console Subsystem issue.

Note:
On Windows, with respect to rendering Unicode characters, it is primarily the choice of font / console (terminal) application that matters.
Nowadays, using Windows Terminal, which is distributed and updated via the Microsoft Store since Windows 10, is a good replacement for the legacy console host (console windows provided by conhost.exe), providing superior Unicode character support. In Windows 11 22H2, Windows Terminal even became the default console (terminal).
With respect to programmatically processing Unicode characters when communicating with external programs, $OutputEncoding, [Console]::InputEncoding and [Console]::OutputEncoding matter too - see below.
The PowerShell Core (v6+) perspective (see next section for Windows PowerShell), irrespective of character rendering issues (also covered in the next section), with respect to communicating with external programs:
On Unix-like platforms, PowerShell Core uses UTF-8 by default (typically, these days, given that modern Unix-like platforms use UTF-8-based locales).
On Windows, it is the legacy system locale, via its OEM code page, that determines the default encoding in all consoles, including both Windows PowerShell and PowerShell Core console windows, though recent versions of Windows 10 now allow setting the system locale to code page 65001 (UTF-8); note that the feature is still in beta as of this writing, and using it has far-reaching consequences - see this answer.
If you do use that feature, PowerShell Core console windows will then automatically be UTF-8-aware, though in Windows PowerShell you'll still have to set $OutputEncoding to UTF-8 too (which in Core already defaults to UTF-8), as shown below.
Otherwise - notably on older Windows versions - you can use the same approach as detailed below for Windows PowerShell.
Making your Windows PowerShell console window Unicode (UTF-8) aware:
Pick a TrueType (TT) font that supports the specific scripts (writing systems, alphabets) whose characters you want to display properly in the console:
Important: While all TrueType fonts support Unicode in principle, they usually only support a subset of all Unicode characters, namely those corresponding to specific scripts (writing systems), such as the Latin script, the Cyrillic (Russian) script, ...
In your particular case - if you must support Arabic as well as Chinese, Japanese and Russian characters - your only choice is SimSun-ExtB, which is available on Windows 10 only.
See Wikipedia for a list of what Windows fonts target what scripts (alphabets).
To change the font, click on the icon in the top-left corner of the window and select Properties, then change to the Fonts tab and select the TrueType font of interest.
See this SU answer by not2quibit for how to make additional fonts available.
Additionally, for proper communication with external programs:
The console window's code page must be switched to 65001, the UTF-8 code page (which is usually done with chcp 65001, which, however, cannot be used directly from within a PowerShell session[1], but the PowerShell command below has the same effect).
Windows PowerShell must be instructed to use UTF-8 to communicate with external utilities too, both when sending pipeline input to external programs, via it $OutputEncoding preference variable (on decoding output from external programs, it is the encoding stored in [console]::OutputEncoding that is applied).
The following magic incantation in Windows PowerShell does this (as stated, this implicitly performs chcp 65001):
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding =
New-Object System.Text.UTF8Encoding
To persist these settings, i.e., to make your future interactive PowerShell sessions UTF-8-aware by default, add the command above to your $PROFILE file.
Note: Recent versions of Windows 10 now allow setting the system locale to code page 65001 (UTF-8) (the feature is still in beta as of Window 10 version 1903), which makes all console windows default to UTF-8, including Windows PowerShell's.
If you do use that feature, setting [console]::InputEncoding / [console]::OutputEncoding is then no longer strictly necessary, but you'll still have to set $OutputEncoding (which is not necessary in PowerShell Core, where $OutputEncoding already defaults to UTF-8).
Important:
These settings assume that any external utilities you communicate with expect UTF-8-encoded input and produce UTF-8 output.
CLIs written in Node.js fulfill that criterion, for instance.
Python scripts - if written with UTF-8 support in mind - can handle UTF-8 too.
By contrast, these settings can break (older) utilities that only expect a single-byte encoding as implied by the system's legacy OEM code page.
Up to Windows 8.1, this even included standard Windows utilities such as find.exe and findstr.exe, which have been fixed in Windows 10.
See the bottom of this post for how to bypass this problem by switching to UTF-8 temporarily, on demand for invoking a given utility.
These settings apply to external programs only and are unrelated to the encodings that PowerShell's cmdlets use on output:
See this answer for the default character encodings used by PowerShell cmdlets; in short: If you want cmdlets in Windows PowerShell to default to UTF-8 (which PowerShell [Core] v6+ does anyway), add $PSDefaultParameterValues['*:Encoding'] = 'utf8' to your $PROFILE, but note that this will affect all calls to cmdlets with an -Encoding parameter in your sessions, unless that parameter is used explicitly; also note that in Windows PowerShell you'll invariably get UTF-8 files with BOM; conversely, in PowerShell [Core] v6+, which defaults to BOM-less UTF-8 (both in the absence of -Encoding and with -Encoding utf8, you'd have to use 'utf8BOM'.
Optional background information
Tip of the hat to eryksun for all his input.
While a TrueType font is active, the console-window buffer correctly preserves (non-ASCII) Unicode chars. even if they don't render correctly; that is, even though they may appear generically as ?, so as to indicate lack of support by the current font, you can copy & paste such characters elsewhere without loss of information, as eryksun notes.
PowerShell is capable of outputting Unicode characters to the console even without having switched to code page 65001 first.
However, that by itself does not guarantee that other programs can handle such output correctly - see below.
When it comes to communicating with external programs via stdout (piping), PowersShell uses the character encoding specified in the $OutputEncoding preference variable, which defaults to ASCII(!) in Windows PowerShell, which means that any non-ASCII characters are transliterated to literal ? characters, resulting in information loss. (By contrast, commendably, PowerShell Core (v6+) now uses (BOM-less) UTF-8 as the default encoding, consistently.)
By contrast, however, passing non-ASCII arguments (rather than stdout (piped) output) to external programs seems to require no special configuration (it is unclear to me why that works); e.g., the following Node.js command correctly returns €: 1 even with the default configuration:
node -pe "process.argv[1] + ': ' + process.argv[1].length" €
[Console]::OutputEncoding:
controls what character encoding is assumed when the console translates program output into console display characters.
also tells PowerShell what encoding to assume when capturing output from an external program.
The upshot is that if you need to capture output from an UTF-8-producing program, you need to set [Console]::OutputEncoding to UTF-8 as well; setting $OutputEncoding only covers the input (to the external program) aspect.
[Console]::InputEncoding sets the encoding for keyboard input into a console[2] and also determines how PowerShell's CLI interprets data it receives via stdin (standard input).
If switching the console to UTF-8 for the entire session is not an option, you can do so temporarily, for a given call:
# Save the current settings and temporarily switch to UTF-8.
$oldOutputEncoding = $OutputEncoding; $oldConsoleEncoding = [Console]::OutputEncoding
$OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.Utf8Encoding
# Call the UTF-8 program, using Node.js as an example.
# This should echo '€' (`U+20AC`) as-is and report the length as *1*.
$captured = '€' | node -pe "require('fs').readFileSync(0).toString().trim()"
$captured; $captured.Length
# Restore the previous settings.
$OutputEncoding = $oldOutputEncoding; [Console]::OutputEncoding = $oldConsoleEncoding
Problems on older versions of Windows (pre-W10):
An active chcp value of 65001 breaking the console output of some external programs and even batch files in general in older versions of Windows may ultimately have stemmed from a bug in the WriteFile() Windows API function (as also used by the standard C library), which mistakenly reported the number of characters rather than bytes with code page 65001 in effect, as discussed in this blog post.
The resulting symptoms, according to a comment by bobince on this answer from 2008, are: "My understanding is that calls that return a number-of-bytes (such as fread/fwrite/etc) actually return a number-of-characters. This causes a wide variety of symptoms, such as incomplete input-reading, hangs in fflush, the broken batch files and so on."
Superior alternatives to the native Windows console (terminal), conhost.exe
eryksun suggests two alternatives to the native Windows console windows (conhost.exe), which provider better and faster Unicode character rendering, due to using the modern, GPU-accelerated DirectWrite/DirectX API instead of the "old GDI implementation [that] cannot handle complex scripts, non-BMP characters, or automatic fallback fonts."
Microsoft's own, open-source Windows Terminal, which is distributed and updated via the Microsoft Store since Windows 10 - see here for an introduction.
Long-established third-party alternative ConEmu, which has the advantage of working on older Windows versions too.
[1] Note that running chcp 65001 from inside a PowerShell session is not effective, because .NET caches the console's output encoding on startup and is unaware of later changes made with chcp (only changes made directly via [console]::OutputEncoding] are picked up).
[2] I am unclear on how that manifests in practice; do tell us, if you know.

Elaborated Alexander Martin's answer. For testing purposes, I have created some folders and files with valid names from different Unicode subranges as follows:
For instance, with Courier New console font, replacement symbols are displayed instead of CJK characters in a PowerShell console:
On the other hand, with SimSun console font, (poorly visible) replacement symbols are displayed instead of Arabic and Hebrew characters while CJK chars seem to be displayed correct:
Please note that all replacement symbols are merely displayed whereas real characters are preserved as you can see in the following Copy&Paste from above PowerShell console:
PS D:\PShell> (Get-ChildItem 'D:\bat\UnASCII Names\' -Dir).Name
Arabic (عَرَبِيّ‎)
CJK (中文(繁體))
Czech (Čeština)
Greek (Γρεεκ)
Hebrew (עִבְרִית)
Japanese (日本語)
MathBoldScript (𝓜𝓪𝓽𝓱𝓑𝓸𝓵𝓭𝓢𝓬𝓻𝓲𝓹𝓽)
Russian (русский язык)
Türkçe (Türkiye)
‹angles›
☺☻♥♦
For the sake of completeness, here are appropriate registry values to Enable More Fonts for the Windows Command Prompt (this works for the Windows PowerShell console as well):
(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont' |
Select-Object -Property [0-9]* | Out-String).Split(
[System.Environment]::NewLine,
[System.StringSplitOptions]::RemoveEmptyEntries) |
Sort-Object
Sample output:
0 : Consolas
00 : Source Code Pro
000 : DejaVu Sans Mono
0000 : Courier New
00000 : Simplified Arabic Fixed
000000 : Unifont
0000000 : Lucida Console
932 : *ＭＳ ゴシック
936 : *新宋体

If you install Microsoft's "Windows Terminal" from the Microsoft Store (or the Preview version), it comes pre-configured for full Unicode localization.
You still can't enter commands with special characters... unless you use WSL! 😍

The Powershell ISE is an option for displaying foreign characters: korean.txt is a UTF8 encoded file:
PS C:\Users\js> get-content korean.txt
The Korean language (South Korean: 한국어/韓國語 Hangugeo; North
Korean: 조선말/朝鮮말 Chosŏnmal) is an East Asian language
spoken by about 77 million people.[3]

I was facing a similar challenge, working with the AMAZON translate service. I installed the terminal from Windows store and it works for me now!

Make sure you have a font containing all the problematic characters installed and set as your Win32 Console font. If I remember right, click the PowerShell icon in the top-left corner of the window and pick Properties. The resulting popup dialog should have an option to set the font used. It might have to be a bitmap (.FON or .FNT) font.

Just registered just to clear up the confusion about why "Lucida Console" as font is working in Powershell ISE. Unfortunately I am not able to comment due to missing reputation, so here as Answer:
In normal powershell all characters are displayed in the configured font. Thats why e.g. chinese or cyrillic characters are broken with "Lucida Console" and many other fonts.
For chinese characters Powershell ISE changes the font automatically to "DengXian".
You can find out which alternative Font is used for your special character by copying them to Word or a similar program which is capable of displaying different fonts.

How to do proper Unicode and ANSI output redirection on cmd.exe?

If you are doing automation on windows and you are redirecting the output of different commands (internal cmd.exe or external, you'll discover that your log files contains combined Unicode and ANSI output (meaning that they are invalid and will not load well in viewers/editors).
Is it is possible to make cmd.exe work with UTF-8? This question is not about display, s about stdin/stdout/stderr redirection and Unicode.
I am looking for a solution that would allow you to:
redirect the output of the internal commands to a file using UTF-8
redirect output of external commands supporting Unicode to the files but encoded as UTF-8.
If it is impossible to obtain this kind of consistence using batch files, is there another way of solving this problem, like using python scripting for this? In this case, I would like to know if it is possible to do the Unicode detection alone (user using the scripting should not remember if the called tools will output Unicode or not, it will just expect to convert the output to UTF-8.
For simplicity we'll assume that if the tool output is not-Unicode it will be considered as UTF-8 (no codepage conversion).

You can use chcp to change the active code page. This will be used for redirecting text as well:
chcp 65001
Keep in mind, though, that this will have no effect if cmd was started with the /u switch which forces Unicode (UTF-16 in this case) redirection output. If that switch is active then all output will be in UTF-16LE, regardless of the codepage set with chcp.
Also note that the console will be unusable for interactive output when set to Raster Fonts. I'm getting fun error messages in that case:
C:\Users\Johannes Rössel\Documents>x
Active code page: 65001
The system cannot write to the specified device.
So either use a sane setup (TrueType font for the console) or don't pull this stunt when using the console interactively and having a path that contains non-ASCII characters.

binmode(STDOUT, ":unix");
without
use encoding 'utf8';
Helped me. With that i had wide character in print warning.

ja chars in windows batch file

What is the secret to japanese characters in a Windows XP .bat file?
We have a script for open a file off disk in kiosk mode:
#ECHO OFF
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
It works fine when the OS is english, and it works fine for the japanese OS when XYZ is made up of english characters, but when XYZ is made up of japanese characters, they are getting mangled into gibberish by the time IE tries to find the file.
If the batch file is saved as Unicode or Unicode big endian the script wont even run.
I have tried various ways of encoding the japanese characters. ampersand escape does not work (〹)
Percent escape does not work %xx%xx%xx
ABC works, AB%43 becomes AB3 in the error message, so it looks like the percent escape is trying to do parameter substitution. This is confirmed because %043 puts in the name of the script !
One thing that does work is pasting the ja characters into a command prompt.
#ECHO OFF
CD "%ProgramFiles%\Internet Explorer\"
Set /p URL ="file to open: "
start iexplore.exe –K %URL%
This tells me that iexplore.exe will accept and parse the parameter correctly when it has ja characters, but not when they are written into the script.
So it would be nice to know what the secret may be to getting the parameter into IE successfully via the batch file, as opposed to via the clipboard and an environment variable.
Any suggestions greatly appreciated !
best regards
Richard Collins
P.S.
another post has has made this suggestion, which i am yet to follow up:
You might have more luck in cmd.exe if you opened it in UNICODE mode. Use "cmd /U".
Batch renaming of files with international chars on Windows XP
I will need to find out if this can be from inside the script.

For the record, a simple answer has been found for this question.
If the batch file is saved as ANSI - it works !

First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in. UTF-16 is out anyway, since cmd won't even parse that.
I have detailed an option in my answer to the following question:
Batch file encoding
which might be helpful for your needs.
In principle it boils down to the following:
Use an encoding which has single-byte mappings for ASCII
Put a chcp ... at the start of the batch file
Use the set codepage for the rest of the file
You can use codepage 65001, which is UTF-8 but make sure that your file doesn't include the U+FEFF character at the start (used as byte-order mark in UTF-16 and UTF-32 and sometimes used as marker for UTF-8 files as well). Otherwise the first command in the file will produce an error message.
So just use the following:
echo off
chcp 65001
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
and save it as UTF-8 without BOM (Note: Notepad won't allow you to do that) and it should work.
cmd /u won't do anything here, that advice is pretty much bogus. The /U switch only specifies that Unicode will be used for redirection of input and output (and piping). It has nothing to do with the encoding the console uses for output or reading batch files.
URL encoding won't help you either. cmd is hardly a web browser and outside of HTTP and the web URL encoding isn't exactly widespread (hence the name). cmd uses percent signs for environment variables and arguments to batch files and subroutines.
"Ampersand escape" also known as character entities known from HTML and XML, won't work either, because cmd is also not HTML or XML. The ampersand is used to execute multiple commands in a single line.

I too suffered this frustrating problem in batch/cmd files. However, so far as I can see, no one yet has stated the reason why this problem occurs, here or in other, similar posts at StackOverflow. The nearest statement addressing this was:
“First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in.”
Here is the basic problem. Cmd files are the Windows-2000+ successor to MS-DOS and IBM-DOS bat(ch) files. MS and IBM DOS (1984 vintage) were written in the IBM-PC character set (code page 437). There, the 8th-bit codes were assigned (or “clothed” with) characters different from those assigned to the corresponding codes of Windows, ANSI, or Unicode. The presumption of CP437 encoding is unalterable (except, as previously noted, through cmd.exe /u). Where the characters of the IBM-PC set have exact counterparts in the Unicode set, Windows Explorer remaps them to the Unicode counterparts. Alas, even Windows-1252 characters like š and ¾ have no counterpart in code page 437.
Here is another way to see the problem. Try opening your batch/cmd script using the Windows Edit.com program (at C:\Windows\system32\Edit.com). The Windows-1252 character 0145 ‘ (Unicode 8217) instead appears as IBM-PC 145 æ. A batch command to rename Mary'sFile.txt as Mary’sFile.txt fails, as it is interpreted as MaryæsFile.txt.
This problem can be avoided in the case of copying a file named Mary’sFile.txt: cite it as Mary?sFile.txt, e.g.:
xCopy Mary?sFile.txt Mary?sLastFile.txt
You will see a similar treatment (substitution of question marks) in a DIR list of files having Unicode characters.
Obviously, this is useless unless an extant file has the Unicode characters. This solution’s range is paltry and inadequate, but please make what use of it you can.

You can try to use Shift-JIS encoding.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse