UTF8 Script in PowerShell outputs incorrect characters

UTF8 Script in PowerShell outputs incorrect characters - powershell

I've created a UTF8 script for PowerShell with non-ascii characters.
characters.ps1:
Write-Host "ç â ã á à"
When the script is run in PowerShell console, it outputs wrong characters.
However, if I write the chars directly in the console, they are shown as expected:
Does anyone knows what causes that behavior?
The problem arised from a script I wrote who has hardcoded paths which include non-ascii characters. When I try to pass the path as argument to cmdlets (in the case I was gonna robocopy a folder) the command fails because it cannot find the path (which is output wrongly in the screen).

Changing the encoding of the script to UTF-8 with BOM solved the issue.
I was using SublimeText with the EncodingHelper plugin to control the character-set of the script. It was set correctly to UTF8.
I changed the encoding of the script in SublimeText to "UTF-8 with BOM" and the output was shown correctly.
I created the same script with Notepad++, which defaults to "UTF-8 with BOM", and the string was shown correctly in the console.
I changed the encoding of the script in Notepad++ to "UTF-8 without BOM" and it was shown incorrectly.
It seems PowerShell cannot guess correctly the encoding of UTF-8 files with no BOM.

In my case the problem was caused by creating a new PowerShell script with Visual Studio Code which has the default encoding of UTF-8 without BOM. Set the encoding to "Windows 1252" solved the problem.
It seems that PowerShell can't handle UTF-8 without BOM, it needs "Windows 1252" or "UTF8 with BOM" encodings.

try this before invoking your script :
$OutputEncoding = [Console]::OutputEncoding

There is a reliable way to detect utf8nobom (https://unicodebook.readthedocs.io/guess_encoding.html). Like a lot of other little things, this seems to work better in PS 6. Even my beloved emacs 25 for windows gets the encoding wrong.
PS C:\users\admin> pwsh
PowerShell 6.1.0
Copyright (c) Microsoft Corporation. All rights reserved.
https://aka.ms/pscore6-docs
Type 'help' to get help.
PS C:\users\admin> "write-host 'ç â ã á à'" | set-content -Encoding utf8NoBOM accent.ps1
PS C:\users\admin> .\accent
ç â ã á à

Related

How to save to file non-ascii output of program in Powershell?

I want to run program in Powershell and write output to file with UTF-8 encoding.
However I can't write non-ascii characters properly.
I already read many similar questions on Stack overflow, but I still can't find answer.
I tried both PowerShell 5.1.19041.1023 and PowerShell Core 7.1.3, they differently encode output file, but content is broken in the same way.
I tried simple programs in Python and Golang:
(Please assume that I can't change source code of programs)
Python
print('Hello ąćęłńóśźż world')
Results:
python hello.py
Hello ąćęłńóśźż world
python hello.py > file1.txt
Hello ╣Šŕ│˝ˇťč┐ world
python hello.py | out-file -encoding utf8 file2.ext
Hello ╣Šŕ│˝ˇťč┐ world
On cmd:
python hello.py > file3.txt
Hello ����󜟿 world
Golang
package main
import "fmt"
func main() {
fmt.Printf("Hello ąćęłńóśźż world\n")
}
Results:
go run hello.go:
Hello ąćęłńóśźż world
go run hello.go > file4.txt
Hello ─ů─ç─Ö┼é┼ä├│┼Ť┼║┼╝ world
go run hello.go | out-file -encoding utf8 file5.txt
Hello ─ů─ç─Ö┼é┼ä├│┼Ť┼║┼╝ world
On cmd it works ok:
go run hello.go > file6.txt
Hello ąćęłńóśźż world

You should set the OutputEncoding property of the console first.
In PowerShell, enter this line before running your programs:
[Console]::OutputEncoding = [Text.Encoding]::Utf8
You can then use Out-File with your encoding type:
py hello.py | Out-File -Encoding UTF8 file2.ext
go run hello.go | Out-File -Encoding UTF8 file5.txt

Note: These character-encoding problems only plague PowerShell on Windows, in both editions. On Unix-like platforms, UTF-8 is consistently used.[1]
Quicksilver's answer is fundamentally correct:
It is the character encoding stored in [Console]::OutputEncoding that determines how PowerShell decodes text received from external programs[2] - and note that it invariably interprets such output as text (strings).
[Console]::OutputEncoding by default reflects a console's active code page, which itself defaults to the system's active OEM code page, such as 437 (CP437) on US-English systems.
The standard chcp program also reports the active OEM code page, and while it can in principle also be used to change it for the active console (e.g., chcp 65001), this does not work from inside PowerShell, due to .NET caching the encodings.
Therefore, you may have to (temporarily) set [Console]::OutputEncoding to match the actual character encoding used by a given external console program:
While many console programs respect the active console code page (in which case no workarounds are required), some do not, typically in order to provide full Unicode support. Note that you may not notice a problem until you programmatically process such a program's output (meaning: capturing in a variable, sending through the pipeline to another command, redirection to a file), because such a program may detect the case when its stdout is directly connected to the console and may then selectively use full Unicode support for display.
Notable CLIs that do not respect the active console code page:
Python exhibits nonstandard behavior in that it uses the active ANSI code page by default, i.e. the code page normally only used by non-Unicode GUI-subsystem applications.
However, you can use $env:PYTHONUTF8=1 before invoking Python scripts to instruct Python to use UTF-8 instead (which then applies to all Python calls made from the same process); in v3.7+, you can alternatively pass command-line option -X utf8 (case-sensitive) as a per-call opt-in.
Go and also Node.js invariably use UTF-8 encoding.
The following snippet shows how to set [Console]::OutputEncoding temporarily as needed:
# Save the original encoding.
$orig = [Console]::OutputEncoding
# Work with console programs that use UTF-8 encoding,
# such as Go and Node.js
[Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
# Piping to Write-Output is a dummy operation that forces
# decoding of the external program's output, so that encoding problems would show.
go run hello.go | Write-Output
# Work with console programs that use ANSI encoding, such as Python.
# As noted, the alternative is to configure Python to use UTF-8.
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP))
python hello.py | Write-Output
# Restore the original encoding.
[Console]::OutputEncoding = $orig
Your own answer provides an effective alternative, but it comes with caveats:
Activating the Use Unicode UTF-8 for worldwide language support feature via Control Panel (or the equivalent registry settings) changes the code pages system-wide, which affects not only all console windows and console applications, but also legacy (non-Unicode) GUI-subsystem applications, given that both the OEM and the ANSI code pages are being set.
Notable side effects include:
Windows PowerShell's default behavior changes, because it uses the ANSI code page both to read source code and as the default encoding for the Get-Content and Set-Content cmdlets.
For instance, existing Windows PowerShell scripts that contain non-ASCII range characters such as é will then misbehave, unless they were saved as UTF-8 with a BOM (or as "Unicode", UTF-16LE, which always has a BOM).
By contrast, PowerShell (Core) v6+ consistently uses (BOM-less) UTF-8 to begin with.
Old console applications may break with 65001 (UTF-8) as the active OEM code page, as they may not be able to handle the variable-length encoding aspect of UTF-8 (a single character can be encoded by up to 4 bytes).
See this answer for more information.
[1] The cross-platform PowerShell (Core) v6+ edition uses (BOM-less) UTF-8 consistently. While it is possible to configure Unix terminals and thereby console (terminal) applications to use a character encoding other than UTF-8, doing so is rare these days - UTF-8 is almost universally used.
[2] By contrast, it is the $OutputEncoding preference variable that determines the encoding used for sending text to external programs, via the pipeline.

Solution is to enable Beta: Use Unicode UTF-8 for worldwide language support as described in What does "Beta: Use Unicode UTF-8 for worldwide language support" actually do?
Note: this solution may cause problems with legacy programs. Please read answer by mklement0 and answer by Quciksilver for details and alternative solutions.
Also I found explanation written by Ghisler helpful (source):
If you check this option, Windows will use codepage 65001 (Unicode
UTF-8) instead of the local codepage like 1252 (Western Latin1) for
all plain text files. The advantage is that text files created in e.g.
Russian locale can also be read in other locale like Western or
Central Europe. The downside is that ANSI-Only programs (most older
programs) will show garbage instead of accented characters.
Also Powershell before version 7.1 has a bug when this option is enabled. If you enable it , you may want to upgrade to version 7.1 or later.
I like this solution because it's enough to set it once and it's working. It brings consistent Unix-like UTF-8 behaviour to Windows. I hope I will not see any issues.
How to enable it:
Win+R → intl.cpl
Administrative tab
Click the Change system locale button
Enable Beta: Use Unicode UTF-8 for worldwide language support
Reboot
or alternatively via reg file:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"ACP"="65001"
"OEMCP"="65001"
"MACCP"="65001"

Is there a way to make VS Code not replace unknown text characters?

I'm currently using VS code to write a PowerShell script. As part of this script REGEX is used to replace/remove an atypical character that ends up in the data fairly often and causes trouble down the line. The character is (U+2019) and when the script is opened in code it is replaced permanently with (U+FFFD)
thus the line:
$user.Name = $user.Name -Replace "'|\’|\(|\)|\s+",""
Permanently becomes: $user.Name = $user.Name -Replace "'|\�|\(|\)|\s+",""
until it is manually changed. Seeing as I can paste the U+2019 character in once the file is open and then run the code, I assume that VS code can interpret it okay and the problem is with loading the file in. Is there some option that I can set to stop this being replaced when I open the file?

In my case, turning on the VS Code setting, "Files: Auto Guess Encoding," has fixed the problem, both for reading and saving.

This looks like it all comes down to encoding. Visual Studio Code by default uses UTF-8 and can in general handle saving/viewing Unicode properly.
If the issue is on Opening the file, then is is a case where Visual Studio Code is misinterpreting the file encoding on Opening the file. You can change the encoding (Configuring VS Code encoding) via settings in VS Code for file specific encoding (e.g. UTF-8, UTF-8BOM, UTF-16LE,etc.) by changing the "files.encoding" setting.
"files.encoding": "utf8bom"
If the issue is on saving the file, then it is being saved as ASCII(aka. Windows-1252) and not as proper UTF-8 or equivalent. On save, the character is replaced with the Replacement Character (U+FFFD) which would be displayed on the next time it is opened.
Note: The default encoding used for Windows PowerShell v5.1 is Windows-1252, and may be why saving the scripts with special characters may not work. PowerShell Core v6+ uses UTF-8 by default.

If I save in Vscode as Windows 1252 encoding, I see the character "’" change to � on next opening. I think the problem is Vscode doesn't recognize Windows 1252. It opens it as UTF8. If you reopen with the Windows 1252 encoding, it displays correctly. The other encodings work fine, even to display the character. This includes utf8 no bom.
Even Powershell 5 doesn't have this problem with Windows 1252, only Vscode. Set-content and get-content in Powershell 5 default to Windows 1252.
"’" | set-content file
get-content file
’
Powershell 7 would actually have the same problem:
get-content file
�

powershell base64 encoding different results

below are the two commands for same purpose of doing a base64 encoding of a credtial.
from windows commandline :
powershell "[convert]::ToBase64String([Text.Encoding]::UTF8.GetBytes(\"ATSxxx0101:urSY13sm\"))"
result QVRTVFNHMDEwMTp1clNZMTNzbQ==
from powershell :
[Convert]::ToBase64String([System.Text.Encoding]::Unicode.GetBytes('ATSxxx0101:urSY13sm'))
result : QQBUAFMAVABTAEcAMAAxADAAMQA6AHUAcgBTAFkAMQAzAHMAbQA=
result from windows command line , is working fine but the result powershell is wrong . but my tool can accept only the powershell command . direct windows command is not working. any idea experts ?

The reason is that [Text.Encoding]::UTF8 is not the same as [System.Text.Encoding]::Unicode. The difference is not missing System but UTF/Unicode.
Encoding.Unicode gets an encoding for the UTF-16 format using the little endian byte order.
Encoding.UTF8 gets an encoding for the UTF-8 format.

Output to text file with cyrillic content

Trying to get an output through cmd with the list of folders and files inside a drive.
Some folders are written in cyrillic alphabet so I only get ??? symbols.
My command:
tree /f /a |clip
or
tree /f /a >output.txt
Result:
\---???????????
\---2017 - ????? ??????? ????
01. ?????.mp3
02. ? ???????.mp3
03. ????.mp3
04. ?????? ? ???.mp3
05. ?????.mp3
06. ???? ?????.mp3
07. ???????? ????.mp3
08. ??? ?? ?????.mp3
Cover.jpg
Any idea?

tree.com uses the native UTF-16 encoding when writing to the console, just like cmd.exe and powershell.exe. So at first you'd expect redirecting the output to a file or pipe to also use Unicode. But tree.com, like most command-line utilities, encodes output to a pipe or disk file using a legacy codepage. (Speaking of legacy, the ".com" in the filename here is historical. In 64-bit Windows it's a regular 64-bit executable, not 16-bit DOS code.)
When writing to a pipe or disk file, some programs hard code the system ANSI codepage (e.g. 1252 in Western Europe) or OEM codepage (e.g. 850 in Western Europe), while some use the console's current output codepage (if attached to a console), which defaults to OEM. The latter would be great because you can change the console's output codepage to UTF-8 via chcp.com 65001. Unfortunately tree.com uses the OEM codepage, with no option to use anything else.
cmd.exe, on the other hand, at least provides a /u option to output its built-in commands as UTF-16. So, if you don't really need tree-formatted output, you could simply use cmd's dir command. For example:
cmd /u /c "dir /s /b" | clip
If you do need tree-formatted output, one workaround would be to read the output from tree.com directly from a console screen buffer, which can be done relatively easily for up to 9,999 lines. But that's not generally practical.
Otherwise PowerShell is probably your best option. For example, you could modify the Show-Tree script to output files in addition to directories.

Powershell and UTF-8

I have an html file test.html created with atom which contains:
Testé encoding utf-8
When I read it with Powershell console (I'm using French Windows)
Get-Content -Raw test.html
I get back this:
TestÃ© encoding utf-8
Why is the accent character not printing correctly?

The Atom editor creates UTF-8 files without a pseudo-BOM by default (which is the right thing to do, from a cross-platform perspective).
Other popular cross-platform editors, such as Visual Studio Code and Sublime Text, behave the same way.
Windows PowerShell[1] only recognizes UTF-8 files with a pseudo-BOM.
In the absence of the pseudo-BOM, PowerShell interprets files as being formatted according to the system's legacy ANSI codepage, such as Windows-1252 on US systems, for instance.
(This is also the default encoding used by Notepad, which it calls "ANSI", not just when reading files, but also when creating them. Ditto for Windows PowerShell's Get-Content / Set-Content (where this encoding is called Default and is the actual default and therefore needn't be specified); by contrast, Out-File / > creates UTF-16LE-encoded files (Unicode) by default.)
Therefore, in order for Get-Content to recognize a BOM-less UTF-8 file correctly in Windows PowerShell, you must use -Encoding utf8.
[1] By contrast, the cross-platform PowerShell Core edition commendably defaults to UTF-8, consistently across cmdlets, both on reading and writing, so it does interpret UTF-8-encoded files correctly even without a BOM and by default also creates files without a BOM.

# Created a UTF-8 Sig File
notepad .\test.html
# Get File contents with/without -raw
cat .\test.html;Get-Content -Raw .\test.html
Testé encoding utf-8
Testé encoding utf-8
# Check Encoding to make sure
Get-FileEncoding .\test.html
utf8
As you can see, it definitely works in PowerShell v5 on Windows 10. I'd double check the file formatting and the contents of the file you created, as there may have been characters introduced which your editor might not pick up.
If you do not have Get-FileEncoding as a cmdlet in your PowerShell, here is an implementation you can run:
function Get-FileEncoding([Parameter(Mandatory=$True)]$Path) {
$bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)
if(!$bytes) { return 'utf8' }
switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
'^efbbbf' {return 'utf8'}
'^2b2f76' {return 'utf7'}
'^fffe' {return 'unicode'}
'^feff' {return 'bigendianunicode'}
'^0000feff' {return 'utf32'}
default {return 'ascii'}
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse