I have the following problems with a powershell script that runs inside a TFS build. Both problems are unrelated to TFS and can be reproduced using an simple powershell command line window.
1) Completely unrelated to TFS. It seems Powershell does not like german umlauts when it comes to pipe.
1a) This line of code works fine and all umlauts are shown correctly
.\TF.exe hist "$/Test" /recursive /collection:https://TestTFS/tfs/TestCollection /noprompt /version:C1~T
1b) This line messes with umlauts
.\TF.exe hist "$/Test" /recursive /collection:https://TestTFS/tfs/TestCollection /noprompt /version:C1~T | Out-String
Initially I tried Out-File and changed encoding only to the that the umlauts are encoded wrong in every typeset (UTF8, unicode, UTF32,...)
I really do not know how to extract a string from standard output and get the umlauts right.
2) When using Out-File or Out-String each line in the output got truncated after 80 characters with seems to be the default screen buffer setting. How can I change that inside a powershell script and why does it even have an impact when redirecting the output.
Problem number 2 is not a Powershell problem. tfs documentation says following about default /format parameter (i.e. /format:brief)
Some of the data may be truncated.
/format:detailed does not have that warning, but it returns more information, which you can process with Powershell before doing Out-String or Out-File.
tl;dr
The following should solve both your problems, which stem from tf.exe using ANSI character encoding rather than the expected OEM encoding, and from truncating output by default.:
If you're using Windows PowerShell (the Windows-only legacy edition of PowerShell with versions up to v5.1):
$correctlyCapturedOutput =
& {
$prev = [Console]::OutputEncoding
[Console]::OutputEncoding = [System.Text.Encoding]::Default
# Note the addition of /format:detailed
.\tf.exe hist '$/Test' /recursive /collection:https://TestTFS/tfs/TestCollection /noprompt /format:detailed /version:C1~T
[Console]::OutputEncoding = $prev
}
If you're using the cross-platform, install-on-demand PowerShell (Core) 7+:
Note: [System.Text.Encoding]::Default, which reports the active ANSI code page's encoding in Windows PowerShell, reports (BOM-less) UTF-8 in PowerShell (Core) (reflecting .NET Core's / .NET 5+'s behavior). Therefore, the active ANSI code page must be determined explicitly, which is most robustly done via the registry.
$correctlyCapturedOutput =
& {
$prev = [Console]::OutputEncoding
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(
[int] ((Get-ItemProperty HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP).ACP)
)
# Note the addition of /format:detailed
.\tf.exe hist '$/Test' /recursive /collection:https://TestTFS/tfs/TestCollection /noprompt /format:detailed /version:C1~T
[Console]::OutputEncoding = $prev
}
This Gist contains helper function Invoke-WithEncoding, which can simplify the above in both PowerShell edition as follows:
$correctlyCapturedOutput =
Invoke-WithEncoding -Encoding Ansi {
.\tf.exe hist '$/Test' /recursive /collection:https://TestTFS/tfs/TestCollection /noprompt /format:detailed /version:C1~T
}
You can directly download and define the function with the following command (while I can personally assure you that doing so is safe, it is advisable to check the source code first):
# Downloads and defines function Invoke-WithEncoding in the current session.
irm https://gist.github.com/mklement0/ef57aea441ea8bd43387a7d7edfc6c19/raw/Invoke-WithEncoding.ps1 | iex
Read on for a detailed discussion.
Re the umlaut (character encoding) problem:
While the output from external programs may print OK to the console, when it comes to capturing the output in a variable or redirecting it - such as sending it through the pipeline to Out-String in your case - PowerShell decodes the output into .NET strings, using the character encoding stored in [Console]::OutputEncoding.
If [Console]::OutputEncoding doesn't match the actual encoding used by the external program, PowerShell will misinterpret the output.
The solution is to (temporarily) set [Console]::OutputEncoding to the actual encoding used by the external program.
While the official tf.exe documentation doesn't discuss character encodings, this comment on GitHub suggests that tf.exe uses the system's active ANSI code page, such as Windows-1252 on US-English or Western European systems.
It should be noted that the use of the ANSI code page is nonstandard behavior for a console application, because console applications are expected to use the system's active OEM code page. As an aside: python too exhibits this nonstandard behavior by default, though its behavior is configurable.
The solutions at the top show how to temporarily switch [Console]::OutputEncoding to the active ANSI code page's encoding in order to ensure that PowerShell correctly decodes tf.exe's output.
Re output-line truncation with Out-String / Out-File (and therefore also > and >>):
As Mustafa Zengin's helpful answer points out, in your particular case - due to use of tf.exe - the truncation happens at the source, i.e. it is tf.exe itself that outputs truncated data per its default formatting (implied /format:brief when /noprompt is also specified).
In general, Out-String and Out-File / > / >> do situationally truncate or line-wrap their output lines based on the console-window width (with a default of 120 chars. in the absence of a console):
Truncation of line-wrapping applies only to output lines stemming from the representations of non-primitive, non-string objects generated by PowerShell's rich output-formatting system:
Strings themselves ([string] input) as well as the string representations of .NET primitive types (plus a few more singe-value-only types) are not subject to truncation / line-wrapping.
Since PowerShell only ever interprets output from external programs as text ([string] instances), truncation / line-wrapping do not occur.
It follows that there's usually no reason to use Out-String on external-program output - unless you need to join the stream (array) of output lines to form a single, multiline string for further in-memory processing.
However, note that Out-String invariably adds a trailing newline to the resulting string, which may be undesired; use (...) -join [Environment]::NewLine to avoid that; Out-String's problematic behavior is discussed in GitHub issue #14444.
Related
I want to run program in Powershell and write output to file with UTF-8 encoding.
However I can't write non-ascii characters properly.
I already read many similar questions on Stack overflow, but I still can't find answer.
I tried both PowerShell 5.1.19041.1023 and PowerShell Core 7.1.3, they differently encode output file, but content is broken in the same way.
I tried simple programs in Python and Golang:
(Please assume that I can't change source code of programs)
Python
print('Hello ąćęłńóśźż world')
Results:
python hello.py
Hello ąćęłńóśźż world
python hello.py > file1.txt
Hello ╣Šŕ│˝ˇťč┐ world
python hello.py | out-file -encoding utf8 file2.ext
Hello ╣Šŕ│˝ˇťč┐ world
On cmd:
python hello.py > file3.txt
Hello ���� world
Golang
package main
import "fmt"
func main() {
fmt.Printf("Hello ąćęłńóśźż world\n")
}
Results:
go run hello.go:
Hello ąćęłńóśźż world
go run hello.go > file4.txt
Hello ─ů─ç─Ö┼é┼ä├│┼Ť┼║┼╝ world
go run hello.go | out-file -encoding utf8 file5.txt
Hello ─ů─ç─Ö┼é┼ä├│┼Ť┼║┼╝ world
On cmd it works ok:
go run hello.go > file6.txt
Hello ąćęłńóśźż world
You should set the OutputEncoding property of the console first.
In PowerShell, enter this line before running your programs:
[Console]::OutputEncoding = [Text.Encoding]::Utf8
You can then use Out-File with your encoding type:
py hello.py | Out-File -Encoding UTF8 file2.ext
go run hello.go | Out-File -Encoding UTF8 file5.txt
Note: These character-encoding problems only plague PowerShell on Windows, in both editions. On Unix-like platforms, UTF-8 is consistently used.[1]
Quicksilver's answer is fundamentally correct:
It is the character encoding stored in [Console]::OutputEncoding that determines how PowerShell decodes text received from external programs[2] - and note that it invariably interprets such output as text (strings).
[Console]::OutputEncoding by default reflects a console's active code page, which itself defaults to the system's active OEM code page, such as 437 (CP437) on US-English systems.
The standard chcp program also reports the active OEM code page, and while it can in principle also be used to change it for the active console (e.g., chcp 65001), this does not work from inside PowerShell, due to .NET caching the encodings.
Therefore, you may have to (temporarily) set [Console]::OutputEncoding to match the actual character encoding used by a given external console program:
While many console programs respect the active console code page (in which case no workarounds are required), some do not, typically in order to provide full Unicode support. Note that you may not notice a problem until you programmatically process such a program's output (meaning: capturing in a variable, sending through the pipeline to another command, redirection to a file), because such a program may detect the case when its stdout is directly connected to the console and may then selectively use full Unicode support for display.
Notable CLIs that do not respect the active console code page:
Python exhibits nonstandard behavior in that it uses the active ANSI code page by default, i.e. the code page normally only used by non-Unicode GUI-subsystem applications.
However, you can use $env:PYTHONUTF8=1 before invoking Python scripts to instruct Python to use UTF-8 instead (which then applies to all Python calls made from the same process); in v3.7+, you can alternatively pass command-line option -X utf8 (case-sensitive) as a per-call opt-in.
Go and also Node.js invariably use UTF-8 encoding.
The following snippet shows how to set [Console]::OutputEncoding temporarily as needed:
# Save the original encoding.
$orig = [Console]::OutputEncoding
# Work with console programs that use UTF-8 encoding,
# such as Go and Node.js
[Console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
# Piping to Write-Output is a dummy operation that forces
# decoding of the external program's output, so that encoding problems would show.
go run hello.go | Write-Output
# Work with console programs that use ANSI encoding, such as Python.
# As noted, the alternative is to configure Python to use UTF-8.
[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding([int] (Get-ItemPropertyValue HKLM:\SYSTEM\CurrentControlSet\Control\Nls\CodePage ACP))
python hello.py | Write-Output
# Restore the original encoding.
[Console]::OutputEncoding = $orig
Your own answer provides an effective alternative, but it comes with caveats:
Activating the Use Unicode UTF-8 for worldwide language support feature via Control Panel (or the equivalent registry settings) changes the code pages system-wide, which affects not only all console windows and console applications, but also legacy (non-Unicode) GUI-subsystem applications, given that both the OEM and the ANSI code pages are being set.
Notable side effects include:
Windows PowerShell's default behavior changes, because it uses the ANSI code page both to read source code and as the default encoding for the Get-Content and Set-Content cmdlets.
For instance, existing Windows PowerShell scripts that contain non-ASCII range characters such as é will then misbehave, unless they were saved as UTF-8 with a BOM (or as "Unicode", UTF-16LE, which always has a BOM).
By contrast, PowerShell (Core) v6+ consistently uses (BOM-less) UTF-8 to begin with.
Old console applications may break with 65001 (UTF-8) as the active OEM code page, as they may not be able to handle the variable-length encoding aspect of UTF-8 (a single character can be encoded by up to 4 bytes).
See this answer for more information.
[1] The cross-platform PowerShell (Core) v6+ edition uses (BOM-less) UTF-8 consistently. While it is possible to configure Unix terminals and thereby console (terminal) applications to use a character encoding other than UTF-8, doing so is rare these days - UTF-8 is almost universally used.
[2] By contrast, it is the $OutputEncoding preference variable that determines the encoding used for sending text to external programs, via the pipeline.
Solution is to enable Beta: Use Unicode UTF-8 for worldwide language support as described in What does "Beta: Use Unicode UTF-8 for worldwide language support" actually do?
Note: this solution may cause problems with legacy programs. Please read answer by mklement0 and answer by Quciksilver for details and alternative solutions.
Also I found explanation written by Ghisler helpful (source):
If you check this option, Windows will use codepage 65001 (Unicode
UTF-8) instead of the local codepage like 1252 (Western Latin1) for
all plain text files. The advantage is that text files created in e.g.
Russian locale can also be read in other locale like Western or
Central Europe. The downside is that ANSI-Only programs (most older
programs) will show garbage instead of accented characters.
Also Powershell before version 7.1 has a bug when this option is enabled. If you enable it , you may want to upgrade to version 7.1 or later.
I like this solution because it's enough to set it once and it's working. It brings consistent Unix-like UTF-8 behaviour to Windows. I hope I will not see any issues.
How to enable it:
Win+R → intl.cpl
Administrative tab
Click the Change system locale button
Enable Beta: Use Unicode UTF-8 for worldwide language support
Reboot
or alternatively via reg file:
Windows Registry Editor Version 5.00
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"ACP"="65001"
"OEMCP"="65001"
"MACCP"="65001"
I am trying to redirect input in PowerShell by:
Get-Content input.txt | my-program args
The problem is the piped UTF-8 text is preceded with a BOM (0xEFBBBF), and my program cannot handle that correctly.
A minimal working example:
// File: Hex.java
import java.io.IOException;
public class Hex {
public static void main(String[] dummy) {
int ch;
try {
while ((ch = System.in.read()) != -1) {
System.out.print(String.format("%02X ", ch));
}
} catch (IOException e) {
}
}
}
Then in PowerShell:
javac Hex.java
Set-Content textfile "ABC" -Encoding Ascii
# Now the content of textfile is 0x41 42 43 0D 0A
Get-Content textfile | java Hex
Or simply
javac Hex.java
Write-Output "ABC" | java Hex
In either case, the output is EF BB BF 41 42 43 0D 0A.
How can I pipe the text into the program without 0xEFBBBF?
Note:
The following contains general information that in a normally functioning PowerShell environment would explain the OP's symptom. That the solution doesn't work in the OP's case is owed to machine-specific causes that are unknown at this point.
This answer is about sending BOM-less UTF-8 to an external program; if you're looking to make your PowerShell console windows use UTF-8 in all respects, see this answer.
To ensure that your Java program receives its input UTF-8-encoded without a BOM, you must set $OutputEncoding to a System.Text.UTF8Encoding instance that does not emit a BOM:
# Assigns UTF-8 encoding *without a BOM*.
# PowerShell uses this encoding to encode data piped to external programs.
# $OutputEncoding defaults to ASCII(!) in Windows PowerShell, and more sensibly
# to BOM-*less* UTF-8 in PowerShell [Core] v6+
$OutputEncoding = [Text.UTF8Encoding]::new($false)
Caveats:
Do NOT use the seemingly equivalent New-Object Text.Utf8Encoding $false, because, due to the bug described in GitHub issue #5763, it won't work if you assign to $OutpuEncoding in a non-global scope, such as in a script. In PowerShell v4 and below, use
(New-Object Text.Utf8Encoding $false).psobject.BaseObject as a workaround.
Windows 10 version 1903 and up allow you to set BOM-less UTF-8 as the system-wide default encoding (although note that the feature is still classified as beta as of version 20H2) - see this answer; [fixed in PowerShell 7.1] in PowerShell [Core] up to v7.0, with this feature turned on, the above technique is not effective, due to a presumptive .NET Core bug that causes a UTF-8 BOM always to be emitted, irrespective of what encoding you set $OutputEncoding to (the bug is possibly connected to GitHub issue #28929); the only solution is to turn the feature off, as shown in imgx64's answer.
If, by contrast, you use [Text.Encoding]::Utf8, you'll get a System.Text.Encoding.UTF8 instance with BOM - which is what I suspect happened in your case.
Note that this problem is unrelated to the source encoding of any file read by Get-Content, because what is sent through the PowerShell pipeline is never a stream of raw bytes, but .NET objects, which in the case of Get-Content means that .NET strings are sent (System.String, internally a sequence of UTF-16 code units).
Because you're piping to an external program (a Java application, in your case), PowerShell character-encodes the (stringified-on-demand) objects sent to it based on preference variable $OutputEncoding, and the resulting encoding is what the external program receives.
Perhaps surprisingly, even though BOMs are typically only used in files, PowerShell respects the BOM setting of the encoding assigned to $OutputEncoding also in the pipeline, prepending it to the first line sent (only).
See the bottom section of this answer for more information about how PowerShell handles pipeline input for and output from external programs, including how it is [Console]::OutputEncoding that matters when PowerShell interprets data received from external programs.
To illustrate the difference using your sample program (note how using a PowerShell string literal as input is sufficient; no need to read from a file):
# Note the EF BB BF sequence representing the UTF-8 BOM.
# Enclosure in & { ... } ensures that a local, temporary copy of $OutputEncoding
# is used.
PS> & { $OutputEncoding = [Text.Encoding]::Utf8; 'hö' | java Hex }
EF BB BF 68 C3 B6 0D 0A
# Note the absence of EF BB BF, due to using a BOM-less
# UTF-8 encoding.
PS> & { $OutputEncoding = [Text.Utf8Encoding]::new($false); 'hö' | java Hex }
68 C3 B6 0D 0A
In Windows PowerShell, where $OutputEncoding defaults to ASCII(!), you'd see the following with the default in place:
# The default of ASCII(!) results in *lossy* encoding in Windows PowerShell.
PS> 'hö' | java Hex
68 3F 0D 0A
Note that 3F represents the literal ? character, which is what the non-ASCII ö character was transliterated too, given that it has no representation in ASCII; in other words: information was lost.
PowerShell [Core] v6+ now sensibly defaults to BOM-less UTF-8, so the default behavior there is as expected.
While BOM-less UTF-8 is PowerShell [Core]'s consistent default, also for cmdlets that read from and write to files, on Windows [Console]::OutputEncoding still reflects the active OEM code page by default as of v7.0, so to correctly capture output from UTF-8-emitting external programs, it must be set to [Text.UTF8Encoding]::new($false) as well - see GitHub issue #7233.
You could try setting the OutputEncoding to UTF-8 without BOM:
# Keep the current output encoding in a variable
$oldEncoding = [console]::OutputEncoding
# Set the output encoding to use UTF8 without BOM
[console]::OutputEncoding = New-Object System.Text.UTF8Encoding $false
Get-Content input.txt | my-program args
# Reset the output encoding to the previous
[console]::OutputEncoding = $oldEncoding
If the above has no effect and your program does understand UTF-8, but only expects it to be without the 3-byte BOM, then you can try removing the BOM from the content and pipe the result your program
(Get-Content 'input.txt' -Raw -Encoding UTF8) -replace '^\xef\xbb\xbf' | my-program args
If ever you have 'hacked' the codepage with chcp 65001, I recommend turning that back to chcp 5129 for English - New Zealand. See here.
Although mklement0's answer worked for me on one PC, it didn't work on another PC.
The reason was that I had the Beta: Use Unicode UTF-8 for worldwide language support checkbox selected in Language → Administrative language settings → Change system locale.
I unchecked it and now $OutputEncoding = [Text.UTF8Encoding]::new($false) works as expected.
It's odd that enabling it forces BOM, but I guess it's beta for a reason.
I'm working on a powershell script in which several commands output are shown in the window and appended to a file or a variable. It worked correctly until I used the sfc command. When piped or redirected, the output is "broken":
> sfc /?
Vérificateur de ressources Microsoft (R) Windows (R) version 6.0[...]
> sfc /? | Tee-Object -Variable content
V Ú r i f i c a t e u r d e r e s s o u r c e s M i c r o s o f t ( R ) W i n d o w s ( R ) v e r s i o á 6 . 0[...]
Are there other commands like sfc that are formatted in the same way, or that will result in a broken output if redirected?
EDIT
Powershell sample code, using the code from the accepted answer:
# Run a command
function RunCommand([ScriptBlock] $command) {
# Run the command and write the output to the window and to a variable ("SFC" formatting)
$stringcommand = $command.ToString()
if (
$stringcommand -match "^SFC$" -or
$stringcommand -match "^SFC.exe$" -or
$stringcommand -match "^SFC .*$" -or
$stringcommand -match "^SFC.exe .*$"
) {
$oldEncoding = [console]::OutputEncoding
[console]::OutputEncoding = [Text.Encoding]::Unicode
$command = [ScriptBlock]::Create("(" + $stringcommand + ")" + " -join ""`r`n"" -replace ""`r`n`r`n"", ""`r`n""")
& ($command) 2>&1 | Tee-Object -Variable out_content
[console]::OutputEncoding = $oldEncoding
# Run the command and write the output to the window and to a variable (normal formatting)
} else {
& ($command) 2>&1 | Tee-Object -Variable out_content
}
# Manipulate output variable, write it to a file...
# ...
return
}
# Run commands
RunCommand {ping 127.0.0.1}
RunCommand {sfc /?}
[void][System.Console]::ReadKey($true)
exit
CMD sample code, using more to format the sfcoutput:
#echo off
setlocal enabledelayedexpansion
set "tmpfile=%TEMP%\temp.txt"
set "outputfile=%TEMP%\output.txt"
REM; Run commands
call :RunCommand "ping 127.0.0.1"
call :RunCommand "sfc"
pause
exit /b
REM; Run a command
:RunCommand
REM; Run the command and write the output to the window and to the temp file
set "command=%~1"
(!command! 2>&1) >!tmpfile!
REM; Write the output to the window and to the output file ("SFC" formatting)
set "isSFC=0"
(echo !command!|findstr /I /R /C:"^SFC$" > NUL) && (set "isSFC=1")
(echo !command!|findstr /I /R /C:"^SFC.exe$" > NUL) && (set "isSFC=1")
(echo !command!|findstr /I /R /C:"^SFC .*$" > NUL) && (set "isSFC=1")
(echo !command!|findstr /I /R /C:"^SFC.exe .*$" > NUL) && (set "isSFC=1")
(if !isSFC! equ 1 (
(set \n=^
%=newline=%
)
set "content="
(for /f "usebackq tokens=* delims=" %%a in (`more /p ^<"!tmpfile!"`) do (
set "line=%%a"
set "content=!content!!line!!\n!"
))
echo.!content!
(echo.!content!) >>!outputfile!
REM; Write the output to the window and to the locked output file (normal formatting)
) else (
type "!tmpfile!"
(type "!tmpfile!") >>!outputfile!
))
goto :EOF
As noted in js2010's answer, the sfc.exe utility - surprisingly - outputs text that is UTF-16LE ("Unicode") encoded.
Since PowerShell doesn't expect that, it misinterprets sfc's output.[1]
The solution is to (temporarily) change [console]::OutputEncoding to UTF-16LE, which tells PowerShell / .NET what character encoding to expect from external programs, i.e., how to decode external-program output to .NET strings (which are stored as UTF-16 code units in memory).
However, there's an additional problem that looks like a bug: bizarrely, sfc.exe uses CRCRLF (`r`r`n) sequences as line breaks rather than the Windows-customary CRLF (`r`n) newlines.
PowerShell, when it captures stdout output from external programs, returns an array of lines rather than a single multi-line string, and it treats the following newline styles interchangeably: CRLF (Windows-style), LF (Unix-style), and CR (obsolete Mac-style - very rare these days).
Therefore, it treats CRCRLF as two newlines, which are reflected in both "teed" and captured-in-a-variable output then containing extra, empty lines.
The solution is therefore to join the array elements with the standard CRLF newline sequences - (sfc /?) -join "`r`n" and then replace 2 consecutive `r`n with just one, to remove the artificially introduced line breaks: -replace "`r`n`r`n", "`r`n".
To put it all together:
# Save the current output encoding and switch to UTF-16LE
$prev = [console]::OutputEncoding
[console]::OutputEncoding = [Text.Encoding]::Unicode
# Invoke sfc.exe, whose output is now correctly interpreted and
# apply the CRCRLF workaround.
# You can also send output to a file, but note that Windows PowerShell's
# > redirection again uses UTF-16LE encoding.
# Best to use ... | Set-Content/Add-Content -Encoding ...
(sfc /?) -join "`r`n" -replace "`r`n`r`n", "`r`n" | Tee-Object -Variable content
# Restore the previous output encoding, which is the system's
# active OEM code page, which should work for other programs such
# as ping.exe
[console]::OutputEncoding = $prev
Note that $content will then contain a single, multi-line string; use $content -split "`r`n" to split into an array of lines.
As for:
Are there other commands like "sfc" that are formatted in the same way, or that will result in a broken output if redirected?
Not that I'm personally aware of; unconditional UTF-16LE output, as in sfc.exe's case, strikes me as unusual (other programs may offer that on an opt-in basis).
Older console programs with a Windows-only heritage use a (possibly fixed) OEM code page, which is a single-byte 8-bit encoding that is a superset of ASCII.
Increasingly, modern, multi-platform console programs use UTF-8 (e.g., the Node.js CLI), which is variable-width encoding capable of encoding all Unicode characters that is backward-compatible with ASCII (that is, in the 7-bit ASCII range UTF-8 encodes all characters as single, ASCII-compatible bytes).
If you want to make your PowerShell sessions and potentially all console windows fully UTF-8 aware, see this answer (However, doing so stil requires the above workaround for sfc).
[1] Direct-to-console output:
When sfc output is neither captured by PowerShell nor routed through a cmdlet such as Tee-Object, sfc writes directly to the console, presumably using the Unicode version of the WriteConsole Windows API function, which expects UTF-16LE strings.
Writing to the console this way allows printing all Unicode characters, irrespective of what code page (reflected in chcp / [console]::OutputEncoding) is currently active.
(While the rendering of certain characters may fall short, due to limited font support and lack of support for (the rare) characters outside the BMP (Basic Multilingual Plane), the console buffer correctly preserves all characters, so copying and pasting elsewhere may render correctly there - see the bottom section of this answer.)
Therefore, direct-to-console output is not affected by the misinterpretation and typically prints as expected.
Looks like sfc outputs unicode no bom. Amazing.
cmd /c 'sfc > out'
get-content out -Encoding Unicode | where { $_ } # singlespace
Output:
Microsoft (R) Windows (R) Resource Checker Version 6.0
Copyright (C) Microsoft Corporation. All rights reserved.
Scans the integrity of all protected system files and replaces incorrect versions with
correct Microsoft versions.
SFC [/SCANNOW] [/VERIFYONLY] [/SCANFILE=<file>] [/VERIFYFILE=<file>]
[/OFFWINDIR=<offline windows directory> /OFFBOOTDIR=<offline boot directory>]
/SCANNOW Scans integrity of all protected system files and repairs files with
problems when possible.
/VERIFYONLY Scans integrity of all protected system files. No repair operation is
performed.
/SCANFILE Scans integrity of the referenced file, repairs file if problems are
identified. Specify full path <file>
/VERIFYFILE Verifies the integrity of the file with full path <file>. No repair
operation is performed.
/OFFBOOTDIR For offline repair specify the location of the offline boot directory
/OFFWINDIR For offline repair specify the location of the offline windows directory
e.g.
sfc /SCANNOW
sfc /VERIFYFILE=c:\windows\system32\kernel32.dll
sfc /SCANFILE=d:\windows\system32\kernel32.dll /OFFBOOTDIR=d:\ /OFFWINDIR=d:\windows
sfc /VERIFYONLY
Or delete the nulls and blank lines (windows prints nulls as spaces):
(sfc) -replace "`0" | where {$_}
I have an html file test.html created with atom which contains:
Testé encoding utf-8
When I read it with Powershell console (I'm using French Windows)
Get-Content -Raw test.html
I get back this:
Testé encoding utf-8
Why is the accent character not printing correctly?
The Atom editor creates UTF-8 files without a pseudo-BOM by default (which is the right thing to do, from a cross-platform perspective).
Other popular cross-platform editors, such as Visual Studio Code and Sublime Text, behave the same way.
Windows PowerShell[1] only recognizes UTF-8 files with a pseudo-BOM.
In the absence of the pseudo-BOM, PowerShell interprets files as being formatted according to the system's legacy ANSI codepage, such as Windows-1252 on US systems, for instance.
(This is also the default encoding used by Notepad, which it calls "ANSI", not just when reading files, but also when creating them. Ditto for Windows PowerShell's Get-Content / Set-Content (where this encoding is called Default and is the actual default and therefore needn't be specified); by contrast, Out-File / > creates UTF-16LE-encoded files (Unicode) by default.)
Therefore, in order for Get-Content to recognize a BOM-less UTF-8 file correctly in Windows PowerShell, you must use -Encoding utf8.
[1] By contrast, the cross-platform PowerShell Core edition commendably defaults to UTF-8, consistently across cmdlets, both on reading and writing, so it does interpret UTF-8-encoded files correctly even without a BOM and by default also creates files without a BOM.
# Created a UTF-8 Sig File
notepad .\test.html
# Get File contents with/without -raw
cat .\test.html;Get-Content -Raw .\test.html
Testé encoding utf-8
Testé encoding utf-8
# Check Encoding to make sure
Get-FileEncoding .\test.html
utf8
As you can see, it definitely works in PowerShell v5 on Windows 10. I'd double check the file formatting and the contents of the file you created, as there may have been characters introduced which your editor might not pick up.
If you do not have Get-FileEncoding as a cmdlet in your PowerShell, here is an implementation you can run:
function Get-FileEncoding([Parameter(Mandatory=$True)]$Path) {
$bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)
if(!$bytes) { return 'utf8' }
switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
'^efbbbf' {return 'utf8'}
'^2b2f76' {return 'utf7'}
'^fffe' {return 'unicode'}
'^feff' {return 'bigendianunicode'}
'^0000feff' {return 'utf32'}
default {return 'ascii'}
}
}
I figured out that these 2 lines:
echo "hello world" > hi.txt
echo "hello world" | Set-Content hi.txt
aren't doing exactly the same job. I created a simple script that replaces a content of some values in a configuration file and store it (using >), but that seems to store file in some weird format. Standard windows text editors do see the file normal, but the IDE which is supposed to load this file, (it's configuration file of a project) is unable to read it (I think it uses some extra encoding or whatever).
However when I replaced it with Set-Content it works fine.
What is a default behaviour of these commands, what Set-Content does differently so that it works in that?
The difference is in what default encoding is used. From MSDN, we can see that Set-Content defaults to ASCII encoding, which is readable by most programs (but may not work if you're not writing english). The > output redirection operator on the other hand works with Powershell's internal string representation, which is .Net System.String, which is UTF-16 (reference)
As a side-note, you can also use Out-File, which uses unicode encoding.
The default encoding of Set-Content is ASCII. This can be confirmed with the following:
Get-Help -Name Set-Content -Parameter Encoding;
The default encoding of the PowerShell redirection operator > is Unicode. This can be confirmed by looking at the help about_Redirection topic in PowerShell.
http://technet.microsoft.com/en-us/library/hh847746.aspx