Why does Powershell file concatenation convert UTF8 to UTF16?

Why does Powershell file concatenation convert UTF8 to UTF16? - powershell

I am running the following Powershell script to concatenate a series of output files into a single CSV file. whidataXX.htm (where xx is a two digit sequential number) and the number of files created varies from run to run.
$metadataPath = "\\ServerPath\foo"
function concatenateMetadata {
$cFile = $metadataPath + "whiconcat.csv"
Clear-Content $cFile
$metadataFiles = gci $metadataPath
$iterations = $metadataFiles.Count
for ($i=0;$i -le $iterations-1;$i++) {
$iFile = "whidata"+$i+".htm"
$FileExists = (Test-Path $metadataPath$iFile -PathType Leaf)
if (!($FileExists))
{
break
}
elseif ($FileExists)
{
Write-Host "Adding " $metadataPath$iFile
Get-Content $metadataPath$iFile | Out-File $cFile -append
Write-Host "to" $cfile
}
}
}
The whidataXX.htm files are encoded UTF8, but my output file is encoded UTF16. When I view the file in Notepad, it appears correct, but when I view it in a Hex Editor, the Hex value 00 appears between each character, and when I pull the file into a Java program for processing, the file prints to the console with extra spaces between c h a r a c t e r s.
First, is this normal for PowerShell? or is there something in the source files that would cause this?
Second, how would I fix this encoding problem in the code noted above?

The Out-* cmdlets (like Out-File) format the data, and the default format is unicode.
You can add an -Encoding parameter to Out-file:
Get-Content $metadataPath$iFile | Out-File $cFile -Encoding UTF8 -append
or switch to Add-Content, which doesn't re-format
Get-Content $metadataPath$iFile | Add-Content $cFile

First, the fact that you get 2 bytes per character indicates that fixed length UTF16 is being used. More accurately, it is called UCS-2. This article explains that file redirection in Powershell causes the output to be in UCS-2. See http://www.kongsli.net/nblog/2012/04/20/powershell-gotchas-redirect-to-file-encodes-in-unicode/. That same article also provides a fix.

Related

How can I (efficiently) match content (lines) of many small files with content (lines) of a single large file and update/recreate them

I've tried solving the following case:
many small text files (in subfolders) need their content (lines) matched to lines that exist in another (large) text file. The small files then need to be updated or copied with those matching Lines.
I was able to come up with some running code for this but I need to improve it or use a complete other method because it is extremely slow and would take >40h to get through all files.
One idea I already had was to use a SQL Server to bulk-import all files in a single table with [relative path],[filename],[jap content] and the translation file in a table with [jap content],[eng content] and then join [jap content] and bulk-export the joined table as separate files using [relative path],[filename]. Unfortunately I got stuck right at the beginning due to formatting and encoding issues so I dropped it and started working on a PowerShell script.
Now in detail:
Over 40k txt files spread across multiple subfolders with multiple lines each, every line can exist in multiple files.
Content:
UTF8 encoded Japanese text that also can contain special characters like \\[*+(), each Line ending with a tabulator character. Sounds like csv files but they don't have headers.
One large File with >600k Lines containing the translation to the small files. Every line is unique within this file.
Content:
Again UTF8 encoded Japanese text. Each line formatted like this (without brackets):
[Japanese Text][tabulator][English Text]
Example:
テスト[1] Test [1]
End result should be a copy or a updated version of all these small files where their lines got replaced with the matching ones of the translation file while maintaining their relative path.
What I have at the moment:
$translationfile = 'B:\Translation.txt'
$inputpath = 'B:\Working'
$translationarray = [System.Collections.ArrayList]#()
$translationarray = #(Get-Content $translationfile -Encoding UTF8)
Get-Childitem -path $inputpath -Recurse -File -Filter *.txt | ForEach-Object -Parallel {
$_.Name
$filepath = ($_.Directory.FullName).substring(2)
$filearray = [System.Collections.ArrayList]#()
$filearray = #(Get-Content -path $_.FullName -Encoding UTF8)
$filearray = $filearray | ForEach-Object {
$result = $using:translationarray -match ("^$_" -replace '[[+*?()\\.]','\$&')
if ($result) {
$_ = $result
}
$_
}
If(!(test-path B:\output\$filepath)) {New-Item -ItemType Directory -Force -Path B:\output\$filepath}
#$("B:\output\"+$filepath+"\")
$filearray | Out-File -FilePath $("B:\output\" + $filepath + "\" + $_.Name) -Force -Encoding UTF8
} -ThrottleLimit 10
I would appreciate any help and ideas but please keep in mind that I rarely write scripts so anything to complex might fly right over my head.
Thanks

As zett42 states, using a hash table is your best option for mapping the Japanese-only phrases to the dual-language lines.
Additionally, use of .NET APIs for file I/O can speed up the operation noticeably.
# Be sure to specify all paths as full paths, not least because .NET's
# current directory usually differs from PowerShell's
$translationfile = 'B:\Translation.txt'
$inPath = 'B:\Working'
$outPath = (New-Item -Type Directory -Force 'B:\Output').FullName
# Build the hashtable mapping the Japanese phrases to the full lines.
# Note that ReadLines() defaults to UTF-8
$ht = #{ }
foreach ($line in [IO.File]::ReadLines($translationfile)) {
$ht[$line.Split("`t")[0] + "`t"] = $line
}
Get-ChildItem $inPath -Recurse -File -Filter *.txt | Foreach-Object -Parallel {
# Translate the lines to the matching lines including the $translation
# via the hashtable.
# NOTE: If an input line isn't represented as a key in the hashtable,
# it is passed through as-is.
$lines = foreach ($line in [IO.File]::ReadLines($_.FullName)) {
($using:ht)[$line] ?? $line
}
# Synthesize the output file path, ensuring that the target dir. exists.
$outFilePath = (New-Item -Force -Type Directory ($using:outPath + $_.Directory.FullName.Substring(($using:inPath).Length))).FullName + '/' + $_.Name
# Write to the output file.
# Note: If you want UTF-8 files *with BOM*, use -Encoding utf8bom
Set-Content -Encoding utf8 $outFilePath -Value $lines
} -ThrottleLimit 10
Note: Your use of ForEach-Object -Parallel implies that you're using PowerShell [Core] 7+, where BOM-less UTF-8 is the consistent default encoding (unlike in Window PowerShell, where default encodings vary wildly).
Therefore, in lieu of the .NET [IO.File]::ReadLines() API in a foreach loop, you could also use the more PowerShell-idiomatic switch statement with the -File parameter for efficient line-by-line text-file processing.

Out-File -Encoding problems using PowerShell replace command [duplicate]

Out-File seems to force the BOM when using UTF-8:
$MyFile = Get-Content $MyPath
$MyFile | Out-File -Encoding "UTF8" $MyPath
How can I write a file in UTF-8 with no BOM using PowerShell?
Update 2021
PowerShell has changed a bit since I wrote this question 10 years ago. Check multiple answers below, they have a lot of good information!

Using .NET's UTF8Encoding class and passing $False to the constructor seems to work:
$MyRawString = Get-Content -Raw $MyPath
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($MyPath, $MyRawString, $Utf8NoBomEncoding)

The proper way as of now is to use a solution recommended by #Roman Kuzmin in comments to #M. Dudley answer:
[IO.File]::WriteAllLines($filename, $content)
(I've also shortened it a bit by stripping unnecessary System namespace clarification - it will be substituted automatically by default.)

I figured this wouldn't be UTF, but I just found a pretty simple solution that seems to work...
Get-Content path/to/file.ext | out-file -encoding ASCII targetFile.ext
For me this results in a utf-8 without bom file regardless of the source format.

Note: This answer applies to Windows PowerShell; by contrast, in the cross-platform PowerShell Core edition (v6+), UTF-8 without BOM is the default encoding, across all cmdlets.
In other words: If you're using PowerShell [Core] version 6 or higher, you get BOM-less UTF-8 files by default (which you can also explicitly request with -Encoding utf8 / -Encoding utf8NoBOM, whereas you get with-BOM encoding with -utf8BOM).
If you're running Windows 10 and you're willing to switch to BOM-less UTF-8 encoding system-wide - which can have side effects - even Windows PowerShell can be made to use BOM-less UTF-8 consistently - see this answer.
To complement M. Dudley's own simple and pragmatic answer (and ForNeVeR's more concise reformulation):
A simple, (non-streaming) PowerShell-native alternative is to use New-Item, which (curiously) creates BOM-less UTF-8 files by default even in Windows PowerShell:
# Note the use of -Raw to read the file as a whole.
# Unlike with Set-Content / Out-File *no* trailing newline is appended.
$null = New-Item -Force $MyPath -Value (Get-Content -Raw $MyPath)
Note: To save the output from arbitrary commands in the same format as Out-File would, pipe to Out-String first; e.g.:
$null = New-Item -Force Out.txt -Value (Get-ChildItem | Out-String)
For convenience, below is advanced function Out-FileUtf8NoBom, a pipeline-based alternative that mimics Out-File, which means:
you can use it just like Out-File in a pipeline.
input objects that aren't strings are formatted as they would be if you sent them to the console, just like with Out-File.
an additional -UseLF switch allows you use Unix-format LF-only newlines ("`n") instead of the Windows-format CRLF newlines ("`r`n") you normally get.
Example:
(Get-Content $MyPath) | Out-FileUtf8NoBom $MyPath # Add -UseLF for Unix newlines
Note how (Get-Content $MyPath) is enclosed in (...), which ensures that the entire file is opened, read in full, and closed before sending the result through the pipeline. This is necessary in order to be able to write back to the same file (update it in place).
Generally, though, this technique is not advisable for 2 reasons: (a) the whole file must fit into memory and (b) if the command is interrupted, data will be lost.
A note on memory use:
M. Dudley's own answer
and the New-Item alternative above require that the entire file contents be built up in memory first, which can be problematic with large input sets.
The function below does not require this, because it is implemented as a proxy (wrapper) function (for a concise summary of how to define such functions, see this answer).
Source code of function Out-FileUtf8NoBom:
Note: The function is also available as an MIT-licensed Gist, and only it will be maintained going forward.
You can install it directly with the following command (while I can personally assure you that doing so is safe, you should always check the content of a script before directly executing it this way):
# Download and define the function.
irm https://gist.github.com/mklement0/8689b9b5123a9ba11df7214f82a673be/raw/Out-FileUtf8NoBom.ps1 | iex
function Out-FileUtf8NoBom {
<#
.SYNOPSIS
Outputs to a UTF-8-encoded file *without a BOM* (byte-order mark).
.DESCRIPTION
Mimics the most important aspects of Out-File:
* Input objects are sent to Out-String first.
* -Append allows you to append to an existing file, -NoClobber prevents
overwriting of an existing file.
* -Width allows you to specify the line width for the text representations
of input objects that aren't strings.
However, it is not a complete implementation of all Out-File parameters:
* Only a literal output path is supported, and only as a parameter.
* -Force is not supported.
* Conversely, an extra -UseLF switch is supported for using LF-only newlines.
.NOTES
The raison d'être for this advanced function is that Windows PowerShell
lacks the ability to write UTF-8 files without a BOM: using -Encoding UTF8
invariably prepends a BOM.
Copyright (c) 2017, 2022 Michael Klement <mklement0#gmail.com> (http://same2u.net),
released under the [MIT license](https://spdx.org/licenses/MIT#licenseText).
#>
[CmdletBinding(PositionalBinding=$false)]
param(
[Parameter(Mandatory, Position = 0)] [string] $LiteralPath,
[switch] $Append,
[switch] $NoClobber,
[AllowNull()] [int] $Width,
[switch] $UseLF,
[Parameter(ValueFromPipeline)] $InputObject
)
begin {
# Convert the input path to a full one, since .NET's working dir. usually
# differs from PowerShell's.
$dir = Split-Path -LiteralPath $LiteralPath
if ($dir) { $dir = Convert-Path -ErrorAction Stop -LiteralPath $dir } else { $dir = $pwd.ProviderPath }
$LiteralPath = [IO.Path]::Combine($dir, [IO.Path]::GetFileName($LiteralPath))
# If -NoClobber was specified, throw an exception if the target file already
# exists.
if ($NoClobber -and (Test-Path $LiteralPath)) {
Throw [IO.IOException] "The file '$LiteralPath' already exists."
}
# Create a StreamWriter object.
# Note that we take advantage of the fact that the StreamWriter class by default:
# - uses UTF-8 encoding
# - without a BOM.
$sw = New-Object System.IO.StreamWriter $LiteralPath, $Append
$htOutStringArgs = #{}
if ($Width) { $htOutStringArgs += #{ Width = $Width } }
try {
# Create the script block with the command to use in the steppable pipeline.
$scriptCmd = {
& Microsoft.PowerShell.Utility\Out-String -Stream #htOutStringArgs |
. { process { if ($UseLF) { $sw.Write(($_ + "`n")) } else { $sw.WriteLine($_) } } }
}
$steppablePipeline = $scriptCmd.GetSteppablePipeline($myInvocation.CommandOrigin)
$steppablePipeline.Begin($PSCmdlet)
}
catch { throw }
}
process
{
$steppablePipeline.Process($_)
}
end {
$steppablePipeline.End()
$sw.Dispose()
}
}

Starting from version 6 powershell supports the UTF8NoBOM encoding both for set-content and out-file and even uses this as default encoding.
So in the above example it should simply be like this:
$MyFile | Out-File -Encoding UTF8NoBOM $MyPath

When using Set-Content instead of Out-File, you can specify the encoding Byte, which can be used to write a byte array to a file. This in combination with a custom UTF8 encoding which does not emit the BOM gives the desired result:
# This variable can be reused
$utf8 = New-Object System.Text.UTF8Encoding $false
$MyFile = Get-Content $MyPath -Raw
Set-Content -Value $utf8.GetBytes($MyFile) -Encoding Byte -Path $MyPath
The difference to using [IO.File]::WriteAllLines() or similar is that it should work fine with any type of item and path, not only actual file paths.

This script will convert, to UTF-8 without BOM, all .txt files in DIRECTORY1 and output them to DIRECTORY2
foreach ($i in ls -name DIRECTORY1\*.txt)
{
$file_content = Get-Content "DIRECTORY1\$i";
[System.IO.File]::WriteAllLines("DIRECTORY2\$i", $file_content);
}

important!: this only works if an extra space or newline at the start is no problem for your use case of the file
(e.g. if it is an SQL file, Java file or human readable text file)
one could use a combination of creating an empty (non-UTF8 or ASCII (UTF8-compatible)) file and appending to it (replace $str with gc $src if the source is a file):
" " | out-file -encoding ASCII -noNewline $dest
$str | out-file -encoding UTF8 -append $dest
as one-liner
replace $dest and $str according to your use case:
$_ofdst = $dest ; " " | out-file -encoding ASCII -noNewline $_ofdst ; $src | out-file -encoding UTF8 -append $_ofdst
as simple function
function Out-File-UTF8-noBOM { param( $str, $dest )
" " | out-file -encoding ASCII -noNewline $dest
$str | out-file -encoding UTF8 -append $dest
}
using it with a source file:
Out-File-UTF8-noBOM (gc $src), $dest
using it with a string:
Out-File-UTF8-noBOM $str, $dest
optionally: continue appending with Out-File:
"more foo bar" | Out-File -encoding UTF8 -append $dest

Old question, new answer:
While the "old" powershell writes a BOM, the new platform-agnostic variant does behave differently: The default is "no BOM" and it can be configured via switch:
-Encoding
Specifies the type of encoding for the target file. The default value is utf8NoBOM.
The acceptable values for this parameter are as follows:
ascii: Uses the encoding for the ASCII (7-bit) character set.
bigendianunicode: Encodes in UTF-16 format using the big-endian byte order.
oem: Uses the default encoding for MS-DOS and console programs.
unicode: Encodes in UTF-16 format using the little-endian byte order.
utf7: Encodes in UTF-7 format.
utf8: Encodes in UTF-8 format.
utf8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM)
utf8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)
utf32: Encodes in UTF-32 format.
Source: https://learn.microsoft.com/de-de/powershell/module/Microsoft.PowerShell.Utility/Out-File?view=powershell-7
Emphasis mine

For PowerShell 5.1, enable this setting:
Control Panel, Region, Administrative, Change system locale, Use Unicode UTF-8
for worldwide language support
Then enter this into PowerShell:
$PSDefaultParameterValues['*:Encoding'] = 'Default'
Alternatively, you can upgrade to PowerShell 6 or higher.
https://github.com/PowerShell/PowerShell

I would say to use just the Set-Content command, nothing else needed.
The powershell version in my system is :-
PS C:\Users\XXXXX> $PSVersionTable.PSVersion | fl
Major : 5
Minor : 1
Build : 19041
Revision : 1682
MajorRevision : 0
MinorRevision : 1682
PS C:\Users\XXXXX>
So you would need something like following.
PS C:\Users\XXXXX> Get-Content .\Downloads\finddate.txt
Thursday, June 23, 2022 5:57:59 PM
PS C:\Users\XXXXX> Get-Content .\Downloads\finddate.txt | Set-Content .\Downloads\anotherfile.txt
PS C:\Users\XXXXX> Get-Content .\Downloads\anotherfile.txt
Thursday, June 23, 2022 5:57:59 PM
PS C:\Users\XXXXX>
Now when we check the file as per the screenshot it is utf8.
anotherfile.txt

Change multiple files by extension to UTF-8 without BOM:
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in ls -recurse -filter "*.java") {
$MyFile = Get-Content $i.fullname
[System.IO.File]::WriteAllLines($i.fullname, $MyFile, $Utf8NoBomEncoding)
}

[System.IO.FileInfo] $file = Get-Item -Path $FilePath
$sequenceBOM = New-Object System.Byte[] 3
$reader = $file.OpenRead()
$bytesRead = $reader.Read($sequenceBOM, 0, 3)
$reader.Dispose()
#A UTF-8+BOM string will start with the three following bytes. Hex: 0xEF0xBB0xBF, Decimal: 239 187 191
if ($bytesRead -eq 3 -and $sequenceBOM[0] -eq 239 -and $sequenceBOM[1] -eq 187 -and $sequenceBOM[2] -eq 191)
{
$utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
[System.IO.File]::WriteAllLines($FilePath, (Get-Content $FilePath), $utf8NoBomEncoding)
Write-Host "Remove UTF-8 BOM successfully"
}
Else
{
Write-Warning "Not UTF-8 BOM file"
}
Source How to remove UTF8 Byte Order Mark (BOM) from a file using PowerShell

If you want to use [System.IO.File]::WriteAllLines(), you should cast second parameter to String[] (if the type of $MyFile is Object[]), and also specify absolute path with $ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($MyPath), like:
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Set-Variable MyFile
[System.IO.File]::WriteAllLines($ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($MyPath), [String[]]$MyFile, $Utf8NoBomEncoding)
If you want to use [System.IO.File]::WriteAllText(), sometimes you should pipe the second parameter into | Out-String | to add CRLFs to the end of each line explictly (Especially when you use them with ConvertTo-Csv):
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Out-String | Set-Variable tmp
[System.IO.File]::WriteAllText("/absolute/path/to/foobar.csv", $tmp, $Utf8NoBomEncoding)
Or you can use [Text.Encoding]::UTF8.GetBytes() with Set-Content -Encoding Byte:
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Out-String | % { [Text.Encoding]::UTF8.GetBytes($_) } | Set-Content -Encoding Byte -Path "/absolute/path/to/foobar.csv"
see: How to write result of ConvertTo-Csv to a file in UTF-8 without BOM

I have the same error in the PowerShell and used this isolation and fixed it
$PSDefaultParameterValues['*:Encoding'] = 'utf8'

One technique I utilize is to redirect output to an ASCII file using the Out-File cmdlet.
For example, I often run SQL scripts that create another SQL script to execute in Oracle. With simple redirection (">"), the output will be in UTF-16 which is not recognized by SQLPlus. To work around this:
sqlplus -s / as sysdba "#create_sql_script.sql" |
Out-File -FilePath new_script.sql -Encoding ASCII -Force
The generated script can then be executed via another SQLPlus session without any Unicode worries:
sqlplus / as sysdba "#new_script.sql" |
tee new_script.log
Update: As others have pointed out, this will drop non-ASCII characters. Since the user asked for a way to "force" conversion, I assume they do not care about that as perhaps their data does not contain such data.
If you care about the preservation of non-ASCII characters, this is not the answer for you.

Used this method to edit a UTF8-NoBOM file and generated a file with correct encoding-
$fileD = "file.xml"
(Get-Content $fileD) | ForEach-Object { $_ -replace 'replace text',"new text" } | out-file "file.xml" -encoding ASCII
I was skeptical at this method at first, but it surprised me and worked!
Tested with powershell version 5.1

Could use below to get UTF8 without BOM
$MyFile | Out-File -Encoding ASCII

This one works for me (use "Default" instead of "UTF8"):
$MyFile = Get-Content $MyPath
$MyFile | Out-File -Encoding "Default" $MyPath
The result is ASCII without BOM.

Prevent trailing newline in PowerShell Out-File command

How do I prevent PowerShell's Out-File command from appending a newline after the text it outputs?
For example, running the following command produces a file with contents "TestTest\r\n" rather than just "TestTest".
"TestTest" | Out-File -encoding ascii test.txt

In PowerShell 5.0+, you would use:
"TestTest" | Out-File -encoding ascii test.txt -NoNewline
But in earlier versions you simply can't with that cmdlet.
Try this:
[System.IO.File]::WriteAllText($FilePath,"TestTest",[System.Text.Encoding]::ASCII)

To complement briantist's helpful answer re -NoNewline:
The following applies not just to Out-File, but analogously to Set-Content / Add-Content as well; as stated, -NoNewline requires PSv5+.
Note that -NoNewline means that with multiple objects to output, it is not just a trailing newline (line break) that is suppressed, but any newlines.
In other words: The string representations of the input objects are directly concatenated, without a separator (terminator).
Therefore, the following commands result in the same file contents (TestTest without a trailing newline):
# Single input string
"TestTest" | Out-File -encoding ascii test.txt -NoNewline
# Equivalent command: 2-element array of strings that are directly concatenated.
"Test", "Test" | Out-File -encoding ascii test.txt -NoNewline
In order to place newlines only between, but not also after the output objects, you must join the objects with newlines explicitly:
"Test", "Test" -join [Environment]::NewLine |
Out-File -encoding ascii test.txt -NoNewline
[Environment]::NewLine is the platform-appropriate newline sequence (CRLF on Windows, LF on Unix-like platforms); you can also produce either sequence explicitly, if needed, with "`r`n" and "`n"
Caveat:
The above -join solution implicitly converts the input objects to strings, if they aren't already and does so by calling the .NET .ToString() method on each object. This often yields a different representation than the one that Out-File would directly create, because Out-File uses PowerShell's default output formatter; for instance, compare the outputs of (Get-Date).ToString() and just Get-Date.
If your input comprises only strings and/or non-strings whose .ToString() representation is satisfactory, the above solution works, but note that it is then generally preferable to use the Set-Content cmdlet, which applies the same stringification implicitly.
For a complete discussion of the differences between Out-File and Set-Content, see this answer of mine.
If your input has non-strings that do you want to be formatted as they would print to the console, there is actually no simple solution: while you can use Out-String to create per-object string representations with the default formatter, Out-String's lack of -NoNewline (as of v5.1; this GitHub issue suggests introducing it) would invariably yield trailing newlines.

To complement briantist's and mklement0's helpful answers re -NoNewline:
I created this little function to replace the -NoNewLine parameter of Out-File in previous versions of powershell.
Note: In my case it was for a .csv file with 7 lines (Days of the week and some more values)
## Receive the value we want to add and "yes" or "no" depending on whether we want to
put the value on a new line or not.
function AddValueToLogFile ($value, $NewLine) {
## If the log file exists:
if (Test-path $Config.LogPath) {
## And we don't want to add a new line, the value is concatenated at the end.
if ($NewLine -eq "no") {
$file = Get-Content -Path $Config.LogPath
## If the file has more than one line
if ($file -is [array]) {
$file[-1]+= ";" + $value
}
## if the file only has one line
else {
$file += ";" + $value
}
$file | Out-File -FilePath $Config.LogPath
}
## If we want to insert a new line the append parameter is used.
elseif ($NewLine -eq "yes") {
$value | Out-File -Append -FilePath $Config.LogPath
}
}
## If the log file does not exist it is passed as a value
elseif (!(Test-path $Config.LogPath)) {
$value | Out-File -FilePath $Config.LogPath
}
}

Read UTF-8 files correctly with PowerShell

Following situation:
A PowerShell script creates a file with UTF-8 encoding
The user may or may not edit the file, possibly losing the BOM, but should keep the encoding as UTF-8, and possibly changing the line separators
The same PowerShell script reads the file, adds some more content and writes it all as UTF-8 back to the same file
This can be iterated many times
With Get-Content and Out-File -Encoding UTF8 I have problems reading it correctly. It's stumbling over the BOM it has written before (putting it in the content, breaking my parsing regex), does not use UTF-8 encoding and even deletes line breaks in the original content part.
I need a function that can read any file with UTF-8 encoding, ignore and delete the BOM and not modify the content. What should I use?
Update
I have added a little test script that shows what I'm trying to do and what happens instead.
# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
$data = Get-Content -Path test.txt
if ($data -match "^[0-9-]{10} - r([0-9]+)")
{
$startRev = [int]$matches[1] + 1
}
}
Write-Host Next revision is $startRev
# Define example data to add
$startRev = $startRev + 10
$newMsgs = "2014-04-01 - r" + $startRev + "`r`n`r`n" + `
"Line 1`r`n" + `
"Line 2`r`n`r`n"
# Write new data back
$data = $newMsgs + $data
$data | Out-File test.txt -Encoding UTF8
After running it a few times, new sections should be added to the beginning of the file, the existing content should not be altered in any way (currently loses line breaks) and no additional new lines should be added at the end of the file (seems to happen sometimes).
Instead, the second run gives me an error.

If the file is supposed to be UTF8 why don't you try to read it decoding UTF8 :
Get-Content -Path test.txt -Encoding UTF8

Really JPBlanc is right. If you want it read as UTF8 then specify that when the file is read.
On a side note, you're losing formatting in here with the [String]+[String] stuff. Not to mention your regex match doesn't work. Check out the regex search changes, and the changes made to the $newMsgs, and the way I'm outputting your data to the file.
# Read data if exists
$data = ""
$startRev = 1;
if (Test-Path test.txt)
{
$data = Get-Content -Path test.txt #-Encoding UTF8
if($data -match "\br([0-9]+)\b"){
$startRev = [int]([regex]::Match($data,"\br([0-9]+)\b")).groups[1].value + 1
}
}
Write-Host Next revision is $startRev
# Define example data to add
$startRev = $startRev + 10
$newMsgs = #"
2014-04-01 - r$startRev`r`n`r`n
Line 1`r`n
Line 2`r`n`r`n
"#
# Write new data back
$newmsgs,$data | Out-File test.txt -Encoding UTF8

Get-Content doesn't seem to handle UTF-files without BOM at all (if you omit the Encoding-flag). System.IO.File.ReadLines seems to be an alternative, examples:
PS C:\temp\powershellutf8> $a = Get-Content .\utf8wobom.txt
PS C:\temp\powershellutf8> $b = Get-Content .\utf8wbom.txt
PS C:\temp\powershellutf8> $a2 = Get-Content .\utf8wbom.txt -Encoding UTF8
PS C:\temp\powershellutf8> $a
ABCDEFGHIJKLMNOPQRSTUVWXYZÃ…Ã„Ã– <== This doesnt seem to be right at all
PS C:\temp\powershellutf8> $b
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8> $a2
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8>
PS C:\temp\powershellutf8> $c = [IO.File]::ReadLines('.\utf8wbom.txt');
PS C:\temp\powershellutf8> $c
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ
PS C:\temp\powershellutf8> $d = [IO.File]::ReadLines('.\utf8wobom.txt');
PS C:\temp\powershellutf8> $d
ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ <== Works!

Using PowerShell to write a file in UTF-8 without the BOM

Out-File seems to force the BOM when using UTF-8:
$MyFile = Get-Content $MyPath
$MyFile | Out-File -Encoding "UTF8" $MyPath
How can I write a file in UTF-8 with no BOM using PowerShell?
Update 2021
PowerShell has changed a bit since I wrote this question 10 years ago. Check multiple answers below, they have a lot of good information!

Using .NET's UTF8Encoding class and passing $False to the constructor seems to work:
$MyRawString = Get-Content -Raw $MyPath
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($MyPath, $MyRawString, $Utf8NoBomEncoding)

The proper way as of now is to use a solution recommended by #Roman Kuzmin in comments to #M. Dudley answer:
[IO.File]::WriteAllLines($filename, $content)
(I've also shortened it a bit by stripping unnecessary System namespace clarification - it will be substituted automatically by default.)

I figured this wouldn't be UTF, but I just found a pretty simple solution that seems to work...
Get-Content path/to/file.ext | out-file -encoding ASCII targetFile.ext
For me this results in a utf-8 without bom file regardless of the source format.

Note: This answer applies to Windows PowerShell; by contrast, in the cross-platform PowerShell Core edition (v6+), UTF-8 without BOM is the default encoding, across all cmdlets.
In other words: If you're using PowerShell [Core] version 6 or higher, you get BOM-less UTF-8 files by default (which you can also explicitly request with -Encoding utf8 / -Encoding utf8NoBOM, whereas you get with-BOM encoding with -utf8BOM).
If you're running Windows 10 and you're willing to switch to BOM-less UTF-8 encoding system-wide - which can have side effects - even Windows PowerShell can be made to use BOM-less UTF-8 consistently - see this answer.
To complement M. Dudley's own simple and pragmatic answer (and ForNeVeR's more concise reformulation):
A simple, (non-streaming) PowerShell-native alternative is to use New-Item, which (curiously) creates BOM-less UTF-8 files by default even in Windows PowerShell:
# Note the use of -Raw to read the file as a whole.
# Unlike with Set-Content / Out-File *no* trailing newline is appended.
$null = New-Item -Force $MyPath -Value (Get-Content -Raw $MyPath)
Note: To save the output from arbitrary commands in the same format as Out-File would, pipe to Out-String first; e.g.:
$null = New-Item -Force Out.txt -Value (Get-ChildItem | Out-String)
For convenience, below is advanced function Out-FileUtf8NoBom, a pipeline-based alternative that mimics Out-File, which means:
you can use it just like Out-File in a pipeline.
input objects that aren't strings are formatted as they would be if you sent them to the console, just like with Out-File.
an additional -UseLF switch allows you use Unix-format LF-only newlines ("`n") instead of the Windows-format CRLF newlines ("`r`n") you normally get.
Example:
(Get-Content $MyPath) | Out-FileUtf8NoBom $MyPath # Add -UseLF for Unix newlines
Note how (Get-Content $MyPath) is enclosed in (...), which ensures that the entire file is opened, read in full, and closed before sending the result through the pipeline. This is necessary in order to be able to write back to the same file (update it in place).
Generally, though, this technique is not advisable for 2 reasons: (a) the whole file must fit into memory and (b) if the command is interrupted, data will be lost.
A note on memory use:
M. Dudley's own answer
and the New-Item alternative above require that the entire file contents be built up in memory first, which can be problematic with large input sets.
The function below does not require this, because it is implemented as a proxy (wrapper) function (for a concise summary of how to define such functions, see this answer).
Source code of function Out-FileUtf8NoBom:
Note: The function is also available as an MIT-licensed Gist, and only it will be maintained going forward.
You can install it directly with the following command (while I can personally assure you that doing so is safe, you should always check the content of a script before directly executing it this way):
# Download and define the function.
irm https://gist.github.com/mklement0/8689b9b5123a9ba11df7214f82a673be/raw/Out-FileUtf8NoBom.ps1 | iex
function Out-FileUtf8NoBom {
<#
.SYNOPSIS
Outputs to a UTF-8-encoded file *without a BOM* (byte-order mark).
.DESCRIPTION
Mimics the most important aspects of Out-File:
* Input objects are sent to Out-String first.
* -Append allows you to append to an existing file, -NoClobber prevents
overwriting of an existing file.
* -Width allows you to specify the line width for the text representations
of input objects that aren't strings.
However, it is not a complete implementation of all Out-File parameters:
* Only a literal output path is supported, and only as a parameter.
* -Force is not supported.
* Conversely, an extra -UseLF switch is supported for using LF-only newlines.
.NOTES
The raison d'être for this advanced function is that Windows PowerShell
lacks the ability to write UTF-8 files without a BOM: using -Encoding UTF8
invariably prepends a BOM.
Copyright (c) 2017, 2022 Michael Klement <mklement0#gmail.com> (http://same2u.net),
released under the [MIT license](https://spdx.org/licenses/MIT#licenseText).
#>
[CmdletBinding(PositionalBinding=$false)]
param(
[Parameter(Mandatory, Position = 0)] [string] $LiteralPath,
[switch] $Append,
[switch] $NoClobber,
[AllowNull()] [int] $Width,
[switch] $UseLF,
[Parameter(ValueFromPipeline)] $InputObject
)
begin {
# Convert the input path to a full one, since .NET's working dir. usually
# differs from PowerShell's.
$dir = Split-Path -LiteralPath $LiteralPath
if ($dir) { $dir = Convert-Path -ErrorAction Stop -LiteralPath $dir } else { $dir = $pwd.ProviderPath }
$LiteralPath = [IO.Path]::Combine($dir, [IO.Path]::GetFileName($LiteralPath))
# If -NoClobber was specified, throw an exception if the target file already
# exists.
if ($NoClobber -and (Test-Path $LiteralPath)) {
Throw [IO.IOException] "The file '$LiteralPath' already exists."
}
# Create a StreamWriter object.
# Note that we take advantage of the fact that the StreamWriter class by default:
# - uses UTF-8 encoding
# - without a BOM.
$sw = New-Object System.IO.StreamWriter $LiteralPath, $Append
$htOutStringArgs = #{}
if ($Width) { $htOutStringArgs += #{ Width = $Width } }
try {
# Create the script block with the command to use in the steppable pipeline.
$scriptCmd = {
& Microsoft.PowerShell.Utility\Out-String -Stream #htOutStringArgs |
. { process { if ($UseLF) { $sw.Write(($_ + "`n")) } else { $sw.WriteLine($_) } } }
}
$steppablePipeline = $scriptCmd.GetSteppablePipeline($myInvocation.CommandOrigin)
$steppablePipeline.Begin($PSCmdlet)
}
catch { throw }
}
process
{
$steppablePipeline.Process($_)
}
end {
$steppablePipeline.End()
$sw.Dispose()
}
}

Starting from version 6 powershell supports the UTF8NoBOM encoding both for set-content and out-file and even uses this as default encoding.
So in the above example it should simply be like this:
$MyFile | Out-File -Encoding UTF8NoBOM $MyPath

When using Set-Content instead of Out-File, you can specify the encoding Byte, which can be used to write a byte array to a file. This in combination with a custom UTF8 encoding which does not emit the BOM gives the desired result:
# This variable can be reused
$utf8 = New-Object System.Text.UTF8Encoding $false
$MyFile = Get-Content $MyPath -Raw
Set-Content -Value $utf8.GetBytes($MyFile) -Encoding Byte -Path $MyPath
The difference to using [IO.File]::WriteAllLines() or similar is that it should work fine with any type of item and path, not only actual file paths.

This script will convert, to UTF-8 without BOM, all .txt files in DIRECTORY1 and output them to DIRECTORY2
foreach ($i in ls -name DIRECTORY1\*.txt)
{
$file_content = Get-Content "DIRECTORY1\$i";
[System.IO.File]::WriteAllLines("DIRECTORY2\$i", $file_content);
}

important!: this only works if an extra space or newline at the start is no problem for your use case of the file
(e.g. if it is an SQL file, Java file or human readable text file)
one could use a combination of creating an empty (non-UTF8 or ASCII (UTF8-compatible)) file and appending to it (replace $str with gc $src if the source is a file):
" " | out-file -encoding ASCII -noNewline $dest
$str | out-file -encoding UTF8 -append $dest
as one-liner
replace $dest and $str according to your use case:
$_ofdst = $dest ; " " | out-file -encoding ASCII -noNewline $_ofdst ; $src | out-file -encoding UTF8 -append $_ofdst
as simple function
function Out-File-UTF8-noBOM { param( $str, $dest )
" " | out-file -encoding ASCII -noNewline $dest
$str | out-file -encoding UTF8 -append $dest
}
using it with a source file:
Out-File-UTF8-noBOM (gc $src), $dest
using it with a string:
Out-File-UTF8-noBOM $str, $dest
optionally: continue appending with Out-File:
"more foo bar" | Out-File -encoding UTF8 -append $dest

Old question, new answer:
While the "old" powershell writes a BOM, the new platform-agnostic variant does behave differently: The default is "no BOM" and it can be configured via switch:
-Encoding
Specifies the type of encoding for the target file. The default value is utf8NoBOM.
The acceptable values for this parameter are as follows:
ascii: Uses the encoding for the ASCII (7-bit) character set.
bigendianunicode: Encodes in UTF-16 format using the big-endian byte order.
oem: Uses the default encoding for MS-DOS and console programs.
unicode: Encodes in UTF-16 format using the little-endian byte order.
utf7: Encodes in UTF-7 format.
utf8: Encodes in UTF-8 format.
utf8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM)
utf8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)
utf32: Encodes in UTF-32 format.
Source: https://learn.microsoft.com/de-de/powershell/module/Microsoft.PowerShell.Utility/Out-File?view=powershell-7
Emphasis mine

For PowerShell 5.1, enable this setting:
Control Panel, Region, Administrative, Change system locale, Use Unicode UTF-8
for worldwide language support
Then enter this into PowerShell:
$PSDefaultParameterValues['*:Encoding'] = 'Default'
Alternatively, you can upgrade to PowerShell 6 or higher.
https://github.com/PowerShell/PowerShell

I would say to use just the Set-Content command, nothing else needed.
The powershell version in my system is :-
PS C:\Users\XXXXX> $PSVersionTable.PSVersion | fl
Major : 5
Minor : 1
Build : 19041
Revision : 1682
MajorRevision : 0
MinorRevision : 1682
PS C:\Users\XXXXX>
So you would need something like following.
PS C:\Users\XXXXX> Get-Content .\Downloads\finddate.txt
Thursday, June 23, 2022 5:57:59 PM
PS C:\Users\XXXXX> Get-Content .\Downloads\finddate.txt | Set-Content .\Downloads\anotherfile.txt
PS C:\Users\XXXXX> Get-Content .\Downloads\anotherfile.txt
Thursday, June 23, 2022 5:57:59 PM
PS C:\Users\XXXXX>
Now when we check the file as per the screenshot it is utf8.
anotherfile.txt

Change multiple files by extension to UTF-8 without BOM:
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in ls -recurse -filter "*.java") {
$MyFile = Get-Content $i.fullname
[System.IO.File]::WriteAllLines($i.fullname, $MyFile, $Utf8NoBomEncoding)
}

[System.IO.FileInfo] $file = Get-Item -Path $FilePath
$sequenceBOM = New-Object System.Byte[] 3
$reader = $file.OpenRead()
$bytesRead = $reader.Read($sequenceBOM, 0, 3)
$reader.Dispose()
#A UTF-8+BOM string will start with the three following bytes. Hex: 0xEF0xBB0xBF, Decimal: 239 187 191
if ($bytesRead -eq 3 -and $sequenceBOM[0] -eq 239 -and $sequenceBOM[1] -eq 187 -and $sequenceBOM[2] -eq 191)
{
$utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
[System.IO.File]::WriteAllLines($FilePath, (Get-Content $FilePath), $utf8NoBomEncoding)
Write-Host "Remove UTF-8 BOM successfully"
}
Else
{
Write-Warning "Not UTF-8 BOM file"
}
Source How to remove UTF8 Byte Order Mark (BOM) from a file using PowerShell

If you want to use [System.IO.File]::WriteAllLines(), you should cast second parameter to String[] (if the type of $MyFile is Object[]), and also specify absolute path with $ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($MyPath), like:
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Set-Variable MyFile
[System.IO.File]::WriteAllLines($ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath($MyPath), [String[]]$MyFile, $Utf8NoBomEncoding)
If you want to use [System.IO.File]::WriteAllText(), sometimes you should pipe the second parameter into | Out-String | to add CRLFs to the end of each line explictly (Especially when you use them with ConvertTo-Csv):
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Out-String | Set-Variable tmp
[System.IO.File]::WriteAllText("/absolute/path/to/foobar.csv", $tmp, $Utf8NoBomEncoding)
Or you can use [Text.Encoding]::UTF8.GetBytes() with Set-Content -Encoding Byte:
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
Get-ChildItem | ConvertTo-Csv | Out-String | % { [Text.Encoding]::UTF8.GetBytes($_) } | Set-Content -Encoding Byte -Path "/absolute/path/to/foobar.csv"
see: How to write result of ConvertTo-Csv to a file in UTF-8 without BOM

I have the same error in the PowerShell and used this isolation and fixed it
$PSDefaultParameterValues['*:Encoding'] = 'utf8'

One technique I utilize is to redirect output to an ASCII file using the Out-File cmdlet.
For example, I often run SQL scripts that create another SQL script to execute in Oracle. With simple redirection (">"), the output will be in UTF-16 which is not recognized by SQLPlus. To work around this:
sqlplus -s / as sysdba "#create_sql_script.sql" |
Out-File -FilePath new_script.sql -Encoding ASCII -Force
The generated script can then be executed via another SQLPlus session without any Unicode worries:
sqlplus / as sysdba "#new_script.sql" |
tee new_script.log
Update: As others have pointed out, this will drop non-ASCII characters. Since the user asked for a way to "force" conversion, I assume they do not care about that as perhaps their data does not contain such data.
If you care about the preservation of non-ASCII characters, this is not the answer for you.

Used this method to edit a UTF8-NoBOM file and generated a file with correct encoding-
$fileD = "file.xml"
(Get-Content $fileD) | ForEach-Object { $_ -replace 'replace text',"new text" } | out-file "file.xml" -encoding ASCII
I was skeptical at this method at first, but it surprised me and worked!
Tested with powershell version 5.1

Could use below to get UTF8 without BOM
$MyFile | Out-File -Encoding ASCII

This one works for me (use "Default" instead of "UTF8"):
$MyFile = Get-Content $MyPath
$MyFile | Out-File -Encoding "Default" $MyPath
The result is ASCII without BOM.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why does Powershell file concatenation convert UTF8 to UTF16? - powershell

Related

How can I (efficiently) match content (lines) of many small files with content (lines) of a single large file and update/recreate them

Out-File -Encoding problems using PowerShell replace command [duplicate]

Prevent trailing newline in PowerShell Out-File command

Read UTF-8 files correctly with PowerShell

Using PowerShell to write a file in UTF-8 without the BOM

Categories

Resources