PowerShell : Set-Content Replace word and Encoding UTF8 without BOM - powershell

I'd like to escape \ to \\ in csv file to upload to Redshift.
Following simple PowerShell script can replace $TargetWord \ to $ReplaceWord \\ , as expected, but export utf-8 with bom and sometimes causes the Redshift copy error.
Any advice would be appreciated to improve it. Thank you in advance.
Exp_Escape.ps1
Param(
[string]$StrExpFile,
[string]$TargetWord,
[string]$ReplaceWord
)
# $(Get-Content "$StrExpFile").replace($TargetWord,$ReplaceWord) | Set-Content -Encoding UTF8 "$StrExpFile"

In PowerShell (Core) 7+, you would get BOM-less UTF-8 files by default; -Encoding utf8 and -Encoding utf8NoBom express that default explicitly; to use a BOM, -Encoding utf8BOM is needed.
In Windows PowerShell, unfortunately, you must use a workaround to get BOM-less UTF-8, because -Encoding utf8 only produces UTF-8 files with BOM (and no other utf8-related values are supported).
The workaround requires combining Out-String with New-Item, which (curiously) creates BOM-less UTF-8 files by default even in Windows PowerShell:
Param(
[string]$StrExpFile,
[string]$TargetWord,
[string]$ReplaceWord
)
$null =
New-Item -Force $StrExpFile -Value (
(Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord) | Out-String
)
Note:
$null = is needed to discard the output object that New-Item emits (which is a file-info object describing the newly created files.
-Force is needed in order to quietly overwrite an existing file by the same name (as Set-Content and Out-File do by default).
The -Value argument must be a single (multi-line) string to write to the file, which is what Out-String ensures.
Caveats:
For non-string input objects, Out-String creates the same rich for-display representations as Out-File and as you would see in the console by default.
New-Item itself does not append a trailing newline when it writes the string to the file, but Out-String curiously does; while this happens to be handy here, it is generally problematic, as discussed in GitHub issue #14444.
The alternative to using Out-String is to create the multi-line string manually, which is a bit more cumbersome ("`n" is used to create LF-only newlines, which PowerShell and most programs happily accept even on Windows; for platform-native newlines (CRLF) on Windows, use [Environment]::NewLine instead):
$null =
New-Item -Force $StrExpFile -Value (
((Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord) -join "`n`") + "`n"
)
Since the entire file content must be passed as an argument,[1] it must fit into memory as a whole; the convenience function discussed next avoids this problem.
For a convenience wrapper function around Out-File for use in Windows PowerShell that creates BOM-less UTF-8 files, see this answer.
Alternative, with direct use of .NET APIs:
.NET APIs produce BOM-less UTF-8 files by default.
However, because .NET's working directory usually differs from PowerShell's, full file paths must always be used, which requires more effort:
# In order for .NET API calls to work as expected,
# file paths must be expressed as *full, native* paths.
$OutDir = Split-Path -Parent $StrExpFile
if ($OutDir -eq '') { $OutDir = '.' }
$strExpFileFullPath = Join-Path (Convert-Path $OutDir) (Split-Path -Leaf $StrExpFile)
# Note: .NET APIs create BOM-less UTF-8 files *by default*
[IO.File]::WriteAllLines(
$strExpFileFullPath,
(Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord)
)
The above uses the System.IO.File.WriteAllLines method.
[1] Note that while New-Item technically supports receiving the content to write to the file via the pipeline, it unfortunately writes each to the target file, successively, with only the last one ending up in the file.

Related

How to echo JSON content to an exe file and then save the output to a file?

I am using a script in git bash, which performs few curl calls to HTTP endpoints expecting and producing protobuf.
The curl-output is piped to a custom proto2json.exe file and finally the result is saved to a JSON file:
#!/bin/bash
SCRIPT_DIR=$(dirname $0)
JSON2PROTO="$SCRIPT_DIR/json2proto.exe"
PROTO2JSON="$SCRIPT_DIR/proto2json.exe"
echo '{"key1":"value1","version":3}' | $JSON2PROTO -v 3 > request.dat
curl --insecure --data-binary #request.dat --output - https://localhost/protobuf | $PROTO2JSON -v 3 > response.json
The script works well and now I am trying to port it to Powershell:
$SCRIPT_DIR = Split-Path -parent $PSCommandPath
$JSON2PROTO = "$SCRIPT_DIR/json2proto.exe"
$PROTO2JSON = "$SCRIPT_DIR/proto2json.exe"
#{
key1 = value1;
version = 3;
} | ConvertTo-Json | &$JSON2PROTO -v 3 > request.dat
Unfortunately, when I compare the generated binary files in "git bash" and in Powershell, then I see that the latter file has additionaly zero bytes entered.
Is the GitHub issue #1908 related to my issue?
It looks like you're ultimately after this:
$SCRIPT_DIR = Split-Path -parent $PSCommandPath
$JSON2PROTO = "$SCRIPT_DIR/json2proto.exe"
$PROTO2JSON = "$SCRIPT_DIR/proto2json.exe"
# Make sure that the output from your $JSON2PROTO executable is correctly decoded
# as UTF-8.
# You may want to restore the original encoding later.
[Console]::OutputEncoding = [System.Text.Utf8Encoding]::new()
# Capture the output lines from calling the $JSON2PROTO executable.
# Note: PowerShell captures a *single* output line as-is, and
# *multiple* ones *as an array*.
[array] $output =
#{
key1 = value1;
version = 3;
} | ConvertTo-Json | & $JSON2PROTO -v 3
# Filter out empty lines to extract the one line of interest.
[string] $singleOutputLineOfInterest = $output -ne ''
# Write a BOM-less UTF-8 file with the given text as-is,
# without appending a newline.
[System.IO.File]::WriteAllText(
"$PWD/request.dat",
$singleOutputLineOfInterest
)
As for what you tried:
In PowerShell, > is an effective alias of the Out-File cmdlet, whose default output character encoding in Windows PowerShell is "Unicode" (UTF-16LE) - which is what you saw - and, in PowerShell (Core) 7+, BOM-less UTF8. To control the character encoding, call Out-File or, for text input, Set-Content with the -Encoding parameter.
Note that you may also have to ensure that an external program's output is first properly decoded, which happens based on the encoding stored in [Console]::OutputEncoding - see this answer for more information.
Note that you can't avoid these decoding + re-encoding steps in PowerShell as of v7.2.4, because the PowerShell pipeline currently cannot serve as a conduit for raw bytes, as discussed in this answer, which also links to the GitHub issue you mention.
Finally, note that both Out-File and Set-Content by default append a trailing, platform-native newline to the output file. While -NoNewLine suppresses that, it also suppresses newlines between multiple input objects, so you may have to use the -join operator to manually join the inputs with newlines in the desired format, e.g. (1, 2) -join "`n" | Set-Content -NoNewLine out.txt
If, in Windows PowerShell, you want to create UTF-8 files without a BOM, you can't use a file-writing cmdlet and must instead use .NET APIs directly (PowerShell (Core) 7+, by contrast, produces BOM-less UTF-8 files by default, consistently). .NET APIs do and always have created BOM-less UTF-8 files by default; e.g.:
[System.IO.File]::WriteAllLines() writes the elements of an array as lines to an output file, with each line terminated with a platform-native newline, i.e. CRLF (0xD 0xA) on Windows, and LF (0xA) on Unix-like platforms.
[System.IO.File]::WriteAllText() writes a single (potentially multi-line) string as-is to an output file.
Important: Always pass full paths to file-related .NET APIs, because PowerShell's current location (directory) usually differs from .NET's.

Issues with specific characters in outfile

I have a script that merges files and that works fine - but characters like åäö looks not good in the output file
Here is the complete script:
$startOfToday = (Get-Date).Date
Get-ChildItem "C:\TEST -include *.* -Recurse |
Where-Object LastWriteTime -gt $startOfToday | ForEach-Object {gc $_; ""} |
Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"
In the files in looks like this for example
Order ID 1
Order ID 2
This is för får
In the output it gets like this for the last row
Order ID 1
Order ID 2
får för fär
is there a way to make those characters appear in the output file as they appear in the first file?
The implication is that your input files are UTF-8-encoded without a BOM, which in Windows PowerShell are (mis)interpreted to be ANSI-encoded (using the system's active ANSI code page, such as Windows-1252).
The solution is to tell gc (Get-Content) explicitly what encoding to use, via the -Encoding parameter:
Get-ChildItem C:\TEST -include *.* -Recurse |
Where-Object LastWriteTime -gt $startOfToday |
ForEach-Object { Get-Content -Encoding Utf8 $_; ""} |
Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"
Note that PowerShell never preserves the input encoding automatically, therefore, in the absence of using -Encoding with Out-File, its default encoding is used, which is "Unicode" (UTF-16LE) in Windows PowerShell.
While PowerShell (Core) 7+ also doesn't preserve input encodings, it consistently defaults to BOM-less UTF-8, so your original code would work as-is there.
For more information about default encodings in Windows PowerShell vs. PowerShell (Core) 7+, see this answer.
Note: As AdminOfThings suggests in a comment, simply replacing Out-File with Set-Content in your original code also works in this particular case, because the same misinterpretation of the encoding is then performed on both in- and output, and the data is simply being passed through. This isn't a general solution, however, notably not if you need to process the strings in memory first, before saving them to a file.

Backing up of the "Mail flow - Rules" in the "Exchange admin center"

I need to take a backup of the "Mail flow - Rules" in the "Exchange admin center"
$TransCollect = Export-TransportRuleCollection
$TransCollect1 = [System.Text.Encoding]::Unicode.GetString($TransCollect.FileData)
$TransCollect1 | Set-Content -path c:\temp\2.xml
But I cannot extract anything from the XML file because at the start of the XML file is a special character.
So If run ....
[XML]$AppConfig = Get-Content –Path "c:\temp\2.xml"
I get several errors.
Is there is a problem in the "[System.Text.Encoding]::Unicode.GetString" line itself OR
how do I remove this special character.
See the screenshot for the special character. It shows up at the beginning of the file"
The "special character" you see is the Byte Order Mark for Unicode (aka UTF-16LE) (\xFF\xFE) because you used [System.Text.Encoding]::Unicode.GetString(). To read that file with Get-Content, you need to specify that same encoding using its -Encoding parameter:
[XML]$AppConfig = Get-Content -Path "c:\temp\2.xml" -Encoding Unicode
Depending on the PowerShell version you are using, Get-Content by default uses different encodings:
Versions 5.1 and below use the encoding that corresponds to the system's active code page (usually ANSI)
Versions 6 and up use UTF8NoBOM
Is there a specific reason for writing the XML in UTF16 ? Usually, UTF-8 is used with XML
I was able to use this and bypass the special character entirely as well as getting the EXCHANGE RULES into a extractable format.
$File_Collect = "c:\temp\1.xml"
$TransCollect = Export-TransportRuleCollection
Set-Content -Path $File_Collect -Value $TransCollect.FileData -Encoding Byte
[XML]$TransXMLCollect = Get-Content –Path $File_Collect
$TransXMLCollect.SelectNodes("//rule") | % { $_.InnerText }

ANSI Encoding via PowerShell [duplicate]

In PowerShell, what's the difference between Out-File and Set-Content? Or Add-Content and Out-File -append?
I've found if I use both against the same file, the text is fully mojibaked.
(A minor second question: > is an alias for Out-File, right?)
Here's a summary of what I've deduced, after a few months experience with PowerShell, and some scientific experimentation. I never found any of this in the documentation :(
[Update: Much of this now appears to be better documented.]
Read and write locking
While Out-File is running, another application can read the log file.
While Set-Content is running, other applications cannot read the log file. Thus never use Set-Content to log long running commands.
Encoding
Out-File saves in the Unicode (UTF-16LE) encoding by default (though this can be specified), whereas Set-Content defaults to ASCII (US-ASCII) in PowerShell 3+ (this may also be specified). In earlier PowerShell versions, Set-Content wrote content in the Default (ANSI) encoding.
Editor's note: PowerShell as of version 5.1 still defaults to the culture-specific Default ("ANSI") encoding, despite what the documentation claims. If ASCII were the default, non-ASCII characters such as ü would be converted to literal ?, but that is not the case: 'ü' | Set-Content tmp.txt; (Get-Content tmp.txt) -eq '?' yields $False.
PS > $null | out-file outed.txt
PS > $null | set-content set.txt
PS > md5sum *
f3b25701fe362ec84616a93a45ce9998 *outed.txt
d41d8cd98f00b204e9800998ecf8427e *set.txt
This means the defaults of two commands are incompatible, and mixing them will corrupt text, so always specify an encoding.
Formatting
As Bartek explained, Out-File saves the fancy formatting of the output, as seen in the terminal. So in a folder with two files, the command dir | out-file out.txt creates a file with 11 lines.
Whereas Set-Content saves a simpler representation. In that folder with two files, the command dir | set-content sc.txt creates a file with two lines. To emulate the output in the terminal:
PS > dir | ForEach-Object {$_.ToString()}
out.txt
sc.txt
I believe this formatting has a consequence for line breaks, but I can't describe it yet.
File creation
Set-Content doesn't reliably create an empty file when Out-File would:
In an empty folder, the command dir | out-file out.txt creates a file, while dir | set-content sc.txt does not.
Pipeline Variable
Set-Content takes the filename from the pipeline; allowing you to set a number of files' contents to some fixed value.
Out-File takes the data as from the pipeline; updating a single file's content.
Parameters
Set-Content includes the following additional parameters:
Exclude
Filter
Include
PassThru
Stream
UseTransaction
Out-File includes the following additional parameters:
Append
NoClobber
Width
For more information about what those parameters are, please refer to help; e.g. get-help out-file -parameter append.
Out-File has the behavior of overwriting the output path unless the -NoClobber and/or the -Append flag is set. Add-Content will append content if the output path already exists by default (if it can). Both will create the file if one doesn't already exist.
Another interesting difference is that Add-Content will create an ASCII encoded file by default and Out-File will create a little endian unicode encoded file by default.
> is an alias syntactic sugar for Out-File. It's Out-File with some pre-defined parameter settings.
Well, I would disagree... :)
Out-File has -Append (-NoClober is there to avoid overwriting) that will Add-Content. But this is not the same beast.
command | Add-Content will use .ToString() method on input. Out-File will use default formatting.
so:
ls | Add-Content test.txt
and
ls | Out-File test.txt
will give you totally different results.
And no, '>' is not alias, it's redirection operator (same as in other shells). And has very serious limitation... It will cut lines same way they are displayed. Out-File has -Width parameter that helps you avoid this. Also, with redirection operators you can't decide what encoding to use.
HTH
Bartek
Set-Content supports -Encoding Byte, while Out-File does not.
So when you want to write binary data or result of Text.Encoding#GetBytes() to a file, you should use Set-Content.
Wanted to add about difference on encoding:
Windows with PowerShell 5.1:
Out-File - Default encoding is utf-16le
Set-Content - Default encoding is us-ascii
Linux with PowerShell 7.1:
Out-File - Default encoding is us-ascii
Set-Content - Default encoding is us-ascii
Out-file -append or >> can actually mix two encodings in the same file. Even if the file is originally ASCII or ANSI, it will add Unicode by default to the bottom of it. Add-content will check the encoding and match it before appending. Btw, export-csv defaults to ASCII (no accents), and set-content/add-content to ANSI.
TL;DR, use Set-Content as it's more consistent over Out-File.
Set-Content behavior is the same over different powershell versions
Out-File as #JagWireZ says produces different encodings for the default settings, even on the same OS(Windows) the docs for powershell 5.1 and powershell 7.3 state that the encoding changed from unicode to utf8NoBOM
Some issues like Malformed XML arise from using Out-File, that could of course be fixed by setting the desired encoding, however it's likely to forget to set the encoding and end up with issues.

PowerShell Set-Content and Out-File - what is the difference?

In PowerShell, what's the difference between Out-File and Set-Content? Or Add-Content and Out-File -append?
I've found if I use both against the same file, the text is fully mojibaked.
(A minor second question: > is an alias for Out-File, right?)
Here's a summary of what I've deduced, after a few months experience with PowerShell, and some scientific experimentation. I never found any of this in the documentation :(
[Update: Much of this now appears to be better documented.]
Read and write locking
While Out-File is running, another application can read the log file.
While Set-Content is running, other applications cannot read the log file. Thus never use Set-Content to log long running commands.
Encoding
Out-File saves in the Unicode (UTF-16LE) encoding by default (though this can be specified), whereas Set-Content defaults to ASCII (US-ASCII) in PowerShell 3+ (this may also be specified). In earlier PowerShell versions, Set-Content wrote content in the Default (ANSI) encoding.
Editor's note: PowerShell as of version 5.1 still defaults to the culture-specific Default ("ANSI") encoding, despite what the documentation claims. If ASCII were the default, non-ASCII characters such as ü would be converted to literal ?, but that is not the case: 'ü' | Set-Content tmp.txt; (Get-Content tmp.txt) -eq '?' yields $False.
PS > $null | out-file outed.txt
PS > $null | set-content set.txt
PS > md5sum *
f3b25701fe362ec84616a93a45ce9998 *outed.txt
d41d8cd98f00b204e9800998ecf8427e *set.txt
This means the defaults of two commands are incompatible, and mixing them will corrupt text, so always specify an encoding.
Formatting
As Bartek explained, Out-File saves the fancy formatting of the output, as seen in the terminal. So in a folder with two files, the command dir | out-file out.txt creates a file with 11 lines.
Whereas Set-Content saves a simpler representation. In that folder with two files, the command dir | set-content sc.txt creates a file with two lines. To emulate the output in the terminal:
PS > dir | ForEach-Object {$_.ToString()}
out.txt
sc.txt
I believe this formatting has a consequence for line breaks, but I can't describe it yet.
File creation
Set-Content doesn't reliably create an empty file when Out-File would:
In an empty folder, the command dir | out-file out.txt creates a file, while dir | set-content sc.txt does not.
Pipeline Variable
Set-Content takes the filename from the pipeline; allowing you to set a number of files' contents to some fixed value.
Out-File takes the data as from the pipeline; updating a single file's content.
Parameters
Set-Content includes the following additional parameters:
Exclude
Filter
Include
PassThru
Stream
UseTransaction
Out-File includes the following additional parameters:
Append
NoClobber
Width
For more information about what those parameters are, please refer to help; e.g. get-help out-file -parameter append.
Out-File has the behavior of overwriting the output path unless the -NoClobber and/or the -Append flag is set. Add-Content will append content if the output path already exists by default (if it can). Both will create the file if one doesn't already exist.
Another interesting difference is that Add-Content will create an ASCII encoded file by default and Out-File will create a little endian unicode encoded file by default.
> is an alias syntactic sugar for Out-File. It's Out-File with some pre-defined parameter settings.
Well, I would disagree... :)
Out-File has -Append (-NoClober is there to avoid overwriting) that will Add-Content. But this is not the same beast.
command | Add-Content will use .ToString() method on input. Out-File will use default formatting.
so:
ls | Add-Content test.txt
and
ls | Out-File test.txt
will give you totally different results.
And no, '>' is not alias, it's redirection operator (same as in other shells). And has very serious limitation... It will cut lines same way they are displayed. Out-File has -Width parameter that helps you avoid this. Also, with redirection operators you can't decide what encoding to use.
HTH
Bartek
Set-Content supports -Encoding Byte, while Out-File does not.
So when you want to write binary data or result of Text.Encoding#GetBytes() to a file, you should use Set-Content.
Wanted to add about difference on encoding:
Windows with PowerShell 5.1:
Out-File - Default encoding is utf-16le
Set-Content - Default encoding is us-ascii
Linux with PowerShell 7.1:
Out-File - Default encoding is us-ascii
Set-Content - Default encoding is us-ascii
Out-file -append or >> can actually mix two encodings in the same file. Even if the file is originally ASCII or ANSI, it will add Unicode by default to the bottom of it. Add-content will check the encoding and match it before appending. Btw, export-csv defaults to ASCII (no accents), and set-content/add-content to ANSI.
TL;DR, use Set-Content as it's more consistent over Out-File.
Set-Content behavior is the same over different powershell versions
Out-File as #JagWireZ says produces different encodings for the default settings, even on the same OS(Windows) the docs for powershell 5.1 and powershell 7.3 state that the encoding changed from unicode to utf8NoBOM
Some issues like Malformed XML arise from using Out-File, that could of course be fixed by setting the desired encoding, however it's likely to forget to set the encoding and end up with issues.