Just can't get ASCII encoding get to work in PowerShell. Tried a bunch of different approaches.
Whatever I try I get an UTF8 encoded file (that is what NPP tells me):
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding ascii
PSVersion = 5.1.14393.5066
Any hint is welcome!
ASCII is a 7-bit character set and doesn't contain any accented character, so obviously storing öäü in ASCII doesn't work. If you need UTF-8 then you need to specify encoding as utf8
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding utf8
If you need another encoding then specify it accordingly. For example to get the ANSI code page use this
$newLine = "Ein Test öäü"
$newLine | Out-File -FilePath "c:\temp\check.txt" -Encoding default
-Encoding default will save the file in the current ANSI code page and -Encoding oem will use the current OEM code page. Just press Tab after -Encoding and PowerShell will cycle through the list of supported encodings. For encodings not in that list you can trivially deal with them using System.Text.Encoding
Note that "ANSI code page" is a misnomer and the actual encoding changes depending on each environment so it won't be reliable. For example if you change the code page manually then it won't work anymore. For a more reliable behavior you need to explicitly specify the encoding (typically Windows-1252 for Western European languages). In older PowerShell use
[IO.File]::WriteAllLines("c:\temp\check.txt", $newLine, [Text.Encoding]::GetEncoding(1252)
and in PowerShell Core you can use
$newLine | Out-File -FilePath "check2.txt" -Encoding ([Text.Encoding]::GetEncoding(1252))
See How do I change my Powershell script so that it writes out-file in ANSI - Windows-1252 encoding?
Found the solution:
$file = "c:\temp\check-ansi.txt"
$newLine = "Ein Test öÖäÄüÜ"
Remove-Item $file
[IO.File]::WriteAllLines($file, $newLine, [Text.Encoding]::GetEncoding(28591))
Related
I need to take a backup of the "Mail flow - Rules" in the "Exchange admin center"
$TransCollect = Export-TransportRuleCollection
$TransCollect1 = [System.Text.Encoding]::Unicode.GetString($TransCollect.FileData)
$TransCollect1 | Set-Content -path c:\temp\2.xml
But I cannot extract anything from the XML file because at the start of the XML file is a special character.
So If run ....
[XML]$AppConfig = Get-Content –Path "c:\temp\2.xml"
I get several errors.
Is there is a problem in the "[System.Text.Encoding]::Unicode.GetString" line itself OR
how do I remove this special character.
See the screenshot for the special character. It shows up at the beginning of the file"
The "special character" you see is the Byte Order Mark for Unicode (aka UTF-16LE) (\xFF\xFE) because you used [System.Text.Encoding]::Unicode.GetString(). To read that file with Get-Content, you need to specify that same encoding using its -Encoding parameter:
[XML]$AppConfig = Get-Content -Path "c:\temp\2.xml" -Encoding Unicode
Depending on the PowerShell version you are using, Get-Content by default uses different encodings:
Versions 5.1 and below use the encoding that corresponds to the system's active code page (usually ANSI)
Versions 6 and up use UTF8NoBOM
Is there a specific reason for writing the XML in UTF16 ? Usually, UTF-8 is used with XML
I was able to use this and bypass the special character entirely as well as getting the EXCHANGE RULES into a extractable format.
$File_Collect = "c:\temp\1.xml"
$TransCollect = Export-TransportRuleCollection
Set-Content -Path $File_Collect -Value $TransCollect.FileData -Encoding Byte
[XML]$TransXMLCollect = Get-Content –Path $File_Collect
$TransXMLCollect.SelectNodes("//rule") | % { $_.InnerText }
We receive Cognos reports that are encoded as UTF16LE. I am trying to create a powershell script to convert the UTF16LE files to UTF8. My logic so far does loop through the directory (whichever directory I place the script in as hardcoding the directory names that contain date/numbers caused errors) and save the files as UTF-8; however, the delimiters seem to be removed.
I believe that it may be due to the way that I am reading the data, as I am not specifying UTF16LE; however, I am unsure of any way to do that. My script so far is:
$files = Get-ChildItem
$dt = get-date -Format yyyyMMdd
$extension = "_" + "$dt" + "_utf8.csv"
ForEach ($file in $files) {
$file_name = $file.basename
$new_file = "$file_name" + "$extension"
echo $new_file
#Get-Content $file | Set-Content -Encoding UTF8 $new_file
}
Read-Host -Prompt "Press Enter to Close Window"
Any and all insight into this issue would be greatly appreciated.
PowerShell's Import-CSV and Export-CSV cmdlets support the -Encoding parameter (links to Microsoft Docs), so you could replace your line
Get-Content $file | Set-Content -Encoding UTF8 $new_file
with
Import-CSV -Path $File -Encoding Unicode | Export-CSV -Path $New_File -Encoding UTF8
(UTF16LE encoding is what PowerShell calls "Unicode"; UTF16BE is "BigEndianUnicode". The default is UTF8NoBOM, UTF8 without Byte Order Mark.)
Since all you want to do is convert the character encoding, reading and writing as a string would be the most straightforward. As always, read a text file with the character encoding it was written with:
Get-Content -Encoding Unicode $file | Set-Content -Encoding UTF8 $new_file
Encoding "Unicode" for UTF-16 harkens back to the infancy of the Unicode character set when UCS-2 was going to be "it" for many environments. Then the explosion happened and UTF-16 was born from UCS-2. Systems invented since then quite reasonably use UTF16 or similar when they mean UTF-16 and "Unicode" for UTF-16 is esoteric and imponderable.
I used the answer to this question:
Using PowerShell to write a file in UTF-8 without the BOM
to encode a file(UCS-2) to UTF-8. The problem is that if I run the encoding twice(or more times) the Cyrillic text is broked. How to stop the encode if the file is already in UTF-8?
The code is:
$MyFile = Get-Content $MyPath
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding $False
[System.IO.File]::WriteAllLines($MyPath, $MyFile, $Utf8NoBomEncoding)
Use:
$MyFile = Get-Content -Encoding UTF8 $MyPath
Initially, when $MyPath is UTF-16LE-encoded ("Unicode" encoding, which I assume is what you meant), PowerShell will ignore the -Encoding parameter due to the presence of a BOM in the file, which unambiguously identifies the encoding.
If your original file does not have a BOM, more work is needed.
Once you've saved $MyPath as UTF-8 without BOM, you must tell Windows PowerShell[1] that you expect UTF-8 encoding with -Encoding UTF8, as it interprets files as "ANSI"-encoded by default (encoded according to the typically single-byte code page associated with the legacy system locale).
[1] Note that the cross-platform PowerShell Core edition defaults to BOM-less UTF-8.
In PowerShell, what's the difference between Out-File and Set-Content? Or Add-Content and Out-File -append?
I've found if I use both against the same file, the text is fully mojibaked.
(A minor second question: > is an alias for Out-File, right?)
Here's a summary of what I've deduced, after a few months experience with PowerShell, and some scientific experimentation. I never found any of this in the documentation :(
[Update: Much of this now appears to be better documented.]
Read and write locking
While Out-File is running, another application can read the log file.
While Set-Content is running, other applications cannot read the log file. Thus never use Set-Content to log long running commands.
Encoding
Out-File saves in the Unicode (UTF-16LE) encoding by default (though this can be specified), whereas Set-Content defaults to ASCII (US-ASCII) in PowerShell 3+ (this may also be specified). In earlier PowerShell versions, Set-Content wrote content in the Default (ANSI) encoding.
Editor's note: PowerShell as of version 5.1 still defaults to the culture-specific Default ("ANSI") encoding, despite what the documentation claims. If ASCII were the default, non-ASCII characters such as ü would be converted to literal ?, but that is not the case: 'ü' | Set-Content tmp.txt; (Get-Content tmp.txt) -eq '?' yields $False.
PS > $null | out-file outed.txt
PS > $null | set-content set.txt
PS > md5sum *
f3b25701fe362ec84616a93a45ce9998 *outed.txt
d41d8cd98f00b204e9800998ecf8427e *set.txt
This means the defaults of two commands are incompatible, and mixing them will corrupt text, so always specify an encoding.
Formatting
As Bartek explained, Out-File saves the fancy formatting of the output, as seen in the terminal. So in a folder with two files, the command dir | out-file out.txt creates a file with 11 lines.
Whereas Set-Content saves a simpler representation. In that folder with two files, the command dir | set-content sc.txt creates a file with two lines. To emulate the output in the terminal:
PS > dir | ForEach-Object {$_.ToString()}
out.txt
sc.txt
I believe this formatting has a consequence for line breaks, but I can't describe it yet.
File creation
Set-Content doesn't reliably create an empty file when Out-File would:
In an empty folder, the command dir | out-file out.txt creates a file, while dir | set-content sc.txt does not.
Pipeline Variable
Set-Content takes the filename from the pipeline; allowing you to set a number of files' contents to some fixed value.
Out-File takes the data as from the pipeline; updating a single file's content.
Parameters
Set-Content includes the following additional parameters:
Exclude
Filter
Include
PassThru
Stream
UseTransaction
Out-File includes the following additional parameters:
Append
NoClobber
Width
For more information about what those parameters are, please refer to help; e.g. get-help out-file -parameter append.
Out-File has the behavior of overwriting the output path unless the -NoClobber and/or the -Append flag is set. Add-Content will append content if the output path already exists by default (if it can). Both will create the file if one doesn't already exist.
Another interesting difference is that Add-Content will create an ASCII encoded file by default and Out-File will create a little endian unicode encoded file by default.
> is an alias syntactic sugar for Out-File. It's Out-File with some pre-defined parameter settings.
Well, I would disagree... :)
Out-File has -Append (-NoClober is there to avoid overwriting) that will Add-Content. But this is not the same beast.
command | Add-Content will use .ToString() method on input. Out-File will use default formatting.
so:
ls | Add-Content test.txt
and
ls | Out-File test.txt
will give you totally different results.
And no, '>' is not alias, it's redirection operator (same as in other shells). And has very serious limitation... It will cut lines same way they are displayed. Out-File has -Width parameter that helps you avoid this. Also, with redirection operators you can't decide what encoding to use.
HTH
Bartek
Set-Content supports -Encoding Byte, while Out-File does not.
So when you want to write binary data or result of Text.Encoding#GetBytes() to a file, you should use Set-Content.
Wanted to add about difference on encoding:
Windows with PowerShell 5.1:
Out-File - Default encoding is utf-16le
Set-Content - Default encoding is us-ascii
Linux with PowerShell 7.1:
Out-File - Default encoding is us-ascii
Set-Content - Default encoding is us-ascii
Out-file -append or >> can actually mix two encodings in the same file. Even if the file is originally ASCII or ANSI, it will add Unicode by default to the bottom of it. Add-content will check the encoding and match it before appending. Btw, export-csv defaults to ASCII (no accents), and set-content/add-content to ANSI.
TL;DR, use Set-Content as it's more consistent over Out-File.
Set-Content behavior is the same over different powershell versions
Out-File as #JagWireZ says produces different encodings for the default settings, even on the same OS(Windows) the docs for powershell 5.1 and powershell 7.3 state that the encoding changed from unicode to utf8NoBOM
Some issues like Malformed XML arise from using Out-File, that could of course be fixed by setting the desired encoding, however it's likely to forget to set the encoding and end up with issues.
In PowerShell, what's the difference between Out-File and Set-Content? Or Add-Content and Out-File -append?
I've found if I use both against the same file, the text is fully mojibaked.
(A minor second question: > is an alias for Out-File, right?)
Here's a summary of what I've deduced, after a few months experience with PowerShell, and some scientific experimentation. I never found any of this in the documentation :(
[Update: Much of this now appears to be better documented.]
Read and write locking
While Out-File is running, another application can read the log file.
While Set-Content is running, other applications cannot read the log file. Thus never use Set-Content to log long running commands.
Encoding
Out-File saves in the Unicode (UTF-16LE) encoding by default (though this can be specified), whereas Set-Content defaults to ASCII (US-ASCII) in PowerShell 3+ (this may also be specified). In earlier PowerShell versions, Set-Content wrote content in the Default (ANSI) encoding.
Editor's note: PowerShell as of version 5.1 still defaults to the culture-specific Default ("ANSI") encoding, despite what the documentation claims. If ASCII were the default, non-ASCII characters such as ü would be converted to literal ?, but that is not the case: 'ü' | Set-Content tmp.txt; (Get-Content tmp.txt) -eq '?' yields $False.
PS > $null | out-file outed.txt
PS > $null | set-content set.txt
PS > md5sum *
f3b25701fe362ec84616a93a45ce9998 *outed.txt
d41d8cd98f00b204e9800998ecf8427e *set.txt
This means the defaults of two commands are incompatible, and mixing them will corrupt text, so always specify an encoding.
Formatting
As Bartek explained, Out-File saves the fancy formatting of the output, as seen in the terminal. So in a folder with two files, the command dir | out-file out.txt creates a file with 11 lines.
Whereas Set-Content saves a simpler representation. In that folder with two files, the command dir | set-content sc.txt creates a file with two lines. To emulate the output in the terminal:
PS > dir | ForEach-Object {$_.ToString()}
out.txt
sc.txt
I believe this formatting has a consequence for line breaks, but I can't describe it yet.
File creation
Set-Content doesn't reliably create an empty file when Out-File would:
In an empty folder, the command dir | out-file out.txt creates a file, while dir | set-content sc.txt does not.
Pipeline Variable
Set-Content takes the filename from the pipeline; allowing you to set a number of files' contents to some fixed value.
Out-File takes the data as from the pipeline; updating a single file's content.
Parameters
Set-Content includes the following additional parameters:
Exclude
Filter
Include
PassThru
Stream
UseTransaction
Out-File includes the following additional parameters:
Append
NoClobber
Width
For more information about what those parameters are, please refer to help; e.g. get-help out-file -parameter append.
Out-File has the behavior of overwriting the output path unless the -NoClobber and/or the -Append flag is set. Add-Content will append content if the output path already exists by default (if it can). Both will create the file if one doesn't already exist.
Another interesting difference is that Add-Content will create an ASCII encoded file by default and Out-File will create a little endian unicode encoded file by default.
> is an alias syntactic sugar for Out-File. It's Out-File with some pre-defined parameter settings.
Well, I would disagree... :)
Out-File has -Append (-NoClober is there to avoid overwriting) that will Add-Content. But this is not the same beast.
command | Add-Content will use .ToString() method on input. Out-File will use default formatting.
so:
ls | Add-Content test.txt
and
ls | Out-File test.txt
will give you totally different results.
And no, '>' is not alias, it's redirection operator (same as in other shells). And has very serious limitation... It will cut lines same way they are displayed. Out-File has -Width parameter that helps you avoid this. Also, with redirection operators you can't decide what encoding to use.
HTH
Bartek
Set-Content supports -Encoding Byte, while Out-File does not.
So when you want to write binary data or result of Text.Encoding#GetBytes() to a file, you should use Set-Content.
Wanted to add about difference on encoding:
Windows with PowerShell 5.1:
Out-File - Default encoding is utf-16le
Set-Content - Default encoding is us-ascii
Linux with PowerShell 7.1:
Out-File - Default encoding is us-ascii
Set-Content - Default encoding is us-ascii
Out-file -append or >> can actually mix two encodings in the same file. Even if the file is originally ASCII or ANSI, it will add Unicode by default to the bottom of it. Add-content will check the encoding and match it before appending. Btw, export-csv defaults to ASCII (no accents), and set-content/add-content to ANSI.
TL;DR, use Set-Content as it's more consistent over Out-File.
Set-Content behavior is the same over different powershell versions
Out-File as #JagWireZ says produces different encodings for the default settings, even on the same OS(Windows) the docs for powershell 5.1 and powershell 7.3 state that the encoding changed from unicode to utf8NoBOM
Some issues like Malformed XML arise from using Out-File, that could of course be fixed by setting the desired encoding, however it's likely to forget to set the encoding and end up with issues.