Issues with specific characters in outfile - powershell

I have a script that merges files and that works fine - but characters like åäö looks not good in the output file
Here is the complete script:
$startOfToday = (Get-Date).Date
Get-ChildItem "C:\TEST -include *.* -Recurse |
Where-Object LastWriteTime -gt $startOfToday | ForEach-Object {gc $_; ""} |
Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"
In the files in looks like this for example
Order ID 1
Order ID 2
This is för får
In the output it gets like this for the last row
Order ID 1
Order ID 2
får för fär
is there a way to make those characters appear in the output file as they appear in the first file?

The implication is that your input files are UTF-8-encoded without a BOM, which in Windows PowerShell are (mis)interpreted to be ANSI-encoded (using the system's active ANSI code page, such as Windows-1252).
The solution is to tell gc (Get-Content) explicitly what encoding to use, via the -Encoding parameter:
Get-ChildItem C:\TEST -include *.* -Recurse |
Where-Object LastWriteTime -gt $startOfToday |
ForEach-Object { Get-Content -Encoding Utf8 $_; ""} |
Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"
Note that PowerShell never preserves the input encoding automatically, therefore, in the absence of using -Encoding with Out-File, its default encoding is used, which is "Unicode" (UTF-16LE) in Windows PowerShell.
While PowerShell (Core) 7+ also doesn't preserve input encodings, it consistently defaults to BOM-less UTF-8, so your original code would work as-is there.
For more information about default encodings in Windows PowerShell vs. PowerShell (Core) 7+, see this answer.
Note: As AdminOfThings suggests in a comment, simply replacing Out-File with Set-Content in your original code also works in this particular case, because the same misinterpretation of the encoding is then performed on both in- and output, and the data is simply being passed through. This isn't a general solution, however, notably not if you need to process the strings in memory first, before saving them to a file.

Related

Powershell Get-Content failing spuriously

I have a fairly simple PS script that was working perfectly, and now has suddenly started giving errors. I have narrowed the problem portion to a couple of Get-Content statements. Here's what the affected part of the script looks like:
$pathSource = "D:\FileDirectory"
Set-Location -Path $pathSource
Get-Content -Encoding UTF8 -Path FilesA*.txt | Out-File -Encoding ASCII FilesA_Digest.txt
Get-Content -Encoding UTF8 -Path FilesB*.txt | Out-File -Encoding ASCII FilesB_Digest.txt
This part of the script gathers up a collection of like-named files and concatenates them into a single text file for uploading to an FTP site. The Get-Content/Out-File was needed as the original files are encoded incorrectly for the FTP site. The script was working perfectly, running once each night for several weeks. Now, it gets the following error when the Get-Content statements are reached:
Get-Content : A parameter cannot be found that matches parameter name 'Encoding'.
At D:\FileDirectory\Script.ps1
Environment is Windows Server 2016. I've tried different variations on the Get-Content parameters, but nothing has worked. I know there is a bug that affects network-mapped drives, but that's not the case here -- all files are local.
Any ideas/suggestions?
The only plausible explanation I can think of is that a custom Get-Content command that lacks an -Encoding parameter is shadowing (overriding) the standard Get-Content cmdlet in the PowerShell session that's executing your script.
To demonstrate:
# Define a custom Get-Content command (function) that accepts only
# a (positional) -Path parameter, not also -Encoding.
function Get-Content { [CmdletBinding()] param([string] $Path) }
# Now try to use Get-Content -Encoding
Get-Content -Encoding Utf8 FilesA*.txt
You'll see the same error message as in your question.
Use Get-Command Get-Content -All to see all commands named Get-Content, with the effective command listed first.
Then examine where any custom commands may come from; e.g., your $PROFILE script may contain one.
To rule out $PROFILE as the culprit, start PowerShell without loading the profile script and examine Get-Content then:
powershell -noprofile # Windows PowerShell
pwsh -noprofile # PowerShell Core
A simple way to rule out custom overrides ad hoc is to call a command by its module-qualified name:
Microsoft.Powershell.Management\Get-Content ...
You can determine a built-in cmdlet's module name of origin as follows:
PS> (Get-Command Get-Content -All)[-1].ModuleName
Microsoft.PowerShell.Management
In a pinch you can also infer the originating module name from the URL of the help topic:
Googling Get-Content will take you to https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.management/get-content - note how the cmdlet's module name, microsoft.powershell.management (case doesn't matter), is the penultimate (next to last) URI component.
It seems an issue with the out command. Can you please try below code :
$pathSource = "D:\FileDirectory"
Set-Location -Path $pathSource
Get-Content -Encoding UTF8 -Path FilesA*.txt | Set-Content -Encoding ASCII -path FilesA_Digest.txt
Get-Content -Encoding UTF8 -Path FilesB*.txt | Set-Content -Encoding ASCII -path FilesB_Digest.txt
Well, I don't know why it failed, but I can say that I have completely re-written the script and now it works. I have to note that, given the errors that were occurring, I also don't know why it is now working.
I am using the exact same calls to the Get-Content commandlet, with the -Encoding parameter and the pipe to Out-File with its own -Encoding parameter. I am doing the exact same actions as the previous version of the script. The only part that is significantly different is the portion that performs the FTP transfer of the processed files. I'm now using only PowerShell to perform the transfer rather than CuteFTP and it all seems to be working correctly.
Thanks to everyone who contributed.
Cheers
Norm
Not sure if it helps, but I was running into the same with:
$n = ni '[hi]' -value 'some text'
gc $n -Encoding Byte
$f = ls *hi*
$f.where{$_.name -eq '[hi]'}.Delete()
also looks like there's already a chain of SOs about this known bug see this answer

Strip lines from text file based on content

I like to use one of the packaged HOSTS (MVPS,) files to protect myself from some of the nastier domains. Unfortunately, sometimes these files are a bit overzealous for me (blocking googleadsservices is a pain sometimes). I want an easy way to strip certain lines out of these files. In Linux I use:
cat hosts |grep -v <pattern> >hosts.new
And the file is rewritten minus the lines referencing the pattern I specified in the grep. So I just set it up to replace hosts with hosts.new on reboot and I'm done.
Is there an easy way to do this in PowerShell?
In PowerShell you'd do
(Get-Content hosts) -notmatch $pattern | Out-File hosts.new
or
(cat hosts) -notmatch $pattern > hosts.new
for short.
Of course, since Out-File (and with it the redirection operator) default to Unicode format, you may actually want to use Set-Content instead of Out-File:
(Get-Content hosts) -notmatch $pattern | Set-Content hosts.new
or
(gc hosts) -notmatch $pattern | sc hosts.new
And since the input file is read in a grouping expression (the parentheses around Get-Content hosts) you could actually write the output back to the source file:
(Get-Content hosts) -notmatch $pattern | Set-Content hosts
To complement Ansgar Wiechers' helpful answer (which offers pragmatic and concise solutions based on reading the entire input file into memory up-front):
PowerShell's grep equivalent is the Select-String cmdlet and, just like grep, it directly accepts a filename argument (PSv3+ syntax):
Select-String -NotMatch <pattern> hosts | ForEach-Object Line | Set-Content hosts.new
Select-String -NotMatch <pattern> hosts is short for
Select-String -NotMatch -Pattern <pattern> -LiteralPath hosts and is the virtual equivalent of
grep -v <pattern> hosts
However, Select-String doesn't output strings, it outputs [Microsoft.PowerShell.Commands.MatchInfo] instances that wrap matching lines (stored in property .Line) along with metadata about the match.
ForEach-Object Line extracts just the matching lines (the value of property .Line) from these objects.
Set-Content hosts.new writes the matching lines to file hosts.new, using "ANSI" encoding in Windows PowerShell - i.e., it uses the legacy code page implied by the active system locale, typically a supranational 8-bit superset of ASCII - and UTF-8 encoding (without BOM) in PowerShell Core.
Use the -Encoding parameter to specify a different encoding.
>, by contrast (an effective alias of the Out-File cmdlet), creates:
UTF16-LE ("Unicode") files by default in Windows PowerShell.
UTF-8 files (without BOM) in PowerShell Core - in other words: in PowerShell Core, using
> hosts.new in lieu of | Set-Content hosts.new will do.
Note: While both > / Out-File and Set-Content are suitable for sending string inputs to an output file, they are not generally suitable for sending other data types to a file for programmatic processing: > / Out-File output objects the way they would print to the console / terminal, which is pretty format for display, whereas Set-Content stringifies (simply put: calls .ToString() on) the input objects, which often results in loss of information.
For non-string data, consider a (more) structured data format such as XML (Export-CliXml), JSON (ConvertTo-Json) or CSV (Export-Csv).

ANSI Encoding via PowerShell [duplicate]

In PowerShell, what's the difference between Out-File and Set-Content? Or Add-Content and Out-File -append?
I've found if I use both against the same file, the text is fully mojibaked.
(A minor second question: > is an alias for Out-File, right?)
Here's a summary of what I've deduced, after a few months experience with PowerShell, and some scientific experimentation. I never found any of this in the documentation :(
[Update: Much of this now appears to be better documented.]
Read and write locking
While Out-File is running, another application can read the log file.
While Set-Content is running, other applications cannot read the log file. Thus never use Set-Content to log long running commands.
Encoding
Out-File saves in the Unicode (UTF-16LE) encoding by default (though this can be specified), whereas Set-Content defaults to ASCII (US-ASCII) in PowerShell 3+ (this may also be specified). In earlier PowerShell versions, Set-Content wrote content in the Default (ANSI) encoding.
Editor's note: PowerShell as of version 5.1 still defaults to the culture-specific Default ("ANSI") encoding, despite what the documentation claims. If ASCII were the default, non-ASCII characters such as ü would be converted to literal ?, but that is not the case: 'ü' | Set-Content tmp.txt; (Get-Content tmp.txt) -eq '?' yields $False.
PS > $null | out-file outed.txt
PS > $null | set-content set.txt
PS > md5sum *
f3b25701fe362ec84616a93a45ce9998 *outed.txt
d41d8cd98f00b204e9800998ecf8427e *set.txt
This means the defaults of two commands are incompatible, and mixing them will corrupt text, so always specify an encoding.
Formatting
As Bartek explained, Out-File saves the fancy formatting of the output, as seen in the terminal. So in a folder with two files, the command dir | out-file out.txt creates a file with 11 lines.
Whereas Set-Content saves a simpler representation. In that folder with two files, the command dir | set-content sc.txt creates a file with two lines. To emulate the output in the terminal:
PS > dir | ForEach-Object {$_.ToString()}
out.txt
sc.txt
I believe this formatting has a consequence for line breaks, but I can't describe it yet.
File creation
Set-Content doesn't reliably create an empty file when Out-File would:
In an empty folder, the command dir | out-file out.txt creates a file, while dir | set-content sc.txt does not.
Pipeline Variable
Set-Content takes the filename from the pipeline; allowing you to set a number of files' contents to some fixed value.
Out-File takes the data as from the pipeline; updating a single file's content.
Parameters
Set-Content includes the following additional parameters:
Exclude
Filter
Include
PassThru
Stream
UseTransaction
Out-File includes the following additional parameters:
Append
NoClobber
Width
For more information about what those parameters are, please refer to help; e.g. get-help out-file -parameter append.
Out-File has the behavior of overwriting the output path unless the -NoClobber and/or the -Append flag is set. Add-Content will append content if the output path already exists by default (if it can). Both will create the file if one doesn't already exist.
Another interesting difference is that Add-Content will create an ASCII encoded file by default and Out-File will create a little endian unicode encoded file by default.
> is an alias syntactic sugar for Out-File. It's Out-File with some pre-defined parameter settings.
Well, I would disagree... :)
Out-File has -Append (-NoClober is there to avoid overwriting) that will Add-Content. But this is not the same beast.
command | Add-Content will use .ToString() method on input. Out-File will use default formatting.
so:
ls | Add-Content test.txt
and
ls | Out-File test.txt
will give you totally different results.
And no, '>' is not alias, it's redirection operator (same as in other shells). And has very serious limitation... It will cut lines same way they are displayed. Out-File has -Width parameter that helps you avoid this. Also, with redirection operators you can't decide what encoding to use.
HTH
Bartek
Set-Content supports -Encoding Byte, while Out-File does not.
So when you want to write binary data or result of Text.Encoding#GetBytes() to a file, you should use Set-Content.
Wanted to add about difference on encoding:
Windows with PowerShell 5.1:
Out-File - Default encoding is utf-16le
Set-Content - Default encoding is us-ascii
Linux with PowerShell 7.1:
Out-File - Default encoding is us-ascii
Set-Content - Default encoding is us-ascii
Out-file -append or >> can actually mix two encodings in the same file. Even if the file is originally ASCII or ANSI, it will add Unicode by default to the bottom of it. Add-content will check the encoding and match it before appending. Btw, export-csv defaults to ASCII (no accents), and set-content/add-content to ANSI.
TL;DR, use Set-Content as it's more consistent over Out-File.
Set-Content behavior is the same over different powershell versions
Out-File as #JagWireZ says produces different encodings for the default settings, even on the same OS(Windows) the docs for powershell 5.1 and powershell 7.3 state that the encoding changed from unicode to utf8NoBOM
Some issues like Malformed XML arise from using Out-File, that could of course be fixed by setting the desired encoding, however it's likely to forget to set the encoding and end up with issues.

Powershell magnling ascii text

I'm getting extra characters and lines when trying to modify hosts files. For example, this select string does not take anything out, but the two files are different:
get-content -Encoding ascii C:\Windows\system32\drivers\etc\hosts |
select-string -Encoding ascii -notmatch "thereisnolinelikethis" |
out-file -Encoding ascii c:\temp\testfile
PS C:\temp> (get-filehash C:\windows\system32\drivers\etc\hosts).hash
C54C246D2941F02083B85CE2774D271BD574F905BABE030CC1BB41A479A9420E
PS C:\temp> (Get-FileHash C:\temp\testfile).hash
AC6A1134C0892AD3C5530E58759A09C73D8E0E818EC867C9203B9B54E4B83566
I can confirm that your commands do inexplicably result in extra line breaks in the output file, in the start and in the end. Powershell also converts the tabs in the original file into four spaces instead.
While I cannot explain why, these commands do the same thing without these issues:
Try this code instead:
Get-Content -Path C:\Windows\System32\drivers\etc\hosts -Encoding Ascii |
Where-Object { -not $_.Contains("thereisnolinelikethis") } |
Out-File -FilePath "c:\temp\testfile" -Encoding Ascii
I think this is more of an issue with PowerShell's F&O (formatting & output) engine. Keep in mind that Select-String outputs a rich object called MatchInfo. When that object reaches the end of the output it needs to be rendered to a string. I think it is that rendering/formatting that injects the extra line. One of the properties on MatchInfo is the line that was matched (or notmatched). If you pass just the Line property down the pipe, it seems to work better (hashes match):
Get-Content C:\Windows\system32\drivers\etc\hosts |
Select-String -notmatch "thereisnolinelikethis" |
Foreach {$_.Line} |
Out-File -Encoding ascii c:\temp\testfile
BTW you only need to specify ASCII encoding when outputting back to the file. Everywhere else in PowerShell, just let the string flow as Unicode.
All that said, I would use Where-Object instead of Select-String for this scenario. Where-Object is a filtering command which is what you want. Select-String takes input of one form (string) and converts it to a different object (MatchInfo).
Out-File adds a trailing NewLine ("`r`n") to the testfile file.
C:\Windows\System32\drivers\etc\hosts does not contain a trailing newline out of the box, which is why you get a different FileHash
If you open the files with a StreamReader, you'll see that the underlying stream differs in length (due to the trailing newline in the new file):
PS C:\> $Hosts = [System.IO.StreamReader]"C:\Windows\System32\drivers\etc\hosts"
PS C:\> $Tests = [System.IO.StreamReader]"C:\temp\testfile"
PS C:\> $Hosts.BaseStream.Length
822
PS C:\> $Tests.BaseStream.Length
824
PS C:\> $Tests.BaseStream.Position = 822; $Tests.Read(); $Tests.Read()
13
10
ASCII characters 13 (0x0D) and 10 (0x0A) correspond to [System.Environment]::NewLine or CR+LF

PowerShell Set-Content and Out-File - what is the difference?

In PowerShell, what's the difference between Out-File and Set-Content? Or Add-Content and Out-File -append?
I've found if I use both against the same file, the text is fully mojibaked.
(A minor second question: > is an alias for Out-File, right?)
Here's a summary of what I've deduced, after a few months experience with PowerShell, and some scientific experimentation. I never found any of this in the documentation :(
[Update: Much of this now appears to be better documented.]
Read and write locking
While Out-File is running, another application can read the log file.
While Set-Content is running, other applications cannot read the log file. Thus never use Set-Content to log long running commands.
Encoding
Out-File saves in the Unicode (UTF-16LE) encoding by default (though this can be specified), whereas Set-Content defaults to ASCII (US-ASCII) in PowerShell 3+ (this may also be specified). In earlier PowerShell versions, Set-Content wrote content in the Default (ANSI) encoding.
Editor's note: PowerShell as of version 5.1 still defaults to the culture-specific Default ("ANSI") encoding, despite what the documentation claims. If ASCII were the default, non-ASCII characters such as ü would be converted to literal ?, but that is not the case: 'ü' | Set-Content tmp.txt; (Get-Content tmp.txt) -eq '?' yields $False.
PS > $null | out-file outed.txt
PS > $null | set-content set.txt
PS > md5sum *
f3b25701fe362ec84616a93a45ce9998 *outed.txt
d41d8cd98f00b204e9800998ecf8427e *set.txt
This means the defaults of two commands are incompatible, and mixing them will corrupt text, so always specify an encoding.
Formatting
As Bartek explained, Out-File saves the fancy formatting of the output, as seen in the terminal. So in a folder with two files, the command dir | out-file out.txt creates a file with 11 lines.
Whereas Set-Content saves a simpler representation. In that folder with two files, the command dir | set-content sc.txt creates a file with two lines. To emulate the output in the terminal:
PS > dir | ForEach-Object {$_.ToString()}
out.txt
sc.txt
I believe this formatting has a consequence for line breaks, but I can't describe it yet.
File creation
Set-Content doesn't reliably create an empty file when Out-File would:
In an empty folder, the command dir | out-file out.txt creates a file, while dir | set-content sc.txt does not.
Pipeline Variable
Set-Content takes the filename from the pipeline; allowing you to set a number of files' contents to some fixed value.
Out-File takes the data as from the pipeline; updating a single file's content.
Parameters
Set-Content includes the following additional parameters:
Exclude
Filter
Include
PassThru
Stream
UseTransaction
Out-File includes the following additional parameters:
Append
NoClobber
Width
For more information about what those parameters are, please refer to help; e.g. get-help out-file -parameter append.
Out-File has the behavior of overwriting the output path unless the -NoClobber and/or the -Append flag is set. Add-Content will append content if the output path already exists by default (if it can). Both will create the file if one doesn't already exist.
Another interesting difference is that Add-Content will create an ASCII encoded file by default and Out-File will create a little endian unicode encoded file by default.
> is an alias syntactic sugar for Out-File. It's Out-File with some pre-defined parameter settings.
Well, I would disagree... :)
Out-File has -Append (-NoClober is there to avoid overwriting) that will Add-Content. But this is not the same beast.
command | Add-Content will use .ToString() method on input. Out-File will use default formatting.
so:
ls | Add-Content test.txt
and
ls | Out-File test.txt
will give you totally different results.
And no, '>' is not alias, it's redirection operator (same as in other shells). And has very serious limitation... It will cut lines same way they are displayed. Out-File has -Width parameter that helps you avoid this. Also, with redirection operators you can't decide what encoding to use.
HTH
Bartek
Set-Content supports -Encoding Byte, while Out-File does not.
So when you want to write binary data or result of Text.Encoding#GetBytes() to a file, you should use Set-Content.
Wanted to add about difference on encoding:
Windows with PowerShell 5.1:
Out-File - Default encoding is utf-16le
Set-Content - Default encoding is us-ascii
Linux with PowerShell 7.1:
Out-File - Default encoding is us-ascii
Set-Content - Default encoding is us-ascii
Out-file -append or >> can actually mix two encodings in the same file. Even if the file is originally ASCII or ANSI, it will add Unicode by default to the bottom of it. Add-content will check the encoding and match it before appending. Btw, export-csv defaults to ASCII (no accents), and set-content/add-content to ANSI.
TL;DR, use Set-Content as it's more consistent over Out-File.
Set-Content behavior is the same over different powershell versions
Out-File as #JagWireZ says produces different encodings for the default settings, even on the same OS(Windows) the docs for powershell 5.1 and powershell 7.3 state that the encoding changed from unicode to utf8NoBOM
Some issues like Malformed XML arise from using Out-File, that could of course be fixed by setting the desired encoding, however it's likely to forget to set the encoding and end up with issues.