How to use Powershell to modify a file without changing encoding - powershell

I've written a script to make lots of regular expression modifications to a file and overwrite the file with the changes. However I have just noticed all the files are being written with UTF-16LE encoding. I realise that I can specify the encoding for the Out-File cmdlet, but I'm not confident that all the input files are the same encoding. Can I get the encoding from Get-Content somehow, or should I be using some other method of I/O? An outline of the script is below
Get-ChildItem *.aspx -Recurse| ForEach-Object {
MakeWritable($_)
(Get-Content -raw $_) | FixUp($_.FullName)
}
Filter FixUp($filename)
{
DoReplacements($_) | Out-File $filename
}

Related

PowerShell : Set-Content Replace word and Encoding UTF8 without BOM

I'd like to escape \ to \\ in csv file to upload to Redshift.
Following simple PowerShell script can replace $TargetWord \ to $ReplaceWord \\ , as expected, but export utf-8 with bom and sometimes causes the Redshift copy error.
Any advice would be appreciated to improve it. Thank you in advance.
Exp_Escape.ps1
Param(
[string]$StrExpFile,
[string]$TargetWord,
[string]$ReplaceWord
)
# $(Get-Content "$StrExpFile").replace($TargetWord,$ReplaceWord) | Set-Content -Encoding UTF8 "$StrExpFile"
In PowerShell (Core) 7+, you would get BOM-less UTF-8 files by default; -Encoding utf8 and -Encoding utf8NoBom express that default explicitly; to use a BOM, -Encoding utf8BOM is needed.
In Windows PowerShell, unfortunately, you must use a workaround to get BOM-less UTF-8, because -Encoding utf8 only produces UTF-8 files with BOM (and no other utf8-related values are supported).
The workaround requires combining Out-String with New-Item, which (curiously) creates BOM-less UTF-8 files by default even in Windows PowerShell:
Param(
[string]$StrExpFile,
[string]$TargetWord,
[string]$ReplaceWord
)
$null =
New-Item -Force $StrExpFile -Value (
(Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord) | Out-String
)
Note:
$null = is needed to discard the output object that New-Item emits (which is a file-info object describing the newly created files.
-Force is needed in order to quietly overwrite an existing file by the same name (as Set-Content and Out-File do by default).
The -Value argument must be a single (multi-line) string to write to the file, which is what Out-String ensures.
Caveats:
For non-string input objects, Out-String creates the same rich for-display representations as Out-File and as you would see in the console by default.
New-Item itself does not append a trailing newline when it writes the string to the file, but Out-String curiously does; while this happens to be handy here, it is generally problematic, as discussed in GitHub issue #14444.
The alternative to using Out-String is to create the multi-line string manually, which is a bit more cumbersome ("`n" is used to create LF-only newlines, which PowerShell and most programs happily accept even on Windows; for platform-native newlines (CRLF) on Windows, use [Environment]::NewLine instead):
$null =
New-Item -Force $StrExpFile -Value (
((Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord) -join "`n`") + "`n"
)
Since the entire file content must be passed as an argument,[1] it must fit into memory as a whole; the convenience function discussed next avoids this problem.
For a convenience wrapper function around Out-File for use in Windows PowerShell that creates BOM-less UTF-8 files, see this answer.
Alternative, with direct use of .NET APIs:
.NET APIs produce BOM-less UTF-8 files by default.
However, because .NET's working directory usually differs from PowerShell's, full file paths must always be used, which requires more effort:
# In order for .NET API calls to work as expected,
# file paths must be expressed as *full, native* paths.
$OutDir = Split-Path -Parent $StrExpFile
if ($OutDir -eq '') { $OutDir = '.' }
$strExpFileFullPath = Join-Path (Convert-Path $OutDir) (Split-Path -Leaf $StrExpFile)
# Note: .NET APIs create BOM-less UTF-8 files *by default*
[IO.File]::WriteAllLines(
$strExpFileFullPath,
(Get-Content $StrExpFile).Replace($TargetWord, $ReplaceWord)
)
The above uses the System.IO.File.WriteAllLines method.
[1] Note that while New-Item technically supports receiving the content to write to the file via the pipeline, it unfortunately writes each to the target file, successively, with only the last one ending up in the file.

Issues with specific characters in outfile

I have a script that merges files and that works fine - but characters like åäö looks not good in the output file
Here is the complete script:
$startOfToday = (Get-Date).Date
Get-ChildItem "C:\TEST -include *.* -Recurse |
Where-Object LastWriteTime -gt $startOfToday | ForEach-Object {gc $_; ""} |
Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"
In the files in looks like this for example
Order ID 1
Order ID 2
This is för får
In the output it gets like this for the last row
Order ID 1
Order ID 2
får för fär
is there a way to make those characters appear in the output file as they appear in the first file?
The implication is that your input files are UTF-8-encoded without a BOM, which in Windows PowerShell are (mis)interpreted to be ANSI-encoded (using the system's active ANSI code page, such as Windows-1252).
The solution is to tell gc (Get-Content) explicitly what encoding to use, via the -Encoding parameter:
Get-ChildItem C:\TEST -include *.* -Recurse |
Where-Object LastWriteTime -gt $startOfToday |
ForEach-Object { Get-Content -Encoding Utf8 $_; ""} |
Out-File "C:\$(Get-Date -Format 'yyyy/mm/dd').txt"
Note that PowerShell never preserves the input encoding automatically, therefore, in the absence of using -Encoding with Out-File, its default encoding is used, which is "Unicode" (UTF-16LE) in Windows PowerShell.
While PowerShell (Core) 7+ also doesn't preserve input encodings, it consistently defaults to BOM-less UTF-8, so your original code would work as-is there.
For more information about default encodings in Windows PowerShell vs. PowerShell (Core) 7+, see this answer.
Note: As AdminOfThings suggests in a comment, simply replacing Out-File with Set-Content in your original code also works in this particular case, because the same misinterpretation of the encoding is then performed on both in- and output, and the data is simply being passed through. This isn't a general solution, however, notably not if you need to process the strings in memory first, before saving them to a file.

Powershell Get -ChildItem: filtering csv files and -Recurse not working

I created a short powershell script to convert csv files from Unicode to UTF-8 encoding. My script outputs new files with the the original file name preceded by UTF8. I'm running into two issues:
I'm trying to only run the powershell script on csv files. Currently the script runs on every file in the directory, including the powershell script (it outputs a new file called UTF8pshell_script if the powershell script was called pshell_script for example). The other methods where I've tried to only run the script on csv files just end up making the script not do anything.
I'm trying to run the script on sub-directories. The first issue is that output files created from csv files in subdirectories have no content inside them whatsoever. If the script is ran in the same directory as the csv file this problem does not arise. This is not crucial but I am also uncertain how to get output files created from those in subdirectories to be outputted in the same subdirectories (currently they are outputted in the main directory where the powershell script is).
as
Get-Content -Encoding Unicode $_ | Out-File -Encoding UTF8
Get-ChildItem -Recurse | ForEach-Object {Get-Content -Encoding Unicode $_ | Out-File -Encoding UTF8 "UTF8$_"}
The desired output is the powershell script running on only csv files, and outputting files to the same subdirectories where the files they were created form are.
Get-ChildItem takes a -Filter parameter, which for files is the simple wildcard pattern. This will allow you to restrict your cmdlet to CSV files only:
Get-ChildItem -Filter *.csv
To process subdirectories, you may also use the -Recurse switch
Get-ChildItem -Filter *.csv -Recurse
Now, I'm never quite sure how $_ changes as you pass different objects through the pipe, so I'm probably not doing the next steps the most efficient way - but it will be clear what I'm trying to do:
Each file object that we find needs to be processed as follows:
Dissect it into a path and a filename: $filepath = $_.PSParentPath; $filename = $_.PSChildName
Load up the CSV: Import-CSV -Path $_
Output the new CSV with the proper encoding: Export-CSV -Path ("{0}\UTF8{1}" -f $filepath,$filename) -Encoding UTF8
So, we put it all together:
Get-ChildItem -Filter *.csv -Recurse -exclude UTF8* | ForEach-Object {
$filepath = $_.PSParentPath
$filename = $_.PSChildName
Import-CSV -Path $_ |
Export-CSV -Encoding UTF8 -Path ("{0}\UTF8{1}" -f $filepath,$filename) -NoTypeInformation
}
The -Exclude UTF8* in the Get-ChildItem ensures that when you create a file, it doesn't get picked up later and re-processed. The -NoTypeInformation on the Export-CSV compensates for a stupidity built in to the cmdlet that causes an extra line with a meaningless object type name at the beginning of the file.
Depending on the original encoding (and presence of a BOM) you might have to specify an encoding also on the input side.
ForEach($Csv in (Get-ChildItem -Filter *.csv -Recurse -Exclude UTF8*)){
(Get-Content $Csv.FullName -raw) |
Set-Content -Path {Join-Path $Csv.Directory ("UTF8"+$Csv.Name)} -Encoding UTF8
}
LotPings beat me to this by 10 minutes with a virtually identical answer, but I'm leaving this for the 'passing an empty file to the pipeline' bit that I have. I also realize after the fact that you don't need a pipeline variable for that same reason, as you only need it if you pass things through the pipeline within the loop.
If all you want to do is change the encoding I would use a ForEach($x in $y){} loop, or a ForEach-Object{} loop with a PipelineVariable on the Get-ChildItem. I'll show that since I think pipeline variables are under used. I would also not read the file and pipe it to something, since if the file is empty you won't create a new file as nothing is passed down the pipeline.
Get-ChildItem *.csv -Recurse -PipelineVariable File | ForEach-Object{
Set-Content -Value (Get-Content $File.FullName -Encoding Unicode) -Path {Join-Path $File.Directory "UTF8$($File.Name)"} -Encoding UTF8
}
if you specify the file extension at the end of Get-ChildItem.
This will get only the files with the .csv extension.
By specifying the File path in Out-File it will send it to the specified directory.
Get-ChildItem -Path C:\folder\*.csv -Recurse | ForEach-Object {Get-Content -Encoding Unicode $_ | Out-File -FilePath C:\Folder -Encoding UTF8 "UTF8$_"}

Keep Same Encoding With Set-Content Multiple Files in PowerShell

I'm attempting to write a script to be used to migrate an application from server to server and/or from one drive letter to another drive letter. My goal is to copy the directory from one location, move it to another, and then run a script to edit all instances of the old hostname, IP address, and drive letter to reflect the new hostname, IP address, and drive letter on the new server. This appears to do exactly that:
ForEach($File in (Get-ChildItem $path\* -Include *.xml,*.config -Recurse)){
(Get-Content $File.FullName -Raw) -replace [RegEx]::Escape($oldhost),$newhost `
-replace [RegEx]::Escape($oldip),$newip `
-replace "$olddriveletter(?=:\Application)",$newDriveLetter |
Set-Content $File.FullName -NoNewLine
}
The one problem I am having is that the files all have different types of encoding. Some ANSI, some UTF-8, some Unicode, etc. When I run the script, it saves everything as ANSI and then my application fails to work. I know how to add the encoding parameter, but is there any way to keep the same encoding on each individual file, without writing out a script specifying each individual file in the directory and the encoding that each individual file has?
That would be difficult. It's too bad that get-content doesn't pass an encoding property. Here's a script that tries to get the encoding if there's a signature. Maybe you can just run it first and check them all. But some windows files are unicode no bom. At least xml files can say the encoding. get-childitem *.xml | select-string encoding There might be a better way to load xml files, see the bottom answer: Powershell: Setting Encoding for Get-Content Pipeline
# encoding.ps1
# https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
param([Parameter(ValueFromPipeline=$True)] $filename)
process {
$reader = [IO.StreamReader]::new($filename, [Text.Encoding]::default,$true)
$peek = $reader.Peek()
$encoding = $reader.currentencoding
$reader.close()
[pscustomobject]#{Name=split-path $filename -leaf
BodyName=$encoding.BodyName
EncodingName=$encoding.EncodingName}
}
# end encoding.ps1
PS C:\users\me> get-childitem chinese16.txt | encoding
Name BodyName EncodingName
---- -------- ------------
chinese16.txt utf-16 Unicode
Something like this will use the encoding indicated in the xml file, even if it didn't truly match beforehand. (This also makes the xml pretty.)
PS C:\users\me> [xml]$xml = get-content file.xml
PS C:\users\me> $xml.save("$pwd\file.xml")
Use the file.exe from the git binaries to find out the encoding.
Then, add the encoding parameter to the set-content line with if else statements to meet your needs.
ForEach($File in (Get-ChildItem $path\*)){
$Content = Get-Content $File.FullName -Raw -replace [RegEx]::Escape($oldhost),$newhost `
-replace [RegEx]::Escape($oldip),$newip `
-replace "$olddriveletter(?=:\Application)",$newDriveLetter
$Encoding = file --mime-encoding $File
$FullName = $File.FullName
Write-Host "$FullName - $Encoding"
if(-NOT ($Encoding -like "UTF")){
Set-Content $Content -NoNewLine -Encoding UTF8
}
else {
Set-Content $Content -NoNewLine
}
}
Reference:
https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.management/set-content
http://gnuwin32.sourceforge.net/packages/file.htm

PowerShell Set-Content and Out-File - what is the difference?

In PowerShell, what's the difference between Out-File and Set-Content? Or Add-Content and Out-File -append?
I've found if I use both against the same file, the text is fully mojibaked.
(A minor second question: > is an alias for Out-File, right?)
Here's a summary of what I've deduced, after a few months experience with PowerShell, and some scientific experimentation. I never found any of this in the documentation :(
[Update: Much of this now appears to be better documented.]
Read and write locking
While Out-File is running, another application can read the log file.
While Set-Content is running, other applications cannot read the log file. Thus never use Set-Content to log long running commands.
Encoding
Out-File saves in the Unicode (UTF-16LE) encoding by default (though this can be specified), whereas Set-Content defaults to ASCII (US-ASCII) in PowerShell 3+ (this may also be specified). In earlier PowerShell versions, Set-Content wrote content in the Default (ANSI) encoding.
Editor's note: PowerShell as of version 5.1 still defaults to the culture-specific Default ("ANSI") encoding, despite what the documentation claims. If ASCII were the default, non-ASCII characters such as ü would be converted to literal ?, but that is not the case: 'ü' | Set-Content tmp.txt; (Get-Content tmp.txt) -eq '?' yields $False.
PS > $null | out-file outed.txt
PS > $null | set-content set.txt
PS > md5sum *
f3b25701fe362ec84616a93a45ce9998 *outed.txt
d41d8cd98f00b204e9800998ecf8427e *set.txt
This means the defaults of two commands are incompatible, and mixing them will corrupt text, so always specify an encoding.
Formatting
As Bartek explained, Out-File saves the fancy formatting of the output, as seen in the terminal. So in a folder with two files, the command dir | out-file out.txt creates a file with 11 lines.
Whereas Set-Content saves a simpler representation. In that folder with two files, the command dir | set-content sc.txt creates a file with two lines. To emulate the output in the terminal:
PS > dir | ForEach-Object {$_.ToString()}
out.txt
sc.txt
I believe this formatting has a consequence for line breaks, but I can't describe it yet.
File creation
Set-Content doesn't reliably create an empty file when Out-File would:
In an empty folder, the command dir | out-file out.txt creates a file, while dir | set-content sc.txt does not.
Pipeline Variable
Set-Content takes the filename from the pipeline; allowing you to set a number of files' contents to some fixed value.
Out-File takes the data as from the pipeline; updating a single file's content.
Parameters
Set-Content includes the following additional parameters:
Exclude
Filter
Include
PassThru
Stream
UseTransaction
Out-File includes the following additional parameters:
Append
NoClobber
Width
For more information about what those parameters are, please refer to help; e.g. get-help out-file -parameter append.
Out-File has the behavior of overwriting the output path unless the -NoClobber and/or the -Append flag is set. Add-Content will append content if the output path already exists by default (if it can). Both will create the file if one doesn't already exist.
Another interesting difference is that Add-Content will create an ASCII encoded file by default and Out-File will create a little endian unicode encoded file by default.
> is an alias syntactic sugar for Out-File. It's Out-File with some pre-defined parameter settings.
Well, I would disagree... :)
Out-File has -Append (-NoClober is there to avoid overwriting) that will Add-Content. But this is not the same beast.
command | Add-Content will use .ToString() method on input. Out-File will use default formatting.
so:
ls | Add-Content test.txt
and
ls | Out-File test.txt
will give you totally different results.
And no, '>' is not alias, it's redirection operator (same as in other shells). And has very serious limitation... It will cut lines same way they are displayed. Out-File has -Width parameter that helps you avoid this. Also, with redirection operators you can't decide what encoding to use.
HTH
Bartek
Set-Content supports -Encoding Byte, while Out-File does not.
So when you want to write binary data or result of Text.Encoding#GetBytes() to a file, you should use Set-Content.
Wanted to add about difference on encoding:
Windows with PowerShell 5.1:
Out-File - Default encoding is utf-16le
Set-Content - Default encoding is us-ascii
Linux with PowerShell 7.1:
Out-File - Default encoding is us-ascii
Set-Content - Default encoding is us-ascii
Out-file -append or >> can actually mix two encodings in the same file. Even if the file is originally ASCII or ANSI, it will add Unicode by default to the bottom of it. Add-content will check the encoding and match it before appending. Btw, export-csv defaults to ASCII (no accents), and set-content/add-content to ANSI.
TL;DR, use Set-Content as it's more consistent over Out-File.
Set-Content behavior is the same over different powershell versions
Out-File as #JagWireZ says produces different encodings for the default settings, even on the same OS(Windows) the docs for powershell 5.1 and powershell 7.3 state that the encoding changed from unicode to utf8NoBOM
Some issues like Malformed XML arise from using Out-File, that could of course be fixed by setting the desired encoding, however it's likely to forget to set the encoding and end up with issues.