encoding issue in powershell search and replace - powershell

I'm running a powershell script on XML files recursively to search and replace text. The code is working fine with searching and replacing the text. However in certain files there are other languages text like fréquentes which is getting changed to fréquentes after running the script. I've been using UTF8 encoding in the script. Any pointers on how to retain the encoading?
$content| Foreach-Object{$_ -replace 'test1' , 'testing'`
-replace 'test2' , 'testing' }| Out-File file.FullName -Encoding utf8

You seem to be ignoring the XML file's encoding, which seems to be Latin 1. XML files specify their encoding at the start (or, if they don't, they will be autodetected as UTF-8, UTF-16, or UTF-32):
<?xml version='1.0' encoding='utf-8'?>
So it seems to me like you read the content with the correct encoding, but write the file in UTF-8 which doesn't match the declared one.
You could use the XML APIs to change the file, which may be preferable, or simply change your Out-File to
Out-File -Encoding Default
However, that can cause the encoding to differ between different computers, so careful with that. I pretty much only use it for files I know are in the system's legacy codepage, or for quick one-off scripts.

Related

Powershell Out-file special characters

I have a script that processes data from files and writes result based on a condition to txt. Given data are strings with words like: "Distribución" or "México". When processed, those special characters like "é" and "ó" are broken (typical white square or question mark).
How can i encode the output file to make it work with those characters? I tried encoding in Utf8, utf8 without BOM, it doesn't work. Here is to file writing line:
...| Out-file -encoding XXX .\result.txt
in XXX i tried ASCII, Utf8, nothing works :/
Out-File will always add a BOM. It's a particularly annoying "feature" of that Cmdlet. Unfortunately - to my knowledge - there is no quick way to save a file using UTF8 WITHOUT a BOM in powershell. You can, however, leverage .Net to do this. This isn't really production ready, but here's a quick example:
$outputPath = "D:\temp.txt"
$data = "Distribución or México"
[System.IO.File]::WriteAllLines($outputPath, $data)
Wrap it in a Cmdlet, function and / or module to make it reusable. Of course you can take more control over the file encoding with .Net too.

How to expand file content with powershell

I want to do this :
$content = get-content "test.html"
$template = get-content "template.html"
$template | out-file "out.html"
where template.html contains
<html>
<head>
</head>
<body>
$content
</body>
</html>
and test.html contains:
<h1>Test Expand</h1>
<div>Hello</div>
I get weird characters in first 2 characters of out.html :
��
and content is not expanded.
How to fix this ?
To complement Mathias R. Jessen's helpful answer with a solution that:
is more efficient.
ensures that the input files are read as UTF-8, even if they don't have a (pseudo-)BOM (byte-order mark).
avoids the "weird character" problem altogether by writing a UTF-8-encoded output file without that pseudo-BOM.
# Explicitly read the input files as UTF-8, as a whole.
$content = get-content -raw -encoding utf8 test.html
$template = get-content -raw -encoding utf8 template.html
# Write to output file using UTF-8 encoding *without a BOM*.
[IO.File]::WriteAllText(
"$PWD/out.html",
$ExecutionContext.InvokeCommand.ExpandString($template)
)
get-content -raw (PSv3+) reads the files in as a whole, into a single string (instead of an array of strings, line by line), which, while more memory-intensive, is faster. With HTML files, memory usage shouldn't be a concern.
An additional advantage of reading the files in full is that if the template were to contain multi-line subexpressions ($(...)), the expansion would still function correctly.
get-content -encoding utf8 ensures that the input files are interpreted as using character encoding UTF-8, as is typical in the web world nowadays.
This is crucial, given that UTF-8-encoded HTML files normally do not have the 3-byte pseudo-BOM that PowerShell needs in order to correctly identify a file as UTF-8-encoded (see below).
A single $ExecutionContext.InvokeCommand.ExpandString() call is then sufficient to perform the template expansion.
Out-File -Encoding utf8 would invariably create a file with the pseudo-BOM, which is undesired.
Instead, [IO.File]::WriteAllText() is used, taking advantage of the fact that the .NET Framework by default creates UTF-8-encoded files without the BOM.
Note the use of $PWD/ before out.html, which is needed to ensure that the file gets written in PowerShell's current location (directory); unfortunately, what the .NET Framework considers the current directory is not in sync with PowerShell.
Finally, the obligatory security warning: use this expansion technique only on input that you trust, given that arbitrary embedded commands may get executed.
Optional background information
PowerShell's Out-File, > and >> use UTF-16 LE character encoding with a BOM (byte-order mark) by default (the "weird characters", as mentioned).
While Out-File -Encoding utf8 allows creating UTF-8 output files instead,
PowerShell invariably prepends a 3-byte pseudo-BOM to the output file, which some utilities, notably those with Unix heritage, have problems with - so you would still get "weird characters" (albeit different ones).
If you want a more PowerShell-like way of creating BOM-less UTF-8 files,
see this answer of mine, which defines an Out-FileUtf8NoBom function that otherwise emulates the core functionality of Out-File.
Conversely, on reading files, you must use Get-Content -Encoding utf8 to ensure that BOM-less UTF-8 files are recognized as such.
In the absence of the UTF-8 pseudo-BOM, Get-Content assumes that the file uses the single-byte, extended-ASCII encoding specified by the system's legacy codepage (e.g., Windows-1252 on English-language systems, an encoding that PowerShell calls Default).
Note that while Windows-only editors such as Notepad create UTF-8 files with the pseudo-BOM (if you explicitly choose to save as UTF-8; default is the legacy codepage encoding, "ANSI"), increasingly popular cross-platform editors such as Visual Studio Code, Atom, and Sublime Text by default do not use the pseudo-BOM when they create files.
For the "weird characters", they're probably BOMs (Byte-order marks). Specify the output encoding explicitly with the -Encoding parameter when using Out-File, for example:
$Template |Out-File out.html -Encoding UTF8
For the string expansion, you need to explicitly tell powershell to do so:
$Template = $Template |ForEach-Object {
$ExecutionContext.InvokeCommand.ExpandString($_)
}
$Template | Out-File out.html -Encoding UTF8

Override Powershell > shortcut

In Powershell using > is the same as using | Out-File, so I can write
"something" > file.txt and It will write 'something' into file.txt . This is what I expect of a shell. Unfortunately, Powershell uses Unicode for writing file.txt. The only way to change it into UTF-8 is to write the quite long command:
"something" | Out-File file.txt -Encoding UTF8
I want to override the > shortcut, so that it adds the UTF-8 encoding by default. Is there a way to do that?
NOT A DUPLICATE CLARIFICATION:
This is not a duplicate. As is explained clearly here, Out-File has a hard-coded default. I don't want to change Out-File's behavior, I want to change >'s behavior.
No, can't be done
Even the documentation alludes to this.
From the last paragraph of Get-Help about_Redirection:
When you are
writing to files, the redirection operators use Unicode encoding. If
the file has a different encoding, the output might not be formatted
correctly. To redirect content to non-Unicode files, use the Out-File
cmdlet with its Encoding parameter.
(emphasis added)
The output encoding can be overriden by changing the $OutputEncoding variable. However, that only works for piping output into executables. It doesn't work for redirection operators. If you need a specific encoding for file output you must use Out-File or Set-Content with the -Encoding parameter (or a StreamWriter).

Prevent extra characters in text file when using Powershell cmdlet export-csv

I'm using the Powershell cmdlet export-csv to export the results of a sqlserver query to a .txt file. The result rows match my expectations when I view them with a text editor (Notepad), like this:
Account|Representative|Note
But, when I open the file using Binary file format (in TextPad), I see that the result rows are interspersed with extra characters (not sure if they are periods or spaces), like this:
A.c.c.o.u.n.t.|.R.e.p.r.e.s.e.n.t.a.t.i.v.e.|.N.o.t.e.
I've tried specifying different values for export-csv's encoding parameter, including UTF32, UTF8, ASCII, and Unicode, but can't get rid of the extra characters.
I have other methods of generating the file (SSIS package), but would specifically like to get this Powershell option working for the benefit of the group that will support the export.
Thank you in advance for any help you can provide.
Open the file in Notepad and click File → Save As.... The pre-selected value of the field Encoding will tell you which encoding was used (probably Unicode). Export-Csv -Encoding ASCII should take care of that, though. You could also try to recode the file like this:
Get-Content generated.txt | Out-File -Encoding ASCII converted.txt
If that doesn't help, please open the file with a hex editor and update your question with the first couple bytes (first line should suffice).

Does powershell string manipulation adds junk character?

When i am placing my $string variable contents into some text file, it adds some junk character it seems. For example i am writing an utility name in string and copy that string to file like follows.
$CodeCount+="ccount.exe"
$CodeCount | Out-file "C:\CodeCount.bat"
When i am executing this batch file , it fails by showing some junk character in front.
Even i tried with trim() but still the same result.
How to avoid adding Junk character in front?
Don't use out-file, atleast, in the default form. Use set-content. Or explicitly set the encoding in out-file something like out-file -encoding ascii. Out-file by default uses utf16 ( ucs-2 actually) encoding which doesn't get treated as plain text in many applications ( including version control systems like Hg, Git ) and add the "funny" characters that you mention.
Note that the redirection > is same as out-file and will give you the same results.
That's because the standard encoding used for Out-file is Unicode which adds some BOM values at the beginning of the file (you can see these values when you open your file with a hex editor). To avoid these bytes use -encoding ASCII:
$CodeCount | Out-file -encoding ascii "C:\CodeCount.bat"
The default encoding of Out-File is Little Endian Unicode. cmd.exe doesn't work well with this encoding so just use ASCII:
Out-File -Encoding ASCII