Exporting German umlauts with PowerShell 3 by Out-File CMDlet - powershell

I am currently facing a problem with the Out-File CMDlet which I use for creating a log file. This works fine at all, but unfortunately I can't force PowerShell to export the German umlauts correctly. What I tried was:
saving the script file with UTF-8 encoding (I use Sublime Text as editor)
appending an existing text file which I saved with MS Notepad before (Out-File CMDlet uses its -Append parameter in this case; otherwise a new file is created which has Unicode encoding if the -Encoding parameter is not set otherwise)
using "My String" | Out-File "xyz.log" -Encoding utf8 while the -Encoding parameter should handle the string export with UTF-8 encoding; specifying the value utf8 of -Encoding with double quotation marks ("My String" | Out-File "xyz.log" -Encoding "utf8") does not work as well
Microsoft's Developer Network and other threads on StackOverflow couldn't really solve my problem. Does anyone know a solution or at least a workaround for this issue?

I found a workaround for my issue much faster than I thought before. What I did is to replace each umlaut in a logging string with its Unicode value. For that I created an array $umlauts of arrays containing the particular umlaut with its unicode value.
What you should also focus on is to wrap your logging string into single quotation marks ('äöü') due to the fact that PowerShell seems to have problems with umlauts and double quotation marks ("äöü").
UPDATE 1: As mentioned by n3wjack, a string with umlauts has to be wrapped in single quotation marks so that PowerShell will handle every character "as it is".
Here is my implementation for what I described above:
function Out-LogFile ($str) {
$umlauts = #(
#('Ä',[char]0x00C4),
#('Ö',[char]0x00D6),
#('Ü',[char]0x00DC),
#('ä',[char]0x00E4),
#('ö',[char]0x00F6),
#('ü',[char]0x00FC)
)
foreach ($umlaut in $umlauts) {
$str = $str -replace $umlaut[0],$umlaut[1]
}
$str | Out-File "myfile.txt"
return $str
}
Out-LogFile 'ÄÖÜäöü'
UPDATE 2: I noticed that n3wjack's tip makes my implementation obsolete. By just wrapping the log text into single quoatation marks (like 'äöü' | Out-File "file.log") all umlauts are exported correctly. Thank you!

Related

Encode File save utf8

I'm breaking my head: D
I am trying to encode a text file that will be saved in the same way as Notepad saves
It looks exactly the same but it's not the same only if I go into the file via Notepad and save again it works for me what could be the problem with encoding? Or how can I solve it? Is there an option for a command that opens Notepad and saves again?
i use now
(Get-Content 000014.log) | Out-FileUtf8NoBom ddppyyyyy.txt
and after this
Get-ChildItem ddppyyyyy.txt | ForEach-Object {
# get the contents and replace line breaks by U+000A
$contents = [IO.File]::ReadAllText($_) -replace "`r`n?", "`n"
# create UTF-8 encoding without signature
$utf8 = New-Object System.Text.UTF8Encoding $false
# write the text back
[IO.File]::WriteAllText($_, $contents, $utf8)
}
When you open a file with notepad.exe it autodetects the encoding (or do you open the file explicitly File->Open.. as UTF-8?). If your file is actually not UFT-8 but something else notepad could be able to work around this and converts it to the required encoding when the file is resaved. So, when you do not specify the correct input encoding in your PoSh script things are will go wrong.
But that's not all; notepad also drops erroneous characters when the file is saved to create a regular text file. For instance, your text file might contain a NULL character that only gets removed when you use notepad. If this is the case it is highly unlikely that your input file is UTF-8 encoded (unless it is broken). So, it looks like your problem is your source file is UTF16 or similar; try to find the right input encoding and rewrite it, e.g. UTF-16 to UTF-8
Get-Content file.foo -Encoding Unicode | Set-Content -Encoding UTF8 newfile.foo
Try it like this:
Get-ChildItem ddppyyyyy.txt | ForEach-Object {
# get the contents and replace Windows line breaks by U+000A
$raw= (Get-Content -Raw $_ -Encoding UTF8) -replace "`r?`n", "`n" -replace "`0", ""
# create UTF-8 encoding without BOM signature
$utf8NoBom = New-Object System.Text.UTF8Encoding $false
# write the text back
[System.IO.File]::WriteAllLines($_, $raw, $utf8NoBom)
}
If you are struggling with the Byte-order-mark it is best to use a hex editor to check the file header manually; checking your file after I have saved it like shown above and then opening it with Notepad.exe and saving it under a new name shows no difference anymore:
The hex-dumped beginning of a file with BOM looks like this instead:
Also, as noted, while your regex pattern should work it want to convert Windows newlines to Unix style it is much more common and safer to make the CR optional: `r?`n
Als noted by mklement0 reading the file using the correct encoding is important; if your file is actually in Latin1 or something you will end up with a broken file if you carelessly convert it to UTF-8 in PoSH.
Thus, I have added the -Encoding UTF8 param to the Get-Content Cmdlet; adjust as needed.
Update: There is nothing wrong with the code in the question, the true problem was embedded NUL characters in the files, which caused problems in R, and which opening and resaving in Notepad implicitly removed, thereby resolving the problem (assuming that simply discarding these NULs works as intended) - see also: wp78de's answer.
Therefore, modifying the $contents = ... line as follows should fix your problem:
$contents = [IO.File]::ReadAllText($_) -replace "`r`n", "`n" -replace "`0"
Note: The code in the question uses the Out-FileUtf8NoBom function from this answer, which allows saving to BOM-less UTF-8 files in Windows PowerShell; it now supports a -UseLF switch, which would simplify the OP's command to (additional problems notwithstanding):
Get-Content 000014.log | Out-FileUtf8NoBom ddppyyyyy.txt -UseLF
There's a conceptual flaw in your regex, though it is benign in this case: instead of "`r`n?" you want "`r?`n" (or, expressed as a pure regex, '\r?\n') in order to match both CRLF ("`r`n") and LF-only ("`n") newlines.
Your regex would instead match CRLF and CR-only(!) newlines; however, as wp78de points out, if your input file contains only the usual CRLF newlines (and not also isolated CR characters), your replacement operation should still work.
In fact, you don't need a regex at all if all you need is to replace CRLF sequences with LF: -replace "`r`n", "`n"
Assuming that your original input files are ANSI-encoded, you can simplify your approach as follows, without the need to call Out-FileUtf8NoBom first (assumes Windows PowerShell):
# NO need for Out-FileUtf8NoBom - process the ANSI-encoded files directly.
Get-ChildItem *SomePattern*.txt | ForEach-Object {
# Get the contents and make sure newlines are LF-only
# [Text.Encoding]::Default is the encoding for the active ANSI code page
# in Windows PowerShell.
$contents = [IO.File]::ReadAllText(
$_.FullName,
[Text.Encoding]::Default
) -replace "`r`n", "`n"
# Write the text back with BOM-less UTF-8 (.NET's default)
[IO.File]::WriteAllText($_.FullName, $contents, $utf8)
}
Note that replacing the content of files in-place bears a risk of data loss, so it's best to create backup copies of the original files first.
Note: If you wanted to perform the same operation in PowerShell [Core] v6+, which is built on .NET Core, the code must be modified slightly, because [Text.Encoding]::Default no longer reflects the active ANSI code page and instead invariably returns a BOM-less UTF-8 encoding.
Therefore, the $contents = ... statement would have to change to (note that this would work in Windows PowerShell too):
$contents = [IO.File]::ReadAllText(
$_.FullName,
[Text.Encoding]::GetEncoding(
[cultureinfo]::CurrentCulture.TextInfo.AnsiCodePage
)
) -replace "`r`n", "`n"

Powershell Out-file special characters

I have a script that processes data from files and writes result based on a condition to txt. Given data are strings with words like: "Distribución" or "México". When processed, those special characters like "é" and "ó" are broken (typical white square or question mark).
How can i encode the output file to make it work with those characters? I tried encoding in Utf8, utf8 without BOM, it doesn't work. Here is to file writing line:
...| Out-file -encoding XXX .\result.txt
in XXX i tried ASCII, Utf8, nothing works :/
Out-File will always add a BOM. It's a particularly annoying "feature" of that Cmdlet. Unfortunately - to my knowledge - there is no quick way to save a file using UTF8 WITHOUT a BOM in powershell. You can, however, leverage .Net to do this. This isn't really production ready, but here's a quick example:
$outputPath = "D:\temp.txt"
$data = "Distribución or México"
[System.IO.File]::WriteAllLines($outputPath, $data)
Wrap it in a Cmdlet, function and / or module to make it reusable. Of course you can take more control over the file encoding with .Net too.

How to expand file content with powershell

I want to do this :
$content = get-content "test.html"
$template = get-content "template.html"
$template | out-file "out.html"
where template.html contains
<html>
<head>
</head>
<body>
$content
</body>
</html>
and test.html contains:
<h1>Test Expand</h1>
<div>Hello</div>
I get weird characters in first 2 characters of out.html :
��
and content is not expanded.
How to fix this ?
To complement Mathias R. Jessen's helpful answer with a solution that:
is more efficient.
ensures that the input files are read as UTF-8, even if they don't have a (pseudo-)BOM (byte-order mark).
avoids the "weird character" problem altogether by writing a UTF-8-encoded output file without that pseudo-BOM.
# Explicitly read the input files as UTF-8, as a whole.
$content = get-content -raw -encoding utf8 test.html
$template = get-content -raw -encoding utf8 template.html
# Write to output file using UTF-8 encoding *without a BOM*.
[IO.File]::WriteAllText(
"$PWD/out.html",
$ExecutionContext.InvokeCommand.ExpandString($template)
)
get-content -raw (PSv3+) reads the files in as a whole, into a single string (instead of an array of strings, line by line), which, while more memory-intensive, is faster. With HTML files, memory usage shouldn't be a concern.
An additional advantage of reading the files in full is that if the template were to contain multi-line subexpressions ($(...)), the expansion would still function correctly.
get-content -encoding utf8 ensures that the input files are interpreted as using character encoding UTF-8, as is typical in the web world nowadays.
This is crucial, given that UTF-8-encoded HTML files normally do not have the 3-byte pseudo-BOM that PowerShell needs in order to correctly identify a file as UTF-8-encoded (see below).
A single $ExecutionContext.InvokeCommand.ExpandString() call is then sufficient to perform the template expansion.
Out-File -Encoding utf8 would invariably create a file with the pseudo-BOM, which is undesired.
Instead, [IO.File]::WriteAllText() is used, taking advantage of the fact that the .NET Framework by default creates UTF-8-encoded files without the BOM.
Note the use of $PWD/ before out.html, which is needed to ensure that the file gets written in PowerShell's current location (directory); unfortunately, what the .NET Framework considers the current directory is not in sync with PowerShell.
Finally, the obligatory security warning: use this expansion technique only on input that you trust, given that arbitrary embedded commands may get executed.
Optional background information
PowerShell's Out-File, > and >> use UTF-16 LE character encoding with a BOM (byte-order mark) by default (the "weird characters", as mentioned).
While Out-File -Encoding utf8 allows creating UTF-8 output files instead,
PowerShell invariably prepends a 3-byte pseudo-BOM to the output file, which some utilities, notably those with Unix heritage, have problems with - so you would still get "weird characters" (albeit different ones).
If you want a more PowerShell-like way of creating BOM-less UTF-8 files,
see this answer of mine, which defines an Out-FileUtf8NoBom function that otherwise emulates the core functionality of Out-File.
Conversely, on reading files, you must use Get-Content -Encoding utf8 to ensure that BOM-less UTF-8 files are recognized as such.
In the absence of the UTF-8 pseudo-BOM, Get-Content assumes that the file uses the single-byte, extended-ASCII encoding specified by the system's legacy codepage (e.g., Windows-1252 on English-language systems, an encoding that PowerShell calls Default).
Note that while Windows-only editors such as Notepad create UTF-8 files with the pseudo-BOM (if you explicitly choose to save as UTF-8; default is the legacy codepage encoding, "ANSI"), increasingly popular cross-platform editors such as Visual Studio Code, Atom, and Sublime Text by default do not use the pseudo-BOM when they create files.
For the "weird characters", they're probably BOMs (Byte-order marks). Specify the output encoding explicitly with the -Encoding parameter when using Out-File, for example:
$Template |Out-File out.html -Encoding UTF8
For the string expansion, you need to explicitly tell powershell to do so:
$Template = $Template |ForEach-Object {
$ExecutionContext.InvokeCommand.ExpandString($_)
}
$Template | Out-File out.html -Encoding UTF8

Out-File -append in Powershell does not produce a new line and breaks string into characters

I'm trying to understand some weird behaviour with this cmdlet.
If I use "Out-File -append Filename.txt" on a text file that I created and entered text into via the windows context menu, the string will append to the last line in that file as a series of space separated characters.
So:
"This is a test" | out-file -append textfile.txt
Will produce:
T h i s i s a t e s t
This wont happen if out-file creates the file, or if the text file has no text in it prior to appending. Why does this happen?
I will also note that repeating the command will just append in the same way to the same line. I guess it doesn't recognise newline or line break terminator or something due to changed encoding?
Out-File defaults to unicode encoding which is why you are seeing the behavior you are. Use -Encoding Ascii to change this behavior. In your case
Out-File -Encoding Ascii -append textfile.txt.
Add-Content uses Ascii and also appends by default.
"This is a test" | Add-Content textfile.txt.
As for the lack of newline: You did not send a newline so it will not write one to file.
Add-Content is default ASCII and add new line however Add-Content brings locked files issues too.

Does powershell string manipulation adds junk character?

When i am placing my $string variable contents into some text file, it adds some junk character it seems. For example i am writing an utility name in string and copy that string to file like follows.
$CodeCount+="ccount.exe"
$CodeCount | Out-file "C:\CodeCount.bat"
When i am executing this batch file , it fails by showing some junk character in front.
Even i tried with trim() but still the same result.
How to avoid adding Junk character in front?
Don't use out-file, atleast, in the default form. Use set-content. Or explicitly set the encoding in out-file something like out-file -encoding ascii. Out-file by default uses utf16 ( ucs-2 actually) encoding which doesn't get treated as plain text in many applications ( including version control systems like Hg, Git ) and add the "funny" characters that you mention.
Note that the redirection > is same as out-file and will give you the same results.
That's because the standard encoding used for Out-file is Unicode which adds some BOM values at the beginning of the file (you can see these values when you open your file with a hex editor). To avoid these bytes use -encoding ASCII:
$CodeCount | Out-file -encoding ascii "C:\CodeCount.bat"
The default encoding of Out-File is Little Endian Unicode. cmd.exe doesn't work well with this encoding so just use ASCII:
Out-File -Encoding ASCII