How to get out of PowerShell's encoding hell?

How to get out of PowerShell's encoding hell? - powershell

> cat .\foo.txt
abc
> cat .\foo.txt | md5sum
c13b6afecf97ea6b38d21a8f5167fa1e *-
> md5sum foo.txt
b79545611b3be30f90a0d21ef69bca82 *foo.txt
cat and md5sum are the unix port (from the Windows Git distribution).
This is a toy example for my real use case which is piping of a binary data to a legacy python script that I can't change. Because of the pipe doing encoding, the binary file becomes corrupted.
I tried changing $OutputEncoding, [Console]::OutputEncoding and using chcp, all didn't help (but maybe I was not doing it right, this is all very convoluted...).
The utility in PowerShell's pipe adds linefeed doesn't work for me because of how it handles the process arguments (I need to pass some argument to the legacy script and some need to be quoted, but the utility accepts all arguments as one string)
The optimal solution for me to somehow tell powershell to turn off encoding completely and just behave as unix/cmd.

There is no way around it, except to use cmd to run the commands including the pipe:
cmd /c cat.exe .\foo.txt "|" md5sum
Note the pipe character is quoted, so it is interpreted by cmd and not powershell.

If you're using the Get-Content cmdlet, then follow the recommendation given at https://technet.microsoft.com/en-us/library/hh847788.aspx for dealing with binary data:
When reading from and writing to binary files, use a value of Byte for the Encoding dynamic parameter and a value of 0 for the ReadCount parameter.
Regardless of whether or not you're using Get-Content, you'll probably want to avoid ever having your data represented as a String. The String type is designed for character data, and doesn't do well for handling binary data.

Related

Unable to move forward using cd

I'm having a problem moving forward through a path with PowerShell. I am able to move up the directory but not down. Here's the situation:
I open PowerShell and type in the "pwd" command and it shows that I am currently in PS C:\Users\Robert Inspiron14>
I type the command "cd.." and now I move to PS C:\Users>
I then attempt to change directories by typing: "cd C:\Users\Robert Inspiron14" and I am unable to. Unfortunately, I can't post a picture yet due to lack of reputation.
I'm able to perform the change in CMD but not PowerShell. Also, I don't know how to change the User from "Robert Inspiron14" to just "Robert". Any help is appreciated!

Before PowerShell can execute your cd command, it needs to parse it, and PowerShell's parser interprets your command like this:
cd C:\Users\Robert Inspiron14
\/ \_____________/ \________/
Command Name | |
argument 1 |
argument 2
In other words, C:\Users\Robert and Inspiron14 are interpreted as separate arguments.
Neither argument is a path to a valid directory, so cd (or rather Set-Location for which cd is an alias) throws an error.
You can force PowerShell to recognize C:\Users\Robert Inspiron14 as a single string argument by qualifying its boundaries using quotation marks (both " and ' will work):
cd 'C:\Users\Robert Inspiron14'
You can read more about how PowerShell parses command expressions in the about_Parsing help topic

To complement Mathias R. Jessen's helpful answer with more background information:
Quoting an argument that contains spaces is a general syntactic necessity, in all shells, because unquoted spaces are used to separate multiple arguments.
It isn't only spaces that require quoting, but any of PowerShell's so-called metacharacters (characters that, when used unquoted, have syntactic function); for instance, passing the path to a directory literally named a;b requires quoting as well, given that ; would otherwise be interpreted as a statement separator.
There are multiple quoting styles:
Since your path value is a literal - it contains no variable references or expressions - a verbatim (single-quoted) string ('...') is the best choice.
cd 'C:\Users\Robert Inspiron14'
If your path contains variables or subexpressions, you must use an expandable (double-quoted) string ("...")[1]
cd "$HOME\Documents"
Another, less common solution is to individually escape the space characters with `, the so-called backtick, PowerShell's escape character:
cd C:\Users\Robert` Inspiron14
Also note:
PowerShell's tab-completion automatically applies quoting as necessary.
cd.. is the name of a built-in function in PowerShell, whose sole purpose is to emulate cmd.exe's (questionably permissive) behavior (see below); the function performs a syntactically correct Set-Location .. call (verify by executing ${function:cd..}), with a space separating the command name from its argument.
Contrast with cmd.exe:
Unfortunately, cmd.exe's built-in cd command decided not to enforce its usual syntax rules, and enabled calls such as cd C:\Program Files.
It should never have done that: While convenient at first glance, it constitutes a problematic exception from the usual rules that invites confusion.
Note that cmd.exe's tab completion properly quotes arguments that contain spaces.
Similarly, cd.. was unfortunately allowed as as syntactically exceptional alternative to the correct cd .. - see the comments on this answer for details.
[1] Note "..."-quoting isn't strictly necessary if you use variable references in a path, as long as any literal components do not require quoting; e.g., $HOME\foo is fine without quoting, whereas the " around "$HOME\foo bar" are required. With subexpressions ($(...)), the rules get more complicated, so the simplest approach is to always use "..."-quoting with them.

Why does PowerShell redirection >> change the formatting of the text content?

I want to use the redirect append >> or write > to write to a txt file, but when I do, I receive a weird format "\x00a\x00p...".
I successfully use Set-Content and Add-Content, why do they function as expected, but not the >> and > redirect operators?
Showing the output using PowerShell cat as well as simple Python print.
rocket_brain> new-item test.txt
rocket_brain> "appended using add-content" | add-content test.txt
rocket_brain> cat test.txt
appended using add-content
but then if I use redirect append >>
rocket_brain> "appended using redirect" >> test.txt
rocket_brain> cat test.txt
appended using add-content
a p p e n d e d u s i n g r e d i r e c t
Simple Python script: read_test.py
with open("test.txt", "r") as file: # open test.txt in readmode
data = file.readlines() # append each line to the list data
print(data) # output list with each input line as an item
Using read_test.py I see a difference in formatting
rocket_brain> python read_test.txt
['appended using add-content\n', 'a\x00p\x00p\x00e\x00n\x00d\x00e\x00d\x00 \x00u\x00s\x00i\x00n\x00g\x00 \x00r\x00e\x00d\x00i\x00r\x00e\x00c\x00t\x00\r\x00\n', '\x00']
NOTE: If I use only the redirect append >> (or write >) without first using Add-Content, the cat output looks normal (instead of spaced out), but I will then get the /x00p format for every line when using the Python script (including any Add-Content command after starting with > operators). Opening the file in Notepad (or VS etc), the text always looks as expected. Using >> or > in cmd (instead of PS) also stores text in expected ascii format.
Related links: cmd redirection operators,
PS redirection operators

Note: The problem is ultimately that in Windows PowerShell different cmdlets / operators use different default encodings. This problem has been resolved in PowerShell Core(v6+), where BOM-less UTF-8 is consistently used.
>> blindly applies Out-File's default encoding when appending to an existing file (in effect, > behaves like Out-File and >> like Out-File -Append), which in Windows PowerShell is the encoding named Unicode, i.e., UTF-16LE, where most characters are encoded as 2-byte sequences, even those in the ASCII range; the latter have a 0x0 (NUL) as the high byte.
Therefore, unless the target file's existing contents use the same encoding, you'll end up with a mix of different encodings, which is what happened in your case.[1]
While Add-Content, by contrast, does try to detect a file's existing encodingThanks again, js2010., you used it on an empty file, in which case Set-Content's default encoding is applied, which in Windows PowerShell is the encoding named Default, which refers to your system's active ANSI code page.
Therefore, to match the single-byte ANSI encoding initially created by your Add-Content call when appending further content, use Out-File -Append -Encoding Default instead of >>, or simply keep using Add-Content.
Alternatively, pick a different encoding with Add-Content -Encoding ... and match it in the Out-File -Append call; UTF-8 is generally the best choice, though note that when you create a UTF-8 file in Windows PowerShell, it will start with a BOM (a pseudo byte-order mark identifying the file as UTF-8, which Unix-like platforms typically do not expect).
In PowerShell v5.1+ you may also change the default encoding globally, including for > and >> (which isn't possible in earlier versions). To change to UTF-8, for instance, use:
$PSDefaultParameterValues['*:Encoding']='UTF8'
Aside from different default encodings (in Windows PowerShell), it is important to note that Set-Content / Add-Content on the one hand and > / >> / Out-File [-Append] on the other behave fundamentally differently with non-string input:
In short: the former apply simple .ToString()-formatting to the input objects, whereas the latter perform the same output formatting you would see in the console - see this answer for details.
[1] Due to the initial content set by Add-Content, Windows PowerShell interprets the file as ANSI-encoded (the default in the absence of a BOM), where each byte is its own character. The UTF-16 content appended later is therefore also interpreted as if it were ANSI, so the 0x0 bytes are treated like characters in their own right, which print to the console like spaces.

>> and > redirects console output. So i assume that would also include some weird characters sometimes. >> and > are more closely related to the Out-File cmdlet.
add-content does not forward console output to a file, It only writes the values you provide it (e.g. a variable or pipeline object)
about_redirection

>> or out-file -append will append unicode text by default, even if the file isn't unicode in the first place. Add-content will check the encoding of the file first, and match it. Add-content or set-content defaults to ansi encoding as well. I would never use >, >>, or out-file.
Seeing something with spaces in between is a giveaway that it's unicode. Unicode has $nulls between each letter usually. If you dump the hex, like in emacs esc-x hexl-mode, you can see it. Boms are 2 or 3 hex characters in the beginning of a file.
a p p e n d e d u s i n g r e d i r e c t
This is a correctly constructed unicode text file, copied and pasted from emacs hexl-mode. fffe is the bom. After each character is 00. At the end is 0d and 0a, carriage return and linefeed. Stuff like this interests me. It's possible for some windows utilities to make a unicode text file with no bom (icacls /save). Then if you type the file, the letters will appear to have spaces in-between.
00000000: fffe 6100 7000 7000 6500 6e00 6400 6500 ..a.p.p.e.n.d.e.
00000010: 6400 2000 7500 7300 6900 6e00 6700 2000 d. .u.s.i.n.g. .
00000020: 7200 6500 6400 6900 7200 6500 6300 7400 r.e.d.i.r.e.c.t.
00000030: 0d00 0a00 ....

Perl output shell-escaped string

I'm trying to use a perl one-liner to turn print0 output into quoted shell parameters, kind of like the trick that's something like .. | xargs -0 printf "%q" {} but I didn't want to require bash (whose printf implements %p). I was kind of amazed to, well, not find an easy way to do this in perl. For all of perl's quoting mechanisms, there's no way I saw for producing quoted strings. Surely I just haven't looked hard enough.
Hopefully the answer isn't a regular expression. Quoting an elaborate regular expression to put into a shell command-line is not my idea of fun (if only a simple perl program could quote it for me, oh back to the same problem).

You can roll your own quoting for POSIX-like shells fairly simply - no complicated regexes needed (just straightforward string substitution using literals):
$ echo "I'm \$HOME. 3\" of rain." | perl -lne "s/'/'\\\''/g; print q{'} . \$_ . q{'}"
'I'\''m $HOME. 3" of rain.'
The approach is modeled after AppleScript's quoted form of command:
The input string is broken into substrings by ', each substring is itself '-enclosed, with the original ' chars. spliced between the substrings as \' (an individually quoted ').
When passed to the shell, the shell rebuilds these parts into a single, literal string.
This multi-part string-concatenation approach is necessary, because POSIX-like shells categorically do not allow embedding ' itself inside single-quoted strings (there's not even an escape sequence).
Alternatively, you can install a CPAN module such as ShellQuote.
Optional background information
While it would be handy for Perl itself to support such a quoting mechanism for piecing together shell commands stored in a single string to pass to qx// (`...`), such a mechanism would have to operate platform-specifically.
Notably, quoting rules for Windows are very different from rules for Unix platforms, and except for simple cases shell commands as a whole will be incompatible too.
From inside Perl, you may be able to bypass the need for quoting altogether, by using the list forms of system() and open(), which allow you to pass the command arguments individually, as-is, but note that this is only an option if your command doesn't use any shell features; for a "shell-less" qx// (`...`) alternative, see this answer of mine, which also covers shell-quoting on Windows.

Is it possible to pipe input to another script with '<' using the system() in perl?

I've looked at several similar questions but none of them seem to address this issue, or they use a form of piping that I'm unfamiliar with, or I'm using "piping" in place of the correct word.
First, I'm on windows 7 and what I'm trying to do is get a Perl script to call and input to another Perl Script multiple times.
The way I'm going about doing this is with the System() function.
When put directly into the command line this works, although a little sloppy:
Functionalscript.pl < InputFile > OutputFile
That takes stuff from the input file performs the function and writes it to the output file flawlessly. However, when using the "system()" function in my calling script the input is not registered, but the output file is created (it's just blank).
The problem is with:
system("Functionalscript.pl < InputFile > OutputFile")
For some reason when that is used the functionalscript does not receive the input as stdin. Is there a way to make this work?

According to perldoc -f system (http://perldoc.perl.org/functions/system.html):
If there is only one scalar argument, the argument is checked for shell metacharacters, and if there are any, the entire argument is passed to the system's command shell for parsing (this is /bin/sh -c on Unix platforms, but varies on other platforms). If there are no shell metacharacters in the argument, it is split into words and passed directly to execvp , which is more efficient.
Which means if your command has > or < in it it should be passed to the shell, and the input and output redirection should work as expected.

system("x:/path/perl.exe Functionalscript.pl InputFile > OutputFile")
Supplied by mpapec, Works. The "x:/path/perl.exe" had to be included.

Windows-1252 to UTF-8 encoding

I've copied certain files from a Windows machine to a Linux machine. So all the Windows encoded (windows-1252) files need to be converted to UTF-8. The files which are already in UTF-8 should not be changed. I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files?
Example usage of recode:
recode windows-1252.. myfile.txt
This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded. Otherwise, I believe this would corrupt the file.

iconv -f WINDOWS-1252 -t UTF-8 filename.txt

How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.
Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.
One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.
I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.
Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.
Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.

Here's a transcription of another answer I gave to a similar question:
If you apply utf8_encode() to an already UTF8 string it will return a garbled UTF8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8().
You dont need to know what the encoding of your strings is. It can be Latin1 (iso 8859-1), Windows-1252 or UTF8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF8.
I did it because a service was giving me a feed of data all messed up, mixing UTF8 and Latin1 in the same string.
Usage:
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
https://github.com/neitanod/forceutf8
Update:
I've included another function, Encoding::fixUFT8(), wich will fix every UTF8 string that looks garbled.
Usage:
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Update: I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

There's no general way to tell if a file is encoded with a specific encoding. Remember that an encoding is nothing more but an "agreement" how the bits in a file should be mapped to characters.
If you don't know which of your files are actually already encoded in UTF-8 and which ones are encoded in windows-1252, you will have to inspect all files and find out yourself. In the worst case that could mean that you have to open every single one of them with either of the two encodings and see whether they "look" correct -- i.e., all characters are displayed correctly. Of course, you may use tool support in order to do that, for instance, if you know for sure that certain characters are contained in the files that have a different mapping in windows-1252 vs. UTF-8, you could grep for them after running the files through 'iconv' as mentioned by Seva Akekseyev.
Another lucky case for you would be, if you know that the files actually contain only characters that are encoded identically in both UTF-8 and windows-1252. In that case, of course, you're done already.

If you want to rename multiple files in a single command ‒ let's say you want to convert all *.txt files ‒ here is the command:
find . -name "*.txt" -exec iconv -f WINDOWS-1252 -t UTF-8 {} -o {}.ren \; -a -exec mv {}.ren {} \;

Use the iconv command.
To make sure the file is in Windows-1252, open it in Notepad (under Windows), then click Save As. Notepad suggests current encoding as the default; if it's Windows-1252 (or any 1-byte codepage, for that matter), it would say "ANSI".

You can change the encoding of a file with an editor such as notepad++. Just go to Encoding and select what you want.
I always prefer the Windows 1252

If you are sure your files are either UTF-8 or Windows 1252 (or Latin1), you can take advantage of the fact that recode will exit with an error if you try to convert an invalid file.
While utf8 is valid Win-1252, the reverse is not true: win-1252 is NOT valid UTF-8. So:
recode utf8..utf16 <unknown.txt >/dev/null || recode cp1252..utf8 <unknown.txt >utf8-2.txt
Will spit out errors for all cp1252 files, and then proceed to convert them to UTF8.
I would wrap this into a cleaner bash script, keeping a backup of every converted file.
Before doing the charset conversion, you may wish to first ensure you have consistent line-endings in all files. Otherwise, recode will complain because of that, and may convert files which were already UTF8, but just had the wrong line-endings.

this script worked for me on Win10/PS5.1 CP1250 to UTF-8
Get-ChildItem -Include *.php -Recurse | ForEach-Object {
$file = $_.FullName
$mustReWrite = $false
# Try to read as UTF-8 first and throw an exception if
# invalid-as-UTF-8 bytes are encountered.
try
{
[IO.File]::ReadAllText($file,[Text.Utf8Encoding]::new($false, $true))
}
catch [System.Text.DecoderFallbackException]
{
# Fall back to Windows-1250
$content = [IO.File]::ReadAllText($file,[Text.Encoding]::GetEncoding(1250))
$mustReWrite = $true
}
# Rewrite as UTF-8 without BOM (the .NET frameworks' default)
if ($mustReWrite)
{
Write "Converting from 1250 to UTF-8"
[IO.File]::WriteAllText($file, $content)
}
else
{
Write "Already UTF-8-encoded"
}
}

As said, you can't reliably determine whether a file is Windows-1252 because Windows-1252 maps almost all bytes to a valid code point. However if the files are only in Windows-1252 and UTF-8 and no other encodings then you can try to parse a file in UTF-8 and if it contains invalid bytes then it's a Windows-1252 file
if iconv -f UTF-8 -t UTF-16 $FILE 1>/dev/null 2>&1; then
# Conversion succeeded
echo "$FILE is in UTF-8"
else
# iconv returns error if there are invalid characters in the byte stream
echo "$FILE is in Windows-1252. Converting to UTF-8"
iconv -f WINDOWS-1252 -t UTF-8 -o ${FILE}_utf8.txt $FILE
fi
This is similar to many other answers that try to treat the file as UTF-8 and check if there are errors. It works 99% of the time because most Windows-1252 texts will be invalid in UTF-8, but there will still be rare cases when it won't work. It's heuristic after all!
There are also various libraries and tools to detect the character set, such as chardet
$ chardet utf8.txt windows1252.txt iso-8859-1.txt
utf8.txt: utf-8 with confidence 0.99
windows1252.txt: Windows-1252 with confidence 0.73
iso-8859-1.txt: ISO-8859-1 with confidence 0.73
It can't be completely reliable due to the heuristic nature, so it outputs a confidence value for people to judge. The more human text in the file, the more confident it'll be. If you have very specific texts then more trainings for the library will be needed. For more information read How do browsers determine the encoding used?

Found this documentation for the TYPE command:
Convert an ASCII (Windows1252) file into a Unicode (UCS-2 le) text file:
For /f "tokens=2 delims=:" %%G in ('CHCP') do Set _codepage=%%G
CHCP 1252 >NUL
CMD.EXE /D /A /C (SET/P=ÿþ)<NUL > unicode.txt 2>NUL
CMD.EXE /D /U /C TYPE ascii_file.txt >> unicode.txt
CHCP %_codepage%
The technique above (based on a script by Carlos M.) first creates a file with a Byte Order Mark (BOM) and then appends the content of the original file. CHCP is used to ensure the session is running with the Windows1252 code page so that the characters 0xFF and 0xFE (ÿþ) are interpreted correctly.

UTF-8 does not have a BOM as it is both superfluous and invalid. Where a BOM is helpful is in UTF-16 which may be byte swapped as in the case of Microsoft. UTF-16 if for internal representation in a memory buffer. Use UTF-8 for interchange. By default both UTF-8, anything else derived from US-ASCII and UTF-16 are natural/network byte order. The Microsoft UTF-16 requires a BOM as it is byte swapped.
To covert Windows-1252 to ISO8859-15, I first convert ISO8859-1 to US-ASCII for codes with similar glyphs. I then convert Windows-1252 up to ISO8859-15, other non-ISO8859-15 glyphs to multiple US-ASCII characters.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse