Different behaviour and output when piping in CMD and PowerShell - powershell

I am trying to pipe the content of a file to a simple ASCII symmetrical encryption program i made. It's a simple program that reads input from STDIN and adds or subtracts a certain value (224) to each byte of the input.
For example: if the first byte is 4 and we want to encrypt, then it becomes 228. If it exceeds 255, the program just performs some modulo.
This is the output I get with cmd (test.txt contains "this is a test"):
type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt
this is a test
It also works the other way, thus it is a symmetrical encryption algorithm
type .\test.txt | .\Crypt.exe --decrypt | .\Crypt.exe --encrypt
this is a test
But, the behaviour on PowerShell is different. When encrypting first, I get:
type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt
this is a test_*
And that is what I get when decrypting first:
Maybe is an encoding problem. Thanks in advance.

tl;dr:
Up to at least PowerShell 7.3.1, if you need raw byte handling and/or need to prevent PowerShell from situationally adding a trailing newline to your text data, avoid the PowerShell pipeline altogether.
Future support for passing raw byte data between external programs and to-file redirections is the subject of GitHub issue #1908.
For raw byte handling, shell out to cmd with /c (on Windows; on Unix-like platforms / Unix-like Windows subsystems, use sh or bash with -c):
cmd /c 'type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt'
Use a similar technique to save raw byte output in a file - do not use PowerShell's > operator:
cmd /c 'someexe > file.bin'
Note that if you want to capture an external program's text output in a PowerShell variable or process it further in a PowerShell pipeline, you need to make sure that [Console]::OutputEncoding matches your program's output character encoding (the active OEM code page, typically), which should be true by default in this case; see the next section for details.
Generally, however, byte manipulation of text data is best avoided.
There are two separate problems, only one of which has a simple solution:
Problem 1: There is indeed a character encoding problem, as you suspected:
PowerShell invisibly inserts itself as an intermediary in pipelines, even when sending data to and receiving data from external programs: It converts data from and to .NET strings (System.String), which are sequences of UTF-16 code units.
As an aside: Even when using only PowerShell-native commands, this means that reading input from files and saving them again can result in a different character encoding, because the information about the original character encoding is not preserved once (string) data has been read into memory, and on saving it is the cmdlets' default character encoding that is used; while this default encoding is consistently BOM-less UTF-8 in PowerShell (Core) 6+, it varies by cmdlet in Windows PowerShell - see this answer.
In order to send to and receive data from external programs (such as Crypt.exe in your case), you need to match their character encoding; in your case, with a Windows console application that uses raw byte handling, the implied encoding is the system's active OEM code page.
On sending data, PowerShell uses the encoding of the $OutputEncoding preference variable to encode (what is invariably treated as text) data, which defaults to ASCII(!) in Windows PowerShell, and (BOM-less) UTF-8 in PowerShell (Core).
The receiving end is covered by default: PowerShell uses [Console]::OutputEncoding (which itself reflects the code page reported by chcp) for decoding data received, and on Windows this by default reflects the active OEM code page, both in Windows PowerShell and PowerShell [Core][1].
To fix your primary problem, you therefore need to set $OutputEncoding to the active OEM code page:
# Make sure that PowerShell uses the OEM code page when sending
# data to `.\Crypt.exe`
$OutputEncoding = [Console]::OutputEncoding
Problem 2: PowerShell invariably appends a trailing newline to data that doesn't already have one when piping data to external programs:
That is, "foo" | .\Crypt.exe doesn't send (the $OutputEncoding-encoded bytes representing) "foo" to .\Crypt.exe's stdin, it sends "foo`r`n" on Windows; i.e., a (platform-appropriate) newline sequence (CRLF on Windows) is automatically and invariably appended (unless the string already happens to have a trailing newline).
This problematic behavior is discussed in GitHub issue #5974 and also in this answer.
In your specific case, the implicitly appended "`r`n" is also subject to the byte-value-shifting, which means that the 1st Crypt.exe calls transforms it to -*, causing another "`r`n" to be appended when the data is sent to the 2nd Crypt.exe call.
The net result is an extra newline that is round-tripped (the intermediate -*), plus an encrypted newline that results in φΩ).
In short: If your input data had no trailing newline, you'll have to cut off the last 4 characters from the result (representing the round-tripped and the inadvertently encrypted newline sequences):
# Ensure that .\Crypt.exe output is correctly decoded.
$OutputEncoding = [Console]::OutputEncoding
# Invoke the command and capture its output in variable $result.
# Note the use of the `Get-Content` cmdlet; in PowerShell, `type`
# is simply a built-in *alias* for it.
$result = Get-Content .\test.txt | .\Crypt.exe --decrypt | .\Crypt.exe --encrypt
# Remove the last 4 chars. and print the result.
$result.Substring(0, $result.Length - 4)
Given that calling cmd /c as shown at the top of the answer works too, that hardly seems worth it.
How PowerShell handles pipeline data with external programs:
Unlike cmd (or POSIX-like shells such as bash):
PowerShell doesn't support raw byte data in pipelines.[2]
When talking to external programs, it only knows text (whereas it passes .NET objects when talking to PowerShell's own commands, which is where much of its power comes from).
Specifically, this works as follows:
When you send data to an external program via the pipeline (to its stdin stream):
It is converted to text (strings) using the character encoding specified in the $OutputEncoding preference variable, which defaults to ASCII(!) in Windows PowerShell, and (BOM-less) UTF-8 in PowerShell (Core).
Caveat: If you assign an encoding with a BOM to $OutputEncoding, PowerShell (as of v7.0) will emit the BOM as part of the first line of output sent to an external program; therefore, for instance, do not use [System.Text.Encoding]::Utf8 (which emits a BOM) in Windows PowerShell, and use [System.Text.Utf8Encoding]::new($false) (which doesn't) instead.
If the data is not captured or redirected by PowerShell, encoding problems may not always become apparent, namely if an external program is implemented in a way that uses the Windows Unicode console API to print to the display.
Something that isn't already text (a string) is stringified using PowerShell's default output formatting (the same format you see when you print to the console), with an important caveat:
If the (last) input object already is a string that doesn't itself have a trailing newline, one is invariably appended (and even an existing trailing newline is replaced with the platform-native one, if different).
This behavior can cause problems, as discussed in GitHub issue #5974 and also in this answer.
When you capture / redirect data from an external program (from its stdout stream), it is invariably decoded as lines of text (strings), based on the encoding specified in [Console]::OutputEncoding, which defaults to the active OEM code page on Windows (surprisingly, in both PowerShell editions, as of v7.0-preview6[1]).
PowerShell-internally text is represented using the .NET System.String type, which is based on UTF-16 code units (often loosely, but incorrectly called "Unicode"[3]).
The above also applies:
when piping data between external programs,
when data is redirected to a file; that is, irrespective of the source of the data and its original character encoding, PowerShell uses its default encoding(s) when sending data to files; in Windows PowerShell, > produces UTF-16LE-encoded files (with BOM), whereas PowerShell (Core) sensibly defaults to BOM-less UTF-8 (consistently, across file-writing cmdlets).
[1] In PowerShell (Core), given that $OutputEncoding commendably already defaults to UTF-8, it would make sense to have [Console]::OutputEncoding be the same - i.e., for the active code page to be effectively 65001 on Windows, as suggested in GitHub issue #7233.
[2] With input from a file, the closest you can get to raw byte handling is to read the file as a .NET System.Byte array with Get-Content -AsByteStream (PowerShell (Core)) / Get-Content -Encoding Byte (Windows PowerShell), but the only way you can further process such as an array is to pipe to a PowerShell command that is designed to handle a byte array, or by passing it to a .NET type's method that expects a byte array. If you tried to send such an array to an external program via the pipeline, each byte would be sent as its decimal string representation on its own line.
[3] Unicode is the name of the abstract standard describing a "global alphabet". In concrete use, it has various standard encodings, UTF-8 and UTF-16 being the most widely used.

Related

powershell string urldecoding execution time

Performing URLdecoding in CGI hybrid script by powershell oneliner:
echo %C4%9B%C5%A1%C4%8D%C5%99%C5%BE%C3%BD%C3%A1%C3%AD%C3%A9%C5%AF%C3%BA| powershell.exe "Add-Type -AssemblyName System.Web;[System.Web.HttpUtility]::UrlDecode($Input) | Write-Host"
execution time of this oneliner is between 2-3 seconds on virtual machine. Is it because of .NET object is employed? Is there any way to decrease execution time? Have also some C written, lightning fast urldecode.exe utility, but unfortunatelly it does not eat STDIN.
A note if you're passing the input data as a string literal, as in the sample call in the question (rather than as output from a command):
If you're calling from an interactive cmd.exe session, %C4%9B%C5%A1%C4%8D%C5%99%C5%BE%C3%BD%C3%A1%C3%AD%C3%A9%C5%AF%C3%BA works as-is - unless any of the tokens between paired % instances happen to be the name of an existing environment variable (which seems unlikely).
From a batch file, you'll have to escape the % chars. by doubling them - see this answer; which you can obtain by applying the following PowerShell operation on the original string:
'%C4%9B%C5%A1%C4%8D%C5%99%C5%BE%C3%BD%C3%A1%C3%AD%C3%A9%C5%AF%C3%BA' -replace '%', '%%'
Is it because of .NET object is employed?
Yes, powershell.exe, as a .NET-based application requires starting the latter's runtime (CLR), which is nontrivial in terms of performance.
Additionally, powershell.exe by default loads the initialization files listed in its $PROFILE variable, which can take additional time.
Pass the -NoProfile CLI option to suppress that.
Have also some C written, lightning fast urldecode.exe utility, but unfortunately it does not eat STDIN.
If so, pass the data as an argument, if feasible; e.g.:
urldecode.exe "%C4%9B%C5%A1%C4%8D%C5%99%C5%BE%C3%BD%C3%A1%C3%AD%C3%A9%C5%AF%C3%BA"
If the data comes from another command's output, you can use for /f to capture in a variable first, and then pass the latter.
If you do need to call powershell.exe, PowerShell's CLI, after all:
There's not much you can do in terms of optimizing performance:
Add -NoProfile, as suggested.
Pass the input data as an argument.
Avoid unnecessary calls such as Write-Host and rely on PowerShell's implicit output behavior instead.[1]
powershell.exe -NoProfile -c "Add-Type -AssemblyName System.Web;[System.Web.HttpUtility]::UrlDecode('%C4%9B%C5%A1%C4%8D%C5%99%C5%BE%C3%BD%C3%A1%C3%AD%C3%A9%C5%AF%C3%BA')"
[1] Optional reading: How do I output machine-parseable data from a PowerShell CLI call?
Note: The sample commands are assumed to be run from cmd.exe / outside PowerShell.
PowerShell's CLI only supports text as output, not also raw byte data.
In order to output data for later programmatic processing, you may have to explicitly ensure that what is output is machine-parseable rather than something that is meant for display only.
There are two basic choices:
Rely on PowerShell's default output formatting for outputting what are strings (text) to begin with, as well as for numbers - though for fractional and very large or small non-integer numbers additional effort may be required.
Explicitly use a structured, text-based data format, such as CSV or Json, to represent complex objects.
Rely on PowerShell's default output formatting:
If the output data is itself text (strings), no extra effort is needed. This applies to your case, and therefore simply implicitly outputting the string returned from the [System.Web.HttpUtility]::UrlDecode() call is sufficient:
# A simple example that outputs a 512-character string; note that
# printing to the _console_ (terminal) will insert line breaks for
# *display*, but the string data itself does _not_ contain any
# (other than a single _trailing one), irrespective of the
# console window width:
powershell -noprofile -c "'x' * 512"
If the output data comprises numbers, you may have to apply explicit, culture-invariant formatting if your code must run with different cultures in effect:
True integer types do not require special handling, as their string representation is in effect culture-neutral.
However, fractional numbers ([double], [decimal]) do use the current culture's decimal mark, which later processing may not expect:
# With, say, culture fr-FR (French) in effect, this outputs
# "1,2", i.e. uses a *comma* as the decimal mark.
powershell -noprofile -c "1.2"
# Simple workaround: Let PowerShell *stringify* the number explicitly
# in an *expandable* (interpolating) string, which uses
# the *invariant culture* for formatting, where the decimal
# mark is *always "." (dot).
# The following outputs "1.2", irrespective of what culture is in effect.
powershell -noprofile -c " $num=1.2; \"$num\" "
Finally, very large and very small [double] values can result in exponential notation being output (e.g., 5.1E-07 for 0.00000051); to avoid that, explicit number formatting is required, which can be done via the .ToString() method:
# The following outputs 0.000051" in all cultures, as intended.
powershell -noprofile -c "$n=0.000051; $n.ToString('F6', [cultureinfo]::InvariantCulture)"
More work is needed if you want to output representations of complex objects in machine-parseable form, as discussed in the next section.
Relying on PowerShell's default output formatting is not an option in this case, because implicit output and (equivalent explicit Write-Output calls) cause the CLI to apply for-display-only formatting, which is meaningful to the human observer but cannot be robustly parsed.
# Produces output helpful to the *human observer*, but isn't
# designed for *parsing*.
# `Get-ChildItem` outputs [System.IO.FileSystemInfo] objects.
powershell -noprofile -c "Get-ChildItem /"
Note that use of Write-Host is not an alternative: Write-Host fundamentally isn't designed for data output, and the textual representation it creates for complex objects are typically not even meaningful to the human observer - see this answer for more information.
Use a structured, text-based data format, such as CSV or Json:
Note:
Hypothetically, the simplest approach is to use the CLI's -OutputFormat Xml option, which serializes the output using the XML-based CLIXML format PowerShell itself uses for remoting and background jobs - see this answer.
However, this format is only natively understood by PowerShell itself, and for third-party applications to parse it they'd have to be .NET-based and use the PowerShell SDK.
Also, this format is automatically used for both serialization and deserialization if you call another PowerShell instance from PowerShell, with the command specified as a script block ({ ... }) - see this answer. However, there is rarely a need to call the PowerShell CLI from PowerShell itself, and direct invocation of PowerShell code and scripts provides full type fidelity as well as better performance.
Finally, note that all serialization formats, including CSV and JSON discussed below, have limits with respect to faithfully representing all aspects of the data, though -OutputFormat Xml comes closest.
PowerShell comes with cmdlets such as ConvertTo-Csv and ConvertTo-Json, which make it easy to convert output to the structured CSV and JSON formats.
Using a Get-Item call to get information about PowerShell's installation directory ($PSHOME) as an example; Get-Item outputs a System.IO.DirectoryInfo instance in this case:
Use of ConvertTo-Csv:
C:\>powershell -noprofile -c "Get-Item $PSHOME | ConvertTo-Csv -NoTypeInformation"
"PSPath","PSParentPath","PSChildName","PSDrive","PSProvider","PSIsContainer","Mode","BaseName","Target","LinkType","Name","FullName","Parent","Exists","Root","Extension","CreationTime","CreationTimeUtc","LastAccessTime","LastAccessTimeUtc","LastWriteTime","LastWriteTimeUtc","Attributes"
"Microsoft.PowerShell.Core\FileSystem::C:\Windows\System32\WindowsPowerShell\v1.0","Microsoft.PowerShell.Core\FileSystem::C:\Windows\System32\WindowsPowerShell","v1.0","C","Microsoft.PowerShell.Core\FileSystem","True","d-----","v1.0","System.Collections.Generic.List`1[System.String]",,"v1.0","C:\Windows\System32\WindowsPowerShell\v1.0","WindowsPowerShell","True","C:\",".0","12/7/2019 4:14:52 AM","12/7/2019 9:14:52 AM","3/14/2021 10:33:10 AM","3/14/2021 2:33:10 PM","11/6/2020 3:52:41 AM","11/6/2020 8:52:41 AM","Directory"
Note: -NoTypeInformation is no longer needed in PowerShell (Core) 7+
Using ConvertTo-Json:
C:\>powershell -noprofile -c "Get-Item $PSHOME | ConvertTo-Json -Depth 1"
{
"Name": "v1.0",
"FullName": "C:\\Windows\\System32\\WindowsPowerShell\\v1.0",
"Parent": {
"Name": "WindowsPowerShell",
"FullName": "C:\\Windows\\System32\\WindowsPowerShell",
"Parent": "System32",
"Exists": true,
"Root": "C:\\",
"Extension": "",
"CreationTime": "\/Date(1575710092565)\/",
"CreationTimeUtc": "\/Date(1575710092565)\/",
"LastAccessTime": "\/Date(1615733476841)\/",
"LastAccessTimeUtc": "\/Date(1615733476841)\/",
"LastWriteTime": "\/Date(1575710092565)\/",
"LastWriteTimeUtc": "\/Date(1575710092565)\/",
"Attributes": 16
},
"Exists": true
// ...
}
Since JSON is a hierarchical data format, the serialization depth must be limited with -Depth in order to prevent "runaway" serialization when serializing arbitrary .NET types; this isn't necessary for [pscustomobject] and [hashtable] object graphs composed of primitive .NET types only.

Powershell Pipeline data to external console application

I have a console application which can take standard input. It buffers up the data until the execute command, at which point it executes it all, and sends the output to standard output.
At the moment, I am running this application from Powershell, piping commands into it, and then parsing the output. The data piped in is relatively small; however this application is being called about 1000 times. Each time it is executed, it has to load, and create network connections. I am wondering whether it might be more efficient to pipeline all the commands into a single instantiation of the console application.
I have tried this by adding all Powershell script, that manufactures the standard input for the console, into a function, then piping that function to the console application. This seems to work at first, but you eventually realise it is buffering up all the data in Powershell until the function has finished, then sending it to the console's StdIn. You can see this because I have a whole load of Write-Host statements that flash by, and only then do you see the output.
e.g.
Function Run-Command1
{
Write-Host "Run-Command1"
"GET nethost xxxx COLS id,name"
"EXEC"
}
Function Run-Command2
{
Write-Host "Run-Command2"
"GET nethost yyyy COLS id,name"
"GET users yyyy COLS id,name"
"EXEC"
}
...
Function Run-CommandX
{
...
}
Previously, I would use this as:
Run-Command1 | netapp.exe -connect QQQQ -U user -P password
Run-Command2 | netapp.exe -connect QQQQ -U user -P password
...
Run-CommandX | netapp.exe -connect QQQQ -U user -P password
But now I would like to do:
Function Run-Commands
{
Run-Command1
Run-Command2
...
Run-CommandX
}
Run-Commands |
netapp.exe -connect QQQQ -U user -P password
Ideally, I would like the Powershell pipeline behaviour to be extended to an external application. Is this possible?
I would like the Powershell pipeline behaviour to be extended to an external application.
I have a whole load of Write-Host statements that flash by, and only then do you see the output.
Tip of the hat to marsze.
PowerShell [Core] v6+ performs no buffering at all, and sends (stringified) output as it is being produced by a command to an external program, in the same manner that output is streamed between PowerShell commands.[1]
PowerShell's legacy edition (versions up to 5.1), Windows PowerShell, buffers in that it collects all output from a command first before sending it(s stringification) to an external program.
marsze's helpful answer shows a workaround based on direct use of .NET APIs.
However, I think even Windows PowerShell's behavior isn't the problem here: Your Run-Commands function executes very quickly - given that the functions it calls merely output string literals - and the resulting array of lines is then sent all at once to netapp.exe - and further processing, including when to produce output, is then up to netapp.exe. In PowerShell [Core] v6+, with PowerShell-side buffering out of the picture, the individual Run-Commmand<n> functions' output would be sent to netapp.exe ever so slightly earlier, but I wouldn't expect that to make a difference.
The upshot is that unless netapp.exe offers a way to adjust its input and output buffering, you won't be able to control the timing of its input processing and output production.
How PowerShell sends objects to an external program (native utility) via the pipeline:
It sends a stringified representation of each object:
in PowerShell [Core] v6+: as the object becomes available.
in Windows PowerShell: after having collected all output objects in memory first.
In other words: on the PowerShell side, from v6 onward, there is no buffering.[1]
However, receiving external programs typically do buffer the stdin (standard input) data they receive via the pipeline[2].
Similarly, external programs typically do buffer their stdout (standard output) streams (but PowerShell performs no additional buffering before passing the output on, such as to the terminal (console)).
PowerShell has no control over this behavior; either the external program itself offers an option to adjust buffering or, in limited cases on Linux, you can call the external program via the stdbuf utility.
Optional reading: How PowerShell stringifies objects when piping to external programs:
PowerShell, as of v7.1, knows only text when communicating with external programs; that is, data sent to such programs is converted to text, and output from such programs is interpreted as text - even though the underlying system IPC features are simply byte conduits.
The UTF-16-based .NET strings PowerShell uses are converted to byte streams for external programs based on the character encoding specified in the $OutputEncoding preference variable, which, regrettably, defaults to ASCII(!) in Windows PowerShell, and now sensibly to (BOM-less) UTF-8 in PowerShell [Core] v6+.
In other words: The encoding specified via $OutputEncoding must match the character encoding that the external program expects.
Conversely, it is the encoding specified in [Console]::OutputEncoding that determines how PowerShell interprets text received from an external program, i.e. how it converts the bytes received to .NET strings, line by line, with newlines stripped (which, when captured in a variable, amounts to either a single string, if only one line was output, or an array of strings).
The for-display representations you see in the PowerShell console (terminal) are also what is sent to external programs via the pipeline, as lines of text, specifically:
If an object (already) is a string (or [char] instance), PowerShell sends it as-is to the pipe, but with a platform-appropriate newline invariably appended.
That is, a CRLF newline is appended on Windows, and a LF-only newline on Unix-like platforms.
This behavior can be problematic, as there are situations where you do not want that, and there's no way to prevent it - see GitHub issue #5974, GitHub issue #13579, and this answer for a workaround.
If an object is, loosely speaking, a primitive type - something that is conceptually a single value, notably the various number types - it is stringified in a culture-sensitive manner, where available[3], a platform-appropriate newline is again invariably appended.
E.g., with, a French culture in effect (as reflected in Get-Culture), decimal fraction 1.2 - which PowerShell parses as a [double] value - is sent as 1,2<newline>.
Note that [bool] instances are not culture-sensitive and are always converted to strings True or False.
All other (complex) types are subject to PowerShell's rich for-display output formatting, and whatever you would see in the terminal (console) is also what is sent to external programs - which not only again potentially contains culture-sensitive representations, but is generally problematic in that these representations are designed for the human observer, not for programmatic processing.
The upshot:
Beware encoding problems - make sure $OutputEncoding and [Console]::OutputEncoding are set correctly.
To avoid unexpected culture-sensitivity and unexpected for-display formatting, it is best to deliberately construct the string representation you want to send.
[1] By default; however, you can explicitly request buffering - expressed as an object count - via the common -OutBuffer parameter
[2] On recent macOS and Linux platforms, the stdin buffer size is 64KB. On Unix-like platforms, utilities typically switch to line-buffering in interactive invocations, i.e. when the stream in question is connected to a terminal.
[3] The behavior is delegated to the .ToString() method of a type at hand, i.e. whether or not that method outputs a culture-sensitive representation.
EDIT: As #mklement0 pointed out, this is different in PowerShell Core.
In PowerShell 5.1 (and lower) think you would have to manually write each pipeline item to the external application's input stream.
Here's an attempt to build a function for that:
function Invoke-Pipeline {
[CmdletBinding()]
param (
[Parameter(Mandatory, Position = 0)]
[string]$FileName,
[Parameter(Position = 1)]
[string[]]$ArgumentList,
[int]$TimeoutMilliseconds = -1,
[Parameter(ValueFromPipeline)]
$InputObject
)
begin {
$process = [System.Diagnostics.Process]::Start((New-Object System.Diagnostics.ProcessStartInfo -Property #{
FileName = $FileName
Arguments = $ArgumentList
UseShellExecute = $false
RedirectStandardInput = $true
RedirectStandardOutput = $true
}))
$output = [System.Collections.Concurrent.ConcurrentQueue[string]]::new()
$event = Register-ObjectEvent -InputObject $process -EventName 'OutputDataReceived' ` -Action {
$Event.MessageData.TryAdd($EventArgs.Data)
} -MessageData $output
$process.BeginOutputReadLine()
}
process {
$process.StandardInput.WriteLine($InputObject)
[string]$line = ""
while (-not ($output.TryDequeue([ref]$line))) {
start-sleep -Milliseconds 1
}
do {
$line
} while ($output.TryDequeue([ref]$line))
}
end {
if ($TimeoutMilliseconds -lt 0) {
$exited = $process.WaitForExit()
}
else {
$exited = $process.WaitForExit($TimeoutMilliseconds)
}
if ($exited) {
$process.Close()
}
else {
try {$process.Kill()} catch {}
}
}
}
Run-Commands | Invoke-Pipeline netapp.exe "-connect QQQQ -U user -P password"
The problem is, that there is no perfect solution, because by definition, you cannot know when the external program will write something to its output stream, or how much.
Note: This function doesn't redirect the error stream. The approach would be the same though.

How do powershell pipes transfer Unicode between native programs? [duplicate]

I am trying to pipe the content of a file to a simple ASCII symmetrical encryption program i made. It's a simple program that reads input from STDIN and adds or subtracts a certain value (224) to each byte of the input.
For example: if the first byte is 4 and we want to encrypt, then it becomes 228. If it exceeds 255, the program just performs some modulo.
This is the output I get with cmd (test.txt contains "this is a test"):
type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt
this is a test
It also works the other way, thus it is a symmetrical encryption algorithm
type .\test.txt | .\Crypt.exe --decrypt | .\Crypt.exe --encrypt
this is a test
But, the behaviour on PowerShell is different. When encrypting first, I get:
type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt
this is a test_*
And that is what I get when decrypting first:
Maybe is an encoding problem. Thanks in advance.
tl;dr:
Up to at least PowerShell 7.3.1, if you need raw byte handling and/or need to prevent PowerShell from situationally adding a trailing newline to your text data, avoid the PowerShell pipeline altogether.
Future support for passing raw byte data between external programs and to-file redirections is the subject of GitHub issue #1908.
For raw byte handling, shell out to cmd with /c (on Windows; on Unix-like platforms / Unix-like Windows subsystems, use sh or bash with -c):
cmd /c 'type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt'
Use a similar technique to save raw byte output in a file - do not use PowerShell's > operator:
cmd /c 'someexe > file.bin'
Note that if you want to capture an external program's text output in a PowerShell variable or process it further in a PowerShell pipeline, you need to make sure that [Console]::OutputEncoding matches your program's output character encoding (the active OEM code page, typically), which should be true by default in this case; see the next section for details.
Generally, however, byte manipulation of text data is best avoided.
There are two separate problems, only one of which has a simple solution:
Problem 1: There is indeed a character encoding problem, as you suspected:
PowerShell invisibly inserts itself as an intermediary in pipelines, even when sending data to and receiving data from external programs: It converts data from and to .NET strings (System.String), which are sequences of UTF-16 code units.
As an aside: Even when using only PowerShell-native commands, this means that reading input from files and saving them again can result in a different character encoding, because the information about the original character encoding is not preserved once (string) data has been read into memory, and on saving it is the cmdlets' default character encoding that is used; while this default encoding is consistently BOM-less UTF-8 in PowerShell (Core) 6+, it varies by cmdlet in Windows PowerShell - see this answer.
In order to send to and receive data from external programs (such as Crypt.exe in your case), you need to match their character encoding; in your case, with a Windows console application that uses raw byte handling, the implied encoding is the system's active OEM code page.
On sending data, PowerShell uses the encoding of the $OutputEncoding preference variable to encode (what is invariably treated as text) data, which defaults to ASCII(!) in Windows PowerShell, and (BOM-less) UTF-8 in PowerShell (Core).
The receiving end is covered by default: PowerShell uses [Console]::OutputEncoding (which itself reflects the code page reported by chcp) for decoding data received, and on Windows this by default reflects the active OEM code page, both in Windows PowerShell and PowerShell [Core][1].
To fix your primary problem, you therefore need to set $OutputEncoding to the active OEM code page:
# Make sure that PowerShell uses the OEM code page when sending
# data to `.\Crypt.exe`
$OutputEncoding = [Console]::OutputEncoding
Problem 2: PowerShell invariably appends a trailing newline to data that doesn't already have one when piping data to external programs:
That is, "foo" | .\Crypt.exe doesn't send (the $OutputEncoding-encoded bytes representing) "foo" to .\Crypt.exe's stdin, it sends "foo`r`n" on Windows; i.e., a (platform-appropriate) newline sequence (CRLF on Windows) is automatically and invariably appended (unless the string already happens to have a trailing newline).
This problematic behavior is discussed in GitHub issue #5974 and also in this answer.
In your specific case, the implicitly appended "`r`n" is also subject to the byte-value-shifting, which means that the 1st Crypt.exe calls transforms it to -*, causing another "`r`n" to be appended when the data is sent to the 2nd Crypt.exe call.
The net result is an extra newline that is round-tripped (the intermediate -*), plus an encrypted newline that results in φΩ).
In short: If your input data had no trailing newline, you'll have to cut off the last 4 characters from the result (representing the round-tripped and the inadvertently encrypted newline sequences):
# Ensure that .\Crypt.exe output is correctly decoded.
$OutputEncoding = [Console]::OutputEncoding
# Invoke the command and capture its output in variable $result.
# Note the use of the `Get-Content` cmdlet; in PowerShell, `type`
# is simply a built-in *alias* for it.
$result = Get-Content .\test.txt | .\Crypt.exe --decrypt | .\Crypt.exe --encrypt
# Remove the last 4 chars. and print the result.
$result.Substring(0, $result.Length - 4)
Given that calling cmd /c as shown at the top of the answer works too, that hardly seems worth it.
How PowerShell handles pipeline data with external programs:
Unlike cmd (or POSIX-like shells such as bash):
PowerShell doesn't support raw byte data in pipelines.[2]
When talking to external programs, it only knows text (whereas it passes .NET objects when talking to PowerShell's own commands, which is where much of its power comes from).
Specifically, this works as follows:
When you send data to an external program via the pipeline (to its stdin stream):
It is converted to text (strings) using the character encoding specified in the $OutputEncoding preference variable, which defaults to ASCII(!) in Windows PowerShell, and (BOM-less) UTF-8 in PowerShell (Core).
Caveat: If you assign an encoding with a BOM to $OutputEncoding, PowerShell (as of v7.0) will emit the BOM as part of the first line of output sent to an external program; therefore, for instance, do not use [System.Text.Encoding]::Utf8 (which emits a BOM) in Windows PowerShell, and use [System.Text.Utf8Encoding]::new($false) (which doesn't) instead.
If the data is not captured or redirected by PowerShell, encoding problems may not always become apparent, namely if an external program is implemented in a way that uses the Windows Unicode console API to print to the display.
Something that isn't already text (a string) is stringified using PowerShell's default output formatting (the same format you see when you print to the console), with an important caveat:
If the (last) input object already is a string that doesn't itself have a trailing newline, one is invariably appended (and even an existing trailing newline is replaced with the platform-native one, if different).
This behavior can cause problems, as discussed in GitHub issue #5974 and also in this answer.
When you capture / redirect data from an external program (from its stdout stream), it is invariably decoded as lines of text (strings), based on the encoding specified in [Console]::OutputEncoding, which defaults to the active OEM code page on Windows (surprisingly, in both PowerShell editions, as of v7.0-preview6[1]).
PowerShell-internally text is represented using the .NET System.String type, which is based on UTF-16 code units (often loosely, but incorrectly called "Unicode"[3]).
The above also applies:
when piping data between external programs,
when data is redirected to a file; that is, irrespective of the source of the data and its original character encoding, PowerShell uses its default encoding(s) when sending data to files; in Windows PowerShell, > produces UTF-16LE-encoded files (with BOM), whereas PowerShell (Core) sensibly defaults to BOM-less UTF-8 (consistently, across file-writing cmdlets).
[1] In PowerShell (Core), given that $OutputEncoding commendably already defaults to UTF-8, it would make sense to have [Console]::OutputEncoding be the same - i.e., for the active code page to be effectively 65001 on Windows, as suggested in GitHub issue #7233.
[2] With input from a file, the closest you can get to raw byte handling is to read the file as a .NET System.Byte array with Get-Content -AsByteStream (PowerShell (Core)) / Get-Content -Encoding Byte (Windows PowerShell), but the only way you can further process such as an array is to pipe to a PowerShell command that is designed to handle a byte array, or by passing it to a .NET type's method that expects a byte array. If you tried to send such an array to an external program via the pipeline, each byte would be sent as its decimal string representation on its own line.
[3] Unicode is the name of the abstract standard describing a "global alphabet". In concrete use, it has various standard encodings, UTF-8 and UTF-16 being the most widely used.

Why does adb exec-out behave differently in powershell? [duplicate]

I am trying to pipe the content of a file to a simple ASCII symmetrical encryption program i made. It's a simple program that reads input from STDIN and adds or subtracts a certain value (224) to each byte of the input.
For example: if the first byte is 4 and we want to encrypt, then it becomes 228. If it exceeds 255, the program just performs some modulo.
This is the output I get with cmd (test.txt contains "this is a test"):
type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt
this is a test
It also works the other way, thus it is a symmetrical encryption algorithm
type .\test.txt | .\Crypt.exe --decrypt | .\Crypt.exe --encrypt
this is a test
But, the behaviour on PowerShell is different. When encrypting first, I get:
type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt
this is a test_*
And that is what I get when decrypting first:
Maybe is an encoding problem. Thanks in advance.
tl;dr:
Up to at least PowerShell 7.3.1, if you need raw byte handling and/or need to prevent PowerShell from situationally adding a trailing newline to your text data, avoid the PowerShell pipeline altogether.
Future support for passing raw byte data between external programs and to-file redirections is the subject of GitHub issue #1908.
For raw byte handling, shell out to cmd with /c (on Windows; on Unix-like platforms / Unix-like Windows subsystems, use sh or bash with -c):
cmd /c 'type .\test.txt | .\Crypt.exe --encrypt | .\Crypt.exe --decrypt'
Use a similar technique to save raw byte output in a file - do not use PowerShell's > operator:
cmd /c 'someexe > file.bin'
Note that if you want to capture an external program's text output in a PowerShell variable or process it further in a PowerShell pipeline, you need to make sure that [Console]::OutputEncoding matches your program's output character encoding (the active OEM code page, typically), which should be true by default in this case; see the next section for details.
Generally, however, byte manipulation of text data is best avoided.
There are two separate problems, only one of which has a simple solution:
Problem 1: There is indeed a character encoding problem, as you suspected:
PowerShell invisibly inserts itself as an intermediary in pipelines, even when sending data to and receiving data from external programs: It converts data from and to .NET strings (System.String), which are sequences of UTF-16 code units.
As an aside: Even when using only PowerShell-native commands, this means that reading input from files and saving them again can result in a different character encoding, because the information about the original character encoding is not preserved once (string) data has been read into memory, and on saving it is the cmdlets' default character encoding that is used; while this default encoding is consistently BOM-less UTF-8 in PowerShell (Core) 6+, it varies by cmdlet in Windows PowerShell - see this answer.
In order to send to and receive data from external programs (such as Crypt.exe in your case), you need to match their character encoding; in your case, with a Windows console application that uses raw byte handling, the implied encoding is the system's active OEM code page.
On sending data, PowerShell uses the encoding of the $OutputEncoding preference variable to encode (what is invariably treated as text) data, which defaults to ASCII(!) in Windows PowerShell, and (BOM-less) UTF-8 in PowerShell (Core).
The receiving end is covered by default: PowerShell uses [Console]::OutputEncoding (which itself reflects the code page reported by chcp) for decoding data received, and on Windows this by default reflects the active OEM code page, both in Windows PowerShell and PowerShell [Core][1].
To fix your primary problem, you therefore need to set $OutputEncoding to the active OEM code page:
# Make sure that PowerShell uses the OEM code page when sending
# data to `.\Crypt.exe`
$OutputEncoding = [Console]::OutputEncoding
Problem 2: PowerShell invariably appends a trailing newline to data that doesn't already have one when piping data to external programs:
That is, "foo" | .\Crypt.exe doesn't send (the $OutputEncoding-encoded bytes representing) "foo" to .\Crypt.exe's stdin, it sends "foo`r`n" on Windows; i.e., a (platform-appropriate) newline sequence (CRLF on Windows) is automatically and invariably appended (unless the string already happens to have a trailing newline).
This problematic behavior is discussed in GitHub issue #5974 and also in this answer.
In your specific case, the implicitly appended "`r`n" is also subject to the byte-value-shifting, which means that the 1st Crypt.exe calls transforms it to -*, causing another "`r`n" to be appended when the data is sent to the 2nd Crypt.exe call.
The net result is an extra newline that is round-tripped (the intermediate -*), plus an encrypted newline that results in φΩ).
In short: If your input data had no trailing newline, you'll have to cut off the last 4 characters from the result (representing the round-tripped and the inadvertently encrypted newline sequences):
# Ensure that .\Crypt.exe output is correctly decoded.
$OutputEncoding = [Console]::OutputEncoding
# Invoke the command and capture its output in variable $result.
# Note the use of the `Get-Content` cmdlet; in PowerShell, `type`
# is simply a built-in *alias* for it.
$result = Get-Content .\test.txt | .\Crypt.exe --decrypt | .\Crypt.exe --encrypt
# Remove the last 4 chars. and print the result.
$result.Substring(0, $result.Length - 4)
Given that calling cmd /c as shown at the top of the answer works too, that hardly seems worth it.
How PowerShell handles pipeline data with external programs:
Unlike cmd (or POSIX-like shells such as bash):
PowerShell doesn't support raw byte data in pipelines.[2]
When talking to external programs, it only knows text (whereas it passes .NET objects when talking to PowerShell's own commands, which is where much of its power comes from).
Specifically, this works as follows:
When you send data to an external program via the pipeline (to its stdin stream):
It is converted to text (strings) using the character encoding specified in the $OutputEncoding preference variable, which defaults to ASCII(!) in Windows PowerShell, and (BOM-less) UTF-8 in PowerShell (Core).
Caveat: If you assign an encoding with a BOM to $OutputEncoding, PowerShell (as of v7.0) will emit the BOM as part of the first line of output sent to an external program; therefore, for instance, do not use [System.Text.Encoding]::Utf8 (which emits a BOM) in Windows PowerShell, and use [System.Text.Utf8Encoding]::new($false) (which doesn't) instead.
If the data is not captured or redirected by PowerShell, encoding problems may not always become apparent, namely if an external program is implemented in a way that uses the Windows Unicode console API to print to the display.
Something that isn't already text (a string) is stringified using PowerShell's default output formatting (the same format you see when you print to the console), with an important caveat:
If the (last) input object already is a string that doesn't itself have a trailing newline, one is invariably appended (and even an existing trailing newline is replaced with the platform-native one, if different).
This behavior can cause problems, as discussed in GitHub issue #5974 and also in this answer.
When you capture / redirect data from an external program (from its stdout stream), it is invariably decoded as lines of text (strings), based on the encoding specified in [Console]::OutputEncoding, which defaults to the active OEM code page on Windows (surprisingly, in both PowerShell editions, as of v7.0-preview6[1]).
PowerShell-internally text is represented using the .NET System.String type, which is based on UTF-16 code units (often loosely, but incorrectly called "Unicode"[3]).
The above also applies:
when piping data between external programs,
when data is redirected to a file; that is, irrespective of the source of the data and its original character encoding, PowerShell uses its default encoding(s) when sending data to files; in Windows PowerShell, > produces UTF-16LE-encoded files (with BOM), whereas PowerShell (Core) sensibly defaults to BOM-less UTF-8 (consistently, across file-writing cmdlets).
[1] In PowerShell (Core), given that $OutputEncoding commendably already defaults to UTF-8, it would make sense to have [Console]::OutputEncoding be the same - i.e., for the active code page to be effectively 65001 on Windows, as suggested in GitHub issue #7233.
[2] With input from a file, the closest you can get to raw byte handling is to read the file as a .NET System.Byte array with Get-Content -AsByteStream (PowerShell (Core)) / Get-Content -Encoding Byte (Windows PowerShell), but the only way you can further process such as an array is to pipe to a PowerShell command that is designed to handle a byte array, or by passing it to a .NET type's method that expects a byte array. If you tried to send such an array to an external program via the pipeline, each byte would be sent as its decimal string representation on its own line.
[3] Unicode is the name of the abstract standard describing a "global alphabet". In concrete use, it has various standard encodings, UTF-8 and UTF-16 being the most widely used.

How to expand file content with powershell

I want to do this :
$content = get-content "test.html"
$template = get-content "template.html"
$template | out-file "out.html"
where template.html contains
<html>
<head>
</head>
<body>
$content
</body>
</html>
and test.html contains:
<h1>Test Expand</h1>
<div>Hello</div>
I get weird characters in first 2 characters of out.html :
��
and content is not expanded.
How to fix this ?
To complement Mathias R. Jessen's helpful answer with a solution that:
is more efficient.
ensures that the input files are read as UTF-8, even if they don't have a (pseudo-)BOM (byte-order mark).
avoids the "weird character" problem altogether by writing a UTF-8-encoded output file without that pseudo-BOM.
# Explicitly read the input files as UTF-8, as a whole.
$content = get-content -raw -encoding utf8 test.html
$template = get-content -raw -encoding utf8 template.html
# Write to output file using UTF-8 encoding *without a BOM*.
[IO.File]::WriteAllText(
"$PWD/out.html",
$ExecutionContext.InvokeCommand.ExpandString($template)
)
get-content -raw (PSv3+) reads the files in as a whole, into a single string (instead of an array of strings, line by line), which, while more memory-intensive, is faster. With HTML files, memory usage shouldn't be a concern.
An additional advantage of reading the files in full is that if the template were to contain multi-line subexpressions ($(...)), the expansion would still function correctly.
get-content -encoding utf8 ensures that the input files are interpreted as using character encoding UTF-8, as is typical in the web world nowadays.
This is crucial, given that UTF-8-encoded HTML files normally do not have the 3-byte pseudo-BOM that PowerShell needs in order to correctly identify a file as UTF-8-encoded (see below).
A single $ExecutionContext.InvokeCommand.ExpandString() call is then sufficient to perform the template expansion.
Out-File -Encoding utf8 would invariably create a file with the pseudo-BOM, which is undesired.
Instead, [IO.File]::WriteAllText() is used, taking advantage of the fact that the .NET Framework by default creates UTF-8-encoded files without the BOM.
Note the use of $PWD/ before out.html, which is needed to ensure that the file gets written in PowerShell's current location (directory); unfortunately, what the .NET Framework considers the current directory is not in sync with PowerShell.
Finally, the obligatory security warning: use this expansion technique only on input that you trust, given that arbitrary embedded commands may get executed.
Optional background information
PowerShell's Out-File, > and >> use UTF-16 LE character encoding with a BOM (byte-order mark) by default (the "weird characters", as mentioned).
While Out-File -Encoding utf8 allows creating UTF-8 output files instead,
PowerShell invariably prepends a 3-byte pseudo-BOM to the output file, which some utilities, notably those with Unix heritage, have problems with - so you would still get "weird characters" (albeit different ones).
If you want a more PowerShell-like way of creating BOM-less UTF-8 files,
see this answer of mine, which defines an Out-FileUtf8NoBom function that otherwise emulates the core functionality of Out-File.
Conversely, on reading files, you must use Get-Content -Encoding utf8 to ensure that BOM-less UTF-8 files are recognized as such.
In the absence of the UTF-8 pseudo-BOM, Get-Content assumes that the file uses the single-byte, extended-ASCII encoding specified by the system's legacy codepage (e.g., Windows-1252 on English-language systems, an encoding that PowerShell calls Default).
Note that while Windows-only editors such as Notepad create UTF-8 files with the pseudo-BOM (if you explicitly choose to save as UTF-8; default is the legacy codepage encoding, "ANSI"), increasingly popular cross-platform editors such as Visual Studio Code, Atom, and Sublime Text by default do not use the pseudo-BOM when they create files.
For the "weird characters", they're probably BOMs (Byte-order marks). Specify the output encoding explicitly with the -Encoding parameter when using Out-File, for example:
$Template |Out-File out.html -Encoding UTF8
For the string expansion, you need to explicitly tell powershell to do so:
$Template = $Template |ForEach-Object {
$ExecutionContext.InvokeCommand.ExpandString($_)
}
$Template | Out-File out.html -Encoding UTF8