Avoid unexpected characters in powershell stream-to-file output - powershell

I would expect the following powershell command:
echo 'abc' > /test.txt
to fill the file /test.txt with exactly 3 bytes: 0x61, 0x62, 0x63.
If I inspect the file, I see that its bytes are (all values are hex):
ff fe 61 00 62 00 63 00 0d 00 0a 00
Regardless of what output I try to stream into a file, the following steps apply in the given order, transforming the resulting file contents:
Append \r\n (0d 0a) to the end of the output
Insert a null byte (00) after every character of the output
Prepend ff fe to the entire output
Transformation #1 is relatively tolerable, but the other two transformations are a real pain.
How can I stream content more precisely (without these transformations) using powershell?
Thanks!

Try this
'abc' | out-file -filepath .\test.txt -encoding ascii -nonewline
should get you 3 bytes instead of 12
61 62 63

Related

Powershell Format-Hex does not show end of line. Why?

I cannot see any end of line byte
echo "hello" | Format-Hex -Raw -Encoding Ascii
is there a way to show them?
Edit: I also have a file that shows the same behaviour, and this one contains multiple lines, as confirmed by both cat and notepad.
PS C:\dev\cur CMR-27473_AMI_not_stopping_in_ecat_fault 97984 > cat .\x.txt
helo
helo2
PS C:\dev\cur CMR-27473_AMI_not_stopping_in_ecat_fault 97984 > Get-Content .\x.txt | Format-Hex -Raw
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 68 65 6C 6F helo
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 68 65 6C 6F 32 helo2
I do see the two records. But I want to see the end of line characters instead, that is, the raw bytes content.
If you mean newline, there isn't one in the source string. Thus, Format-Hex won't show one.
Windows uses CR LF sequence (0x0a, 0x0d) for newline. To see the control characters, append a newline into the string. Like so,
"hello"+[environment]::newline | Format-Hex -Raw -Encoding Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 68 65 6C 6C 6F 0D 0A hello..
One can also use Powershell's backtick escape sequence: "hello`r`n" for the same effect as appending [Environment]::NewLine, though only the latter is platform-aware.
Addendum as per the comment and edit:
Powershell's Get-Content is trying to be smart. In most of the use cases[citation needed], data read from text files does not need to include the newline characters. Get-Content will populate an array and each line read from the file will be in its own element. What use would a newline be?
When output is redirected to a file, Powershell is trying to be smart again. In most of the use cases[citation needed], adding text into a text file means adding new lines of data. Not appending existing a line. There's actually a separate switch for preventing the linefeed: Add-Content -NoNewLine.
What's more, high level languages do not have a specific string termination character. When one has a string object, like the modern languages, the length of the string is stored as an attribute of the string object.
In low level languages, there is no concept of a string. It's just a bunch of characters stuffed together. How, then would one know where a "string" begins and ends? Pascal's approach is to allocate byte in the beginning to contain actual string data length. C uses null-terminated strings. In DOS, assembly programs used dollar -terminated strings.
To complement vonPryz's helpful answer:
tl;dr:
Format-Hex .\x.txt
is the only way to inspect a file's raw byte content in PowerShell; i.e., you need to pass the input file path as a direct argument (to the implied -Path parameter).
Once the pipeline is involved, any strings you're dealing with are by definition .NET string objects, which are inherently UTF-16-encoded.
echo "hello", which is really Write-Output "hello", given that echo is a built-in alias for Write-Output, writes a single string object to the pipeline, as-is - and given that it has no embedded newline, Format-Hex doesn't show one.
For more, read on.
Generally, PowerShell has no concept of sending raw data through a pipeline: you're always dealing with instances of .NET types (objects).
Therefore, when Format-Hex receives pipeline input, it never sees raw byte streams, it operates on .NET strings, which are inherently UTF-16 ("Unicode") strings.
It is only then that the -Encoding parameter applies: it re-encodes the .NET strings on output.
By default, the output encoding is ASCII in Windows PowerShell, and UTF-8 in PowerShell Core.
Note: In Windows PowerShell, this means that by default characters outside the 7-bit ASCII range are transcoded in a "lossy" fashion to the literal ? character (whose Unicode code point and byte value is 0x3F).
The -Raw switch only make sense in combination with [int] (System.Int32)-typed input in Windows PowerShell v5.1 and is obsolete in PowerShell Core, where it has no effect whatsoever.[1]
echo is a built-in alias for the Write-Output cmdlet, and it accepts objects to write to the pipeline.
In your case, that object is a single-line string (an object of type [string] (System.String)), which, as stated, has no embedded newline sequence.
As an aside: PowerShell implicitly outputs anything that isn't captured (assigned to a variable or redirected elsewhere), so your command can be written more idiomatically as:
"hello" | Format-Hex
Similarly, cat is a built-in alias for the Get-Content cmdlet, which reads a text file's content as an array of lines, i.e., into a string array whose elements do not end in a newline.
It is the array elements that are written to the pipeline, one by one, and Format-Hex renders the bytes of each separately - but, again, without any newlines, because the input objects (array elements representing lines without a trailing newline) do not contain any.
The only way to see newlines is to read the file as a whole, which is what the - somewhat confusingly named - -Raw switch does:
Get-Content -Raw .\x.txt | Format-Hex
While this now does reflect the actual newlines present in the file, note that it is not a raw byte representation of the file, for the reasons mentioned.
[1] -Raw's purpose up to v5.1 was never documented, but it is now described as obsolete and having no effect.
In short: [int]-typed input was not necessarily represented by the 4 bytes it comprises - single-byte or double-byte sequences were used, if the value was small enough, in favor of more compact output; -Raw would deactivate this and output the faithful 4-byte representation.
In PS Core [v6+], you now always and invariably get the faithful byte representation, and -Raw has no effect; for the full story see this GitHub pull request.

is newline treated as space in perl

consider this perl one liner
perl -e "$\=qq{\n};$/=qq{ };while(<>){print;}" "perl.txt" > perlMod.txt
contents of perl.txt are
a b
c
contents of perlMod.txt are
a
b
c
contents of perlMod.txt in hex are
61200D0A620D0A630D0A
Note that I have specified space as input record separator and "\n" as default output record separator. I am expecting two '0D0A's after b(62 in hex). One 0D0A is the new line after b. The other 0D0A belongs to output record separator. Why is there only one 0D0A.
You seem to think <> will still stop reading at a linefeed even though you changed the input record separator.
Your input contains 61 20 62 0D 0A 63 or a␠b␍␊c.
The first read reads a␠.
You print a␠.
To that, $\ gets added, giving a␠␊.
Then :crlf does its translation, giving a␠␍␊.
There is no other spaces in the file, so your second read reads the rest of the file: b␍␊c.
Then :crlf does its translation, giving b␊c.
You print b␊c.
To that, $\ gets added, giving b␊c␊.
Then :crlf does its translation, giving b␍␊c␍␊.
So, altogether, you get a␠␍␊b␍␊c␍␊, or 61 20 0D 0A 62 0D 0A 63 0D 0A.

Byte Order Mask: confusing the UTF encoding

The Byte Order Mask (BOM) uses the Unicode character U+FEFF to determine the encoding of a text file according to the following rule:
+-------------+-----------------------+
| Bytes | Encoding Form |
+-------------+-----------------------+
| 00 00 FE FF | UTF-32, big-endian |
| FF FE 00 00 | UTF-32, little-endian |
| FE FF | UTF-16, big-endian |
| FF FE | UTF-16, little-endian |
| EF BB BF | UTF-8 |
+-------------+-----------------------+
My question is: is there any combination of bytes that can make one UTF encoding to be confused with another UTF encoding?
For example, if I have a UTF-16 big-endian encoded file without BOM and with the characters U+EFBB and U+BF40 (EF BB BF 40) can it be confused with an UTF-8 encoded file with BOM and the ASCII character #?
Sure, without knowing the encoding, a sequence of U+0000 characters has an unknown length.
00 00 00 00 UTF-8 U+0000 U+0000 U+0000 U+0000
00 00 00 00 UTF-16 U+0000 U+0000
00 00 00 00 UTF-32 U+0000
BTW—Bytes that look like a byte order mark cannot be used to determine the encoding of a text file. In general, it's an unsolvable problem—data loss.
The BOM is designed to find the byte order when the size is known. So there is no U+FFFE code. There is no further limitation on charset, so there can be some overlapping codes. (#TomBlodget has an example of a "degenerate" case)
BOM in UTF-8 is not really needed, but it should be preserved, in order to do a perfect round conversion from other unicode encoding. Just Windows started to use it to distinguish UTF-8 from other encoding (especially outside unicode encoding), and that it is not 100% reliable.
C0 and C1 are bytes not allowed on UTF-8, along various sequences (first bits on byte 1 defines the length of sequence, and so there should be exactly so many bytes with "continuation prefix" (0b10). So usually it is easy to find if a string it is UTF-8 (if not too short or "degenerate").
UTF-32 has valid values just from 0 to U+10FFFF, so this could be used to distinguish it from UTF16 (again, "degenerate" and short strings are not discriminable, OTOH we should expect very often 00 00 in UTF32, and usually no 00 00 on UTF16 normal text, but ev. at the end.).
Control characters and private character set should not be used on "public" Unicode text (but if you agree on the protocol, but so that should not be the case of the question).

How can I detect the codepage of a serial of text,2 byte for a character,It's polish

How can I detect the codepage of a serial of text,2 byte for a charactor,It's polish.And for normal English charactor ,just add 0x00 to the ansi code, for special Polish character,the two byte have the special meaning. there is no file head ,just bytes stream like this.
Sample here
string: Połączenia
bytes: 50 00/6f 00/42 01/05 01/63 00/7a 00/65 00/69 00/61 00
I think it's not unicode ,because 0x4201 in unicode is a Chinese charactor
not Polish.
So Any one can help me? thanks very much!
Its UTF-16 Big Endian.
$ echo -n "Połączenia" | iconv -f UTF8 -t UTF16BE | hexdump
0000000 5000 6f00 4201 0501 6300 7a00 6500 6e00
0000010 6900 6100

Stream-process UTF-16 file with BOM and Unix line endings in Windows perl

I need to stream-process using perl a 1Gb text file encoded in UTF-16 little-endian with unix-style endings (i.e. 0x000A only without 0x000D in the stream) and LE BOM in the beginning. File is processed on Windows (Unix solutions are needed also). By stream-process I mean using while (<>), line-by-line reading and writing.
Would be nice to have a command line one-liner like:perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt
Hex dump of input for testing (two lines: "a" and "b" letters on each):
FF FE 61 00 0A 00 62 00 0A 00
processing like s/b/c/g should give an output ("b" replaced with "c"):
FF FE 61 00 0A 00 63 00 0A 00
PS. Right now with all my trials either there's a problem with CRLF output (0D 0A bytes are output producing incorrect unicode symbol, and I need only 0A00 without 0D00 to preserve same unix style) or every new line switches LE/BE, i.e. same "a" on one line is 6100 on the odd lines and 0061 on the even lines in the output.
The best I've come up with is this:
perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/b/c/g;" <infile.txt >outfile.txt
But note that I had to use <infile.txt instead of infile.txt so that the file would be on STDIN. Theoretically, the open pragma should control the encoding used by the magic ARGV filehandle, but I can't get it to work correctly in this case.
The difference between <infile.txt and infile.txt is in how and when the files are opened. With <infile.txt, the file is connected to standard input, and opened before Perl begins running. When you binmode STDIN in a BEGIN block, the file is already open, and you can change the encoding.
When you use infile.txt, the filename is passed as a command line argument and placed in the #ARGV array. When the BEGIN block executes, the file is not open yet, so you can't set its encoding. Theoretically, you ought to be able to say:
use open qw(:std IO :raw:encoding(UTF-16LE));
and have the magic <ARGV> processing apply the right encoding. But I haven't been able to get that to work right in this case.