consider this perl one liner
perl -e "$\=qq{\n};$/=qq{ };while(<>){print;}" "perl.txt" > perlMod.txt
contents of perl.txt are
a b
c
contents of perlMod.txt are
a
b
c
contents of perlMod.txt in hex are
61200D0A620D0A630D0A
Note that I have specified space as input record separator and "\n" as default output record separator. I am expecting two '0D0A's after b(62 in hex). One 0D0A is the new line after b. The other 0D0A belongs to output record separator. Why is there only one 0D0A.
You seem to think <> will still stop reading at a linefeed even though you changed the input record separator.
Your input contains 61 20 62 0D 0A 63 or a␠b␍␊c.
The first read reads a␠.
You print a␠.
To that, $\ gets added, giving a␠␊.
Then :crlf does its translation, giving a␠␍␊.
There is no other spaces in the file, so your second read reads the rest of the file: b␍␊c.
Then :crlf does its translation, giving b␊c.
You print b␊c.
To that, $\ gets added, giving b␊c␊.
Then :crlf does its translation, giving b␍␊c␍␊.
So, altogether, you get a␠␍␊b␍␊c␍␊, or 61 20 0D 0A 62 0D 0A 63 0D 0A.
Related
I have some very bizarre behavior in a script that I wrote and have used for years but, for some reason, fails to run on one particular file.
Recognizing that the script is failing to identify a key that should be in a hash, I added some test print statements to read the keys. My normal strategy involves placing asterisks before and after the variable to detect potential hidden characters. Clearly, the keys are corrupt. Relevant code block:
foreach my $fastaRecord (#GenomeList) {
my ($ID, $Seq) = split(/\n/, $fastaRecord, 2);
# uncomment next line to strip everything off sequence
# header except trailing numeric identifiers
# $ID =~ s/.+?(\d+$)/$1/;
$Seq =~ s/[^A-Za-z-]//g; # remove any kind of new line characters
$RefSeqLen = length($Seq);
$GenomeLenHash{$ID} = $RefSeqLen;
print "$ID\n";
print "*$ID**\n";
}
This produces the following output:
supercont3
**upercont3
Mitochondrion
**itochondrion
Chr1
**hr1
Chr2
**hr2
Chr3
**hr3
Chr4
**hr4
Normally, I'd suspect "illegal" newline characters as being involved. However, I manually replaced all newlines in the input file to try and solve the problem. What in the input file could be causing the script to execute in this way? I could imagine that maybe, despite my efforts, there is still an illegal newline after the ID variable, but then why are neither the first asterisk, nor newline characters after the double asterisk not printed, and why is the double asterisk printed at the beginning of the line in a way that overwrites the first asterisk as well as the first two characters of the variable "value"?
When you see these sorts of effects, look at the data in a file or in a hexdump. The terminal is going to hide data if it interprets backspace, carriage returns, and ansi sequences.
% perl script.pl | hexdump -C
Here's a simple example. I echo a, b, carriage return, then c. My terminal sees the carriage return and moves the cursor to the beginning of the line. After that, the output continues. The c masks the a:
% echo $'ab\rc'
cb
With a hex dump, I can see the 0d that represents the carriage return:
% echo $'ab\rc' | hexdump -C
00000000 61 62 0d 63 0a |ab.c.|
00000005
Also, when you try to remove "any sort of newline" from $Seq, you might just remove vertical whitespace:
$target =~ s/\v//g;
You might also use the generalized newline to
$target =~ s/\R//g;
I would expect the following powershell command:
echo 'abc' > /test.txt
to fill the file /test.txt with exactly 3 bytes: 0x61, 0x62, 0x63.
If I inspect the file, I see that its bytes are (all values are hex):
ff fe 61 00 62 00 63 00 0d 00 0a 00
Regardless of what output I try to stream into a file, the following steps apply in the given order, transforming the resulting file contents:
Append \r\n (0d 0a) to the end of the output
Insert a null byte (00) after every character of the output
Prepend ff fe to the entire output
Transformation #1 is relatively tolerable, but the other two transformations are a real pain.
How can I stream content more precisely (without these transformations) using powershell?
Thanks!
Try this
'abc' | out-file -filepath .\test.txt -encoding ascii -nonewline
should get you 3 bytes instead of 12
61 62 63
I cannot see any end of line byte
echo "hello" | Format-Hex -Raw -Encoding Ascii
is there a way to show them?
Edit: I also have a file that shows the same behaviour, and this one contains multiple lines, as confirmed by both cat and notepad.
PS C:\dev\cur CMR-27473_AMI_not_stopping_in_ecat_fault 97984 > cat .\x.txt
helo
helo2
PS C:\dev\cur CMR-27473_AMI_not_stopping_in_ecat_fault 97984 > Get-Content .\x.txt | Format-Hex -Raw
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 68 65 6C 6F helo
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 68 65 6C 6F 32 helo2
I do see the two records. But I want to see the end of line characters instead, that is, the raw bytes content.
If you mean newline, there isn't one in the source string. Thus, Format-Hex won't show one.
Windows uses CR LF sequence (0x0a, 0x0d) for newline. To see the control characters, append a newline into the string. Like so,
"hello"+[environment]::newline | Format-Hex -Raw -Encoding Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000 68 65 6C 6C 6F 0D 0A hello..
One can also use Powershell's backtick escape sequence: "hello`r`n" for the same effect as appending [Environment]::NewLine, though only the latter is platform-aware.
Addendum as per the comment and edit:
Powershell's Get-Content is trying to be smart. In most of the use cases[citation needed], data read from text files does not need to include the newline characters. Get-Content will populate an array and each line read from the file will be in its own element. What use would a newline be?
When output is redirected to a file, Powershell is trying to be smart again. In most of the use cases[citation needed], adding text into a text file means adding new lines of data. Not appending existing a line. There's actually a separate switch for preventing the linefeed: Add-Content -NoNewLine.
What's more, high level languages do not have a specific string termination character. When one has a string object, like the modern languages, the length of the string is stored as an attribute of the string object.
In low level languages, there is no concept of a string. It's just a bunch of characters stuffed together. How, then would one know where a "string" begins and ends? Pascal's approach is to allocate byte in the beginning to contain actual string data length. C uses null-terminated strings. In DOS, assembly programs used dollar -terminated strings.
To complement vonPryz's helpful answer:
tl;dr:
Format-Hex .\x.txt
is the only way to inspect a file's raw byte content in PowerShell; i.e., you need to pass the input file path as a direct argument (to the implied -Path parameter).
Once the pipeline is involved, any strings you're dealing with are by definition .NET string objects, which are inherently UTF-16-encoded.
echo "hello", which is really Write-Output "hello", given that echo is a built-in alias for Write-Output, writes a single string object to the pipeline, as-is - and given that it has no embedded newline, Format-Hex doesn't show one.
For more, read on.
Generally, PowerShell has no concept of sending raw data through a pipeline: you're always dealing with instances of .NET types (objects).
Therefore, when Format-Hex receives pipeline input, it never sees raw byte streams, it operates on .NET strings, which are inherently UTF-16 ("Unicode") strings.
It is only then that the -Encoding parameter applies: it re-encodes the .NET strings on output.
By default, the output encoding is ASCII in Windows PowerShell, and UTF-8 in PowerShell Core.
Note: In Windows PowerShell, this means that by default characters outside the 7-bit ASCII range are transcoded in a "lossy" fashion to the literal ? character (whose Unicode code point and byte value is 0x3F).
The -Raw switch only make sense in combination with [int] (System.Int32)-typed input in Windows PowerShell v5.1 and is obsolete in PowerShell Core, where it has no effect whatsoever.[1]
echo is a built-in alias for the Write-Output cmdlet, and it accepts objects to write to the pipeline.
In your case, that object is a single-line string (an object of type [string] (System.String)), which, as stated, has no embedded newline sequence.
As an aside: PowerShell implicitly outputs anything that isn't captured (assigned to a variable or redirected elsewhere), so your command can be written more idiomatically as:
"hello" | Format-Hex
Similarly, cat is a built-in alias for the Get-Content cmdlet, which reads a text file's content as an array of lines, i.e., into a string array whose elements do not end in a newline.
It is the array elements that are written to the pipeline, one by one, and Format-Hex renders the bytes of each separately - but, again, without any newlines, because the input objects (array elements representing lines without a trailing newline) do not contain any.
The only way to see newlines is to read the file as a whole, which is what the - somewhat confusingly named - -Raw switch does:
Get-Content -Raw .\x.txt | Format-Hex
While this now does reflect the actual newlines present in the file, note that it is not a raw byte representation of the file, for the reasons mentioned.
[1] -Raw's purpose up to v5.1 was never documented, but it is now described as obsolete and having no effect.
In short: [int]-typed input was not necessarily represented by the 4 bytes it comprises - single-byte or double-byte sequences were used, if the value was small enough, in favor of more compact output; -Raw would deactivate this and output the faithful 4-byte representation.
In PS Core [v6+], you now always and invariably get the faithful byte representation, and -Raw has no effect; for the full story see this GitHub pull request.
I want to use sed to shorten a log containing lines which begin with a date:
2017-07-26T01:01:01 236
2017-07-27T01:02:01 236
2017-07-27T01:02:51 236
2017-07-27T01:03:01 236
2017-07-28T01:01:01 236
2017-09-07T09:05:18 236
2017-09-07T10:22:10 239
(no, logrotate won't do). I know that I can use
sed -i '0,/^2017-07-27/d' filename
to delete all lines (in place) before the first with '2017-07-27' on it and that first one, but if the file contains no such line, or only one such line, sed will delete everything. I'd like to do nothing if the pattern is not found.
I also want to delete in place.
How do I prevent sed from deleting all lines if the pattern never matches?
Is there a better way of doing this in place?
Never say never.
We can use the sed hold space to collect the desired output. If we match the first line with the date, we throw away the accumulated contents of the hold space and start over with collecting the lines into the hold space. At the end of the file, we print out the contents of the hold space.
The whole file will potentially end up in the hold space. I suppose we could start printing each line directly instead of accumulating to the hold space once we find a match ... maybe. But if there is no match, we still have to accumulate the whole file in the hold space - as potentially there will be a match on the last line and we'd have to throw out everything. But I'm happy with this as it is.
I changed the example input slightly to make it easier for me to see which lines are which while developing the sed script.
bjb#rhino:~$ cat /tmp/sed.txt
2017-07-26T01:01:01 231
2017-07-27T01:02:01 232
2017-07-27T01:02:51 233
2017-07-27T01:03:01 234
2017-07-28T01:01:01 235
2017-09-07T09:05:18 236
2017-09-07T10:22:10 237
bjb#rhino:~$ sed -n ':begin H;/^2017-07-27/{s/^.*$//;x;s/^.*$//;n;b middle};$ b end;n;b begin;:middle H;$ b end;n;b middle;:end x;p;q' /tmp/sed.txt
2017-07-27T01:02:51 233
2017-07-27T01:03:01 234
2017-07-28T01:01:01 235
2017-09-07T09:05:18 236
2017-09-07T10:22:10 237
bjb#rhino:~$ sed -n ':begin H;/^fluffallaa/{s/^.*$//;x;s/^.*$//;n;b middle};$ b end;n;b begin;:middle H;$ b end;n;b middle;:end x;p;q' /tmp/sed.txt
2017-07-26T01:01:01 231
2017-07-27T01:02:01 232
2017-07-27T01:02:51 233
2017-07-27T01:03:01 234
2017-07-28T01:01:01 235
2017-09-07T09:05:18 236
2017-09-07T10:22:10 237
bjb#rhino:~$
I need to stream-process using perl a 1Gb text file encoded in UTF-16 little-endian with unix-style endings (i.e. 0x000A only without 0x000D in the stream) and LE BOM in the beginning. File is processed on Windows (Unix solutions are needed also). By stream-process I mean using while (<>), line-by-line reading and writing.
Would be nice to have a command line one-liner like:perl -pe "BEGIN { SOME_PREPARATION }; s/SRC/DST/g;" infile.txt > outfile.txt
Hex dump of input for testing (two lines: "a" and "b" letters on each):
FF FE 61 00 0A 00 62 00 0A 00
processing like s/b/c/g should give an output ("b" replaced with "c"):
FF FE 61 00 0A 00 63 00 0A 00
PS. Right now with all my trials either there's a problem with CRLF output (0D 0A bytes are output producing incorrect unicode symbol, and I need only 0A00 without 0D00 to preserve same unix style) or every new line switches LE/BE, i.e. same "a" on one line is 6100 on the odd lines and 0061 on the even lines in the output.
The best I've come up with is this:
perl -pe "BEGIN { binmode $_, ':raw:encoding(UTF-16LE)' for *STDIN, *STDOUT }; s/b/c/g;" <infile.txt >outfile.txt
But note that I had to use <infile.txt instead of infile.txt so that the file would be on STDIN. Theoretically, the open pragma should control the encoding used by the magic ARGV filehandle, but I can't get it to work correctly in this case.
The difference between <infile.txt and infile.txt is in how and when the files are opened. With <infile.txt, the file is connected to standard input, and opened before Perl begins running. When you binmode STDIN in a BEGIN block, the file is already open, and you can change the encoding.
When you use infile.txt, the filename is passed as a command line argument and placed in the #ARGV array. When the BEGIN block executes, the file is not open yet, so you can't set its encoding. Theoretically, you ought to be able to say:
use open qw(:std IO :raw:encoding(UTF-16LE));
and have the magic <ARGV> processing apply the right encoding. But I haven't been able to get that to work right in this case.