Suppose you have some text like this:
foobar 42 | ff 00 00 00 00
foobaz 00 | 0a 00 0b 00 00
foobie 00 | 00 00 00 00 00
bar 00 | ab ba 00 cd 00
and you want to change all non-00 on the right hand side of the | to be wrapped with (), but only if on the LHS of the | has 00. The desired result:
foobar 42 | ff 00 00 00 00
foobaz 00 | (0a) 00 (0b) 00 00
foobie 00 | 00 00 00 00 00
bar 00 | (ab) (ba) 00 (cd) 00
Is there a good way of going about this using sed, or am I trying to stretch beyond the capabilities of the language?
Here's my work so far:
s/[^0]\{2\}/(&)/g wraps your RHS values
/[^|]*00[^|]*|/ can be used as an address to a command to operate only on valid lines
The trick now is to formulate a command that executes in a portion of the pattern space.
This really isn't line oriented, which may explain why I'm having trouble getting an expression that works.
$ awk 'BEGIN{ FS=OFS="|" } $1~/ 00 /{gsub(/[^ ][^0 ]|[^0 ][^ ]/,"(&)",$2)} 1' file
foobar 42 | ff 00 00 00 00
foobaz 00 | (0a) 00 (0b) 00 00
foobie 00 | 00 00 00 00 00
bar 00 | (ab) (ba) 00 (cd) 00
In case the string you want to search for ever gets more complicated than 2 0s, here's a more generally extensible approach since it doesn't require you to write an RE that negates the string:
$ awk '
BEGIN{ FS=OFS="|" }
$1 ~ / 00 /{
split($2,a,/ /)
$2=""
for (i=2;i in a;i++)
$2 = $2 " " (a[i] == "00" ? a[i] : "(" a[i] ")")
}
1
' file
foobar 42 | ff 00 00 00 00
foobaz 00 | (0a) 00 (0b) 00 00
foobie 00 | 00 00 00 00 00
bar 00 | (ab) (ba) 00 (cd) 00
This might work for you (GNU sed):
sed -r '/^\s*\S+\s*00/!b;s/\b([^0][^0]|0[^0]|[^0]0)\b/(&)/g' file
This disregards lines which do not begin with a word followed by 00. It then inserts parens round 2 character strings which are neither 0's or contain a 0 and a non-0.
well it seems, (though I do it all the time) that piping sed to sed to sed means I didn't do it right the first time: Here's one
sed -r '/00.*\|/ { ## match lines with a zero before the pipe
### surround tailing digits with ()
##
s/(\w\w) (\w\w) (\w\w) (\w\w) (\w\w)$/(\1) (\2) (\3) (\4) (\5)/;
### replace the zeroes (00) with 00
##
s/\(00\)/00/g;
}' txt
foobar 42 | ff 00 00 00 00
foobaz 00 | (0a) 00 (0b) 00 00
foobie 00 | 00 00 00 00 00
bar 00 | (ab) (ba) 00 (cd) 00
ok!
I think awk is probably the better tool for this job, but it can be done with sed:
sed '/^[^ ]* *00 *|/{
:a
s/\(|.*[^(]\)\([0-9a-f][1-9a-f]\)/\1(\2)/
t a
:b
s/\(|.*[^(]\)\([1-9a-f][0-9a-f]\)/\1(\2)/
t b
}' data
The script looks for lines containing 00 before the pipe, and only applies the operations to those lines. There are two substitute operations, each wrapped in a loop. The :a and :b lines are labels. The t a and t b commands are a conditional jump to the named label if there was a substitution performed since the last jump. The two substitutions are almost symmetric; the first deals with any number not ending in 0; the second deals with any number not starting with 0; between them, they ignore 00. The patterns look for a pipe, any sequence of characters not ending with an open parenthesis (, and the appropriate pair of digits; it replaces that so that the number ends up inside parentheses. The loops are necessary because a g modifier doesn't start from the beginning again, and the patterns work backwards through the numbers.
Given this data file (a slightly extended version of yours):
foobar 42 | ff 00 00 00 00
foobaz 00 | 0a 00 0b 00 00
foobie 00 | 00 00 00 00 00
bar 00 | ab ba 00 cd 00
fizbie 00 | ab ba 00 cd 90
fizzbuzz 00 | ab ba 00 cd 09
the output from the script is:
foobar 42 | ff 00 00 00 00
foobaz 00 | (0a) 00 (0b) 00 00
foobie 00 | 00 00 00 00 00
bar 00 | (ab) (ba) 00 (cd) 00
fizbie 00 | (ab) (ba) 00 (cd) (90)
fizzbuzz 00 | (ab) (ba) 00 (cd) (09)
It is moderately educational to add a p after each of the substitute commands, so you can see how the substitutions work:
foobar 42 | ff 00 00 00 00
foobaz 00 | 0a 00 (0b) 00 00
foobaz 00 | (0a) 00 (0b) 00 00
foobaz 00 | (0a) 00 (0b) 00 00
foobie 00 | 00 00 00 00 00
bar 00 | ab ba 00 (cd) 00
bar 00 | ab (ba) 00 (cd) 00
bar 00 | (ab) (ba) 00 (cd) 00
bar 00 | (ab) (ba) 00 (cd) 00
fizbie 00 | ab ba 00 (cd) 90
fizbie 00 | ab (ba) 00 (cd) 90
fizbie 00 | (ab) (ba) 00 (cd) 90
fizbie 00 | (ab) (ba) 00 (cd) (90)
fizbie 00 | (ab) (ba) 00 (cd) (90)
fizzbuzz 00 | ab ba 00 cd (09)
fizzbuzz 00 | ab ba 00 (cd) (09)
fizzbuzz 00 | ab (ba) 00 (cd) (09)
fizzbuzz 00 | (ab) (ba) 00 (cd) (09)
fizzbuzz 00 | (ab) (ba) 00 (cd) (09)
Ok try this!
$ sed '/00 *|/ { h; s/|.*/|/; x; s/.*|//; s/\(0[1-9a-f]\|[1-9a-f][0-9a-f]\)/(\1)/g; H; x; s/\n//; }' yourfile.txt
the output I get is this:
foobar 42 | ff 00 00 00 00
foobaz 00 | (0a) 00 (0b) 00 00
foobie 00 | 00 00 00 00 00
bar 00 | (ab) (ba) 00 (cd) 00
Edited, so it don't touch the line without 00 before the |.
Related
This question already has answers here:
Powershell - Strange WSL output string encoding
(4 answers)
Closed last month.
To find every line with that "-" from the command wsl --help, theses lines work
wsl --help | Select-String -Pattern "-"
wsl --help | Select-String "-"
Now I try with more complicated pattern: "--"
wsl --help | Select-String -Pattern "--"
wsl --help | Select-String "--"
Nothing is return although there is line with this pattern. Why?
updated:
wsl --help | Select-String "--" -SimpleMatch
doesn't work either
Yep, wsl outputs utf16le or unicode. Even bytes are null.
wsl --help | select -first 1 | format-hex
Label: String (System.String) <09F5DDB6>
Offset Bytes Ascii
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
------ ----------------------------------------------- -----
0000000000000000 43 00 6F 00 70 00 79 00 72 00 69 00 67 00 68 00 C o p y r i g h
0000000000000010 74 00 20 00 28 00 63 00 29 00 20 00 4D 00 69 00 t ( c ) M i
0000000000000020 63 00 72 00 6F 00 73 00 6F 00 66 00 74 00 20 00 c r o s o f t
0000000000000030 43 00 6F 00 72 00 70 00 6F 00 72 00 61 00 74 00 C o r p o r a t
0000000000000040 69 00 6F 00 6E 00 2E 00 20 00 41 00 6C 00 6C 00 i o n . A l l
0000000000000050 20 00 72 00 69 00 67 00 68 00 74 00 73 00 20 00 r i g h t s
0000000000000060 72 00 65 00 73 00 65 00 72 00 76 00 65 00 64 00 r e s e r v e d
0000000000000070 2E 00 .
"`0" means null. In powershell 7, the matches are highlighted.
wsl --help | Select-String -Pattern "-`0-" | select -first 1
--exec, -e <CommandLine>
Sample of my df is:
+------------------------------------------------------------------------------------------------------------------+
|id | binary_col |
+------------------------------------------------------------------------------------------------------------------+
| 1 | [08 01 10 0D 00 0E CC 93 01 00 00 00 01 00 00 00 00 00 00 00 80 FF BF 40 00 00 00 00 00 00 F0 3F BE 2B 00 00]|
| 2 | [08 01 10 0D 00 0E CC 93 01 00 00 00 01 00 00 00 00 00 00 00 F0 FF BF 40 00 00 00 00 00 00 F0 3F 57 66 00 00]|
| 3 | [08 01 10 0D 00 0E CC 93 01 00 00 00 01 00 00 00 00 00 00 00 C0 FF BF 40 00 00 00 00 00 00 F0 3F D5 69 00 00]|
| 4 | [08 01 10 0D 00 0E CC 93 01 00 00 00 01 00 00 00 00 00 00 00 80 FF BF 40 00 00 00 00 00 00 F0 3F 5A 60 00 00]|
+------------------------------------------------------------------------------------------------------------------+
with these schema (df.printSchema())
|-- id: int (nullable = true)
|-- binary_col: binary (nullable = true)
And I want to filter only the values with [08 01 10 0D 00 0E CC 93 01 00 00 00 01 00 00 00 00 00 00 00 80 FF BF 40 00 00 00 00 00 00 F0 3F BE 2B 00 00] (It doesn't work filtering id=1 because there are other ids in the df)
I've tried to cast binary to bigint to filter later like here: Spark: cast bytearray to bigint
by doing df.withColumn('casted_bin', F.conv(F.hex(F.col("binary_col")), 16, 10).cast("bigint")).show(truncate=False) but it didn't work.
How can I filter any kind of binary data type?
Note: I had asked previously here (How to filter Pyspark column with binary data type?) but it was a very simple binary data and the answer generated the binary from a numeric value while now I don't know how to generate the numeric value.
This is my script I am writing.
#usr/bin/perl
use warnings;
open(my $infile, '<', "./file1.bin") or die "Cannot open file1.bin: $!";
binmode($infile);
open(my $outfile, '>', "./extracted data without 00's.bin") or die "Cannot create extracted data without 00's.bin: $!";
binmode($outfile);
local $/; $infile = <STDIN>;
print substr($infile, 0, 0x840, '');
$infile =~ s/\0{16}//;
print $outfile;
I'm loading a binary file in perl.
I have been able to seek and patch at certain offsets, but what I would like to do is, now be able to find any instance of "00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00" (16 bytes?) and remove it from the file, but no less than 16 bytes. Anything less than that I would want to leave. In some of the files the offset where the 00's start will be at different offsets, but if I am thinking correctly, if I can just search for 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 and remove any instance of it, then it won't matter what offset the 00's are at. I would extract the data first from specific offsets, then search the file and prune 00's from it. I can already extract the specific offsets I need, I just need to open the extracted file and shave off 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
EF 39 77 5B 14 9D E9 1E 94 A9 97 F2 6D E3 68 05
6F 7B 77 BB C4 99 67 B5 C9 71 12 30 9D ED 31 B6
AB 1F 81 66 E1 DD 29 4E 71 8D 54 F5 6C C8 86 0D
5B 72 AF A8 1F 26 DD 05 AF 78 13 EF A5 E0 76 BB
8A 59 9B 20 C5 58 95 7C E0 DB 44 6A EC 7E D0 10
09 42 B1 12 65 80 B3 EC 58 1A 2F 92 B9 32 D9 07
96 DE 32 51 4B 5F 3B 50 9A D1 09 37 F4 6D 7C 01
01 4A A4 24 04 DC 83 08 17 CB 34 2C E5 87 26 C1
35 38 F4 C4 E4 78 FE FC A2 BE 99 48 C9 CA 69 90
33 87 09 A8 27 BA 91 FC 4B 77 FA AB F5 1E 4E C0 I want to leave everything from
F2 78 6E 31 7D 16 3B 53 04 8A C1 A8 4B 70 39 22 <----- here up
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <----- I want to prune everything
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 from here on
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00<---- this IS the end of the file, and
just need to prune these few rows
of 00's
Say that "F2 78 6E" from the example above, is at offset 0x45000 BUT in another file the 00 00's will start at a different offset, how could I code it so the 00 00's would get pruned. In any file that I am opening?
If I need to be more specific, just ask.
Seems like I would peekk so far into the file until I hit a long 00 00 string, then prune any remaining lines. Does that make sense at all?
All I want to do is search the file for any instances of 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 and delete/prune/truncate it. I want to save everything but the 00's
EDIT #2
this did it:
open($infile, '<', './file1') or die "cannot open file1: $!";
binmode $infile;
open($outfile, '>', './file2') or die "cannot open file2: $!";
binmode $outfile;
local $/; $file = <$infile>;
$file =~ s/\0{16}//g;
print $outfile $file;
close ($infile);
close ($outfile);
Thank you ikegami for all your help and patience :)
No such thing as removing from a file. You have to either
copy the file without the undesired bits, or
read the rest of the file, seek back, print over the undesired bits, then truncate the file.
I went with option 1.
$ perl -e'
binmode STDIN;
binmode STDOUT;
local $/; $file = <STDIN>;
$file =~ s/\0{16}//;
print $file;
' <file.in >file.out
I'm loading the entire file into memory. Either option can be done in chunks, but it complicates things because your NULs could span two chunks.
In a poorly phrased update, you seem to have asked to avoid changes in the first 0x840 bytes. Two solutions:
$ perl -e'
binmode STDIN;
binmode STDOUT;
local $/; $file = <STDIN>;
substr($file, 0x840) =~ s/\0{16}//;
print $file;
' <file.in >file.out
$ perl -e'
binmode STDIN;
binmode STDOUT;
local $/; $file = <STDIN>;
print substr($file, 0, 0x840, '');
$file =~ s/\0{16}//;
print $file;
' <file.in >file.out
I'm debugging the output of a program that transmits data via TCP.
For debugging purposes i've replaced the receiving program with netcat and hexdump:
netcat -l -p 1234 | hexdump -C
That outputs all data as a nice hexdump, almost like I want. Now the data is transmitted in fixed blocks which lengths are not multiples of 16, leading to shifted lines in the output that make spotting differences a bit difficult:
00000000 50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |P...............|
00000010 00 50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |.P..............|
00000020 00 00 50 00 00 00 00 00 00 00 00 00 00 00 00 00 |..P.............|
How do I reformat the output so that after 17 bytes a new line is started?
It should look something like this:
50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |P...............|
00 |. |
50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |P...............|
00 |. |
50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |P...............|
00 |. |
Using hexdumps -n parameter does not work since it will exit after reaching the number of bytes. (Unless there is a way to keep the netcat programm running and seamlessly piping the next bytes to a new instance of hexdump).
Also it would be great if I could use watch -d on the output to get a highlight of changes between lines.
For hexdump without characters part.
hexdump -e '16/1 "%0.2x " "\n" 1/1 "%0.2x " "\n"'
I use this:
use strict;
use warnings;
use bytes;
my $N = $ARGV[0];
$/ = \$N;
while (<STDIN>) {
my #bytes = unpack("C*", $_);
my $clean = $_;
$clean =~ s/[[:^print:]]/./g;
print join(' ', map {sprintf("%2x", $_)} #bytes),
" |", $clean, "|\n";
}
Run it as perl scriptname.pl N where N is the number of bytes in each chunk you want.
also you can use xxd -p to make a hexdump .
I'm looking for a way to take the TEXT characters from a 4byte BINARY file to array or TEXT file,
Lets say my input file is:
00000000 2e 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 |................|
00000010 04 00 00 00 05 00 00 00 06 00 00 00 07 00 00 00 |................|
00000020 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000070 00 00 00 00 00 00 00 00 |........|
00000078
And my desired output is:
46,1,2,3,4,5,6,7,8,9,0,0...
The output can be a TEXT file or an array.
I notice that the pack/unpack functions may help here, but I couldn't figure how to use them properly,
An example would be nice.
Use unpack:
local $/;
#_=unpack("V*", <>);
gets you an array. So as an inefficient (don't try on huge files) example:
perl -e 'local$/;print join(",",map{sprintf("%d",$_)}unpack("V*",<>))' thebinaryfile
The answer is dependent on what you consider an ASCII character. Anything below 128 is technically an ASCII character, but I am assuming you mean characters you normally find in a text file. In that case, try this:
#!/usr/bin/perl
use strict;
use warnings;
use bytes;
$/ = \1024; #read 1k at a time
while (<>) {
for my $char (split //) {
my $ord = ord $char;
if ($char > 31 and $char < 127 or $char =~ /\r\n\t/) {
print "$ord,"
}
}
}
od -t d4 -v <filename>