simple diff question - diff

file1.txt is
10
20
30
40
50
file2.txt is
10
20
40
60
70
80
I want output.txt to be a union of numbers in these two files but with no duplicates.
Output.txt should be
10
20
30
40
50
60
70
80

If you don't need to preserve the order within the individual files:
cat file1.txt file2.txt | sort | uniq > output.txt

Related

Trim the last character of a stream

I use sed -e '$s/.$//' to trim the last character of a stream. Is it the correct way to do so? Are there other better ways to do so with other command line tools?
$ builtin printf 'a\nb\0' | sed -e '$s/.$//' | od -c -t x1 -Ax
000000 a \n b
61 0a 62
000003
EDIT: It seems that this command is not robust. The expected output is a\nb for the following example. Better methods (but not too verbose) are needed.
$ builtin printf 'a\nb\n' | sed -e '$s/.$//' | od -c -t x1 -Ax
000000 a \n \n
61 0a 0a
000003
You may use head -c -1:
printf 'a\nb\0' | head -c -1 | od -c -t x1 -Ax
000000 a \n b
61 0a 62
000003
printf 'a\nb\n' | head -c -1 | od -c -t x1 -Ax
000000 a \n b
61 0a 62
000003
It seems you can't rely on any line-oriented tools (like sed) that automatically remove and re-add newlines.
Perl can slurp the whole stream into a string and can remove the last char:
$ printf 'a\nb\0' | perl -0777 -pe chop | od -c -t x1 -Ax
000000 a \n b
61 0a 62
000003
$ printf 'a\nb\n' | perl -0777 -pe chop | od -c -t x1 -Ax
000000 a \n b
61 0a 62
000003
The tradeoff is that you need to hold the entire stream in memory.

Split results of du command by new line

I have got a list of the top 20 files/folders that are taking the most amount of room on my hard drive. I would like to separate them into size path/to/file. Below is what I have done so far.
I am using: var=$(du -a -g /folder/ | sort -n -r | head -n 20). It returns the following:
120 /path/to/file
115 /path/to/another/file
110 /file/path/
etc.
I have tried the following code to split it up into single lines.
for i in $(echo $var | sed "s/\n/ /g")
do
echo "$i"
done
The result I would like is as follows:
120 /path/to/file,
115 /path/to/another/file,
110 /file/path/,
etc.
This however is the result I am getting:
120,
/path/to/file,
115,
/path/to/another/file,
110,
/file/path/,
etc.
I think awk will be easier, can be combined with a pipe to the original command:
du -a -g /folder/ | sort -n -r | head -n 20 | awk '{ print $1, $2 "," }'
If you can not create a single pipe, and have to use $var
echo "$var" | awk '{ print $1, $2 "," }'

Extract data from a binary file

I have a binary file. I want extract all data from $marker + $step to $marker (or end of file).
Data example:
23 40 92 34 32 09 84 39 02 89 30 fe 90 38 01 02 03 f1 f2 00 00 00 22 33 44 56 77 22 aa bb cc dd ee ff 00 11 ff dd cc cc cc 22 80 ee 01 02 03 f1 f2 00 00 00 22 33 44 56 23 40 92 34 32 dd cc cc 22 33 44 22 33 44 01 02 03 f1 f2 00 00 00 22 33 44 56 77 22 FF FF FF 52 FF FF 52 00 00 00 00 00 00 00
It contains three blocks. I need:
1
00 00 00 22 33 44 56 77 22 aa bb cc dd ee ff 00 11 ff dd cc cc cc 22 80 ee
2
00 00 00 22 33 44 56 23 40 92 34 32 dd cc cc 22 33 44 22 33 44
3
00 00 00 22 33 44 56 77 22 FF FF FF 52 FF FF 52 00 00 00 00 00 00 00
I never worked with binary files with Perl.
$filename = $ARGV[0];
$marker = \x01\x02\x03\xf1\xf2;
$step = 3;
$count = 0;
open $file
while <$file> {
seek $marker;
Go to forward +$step bytes;
$count++
print EXTFILE_.$count.'.dat' $_
# Until do not seek new $marker or EOF
}
close file
As a result, I have to get three .dat files.
How can I realize this pseudocode? What would be some simple example?
Perl regular expressions are just as happy with binary data as with readable text, and binary files can be opened with a mode of raw to avoid translating line endings.
Here's a solution that reads the whole file into memory and scans it for the marker string.
use strict;
use warnings;
my $filename = shift;
my $binary = do {
open my $fh, '<:raw', $filename or die $!;
local $/;
<$fh>;
};
my $marker = "\x01\x02\x03\xf1\xf2";
while ( $binary =~ /$marker(.*?)(?=$marker|\z)/sg ) {
my #hex = map { sprintf '%02X', $_ } unpack 'C*', $1;
print "#hex\n";
}
Output
00 00 00 22 33 44 56 77 22 AA BB CC DD EE FF 00 11 FF DD CC CC CC 22 80 EE
00 00 00 22 33 44 56 23 40 92 34 32 DD CC CC 22 33 44 22 33 44
00 00 00 22 33 44 56 77 22 FF FF FF 52 FF FF 52 00 00 00 00 00 00 00
If the file is huge, or if you simply prefer the idea, you could set the input record separator to the marker string. Then a readline operation on the file would fetch up to and including the next occurrence of the marker pattern in the file. It means that each record is being read along with the marker from the beginning of the next record, but as it's going to be removed anyway it doesn't matter.
use strict;
use warnings;
my $filename = shift;
my $marker = "\x01\x02\x03\xf1\xf2";
open my $fh, '<:raw', $filename or die $!;
local $/ = $marker;
<$fh>; # Drop the data up to and including the first marker
while (<$fh>) {
chomp; # Remove the marker string from the end, if any
my #hex = map { sprintf '%02X', $_ } unpack 'C*';
print "#hex\n";
}
The output is identical to that of the program above.
Though that doesn't work for the required output of the program. This program uses the second technique above but writes to a series of EXTFILE.dat files instead of dumping the hex data. Note that an open mode of raw is necessary again.
use strict;
use warnings;
my $filename = shift // 'file.bin';
my $marker = "\x01\x02\x03\xf1\xf2";
open my $fh, '<:raw', $filename or die $!;
local $/ = $marker;
<$fh>; # Drop the data up to and including the first marker
my $count;
while (my $record = <$fh>) {
chomp $record;
my $outfile = sprintf 'EXTFILE_%d.dat', ++$count;
open my $out_fh, '>:raw', $outfile or die $!;
print $out_fh $record;
close $out_fh or die $!;
}

grep and delete two lines above a string

I have a file that is something like below.
I want to grep for a string say cde and find two lines above it and delete in the same file (something like perl -i).
abc
abc
cde
fgh
lij
lij
klm
mno
pqr
pqr
I tried
grep -B 2 "cde" a.txt
Output
abc
abc
cde
But now I want to delete the two lines above cde so that my final output is
cde
fgh
lij
lij
klm
mno
pqr
pqr
I have tried
grep -v -B "cde" a.txt
but it doesn't work
In a perl one-liner
perl -ne 'push #b, $_; #b = () if /^cde$/; print shift #b if #b == 3; END { print #b }' file.txt
Here is an awk solution.
awk 'FNR==NR {if (/cde/) f=NR;next} FNR!=f-1 && FNR!=f-2' file{,} > tmp && mv tmp file
cde
fgh
lij
lij
klm
mno
pqr
pqr
file{,} is the same as file file. Make awk read the file two times.
First time look for cde and store it in a variable f
Second time print if record is not -1 or -2 compare to the finding
> tmp && mv tmp file store output in a tmp file, then write back to original file, like -i

match pathname within double-zero-byte-separator input file

I am improving a script listing duplicated files that I have written last year (see the second script if you follow the link).
The record separator of the duplicated.log output is the zero byte instead of the carriage return \n. Example:
$> tr '\0' '\n' < duplicated.log
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
32 dir6/video.m4v
32 dir7/video.m4v
(in this example, the five files dir1/index.htm, ... and dir5/index.htm have same md5sum and their size is 12 bytes. The other two files dir6/video.m4vand dir7/video.m4v have same md5sum and their content size (du) is 32 bytes.)
As each line is ended by a zero byte (\0) instead of carriage return symbol (\n), blank lines are represented as two successive zero bytes (\0\0).
I use zero byte as line separator because, path-file-name may contain carriage return symbol.
But, doing that I am faced to this issue:
How to 'grep' all duplicates of a specified file from duplicated.log?
(e.g. How to retrieve duplicates of dir1/index.htm?)
I need:
$> ./youranswer.sh "dir1/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir4/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir7/video.m4v" < duplicated.log | tr '\0' '\n'
32 dir6/video.m4v
32 dir7/video.m4v
I was thinking about some thing like:
awk 'BEGIN { RS="\0\0" } #input record separator is double zero byte
/filepath/ { print $0 }' duplicated.log
...but filepathmay contain slash symbols / and many other symbols (quotes, carriage return...).
I may have to use perl to deal with this situation...
I am open to any suggestions, questions, other ideas...
You're almost there: use the matching operator ~:
awk -v RS='\0\0' -v pattern="dir1/index.htm" '$0~pattern' duplicated.log
I have just realized that I could use the md5sum instead of the pathname because in my new version of the script I am keeping the md5sum information.
This is the new format I am currently using:
$> tr '\0' '\n' < duplicated.log
12 89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm
32 fc191f86efabfca83a94d33aad2f87b4 dir6/video.m4v
32 fc191f86efabfca83a94d33aad2f87b4 dir7/video.m4v
gawk and nawk give wanted result:
$> awk 'BEGIN { RS="\0\0" }
/89e8a208e5f06c65e6448ddeb40ad879/ { print $0 }' duplicated.log |
tr '\0' '\n'
12 89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm
But I am still open about your answers :-)
(this current answer is just a workaround)
For curious, below the new (horrible) script under construction...
#!/bin/bash
fifo=$(mktemp -u)
fif2=$(mktemp -u)
dups=$(mktemp -u)
dirs=$(mktemp -u)
menu=$(mktemp -u)
numb=$(mktemp -u)
list=$(mktemp -u)
mkfifo $fifo $fif2
# run processing in background
find . -type f -printf '%11s %P\0' | #print size and filename
tee $fifo | #write in fifo for dialog progressbox
grep -vzZ '^ 0 ' | #ignore empty files
LC_ALL=C sort -z | #sort by size
uniq -Dzw11 | #keep files having same size
while IFS= read -r -d '' line
do #for each file compute md5sum
echo -en "${line:0:11}" "\t" $(md5sum "${line:12}") "\0"
#file size + md5sim + file name + null terminated instead of '\n'
done | #keep the duplicates (same md5sum)
tee $fif2 |
uniq -zs12 -w46 --all-repeated=separate |
tee $dups |
#xargs -d '\n' du -sb 2<&- | #retrieve size of each file
gawk '
function tgmkb(size) {
if(size<1024) return int(size) ; size/=1024;
if(size<1024) return int(size) "K"; size/=1024;
if(size<1024) return int(size) "M"; size/=1024;
if(size<1024) return int(size) "G"; size/=1024;
return int(size) "T"; }
function dirname (path)
{ if(sub(/\/[^\/]*$/, "", path)) return path; else return "."; }
BEGIN { RS=ORS="\0" }
!/^$/ { sz=substr($0,0,11); name=substr($0,48); dir=dirname(name); sizes[dir]+=sz; files[dir]++ }
END { for(dir in sizes) print tgmkb(sizes[dir]) "\t(" files[dir] "\tfiles)\t" dir }' |
LC_ALL=C sort -zrshk1 > $dirs &
pid=$!
tr '\0' '\n' <$fifo |
dialog --title "Collecting files having same size..." --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)
tr '\0' '\n' <$fif2 |
dialog --title "Computing MD5 sum" --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)
wait $pid
DUPLICATES=$( grep -zac -v '^$' $dups) #total number of files concerned
UNIQUES=$( grep -zac '^$' $dups) #number of files, if all redundant are removed
DIRECTORIES=$(grep -zac . $dirs) #number of directories concerned
lins=$(tput lines)
cols=$(tput cols)
cat > $menu <<EOF
--no-shadow
--no-lines
--hline "After selection of the directory, you will choose the redundant files you want to remove"
--menu "There are $DUPLICATES duplicated files within $DIRECTORIES directories.\nThese duplicated files represent $UNIQUES unique files.\nChoose directory to proceed redundant file removal:"
$lins
$cols
$DIRECTORIES
EOF
tr '\n"' "_'" < $dirs |
gawk 'BEGIN { RS="\0" } { print FNR " \"" $0 "\" " }' >> $menu
dialog --file $menu 2> $numb
[[ $? -eq 1 ]] && exit
set -x
dir=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f4- )
md5=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f2 )
grep -zao "$dir/[^/]*$" "$dups" |
while IFS= read -r -d '' line
do
file="${line:47}"
awk 'BEGIN { RS="\0\0" } '"/$md5/"' { print $0 }' >> $list
done
echo -e "
fifo $fifo \t dups $dups \t menu $menu
fif2 $fif2 \t dirs $dirs \t numb $numb \t list $list"
#rm -f $fifo $fif2 $dups $dirs $menu $numb