match pathname within double-zero-byte-separator input file

match pathname within double-zero-byte-separator input file - perl

I am improving a script listing duplicated files that I have written last year (see the second script if you follow the link).
The record separator of the duplicated.log output is the zero byte instead of the carriage return \n. Example:
$> tr '\0' '\n' < duplicated.log
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
32 dir6/video.m4v
32 dir7/video.m4v
(in this example, the five files dir1/index.htm, ... and dir5/index.htm have same md5sum and their size is 12 bytes. The other two files dir6/video.m4vand dir7/video.m4v have same md5sum and their content size (du) is 32 bytes.)
As each line is ended by a zero byte (\0) instead of carriage return symbol (\n), blank lines are represented as two successive zero bytes (\0\0).
I use zero byte as line separator because, path-file-name may contain carriage return symbol.
But, doing that I am faced to this issue:
How to 'grep' all duplicates of a specified file from duplicated.log?
(e.g. How to retrieve duplicates of dir1/index.htm?)
I need:
$> ./youranswer.sh "dir1/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir4/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir7/video.m4v" < duplicated.log | tr '\0' '\n'
32 dir6/video.m4v
32 dir7/video.m4v
I was thinking about some thing like:
awk 'BEGIN { RS="\0\0" } #input record separator is double zero byte
/filepath/ { print $0 }' duplicated.log
...but filepathmay contain slash symbols / and many other symbols (quotes, carriage return...).
I may have to use perl to deal with this situation...
I am open to any suggestions, questions, other ideas...

You're almost there: use the matching operator ~:
awk -v RS='\0\0' -v pattern="dir1/index.htm" '$0~pattern' duplicated.log

I have just realized that I could use the md5sum instead of the pathname because in my new version of the script I am keeping the md5sum information.
This is the new format I am currently using:
$> tr '\0' '\n' < duplicated.log
12 89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm
32 fc191f86efabfca83a94d33aad2f87b4 dir6/video.m4v
32 fc191f86efabfca83a94d33aad2f87b4 dir7/video.m4v
gawk and nawk give wanted result:
$> awk 'BEGIN { RS="\0\0" }
/89e8a208e5f06c65e6448ddeb40ad879/ { print $0 }' duplicated.log |
tr '\0' '\n'
12 89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm
But I am still open about your answers :-)
(this current answer is just a workaround)
For curious, below the new (horrible) script under construction...
#!/bin/bash
fifo=$(mktemp -u)
fif2=$(mktemp -u)
dups=$(mktemp -u)
dirs=$(mktemp -u)
menu=$(mktemp -u)
numb=$(mktemp -u)
list=$(mktemp -u)
mkfifo $fifo $fif2
# run processing in background
find . -type f -printf '%11s %P\0' | #print size and filename
tee $fifo | #write in fifo for dialog progressbox
grep -vzZ '^ 0 ' | #ignore empty files
LC_ALL=C sort -z | #sort by size
uniq -Dzw11 | #keep files having same size
while IFS= read -r -d '' line
do #for each file compute md5sum
echo -en "${line:0:11}" "\t" $(md5sum "${line:12}") "\0"
#file size + md5sim + file name + null terminated instead of '\n'
done | #keep the duplicates (same md5sum)
tee $fif2 |
uniq -zs12 -w46 --all-repeated=separate |
tee $dups |
#xargs -d '\n' du -sb 2<&- | #retrieve size of each file
gawk '
function tgmkb(size) {
if(size<1024) return int(size) ; size/=1024;
if(size<1024) return int(size) "K"; size/=1024;
if(size<1024) return int(size) "M"; size/=1024;
if(size<1024) return int(size) "G"; size/=1024;
return int(size) "T"; }
function dirname (path)
{ if(sub(/\/[^\/]*$/, "", path)) return path; else return "."; }
BEGIN { RS=ORS="\0" }
!/^$/ { sz=substr($0,0,11); name=substr($0,48); dir=dirname(name); sizes[dir]+=sz; files[dir]++ }
END { for(dir in sizes) print tgmkb(sizes[dir]) "\t(" files[dir] "\tfiles)\t" dir }' |
LC_ALL=C sort -zrshk1 > $dirs &
pid=$!
tr '\0' '\n' <$fifo |
dialog --title "Collecting files having same size..." --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)
tr '\0' '\n' <$fif2 |
dialog --title "Computing MD5 sum" --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)
wait $pid
DUPLICATES=$( grep -zac -v '^$' $dups) #total number of files concerned
UNIQUES=$( grep -zac '^$' $dups) #number of files, if all redundant are removed
DIRECTORIES=$(grep -zac . $dirs) #number of directories concerned
lins=$(tput lines)
cols=$(tput cols)
cat > $menu <<EOF
--no-shadow
--no-lines
--hline "After selection of the directory, you will choose the redundant files you want to remove"
--menu "There are $DUPLICATES duplicated files within $DIRECTORIES directories.\nThese duplicated files represent $UNIQUES unique files.\nChoose directory to proceed redundant file removal:"
$lins
$cols
$DIRECTORIES
EOF
tr '\n"' "_'" < $dirs |
gawk 'BEGIN { RS="\0" } { print FNR " \"" $0 "\" " }' >> $menu
dialog --file $menu 2> $numb
[[ $? -eq 1 ]] && exit
set -x
dir=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f4- )
md5=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f2 )
grep -zao "$dir/[^/]*$" "$dups" |
while IFS= read -r -d '' line
do
file="${line:47}"
awk 'BEGIN { RS="\0\0" } '"/$md5/"' { print $0 }' >> $list
done
echo -e "
fifo $fifo \t dups $dups \t menu $menu
fif2 $fif2 \t dirs $dirs \t numb $numb \t list $list"
#rm -f $fifo $fif2 $dups $dirs $menu $numb

Related

how to convert 23/1/17 to 23/01/2017 in a row of csv file with unix?

I am looking for how to convert all dates in a csv file row into this format ? example I want to convert 23/1/17 to 23/01/2017
I use unix
thank you
my file is like this :
23/1/17
17/08/18
1/1/2
5/6/03
18/05/2019
and I want this :
23/01/2017
17/08/2018
01/01/2002
05/06/2003
18/05/2019

I used date_samples.csv as my test data:
23/1/17,17/08/18,1/1/02,5/6/03,18/05/2019
cat date_samples.csv | tr "," "\n" | awk 'BEGIN{FS=OFS="/"}{print $2,$1,$3}' | \
while read CMD; do
date -d $CMD +%d/%m/%Y >> temp
done; cat temp | tr "\n" "," > converted_dates.csv ; rm temp; truncate -s-1 converted_dates.csv
Output:
23/01/2017,17/08/2018,01/01/2002,05/06/2003,18/05/2019
This portion of the code converts your "," to new lines and makes your input DD/MM/YY to MM/DD/YY, since the date command does not accept date inputs of DD/MM/YY. It then loops through re-arranged dates and convert them to DD/MM/YYYY format and temporarily stores them in temp.
cat date_samples.csv | tr "," "\n" | awk 'BEGIN{FS=OFS="/"}{print $2,$1,$3}' | \
while read CMD; do
date -d $CMD +%d/%m/%Y >> temp
done;
This line cat temp | tr "\n" "," > converted_dates.csv ; rm temp; truncate -s-1 converted_dates.csv converts the new line back to "," and puts the output to converted_dates.csv and deletes temp.

Using awk:
awk -F, '{ for (i=1;i<=NF;i++) { split($i,map,"/");if (length(map[3])==1) { map[3]="0"map[3] } "date -d \""map[2]"/"map[1]"/"map[3]"\" \"+%d/%m/%y\"" | getline dayte;close("date -d \""map[2]"/"map[1]"/"map[3]"\" \"+%d/%m/%y\"");$i=dayte }OFS="," }1' file
Explanation:
awk -F, '{
for (i=1;i<=NF;i++) {
split($i,map,"/"); # Loop through each comma separated field and split into the array map using "/" as the field seperator
if (length(map[3])==1) {
map[3]="0"map[3] # If the year is just one digit, pad out with prefix 0
}
"date -d \""map[2]"/"map[1]"/"map[3]"\" \"+%d/%m/%y\"" | getline dayte; # Run date command on day month and year and read result into variable dayte
close("date -d \""map[2]"/"map[1]"/"map[3]"\" \"+%d/%m/%y\""); # Close the date execution pipe
$i=dayte # Replace the field for the dayte variable
}
OFS="," # Set the output field seperator
}1' file

iterate over stdin fish (context: filter music files by genre grep)

I have this:
for file in **/*.ogg;
if ffprobe "$file" 2>&1 | sed -E -n 's/^ *GENRE *: (.*)/\1/p' | grep -q "$argv";
echo "$file"
else
end
end
but I would like to turn it into a function which will take a list of filenames as standard-input:
$ find . -maxdepth 1 -not -type d -exec du -h {} + | cut -f2 | filterByGenre Classical

You could do
function filterByGenre
while read line
do stuff with $line
end
end
or
function filterByGenre
set listOfLines (cat)
for line in $listOfLines
do stuff with $line
end
end

Split results of du command by new line

I have got a list of the top 20 files/folders that are taking the most amount of room on my hard drive. I would like to separate them into size path/to/file. Below is what I have done so far.
I am using: var=$(du -a -g /folder/ | sort -n -r | head -n 20). It returns the following:
120 /path/to/file
115 /path/to/another/file
110 /file/path/
etc.
I have tried the following code to split it up into single lines.
for i in $(echo $var | sed "s/\n/ /g")
do
echo "$i"
done
The result I would like is as follows:
120 /path/to/file,
115 /path/to/another/file,
110 /file/path/,
etc.
This however is the result I am getting:
120,
/path/to/file,
115,
/path/to/another/file,
110,
/file/path/,
etc.

I think awk will be easier, can be combined with a pipe to the original command:
du -a -g /folder/ | sort -n -r | head -n 20 | awk '{ print $1, $2 "," }'
If you can not create a single pipe, and have to use $var
echo "$var" | awk '{ print $1, $2 "," }'

grep + grep + sed = sed: no input files

Can anybody help me please?
grep " 287 " file.txt | grep "HI" | sed -i 's/HIS/HID/g'
sed: no input files
Tried also xargs
grep " 287 " file.txt | grep HI | xargs sed -i 's/HIS/HID/g'
sed: invalid option -- '6'
This works fine
grep " 287 " file.txt | grep HI

If you want to keep your pipeline:
f=file.txt
tmp=$(mktemp)
grep " 287 " "$f" | grep "HI" | sed 's/HIS/HID/g' > "$tmp" && mv "$tmp" "$f"
Or, simplify:
sed -i -n '/ 287 / {/HI/ s/HIS/HID/p}' file.txt
That will filter out any line that does not contain " 287 " and "HI" -- is that what you want? I suspect you really want this:
sed -i '/ 287 / {/HI/ s/HIS/HID/}' file.txt
For lines that match / 287 /, execute the commands in braces. In there, for lines that match /HI/, search for the first "HIS" and replace with "HID". sed implicitly prints all lines if -n is not specified.
Other commands that do the same thing:
awk '/ 287 / && /HI/ {sub(/HIS/, "HID")} {print}' file.txt > new.txt
perl -i -pe '/ 287 / and /HI/ and s/HIS/HID/' file.txt
awk does not have an "in-place" option (except gawk -i inplace for recent gawk versions)

Insert comma after certain byte range

I'm trying to turn a big list of data into a CSV. Its basically a giant list with no spaces, and the rows are separated by newlines. I have made a bash script that basically loops through the document, awks out the line, cuts the byte range, and then adds a comma and appends it to the end of the line. It looks like this:
awk -v n=$x 'NR==n { print;exit}' PROP.txt | cut -c 1-12 | tr -d '\n' >> $x.tmp
awk -v n=$x 'NR==n { print;exit}' PROP.txt | cut -c 13-17 | tr -d '\n' | xargs -I {} sed -i '' -e 's~$~,{}~' $x.tmp
awk -v n=$x 'NR==n { print;exit}' PROP.txt | cut -c 18-22 | tr -d '\n' | xargs -I {} sed -i '' -e 's~$~,{}~' $x.tmp
awk -v n=$x 'NR==n { print;exit}' PROP.txt | cut -c 23-34 | tr -d '\n' | xargs -I {} sed -i '' -e 's~$~,{}~' $x.tmp
The problem is this is EXTREMELY slow, and the data has about 400k rows. I know there must be a better way to accomplish this. Essentially I just need to add a comma after every 12/17/22/34 etc character of a line.
Any help is appreciated, thank you!

There are many many ways to do this with Perl. Here is one way:
perl -pe 's/(.{12})(.{5})(.{5})(.{12})/$1,$2,$3,$4,/' < input-file > output-file
The matching pattern in the substitution captures four groups of text from the beginning of each line with 12, 5, 5, and 12 arbitrary characters. The replacement pattern places a comma after each group.

With GNU awk, you could write
gawk 'BEGIN {FIELDWIDTHS="12 5 5 12"; OFS=","} {$1=$1; print}'
The $1=$1 part is to force awk to rewrite the like, incorporating the output field separator, without changing anything.

This is very much a job for substr.
use strict;
use warnings;
my #widths = (12, 5, 5, 12);
my $offset;
while (my $line = <DATA>) {
for my $width (#widths) {
$offset += $width;
substr $line, $offset, 0, ',';
++$offset;
}
print $line;
}
__DATA__
1234567890123456789012345678901234567890
output
123456789012,34567,89012,345678901234,567890

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

match pathname within double-zero-byte-separator input file - perl

You're almost there: use the matching operator ~: awk -v RS='\0\0' -v pattern="dir1/index.htm" '$0~pattern' duplicated.log

Related

how to convert 23/1/17 to 23/01/2017 in a row of csv file with unix?

iterate over stdin fish (context: filter music files by genre grep)

Split results of du command by new line

grep + grep + sed = sed: no input files

Insert comma after certain byte range

Categories

Resources