I need to reorder the columns of this (tab-separated) data:
1 cat plays
1 dog eats
1 horse runs
1 red dog
1 the cat
1 the cat
so that is prints like:
cat plays 1
dog eats 1
horse runs 1
red dog 1
the cat 2
i have tried:
sort [input] | uniq -c | awk '{print $2 "\t" $3 "\t" $1}' > [output]
and the result is:
1 cat 1
1 dog 1
1 horse 1
1 red 1
2 the 1
Can anyone give me some insight on what is going wrong?
Thank you.
Since the output of cat input | sort | uniq -c is:
1 1 cat plays
1 1 dog eats
1 1 horse runs
1 1 red dog
2 1 the cat
you need something like:
cat input | sort | uniq -c | awk '{print $3 "\t" $4 "\t" $1}'
And we can also indicate the output field separator in awk:
cat input | sort | uniq -c | awk -v OFS="\t" '{print $3,$4,$1}'
uniq -c adds an extra column. This should give you the output you want:
$ sort file | uniq -c | awk '{print $3 "\t" $4 "\t" $1}'
cat plays 1
dog eats 1
horse runs 1
red dog 1
the cat 2
With awk and sort:
$ awk '{a[$2 OFS $3]++}END{for(k in a)print k,a[k]}' OFS='\t' file | sort -nk3
cat plays 1
dog eats 1
horse runs 1
red dog 1
the cat 2
If you have GNU awk (gawk), you could use only it and its feature function asorti():
#!/usr/bin/env gawk -f
{
a[$2 "\t" $3]++
}
END {
asorti(a, b)
for (i = 1; i in b; ++i) print b[i] "\t" a[b[i]]
}
One line:
gawk '{++a[$2"\t"$3]}END{asorti(a,b);for(i=1;i in b;++i)print b[i]"\t"a[b[i]]}' file
Output:
cat plays 1
dog eats 1
horse runs 1
red dog 1
the cat 2
UPDATE: To preserve original order without sorting use:
#!/usr/bin/awk -f
!a[$2 "\t" $3]++ {
b[++i] = $2 "\t" $3
}
END {
for (j = 1; j <= i; ++j) print b[j] "\t" a[b[j]]
}
Or
awk '!a[$2"\t"$3]++{b[++i]=$2"\t"$3}END{for(j=1;j<=i;++j)print b[j]"\t"a[b[j]]}' file
Any awk version would be compatible with that this time.
Output should be the same this time since input is already sorted by default.
Related
I want to reverse the sign of numbers in column x (2) in multiple files. For example:
From
1 | 2.0
2 | -3.0
3 | 1.0
To
1 |-2.0
2 |3.0
3 |-1.0
I am using sed '/^-/ {s/.//;b};s/^/-/' file command, but it does not work. Any suggestion?
A more "proper" way using actual math is easy with awk. For example if you want to negate columns 2 and 3:
awk '{print $1, -$2, -$3}'
$ cat ip.txt
1 | 2.0
2 | -3.0
3 | 1.0
Modifying sed command from OP, not suited to easily modify for a different column or different delimiter
$ sed -E '/^(.*\|\s*)-[0-9]/ {s/^(.*\|\s*)-/\1/;b}; s/^(.*\|\s*)/&-/' ip.txt
1 | -2.0
2 | 3.0
3 | -1.0
With perl where it is easier to specify delimiter and modify specific column
$ perl -F'\|' -lane '$F[1] =~ m/-/ ? $F[1] =~ s/-// : $F[1] =~ s/\d/-$&/; print join "|", #F' ip.txt
1 | -2.0
2 | 3.0
3 | -1.0
To modify multiple files inplace within a folder, use the -i option
sed -i -E '/^(.*\|\s*)-[0-9]/ {s/^(.*\|\s*)-/\1/;b}; s/^(.*\|\s*)/&-/' *
and
perl -i -F'\|' -lane '$F[1] =~ m/-/ ? $F[1] =~ s/-// : $F[1] =~ s/\d/-$&/; print join "|", #F' *
If number format is not an issue,
$ perl -F'\|' -lane '$F[1] = -$F[1]; print join "|", #F' ip.txt
1 |-2
2 |3
3 |-1
I have a file with several rows and with each row containing the following data-
name 20150801|1 20150802|4 20150803|6 20150804|7 20150805|7 20150806|8 20150807|11532 20150808|12399 2015089|12619 20150810|12773 20150811|14182 20150812|27856 20150813|81789 20150814|41168 20150815|28982 20150816|24500 20150817|22534 20150818|3 20150819|4 20150820|47773 20150821|33168 20150822|53541 20150823|46371 20150824|34664 20150825|32249 20150826|29181 20150827|38550 20150828|28843 20150829|3 20150830|23543 20150831|6
name2 20150801|1 20150802|4 20150803|6 20150804|7 20150805|7 20150806|8 20150807|11532 20150808|12399 2015089|12619 20150810|12773 20150811|14182 20150812|27856 20150813|81789 20150814|41168 20150815|28982 20150816|24500 20150817|22534 20150818|3 20150819|4 20150820|47773 20150821|33168 20150822|53541 20150823|46371 20150824|34664 20150825|32249 20150826|29181 20150827|38550 20150828|28843 20150829|3 20150830|23543 20150831|6
The pipe separated value indicates the value for each of the dates in the month.
Each row has the same format with same number of columns.
The first column name indicates a unique name for the row e.g. 20150818 is yyyyddmm
Given a specific date, how do I extract the name of the row that has the largest value on that day?
I think you mean this:
awk -v date=20150823 '{for(f=2;f<=NF;f++){split($f,a,"|");if(a[1]==date&&a[2]>max){max=a[2];name=$1}}}END{print name,max}' YourFile
So, you pass the date you are looking for in as a variable called date. You then iterate through all fields on the line, and split the date and value of each into an array using | as separator - a[1] has the date, a[2] has the value. If the date matches and the value is greater than any previously seen maximum, save this as the new maximum and save the first field from this line for printing at the end.
You couldn't have taken 5 seconds to give your sample input different values? Anyway, this may work when run against input that actually has different values for the dates:
$ cat tst.awk
BEGIN { FS="[|[:space:]]+" }
FNR==1 {
for (i=2;i<=NF;i+=2) {
if ( $i==tgt ) {
f = i+1
}
}
max = $f
}
$f >= max { max=$f; name=$1 }
END { print name }
$ awk -v tgt=20150801 -f tst.awk file
name2
As a quick&dirty solution, we can perform this in following Unix commands:
yourdatafile=<yourdatafile>
yourdate=<yourdate>
cat $yourdatafile | sed 's/|/_/g' | awk -F "${yourdate}_" '{print $1" "$2}' | sed 's/[0-9]*_[0-9]*//g' | awk '{print $1" "$2}' |sort -k 2n | tail -n 1
With following sample data:
$ cat $yourdatafile
Alice 20150801|44 20150802|21 20150803|7 20150804|76 20150805|71
Bob 20150801|31 20150802|5 20150803|21 20150804|133 20150805|71
and yourdate=20150803 we get:
$ cat $yourdatafile | sed 's/|/_/g' | awk -F "${yourdate}_" '{print $1" "$2}' | sed 's/[0-9]*_[0-9]*//g' | awk '{print $1" "$2}' |sort -k 2n | tail -n 1
Bob 21
and for yourdate=20150802 we get:
$ cat $yourdatafile | sed 's/|/_/g' | awk -F "${yourdate}_" '{print $2" "$1}' | sed 's/[0-9]*_[0-9]*//g' | awk '{print $2" "$1}' | sort -k 2n | tail -n 1
Alice 21
The drawback is that only one line is printed the highest value of a day was achieved by more than one name as can be seen with:
$ yourdate=20150805; cat $yourdatafile | sed 's/|/_/g' | awk -F "${yourdate}_" '{print $2" "$1}' | sed 's/[0-9]*_[0-9]*//g' | awk '{print $2" "$1}' | sort -k 2n | tail -n 1
Bob 71
I hope that helps anyway.
I am improving a script listing duplicated files that I have written last year (see the second script if you follow the link).
The record separator of the duplicated.log output is the zero byte instead of the carriage return \n. Example:
$> tr '\0' '\n' < duplicated.log
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
32 dir6/video.m4v
32 dir7/video.m4v
(in this example, the five files dir1/index.htm, ... and dir5/index.htm have same md5sum and their size is 12 bytes. The other two files dir6/video.m4vand dir7/video.m4v have same md5sum and their content size (du) is 32 bytes.)
As each line is ended by a zero byte (\0) instead of carriage return symbol (\n), blank lines are represented as two successive zero bytes (\0\0).
I use zero byte as line separator because, path-file-name may contain carriage return symbol.
But, doing that I am faced to this issue:
How to 'grep' all duplicates of a specified file from duplicated.log?
(e.g. How to retrieve duplicates of dir1/index.htm?)
I need:
$> ./youranswer.sh "dir1/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir4/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir7/video.m4v" < duplicated.log | tr '\0' '\n'
32 dir6/video.m4v
32 dir7/video.m4v
I was thinking about some thing like:
awk 'BEGIN { RS="\0\0" } #input record separator is double zero byte
/filepath/ { print $0 }' duplicated.log
...but filepathmay contain slash symbols / and many other symbols (quotes, carriage return...).
I may have to use perl to deal with this situation...
I am open to any suggestions, questions, other ideas...
You're almost there: use the matching operator ~:
awk -v RS='\0\0' -v pattern="dir1/index.htm" '$0~pattern' duplicated.log
I have just realized that I could use the md5sum instead of the pathname because in my new version of the script I am keeping the md5sum information.
This is the new format I am currently using:
$> tr '\0' '\n' < duplicated.log
12 89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm
32 fc191f86efabfca83a94d33aad2f87b4 dir6/video.m4v
32 fc191f86efabfca83a94d33aad2f87b4 dir7/video.m4v
gawk and nawk give wanted result:
$> awk 'BEGIN { RS="\0\0" }
/89e8a208e5f06c65e6448ddeb40ad879/ { print $0 }' duplicated.log |
tr '\0' '\n'
12 89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm
But I am still open about your answers :-)
(this current answer is just a workaround)
For curious, below the new (horrible) script under construction...
#!/bin/bash
fifo=$(mktemp -u)
fif2=$(mktemp -u)
dups=$(mktemp -u)
dirs=$(mktemp -u)
menu=$(mktemp -u)
numb=$(mktemp -u)
list=$(mktemp -u)
mkfifo $fifo $fif2
# run processing in background
find . -type f -printf '%11s %P\0' | #print size and filename
tee $fifo | #write in fifo for dialog progressbox
grep -vzZ '^ 0 ' | #ignore empty files
LC_ALL=C sort -z | #sort by size
uniq -Dzw11 | #keep files having same size
while IFS= read -r -d '' line
do #for each file compute md5sum
echo -en "${line:0:11}" "\t" $(md5sum "${line:12}") "\0"
#file size + md5sim + file name + null terminated instead of '\n'
done | #keep the duplicates (same md5sum)
tee $fif2 |
uniq -zs12 -w46 --all-repeated=separate |
tee $dups |
#xargs -d '\n' du -sb 2<&- | #retrieve size of each file
gawk '
function tgmkb(size) {
if(size<1024) return int(size) ; size/=1024;
if(size<1024) return int(size) "K"; size/=1024;
if(size<1024) return int(size) "M"; size/=1024;
if(size<1024) return int(size) "G"; size/=1024;
return int(size) "T"; }
function dirname (path)
{ if(sub(/\/[^\/]*$/, "", path)) return path; else return "."; }
BEGIN { RS=ORS="\0" }
!/^$/ { sz=substr($0,0,11); name=substr($0,48); dir=dirname(name); sizes[dir]+=sz; files[dir]++ }
END { for(dir in sizes) print tgmkb(sizes[dir]) "\t(" files[dir] "\tfiles)\t" dir }' |
LC_ALL=C sort -zrshk1 > $dirs &
pid=$!
tr '\0' '\n' <$fifo |
dialog --title "Collecting files having same size..." --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)
tr '\0' '\n' <$fif2 |
dialog --title "Computing MD5 sum" --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)
wait $pid
DUPLICATES=$( grep -zac -v '^$' $dups) #total number of files concerned
UNIQUES=$( grep -zac '^$' $dups) #number of files, if all redundant are removed
DIRECTORIES=$(grep -zac . $dirs) #number of directories concerned
lins=$(tput lines)
cols=$(tput cols)
cat > $menu <<EOF
--no-shadow
--no-lines
--hline "After selection of the directory, you will choose the redundant files you want to remove"
--menu "There are $DUPLICATES duplicated files within $DIRECTORIES directories.\nThese duplicated files represent $UNIQUES unique files.\nChoose directory to proceed redundant file removal:"
$lins
$cols
$DIRECTORIES
EOF
tr '\n"' "_'" < $dirs |
gawk 'BEGIN { RS="\0" } { print FNR " \"" $0 "\" " }' >> $menu
dialog --file $menu 2> $numb
[[ $? -eq 1 ]] && exit
set -x
dir=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f4- )
md5=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f2 )
grep -zao "$dir/[^/]*$" "$dups" |
while IFS= read -r -d '' line
do
file="${line:47}"
awk 'BEGIN { RS="\0\0" } '"/$md5/"' { print $0 }' >> $list
done
echo -e "
fifo $fifo \t dups $dups \t menu $menu
fif2 $fif2 \t dirs $dirs \t numb $numb \t list $list"
#rm -f $fifo $fif2 $dups $dirs $menu $numb
Currently, this shows only numbers.
sed 's/[^0-9]*//g')
How can I tell sed to display ONLY the largest number found, taking into account ONLY the line which contains the word "Page" ?
sed '/Page/!d; s/[^0-9]//g' | sort -n | tail -1
or
awk '/Page/ {gsub(/[^0-9]/,""); if ($0 > max) max = $0} END {print max}'
grep Page filename | awk '{print $2}' | sort -n | tail -n 1
This assumes the page number is the 2nd word of the line (if not, change the awk command as appropriate)
How can I join a string from the left side to the output?
For example: we want to join parameter="file/"
remark: file=/dir1/dir2/ (file has a value)
echo aaa bbb | awk '{print $2}' | sed ....
Will print
/dir1/dir2/bbb
Assuming your input is good, this should be enough.
sed "s|\(.*\)|$VARIABLE\1|"
echo aaa bbb | awk '{print "file/"$2}'
How about:
echo aaa bbb | awk '{ print "file/" $2 }'