delete lines from multiple files using gawk / awk / sed - sed

I have two sets of text files. First set is in AA folder. Second set is in BB folder. The content of ff.txt file from first set(AA folder) is shown below.
Name number marks
john 1 60
maria 2 54
samuel 3 62
ben 4 63
I would like to print the second column(number) from this file if marks>60. The output would be 3,4. Next, read the ff.txt file in the BB folder and delete the lines containing numbers 3,4.
files in the BB folder looks like this. second column is the number.
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
I used the following code.This code is working perfectly for one file.
gawk 'BEGIN {getline} $3>60{print $2}' AA/ff.txt | while read number; do gawk -v number=$number '$2 != number' BB/ff.txt > /tmp/ff.txt; mv /tmp/ff.txt BB/ff.txt; done
But when I run this code with multiple files, I get error.
gawk 'BEGIN {getline} $3>60{print $2}' AA/*.txt | while read number; do gawk -v number=$number '$2 != number' BB/*.txt > /tmp/*.txt; mv /tmp/*.txt BB/*.txt; done
error:-
mv: target `BB/kk.txt' is not a directory
I had asked this question two days ago.Please help me to solve this error.

This creates an index of all files in folder AA and checks against all files in folder BB:
cat AA/*.txt | awk 'FNR==NR { if ($3 > 60) array[$2]; next } !($2 in array)' - BB/*.txt
This compares two individual files, assuming they have the same name in folders AA and BB:
ls AA/*.txt | sed "s%AA/\(.*\)%awk 'FNR==NR { if (\$3 > 60) array[\$2]; next } !(\$2 in array)' & BB/\1 %" | sh
HTH
EDIT
This should help :-)
ls AA/*.txt | sed "s%AA/\(.*\)%awk 'FNR==NR { if (\$3 > 60) array[\$2]; next } !(\$2 in array)' & BB/\1 > \1_tmp \&\& mv \1_tmp BB/\1 %" | sh

> /tmp/*.txt and mv /tmp/*.txt BB/*.txt are wrong.
For single file
awk 'NR>1 && $3>60{print $2}' AA/ff.txt > idx.txt
awk 'NR==FNR{a[$0]; next}; !($2 in a)' idx.txt BB/ff.txt
For multiple files
awk 'FNR>1 && $3>60{print $2}' AA/*.txt >idx.txt
cat BB/*.txt | awk 'NR==FNR{a[$0]; next}; !($2 in a)' idx.txt -

One perl solution:
use warnings;
use strict;
use File::Spec;
## Hash to save data to delete from files of BB folder.
## key -> file name.
## value -> string with numbers of second column. They will be
## joined separated with '-...-', like: -2--3--1-. And it will be easier to
## search for them using a regexp.
my %delete;
## Check arguments:
## 1.- They are two.
## 2.- Both are directories.
## 3.- Both have same number of regular files and with identical names.
die qq[Usage: perl $0 <dir_AA> <dir_BB>\n] if
#ARGV != 2 ||
grep { ! -d } #ARGV;
{
my %h;
for ( glob join q[ ], map { qq[$_/*] } #ARGV ) {
next unless -f;
my $file = ( File::Spec->splitpath( $_ ) )[2] or next;
$h{ $file }++;
}
for ( values %h ) {
if ( $_ != 2 ) {
die qq[Different files in both directories\n];
}
}
}
## Get files from dir 'AA'. Process them, print to output lines which
## matches condition and save the information in the %delete hash.
for my $file ( glob( shift . qq[/*] ) ) {
open my $fh, q[<], $file or do { warn qq[Couldn't open file $file\n]; next };
$file = ( File::Spec->splitpath( $file ) )[2] or do {
warn qq[Couldn't get file name from path\n]; next };
while ( <$fh> ) {
next if $. == 1;
chomp;
my #f = split;
next unless #f >= 3;
if ( $f[ $#f ] > 60 ) {
$delete{ $file } .= qq/-$f[1]-/;
printf qq[%s\n], $_;
}
}
}
## Process files found in dir 'BB'. For each line, print it if not found in
## file from dir 'AA'.
{
#ARGV = glob( shift . qq[/*] );
$^I = q[.bak];
while ( <> ) {
## Sanity check. Shouldn't occur.
my $filename = ( File::Spec->splitpath( $ARGV ) )[2];
if ( ! exists $delete{ $filename } ) {
close ARGV;
next;
}
chomp;
my #f = split;
if ( $delete{ $filename } =~ m/-$f[1]-/ ) {
next;
}
printf qq[%s\n], $_;
}
}
exit 0;
A test:
Assuming next tree of files. Command:
ls -R1
Output:
.:
AA
BB
script.pl
./AA:
ff.txt
gg.txt
./BB:
ff.txt
gg.txt
And next content of files. Command:
head AA/*
Output:
==> AA/ff.txt <==
Name number marks
john 1 60
maria 2 54
samuel 3 62
ben 4 63
==> AA/gg.txt <==
Name number marks
john 1 70
maria 2 54
samuel 3 42
ben 4 33
Command:
head BB/*
Output:
==> BB/ff.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
==> BB/gg.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 2 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
Run the script like:
perl script.pl AA/ BB
With following ouput to screen:
samuel 3 62
ben 4 63
john 1 70
And files of BB directory modified like:
head BB/*
Output:
==> BB/ff.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
==> BB/gg.txt <==
marks 2 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
So, from ff.txt lines with numbers 3 and 4 have been deleted, and lines with number 1 in gg.txt, which all of them were bigger than 60 in last column. I think this is what you wanted to achieve. I hope it helps, although not awk.

Related

How to repeat a sequence of numbers to the end of a column?

I have a data file that needs a new column of identifiers from 1 to 5. The final purpose is to split the data into five separate files with no leftover file (split leaves a leftover file).
Data:
aa
bb
cc
dd
ff
nn
ww
tt
pp
with identifier column:
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
Not sure if this can be done with seq? Afterwards it will be split with:
awk '$2 == 1 {print $0}'
awk '$2 == 2 {print $0}'
awk '$2 == 3 {print $0}'
awk '$2 == 4 {print $0}'
awk '$2 == 5 {print $0}'
Perl to the rescue:
perl -pe 's/$/" " . $. % 5/e' < input > output
Uses 0 instead of 5.
$. is the line number.
% is the modulo operator.
the /e modifier tells the substitution to evaluate the replacement part as code
i.e. end of line ($) is replaced with a space concatenated (.) with the line number modulo 5.
$ awk '{print $0, ((NR-1)%5)+1}' file
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
No need for that to create 5 separate files of course. All you need is:
awk '{print > ("file_" ((NR-1)%5)+1)}' file
Looks like you're happy with a perl solution that outputs 1-4 then 0 instead of 1-5 so FYI here's the equivalent in awk:
$ awk '{print $0, NR%5}' file
aa 1
bb 2
cc 3
dd 4
ff 0
nn 1
ww 2
tt 3
pp 4
I am going to offer a Perl solution even though it wasn't tagged because Perl is well suited to solve this problem.
If I understand what you want to do, you have a single file that you want to split into 5 separate files based on the position of a line in the data file:
the first line in the data file goes to file 1
the second line in the data file goes to file 2
the third line in the data file goes to file 3
...
since you already have the lines position in the file, you don't really need the identifier column (though you could pursue that solution if you wanted).
Instead you can open 5 filehandles and simply alternate which handle you write to:
use strict;
use warnings;
my $datafilename = shift #ARGV;
# open filehandles and store them in an array
my #fhs;
foreach my $i ( 0 .. 4 ) {
open my $fh, '>', "${datafilename}_$i"
or die "$!";
$fhs[$i] = $fh;
}
# open the datafile
open my $datafile_fh, '<', $datafilename
or die "$!";
my $row_number = 0;
while ( my $datarow = <$datafile_fh> ) {
print { $fhs[$row_number++ % #fhs] } $datarow;
}
# close resources
foreach my $fh ( #fhs ) {
close $fh;
}

Pass "file name" from a text file to a command line where each line of a file is file name

I'm running the following code
git log --pretty=format: --numstat -- SOMEFILENAME |
perl -ane '$i += ($F[0]-$F[1]); END{print "changed: $i\n"}' \
>> random.txt
What this does is it takes a file with a name "SOMEFILENAME" and saves the sum of the total amount of added and removed lines to a textfile called "random.txt"
I need to run this program on every file in repository and there are looots of them. What would be an easy way to do this?
If you want a total per file:
git log --pretty=format: --numstat |
perl -ane'
$c{$F[2]} += $F[0]-$F[1] if $F[2];
END { print "$_\t$c{$_}\n" for sort keys %c }
' >random.txt
If you want a single total:
git log --pretty=format: --numstat |
perl -ane'
$c += $F[0]-$F[1];
END { print "$c\n" }
' >random.txt
Their respective outputs are:
.gitignore 22
Build.PL 48
CHANGES.txt 0
Changes 25
LICENSE 132
LICENSE.txt 0
MANIFEST 18
MANIFEST.SKIP 9
README.txt 67
TODO.txt 1
lib/feature/qw_comments.pm 129
lib/feature/qw_comments.xs 250
t/00_load.t 13
t/01_basic.t 85
t/02_pragma.t 56
t/03_line_numbers.t 37
t/04_errors.t 177
t/05-unicode.t 39
t/devel-pod-coverage.t 26
t/pod.t 17
and
1151
Rather than use find, you can just let git give you all the files by using the name . (representing the current directory). With that, here's a version using awk that prints out stats per file:
git log --pretty=format: --numstat -- . |
awk '
NF == 3 {changed[$3] += $1 - $2}
END { for (name in changed) { printf("%s: %d changed\n", name, changed[name]); } }
'
And an even shorter one that prints a single overall changed line:
git log --pretty=format: --numstat -- . |
awk '
NF == 3 {changed += $1 - $2}
END { printf("%d changed\n", changed); }
'
(The NF == 3 is to account for the fact that git seems to print spurious blank lines in its output. I didn't try to figure out if there's a better git command.)

grep variables and give informative ouput

I want to see how many times specific word was mentioned in the file/lines.
My dummy examples looks like this:
cat words
blue
red
green
yellow
cat text
TEXTTEXTblueTEXTTEXTblue
TEXTTEXTgreenblueTEXTTEXT
TEXTTEXyeowTTEXTTEXTTEXT
I am doing this:
for i in $(cat words); do grep "$i" text | wc >> output; done
cat output
2 2 51
0 0 0
1 1 26
0 0 0
But what I actually want to get is:
1. Word that was used as a variable;
2. In how many lines (additionally to text hits) word was found.
Preferable output looks like this:
blue 3 2
red 0 0
green 1 1
yellow 0 0
$1 - variable that was grep'ed
$2 - how many times variable was found in the text
$3 - in how many lines variable was found
Hope someone could help me doing this with grep, awk, sed as they are fast enough for the large data set, but Perl one liner would help me too.
Edit
Tried this
for i in $(cat words); do grep "$i" text > out_${i}; done && wc out*
and it kinda looks nice, but some of the words are longer than 300 letters so I can't create file named like the word.
You can use the grep option -o which print only the matched parts of a matching line, with each match on a separate output line.
while IFS= read -r line; do
wordcount=$(grep -o "$line" text | wc -l)
linecount=$(grep -c "$line" text)
echo $line $wordcount $linecount
done < words | column -t
You can put it all in one line to make it a one liner.
If column gives the "column too long" error, you can use printf provided you know the maximum number of characters. Use the below instead of echo and remove the pipe to column:
printf "%-20s %-2s %-2s\n" "$line" $wordcount $linecount
Replace the 20 with your max word length and the other numbers as well if you need to.
Here is a similar Perl solution; but rather written as a complete script.
#!/usr/bin/perl
use 5.012;
die "USAGE: $0 wordlist.txt [text-to-search.txt]\n" unless #ARGV;
my $wordsfile = shift #ARGV;
my #wordlist = do {
open my $words_fh, "<", $wordsfile or die "Can't open $wordsfile: $!";
map {chomp; length() ? $_ : ()} <$words_fh>;
};
my %words;
while (<>) {
for my $word (#wordlist) {
my $cnt = 0;
$cnt++ for /\Q$word\E/g;
$words{$word}[0] += $cnt;
$words{$word}[1] += 1&!! $cnt; # trick to force 1 or 0.
}
}
# sorts output after frequency. remove `sort {...}` to get unsorted output.
for my $key (sort {$words{$b}->[0] <=> $words{$a}->[0] or $a cmp $b} keys %words) {
say join "\t", $key, #{ $words{$key} };
}
Example output:
blue 3 2
green 1 1
red 0 0
yellow 0 0
Advantage over bash script: every file is only read once.
This gets pretty ugly as a Perl one-liner (partly because it needs to get data from two files and only one can be sent on stdin, partly because of the requirement to count both the number of lines matched and the total number of matches), but here you go:
perl -E 'undef $|; open $w, "<", "words"; #w=<$w>; chomp #w; $r{$_}=[0,{}] for #w; my $re = join "|", #w; while(<>) { $l++; while (/($re)/g) { $r{$1}[0]++; $r{$1}[1]{$l}++; } }; say "$_\t$r{$_}[0]\t" . scalar keys %{$r{$_}[1]} for #w' < text
This requires perl 5.10 or later, but changing it to support 5.8 and earlier is trivial. (Change the -E to -e, change say to print, and add a \n at the end of each line of output.)
Output:
blue 3 2
red 0 0
green 1 1
yellow 0 0
an awk(gawk) oneliner could save you from grep puzzle:
awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
format the code a bit:
awk 'NR==FNR{n[$0];l[$0];next;}
{for(w in n){ s=$0;
t=gsub(w,"#",s);
n[w]+=t;l[w]+=t>0?1:0;}
}END{for(x in n)print x,n[x],l[x]}' words text
test with your example:
kent$ awk 'NR==FNR{n[$0];l[$0];next;}{for(w in n){ s=$0;t=gsub(w,"#",s); n[w]+=t;l[w]+=t>0?1:0;}}END{for(x in n)print x,n[x],l[x]}' words text
yellow 0 0
red 0 0
green 1 1
blue 3 2
if you want to format your output, you could just pipe the awk output to column -t
so it looks like:
yellow 0 0
red 0 0
green 1 1
blue 3 2
awk '
NR==FNR { words[$0]; next }
{
for (word in words) {
count = gsub(word,word)
if (count) {
counts[word] += count
lines[word]++
}
}
}
END { for (word in words) printf "%s %d %d\n", word, counts[word], lines[word] }
' file

Prevent column shift after character removal

Currently I am using the following oneliner for the removal of special characters:
sed 's/[-$*=+()]//g'
However sometimes it occurs that a column only contains the special character *.
How can I prevent the column from shifting if it only contains *?
Would it be possible to use a placeholder, so that whenever it occurs that the only character(s) in the columns two and/or four are * it gets replaced by N for every *?
From:
6 cc-g*$ 10 cc+c
6 c$c$*g$q 10 ***
6 *c*c$$qq 10 ccc
6 ** 10 c$cc
6 ** 10 *
To possibly:
6 ccg 10 ccc
6 ccgq 10 NNN
6 ccqq 10 ccc
6 NN 10 ccc
6 NN 10 N
Try with in awk,
awk '{ if($2 ~ /^[*]+$/) { gsub ( /[*]/,"N",$2); } if($4 ~ /^[*]+$/ ){ gsub ( /[*]/,"N",$4); } print }' your_file.txt | sed 's/[-$*=+()]//g'
I hope this will help you.
One way using perl. Traverse all fields of each line and substitute special characters unless the field only has * characters. After that print them separated with one space.
perl -ane '
for my $pos ( 0 .. $#F ) {
$F[ $pos ] =~ s/[-\$*=+()]//g unless $F[ $pos ] =~ m/\A\*+\Z/;
}
printf qq|%s\n|, join qq| |, #F;
' infile
Assuming infile has the content of the question, output will be:
6 ccg 10 ccc
6 ccgq 10 ***
6 ccqq 10 ccc
6 ** 10 ccc
6 ** 10 *
This might work for you (GNU sed):
sed 'h;s/\S*\s*\(\S*\).*/\1/;:a;/^\**$/y/*/N/;s/[*$+=-]//g;H;g;/\n.*\n/bb;s/\(\S*\s*\)\{3\}\(\S*\).*/\2/;ba;:b;s/^\(\S*\s*\)\(\S*\)\([^\n]*\)\n\(\S*\)/\1\4\3/;s/\(\S*\)\n\(.*\)/\2/' file

How to extract a particular column of data in Perl?

I have some data from a unix commandline call
1 ab 45 1234
2 abc 5
4 yy 999 2
3 987 11
I'll use the system() function for the call.
How can I extract the second column of data into an array in Perl? Also, the array size has to be dependent on the number of rows that I have (it will not necessarily be 4).
I want the array to have ("ab", "abc", "yy", 987).
use strict;
use warnings;
my $data = "1 ab 45 1234
2 abc 5
2 abc 5
2 abc 5
4 yy 999 2
3 987 11";
my #second_col = map { (split)[1] } split /\n/, $data;
To get unique values, see perlfaq4. Here's part of the answer provided there:
my %seen;
my #unique = grep { ! $seen{ $_ }++ } #second_col;
You can chain a Perl cmd-line call (aka: one-liner) to your unix script:
perl -lane 'print $F[1]' data.dat
instead of data.dat, use a pipe from your command line tool
cat data.dat | perl -lane 'print $F[1]'
Addendum:
The extension for unique-ness of the resulting column is straightforward:
cat data.dat | perl -lane 'print $F[1] unless $seen{$F[1]}++'
or, if you are lazy (employing %_):
cat data.dat | perl -lane 'print unless $_{$_=$F[1]}++'