Perl command not giving expected output - perl

Command:
perl -lpe '1 while (s/(^|\s)(0\d*)(\s|$)/$1"$2"$3/)' test5
Input:
1234 012345 0
0.000 01234 0
01/02/03 5467 0abc
01234 0123
0000 000054
0asdf 0we23-1
Current Output:
perl -lpe '1 while (s/(^|\s)(0\d*)(\s|$)/$1"$2"$3/)' test5
1234 "012345" "0"
0.000 "01234" "0"
01/02/03 5467 "0abc"
"01234" "0123"
"0000" "000054"
0asdf 0we23-1
Excepted Output:
1234 "012345" 0
0.000 "01234" 0
01/02/03 5467 "0abc"
"01234" "0123"
"0000" "000054"
"0asdf" "0we23-1"
Conditions to follow in output:
All strings starting with 0 having alphanumeric character except / and . should be double quoted.
if string starting with 0 have only 0 character should not be quoted.
Spacing between strings should be preserved.

This appears to do what you want:
#!/usr/bin/env perl
use strict;
use warnings;
while ( <DATA> ) {
my #fields = split;
s/^(0[^\.\/]+)$/"$1"/ for #fields;
print join " ", #fields, "\n";
}
__DATA__
1234 012345 0
0.000 01234 0
01/02/03 5467 0abc
01234 0123
0000 000054
0asdf 0we23-1
Note - it doesn't strictly preserve whitespace like you asked though - it just removes it and reinserts a single space. That seems to meet your spec, but you could instead:
my #fields = split /(\s+)/;
as this would capture the spaces too.
join "", #fields;
This is reducible to a one liner using -a for autosplitting:
perl -lane 's/^(0[^\.\/]+)$/"$1"/ for #F; print join " ", #F'
If you wanted to do the second bit (preserving whitespace strictly) then you'd need to drop the -a and use split yourself.

Related

select specific columns from complex lines

I have a file that contains lines with the following format. I would like to keep only the first column and the column containing the string with the following format NC_XXXX.1
484-2117 16 gi|9634679|ref|NC_002188.1| 188705 23 21M * 0 0 CGCGTACCAAAAGTAATAATT IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0G20 YT:Z:UU
787-1087 16 gi|21844535|ref|NC_004068.1| 7006 23 20M * 0 0 CTATACAACCTACTACCTCA IIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19T0 YT:Z:UU
.....
....
...
output:
484-2117 NC_002188.1
787-1087 NC_004068.1
Something like this in perl:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
my ( $id, $nc ) = m/^([\d\-]+).*(NC_[\d\.]+)/;
print "$id $nc\n";
}
__DATA__
484-2117 16 gi|9634679|ref|NC_002188.1| 188705 23 21M * 0 0 CGCGTACCAAAAGTAATAATT IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0G20 YT:Z:UU
787-1087 16 gi|21844535|ref|NC_004068.1| 7006 23 20M * 0 0 CTATACAACCTACTACCTCA IIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19T0 YT:Z:UU
Output:
484-2117 NC_002188.1
787-1087 NC_004068.1
Which reduces to a one liner of:
perl -ne 'm/^([\d\-]+).*(NC_[\d\.]+)/ and print "$1 $2\n"' yourfile
Note - this specifically matches a first column made up of number and dash - you could do this with a wider regex match.
awk to the rescue!
$ awk -F' +|\\|' '{for(i=2;i<=NF;i++) if($i ~ /^NC_[0-9.]+$/) {print $1,$i; next}}' file
484-2117 NC_002188.1
787-1087 NC_004068.1
if the space is a tab char, need to add to the delimiter list
$ awk -F' +|\\||\t' ...
With perl:
perl -anE'say "$F[0] ",(split /\|/, $F[2])[3]' file
or awk:
awk -F'\\|| +' '{print $1,$6}' file
Using gnu-awk below could be solution:
awk '{printf "%s %s\n",$1,gensub(/.*(NC_.*\.1).*/,"\\1",1,$0)}' file
Output
484-2117 NC_002188.1
787-1087 NC_004068.1
A more restrictive version would be
awk '{printf "%s %s\n",$1,gensub(/.*(NC_[[:digit:]]*\.1).*/,"\\1",1,$0)}' file
awk -F'[ |]' '{print $1,$10}' file
484-2117 NC_002188.1
787-1087 NC_004068.1

need perl one liner to get a specific content out of the line and possibly average it

I have a file which had many lines which containts "x_y=XXXX" where XXXX can be a number from 0 to some N.
Now,
a) I would like to get only the XXXX part of the line in every such line.
b) I would like to get the average
Possibly both of these in one liners.
I am trying out sometihng like
cat filename.txt | grep x_y | (this need to be filled)
I am not sure what to file
In the past I have used commands like
perl -pi -e 's/x_y/m_n/g'
to replace all the instances of x_y.
But now, I would like to match for x_y=XXXX and get the XXXX out and then possibly average it out for the entire file.
Any help on this will be greatly appreciated. I am fairly new to perl and regexes.
Timtowtdi (as usual).
perl -nE '$s+=$1, ++$n if /x_y=(\d+)/; END { say "avg:", $s/$n }' data.txt
The following should do:
... | grep 'x_y=' | perl -ne '$x += (split /=/, $_)[1]; $y++ }{ print $x/$y, "\n"'
The }{ is colloquially referred to as eskimo operator and works because of the code which -n places around the -e (see perldoc perlrun).
Using awk:
/^[^_]+_[^=]+=[0-9]+$/ {sum=sum+$2; cnt++}
END {
print "sum:", sum, "items:", cnt, "avg:", sum/cnt
}
$ awk -F= -f cnt.awk data.txt
sum: 55 items: 10 avg: 5.5
Pure bash-solution:
#!/bin/bash
while IFS='=' read str num
do
if [[ $str == *_* ]]
then
sum=$((sum + num))
cnt=$((cnt + 1))
fi
done < data.txt
echo "scale=4; $sum/$cnt" | bc ;exit
Output:
$ ./cnt.sh
5.5000
As a one-liner, split up with comments.
perl -nlwe '
push #a, /x_y=(\d+)/g # push all matches onto an array
}{ # eskimo-operator, is evaluated last
$sum += $_ for #a; # get the sum
print "Average: ", $sum / #a; # divide by the size of the array
' input.txt
Will extract multiple matches on a line, if they exist.
Paste version:
perl -nlwe 'push #a, /x_y=(\d+)/g }{ $sum += $_ for #a; print "Average: ", $sum / #a;' input.txt

delete lines from multiple files using gawk / awk / sed

I have two sets of text files. First set is in AA folder. Second set is in BB folder. The content of ff.txt file from first set(AA folder) is shown below.
Name number marks
john 1 60
maria 2 54
samuel 3 62
ben 4 63
I would like to print the second column(number) from this file if marks>60. The output would be 3,4. Next, read the ff.txt file in the BB folder and delete the lines containing numbers 3,4.
files in the BB folder looks like this. second column is the number.
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
I used the following code.This code is working perfectly for one file.
gawk 'BEGIN {getline} $3>60{print $2}' AA/ff.txt | while read number; do gawk -v number=$number '$2 != number' BB/ff.txt > /tmp/ff.txt; mv /tmp/ff.txt BB/ff.txt; done
But when I run this code with multiple files, I get error.
gawk 'BEGIN {getline} $3>60{print $2}' AA/*.txt | while read number; do gawk -v number=$number '$2 != number' BB/*.txt > /tmp/*.txt; mv /tmp/*.txt BB/*.txt; done
error:-
mv: target `BB/kk.txt' is not a directory
I had asked this question two days ago.Please help me to solve this error.
This creates an index of all files in folder AA and checks against all files in folder BB:
cat AA/*.txt | awk 'FNR==NR { if ($3 > 60) array[$2]; next } !($2 in array)' - BB/*.txt
This compares two individual files, assuming they have the same name in folders AA and BB:
ls AA/*.txt | sed "s%AA/\(.*\)%awk 'FNR==NR { if (\$3 > 60) array[\$2]; next } !(\$2 in array)' & BB/\1 %" | sh
HTH
EDIT
This should help :-)
ls AA/*.txt | sed "s%AA/\(.*\)%awk 'FNR==NR { if (\$3 > 60) array[\$2]; next } !(\$2 in array)' & BB/\1 > \1_tmp \&\& mv \1_tmp BB/\1 %" | sh
> /tmp/*.txt and mv /tmp/*.txt BB/*.txt are wrong.
For single file
awk 'NR>1 && $3>60{print $2}' AA/ff.txt > idx.txt
awk 'NR==FNR{a[$0]; next}; !($2 in a)' idx.txt BB/ff.txt
For multiple files
awk 'FNR>1 && $3>60{print $2}' AA/*.txt >idx.txt
cat BB/*.txt | awk 'NR==FNR{a[$0]; next}; !($2 in a)' idx.txt -
One perl solution:
use warnings;
use strict;
use File::Spec;
## Hash to save data to delete from files of BB folder.
## key -> file name.
## value -> string with numbers of second column. They will be
## joined separated with '-...-', like: -2--3--1-. And it will be easier to
## search for them using a regexp.
my %delete;
## Check arguments:
## 1.- They are two.
## 2.- Both are directories.
## 3.- Both have same number of regular files and with identical names.
die qq[Usage: perl $0 <dir_AA> <dir_BB>\n] if
#ARGV != 2 ||
grep { ! -d } #ARGV;
{
my %h;
for ( glob join q[ ], map { qq[$_/*] } #ARGV ) {
next unless -f;
my $file = ( File::Spec->splitpath( $_ ) )[2] or next;
$h{ $file }++;
}
for ( values %h ) {
if ( $_ != 2 ) {
die qq[Different files in both directories\n];
}
}
}
## Get files from dir 'AA'. Process them, print to output lines which
## matches condition and save the information in the %delete hash.
for my $file ( glob( shift . qq[/*] ) ) {
open my $fh, q[<], $file or do { warn qq[Couldn't open file $file\n]; next };
$file = ( File::Spec->splitpath( $file ) )[2] or do {
warn qq[Couldn't get file name from path\n]; next };
while ( <$fh> ) {
next if $. == 1;
chomp;
my #f = split;
next unless #f >= 3;
if ( $f[ $#f ] > 60 ) {
$delete{ $file } .= qq/-$f[1]-/;
printf qq[%s\n], $_;
}
}
}
## Process files found in dir 'BB'. For each line, print it if not found in
## file from dir 'AA'.
{
#ARGV = glob( shift . qq[/*] );
$^I = q[.bak];
while ( <> ) {
## Sanity check. Shouldn't occur.
my $filename = ( File::Spec->splitpath( $ARGV ) )[2];
if ( ! exists $delete{ $filename } ) {
close ARGV;
next;
}
chomp;
my #f = split;
if ( $delete{ $filename } =~ m/-$f[1]-/ ) {
next;
}
printf qq[%s\n], $_;
}
}
exit 0;
A test:
Assuming next tree of files. Command:
ls -R1
Output:
.:
AA
BB
script.pl
./AA:
ff.txt
gg.txt
./BB:
ff.txt
gg.txt
And next content of files. Command:
head AA/*
Output:
==> AA/ff.txt <==
Name number marks
john 1 60
maria 2 54
samuel 3 62
ben 4 63
==> AA/gg.txt <==
Name number marks
john 1 70
maria 2 54
samuel 3 42
ben 4 33
Command:
head BB/*
Output:
==> BB/ff.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
==> BB/gg.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 2 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
Run the script like:
perl script.pl AA/ BB
With following ouput to screen:
samuel 3 62
ben 4 63
john 1 70
And files of BB directory modified like:
head BB/*
Output:
==> BB/ff.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
==> BB/gg.txt <==
marks 2 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
So, from ff.txt lines with numbers 3 and 4 have been deleted, and lines with number 1 in gg.txt, which all of them were bigger than 60 in last column. I think this is what you wanted to achieve. I hope it helps, although not awk.

How to extract a particular column of data in Perl?

I have some data from a unix commandline call
1 ab 45 1234
2 abc 5
4 yy 999 2
3 987 11
I'll use the system() function for the call.
How can I extract the second column of data into an array in Perl? Also, the array size has to be dependent on the number of rows that I have (it will not necessarily be 4).
I want the array to have ("ab", "abc", "yy", 987).
use strict;
use warnings;
my $data = "1 ab 45 1234
2 abc 5
2 abc 5
2 abc 5
4 yy 999 2
3 987 11";
my #second_col = map { (split)[1] } split /\n/, $data;
To get unique values, see perlfaq4. Here's part of the answer provided there:
my %seen;
my #unique = grep { ! $seen{ $_ }++ } #second_col;
You can chain a Perl cmd-line call (aka: one-liner) to your unix script:
perl -lane 'print $F[1]' data.dat
instead of data.dat, use a pipe from your command line tool
cat data.dat | perl -lane 'print $F[1]'
Addendum:
The extension for unique-ness of the resulting column is straightforward:
cat data.dat | perl -lane 'print $F[1] unless $seen{$F[1]}++'
or, if you are lazy (employing %_):
cat data.dat | perl -lane 'print unless $_{$_=$F[1]}++'

How do I split up a line and rearrange its elements?

I have some data on a single line like below
abc edf xyz rfg yeg udh
I want to present the data as below
abc
xyz
yeg
edf
rfg
udh
so that alternate fields are printed with newline separated.
Are there any one liners for this?
The following awk script can do it:
> echo 'abc edf xyz rfg yeg udh' | awk '{
for (i = 1;i<=NF;i+=2){print $i}
print "";
for (i = 2;i<=NF;i+=2){print $i}
}'
abc
xyz
yeg
edf
rfg
udh
Python in the same spirit as the above awk (4 lines):
$ echo 'abc edf xyz rfg yeg udh' | python -c 'f=raw_input().split()
> for x in f[::2]: print x
> print
> for x in f[1::2]: print x'
Python 1-liner (omitting the pipe to it which is identical):
$ python -c 'f=raw_input().split(); print "\n".join(f[::2] + [""] + f[1::2])'
Another Perl 5 version:
#!/usr/bin/env perl
use Modern::Perl;
use List::MoreUtils qw(part);
my $line = 'abc edf xyz rfg yeg udh';
my #fields = split /\s+/, $line; # split on whitespace
# Divide into odd and even-indexed elements
my $i = 0;
my ($first, $second) = part { $i++ % 2 } #fields;
# print them out
say for #$first;
say ''; # Newline
say for #$second;
A shame that the previous perl answers are so long. Here are two perl one-liners:
echo 'abc edf xyz rfg yeg udh'|
perl -naE '++$i%2 and say for #F; ++$j%2 and say for "",#F'
On older versions of perl (without "say"), you may use this:
echo 'abc edf xyz rfg yeg udh'|
perl -nae 'push #{$a[++$i%2]},"$_\n" for "",#F; print map{#$_}#a;'
Just for comparison, here's a few Perl scripts to do it (TMTOWTDI, after all). A rather functional style:
#!/usr/bin/perl -p
use strict;
use warnings;
my #a = split;
my #i = map { $_ * 2 } 0 .. $#a / 2;
print join("\n", #a[#i]), "\n\n",
join("\n", #a[map { $_ + 1 } #i]), "\n";
We could also do it closer to the AWK script:
#!/usr/bin/perl -p
use strict;
use warnings;
my #a = split;
my #i = map { $_ * 2 } 0 .. $#a / 2;
print "$a[$_]\n" for #i;
print "\n";
print "$a[$_+1]\n" for #i;
I've run out of ways to do it, so if any other clever Perlers come up with another method, feel free to add it.
Another Perl solution:
use strict;
use warnings;
while (<>)
{
my #a = split;
my #b = map { $a[2 * ($_%(#a/2)) + int($_ / (#a /2))] . "\n" } (0 .. #a-1);
print join("\n", #a[0..((#b/2)-1)], '', #a[(#b/2)..#b-1], '');
}
You could even condense it into a real one-liner:
perl -nwle'my #a = split;my #b = map { $a[2 * ($_%(#a/2)) + int($_ / (#a /2))] . "\n" } (0 .. #a-1);print join("\n", #a[0..((#b/2)-1)], "", #a[(#b/2)..#b-1], "");'
Here's the too-literal, non-scalable, ultra-short awk version:
awk '{printf "%s\n%s\n%s\n\n%s\n%s\n%s\n",$1,$3,$5,$2,$4,$6}'
Slightly longer (two more characters), using nested loops (prints an extra newline at the end):
awk '{for(i=1;i<=2;i++){for(j=i;j<=NF;j+=2)print $j;print ""}}'
Doesn't print an extra newline:
awk '{for(i=1;i<=2;i++){for(j=i;j<=NF;j+=2)print $j;if(i==1)print ""}}'
For comparison, paxdiablo's version with all unnecessary characters removed (1, 9 or 11 more characters):
awk '{for(i=1;i<=NF;i+=2)print $i;print "";for(i=2;i<=NF;i+=2)print $i}'
Here's an all-Bash version:
d=(abc edf xyz rfg yeg udh)
i="0 2 4 1 3 5"
for w in $i
do
echo ${d[$w]}
[[ $w == 4 ]]&&echo
done
My attempt in haskell:
Prelude> (\(x,y) -> putStr $ unlines $ map snd (x ++ [(True, "")] ++ y)) $ List.partition fst $ zip (cycle [True, False]) (words "abc edf xyz rfg yeg udh")
abc
xyz
yeg
edf
rfg
udh
Prelude>
you could also just use tr:
echo "abc edf xyz rfg yeg udh" | tr ' ' '\n'
Ruby versions for comparison:
ARGF.each do |line|
groups = line.split
0.step(groups.length-1, 2) { |x| puts groups[x] }
puts
1.step(groups.length-1, 2) { |x| puts groups[x] }
end
ARGF.each do |line|
groups = line.split
puts groups.select { |x| groups.index(x) % 2 == 0 }
puts
puts groups.select { |x| groups.index(x) % 2 != 0 }
end
$ echo 'abc edf xyz rfg yeg udh' |awk -vRS=" " 'NR%2;NR%2==0{_[++d]=$0}END{for(i=1;i<=d;i++)print _[i]}'
abc
xyz
yeg
edf
rfg
udh
For newlines, i leave it to you to do yourself.
Here is yet another way, using Bash, to manually rearrange words in a line - with previous conversion to an array:
echo 'abc edf xyz rfg yeg udh' | while read tline; do twrds=($(echo $tline)); echo -e "${twrd[0]} \n${twrd[2]} \n${twrd[4]} \n\n ${twrd[1]} \n${twrd[3]} \n${twrd[5]} \n" ; done
Cheers!