Output from calculation is messed in perl one-liner - perl

I'm trying to do some calculations on the columns of a tab delimited file using this perl one-liner:
perl -ape 'if (/^\d/) { s/$F[2]/$F[2]\/$F[4]/e && s/$F[3]/$F[3]\/$F[4]/e}' infile
the idea is to get A and B columns divided by C column
infile:
X Y A B C
5001 3 1.03333 0.652549 4215
6001 4 1.2 0.723137 4870
7001 2 1 0.807843 5153
8001 2 1 0.807843 5355
9001 2 1 0.807843 5389
10001 2 1 0.807843 4955
11001 7 1.7671 1.05573 4966
12001 17 8.18802 4.72554 5124
But the output is this:
X Y A B C
5001 3 0.000245155397390273 0.000154815895610913 4215
6001 4 0.000246406570841889 0.000148488090349076 4870
7000.000194061711624297 2 1 0.000156771395303707 5153
8000.000186741363211951 2 1 0.000150857703081232 5355
9000.000185563184264242 2 1 0.000149905919465578 5389
0.0002018163471241170001 2 1 0.000163035923309788 4955
11001 7 0.000355839710028192 0.000212591623036649 4966
12001 17 0.00159797423887588 0.000922236533957845 5124
What is going on on the 3rd to 6th lines? How can manage to fix this?
Thanks.
EDIT:
I removed the /e option from the substitute command and it seems that the calculation is being performed on the wrong column.
perl -ape 'if (/^\d/) { s/$F[2]/$F[2]\/$F[4]/ && s/$F[3]/$F[3]\/$F[4]/}' infile
X Y A B C
5001 3 1.03333/4215 0.652549/4215 4215
6001 4 1.2/4870 0.723137/4870 4870
7001/5153 2 1 0.807843/5153 5153
8001/5355 2 1 0.807843/5355 5355
9001/5389 2 1 0.807843/5389 5389
1/49550001 2 1 0.807843/4955 4955
11001 7 1.7671/4966 1.05573/4966 4966
12001 17 8.18802/5124 4.72554/5124 5124
13001 30 13.8763/5138 8.05385/5138 5138

After substitution and evaluation, you have something like s/1/0.000194061711624297/. So the s operator looks for a 1 and finds it as part of the first column. Whoops. If we add some \b word-boundary markers, we can force the match part of the s operators to match a complete column, never just part of a column:
perl -ape 'if (/^\d/) { s/\b$F[2]\b/$F[2]\/$F[4]/e && s/\b$F[3]\b/$F[3]\/$F[4]/e}' infile
But that's still going to run into issues if it's possible for column X to equal column A or B. Better to just do the calculations and then replace the entire line by assigning to $_:
perl -ape 'if (/^\d/) { $F[2] /= $F[4]; $F[3] /= $F[4]; $_ = join(" ", #F); }'
Use sprintf instead of join if you want a particular format to the output.

Your basic problem is that you are substituting the value that is in column 3 and 4 whereever they appear in the whole line. For row 3, for example, you are doing s/1/1\/5153/e which affects the first occurrence of the digit 1 in the line, not necessarily the 1 that happens to be in column 3.
Try this:
perl -lane 'if ($F[4] =~ /[1-9]/) { $F[2] /= $F[4]; $F[3] /= $F[4] } print join "\t", #F' infile
If you want to limit the precision, do something like $F[2] = sprintf "%f", $F[2]/$F[4]; ...

Related

perl6, how to match 1 to 10000 times except prime number of times?

What is the best way to match a string that occurs anywhere from 1 to 10000 times except prime number of times?
say so "xyz" ~~ m/ <[x y z]> ** <[ 1..10000] - [ all prime numbers ]> /
Thanks !!!
Not necessarily the best way (in particular, it will create up to 10_000 submatch objects), but a way:
$ perl6 -e 'say "$_ ", so <x y z>.roll x $_ ~~ /^ (<[xyz]>) ** 1..10_000 <!{$0.elems.is-prime}> $/ for 1..10'
1 True
2 False
3 False
4 True
5 False
6 True
7 False
8 True
9 True
10 True
If the substring of interest has fixed length, you could also capture the repetition as a whole and check its length, avoiding submatch creation.

Merging two files with condition on two columns

I have two files of the type:
File1.txt
1 117458 rs184574713 rs184574713
1 119773 rs224578963 rs224500000
1 120000 rs224578874 rs224500045
1 120056 rs60094200 rs60094200
2 120056 rs60094536 rs60094536
File2.txt
10 120200 120400 A 0 189,183,107
1 0 119600 C 0 233,150,122
1 119600 119800 D 0 205,92,92
1 119800 120400 F 0 192,192,192
2 120400 122000 B 0 128,128,128
2 126800 133200 A 0 192,192,192
I want to add the information contained in the second file to the first file. The first column in both files needs to match, while the second column in File1.txt should fall in the interval that is indicated by columns 2 and 3 in File2.txt. So that the output should look like this:
1 117458 rs184574713 rs184574713 C 0 233,150,122
1 119773 rs224578963 rs224500000 D 0 205,92,92
1 120000 rs224578874 rs224500045 F 0 192,192,192
1 120056 rs60094200 rs60094200 F 0 192,192,192
2 120440 rs60094536 rs60094536 B 0 128,128,128
Please help me with awk/perl.. or any other script.
This is how you would do it in bash (with a little help from awk):
join -1 1 -2 1 \
<(sort -n -k1,1 test1) <(sort -n -k1,1 test2) | \
awk '$2 >= $5 && $2 <= $6 {print $1, $2, $3, $4, $7, $8, $9}'
Here is a brief explanation.
First, we use join to join lines based on the common key (the
first field).
But join expects both input files to be already sort (hence
sort).
At least, we employ awk to apply the required condition, and to
project the fields we want.
Try this: (Considering the fact that there is a typo in your output for last entry. 120056 is not between 120400 122000.
$ awk '
NR==FNR {
a[$1,$2,$3]=$4 FS $5 FS $6;
next
}
{
for(x in a) {
split(x,tmp,SUBSEP);
if($1==tmp[1] && $2>=tmp[2] && $2<=tmp[3])
print $0 FS a[x]
}
}' file2 file1
1 117458 rs184574713 rs184574713 C 0 233,150,122
1 119773 rs224578963 rs224500000 D 0 205,92,92
1 120000 rs224578874 rs224500045 F 0 192,192,192
1 120056 rs60094200 rs60094200 F 0 192,192,192
You read through the first file creating an array indexed at column 1,2 and 3 having values of column 4,5 and 6.
For the second file, you look up in your array. For every key, you split the key and check for your condition of first column matching and second column to be in range.
If the condition is true you print the entire line from file 1 followed by the value of array.

I dont understand this little perl code (if ...)

Can someone explain me this short pearl code?
$batstr2 = "empty" if( $status2 & 4 );
What say the if statement ?
Already answered many times, for the case if you don't know what is the Bitwise And, here is a small example:
perl -e 'print "dec\t bin\t&4\n";printf "%d\t%8b\t%-8b\n", $_, $_, ($_ & 4) for (0..8);'
prints:
dec bin &4
0 0 0
1 1 0
2 10 0
3 11 0
4 100 100
5 101 100
6 110 100
7 111 100
8 1000 0
as you can see, when the 3rb bit from right is 1 - the $num & 4 is true.
That's using the if as a statement modifier. It's roughly the same as
if ($status & 4) {
$batstr2 = "empty";
}
and exactly the same as
($status & 4) and ($batstr2 = "empty");
a variety of constructs can be used as statement modifiers, including: if, unless, while, until, for, when. These modifiers can't be stacked (foo() if $bar for #baz won't work), you are limited for one modifer per simple statement.
That's a bitwise and - http://perldoc.perl.org/perlop.html#Bitwise-And . $status2 is being used as a bit mask and it sets $batstr2 to 'empty' if the bit is set.
It sets $batstr2 to "empty" if the 3rd least significant bit of $status2 is set - it is a logical AND mask.

How to sum values in a column grouped by values in the other

I have a large file consisting data in 2 columns
100 5
100 10
100 10
101 2
101 4
102 10
102 2
I want to sum the values in 2nd column with matching values in column 1. For this example, the output I'm expecting is
100 25
101 6
102 12
I'm trying to work on this using bash script preferably. Can someone explain me how can I do this
Using awk:
awk '{a[$1]+=$2}END{for(i in a){print i, a[i]}}' inputfile
For your input, it'd produce:
100 25
101 6
102 12
In a perl oneliner
perl -lane "$s{$F[0]} += $F[1]; END { print qq{$_ $s{$_}} for keys %s}" file.txt
You can use an associative array. The first column is the index and the second becomes what you add to it.
#!/bin/bash
declare -A columns=()
while read -r -a line ; do
columns[${line[0]}]=$((${columns[${line[0]}]} + ${line[1]}))
done < "${1}"
for idx in ${!columns[#]} ; do
echo "${idx} ${columns[${idx}]}"
done
Using awk and maintain the order:
awk '!($1 in a){a[$1]=$2; b[++i]=$1;next} {a[$1]+=$2} END{for (k=1; k<=i; k++) print b[k], a[b[k]]}' file
100 25
101 6
102 12
Python is my choice:
d = {}
for line in f.readlines():
key,value = line.split()
if d[key] == None:
d[key] = 0
d[key] += value
print d
Why would you want a bash script?

sed remove line containing a string and nothing but; automation using for loop

Q1: Sed specify the whole line and if the line is nothing but the string then delete
I have a file that contains several of the following numbers:
1 1
3 1
12 1
1 12
25 24
23 24
I want to delete numbers that are the same in each line. For that I have either been using:
sed '/1 1/d' < old.file > new.file
OR
sed -n '/1 1/!p' < old.file > new.file
Here is the main problem. If I search for pattern '1 1' that means I get rid of '1 12' as well. So for I want the pattern to specify the whole line and if it does, to delete it.
Q2: Automation of question 1
I am also trying to automate this problem. The range of numbers in the first column and the second column could be from 1 to 25.
So far this is what I got:
for ((i=1;i<26;i++)); do
sed "/'$i' '$i'/d" < oldfile > newfile; mv newfile oldfile;
done
This does nothing to the oldfile in the end. :(
This would be more readable with awk:
awk '$1 == $2 {next} {print}' oldfile > newfile
Update based on comment:
If the requirement is to remove lines where the two values are within 1 of each other:
awk '{d = $1-$2; if (-1 <= d && d <= 1) next; else print}' oldfile
Unfortunately, awk does not have abs() (at least nawk and gawk don't)
Just put the first number in a group (\([0-9]*\)) and then look for it with a backreference (\1). Since the line to delete should contain only the group, repeated, use the ^ to mark the beginning of line and the $ to mark the end of line. For example, for the following file:
$ cat input
1 1
3 1
12 1
1 12
12 12
12 13
13 13
25 24
23 24
...the result is:
$ sed '/^\([0-9]*\) \1$/d' input
3 1
12 1
1 12
12 13
25 24
23 24
You can also do it with grep:
grep -E -v "([0-9])*\s\1" testfile
Look for multiple digits in a row and remember them, followed by a single whitespace, followed by whatever digits you remembered.