How to subtract a column of values with another column of values? - perl

I have two files with one column of values, like attachments below. I want to subtract the first value of file2 with all values in file1, these subtractions fill the first column in the output file, and then second value of file2 with all values in file1 ...
The output would look like this:
-4(2-6) -5 0 1
0(2-2) -1 4 5
-2(2-4) -3 2 3
-3(2-5) -4 1 2
-6(2-8) -7 -2 -1
Expressions in brackets in first column are only for explanation use and need to be discarded in output.
Also, the number of values in the column can vary.
Many thanks!
file1 file2
6 2
2 1
4 6
5 7
8

If I understand this correctly, then
awk 'NR == FNR { src[FNR] = $1; next } { for(i = 1; i <= length(src); ++i) { printf("%d\t", src[i] - $1); } print ""; }' file2 file1
produces the desired output. This works as follows:
NR == FNR { # while the first file (file2) is read:
src[FNR] = $1 # remember the numbers in an array
next # and we're done
}
{ # after that (when processing file1):
for(i = 1; i <= length(src); ++i) { # loop through the saved numbers
printf("%d\t", src[i] - $1) # subtract the current number from it,
# print result followed by tab
}
print "" # when done, print a newline
}
EDIT: Since the question was edited to use one file with two columns instead of two with one each: The code can be slightly tweaked for that scenario as follows:
awk 'NR == FNR && NF > 1 { src[FNR] = $2 } NR != FNR && $1 != "" { for(i = 1; i <= length(src); ++i) { printf("%d\t", src[i] - $1); } print ""; }' file file
This follows the same basic pattern: Two passes are done over the file, one in which the numbers of the second column are saved and another in which the output is calculated and printed. The main addition is handling for empty fields:
NR == FNR && NF > 1 { # If this is the first pass and there is
src[FNR] = $2 # a second field, remember it
}
NR != FNR && $1 != "" { # If this is the second pass and there is
for(i = 1; i <= length(src); ++i) { # a first field, process it as before.
printf("%d\t", src[i] - $1)
}
print ""
}
Alternatively, it could be done in one pass as follows:
awk '$1 != "" { a[NR] = $1 } NF > 1 { b[NR] = $2 } END { for(i = 1; i <= length(b); ++i) { for(j = 1; j <= length(a); ++j) { printf("%d\t", b[i] - a[j]) } print "" } }' file
That is:
$1 != "" { a[NR] = $1 } # If there is a first field, remember it
NF > 1 { b[NR] = $2 } # If there is a second field, remember it
END { # After reaching the end of the file,
for(i = 1; i <= length(b); ++i) { # process the saved data as before.
for(j = 1; j <= length(a); ++j) {
printf("%d\t", b[i] - a[j])
}
print ""
}
}

Related

Sum of Primes always returns 0

I have a code in Perl which takes in a number and adds up all the prime numbers up to that number. I keep on getting the value 0 which means it is not updating my $sum variable, but I don't know what else to do.
sub checkPrime {
my($numb) = #_;
$primeCheck = "prime\n";
if ($numb == 1) {
$primeCheck = "notPrime\n";
}
for ($i = 2; $i < $numb; $i++) {
$mod = $numb % $i;
if ($mod == 0) {
$primeCheck = "notPrime\n"
}
}
return $primeCheck;
}
sub sumOfPrimes {
my($input) = #_;
$sum = 0;
for ($i = 2; $i <= $input; $i++) {
if (checkPrime($i) eq "prime") {
$sum = $sum + $i;
}
}
return $sum;
}
print sumOfPrimes(10);
You are not comparing the correct string. You include a newline character (\n) when you set the value, but not when you compare it. Change:
if (checkPrime($i) eq "prime")
to:
if (checkPrime($i) eq "prime\n")
That is the simplest change, but you probably don't need to have \n in there at all.
To sum prime numbers you need to identify if the number is a prime number. Let's create a function which returns 1 if the number is prime and 0 otherwise.
sub isPrime {
my $n = shift;
return 0 unless $n > 1;
for( my $i = 2; $i < $n; $i++ ) {
return 0 if $n % $i == 0;
}
return 1;
}
Now go through the list of numbers and sum only those which is prime
$sum += $num if isPrime($num);

Perl: Finding out if a given number is a prime number

I am trying to write a subroutine that determines whether or not the number passed in is prime, and it's not working correctly. The numbers I'm passing in should not be identified as prime. Is there a logic error, or something about Perl that I'm missing?
sub isPrime {
my ( $n ) = #_;
for ( my $i = 3 ; $i < $n ; $i++ ) {
if ( $n % $i == 0 ) {
return 0;
}
else {
return 1;
}
}
}
At the moment your function is checking just if n is not divisible by 3 because it calls return immediately after the fisrt test.
Try to make the function return 0 within the for loop, and return 1 outside it, or set a flag for the number being prime that is initially true and return its value after the loop.
You should also start your for loop at 2, not at 3, otherwise you aren't testing for even numbers.
Here is my code I wrote in about 40 minutes. Don't hate if it is inefficient, I am still learning perl.
print ("This is a prime number checker!\n");
print ("Enter a number below to check it:\n");
$y = 0;
$num = <>;
for ($i = $num; $i > 0; $i--) {
if ($num % $i == 0) {
$y += 1;
}
}
if ($y > 2) {
print ("$num is not a prime!");
} else {
print ("$num is a prime!");
}

Removing Lines and columns with all zeros

How can I delete lines (rows) and columns in a text file that contain all the zeros.
For example, I have a file:
1 0 1 0 1
0 0 0 0 0
1 1 1 0 1
0 1 1 0 1
1 1 0 0 0
0 0 0 0 0
0 0 1 0 1
I want to delete 2nd and 4th line and also the 2nd column. The output should look like:
1 0 1 1
1 1 1 1
0 1 1 1
1 1 0 0
0 0 1 1
I can do this using sed and egrep
sed '/0 0 0 0/d' or egrep -v '^(0 0 0 0 )$'
for lines with zeros but that would too inconvenient for files with thousands of columns. I have no idea how can I remove the column with all zeros, 2nd column here.
Perl solution. It keeps all the non-zero lines in memory to be printed at the end, because it cannot tell what columns will be non-zero before it processes the whole file. If you get Out of memory, you may only store the numbers of the lines you want to output, and process the file again while printing the lines.
#!/usr/bin/perl
use warnings;
use strict;
my #nonzero; # What columns where not zero.
my #output; # The whole table for output.
while (<>) {
next unless /1/;
my #col = split;
$col[$_] and $nonzero[$_] ||= 1 for 0 .. $#col;
push #output, \#col;
}
my #columns = grep $nonzero[$_], 0 .. $#nonzero; # What columns to output.
for my $line (#output) {
print "#{$line}[#columns]\n";
}
Rather than storing lines in memory, this version scans the file twice: Once to find the "zero columns", and again to find the "zero rows" and perform the output:
awk '
NR==1 {for (i=1; i<=NF; i++) if ($i == 0) zerocol[i]=1; next}
NR==FNR {for (idx in zerocol) if ($idx) delete zerocol[idx]; next}
{p=0; for (i=1; i<=NF; i++) if ($i) {p++; break}}
p {for (i=1; i<=NF; i++) if (!(i in zerocol)) printf "%s%s", $i, OFS; print ""}
' file file
1 0 1 1
1 1 1 1
0 1 1 1
1 1 0 0
0 0 1 1
A ruby program: ruby has a nice array method transpose
#!/usr/bin/ruby
def remove_zeros(m)
m.select {|row| row.detect {|elem| elem != 0}}
end
matrix = File.readlines(ARGV[0]).map {|line| line.split.map {|elem| elem.to_i}}
# remove zero rows
matrix = remove_zeros(matrix)
# remove zero rows from the transposed matrix, then re-transpose the result
matrix = remove_zeros(matrix.transpose).transpose
matrix.each {|row| puts row.join(" ")}
Another awk variant:
awk '{show=0; for (i=1; i<=NF; i++) {if ($i!=0) show=1; col[i]+=$i;}} show==1{tr++; for (i=1; i<=NF; i++) vals[tr,i]=$i; tc=NF} END{for(i=1; i<=tr; i++) { for (j=1; j<=tc; j++) { if (col[j]>0) printf("%s%s", vals[i,j], OFS)} print ""; } }' file
Expanded Form:
awk '{
show=0;
for (i=1; i<=NF; i++) {
if ($i != 0)
show=1;
col[i]+=$i;
}
}
show==1 {
tr++;
for (i=1; i<=NF; i++)
vals[tr,i]=$i;
tc=NF
}
END {
for(i=1; i<=tr; i++) {
for (j=1; j<=tc; j++) {
if (col[j]>0)
printf("%s%s", vals[i,j], OFS)
}
print ""
}
}' file
Try this:
perl -n -e '$_ !~ /0 0 0 0/ and print' data.txt
Or simply:
perl -n -e '/1/ and print' data.txt
Where data.txt contains your data.
In Windows, use double quotes:
perl -n -e "/1/ and print" data.txt
All together:
$ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' file | awk '{l=NR; c=NF; for (i=1; i<=c; i++) {a[l,i]=$i; if ($i) e[i]++}} END{for (i=1; i<=l; i++) {for (j=1; j<=c; j++) {if (e[j]) printf "%d ",a[i,j] } printf "\n"}}'
This makes the row checking:
$ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' file
1 0 1 1
1 0 1 0
1 0 0 1
It loops through all the fields of the line. If any of them are "true" (meaning not 0), it prints the line (print) and breaks to next line (next).
This makes the column checking:
$ awk '{l=NR; c=NF;
for (i=1; i<=c; i++) {
a[l,i]=$i;
if ($i) e[i]++
}}
END{
for (i=1; i<=l; i++){
for (j=1; j<=c; j++)
{if (e[j]) printf "%d ",a[i,j] }
printf "\n"
}
}'
It basically saves all the data in the a array, l number of lines, c number of columns. e is an array saving if a column has any value different from 0 or not. Then it loops and prints all fields just when e array index is set, meaning if that column has any non-zero value.
Test
$ cat a
1 0 1 0 1
0 0 0 0 0
1 1 1 0 1
0 1 1 0 1
1 1 0 0 0
0 0 0 0 0
0 0 1 0 1
$ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' a | awk '{l=NR; c=NF; for (i=1; i<=c; i++) {a[l,i]=$i; if ($i) e[i]++}} END{for (i=1; i<=l; i++) {for (j=1; j<=c; j++) {if (e[j]) printf "%d ",a[i,j] } printf "\n"}}'
1 0 1 1
1 1 1 1
0 1 1 1
1 1 0 0
0 0 1 1
previous input:
$ cat file
1 0 1 1
0 0 0 0
1 0 1 0
0 0 0 0
1 0 0 1
$ awk '{for (i=1; i<=NF; i++) {if ($i) {print; next}}}' file | awk '{l=NR; c=NF; for (i=1; i<=c; i++) {a[l,i]=$i; if ($i) e[i]++}} END{for (i=1; i<=l; i++) {for (j=1; j<=c; j++) {if (e[j]) printf "%d ",a[i,j] } printf "\n"}}'
1 1 1
1 1 0
1 0 1
The following script also makes two passes. During the first pass, it saves the line numbers of lines to be omitted from the output and the column indexes that should be included in the output. In the second pass, it outputs those lines and columns. I think this should provide close to the smallest possible memory footprint which might matter if you are dealing with large files.
#!/usr/bin/env perl
use strict;
use warnings;
filter_zeros(\*DATA);
sub filter_zeros {
my $fh = shift;
my $pos = tell $fh;
my %nonzero_cols;
my %zero_rows;
while (my $line = <$fh>) {
last unless $line =~ /\S/;
my #row = split ' ', $line;
my #nonzero_idx = grep $row[$_], 0 .. $#row;
unless (#nonzero_idx) {
$zero_rows{$.} = undef;
next;
}
$nonzero_cols{$_} = undef for #nonzero_idx;
}
my #matrix;
{
my #idx = sort {$a <=> $b } keys %nonzero_cols;
seek $fh, $pos, 0;
local $. = 0;
while (my $line = <$fh>) {
last unless $line =~ /\S/;
next if exists $zero_rows{$.};
print join(' ', (split ' ', $line)[#idx]), "\n";
}
}
}
__DATA__
1 0 1 0 1
0 0 0 0 0
1 1 1 0 1
0 1 1 0 1
1 1 0 0 0
0 0 0 0 0
0 0 1 0 1
Output:
1 0 1 1
1 1 1 1
0 1 1 1
1 1 0 0
0 0 1 1
Little bit unorthodox solution but fast as hell and small memory consumption:
perl -nE's/\s+//g;$m|=$v=pack("b*",$_);push#v,$v if$v!~/\000/}{$m=unpack("b*",$m);#m=split//,$m;#m=grep{$m[$_]eq"1"}0..$#m;say"#{[(split//,unpack(q(b*),$_))[#m]]}"for#v'
This is my awk solution. It would work with variable number of rows and columns.
#!/usr/bin/gawk -f
BEGIN {
FS = " "
}
{
for (c = 1; c <= NF; ++c) {
v = $c
map[c, NR] = v
ctotal[c] += v
rtotal[NR] += v
}
fields[NR] = NF
}
END {
for (r = 1; r <= NR; ++r) {
if (rtotal[r]) {
append = 0
f = fields[r]
for (c = 1; c <= f; ++c) {
if (ctotal[c]) {
if (append) {
printf " " map[c, r]
} else {
printf map[c, r]
append = 1
}
}
}
print ""
}
}
}
Off the top of my head...
The problem is the columns. How do you know if a column is all zeros until you read in the entire file?
I'm thinking you need an array of the columns with each array being the column. You can push in the amounts. An array of arrays.
The trick is to skip the rows that contain all zeros as you read them in:
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use Data::Dumper;
my #array_of_columns;
for my $row ( <DATA> ) {
chomp $row;
next if $row =~ /^(0\s*)+$/; #Skip zero rows;
my #columns = split /\s+/, $row;
for my $index ( (0..$#columns) ) {
push #{ $array_of_columns[$index] }, $columns[$index];
}
}
# Remove the columns that contain nothing but zeros;
for my $column ( (0..$#array_of_columns) ) {
my $index = $#array_of_columns - $column;
my $values = join "", #{ $array_of_columns[$index] };
if ( $values =~ /^0+$/ ) {
splice ( #array_of_columns, $index, 1 );
}
}
say Dumper \#array_of_columns;
__DATA__
1 0 1 0 1
0 0 0 0 0
1 1 1 0 1
0 1 1 0 1
1 1 0 0 0
0 0 0 0 0
0 0 1 0 1
Of course, you could use Array::Transpose which will transpose your array which makes things much easier.
This is a real tricky and challenging question.. so in order to solve we need to be tricky too :) in my version I depend on script learning, every time we read a new line we check for new field possibility to be omitted and if new change detected we start over.
The check and start over process should not be repeated so often as we will have few rounds
until we get a constant number of fields to omit or zero, then we omit each row zero value at specific position.
#! /usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
open my $fh, '<', 'file.txt' or die $!;
##open temp file for output
open my $temp, '>', 'temp.txt' or die $!;
##how many field you have in you data
##you can increase this by one if you have more fields
my #fields_to_remove = (0,1,2,3,4);
my $change = $#fields_to_remove;
while (my $line = <$fh>){
if ($line =~ /1/){
my #new = split /\s+/, $line;
my $i = 0;
for (#new){
unless ($_ == 0){
#fields_to_remove = grep(!/$i/, #fields_to_remove);
}
$i++;
}
foreach my $field (#fields_to_remove){
$new[$field] = 'x';
}
my $new = join ' ', #new;
$new =~ s/(\s+)?x//g;
print $temp $new . "\n";
##if a new change detected start over
## this should repeat for limited time
## as the script keeps learning and eventually stop
if ($#fields_to_remove != $change){
$change = $#fields_to_remove;
seek $fh, 0, 0;
close $temp;
unlink 'temp.txt';
open $temp, '>', 'temp.txt';
}
} else {
##nothing -- removes 0 lines
}
}
### this is just for showing you which fields has been removed
print Dumper \#fields_to_remove;
I have tested with 9 fields 25mb data file and it worked perfectly it wasn't super fast but it didn't consume much memory as well.
My compact and large-file-compatible alternative using grep and cut. Only drawback : lengthy for large files because of the for loop.
# Remove constant lines using grep
$ grep -v "^[0 ]*$\|^[1 ]*$" $fIn > $fTmp
# Remove constant columns using cut and wc
$ nc=`cat $fTmp | head -1 | wc -w`
$ listcol=""
$ for (( i=1 ; i<=$nc ; i++ ))
$ do
$ nitem=`cut -d" " -f$i $fTmp | sort | uniq | wc -l`
$ if [ $nitem -gt 1 ]; then listcol=$listcol","$i ;fi
$ done
$ listcol2=`echo $listcol | sed 's/^,//g'`
$ cut -d" " -f$listcol2 $fTmp | sed 's/ //g' > $fOut
Checking the rows can be done this way: awk '/[^0[:blank:]]/' file
It just states if a line contains any character that is different from 0 or a <blank> character, then print the line.
If you now want to check the columns, then I suggest an adaptation of Glenn Jackman's answer
awk '
NR==1 {for (i=1; i<=NF; i++) if ($i == 0) zerocol[i]=1; next}
NR==FNR {for (idx in zerocol) if ($idx) delete zerocol[idx]; next}
/[^0[:blank:]]/ {for (i=1; i<=NF; i++) if (i in zerocol) $i=""; print}
' file file

Perl: getting all increasing and decreasing Strips in an array (use in Bioinformatics)

I'm new at Perl and im having trouble at designing a certain function in Perl.
The Function should find and return all Increasing and Decreasing Strips.
What does that mean? Two Positions are neighbors if they're neighboring numbers. i.e. (2,3) or (8,7). A Increasing Strip is an increasing Strip of neighbors. i.e. (3,4,5,6). Decreasing Strip is defined similar. At the beginning of every Array a 0 gets added and at the end the length of the array+1. Single Numbers without neighbors are decreasing. 0 and n+1 are increasing.
So if i have the array (0,3,4,5,9,8,6,2,1,7,10) i should get the following results:
Increasing Strips are: (3,4,5) (10) (0)
Decreasing Strips are: (9,8), (6), (2,1) (7)
I tried to reduce the problem to only getting all Decreasing Strips, but this is as far as i get: http://pastebin.com/yStbgNme
Code here:
sub getIncs{
my #$bar = shift;
my %incs;
my $inccount = 0;
my $i=0;
while($i<#bar-1){
for($j=$i; 1; $j++;){
if($bar[$j] == $bar[$j+1]+1){
$incs{$inccount} = ($i,$j);
} else {
$inccount++;
last;
}
}
}
//edit1: I found a Python-Program that contains said function getStrips(), but my python is sporadic at best. http://www.csbio.unc.edu/mcmillan/Media/breakpointReversalSort.txt
//edit2: Every number is exactly one Time in the array So there can be no overlap.
use strict;
my #s = (0,3,4,5,9,8,6,2,1,7,10);
my $i = 0;
my $j = 0; #size of #s
my $inc = "Increasing: ";
my $dec = "Decreasing: ";
# Prepend the beginning with 0, if necessary
if($s[0] != 0 || #s == 0 ) { unshift #s, 0; }
$j = #s;
foreach(#s) {
# Increasing
if( ($s[$i] == 0) || ($i == $j-1) || ($s[$i+1] - $s[$i]) == 1 || ($s[$i] - $s[$i-1] == 1)) {
if($s[$i] - $s[$i-1] != 1) { $inc .= "("; }
$inc .= $s[$i];
if($s[$i+1] - $s[$i] != 1) { $inc .= ")"; }
if($s[$i+1] - $s[$i] == 1) { $inc .= ","; }
}
#Decreasing
if( ($s[$i]-$s[$i-1] != 1) && ($s[$i+1] - $s[$i] != 1) && ($s[$i] != 0) && ($i != $j-1) ) {
if($s[$i-1] - $s[$i] != 1) { $dec .= "("; }
$dec .= $s[$i];
if($s[$i] - $s[$i+1] != 1) { $dec .= ")"; }
if($s[$i] - $s[$i+1] == 1) { $dec .= ","; }
}
$i++;
}
$inc =~ s/\)\(/\),\(/g;
$dec =~ s/\)\(/\),\(/g;
print "$inc\n";
print "$dec\n";
Result:
Increasing: (0),(3,4,5),(10)
Decreasing: (9,8),(6),(2,1),(7)

Generating Synthetic DNA Sequence with Substitution Rate

Given these inputs:
my $init_seq = "AAAAAAAAAA" #length 10 bp
my $sub_rate = 0.003;
my $nof_tags = 1000;
my #dna = qw( A C G T );
I want to generate:
One thousand length-10 tags
Substitution rate for each position in a tag is 0.003
Yielding output like:
AAAAAAAAAA
AATAACAAAA
.....
AAGGAAAAGA # 1000th tags
Is there a compact way to do it in Perl?
I am stuck with the logic of this script as core:
#!/usr/bin/perl
my $init_seq = "AAAAAAAAAA" #length 10 bp
my $sub_rate = 0.003;
my $nof_tags = 1000;
my #dna = qw( A C G T );
$i = 0;
while ($i < length($init_seq)) {
$roll = int(rand 4) + 1; # $roll is now an integer between 1 and 4
if ($roll == 1) {$base = A;}
elsif ($roll == 2) {$base = T;}
elsif ($roll == 3) {$base = C;}
elsif ($roll == 4) {$base = G;};
print $base;
}
continue {
$i++;
}
As a small optimisation, replace:
$roll = int(rand 4) + 1; # $roll is now an integer between 1 and 4
if ($roll == 1) {$base = A;}
elsif ($roll == 2) {$base = T;}
elsif ($roll == 3) {$base = C;}
elsif ($roll == 4) {$base = G;};
with
$base = $dna[int(rand 4)];
EDIT: Assuming substitution rate is in the range 0.001 to 1.000:
As well as $roll, generate another (pseudo)random number in the range [1..1000], if it is less than or equal to (1000 * $sub_rate) then perform the substitution, otherwise do nothing (i.e. output 'A').
Be aware that you may introduce subtle bias unless the properties of your random number generator are known.
Not exactly what you are looking for, but I suggest you take a look at BioPerl's Bio::SeqEvolution::DNAPoint module. It does not take mutation rate as a parameter though. Rather, it asks what the lower bound of sequence identity with the original you want.
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqEvolution::Factory;
my $seq = Bio::Seq->new(-seq => 'AAAAAAAAAA', -alphabet => 'dna');
my $evolve = Bio::SeqEvolution::Factory->new (
-rate => 2, # transition/transversion rate
-seq => $seq
-identity => 50 # At least 50% identity with the original
);
my #mutated;
for (1..1000) { push #mutated, $evolve->next_seq }
All 1000 mutated sequences will be stored in the #mutated array, their sequences can be accessed via the seq method.
In the event of a substitution, you want to exclude the current base from the possibilities:
my #other_bases = grep { $_ ne substr($init_seq, $i, 1) } #dna;
$base = #other_bases[int(rand 3)];
Also please see Mitch Wheat's answer for how to implement the substitution rate.
I don't know if I understand correctly but I'd do something like this (pseudocode):
digits = 'ATCG'
base = 'AAAAAAAAAA'
MAX = 1000
for i = 1 to len(base)
# check if we have to mutate
mutate = 1+rand(MAX) <= rate*MAX
if mutate then
# find current A:0 T:1 C:2 G:3
current = digits.find(base[i])
# get a new position
# but ensure that it is not current
new = (j+1+rand(3)) mod 4
base[i] = digits[new]
end if
end for