I dont understand this little perl code (if ...) - perl

Can someone explain me this short pearl code?
$batstr2 = "empty" if( $status2 & 4 );
What say the if statement ?

Already answered many times, for the case if you don't know what is the Bitwise And, here is a small example:
perl -e 'print "dec\t bin\t&4\n";printf "%d\t%8b\t%-8b\n", $_, $_, ($_ & 4) for (0..8);'
prints:
dec bin &4
0 0 0
1 1 0
2 10 0
3 11 0
4 100 100
5 101 100
6 110 100
7 111 100
8 1000 0
as you can see, when the 3rb bit from right is 1 - the $num & 4 is true.

That's using the if as a statement modifier. It's roughly the same as
if ($status & 4) {
$batstr2 = "empty";
}
and exactly the same as
($status & 4) and ($batstr2 = "empty");
a variety of constructs can be used as statement modifiers, including: if, unless, while, until, for, when. These modifiers can't be stacked (foo() if $bar for #baz won't work), you are limited for one modifer per simple statement.

That's a bitwise and - http://perldoc.perl.org/perlop.html#Bitwise-And . $status2 is being used as a bit mask and it sets $batstr2 to 'empty' if the bit is set.

It sets $batstr2 to "empty" if the 3rd least significant bit of $status2 is set - it is a logical AND mask.

Related

How to count the numbers of elements in parts of a text file using a loop in Perl?

I´m looking for a way to create a script in Perl to count the elements in my text file and do it in parts. For example, my text file has this form:
ID Position Potential Jury agreement NGlyc result
(PART 1)
NP_073551.1_HCoV229Egp2 23 NTSY 0.5990 (8/9) +
NP_073551.1_HCoV229Egp2 62 NTSS 0.7076 (9/9) ++
NP_073551.1_HCoV229Egp2 171 NTTI 0.5743 (5/9) +
...
(PART 2)
QJY77946.1_NA 20 NGTN 0.7514 (9/9) +++
QJY77946.1_NA 23 NTSH 0.5368 (5/9) +
QJY77946.1_NA 51 NFSF 0.7120 (9/9) ++
QJY77946.1_NA 62 NTSS 0.6947 (9/9) ++
...
(PART 3)
QJY77954.1_NA 20 NGTN 0.7694 (9/9) +++
QJY77954.1_NA 23 NTSH 0.5398 (5/9) +
QJY77954.1_NA 51 NFSF 0.7121 (9/9) ++
...
(PART N°...)
Like you can see the ID is the same in each part (one for PART 1, other to PART 2 and then...). The changes only can see in the columns Position//Potential//Jury agreement//NGlyc result Then, my main goal is to count the line with Potential 0,7 >=.
With this in mind, I´m looking for output like this:
Part 1:
1 (one value 0.7 >=)
Part 2:
2 (two values 0.7 >=)
Part 3:
2 (two values 0.7 >=)
Part N°:
X numbers of values 0.7 >=
This output tells me the number of positive values (0.7 >=) for each ID.
The pseudocode I believe would be something like this:
foreach ID in LIST
foreach LINE in FILE
if (ID is in LINE)
... count the line ...
end foreach LINE
end foreach ID
I´m looking for any suggestion (for a package or script idea) or comment to create a better script.
Thanks! Best!
To count the number of lines, for each part, that match some condition on a certain column, you can just loop over the lines, skip the header, parse the part number, and use an array to count the number of lines matching for each part.
After this you can just loop over the counts recorded in the array and print them out in your specific format.
#!/usr/bin/perl
use strict;
use warnings;
my $part = 0;
my #cnt_part;
while(my $line = <STDIN>) {
if($. == 1) {
next;
}elsif($line =~ m{^\(PART (\d+)\)}) {
$part = $1;
}else {
my #cols = split(m{\s+},$line);
if(#cols == 6) {
my $potential = $cols[3];
if(0.7 <= $potential) {
$cnt_part[$part]++;
};
};
};
};
for(my $i=1;$i<=$#cnt_part;$i++){
print "Part $i:\n";
print "$cnt_part[$i] (values 0.7 <=)\n";
};
To run it, just pipe the entire file through the Perl script:
cat in.txt | perl count.pl
and you get an output like this:
Part 1:
1 (values 0.7 <=)
Part 2:
2 (values 0.7 <=)
Part 3:
2 (values 0.7 <=)
If you want to also display the counts into words, you can use Lingua::EN::Numbers (see this program ) and you get an output very similar to the one in your post:
Part 1:
1 (one values 0.7 <=)
Part 2:
2 (two values 0.7 <=)
Part 3:
2 (two values 0.7 <=)
All the code in this post is also available here.

perl6, how to match 1 to 10000 times except prime number of times?

What is the best way to match a string that occurs anywhere from 1 to 10000 times except prime number of times?
say so "xyz" ~~ m/ <[x y z]> ** <[ 1..10000] - [ all prime numbers ]> /
Thanks !!!
Not necessarily the best way (in particular, it will create up to 10_000 submatch objects), but a way:
$ perl6 -e 'say "$_ ", so <x y z>.roll x $_ ~~ /^ (<[xyz]>) ** 1..10_000 <!{$0.elems.is-prime}> $/ for 1..10'
1 True
2 False
3 False
4 True
5 False
6 True
7 False
8 True
9 True
10 True
If the substring of interest has fixed length, you could also capture the repetition as a whole and check its length, avoiding submatch creation.

Output from calculation is messed in perl one-liner

I'm trying to do some calculations on the columns of a tab delimited file using this perl one-liner:
perl -ape 'if (/^\d/) { s/$F[2]/$F[2]\/$F[4]/e && s/$F[3]/$F[3]\/$F[4]/e}' infile
the idea is to get A and B columns divided by C column
infile:
X Y A B C
5001 3 1.03333 0.652549 4215
6001 4 1.2 0.723137 4870
7001 2 1 0.807843 5153
8001 2 1 0.807843 5355
9001 2 1 0.807843 5389
10001 2 1 0.807843 4955
11001 7 1.7671 1.05573 4966
12001 17 8.18802 4.72554 5124
But the output is this:
X Y A B C
5001 3 0.000245155397390273 0.000154815895610913 4215
6001 4 0.000246406570841889 0.000148488090349076 4870
7000.000194061711624297 2 1 0.000156771395303707 5153
8000.000186741363211951 2 1 0.000150857703081232 5355
9000.000185563184264242 2 1 0.000149905919465578 5389
0.0002018163471241170001 2 1 0.000163035923309788 4955
11001 7 0.000355839710028192 0.000212591623036649 4966
12001 17 0.00159797423887588 0.000922236533957845 5124
What is going on on the 3rd to 6th lines? How can manage to fix this?
Thanks.
EDIT:
I removed the /e option from the substitute command and it seems that the calculation is being performed on the wrong column.
perl -ape 'if (/^\d/) { s/$F[2]/$F[2]\/$F[4]/ && s/$F[3]/$F[3]\/$F[4]/}' infile
X Y A B C
5001 3 1.03333/4215 0.652549/4215 4215
6001 4 1.2/4870 0.723137/4870 4870
7001/5153 2 1 0.807843/5153 5153
8001/5355 2 1 0.807843/5355 5355
9001/5389 2 1 0.807843/5389 5389
1/49550001 2 1 0.807843/4955 4955
11001 7 1.7671/4966 1.05573/4966 4966
12001 17 8.18802/5124 4.72554/5124 5124
13001 30 13.8763/5138 8.05385/5138 5138
After substitution and evaluation, you have something like s/1/0.000194061711624297/. So the s operator looks for a 1 and finds it as part of the first column. Whoops. If we add some \b word-boundary markers, we can force the match part of the s operators to match a complete column, never just part of a column:
perl -ape 'if (/^\d/) { s/\b$F[2]\b/$F[2]\/$F[4]/e && s/\b$F[3]\b/$F[3]\/$F[4]/e}' infile
But that's still going to run into issues if it's possible for column X to equal column A or B. Better to just do the calculations and then replace the entire line by assigning to $_:
perl -ape 'if (/^\d/) { $F[2] /= $F[4]; $F[3] /= $F[4]; $_ = join(" ", #F); }'
Use sprintf instead of join if you want a particular format to the output.
Your basic problem is that you are substituting the value that is in column 3 and 4 whereever they appear in the whole line. For row 3, for example, you are doing s/1/1\/5153/e which affects the first occurrence of the digit 1 in the line, not necessarily the 1 that happens to be in column 3.
Try this:
perl -lane 'if ($F[4] =~ /[1-9]/) { $F[2] /= $F[4]; $F[3] /= $F[4] } print join "\t", #F' infile
If you want to limit the precision, do something like $F[2] = sprintf "%f", $F[2]/$F[4]; ...

Perl: Devel::Gladiator module and memory management

I have a perl script that needs to run in the background constantly. It consists of several .pm module files and a main .pl file. What the program does is to periodically gather some data, do some computation, and finally update the result recorded in a file.
All the critical data structures are declared in the .pl file with our, and there's no package variable declared in any .pm file.
I used the function arena_table() in the Devel::Gladiator module to produce some information about the arena in the main loop, and found that the SVs of type SCALAR and GLOB are increasing slowly, resulting in a gradual increase in the memory usage.
The output of arena_table (I reformat them, omitting the title. after a long enough period, only the first two number is increasing):
2013-05-17#11:24:34 36235 3924 3661 3642 3376 2401 720 201 27 23 18 13 13 10 2 2 1 1 1 1 1 1 1 1 1 1
After running for some time:
2013-05-17#12:05:10 50702 46169 36910 4151 3995 3924 2401 720 274 201 26 23 18 13 13 10 2 2 1 1 1 1 1 1 1 1 1
The main loop is something like:
our %hash1 = ();
our %hash2 = ();
# and some more package variables ...
# all are hashes
do {
my $nowtime = time();
collect_data($nowtime);
if (calculate() == 1) {
update();
}
sleep 1;
get_mem_objects(); # calls arena_table()
} while (1);
Except get_mem_objects, other functions will operate on the global hashes declared by our. In update, the program will do some log rotation, the code is like:
sub rotate_history() {
my $i = $main::HISTORY{'count'};
if ($i == $main::CONFIG{'times'}{'total'}) {
for ($i--; $i >= 1; $i--) {
$main::HISTORY{'data'}{$i} = dclone($main::HISTORY{'data'}{$i-1});
}
} else {
for (; $i >= 1; $i--) {
$main::HISTORY{'data'}{$i} = dclone($main::HISTORY{'data'}{$i-1});
}
}
$main::HISTORY{'data'}{'0'} = dclone(\%main::COLLECT);
if ($main::HISTORY{'count'} < $main::CONFIG{'times'}{'total'}) {
$main::HISTORY{'count'}++;
}
}
If I comment the calls to this function, in the final report given by Devel::Gladiator, only the SVs of type SCALAR is increasing, the number of GLOBs will finally enter a stable state. I doubt the dclone may cause the problem here.
My questions are,
what exactly does the information given by that module mean? The statements in the perldoc is a little vague for a perl newbie like me.
And, what are the common skills to lower the memory usage of long-running perl scripts?
I know that package variables are stored in the arena, but how about the lexical variables? How are the memory consumed by them managed?

How can I searching for different variants of bioinformatics motifs in string, using Perl?

I have a program output with one tandem repeat in different variants. Is it possible to search (in a string) for the motif and to tell the program to find all variants with maximum "3" mismatches/insertions/deletions?
I will take a crack at this with the very limited information supplied.
First, a short friendly editorial:
<editorial>
Please learn how to ask a good question and how to be precise.
At a minimum, please:
Refrain from domain specific jargon such as "motif" and "tandem repeat" and "base pairs" without providing links or precise definitions;
Say what the goal is and what you have done so far;
Important: Provide clear examples of input and desired output.
It is not helpful to potential helpers on SO have to have to play 20 questions in comments to try and understand your question! I spent more time trying to figure out what you were asking than answering it.
</editorial>
The following program generates a string of 2 character pairs 5,428 pairs long in an array of 1,000 elements long. I realize it is more likely that you will be reading these from a file, but this is just an example. Obviously you would replace the random strings with your actual data from whatever source.
I do not know if 'AT','CG','TC','CA','TG','GC','GG' that I used are legitimate base pair combinations or not. (I slept through biology...) Just edit the map block pairs to legitimate pairs and change the 7 to the number of pairs if you want to generate legitimate random strings for testing.
If the substring at the offset point is 3 differences or less, the array element (a scalar value) is stored in an anonymous array in the value part of a hash. The key part of the hash is the substring that is a near match. Rather than array elements, the values could be file names, Perl data references or other relevant references you want to associate with your motif.
While I have just looked at character by character differences between the strings, you can put any specific logic that you need to look at by replacing the line foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); } with the comparison logic that works for your problem. I do not know how mismatches/insertions/deletions are represented in your string, so I leave that as an exercise to the reader. Perhaps Algorithm::Diff or String::Diff from CPAN?
It is easy to modify this program to have keyboard input for $target and $offset or have the string searched beginning to end rather than several strings at a fixed offset. Once again: it was not really clear what your goal is...
use strict; use warnings;
my #bps;
push(#bps,join('',map { ('AT','CG','TC','CA','TG','GC','GG')[rand 7] }
0..5428)) for(1..1_000);
my $len=length($bps[0]);
my $s_count= scalar #bps;
print "$s_count random strings generated $len characters long\n" ;
my $target="CGTCGCACAG";
my $offset=832;
my $nlen=length $target;
my %HoA;
my $diffs=0;
my #a2=split(//, $target);
substr($bps[-1], $offset, $nlen)=$target; #guarantee 1 match
substr($bps[-2], $offset, $nlen)="CATGGCACGG"; #anja example
foreach my $i (0..$#bps) {
my $cand=substr($bps[$i], $offset, $nlen);
my #a1=split(//, $cand);
$diffs=0;
foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); }
next if $diffs > 3;
push (#{$HoA{$cand}}, $i);
}
foreach my $hit (keys %HoA) {
my #a1=split(//, $hit);
$diffs=0;
my $ds="";
foreach my $j (0..$#a1) {
if($a1[$j] eq $a2[$j]) {
$ds.=" ";
} else {
$diffs++;
$ds.=$a1[$j];
}
}
print "Target: $target\n",
"Candidate: $hit\n",
"Differences: $ds $diffs differences\n",
"Array element: ";
foreach (#{$HoA{$hit}}) {
print "$_ " ;
}
print "\n\n";
}
Output:
1000 random strings generated 10858 characters long
Target: CGTCGCACAG
Candidate: CGTCGCACAG
Differences: 0 differences
Array element: 999
Target: CGTCGCACAG
Candidate: CGTCGCCGCG
Differences: CGC 3 differences
Array element: 696
Target: CGTCGCACAG
Candidate: CGTCGCCGAT
Differences: CG T 3 differences
Array element: 851
Target: CGTCGCACAG
Candidate: CGTCGCATGG
Differences: TG 2 differences
Array element: 986
Target: CGTCGCACAG
Candidate: CATGGCACGG
Differences: A G G 3 differences
Array element: 998
..several cut out..
Target: CGTCGCACAG
Candidate: CGTCGCTCCA
Differences: T CA 3 differences
Array element: 568 926
I believe that there are routines for this sort of thing in BioPerl.
In any case, you might get better answers if you asked this over at BioStar, the bioinformatics stack exchange.
When I was in my first couple years of learning perl, I wrote what I now consider to be a very inefficient (but functional) tandem repeat finder (which used to be available on my old job's company website) called tandyman. I wrote a fuzzy version of it a couple years later called cottonTandy. If I were to re-write it today, I would use hashes for a global search (given the allowed mistakes) and utilize pattern matching for a local search.
Here's an example of how you use it:
#!/usr/bin/perl
use Tandyman;
$sequence = "ATGCATCGTAGCGTTCAGTCGGCATCTATCTGACGTACTCTTACTGCATGAGTCTAGCTGTACTACGTACGAGCTGAGCAGCGTACgTG";
my $tandy = Tandyman->new(\$sequence,'n'); #Can't believe I coded it to take a scalar reference! Prob. fresh out of a cpp class when I wrote it.
$tandy->SetParams(4,2,3,3,4);
#The parameters are, in order:
# repeat unit size
# min number of repeat units to require a hit
# allowed mistakes per unit (an upper bound for "mistake concentration")
# allowed mistakes per window (a lower bound for "mistake concentration")
# number of units in a "window"
while(#repeat_info = $tandy->FindRepeat())
{print(join("\t",#repeat_info),"\n")}
The output of this test looks like this (and takes a horrendous 11 seconds to run):
25 32 TCTA 2 0.87 TCTA TCTG
58 72 CGTA 4 0.81 CTGTA CTA CGTA CGA
82 89 CGTA 2 0.87 CGTA CGTG
45 51 TGCA 2 0.87 TGCA TGA
65 72 ACGA 2 0.87 ACGT ACGA
23 29 CTAT 2 0.87 CAT CTAT
36 45 TACT 3 0.83 TACT CT TACT
24 31 ATCT 2 1 ATCT ATCT
51 59 AGCT 2 0.87 AGTCT AGCT
33 39 ACGT 2 0.87 ACGT ACT
62 72 ACGT 3 0.83 ACT ACGT ACGA
80 88 ACGT 2 0.87 AGCGT ACGT
81 88 GCGT 2 0.87 GCGT ACGT
63 70 CTAC 2 0.87 CTAC GTAC
32 38 GTAC 2 0.87 GAC GTAC
60 74 GTAC 4 0.81 GTAC TAC GTAC GAGC
23 30 CATC 2 0.87 CATC TATC
71 82 GAGC 3 0.83 GAGC TGAGC AGC
1 7 ATGC 2 0.87 ATGC ATC
54 60 CTAG 2 0.87 CTAG CTG
15 22 TCAG 2 0.87 TCAG TCGG
70 81 CGAG 3 0.83 CGAG CTGAG CAG
44 50 CATG 2 0.87 CTG CATG
25 32 TCTG 2 0.87 TCTA TCTG
82 89 CGTG 2 0.87 CGTA CGTG
55 73 TACG 5 0.75 TAGCTG TAC TACG TACG AG
69 83 AGCG 4 0.81 ACG AGCTG AGC AGCG
15 22 TCGG 2 0.87 TCAG TCGG
As you can see, it allows indels and SNPs. The columns are, in order:
Start position
Stop position
Consensus sequence
The number of units found
A quality metric out of 1
The repeat units separated by spaces
Note, that it's easy to supply parameters (as you can see from the output above) that will output junk/insignificant "repeats", but if you know how to supply good params, it can find what you set it upon finding.
Unfortunately, the package is not publicly available. I never bothered to make it available since it's so slow and not amenable to even prokaryotic-sized genome searches (though it would be workable for individual genes). In my novice coding days, I had started to add a feature to take a "state" as input so that I could run it on sections of a sequence in parallel and I never finished that once I learned hashes would make it so much faster. By that point, I had moved on to other projects. But if it would suit your needs, message me, I can email you a copy.
It's just shy of 1000 lines of code, but it has lots of bells & whistles, such as the allowance of IUPAC ambiguity codes (BDHVRYKMSWN). It works for both amino acids and nucleic acids. It filters out internal repeats (e.g. does not report TTTT or ATAT as 4nt consensuses).