Perl recursion help - perl

I'm trying to write a program that will calculate the dividends of stocks. I did this without a subroutine. Right now, I'm trying to modify it so it can run using a recursive routine. Any help with this? Because I'm not so good at this.
Here's the original script + a pathetic attempt.
print "A stock xyz's price is now $100. It has 3.78% dividend. You have 1000 of it and reinvest the dividend into the stock.\n";
my %hash;
#stocknum = 1000;
#dividend = 6780;
while ($#dividend != 20) {
$a = $dividend[-1];
$stock = $stocknum[-1];
$div_total= $stock*100*0.0678;
$stock_total = $stock + int($a/100);
push (#stocknum, $stock_total);
push (#dividend, $div_total);
if ($#dividend == 20) {
last;
}
}
shift (#dividend);
$stock_num = $stocknum[-1];
$div = $stock_num*100*0.0678;
push (#dividend, $div);
#hash{#stocknum} = #dividend;
foreach $key(sort keys %hash) {
print "Stock number: $key\t"."Dividend: $hash{$key}\n";
}
$dividend=0.0378;

I don't think you want recursion. I think you just want to loop over the number of cycles of payouts that you're after. It looks like you're getting all mixed up with arrays for some reason.
print <<'HERE';
A stock xyz's price is now $100. It has 6.78% dividend.
You have 1000 of it and reinvest the dividend into the stock.
HERE
my $shares = 1000;
my $price = 100;
my $dividend = 6.78 / 100;
my $cycles = $ARGV[0] || 20;
foreach ( 1 .. $cycles ) {
local $cycle = $_;
local $payout = $shares * $dividend * $price;
local $new_shares = $payout / $price;
write();
$shares += $new_shares;
}
format STDOUT =
#### #####.###### ######.####### ###.###### #####.######
$cycle, $shares, $payout, $new_shares, $shares+$new_shares,
.
format STDOUT_TOP =
###.####%
$dividend
Cycle Shares Payout New Shares Total Shares
----------------------------------------------------------------
.
This gives me the output:
A stock xyz's price is now $100. It has 6.78% dividend.
You have 1000 of it and reinvest the dividend into the stock.
0.0678%
Cycle Shares Payout New Shares Total Shares
----------------------------------------------------------------
1 1000.000000 6780.0000000 67.800000 1067.800000
2 1067.800000 7239.6840000 72.396840 1140.196840
3 1140.196840 7730.5345752 77.305346 1217.502186
4 1217.502186 8254.6648194 82.546648 1300.048834
5 1300.048834 8814.3310942 88.143311 1388.192145
6 1388.192145 9411.9427423 94.119427 1482.311572
7 1482.311572 10050.0724603 100.500725 1582.812297
8 1582.812297 10731.4673731 107.314674 1690.126971
9 1690.126971 11459.0608610 114.590609 1804.717579
10 1804.717579 12235.9851873 122.359852 1927.077431
11 1927.077431 13065.5849830 130.655850 2057.733281
12 2057.733281 13951.4316449 139.514316 2197.247597
13 2197.247597 14897.3387104 148.973387 2346.220985
14 2346.220985 15907.3782750 159.073783 2505.294767
15 2505.294767 16985.8985220 169.858985 2675.153752
16 2675.153752 18137.5424418 181.375424 2856.529177
17 2856.529177 19367.2678194 193.672678 3050.201855
18 3050.201855 20680.3685775 206.803686 3257.005541
19 3257.005541 22082.4975671 220.824976 3477.830517
20 3477.830517 23579.6909021 235.796909 3713.627426
Don't worry about my use of format; I've had that on the brain this weekend since I rewrote some perlfaq stuff about it then also turned it into Use formats to create paginated, plaintext reports. You could just as easily created the output with printf:
print <<'HERE';
A stock xyz's price is now $100. It has 6.78% dividend.
You have 1000 of it and reinvest the dividend into the stock.
Cycle Shares Payout New Shares Total Shares
----------------------------------------------------------------
HERE
my $shares = 1000;
my $price = 100;
my $dividend = 6.78 / 100;
my $cycles = $ARGV[0] || 20;
foreach ( 1 .. $cycles ) {
my $payout = $shares * $dividend * $price;
my $new_shares = $payout / $price;
printf "%4d %12.6f %12.6f %10.6f %12.6f\n",
$_, $shares, $payout, $new_shares, $shares + $new_shares;
$shares += $new_shares;
}
As a side note, you really don't ever want recursion, and especially not in Perl if you can help it. Other languages get away with it because they know how to unroll your recursion to turn it into an iterative process. Perl, being a dynamic language, can't really do that because it doesn't know if the subroutine will have the same definition on the next go around. It's nice as a computer science topic because it makes the programming marginally easier and they know it all works out in the end. I think I talk about this in Mastering Perl somewhere, but Mark Jason Dominus covers it extensively in Higher-Order Perl. Basically, instead of recursion you use a queue, which is a better skill to practice anyway.

Related

How can I improve Perl compare performance

I have an array ref of about 50,000 users. I want to go through all those users and compare each one to all the others in order to build a weighted list of matches (if the name is an exact match it's worth x, a partial match is worth y etc).
After going through the list and doing all the checks, I then want to go get the 10 highest weighted matches. Here is sort of a example of what I'm doing to help explain:
#!/usr/bin/perl
######################################################################
# Libraries
# ---------
use strict;
use warnings;
my $users = [];
$users->[0]{'Name'} = 'xxx';
$users->[0]{'Address'} = 'yyyy';
$users->[0]{'Phone'} = 'xxx';
$users->[1]{'Name'} = 'xxx';
$users->[1]{'Address'} = 'yyyy';
$users->[1]{'Phone'} = 'xxx';
$users->[2]{'Name'} = 'xxx';
$users->[3]{'Address'} = 'yyyy';
$users->[4]{'Phone'} = 'xxx';
foreach my $user_to_check (#$users) {
my $matched_users = [];
foreach my $user (#$users) {
$user_to_check->{'Weight'} = 0;
if (lc($user_to_check->{'Name'}) eq lc($user->{'Name'})) {
$user_to_check->{'Weight'} = ($user_to_check->{'Weight'} + 10);
} elsif ((length($user_to_check->{'Name'}) > 2) && (length($user->{'Name'}) > 2) && ($user_to_check->{'Name'} =~ /\Q$user->{'Name'}\E/i)) {
$user_to_check->{'Weight'} = ($user_to_check->{'Weight'} + 5);
}
if (lc($user_to_check->{'Address'}) eq lc($user->{'Address'})) {
.....
}
if ($user_to_check->{'Weight'} > 0) {
# We have matches, add to matched users
push (#$matched_users,$user);
}
}
# Now we want to get just the top 10 highest matching users
foreach my $m_user (sort { $b->{'Weight'} <=> $a->{'Weight'} } #$matched_users ) {
last if $counter == 10;
.... # Do stuff with the 10 we want
}
}
The problem is, it's sooo slow. It takes more than a day to run (and I've tried it on multiple machines). I know that the "sort" is a killer but I did also try inserting the results into a tmp mysql table and then at the end instead of doing the Perl sort, I just did an order by select, but the difference in time was very minor.
As I'm just going through a existing data structure and comparing it I'm not sure what I could do (if anything) to speed it up. I'd appreciate any advise.
O(n²)
You compare each element in #$users against every element in there. That is 5E4² = 2.5E9 comparisions. For example, you wouldn't need to compare an element against itself. You also don't need to compare an element against one you have already compared. I.e. in this comparision table
X Y Z
X - + +
Y - - +
Z - - -
there only have to be three comparision to have compared each element against all others. The nine comparisions you are doing are 66% unneccessary (asymptotically: 50% unneccessary).
You can implement this by looping over indices:
for my $i (0 .. $#$users) {
my $userA = $users->[$i];
for my $j ($i+1 .. $#$users) {
my $userB = $users->[$j];
...;
}
}
But this means that upon match, you have to increment the weight of both matching users.
Do things once, not 100,000 times
You lowercase the name of each user 1E5 times. This is 1E5 - 1 times to much! Just do it once for each element, possibly at data input.
As a side note, you shouldn't perform lowercasing, you should do case folding. This is available since at least v16 via the fc feature. Just lowercasing will be buggy when you have non-english data.
use feature 'fc'; # needs v16
$user->[NAME] = fc $name;
or
use Unicode::CaseFold;
$user->[NAME] = fc $name;
When hashes are not fast enough
Hashes are fast, in that a lookup takes constant time. But a single hash lookup is more expensive than an array access. As you only have a small, predefined set of fields, you can use the following trick to use hash-like arrays:
Declare some constants with the names of your fields that map to indices, e.g.
use constant {
WEIGHT => 0,
NAME => 1,
ADDRESS => 2,
...;
};
And then put your data into arrays:
$users->[0][NAME] = $name; ...;
You can access the fields like
$userA->[WEIGHT] += 10;
While this looks like a hash, this is actually a safe method to access only certain fields of an array with minimal overhead.
Regexes are slow
Well, they are quite fast, but there is a better way to determine if a string is a substring of another string: use index. I.e.
$user_to_check->{'Name'} =~ /\Q$user->{'Name'}\E/i
Can be written as
(-1 != index $user_to_check->{Name}, $user->{Name})
assuming both are already lowercased case folded.
Alternative implementation
Edit: this appears to be invalidated by your edit to your question. This assumed you were trying to find some global similarities, not to obtain a set of good matches for each user
Implementing these ideas would make your loops look somewhat like
for my $i (0 .. $#$users) {
my $userA = $users->[$i];
for my $j ($i+1 .. $#$users) {
my $userB = $users->[$j];
if ($userA->[NAME] eq $userB->[NAME]) {
$userA->[WEIGHT] += 10;
$userB->[WEIGHT] += 10;
} elsif ((length($userA->[NAME]) > 2) && (length($userB->[NAME]) > 2))
$userA->[WEIGHT] += 5 if -1 != index $userA->[NAME], $userB->[NAME];
$userB->[WEIGHT] += 5 if -1 != index $userB->[NAME], $userA->[NAME];
}
if ($userA->[ADDRESS] eq $userB->[ADDRESS]) {
..... # More checks
}
}
}
my (#top_ten) = (sort { $b->[WEIGHT] <=> $a->[WEIGHT] } #$users)[0 .. 9];
Divide and conquer
The task you show is highly parallelizable. If you have the memory, using threads is easy here:
my $top10 = Thread::Queue->new;
my $users = ...; # each thread gets a copy of this data
my #threads = map threads->create(\&worker, $_), [0, int($#$users/2)], [int($#$users/2)+1, $#users];
# process output from the threads
while (defined(my $ret = $top10->dequeue)) {
my ($user, #top10) = #$ret;
...;
}
$_->join for #threads;
sub worker {
my ($from, $to) = #_;
for my $i ($from .. $to) {
my $userA = $users->[$i];
for $userB (#$users) {
...;
}
my #top10 = ...;
$top10->enqueue([ $userA, #top10 ]); # yield data to the main thread
}
}
You should probably return your output via a queue (as shown here), but do as much processing as possible inside the threads. With more advanced partitioning of the workload, should spawn as many threads as you have processors available.
But if any kind of pipelining, filtering or caching can decrease the number of iterations needed in the nested loops, you should do such optimizations (think map-reduce-style programming).
Edit: Elegantly reducing complexity through hashes for deduplication
What we are essentially doing is calculating a matrix of how good our records match, e.g.
X Y Z
X 9 4 5
Y 3 9 2
Z 5 2 9
If we assume that X is similar to Y implies Y is similar to X, then the matrix is symmetric, and we only need half of it:
X Y Z
X \ 4 5
Y \ 2
Z \
Such a matrix is equivalent to a weighted, undirected graph:
4 X 5 | X – Y: 4
/ \ | X – Z: 5
Y---Z | Y – Z: 2
2 |
Therefore, we can represent it elegantly as a hash of hashes:
my %graph;
$graph{X}{Y} = 4;
$graph{X}{Z} = 5;
$graph{Y}{Z} = 2;
However, such a hash structure implies a direction (from node X to node Y). To make querying the data easier, we might as well include the other direction too (due to the implementation of hashes, this won't lead to a large memory increase).
$graph{$x}{$y} = $graph{$y}{$x} += 2;
Because each node is now only connected to those nodes it is similar to, we don't have to sort through 50,000 records. For the 100th record, we can get the ten most similar nodes like
my $node = 100;
my #top10 = (sort { $graph{$node}{$b} <=> $graph{$node}{$a} } keys %{ $graph{$node} })[0 .. 9];
This would change the implementation to
my %graph;
# build the graph, using the array indices as node ID
for my $i (0 .. $#$users) {
my $userA = $users->[$i];
for my $j ($i+1 .. $#$users) {
my $userB = $users->[$j];
if ($userA->[NAME] eq $userB->[NAME]) {
$graph{$j}{$i} = $graph{$i}{$j} += 10;
} elsif ((length($userA->[NAME]) > 2) && (length($userB->[NAME]) > 2))
$graph{$j}{$i} = $graph{$i}{$j} += 5
if -1 != index $userA->[NAME], $userB->[NAME]
or -1 != index $userB->[NAME], $userA->[NAME];
}
if ($userA->[ADDRESS] eq $userB->[ADDRESS]) {
..... # More checks
}
}
}
# the graph is now fully populated.
# do somethething with each top10
while (my ($node_id, $similar) = each %graph) {
my #most_similar_ids = (sort { $similar->{$b} <=> $similar->{$a} } keys %$similar)[0 .. 9];
my ($user, #top10) = #$users[ $node_id, #most_similar_ids ];
...;
}
Building the graph this way should take half the time of naive iteration, and if the average number of edges for each node is low enough, going through similar nodes should be considerably faster.
Parallelizing this is a bit harder, as the graph each thread produces has to be combined before the data can be queried. For this, it would be best for each thread to perform the above code with the exception that the iteration bounds are given as parameters, and that only one edge should produced. The pair of edges will be completed in the combination phase:
THREAD A [0 .. 2/3] partial
\ graph
=====> COMBINE -> full graph -> QUERY
/ partial
THREAD B [2/3 .. 1] graph
# note bounds recognizing the triangular distribution of workload
However, this is only beneficial if there are only very few similar nodes for a given node, as combination is expensive.

Converting numbers to K/M/G/T and days/hours/min?

Is there an easy way to convert numbers like these
123 -> 123B
1234 -> 1.2K
12345 -> 12.2K
123456 -> 123.4K
1234567 -> 1.2M
12345678 -> 12.3M
123456789 -> 123.4M
...
and ideally also large numbers into days/hours/min. ?
This was discussed on Stackoverflow using Time::Piece. One of the answers comes close to calculating days hours minutes. From what I've read about this question before, I think you can easily code it up like this:
sub dhms {
my $seconds = shift;
my $days = int $seconds / 86400;
$seconds %= 86400;
my $hours = int $seconds / 3600;
$seconds %= 3600;
my $mins = int $seconds / 60;
my $secs = $seconds % 60;
return $days, $hours, $mins, $secs;
}
Update: daxim's answer using DateTime::Format::Duration does this as well
Suffixing is quite simple actually, once we understand the relationship between the letters K, M, G, T and the factors they present:
K = 10^3 = 10^3^1
M = 10^6 = 10^3^2
G = 10^9 = 10^3^3
T = 10^12 = 10^3^4
The next important thing to realize is that 10^0 = 1.
We want to select the largest suffix whose value is smaller than the value we want to transform. To do that, we put the suffixes into an array:
my #suffixes = qw/ B K M G T /;
so that
$suffixes[$i] == 10**3**$i # conceptually
Now it's just a matter of looping over the indices (probably in reverse) and stopping as soon as the $val >= 10**3**$i.

Randomly selecting letters by frequency of use

After feeding few Shakespeare books to my Perl script I have a hash with 26 english letters as keys and the number of their occurences in texts - as value:
%freq = (
a => 24645246,
b => 1409459,
....
z => 807451,
);
and of course the total number of all letters - let's say in the $total variable.
Is there please a nice trick to generate a string holding 16 random letters (a letter can occur several times there) - weighted by their frequency of use?
To be used in a word game similar to Ruzzle:
Something elegant - like picking a random line from a file, as suggested by a Perl Cookbook receipt:
rand($.) < 1 && ($line = $_) while <>;
The Perl Cookbook trick for picking a random line (which can also be found in perlfaq5) can be adapted for weighted sampling too:
my $chosen;
my $sum = 0;
foreach my $item (keys %freq) {
$sum += $freq{$item};
$chosen = $item if rand($sum) < $freq{$item};
}
Here, $sum corresponds to the line counter $. and $freq{$item} to the constant 1 in the Cookbook version.
If you're going to be picking a lot of weighted random samples, you can speed this up a bit with some preparation (note that this destroys %freq, so make a copy first if you want to keep it):
# first, scale all frequencies so that the average frequency is 1:
my $avg = 0;
$avg += $_ for values %freq;
$avg /= keys %freq;
$_ /= $avg for values %freq;
# now, prepare the array we'll need for fast weighted sampling:
my #lookup;
while (keys %freq) {
my ($lo, $hi) = (sort {$freq{$a} <=> $freq{$b}} keys %freq)[0, -1];
push #lookup, [$lo, $hi, $freq{$lo} + #lookup];
$freq{$hi} -= (1 - $freq{$lo});
delete $freq{$lo};
}
Now, to draw a random weighted sample from the prepared distribution, you just do this:
my $r = rand #lookup;
my ($lo, $hi, $threshold) = #{$lookup[$r]};
my $chosen = ($r < $threshold ? $lo : $hi);
(This is basically the Square Histogram method described in Marsaglia, Tsang & Wang (2004), "Fast Generation of Discrete Random Variables", J. Stat. Soft. 11(3) and originally due to A.J. Walker (1974).)
I have no clue about Perl syntax so I'll just write pseudo-code. You can do something like that
sum <= 0
foreach (letter in {a, z})
sum <= sum + freq[letter]
pick r, a random integer in [0, sum[
letter <= 'a' - 1
do
letter <= letter + 1
r <= r - freq(letter)
while r > 0
letter is the resulting value
The idea behind this code is to make a stack of boxes for each letter. The size of each box is the frequency of the letter. Then we choose a random location on this stack and see which letter's box we landed.
Example :
freq(a) = 5
freq(b) = 3
freq(c) = 3
sum = 11
| a | b | c |
- - - - - - - - - - -
When we choose a 0 <= r < 11, we have the following probabilities
Pick a 'a' = 5 / 11
Pick a 'b' = 3 / 11
Pick a 'c' = 3 / 11
Which is exactly what we want.
You can first built a table of the running sum of the frequency. So if you have the following data:
%freq = (
a => 15,
b => 25,
c => 30,
d => 20
);
the running sum would be;
%running_sums = (
a => 0,
b => 15,
c => 40, # 15 + 25
d => 70, # 15 + 25 + 30
);
$max_sum = 90; # 15 + 25 + 30 + 20
To pick a single letter with the weighted frequency, you need to select a number between [0,90), then you can do a linear search on the running_sum table for the range that includes the letter. For example, if your random number is 20 then the appropriate range is 15-40, which is for the letter 'b'. Using linear search gives a total running time of O(m*n) where m is the number of letters we need and n is the size of the alphabet (therefore m=16, n=26). This is essentially what #default locale do.
Instead of linear search, you can also do a binary search on the running_sum table to get the closest number rounded down. This gives a total running time of O(m*log(n)).
For picking m letters though, there is a faster way than O(m*log(n)), perticularly if n < m. First you generate m random numbers in sorted order (which can be done without sorting in O(n)) then you do a linear matching for the ranges between the list of sorted random numbers and the list of running sums. This gives a total runtime of O(m+n). The code in its entirety running in Ideone.
use List::Util qw(shuffle);
my %freq = (...);
# list of letters in sorted order, i.e. "a", "b", "c", ..., "x", "y", "z"
# sorting is O(n*log(n)) but it can be avoided if you already have
# a list of letters you're interested in using
my #letters = sort keys %freq;
# compute the running_sums table in O(n)
my $sum = 0;
my %running_sum;
for(#letters) {
$running_sum{$_} = $sum;
$sum += $freq{$_};
}
# generate a string with letters in $freq frequency in O(m)
my $curmax = 1;
my $curletter = $#letters;
my $i = 16; # the number of letters we want to generate
my #result;
while ($i > 0) {
# $curmax generates a uniformly distributed decreasing random number in [0,1)
# see http://repository.cmu.edu/cgi/viewcontent.cgi?article=3483&context=compsci
$curmax = $curmax * (1-rand())**(1. / $i);
# scale the random number $curmax to [0,$sum)
my $num = int ($curmax * $sum);
# find the range that includes $num
while ($num < $running_sum{$letters[$curletter]}) {
$curletter--;
}
push(#result, $letters[$curletter]);
$i--;
}
# since $result is sorted, you may want to use shuffle it first
# Fisher-Yates shuffle is O(m)
print "", join('', shuffle(#result));

Get regions from a file that are part of regions in other file (Without loops)

I have two files:
regions.txt: First column is the chromosome name, second and third are start and end position.
1 100 200
1 400 600
2 600 700
coverage.txt: First column is chromosome name, again second and third are start and end positions, and last column is the score.
1 100 101 5
1 101 102 7
1 103 105 8
2 600 601 10
2 601 602 15
This file is very huge it is about 15GB with about 300 million lines.
I basically want to get the mean of all scores in coverage.txt that are in each region in regions.txt.
In other words, start at the first line in regions.txt, if there is a line in coverage.txt which has the same chromosome, start-coverage is >= start-region, and end-coverage is <= end-region, then save its score to a new array. After finish searching in all coverages.txt print the region chromosome, start, end, and the mean of all scores that have been found.
Expected output:
1 100 200 14.6 which is (5+7+8)/3
1 400 600 0 no match at coverages.txt
2 600 700 12.5 which is (10+15)/2
I built the following MATLAB script which take very long time since I have to loop over coverage.txt many time. I don't know how to make a fast awk similar script.
My matlab script
fc = fopen('coverage.txt', 'r');
ft = fopen('regions.txt', 'r');
fw = fopen('out.txt', 'w');
while feof(ft) == 0
linet = fgetl(ft);
scant = textscan(linet, '%d%d%d');
tchr = scant{1};
tx = scant{2};
ty = scant{3};
coverages = [];
frewind(fc);
while feof(fc) == 0
linec = fgetl(fc);
scanc = textscan(linec, '%d%d%d%d');
cchr = scanc{1};
cx = scanc{2};
cy = scanc{3};
cov = scanc{4};
if (cchr == tchr) && (cx >= tx) && (cy <= ty)
coverages = cat(2, coverages, cov);
end
end
covmed = median(coverages);
fprintf(fw, '%d\t%d\t%d\t%d\n', tchr, tx, ty, covmed);
end
Any suggestions to make an alternative using AWK, Perl, or , ... etc I will aslo be pleased if someone can teach me how to get rid of all loops in my matlab script.
Thanks
Here is a Perl solution. I use hashes (aka dictionaries) to access the various ranges via the chromosome, thus reducing the number of loop iterations.
This is potentially efficient, as I don't do a full loop over regions.txt on every input line. Efficiency could perhaps be increased further when multithreading is used.
#!/usr/bin/perl
my ($rangefile) = #ARGV;
open my $rFH, '<', $rangefile or die "Can't open $rangefile";
# construct the ranges. The chromosome is used as range key.
my %ranges;
while (<$rFH>) {
chomp;
my #field = split /\s+/;
push #{$ranges{$field[0]}}, [#field[1,2], 0, 0];
}
close $rFH;
# iterate over all the input
while (my $line = <STDIN>) {
chomp $line;
my ($chrom, $lower, $upper, $value) = split /\s+/, $line;
# only loop over ranges with matching chromosome
foreach my $range (#{$ranges{$chrom}}) {
if ($$range[0] <= $lower and $upper <= $$range[1]) {
$$range[2]++;
$$range[3] += $value;
last; # break out of foreach early because ranges don't overlap
}
}
}
# create the report
foreach my $chrom (sort {$a <=> $b} keys %ranges) {
foreach my $range (#{$ranges{$chrom}}) {
my $value = $$range[2] ? $$range[3]/$$range[2] : 0;
printf "%d %d %d %.1f\n", $chrom, #$range[0,1], $value;
}
}
Example invocation:
$ perl script.pl regions.txt <coverage.txt >output.txt
Output on the example input:
1 100 200 6.7
1 400 600 0.0
2 600 700 12.5
(because (5+7+8)/3 = 6.66…)
Normally, I would load the files into R and calculate it, but given that one of them is so huge, this would become a problem. Here are some thoughts that might help you solving it.
Consider splitting coverage.txt by chromosomes. This would make the calculations less demanding.
Instead of looping over coverage.txt, you first read the regions.txt full into memory (I assume it is much smaller). For each region, you keep a score and a number.
Process coverage.txt line by line. For each line, you determine the chromosome and the region that this particular stretch belongs to. This will require some footwork, but if regions.txt is not too large, it might be more efficient. Add the score to the score of the region and increment number by one.
An alternative, most efficient way requires both files to be sorted first by chromosome, then by position.
Take a line from regions.txt. Record the chromosome and positions. If there is a line remaining from previous loop, go to 3.; otherwise go to 2.
Take a line from coverage.txt.
Check whether it is within the current region.
yes: add the score to the region, increment number. Move to 2.
no: divide score by number, write the current region to output, go to 1.
This last method requires some fine tuning, but will be most efficient -- it requires to go through each file only once and does not require to store almost anything in the memory.
Here's one way using join and awk. Run like:
join regions.txt coverage.txt | awk -f script.awk - regions.txt
Contents of script.awk:
FNR==NR && $4>=$2 && $5<=$3 {
sum[$1 FS $2 FS $3]+=$6
cnt[$1 FS $2 FS $3]++
next
}
{
if ($1 FS $2 FS $3 in sum) {
printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]
}
else if (NF == 3) {
print $0 " 0"
}
}
Results:
1 100 200 6.7
1 400 600 0
2 600 700 12.5
Alternatively, here's the one-liner:
join regions.txt coverage.txt | awk 'FNR==NR && $4>=$2 && $5<=$3 { sum[$1 FS $2 FS $3]+=$6; cnt[$1 FS $2 FS $3]++; next } { if ($1 FS $2 FS $3 in sum) printf "%s %.1f\n", $0, sum[$1 FS $2 FS $3]/cnt[$1 FS $2 FS $3]; else if (NF == 3) print $0 " 0" }' - regions.txt
Here is a simple MATLAB way to bin your coverage into regions:
% extract the regions extents
bins = regions(:,2:3)';
bins = bins(:);
% extract the coverage - only the start is needed
covs = coverage(:,2);
% use histc to place the coverage start into proper regions
% this line counts how many coverages there are in a region
% and assigns them proper region ids.
[h, i]= histc(covs(:), bins(:));
% sum the scores into correct regions (second output of histc gives this)
total = accumarray(i, coverage(:,4), [numel(bins),1]);
% average the score in regions (first output of histc is useful)
avg = total./h;
% remove every second entry - our regions are defined by start/end
avg = avg(1:2:end);
Now this works assuming that the regions are non-overlapping, but I guess that is the case. Also, every entry in the coverage file has to fall into some region.
Also, it is trivial to 'block' this approach over coverages, if you want to avoid reading in the whole file. You only need the bins, your regions file, which presumably is small. You can process the coverages in blocks, incrementally add to total and compute the average in the end.

Perl function for negative integers using the 2's complement

I am trying to convert AD maxpwdAge (a 64-bit integer) into a number of days.
According to Microsoft:
Uses the IADs interface's Get method to retrieve the value of the domain's maxPwdAge attribute (line 5).
Notice we use the Set keyword in VBScript to initialize the variable named objMaxPwdAge—the variable used to store the value returned by Get. Why is that?
When you fetch a 64-bit large integer, ADSI does not return one giant scalar value. Instead, ADSI automatically returns an IADsLargeInteger object. You use the IADsLargeInteger interface's HighPart and LowPart properties to calculate the large integer's value. As you may have guessed, HighPart gets the high order 32 bits, and LowPart gets the low order 32 bits. You use the following formula to convert HighPart and LowPart to the large integer's value.
The existing code in VBScript from the same page:
Const ONE_HUNDRED_NANOSECOND = .000000100 ' .000000100 is equal to 10^-7
Const SECONDS_IN_DAY = 86400
Set objDomain = GetObject("LDAP://DC=fabrikam,DC=com") ' LINE 4
Set objMaxPwdAge = objDomain.Get("maxPwdAge") ' LINE 5
If objMaxPwdAge.LowPart = 0 Then
WScript.Echo "The Maximum Password Age is set to 0 in the " & _
"domain. Therefore, the password does not expire."
WScript.Quit
Else
dblMaxPwdNano = Abs(objMaxPwdAge.HighPart * 2^32 + objMaxPwdAge.LowPart)
dblMaxPwdSecs = dblMaxPwdNano * ONE_HUNDRED_NANOSECOND ' LINE 13
dblMaxPwdDays = Int(dblMaxPwdSecs / SECONDS_IN_DAY) ' LINE 14
WScript.Echo "Maximum password age: " & dblMaxPwdDays & " days"
End If
How can I do this in Perl?
Endianness may come into this, but you may be able to say
#!/usr/bin/perl
use strict;
use warnings;
my $num = -37_108_517_437_440;
my $binary = sprintf "%064b", $num;
my ($high, $low) = $binary =~ /(.{32})(.{32})/;
$high = oct "0b$high";
$low = oct "0b$low";
my $together = unpack "q", pack "LL", $low, $high;
print "num $num, low $low, high $high, together $together\n";
Am I missing something? As far as I can tell from your question, your problem has nothing at all to do with 2’s complement. As far as I can tell, all you need/want to do is
use Math::BigInt;
use constant MAXPWDAGE_UNIT_PER_SEC => (
1000 # milliseconds
* 1000 # microseconds
* 10 # 100 nanoseconds
);
use constant SECS_PER_DAY => (
24 # hours
* 60 # minutes
* 60 # seconds
);
my $maxpwdage_full = ( Math::BigInt->new( $maxpwdage_highpart ) << 32 ) + $maxpwdage_lowpart;
my $days = $maxpwdage_full / MAXPWDAGE_UNIT_PER_SEC / SECS_PER_DAY;
Note that I deliberately use 2 separate constants, and I divide by them in sequence, because that keeps the divisors smaller than the range of a 32-bit integer. If you want to write this another way and you want it to work correctly on 32-bit perls, you’ll have to keep all the precision issues in mind.