How can I sum data over five minute time intervals in Perl? - perl

I have a file in below format.
DATE Time, v1,v2,v3
05:33:25,n1,n2,n3
05:34:25,n4,n5,n5
05:35:24,n6,n7,n8
and so on upto 05:42:25.
I want calculate the values v1, v2 and v3 for every 5 min interval. I have written the below sample code.
while (<STDIN>) {
my ($dateTime, $v1, $v2, $v3) = split /,/, $_;
my ($date, $time) = split / /, $dateTime;
}
I can read all the values but need help to sum all the values for every 5 min interval. Can anyone please suggest me the code to add the time and values for every 5 min.
Required output
05:33 v1(sum 05:33 to 05:37) v2(sum 05:33 to 05:33) v3(sum 05:33 to 05:33)
05:38 v1(sum 05:38 to 05:42) v2(sum 05:38 to 05:42) v3(sum 05:38 to 05:42)
and so on..

The code is a variation the previous answer by Sinan Ünür below, except:
(1) Function timelocal will allow you to read in Day,Month,Year -- so you can sum any five minute gap.
(2) Should deal with case where final time gap is < 5 minutes.
#!/usr/bin/perl -w
use strict;
use warnings;
use Time::Local;
use POSIX qw(strftime);
my ( $start_time, $end_time, $current_time );
my ( $totV1, $totV2, $totV3 ); #totals in time bands
while (<DATA>) {
my ( $hour, $min, $sec, $v1, $v2, $v3 ) =
( $_ =~ /(\d+)\:(\d+)\:(\d+)\,(\d+),(\d+),(\d+)/ );
#convert time to epoch seconds
$current_time =
timelocal( $sec, $min, $hour, (localtime)[ 3, 4, 5 ] ); #sec,min,hr
if ( !$end_time ) {
$start_time = $current_time;
$end_time = $start_time + 5 * 60; #plus 5 min
}
if ( $current_time <= $end_time ) {
$totV1 += $v1;
$totV2 += $v2;
$totV3 += $v3;
}
else {
print strftime( "%H:%M:%S", localtime($start_time) ),
" $totV1,$totV2,$totV3\n";
$start_time = $current_time;
$end_time = $start_time + 5 * 60; #plus 5 min
( $totV1, $totV2, $totV3 ) = ( $v1, $v2, $v3 );
}
}
#Print results of final loop (if required)
if ( $current_time <= $end_time ) {
print strftime( "%H:%M:%S", localtime($start_time) ),
" $totV1,$totV2,$totV3\n";
}
__DATA__
05:33:25,29,74,96
05:34:25,41,69,95
05:35:25,24,38,55
05:36:25,96,63,70
05:37:25,84,65,74
05:38:25,78,58,93
05:39:25,51,38,19
05:40:25,86,40,64
05:41:25,80,68,65
05:42:25,4,93,81
Output:
05:33:25 352,367,483
05:39:25 221,239,229

Obviously, not tested much, for lack of sample data. For parsing the CSV, use either Text::CSV_XS or Text::xSV rather than the naive split below.
Note:
This code does not make sure the output has all consecutive five minute blocks if the input data has gaps.
You will have problems if there are time stamps from multiple days. In fact, if the time stamps are not in 24-hour format, you will have problems even if the data are from a single day.
With those caveats, it should still give you a starting point.
#!/usr/bin/perl
use strict;
use warnings;
my $split_re = qr/ ?, ?/;
my #header = split $split_re, scalar <DATA>;
my #data;
my $time_block = 0;
while ( my $data = <DATA> ) {
last unless $data =~ /\S/;
chomp $data;
my ($ts, #vals) = split $split_re, $data;
my ($hr, $min, $sec) = split /:/, $ts;
my $secs = 3600*$hr + 60*$min + $sec;
if ( $secs > $time_block + 300 ) {
$time_block = $secs;
push #data, [ $time_block ];
}
for my $i (1 .. #vals) {
$data[-1]->[$i] += $vals[$i - 1];
}
}
print join(', ', #header);
for my $row ( #data ) {
my $ts = shift #$row;
print join(', ',
sprintf('%02d:%02d', (localtime($ts))[2,1])
, #$row
), "\n";
}
__DATA__
DATE Time, v1,v2,v3
05:33:25,1,3,5
05:34:25,2,4,6
05:35:24,7,8,9
05:55:24,7,8,9
05:57:24,7,8,9
Output:
DATE Time, v1, v2, v3
05:33, 10, 15, 20
05:55, 14, 16, 18

This is a good problem for Perl to solve. The hardest part is taking the value from the datetime field and identifying which 5 minute bucket it belongs to. The rest is just hashes.
my (%v1,%v2,%v3);
while (<STDIN>) {
my ($datetime,$v1,$v2,$v3) = split /,/, $_;
my ($date,$time) = split / /, $datetime;
my $bucket = &get_bucket_for($time);
$v1{$bucket} += $v1;
$v2{$bucket} += $v2;
$v3{$bucket} += $v3;
}
foreach my $bucket (sort keys %v1) {
print "$bucket $v1{$bucket} $v2{$bucket} $v3{$bucket}\n";
}
Here's one way you could implement &get_bucket_for:
my $first_hhmm;
sub get_bucket_for {
my ($time) = #_;
my ($hh,$mm) = split /:/, $time; # looks like seconds are not important
# buckets are five minutes apart, but not necessarily at multiples of 5 min
# (i.e., buckets could go 05:33,05:38,... instead of 05:30,05:35,...)
# Use the value from the first time this function is called to decide
# what the starting point of the buckets is.
if (!defined $first_hhmm) {
$first_hhmm = $hh * 60 + $mm;
}
my $bucket_index = int(($hh * 60 + $mm - $first_hhmm) / 5);
my $bucket_start = $first_hhmm + 5 * $bucket_index;
return sprintf "%02d:%02d", $bucket_start / 60, $bucket_start % 60;
}

I'm not sure why you would use the times starting from the first time, instead of round 5 minute intervals (00 - 05, 05 - 10, etc), but this is a quick and dirty way to do it your way:
my %output;
my $last_min = -10; # -10 + 5 is less than any positive int.
while (<STDIN>) {
my ($dt, $v1, $v2, $v3) = split(/,/, $_);
my ($h, $m, $s) = split(/:/, $dt);
my $ts = $m + ($h * 60);
if (($last_min + 5) < $ts) {
$last_min = $ts;
}
$output{$last_min}{1} += $v1;
$output{$last_min}{2} += $v2;
$output{$last_min}{3} += $v3;
}
foreach my $ts (sort {$a <=> $b} keys %output) {
my $hour = int($ts / 60);
my $minute = $ts % 60;
printf("%01d:%02d v1(%i) v2(%i) v3(%i)\n", (
$hour,
$minute,
$output{$ts}{1},
$output{$ts}{2},
$output{$ts}{3},
));
}
Not sure why you would do it this way, but there you go in procedural Perl, as example. If you need more on the printf formatting, go here.

Related

Subtract months from given date

I have a date in the future, from which I have to subtract 3 months at a time until the current date is reached.
The reached date after subtracting the months must be closest to the current date, but has to be in the future.
The day of the month is always the 23rd
i.e.:
future date = 2015/01/23
current date = 2014/06/05
result = 2014/07/23
I'm running Solaris, so don't have access to GNU date.
I tried to do this in Perl, but unfortunately I can only use the Time::Local module:
#!/bin/ksh
m_date="2019/05/23"
m_year=$(echo $m_date|cut -d/ -f1)
m_month=$(echo $m_date|cut -d/ -f2)
m_day=$(echo $m_date|cut -d/ -f3)
export m_year m_month m_day
perl -MTime::Local -le '
$time = timelocal(localtime);
$i = 3;
while (timelocal(0, 0, 0, $ENV{'m_day'}, $ENV{'m_month'} - $i, $ENV{'m_year'}) > $time) {
print scalar(localtime(timelocal(0, 0, 0, $ENV{'m_day'}, $ENV{'m_month'} - $i, $ENV{'m_year'})));
$i += 3;
}
'
This only works for months within one year. Is there any other way I can do this?
It is simple enough to just split the date strings and do the arithmetic on the fields.
use strict;
use warnings;
use 5.010;
my $future = '2015/01/23';
my $current = do {
my #current = localtime;
$current[3] += 1;
$current[5] += 1900;
sprintf '%04d/%02d/%02d', #current[5,4,3];
};
my $result;
for (my $test = $future; $test gt $current; ) {
$result = $test;
my #test = split /\//, $test;
if (($test[1] -= 3) < 1) {
--$test[0];
$test[1] += 12;
}
$test = sprintf '%04d/%02d/%02d', #test;
}
say $result;
output
2014/07/23
Alternatively you could just do the division to calculate how many whole quarters to subtract, like this
use strict;
use warnings;
use 5.010;
my $future = '2015/01/23';
my #current = (localtime)[5,4,3];
$current[1] += 1;
$current[0] += 1900;
my #future = split /\//, $future;
my $months = ($future[0] - $current[0]) * 12 + $future[1] - $current[1];
$months -= 1 if $current[2] >= 23;
my #result = #current;
$result[2] = 23;
$result[1] += $months % 3;
$result[0] += int(($result[1] - 1) / 12);
$result[1] = ($result[1] - 1) % 12 + 1;
my $result = sprintf '%04d/%02d/%02d', #result;
say $result;
The output is identical to that of the previous code
This is your script changed so it should work across multiple years,
perl -MTime::Local -le'
sub nextm {
$ENV{m_year}--, $ENV{m_month} +=12 if ($ENV{m_month} -= 3) <1;
timelocal(0, 0, 0, $ENV{m_day}, $ENV{m_month}, $ENV{m_year});
}
my $time = timelocal(localtime);
while ((my $c=nextm()) > $time) {
print scalar localtime($c);
}
'
Try something like:
#!/usr/bin/perl -w
# just convert the real date that you have to epoch
my $torig = 1558569600;
my $tnow = time;
# 3 months in seconds to use the epoch everywhere
my $estep = 3 * 30 * 24 * 3600;
while(($torig - $estep) > $tnow){
$torig -= $estep;
}
print $torig,"\n";
print scalar localtime($torig),"\n";
The only problem here is that a months is an approximation, if you need the very same day but minus 3 months, you could use DateCalc
I ended up scripting it all in KSH instead of perl, thanks to Borodin's logic.
#!/bin/ksh
set -A c_date $(date '+%Y %m %d')
IFS=/ d="2019/05/23"
set -A m_date $d
[[ ${c_date[2]} -gt ${m_date[2]} ]] && ((c_date[1]+=1))
c_date[2]=${m_date[2]}
c_date[1]=$(( (((${m_date[0]} - ${c_date[0]}) * 12) + (${m_date[1]} - ${c_date[1]})) % 3 + ${c_date[1]} ))
if [[ ${c_date[1]} -gt 12 ]] ; then
((c_date[0]+=1))
((c_date[1]-=12))
fi
echo ${c_date[#]}

Perl time parsing and difference calculations for "plus days:hours:minutes"

I have a string with a time difference like:
12:03:22 <- where
^ ^ ^
| | +minutes
| +hours
+days
Mandatory is only the minutes, hours and days can be omitted, but here can be e.g. 120:30, so 120 hours and 30 minutes.
Need calculate the date and time for NOW + difference, so for example:
when now is "May 20, 13:50" and
the string is "1:1:5"
want get as result: "2012 05 21 14 55" (May 21, 14:55)
I know DateTime, but what is the easy way parsing the input string? I'm sure than here is a better way as:
use _usual_things_;
my ....
if($str =~ m/(.*):(.*):(.*)/) {
$d = $1; $h = $2; $m = $3;
}
elsif( $str =~ m/(.*):(.*)/ ) {
$h = $1; $m = $2;
} elsif ($str =~ m/\d+/ ) {
$m = $1;
}
else {
say "error";
}
And how to add to the currect date the parsed days, hours, minutes?
What about using reverse to avoid checking the format?
my ($m, $h, $d) = reverse split /:/, $str;
To add this to current date, just use DateTime:
print DateTime->now->add(days => $d // 0,
hours => $h // 0,
minutes => $m);
Parsing can be done once, but branching based on no. of tokes can't be avoided. Here is the sample implementation.
$Str = '12:03:22' ;
#Values = ($Str=~/\G(\d\d):?/g) ;
print "error with input" if not #Values;
if( #Values == 3) { print "Have all 3 values\n" }
elsif( #Values == 2) { print "Have 2 values\n" }

Convert Old Unix Date to Perl and compare

Requirement - I have file name called "Rajesh.1202242219". Numbers are nothing but a date "date '+%y''%m''%d''%H''%M'" format.
Now I am trying to write a perl script to extract the numbers from file name and compare with current system date and time and based on output of this comparison, print some value using perl.
Approach:
Extract the Digit from File name:
if ($file =~ /Rajesh.(\d+).*/) {
print $1;
}
Convert this time into readable time in perl
my $sec = 0; # Not Feeded
my $min = 19;
my $hour = 22;
my $day = 24;
my $mon = 02 - 1;
my $year = 2012 - 1900;
my $wday = 0; # Not Feeded
my $yday = 0; # Not Feeded
my $unixtime = mktime ($sec, $min, $hour, $day, $mon, $year, $wday, $yday);
print "$unixtime\n";
my $readable_time = localtime($unixtime);
print "$readable_time\n";
find Current time and compare...
my $CurrentTime = time();
my $Todaydate = localtime($startTime);
But the problem here is, I am not getting solution of how to extract 2 digit from $1 and assign to $sec, $min, etc. Any help?
Also, if you have good approach for this problem statement, Please share with me
I like to use time objects to simplify the logic. I use Time::Piece here because it is simple and light weight (and part of the core). DateTime can be another choice.
use Time::Piece;
my ( $datetime ) = $file =~ /(\d+)/;
my $t1 = Time::Piece->strptime( $datetime, '%y%m%d%H%M' );
my $t2 = localtime(); # equivalent to Time::Piece->new
# you can do date comparisons on the object
if ($t1 < $t2) {
# do something
print "[$t1] < [$t2]\n";
}
Might as well teach DateTime::Format::Strptime to make the comparison much simpler:
use DateTime qw();
use DateTime::Format::Strptime qw();
if (
DateTime::Format::Strptime
->new(pattern => '%y%m%d%H%M')
->parse_datetime('Rajesh.1202242219')
< DateTime->now
) {
say 'filename timestamp is earlier than now';
} else {
say 'filename timestamp is later than now';
};
my ($year, $month, $day, $hour, $min) = $file =~ /(\d{2})/g;
if ($min) {
$year += 100; # Assuming 2012 and not 1912
$month--;
# Do stuff
}
I think unpack might be a better fit.
if ( my ( $num ) = $file =~ /Rajesh.(\d+).*/ ) {
my ( $year, $mon, $day, $hour, $min ) = unpack( 'A2 A2 A2 A2 A2', $num );
my $ts = POSIX::mktime( 0, $min, $hour, $day, $mon - 1, $year + 100 );
...
}
Using a module that parses dates might be nice. This code will parse the date and return a DateTime object. Refer to the documentation to see the many ways to manipulate this object.
use DateTime::Format::Strptime;
my $date = "1202242219";
my $dt = get_obj($date);
sub get_obj {
my $date = shift;
my $strp = DateTime::Format::Strptime->new(
pattern => '%y%m%d%H%M'
);
return $strp->parse_datetime($date);
}

How do I find out what the date some weeks ago was?

I was trying to determine a good way to calculate a previous date based on how many weeks I would want to go back. Today is 7/19/2011, so if I wanted to go back 5 weeks what would be the best way to determine what that date would be?
DateTime::Duration is your friend there:
use strict;
use warnings;
use 5.010;
use DateTime;
my $now = DateTime->now(time_zone => 'local');
my $five_weeks = DateTime::Duration->new(weeks => 5);
my $five_weeks_ago = $now - $five_weeks;
say "Five weeks ago now it was $five_weeks_ago";
Notice that it lets you specify the duration in the units of the problem.
Perl has this marvelous thing called regexes that can solve almost any problem.
use strict;
use warnings;
my $date = shift || '7/19/2011';
my $days_ago = shift || 7*5;
$date =~ s#^([0-9]+)/([0-9]+)/([0-9]+)\z##{[sprintf"%.2d",$1]}/#{[sprintf"%.2d",$2]}/$3/$days_ago#;
until ( $date =~ s#^([0-9]+)/([0-9]+)/([0-9]+)/0\z##{[$1+0]}/#{[$2+0]}/$3# ) {
$date =~ s#([0-9]+)/([0-9]+)/([0-9]+)/([0-9]+)##{[$2==1?sprintf"%.2d",$1-1||12:$1]}/#{[sprintf"%.2d",$2-1||31]}/#{[$1==1 && $2==1?$3-1:$3]}/#{[$4-1]}#;
$date =~ s#([0-9]+)\z##{[$1+1]}# unless $date =~ m#^(?:0[1-9]|1[012])/(?:0[1-9]|1[0-9]|2[0-8]|(?<!0[2469]/|11/)31|(?<!02/)30|(?<!02/(?=...(?:..(?:[02468][1235679]|[13579][01345789])|(?:[02468][1235679]|[13579][01345789])00)))29)/#;
}
print $date, "\n";
(Please don't do it this way.)
I like Date::Calc
use strict;
use warnings;
use Date::Calc qw/Add_Delta_Days Today/;
my $offset_weeks = -5;
my $offset_days = $offset_weeks * 7;
# Year, Month, Day
my #delta_date = Add_Delta_Days(
Today( [ localtime ] ),
$offset_days
);
printf "%2d/%2d/%4d\n", #delta_date[1,2,0];
It is designed to catch common gotchas such as leap year.
Best or easiest? I have always found strftime's date normalization to be handy for this sort of thing:
#!/usr/bin/perl
use strict;
use warnings;
use POSIX qw/strftime/;
my #date = localtime;
print strftime "today is %Y-%m-%d\n", #date;
$date[3] -= 5 * 7;
print strftime "five weeks ago was %Y-%m-%d\n", #date;
Which solution is best depends partly on what you want to do with the date when you are done. Here is a benchmark with implementations of various methods:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark;
use Date::Manip qw/UnixDate/;
use Date::Simple qw/today/;
use Date::Calc qw/Add_Delta_Days Today/;
use DateTime;
use POSIX qw/strftime/;
use Class::Date;
my %subs = (
cd => sub {
(Class::Date::now - [0, 0, 5 * 7])->strftime("%Y-%m-%d");
},
dc => sub {
sprintf "%d-%02s-%02d", Add_Delta_Days Today, -5 * 7;
},
dm => sub {
UnixDate("5 weeks ago", "%Y-%m-%d");
},
ds => sub {
(today() - 5 * 7)->strftime("%Y-%m-%d");
},
dt => sub {
my $now = DateTime->from_epoch(epoch => time, time_zone => "local");
my $five_weeks = DateTime::Duration->new(weeks => 5);
($now - $five_weeks)->ymd('-');
},
p => sub {
my #date = localtime;
$date[3] -= 5 * 7;
strftime "%Y-%m-%d", #date;
},
y => sub {
my ($d, $m, $y) = (localtime)[3..5];
my $date = join "/", $m+1, $d, $y+1900;
my $days_ago = 7*5;
$date =~ s#^([0-9]+)/([0-9]+)/([0-9]+)\z##{[sprintf"%.2d",$1]}/#{[sprintf"%.2d",$2]}/$3/$days_ago#;
until ( $date =~ s#^([0-9]+)/([0-9]+)/([0-9]+)/0\z##{[$1+0]}/#{[$2+0]}/$3# ) {
$date =~ s#([0-9]+)/([0-9]+)/([0-9]+)/([0-9]+)##{[$2==1?sprintf"%.2d",$1-1||12:$1]}/#{[sprintf"%.2d",$2-1||31]}/#{[$1==1 && $2==1?$3-1:$3]}/#{[$4-1]}#;
$date =~ s#([0-9]+)\z##{[$1+1]}# unless $date =~ m#^(?:0[1-9]|1[012])/(?:0[1-9]|1[0-9]|2[0-8]|(?<!0[2469]/|11/)31|(?<!02/)30|(?<!02/(?=...(?:..(?:[02468][1235679]|[13579][01345789])|(?:[02468][1235679]|[13579][01345789])00)))29)/#;
}
return $date;
},
);
print "$_: ", $subs{$_}(), "\n" for keys %subs;
Benchmark::cmpthese -1, \%subs;
And here are the results. The strftime method seems to be the fastest, but it is also has the least features.
y: 6/14/2011
dm: 2011-06-14
p: 2011-06-14
dc: 2011-06-14
cd: 2011-06-14
dt: 2011-06-15
ds: 2011-06-14
Rate dt dm y ds cd dc p
dt 1345/s -- -5% -28% -77% -82% -96% -98%
dm 1408/s 5% -- -24% -75% -81% -96% -98%
y 1862/s 38% 32% -- -68% -75% -95% -97%
ds 5743/s 327% 308% 208% -- -24% -84% -90%
cd 7529/s 460% 435% 304% 31% -- -78% -87%
dc 34909/s 2495% 2378% 1775% 508% 364% -- -39%
p 56775/s 4121% 3931% 2949% 889% 654% 63% --
Better than a benchmark is a test of how they handle DST (this test would have caught the error in the assumption about what DateTime->now returns).
#!/usr/bin/perl
use strict;
use warnings;
use Time::Mock;
use Date::Manip qw/UnixDate/;
use Date::Simple qw/today/;
use Date::Calc qw/Add_Delta_Days Today/;
use DateTime;
use POSIX qw/strftime mktime/;
use Class::Date;
sub target {
my #date = localtime;
$date[3] -= 5 * 7;
strftime "%Y-%m-%d", #date;
}
my %subs = (
cd => sub {
(Class::Date::now - [0, 0, 5 * 7])->strftime("%Y-%m-%d");
},
dc => sub { sprintf "%d-%02s-%02d", Add_Delta_Days Today, -5 * 7;
},
dm => sub {
UnixDate("5 weeks ago", "%Y-%m-%d");
},
ds => sub {
(today() - 5 * 7)->strftime("%Y-%m-%d");
},
dt => sub {
my $now = DateTime->from_epoch( epoch => time, time_zone => 'local' );
my $five_weeks = DateTime::Duration->new(weeks => 5);
($now - $five_weeks)->ymd('-');
},
y => sub {
my ($d, $m, $y) = (localtime)[3..5];
my $date = join "/", $m+1, $d, $y+1900;
my $days_ago = 7*5;
$date =~ s#^([0-9]+)/([0-9]+)/([0-9]+)\z##{[sprintf"%.2d",$1]}/#{[sprintf"%.2d",$2]}/$3/$days_ago#;
until ( $date =~ s#^([0-9]+)/([0-9]+)/([0-9]+)/0\z##{[$1+0]}/#{[$2+0]}/$3# ) {
$date =~ s#([0-9]+)/([0-9]+)/([0-9]+)/([0-9]+)##{[$2==1?sprintf"%.2d",$1-1||12:$1]}/#{[sprintf"%.2d",$2-1||31]}/#{[$1==1 && $2==1?$3-1:$3]}/#{[$4-1]}#;
$date =~ s#([0-9]+)\z##{[$1+1]}# unless $date =~ m#^(?:0[1-9]|1[012])/(?:0[1-9]|1[0-9]|2[0-8]|(?<!0[2469]/|11/)31|(?<!02/)30|(?<!02/(?=...(?:..(?:[02468][1235679]|[13579][01345789])|(?:[02468][1235679]|[13579][01345789])00)))29)/#;
}
return join "-", map { sprintf "%02d", $_ }
(split "/", $date)[2,0,1];
},
);
my $time = mktime 0, 0, 0, 13, 2, 111; #2011-03-13 00:00:00, DST in US
for my $offset (map { $_ * 60 * 60 } 1 .. 24) {
print strftime "%Y-%m-%d %H:%M:%S\n", (localtime $time + $offset);
Time::Mock->set($time + $offset);
my $target = target;
for my $sub (sort keys %subs) {
my $result = $subs{$sub}();
if ($result ne $target) {
print "$sub disagrees: ",
"time $time target $target result $result\n";
}
}
}
Using Time::Piece:
use Time::Piece;
use Time::Seconds qw(ONE_DAY);
my $weeks_back = 5;
my $date_str = '7/19/2011';
my $dt = Time::Piece->strptime($date_str, '%m/%d/%Y');
# Avoid DST issues
$dt -= ONE_DAY() * ( 7 * $weeks_back - 0.5 )
my $past_str = $dt->strftime('%m/%d/%Y');
print "$past_str\n";
Too much code for such a simple question! All you need is two simple lines:
my $five_weeks_ago = time - (5*7)*24*60*60;
print scalar localtime($five_weeks_ago), "\n";
My solution is accurate for both DST and leap years.
Here is the way to get the date of 5 weeks back:
$ uname
HP-UX
$ date
Wed Nov 11 09:42:05 CST 2015
$ perl -e 'my ($d,$m,$y) = (localtime(time-60*60*24*(5*7)))[3,4,5]; printf("%d/%d/%d\n", $m+1, $d, $y+1900);'
10/7/2015
say POSIX::strftime(
'%m/%d/%Y' # format string -> mm/dd/YYYY
, 0 # no seconds
, 0 # no minutes
, 0 # no hours
, 19 - ( 5 * 7 ) # current day - numweeks * 7
, 7 - 1 # month - 1
, 2011 - 1900 # YYYY year - 1900
);
Yes, the day comes out to be 19 - 35 = -16, and yes it works.
If date is available as unix timestamp, it can be done with simple arithmetic:
use POSIX qw/strftime/;
say strftime('%Y-%m-%d', localtime(time - 5 * 7 * 86400));

How can I generate a set of ranges from the first letters of a list of words in Perl?

I'm not sure exactly how to explain this, so I'll just start with an example.
Given the following data:
Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry
I want to generate an index based on the first letter of my data, but I want the letters grouped together.
Here is the frequency of the first letters in the above dataset:
2 A
2 B
3 C
1 E
2 G
1 K
1 M
1 N
4 P
2 R
1 S
Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:
A B C D-G H-O P Q-Z
Clicking the "D-G" link would show:
Elderberry
Grapefruit
Grapes
In my range listing above, I am covering the full alphabet - I guess that is not completely neccessary - I would be fine with this output as well:
A B C E-G K-N P R-S
Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.
I am not too worried about lopsided data either - so if I 40% of my data starts with an "S", then S will just have its own link - I don't need to break it down by the second letter in the data.
Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers - it is guaranteed to start with a letter from A-Z.
I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this - I'm not sure what this method is called.
Here is what I started with:
#!/usr/bin/perl
use strict;
use warnings;
my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};
open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";
while ( my $item = <$DATASET> ) {
chomp($item);
my $first_letter = uc( substr( $item, 0, 1 ) );
$index_frequency->{$first_letter}++;
}
foreach my $letter ( sort keys %{$index_frequency} ) {
if ( $index_frequency->{$letter} ) {
# build $ranges here
}
}
My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined - my code gets very messy very fast.
Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess - I can convert it to Perl.
Thanks in advance!
Basic approach:
#!/usr/bin/perl -w
use strict;
use autodie;
my $PAGE_SIZE = 3;
my %frequencies;
open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
next unless $l =~ m{\A([a-z])}i;
$frequencies{ uc $1 }++;
}
close $fh;
my $current_sum = 0;
my #letters = ();
my #pages = ();
for my $letter ( "A" .. "Z" ) {
my $letter_weigth = ( $frequencies{ $letter } || 0 );
if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
$current_sum = $letter_weigth;
#letters = ( $letter );
next;
}
push #letters, $letter;
$current_sum += $letter_weigth;
}
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
print "Pages : " . join( " , ", #pages ) . "\n";
Problem with it is that it outputs (from your data):
Pages : A , B , C-D , E-J , K-O , P , Q-Z
But I would argue this is actually good approach :) And you can always change the for loop into:
for my $letter ( sort keys %frequencies ) {
if you need.
Here's my suggestion:
# get the number of instances of each letter
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# transform the list of counts into a map of count => letters
my %freq = ();
while (my ($letter, $count) = each %count)
{
push #{ $freq{ $count } }, $letter;
}
# now print out the list of letters for each count (or do other appropriate
# output)
foreach (sort keys %freq)
{
my #sorted_letters = sort #{ $freq{$_} };
print "$_: #sorted_letters\n";
}
Update: I think that I misunderstood your requirements. The following code block does something more like what you want.
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# get the maximum frequency
my $max_freq = (sort values %count)[-1];
my $curr_set_count = 0;
my #curr_set = ();
foreach ('A' .. 'Z') {
push #curr_set, $_;
$curr_set_count += $count{$_};
if ($curr_set_count >= $max_freq) {
# print out the range of the current set, then clear the set
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
#curr_set = ();
$curr_set_count = 0;
}
}
# print any trailing letters from the end of the alphabet
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
Try something like that, where frequency is the frequency array you computed at the previous step and threshold_low is the minimal number of entries in a range, and threshold_high is the max. number. This should give harmonious results.
count=0
threshold_low=3
threshold_high=6
inrange=false
frequency['Z'+1]=threshold_high+1
for letter in range('A' to 'Z'):
count += frequency[letter];
if (count>=threshold_low or count+frequency[letter+1]>threshold_high):
if (inrange): print rangeStart+'-'
print letter+' '
inrange=false
count=0
else:
if (not inrange) rangeStart=letter
inrange=true
use strict;
use warnings;
use List::Util qw(sum);
my #letters = ('A' .. 'Z');
my #raw_data = qw(
Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry
Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine
Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
);
# Store the data by starting letter.
my %data;
push #{$data{ substr $_, 0, 1 }}, $_ for #raw_data;
# Set max page size dynamically, based on the average
# letter-group size (in this case, a multiple of it).
my $MAX_SIZE = sum(map { scalar #$_ } values %data) / keys %data;
$MAX_SIZE = int(1.5 * $MAX_SIZE + .5);
# Organize the data into pages. Each page is an array reference,
# with the first element being the letter range.
my #pages = (['']);
for my $letter (#letters){
my #d = exists $data{$letter} ? #{$data{$letter}} : ();
if (#{$pages[-1]} - 1 < $MAX_SIZE or #d == 0){
push #{$pages[-1]}, #d;
$pages[-1][0] .= $letter;
}
else {
push #pages, [ $letter, #d ];
}
}
$_->[0] =~ s/^(.).*(.)$/$1-$2/ for #pages; # Convert letters to range.
This is an example of how I would write this program.
#! /opt/perl/bin/perl
use strict;
use warnings;
my %frequency;
{
use autodie;
open my $data_file, '<', 'datafile';
while( my $line = <$data_file> ){
my $first_letter = uc( substr( $line, 0, 1 ) );
$frequency{$first_letter} ++
}
# $data_file is automatically closed here
}
#use Util::Any qw'sum';
use List::Util qw'sum';
# This is just an example of how to calculate a threshold
my $mean = sum( values %frequency ) / scalar values %frequency;
my $threshold = $mean * 2;
my #index;
my #group;
for my $letter ( sort keys %frequency ){
my $frequency = $frequency{$letter};
if( $frequency >= $threshold ){
if( #group ){
if( #group == 1 ){
push #index, #group;
}else{
# push #index, [#group]; # copy #group
push #index, "$group[0]-$group[-1]";
}
#group = ();
}
push #index, $letter;
}elsif( sum( #frequency{#group,$letter} ) >= $threshold ){
if( #group == 1 ){
push #index, #group;
}else{
#push #index, [#group];
push #index, "$group[0]-$group[-1]"
}
#group = ($letter);
}else{
push #group, $letter;
}
}
#push #index, [#group] if #group;
push #index, "$group[0]-$group[-1]" if #group;
print join( ', ', #index ), "\n";