Substitute Array Elements with Hash Values - perl

I am trying to code a Perl Script which will take the date in Pattern, October 24, 2011 and convert this to 10,24,2011.
In order to do this I have prepared a Hash which will have the Month Name as a Key and a Numerical value representing Month's position as a Value.
I will read the input string, use a regular expression to extract the month name from above format.
Replace this month name with a value which corresponds to the month as a key.
Here's the script I have coded so far, but it's not working for me.
#dates array will have every element in this format -> October 24, 2011.
%days=("January",01,"February",02,"March",03,"April",04,"May",05,"June",06,"July",07,"August",08,"September",09,"October",10,"November",11,"December",12);
#output = map{
$pattern=$_;
$pattern =~ s/(.*)\s/$days{$1};
} #dates;
foreach $output (#output)
{
print $output."\n";
}
Here's a little explanation of what I am trying to do with this code.
#output will have the new formatted array with the Month Name replaced by the corresponding Numerical representing it as defined in the Hash.
map function is used to transform the elements of the array on the fly.
a sequence of characters followed by space is the regular expression used to extract the Month Name from pattern, October 24, 2011.
This will be referenced by $1.
I look up the corresponding value for $1 in the hash using, $days{$1}

I see a few problems here. The first is that there is no use strict;.
A number with a leading zero is assumed to be in octal format (i.e. base 8) so 08 is invalid. You want one of these:
%days = ("January", 1, "February", 2, ...
%days = ("January", "01", "February", "02", ...
%days = ("January" => 1, "February" => 2, ...
%days = ("January" => "01", "February" => "02", ...
You should also be declaring your variables with my:
my %days = ...
my #output = ...
You're missing the final slash on your substitution, you probably want a comma in there to match your desired output format, and .* will eat up more than you want:
$pattern =~ s/(\S*)\s/$days{$1}, /;
The block for your map needs to return the value you want in #output but it currently returns 1 (see perldoc perlop to learn why); something like this will serve you better:
my #output = map {
my $pattern=$_; # You don't need this, operating on $_ is fine here
$pattern =~ s/(\S*)\s/$days{$1}, /;
$pattern
} #dates;
If you really want the spaces removed from the output, then this should do the trick:
my #output = map {
my $pattern=$_; # You don't need this, operating on $_ is fine here
$pattern =~ s/(\S*)\s/$days{$1}, /;
$pattern =~ s/\s//g;
$pattern
} #dates;
There are more compact ways to do this map but I don't want to change too much and confuse you.
And, as mentioned in the comments, you might want to save yourself some trouble and have a look at DateTime and related packages.

Leaving aside the fact that you pasted non-compiling code (forgot training "/" as sarnold said), your regex is wrong.
You used a GREEDY regex: .* - meaning take as many characters as possible while matching. So your regex matched October 24, instead if October.
You need to do \S+\s

Do you want to "substitute array elements with hash values," or do you want to map month names to numbers. If it's the latter, the following will convert month_name day year to month_number day year with less code:
perl -le '$d=$ARGV[0]; for (qw{Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec}) { $i++; last if $d =~ s/\b$_[^\s]*/$i/i; }; print $d' "october 24, 2011"

Here's some feedback on your code:
Your pasted code does not compile very well.
You didn't use strict and warnings.
01 to 09 needs to be in double quotes.
You do not need to reassign $_ inside your map statement.
map needs to end with the value you intend to insert, e.g.: map { s/(\w+)/$days{$1}/; $_ }
say for #output looks nicer. =)

Related

Assigning match variables from nth match

In the code which follows, I want $a and $b to be set to 02 02 (i.e. the values of the second match from $l) If I add a line ($a,$b)=($1,$2);, it works but I'd rather do it in a single line if possible.
Can someone let me know what's wrong?
#!/usr/bin/perl -w
my $l = "01:01 02:02";
my ($a,$b);
if ( ($a,$b) = ( ( $l =~ /(\d\d):(\d\d)/g)[1] ) ) {
print "12 $1 $2\n";
print "ab $a $b\n";
}
Output:
12 02 02
Use of uninitialized value $b in concatenation (.) or string at ./gs.pl line 11.
ab 01
First off, don't use $a or $b as variable names. Those are used by sort - and you'll start to wonder what's going on if you end up sorting in this same scope.
So, for lack of better names, let's try $h and $m.
So, now, what does $l =~ /(\d\d):(\d\d)/g return in list context? It returns all the values it found: ("01","01","02","02"). Now do you see why [1] doesn't get what you're looking for?
Depending on what you're doing, this may be as simple as using [-2,-1] as the index lookup (to always get the last two). Or it may be that you want to loop through all the pairings and do something. There's not enough context, but that should give you some idea to go on with.

how to make grep use a "from X to Y" syntax? (Using date as parameter)

so I would like to write a script that scans through orders and files and pastes certain lines of these files into a file.
How can I let my file scan through a specified range instead of a singular date?
Actually the code I'd need to change looks like this:
$curdir = "$scriptdir\\$folder";
opendir my $dir_b, "$curdir" or die "Can't open directory: $!";
my #file = grep { /$timestamp/ } readdir $dir_b;
closedir $dir_b;
Now line 3 needs to work actually like this
my #file = grep { /$timestamp1 to $timestamp2/ } readdir $dir_b;
anyone knows how to achieve this? timestamp1 would be as example 20160820 and timestamp2 would be 20160903 or 20160830
thanks for the help
You can use Regexp::Assemble to build one pattern out of all timestamps that are in the range of your dates.
use strict;
use warnings;
use Regexp::Assemble;
my $timestamp_late = 20160830;
my $timestamp_early = 20160820;
my $ra = Regexp::Assemble->new;
$ra->add( $_ ) for $timestamp_early .. $timestamp_late;
print $ra->re;
The pattern for that case is: (?^:201608(?:2\d|30))
You can now use it like this:
my $pattern = $ra->re;
my #files = grep { /$pattern/ } readdir $dir_b;
It works by combining multiple patterns into a single one.
Regexp::Assemble takes an arbitrary number of regular expressions and assembles them into a single regular expression (or RE) that matches all that the individual REs match.
As a result, instead of having a large list of expressions to loop over, a target string only needs to be tested against one expression. This is interesting when you have several thousand patterns to deal with. Serious effort is made to produce the smallest pattern possible.
Our patterns here are rather simple (they are just strings), but it works nonetheless. The resulting pattern works like this:
(?^: ) # non-capture group w/o non-default flags for the sub pattern
201608 # literal 201608
(?: ) # non-capture group
2\d # literal 2 followed by a digit (0-9)
| # or
30 # literal 30
The (?^:) is explained in this part of perlre.
If you pass in more numbers, the resulting regex will look different. Of course this is not meant for dates, so with my simple 1 .. 9 expression we get all numbers in between. The .. is the range operator, and will return the list (1, 2, 3, 4, 5, 6, 7, 8, 9) for the aforementioned case.
So if you wanted to make sure that you only get valid dates, you could take this approach or this approach. Here's an example.
use strict;
use warnings;
use Regexp::Assemble;
use DateTime;
my $timestamp_late = DateTime->new( year => 2016, month => 9, day => 1 );
my $timestamp_early = DateTime->new( year => 2016, month => 8, day => 19 ); # -1 day
my $ra = Regexp::Assemble->new;
while ( $timestamp_early->add( days => 1 ) <= $timestamp_late ) {
$ra->add( $timestamp_early->ymd(q{}) );
}
print $ra->re;
This goes over to the next month and gives
(?^:20160(?:8(?:3[01]|2\d)|901))
which, only matches real dates, while the other, simpler, solution will include all numbers between them, including the 99th of August.
(?^:20160(?:8(?:2\d|3\d|4\d|5\d|6\d|7\d|8\d|9\d)|90[01]))
Solution by Сухой27, posted as a comment
my #file = grep { /$timestamp1/ .. /$timestamp2/ } readdir $dir_b;
A very nice example of use of the range operator
I favor some simpler approaches that are easy for someone to understand. The flip-flop is cool, but almost no one knows what it does.
You don't have to do everything in one operation:
my #file = grep {
my $this_date = ...;
$lower_date <= $this_date and $this_date <= $higher_date;
} #inputs;

Perl sort with time stamp using hash

I'm new to Perl and need help with sorting using the hash and/or any other possible method this can be done in Perl.
I've an input file like below and would like to generate the output file as shown.
I'm thinking if this can be done by putting it in hash and then comparing? Please also provide an explanations to the steps for the learning purpose if possible.
If the file has duplicate/triplicate entries matching with different timestamp, it should only list the latest time stamp entry.
Input file
A May 19 23:59:14
B May 19 21:59:14
A May 22 07:59:14
C Apr 10 12:23:00
B May 11 10:23:34
The output should be
A May 22 07:59:14
B May 19 21:59:14
C Apr 10 12:23:00
You can try to use your data(A,B etc) as key and timestamp as value in perl hash.
Then read input file and compare timestamps using perl time datatype. This way you keep only latest entries and other can be discarded. Print result at the end.
A hash is good for coalescing duplicates.
However sorting by time stamp requires converting the 'text' representation to an actual time. Time::Piece is one of the better options for doing this
#!/usr/local/bin/perl
use strict;
use warnings;
use Time::Piece;
my %things;
while (<DATA>) {
my ( $letter, $M, $D, $T ) = split;
my $timestamp =
Time::Piece->strptime( "$M $D $T 2015", "%b %d %H:%M:%S %Y" );
if ( not defined $things{$letter}
or $things{$letter} < $timestamp )
{
$things{$letter} = $timestamp;
}
}
foreach my $thing ( sort keys %things ) {
print "$thing => ", $things{$thing}, "\n";
}
__DATA__
A May 19 23:59:14
B May 19 21:59:14
A May 22 07:59:14
C Apr 10 12:23:00
B May 11 10:23:34
Note though - your timestamps are ambiguous because they omit the year. You have to deal with this some way. I've gone for the easy road of just inserting 2015. That's not good practice - at the very least you should use some way of discovering 'current year' automatically - but bear in mind that at some points in the year, this will Just Break.
You can format output date using the strftime method within Time::Piece - this is merely the default output.

Optimize Perl script - runs too slow on 40GB+ files

I made the following Perl script to handle some file manipulation at work, but it's running far too slowly at the minute to be put in production.
I don't know Perl very well (not one of my languages), so can someone help me identify and replace parts of this script that would be slow given it's processing ~40 million lines?
Data being piped in is in the format:
col1|^|col2|^|col3|!|
col1|^|col2|^|col3|!|
... 40 million of these.
The date_cols array is calculated before this part of the script and basically holds the index of columns containing dates in the pre-converted format.
Here's the part of the script that will be executed for every input row. I've cleaned it up a little and added comments, but let me know if anything else is needed:
## Read from STDIN until no more lines are arailable.
while (<STDIN>)
{
## Split by field delimiter
my #fields = split('\|\^\|', $_, -1);
## Remove the terminating delimiter from the final field so it doesn't
## interfere with date processing.
$fields[-1] = (split('\|!\|', $fields[-1], -1))[0];
## Cycle through all column numbres in date_cols and convert date
## to yyyymmdd
foreach $col (#date_cols)
{
if ($fields[$col] ne "")
{
$fields[$col] = formatTime($fields[$col]);
}
}
print(join('This is an unprintable ASCII control code', #fields), "\n");
}
## Format the input time to yyyymmdd from 'Dec 26 2012 12:00AM' like format.
sub formatTime($)
{
my $col = shift;
if (substr($col, 4, 1) eq " ") {
substr($col, 4, 1) = "0";
}
return substr($col, 7, 4).$months{substr($col, 0, 3)}.substr($col, 4, 2);
}
If written purely for efficiency, I'd write your code like this:
sub run_loop {
local $/ = "|!|\n"; # set the record input terminator
# to the record seperator of our problem space
while (<STDIN>) {
# remove the seperator
chomp;
# Split by field delimiter
my #fields = split m/\|\^\|/, $_, -1;
# Cycle through all column numbres in date_cols and convert date
# to yyyymmdd
foreach $col (#date_cols) {
if ($fields[$col] ne "") {
# $fields[$col] = formatTime($fields[$col]);
my $temp = $fields[$col];
if (substr($temp, 4, 1) eq " ") {
substr($temp, 4, 1) = "0";
}
$fields[$col] = substr($temp, 7, 4).$months{substr($temp, 0, 3)}.substr($temp, 4, 2);
}
}
print join("\022", #fields) . "\n";
}
}
The optimizations are:
Using chomp to remove the |!|\n string at the end
Inlining the formatTime sub.
Subroutine calls are extremely expensive in Perl. If subs have to be used very efficiently, prototype checking can be disabled with the &subroutine(#args) syntax. If #args are ommited, the current arguments #_ are visible to the called sub. This can lead to bugs or additional performance. Use wisely. The goto &subroutine; syntax can be used as well, but this meddles with return (basically a tail call). Do not use.
Further optimizations could include removing the hash lookup %months, as hashing is expensive.
You'll have to benchmark on your data set to compare, but you can throw a regex at it. (Made all the worse by your very regex-unfriendly field and record separators!)
my $i = 0;
our %months = map { $_ => sprintf('%02d', ++$i) } qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
while (<DATA>) {
s! \|\^\| !\022!xg; # convert field separator
s/ \| !\| $ //xg; # strip record terminator
s/\b(\w{3}) ( \d|\d\d) (\d{4}) \d\d:\d\d[AP]M\b/${3} . $months{$1} . sprintf('%02d', $2) /eg;
print;
}
Won't do what you want if one of the non-#date_cols fields matches the date regex.
At my work sometimes i need to grep errorlogs etc from 350+ frontends. I use script template i calling "SMP grep" ;) Its simple:
stat file, get file length
Get "chunk length" = file_length / num_processors
Andjust chunk starts and ends so they start/end at "\n". Just read(), find "\n" and calculate offsets.
fork() to make num_processor workers, each working on own chunk
This can help if you use regexps in your grep or other CPU operations(as your case i think). Admins complaining this script eats disk throughput, but its only bottleneck here if server has 8 CPUs =) Also, obviously if you need to parse 1 week data you can divide between servers.
Tomorrow i can post the code if interested.

perl help finding proper date format

I need some help trying to figure out how to format dates in perl. I have a working perl script, with a regular expression, that works fine if I use hard coded date strings like this:
my $mon = 'Aug';
my $day = '05';
my $year = '2010';
These vars are used in a regular expression like this:
if ($line =~ m/(.* $mon $day) \d{2}:\d{2}:\d{2} $year: ([^:]+):backup:/)
Now, I need to automate this date portion of the code and use current date systematically.
I looked into perl localtime and tried using unix date and throw them into variables.
I need to have the days of the week, single digit, be padded with '0'. As in today, 'Aug' '05'
'2010' because the input file I am using for the regex has dates like this.
My 2nd try with the unix and formatting is returning numbers, but I need to have them be strings:
my $mon2=`date '+%b'`;
my $day2=`date '+%d'`;
my $year2=`date '+%Y'`;
My test code for playing with date formats:
#!/usr/local/bin/perl -w
use strict;
my $mon = 'Aug';
my $day = '05';
my $year = '2010';
my $mon2=`date '+%b'`;
my $day2=`date '+%d'`;
my $year2=`date '+%Y'`;
print "$mon";
print "$day";
print "$year";
print "$mon2";
print "$day2";
print "$year2";
My Output:
Aug052010Aug
05
2010
I hate to break it to you, but you're reinventing the wheel. All this is implemented quite comprehensively in the DateTime distribution and the DateTime::Format:: family of classes:
use DateTime;
my $dt = DateTime->now;
print 'It is currently ', $dt->strftime('%b %d %H:%M:%S'), ".\n";
prints:
It is currently Aug 05 23:54:01.