Modifying a perl script - perl

I'm kinda brand new to perl (well programming in general), and have been presented with a perl script (Id_script3.pl).
Code in question from Id_script3.pl:
# main sub
{ # closure
# keep %species local to sub-routine but only init it once
my %species;
sub _init {
open my $in, '<', 'SpeciesId.txt' or die "could not open SpeciesId.txt: $!";
my $spec;
while (<$in>) {
chomp;
next if /^\s*$/; # skip blank lines
if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {
# handle letter = lines
$species{$spec}{$1} = [$2];
push #{$species{$spec}{$1}}, $3 if $3;
} else {
# handle species name lines
$spec = $_;
$len = length($spec) if (length($spec) > $len);
}
}
close $in;
}
sub analyze {
my ($masses) = #_;
_init() unless %species;
my %data;
# loop over species entries
SPEC:
foreach my $spec (keys %species) {
# loop over each letter of a species
LTR:
foreach my $ltr (keys %{$species{$spec}}) {
# loop over each mass for a letter
foreach my $mass (#{$species{$spec}{$ltr}}) {
# skip to next letter if it is not found
next LTR unless exists($masses->{$mass});
}
# if we get here, all mass values were found for the species/letter
$data{$spec}{cnt}++;
}
}
The script requires a modification, in which 'SpeciesId3.txt' will be used instead of the 'SpeciesId.txt' which is currently used by the script.
There is a slight difference between the two files, so a slight modification would need to be made to the script for it to function; the difference being that SpeciesId3.txt contains no letters (A =, B =, C =) and simply a (much) longer list of values as compared to the original 'SpeciesId.txt'.
SpeciesId.txt:
African Elephant
B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4
Indian Elephant
B = 1453.7
C = 1577.8
D = 2115.1
E = 2808.4
F = 2853.5 AND 2869.5
G = 2999.4 AND 3015.4
Rabbit
A = 1221.6 AND 1235.6
B = 1453.7
C = 1592.8
D = 2129.1
E = 2808.4
F = 2883.5 AND 2899.5
G = 2957.4 AND 2973.4
SpeciesID3.txt (File to be used/script to be modified for:)
African Elephant
826.4
836.4
840.4
852.4
858.4
886.4
892.5
898.5
904.5
920.5
950.5
1001.5
1015.5
1029.5
1095.6
1105.6
Indian Elephant
835.4
836.4
840.5
852.4
868.4
877.4
886.4
892.5
894.5
898.5
908.5
920.5
950.5
1095.6
1105.6
1154.6
1161.6
1180.7
1183.6
1189.6
1196.6
1201.6
1211.6
1230.6
1261.6
1267.7
Rabbit
817.5
836.4
852.4
868.5
872.4
886.4
892.5
898.5
908.5
950.5
977.5
1029.5
1088.6
1095.6
1105.6
1125.5
1138.6
1161.6
1177.6
1182.6
1201.6
1221.6
1235.6
1267.7
1280.6
1311.6
1332.7
1378.5
1437.7
1453.7
1465.7
1469.7
As you can see, the letters (A =, B = ) have been lost for SpeciesID3.txt.
I've tried a couple of attempted "work-arounds" but am yet to write one that works.
Many Thanks,
Stephen.

Well, I don't know if I would consider keeping that script, as it looks rather messy, using script-globals inside subroutines, and strange labels. Here's a method you might like to consider, using Perl's paragraph mode by setting the input record separator $/ to the empty string.
This is a bit clunky since chomp cannot remove newlines from hash keys, so I used a do block to compensate. do { ... } works like a subroutine and returns the value of its last executed statement, in this case returns the elements of the array.
use strict;
use warnings;
use Data::Dumper;
local $/ = ""; # paragraph mode
my %a = do { my #x = <DATA>; chomp(#x); #x; }; # read the file, remove newlines
$_ = [ split ] for values %a; # split numbers into arrays
print Dumper \%a; # print data structure
__DATA__
African Elephant
826.4
836.4
840.4
852.4
858.4
886.4
892.5
898.5
904.5
920.5
950.5
1001.5
1015.5
1029.5
1095.6
1105.6
Indian Elephant
835.4
836.4
840.5
852.4
868.4
877.4
886.4
892.5
894.5
898.5
908.5
920.5
950.5
1095.6
1105.6
1154.6
1161.6
1180.7
1183.6
1189.6
1196.6
1201.6
1211.6
1230.6
1261.6
1267.7
Rabbit
817.5
836.4
852.4
868.5
872.4
886.4
892.5
898.5
908.5
950.5
977.5
1029.5
1088.6
1095.6
1105.6
1125.5
1138.6
1161.6
1177.6
1182.6
1201.6
1221.6
1235.6
1267.7
1280.6
1311.6
1332.7
1378.5
1437.7
1453.7
1465.7
1469.7
Output:
$VAR1 = {
'Rabbit' => [
'817.5',
'836.4',
'852.4',
'868.5',
'872.4',
'886.4',
'892.5',
'898.5',
'908.5',
'950.5',
'977.5',
'1029.5',
'1088.6',
'1095.6',
'1105.6',
'1125.5',
'1138.6',
'1161.6',
'1177.6',
'1182.6',
'1201.6',
'1221.6',
'1235.6',
'1267.7',
'1280.6',
'1311.6',
'1332.7',
'1378.5',
'1437.7',
'1453.7',
'1465.7',
'1469.7'
],
'Indian Elephant' => [
'835.4',
'836.4',
'840.5',
'852.4',
'868.4',
'877.4',
'886.4',
'892.5',
'894.5',
'898.5',
'908.5',
'920.5',
'950.5',
'1095.6',
'1105.6',
'1154.6',
'1161.6',
'1180.7',
'1183.6',
'1189.6',
'1196.6',
'1201.6',
'1211.6',
'1230.6',
'1261.6',
'1267.7'
],
'African Elephant' => [
'826.4',
'836.4',
'840.4',
'852.4',
'858.4',
'886.4',
'892.5',
'898.5',
'904.5',
'920.5',
'950.5',
'1001.5',
'1015.5',
'1029.5',
'1095.6',
'1105.6'
]
};
As you can see from this rather verbose output, the result is a hash with animals as keys, and your numbers as values. As long as you can rely on the names and numbers being separated by at least two consecutive newlines, and there are no arbitrary newlines inside the data, this method will do the trick.

if (m{^([A-Z])\s*=\s*(\d+(?:\.\d)?)(?:\s+AND\s+(\d+(?:\.\d)?))?$}) {
This line contains a regular expression which looks for an uppercase letter [A-Z] followed by an equals sign with optional whitespace on either side \s*=\s*. You basically just want to remove that prefix and simply match a number (\d+(?:\.\d)?).
Because $1, $2, $3 are numbered starting from the leftmost opening parenthesis, the number you want will be in $1 now. (Parentheses with ?: are non-capturing, and don't count.)
You also need to change the variable %species so that its keys are species names and its values simply a list of numbers (the extracted observations).
So:
if (m{^(\d+(?:\.\d)?)$}) {
push ${$species{$spec}}, $1;
}
The analyze subroutine needs to be similarly adapted (the LTR level is basically gone now).

Related

Aligning file output with "\t"

I have an assignment that requires me to print out some sorted lists and delimit the fields by '\t'. I've finished the assignment but I cannot seem to get all the fields to line up with just the tab character. Some of the output is below, names that are over a certain length break the fields. How can I still use '\t' and get everything aligned by only that much space?
open(DOB, ">dob.txt") || die "cannot open $!";
# Output name and DOB, sorted by month
foreach my $key (sort {$month{$a} <=> $month{$b}} keys %month)
{
my #fullName = split(/ /, $namelist{$key});
print DOB "$fullName[1], $fullName[0]\t$doblist{$key}\n";
}
close(DOB);
Current output:
Santiago, Jose 1/5/58
Pinhead, Zippy 1/1/67
Neal, Jesse 2/3/36
Gutierrez, Paco 2/28/53
Sailor, Popeye 3/19/35
Corder, Norma 3/28/45
Kirstin, Lesley 4/22/62
Fardbarkle, Fred 4/12/23
You need to know how many spaces are equivalent to a tab. Then you can work out how many tabs are covered by each entry.
If tabs take 4 spaces then the following code works:
$TAB_SPACE = 4;
$NUM_TABS = 4;
foreach my $key (sort {$month{$a} <=> $month{$b}} keys %month) {
my #fullName = split(/ /, $namelist{$key});
my $name = "$fullName[1], $fullName[0]";
# This rounds down, but that just means you need a partial tab
my $covered_tabs = int(length($name) / $TAB_SPACE);
print $name . ("\t" x ($NUM_TABS - $covered_tabs)) . $doblist{$key}\n";
}
You need to know how many tabs to pad out to, but you could work that out in a very similar way to actually printing the lines.

Want to extract the first letter of each word

I basically have a variable COUNTRY along with variables SUBJID and TREAT and I want to concatenate it like this ABC002-123 /NZ/ABC.
Suppose if the COUNTRY variable had the value 'New Zealand'. I want to extract the first letter of each word, But I want extract only the first two letters of the value when there is only one word in the COUNTRY variable. I wanted a to know how to simply the below code. If possible in perl programming.
If COUNTW(COUNTRY) GT 1 THEN
CAT_VAR=
UPCASE(SUBJID||"/"||CAT(SUBSTR(SCAN(COUNTRY,1,' '),1,1),
SUBSTR(SCAN(COUNTRY,2,' '),1,1))||"/"||TREAT);
my #COUNTRY = ("New Zealand", "Germany");
# 'NZ', 'GE'
my #two_letters = map {
my #r = /\s/ ? /\b(\w)/g : /(..)/;
uc(join "", #r);
} #COUNTRY;
The SAS Perl Regular Expression solution is to use CALL PRXNEXT along with PRXPOXN or CALL PRXPOSN (or a similar function, if you prefer):
data have;
infile datalines truncover;
input #1 country $20.;
datalines;
New Zealand
Australia
Papua New Guinea
;;;;
run;
data want;
set have;
length country_letter $5.;
prx_1 = prxparse('~(?:\b([a-z])[a-z]*\b)+~io');
length=0;
start=1;
stop = length(country);
position=0;
call prxnext(prx_1,start,stop,country,position,length);
do while (position gt 0);
matchletter = prxposn(prx_1,1,country);
country_letter = cats(country_letter,matchletter);
call prxnext(prx_1,start,stop,country,position,length);
put i= position= start= stop=;
end;
run;
I realize the OP might not be interested in another answer, but for other users browsing this thread and not wanting to use Perl expressions I suggest the following simple solution (for the original COUNTRY variable):
FIRST_LETTERS = compress(propcase(COUNTRY),'','l');
The propcase functions capitalizes the first letters of each word and puts the other ones in lower case. The compress function with 'l' modifier deletes all lower case letters.
COUNTRY may have any number of words.
How about this:
#!/usr/bin/perl
use warnings;
use strict;
my #country = ('New Zealand', 'Germany', 'Tanzania', 'Mozambique', 'Irish Repuublic');
my ($one_word_letters, $two_word_letters, #initials);
foreach (#country){
if ($_ =~ /\s+/){ # Captures CAPs if 'country' contains a space
my ($first_letter, $second_letter) = ($_ =~ /([A-Z])/g);
my ($two_word_letters) = ($first_letter.$second_letter);
push #initials, $two_word_letters; # Add to array for later
}
else { ($one_word_letters) = ($_ =~ /([A-Z][a-z])/); # If 'country' is only one word long, then capture first two letters (CAP+noncap)
push #initials, $one_word_letters; # Add this to the same array
}
}
foreach (#initials){ # Print contents of the capture array:
print "$_\n";
}
Outputs:
NZ
Ge
Ta
Mo
IR
This should do the job provided there really are no 3 word countries. Easily fixed if there are though...
This should do.
#!/usr/bin/perl
$init = &getInitials($ARGV[0]);
if($init)
{
print $init . "\n";
exit 0;
}
else
{
print "invalid name\n";
exit 1;
}
1;
sub getInitials {
$name = shift;
$name =~ m/(^(\S)\S*?\s+(\S)\S*?$)|(^(\S\S)\S*?$)/ig;
if( defined($1) and $1 ne '' ) {
return uc($2.$3);
} elsif( defined($4) and $4 ne '' ) {
return uc($5);
} else {
return 0;
}
}

Perl file manipulation using Tie::File

I am parsing an HTML file that contains data that is associated in a grid like manner and am close to being done. I had previously thought that removing all blank lines would be needed but I failed to notice that some fields in the grid are blank. I am now trying to use the Tie::File module to store the file in an array, iterate over it, and if there are three continuous blank lines, I want to insert a dummy value I can manipulate later so that the blank line stripping does not alter the structure of my data.
What I have tried so far (the file is ~2 MB):
my #lines;
my $num = 0;
tie #lines, 'Tie::File', 'results.txt';
(tied #lines)->defer;
foreach (#lines)
{
chomp $lines[$num];
$num++;
if ($lines[$num-1] =~ /^$/ && $lines[$num+1] =~ /^$/)
{
$lines[$num] = "null";
}
}
(tied #lines)->flush;
untie #lines;
Edit: How do I go about iterating over the array and insert the value so there is only one space between each line so I can later get rid of all the blank lines?
If I understand your problem correctly (replace three consecutive empty lines with the word "null" and an empty line on either side), perhaps the following regex operating on your file's contents will help:
use Modern::Perl;
my $htmlFile = do { local $/; <DATA> };
$htmlFile =~ s/(?<!\S)\n{3}/\nnull\n\n/g;
say $htmlFile;
__DATA__
A
B
C
D
E
F
Output:
null
A
B
null
null
C
D
null
E
F

How to isolate a word that corresponds with a letter from a different column of a CSV file?

I have a CSV file, like this:
ACDB,this is a sentence
BECD,this is another sentence
BCAB,this is yet another
Each character in the first column corresponds to a word in the second column, e.g., in the first column, A corresponds with "this", C with "is", D with "a", and B, with sentence.
Given the variable character, which can be set to any of the characters appearing in the first column, I need to isolate the word which corresponds to the selected letter, e.g., if I set character="B", then the output of the above would be:
sentence
this
this another
If I set `character="C", then the output of the above would be:
is
another
is
How can I output only those words which correspond to the position of the selected letter?
The file contains many UTF-8 characters.
For every character in column 1, there is always an equal number of words in column 2.
The words in column 2 are separated by spaces.
Here is the code I have so far:
while read line
do
characters="$(echo $line | awk -F, '{print $1}')"
words="$(echo $line | awk -F, '{print $2}')"
character="B"
done < ./file.csv
This might work for you:
x=B # set wanted key variable
sed '
:a;s/^\([^,]\)\(.*,\)\([^ \n]*\) *\(.*\)/\2\4\n\1 \3/;ta # pair keys with values
s/,// # delete ,
s/\n[^'$x'] [^\n]*//g # delete unwanted keys/values
s/\n.//g # delete wanted keys
s/ // # delete first space
/^$/d # delete empty lines
' file
sentence
this
this another
or in awk:
awk -F, -vx=B '{i=split($1,a,"");split($2,b," ");c=s="";for(n=1;n<=i;n++)if(a[n]==x){c=c s b[n];s=" "} if(length(c))print c}' file
sentence
this
this another
This seems to do the trick. It reads data from within the source file using the DATA file handle, whereas you will have to obtain it from your own source. You may also have to cater for there being no word corresponding to a given letter (as for 'A' in the second data line here).
use strict;
use warnings;
my #data;
while (<DATA>) {
my ($keys, $words) = split /,/;
my #keys = split //, $keys;
my #words = split ' ', $words;
my %index;
push #{ $index{shift #keys} }, shift #words while #keys;
push #data, \%index;
}
for my $character (qw/ B C /) {
print "character = $character\n";
print join(' ', #{$_->{$character}}), "\n" for #data;
print "\n";
}
__DATA__
ACDB,this is a sentence
BECD,this is another sentence
BCAB,this is yet another
output
character = B
sentence
this
this another
character = C
is
another
is
Here's a mostly - done rump answer.
Since SO is not a "Do my work for me" site, you will need to fill in some trivial blanks.
sub get_index_of_char {
my ($character, $charset) = #_;
# Homework: read about index() function
#http://perldoc.perl.org/functions/index.html
}
sub split_line {
my ($line) = #_;
# Separate the line into a charset (before comma),
# and whitespace separated word list.
# You can use a regex for that
my ($charset, #words) = ($line =~ /^([^,]+),(?(\S+)\s+)+(\S+)$/g); # Not tested
return ($charset, \#words);
}
sub process_line {
my ($line, $character) = #_;
chomp($line);
my ($charset, $words) = split_line($line);
my $index = get_index_of_char($character, $charset);
print $words->[$index] . "\n"; # Could contain a off-by-one bug
}
# Here be the main loop calling process_line() for every line from input

Parsing files that use synonyms

If I had a text file with the following:
Today (is|will be) a (great|good|nice) day.
Is there a simple way I can generate a random output like:
Today is a great day.
Today will be a nice day.
Using Perl or UNIX utils?
Closures are fun:
#!/usr/bin/perl
use strict;
use warnings;
my #gens = map { make_generator($_, qr~\|~) } (
'Today (is|will be) a (great|good|nice) day.',
'The returns this (month|quarter|year) will be (1%|5%|10%).',
'Must escape %% signs here, but not here (%|#).'
);
for ( 1 .. 5 ) {
print $_->(), "\n" for #gens;
}
sub make_generator {
my ($tmpl, $sep) = #_;
my #lists;
while ( $tmpl =~ s{\( ( [^)]+ ) \)}{%s}x ) {
push #lists, [ split $sep, $1 ];
}
return sub {
sprintf $tmpl, map { $_->[rand #$_] } #lists
};
}
Output:
C:\Temp> h
Today will be a great day.
The returns this month will be 1%.
Must escape % signs here, but not here #.
Today will be a great day.
The returns this year will be 5%.
Must escape % signs here, but not here #.
Today will be a good day.
The returns this quarter will be 10%.
Must escape % signs here, but not here %.
Today is a good day.
The returns this month will be 1%.
Must escape % signs here, but not here %.
Today is a great day.
The returns this quarter will be 5%.
Must escape % signs here, but not here #.
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $template = 'Today (is|will be) a (great|good|nice) day.';
for (1..10) {
print pick_one($template), "\n";
}
exit;
sub pick_one {
my ($template) = #_;
$template =~ s{\(([^)]+)\)}{get_random_part($1)}ge;
return $template;
}
sub get_random_part {
my $string = shift;
my #parts = split /\|/, $string;
return $parts[rand #parts];
}
Logic:
Define template of output (my $template = ...)
Enter loop to print random output many times (for ...)
Call pick_one to do the work
Find all "(...)" substrings, and replace them with random part ($template =~ s...)
Print generated string
Getting random part is simple:
receive extracted substring (my $string = shift)
split it using | character (my #parts = ...)
return random part (return $parts[...)
That's basically all. Instead of using function you could put the same logic in s{}{}, but it would be a bit less readable:
$template =~ s{\( ( [^)]+ ) \)}
{ my #parts = split /\|/, $1;
$parts[rand #parts];
}gex;
Sounds like you may be looking for Regexp::Genex. From the module's synopsis:
#!/usr/bin/perl -l
use Regexp::Genex qw(:all);
$regex = shift || "a(b|c)d{2,4}?";
print "Trying: $regex";
print for strings($regex);
# abdd
# abddd
# abdddd
# acdd
# acddd
# acdddd
Use a regex to match each parenthetical (and the text inside it).
Use a string split operation (pipe delimiter) on the text inside of the matched parenthetical to get each of the options.
Pick one randomly.
Return it as the replacement for that capture.
Smells like a recursive algorithm
Edit: misread and thought you wanted all possibilities
#!/usr/bin/python
import re, random
def expand(line, all):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
for v in variants:
expand(line[:result.start()] + v + line[result.end():], all)
else:
all.append(line)
return all
line = "Today (is|will be) a (great|good|nice) day."
all = expand(line, [])
# choose a random possibility at the end:
print random.choice(all)
A similar construct that produces a single random line:
def expand_rnd(line):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
choice = random.choice(variants)
return expand_rnd(
line[:result.start()] + choice + line[result.end():])
else:
return line
Will fail however on nested constructs