Help converting to subroutine - perl

I have tried to convert my code into a series of subroutines to make it more modular. The conditional statements in the code below is what I can't incorporate into the subroutine.
next unless ( $sentblock =~ /\[sent. \d+ len. \d+\]: \[.+\]/ ); #1#
( $sentence, $sentencenumber ) = &sentence_sentnum_chptnum($sentblock); #SUBROUTINE
if ( $sentence =~ /\~\s(\d*F*[\.I_]\w+)\s/ ) { #2#
$chapternumber = $1;
$chapternumber =~ tr/./_/;
}
next
unless ( $sentence =~ /\b\Q$search_key\E/i #3#
&& $sentence =~ /\b\Q$addkey0\E/i
&& $sentence =~ /\b\Q$addkey1\E/i );
next
if ( defined($exc0) #4#
&& length($exc0)
&& $sentence =~ /\b\Q$exc0\E\b/i );
next
if ( defined($exc1) #5#
&& length($exc1)
&& $sentence =~ /\b\Q$exc1\E\b/i );
The subroutine so far:
sub sentence_sentnum_chptnum {
my $subsentblock = shift;
my ( $subsentence, $subsentencenumber );
return unless ( $subsentblock =~ /\[sent. (\d+) len. \d+\]: \[(.+)\]/ ); #DIDN'T replace the need to put one in the main script
$subsentencenumber = $1;
$subsentence = $2;
$subsentence =~ s/, / /g;
return ( $subsentence, $subsentencenumber );
}
It works as is, but if I try putting the other conditional statements in: I get errors saying $sentence is uninitialized later in the code. Example: If I try to include the check of $addkey using the same condition, but just swapping next for return I get an error that $sentence is uninitialized in the line: if ( $sentence =~ /\~\s(\d*F*[\.I_]\w+)\s/ ) { And likewise if I put any of those conditions into the subroutine.
Main Question: How can I:
(1) get rid of next unless ( $sentblock =~ /\[sent. \d+ len. \d+\]: \[.+\]/ ); (it's in the subroutine too)
(2) Include: if ( $sentence =~ /\~\s(\d*F*[\.I_]\w+)\s/ ) & all 3 next statements
(3) Since it's included, also return $chapternumber
Without affecting my code?
General Best Practice Question: If I have variables defined at the top of my code (from an HTML form) is it better practice to localize them each time in every subroutine, or just not pass anything into the subroutine, and use the value assigned at the beginning of the code? (Ex. $search_key, $addkey and $exc)?
Test Case I have made a test case, however it is pretty long, so I didn't include it. If you need one, it is very similar to: http://perlmonks.org/?node_id=912276 just find where the subroutine takes over and delete that part... It's right after foreach my $sentblock (#parsed).
Note: The test case does not include addkey or exc, and nothing will match the chapternumber (put '~ 5.5' in front of one sentence to include it)
I've tried checking the returned $sentence in the main program. This eliminates the error, but there are no matches for the rest of the program (ie. The end result of the search engine is 0 results).
Thanks, let me know if anything is unclear.

How much do you want to break things down? It's hard to see what the "best" or "right" way to split things up is without more code.
In general, if you go through your code and add comments describing what each block of code does, you could just as readily replace each commented block with a sub that has a name that recaps the sentence:
# Is this a sentence block?
next unless ( $sent_block =~ /\[sent. \d+ len. \d+\]: \[.+\]/ );
#1#
my ( $sentence, $sentence_number ) = parse_sentence_block($sent_block);
# Get chapter info if present
if ( $sentence =~ /\~\s(\d*F*[\.I_]\w+)\s/ ) { #2#
$chapter_number = $1;
$chapter_number =~ tr/./_/;
}
# Skip if key found
next
unless ( $sentence =~ /\b\Q$search_key\E/i #3#
&& $sentence =~ /\b\Q$addkey0\E/i
&& $sentence =~ /\b\Q$addkey1\E/i );
# skip if excrescence 0 (or whatever exc is short for)
next
if ( defined($exc0) #4#
&& length($exc0)
&& $sentence =~ /\b\Q$exc0\E\b/i );
# skip if excrescence 1.
next
if ( defined($exc1) #5#
&& length($exc1)
&& $sentence =~ /\b\Q$exc1\E\b/i );
Now take these comments and make them into subs:
next unless is_sentence_block( $sent_block );
my( $sentence, $sentence_number ) = parse_sentence_block($sent_block);
# Maybe update the chapter number
my $new_chapter_number = get_chapter_number( $sentence );
$chapter_number = $new_chapter_number if defined $new_chapter_number;
next unless have_all_keys( $sentence => $search_key, $add_key0, $add_key1 );
next if have_excrescence( $exc0 );
next if have_excrescence( $exc1 );
sub is_sentence_block {
my $block = shift;
return $sent_block =~ /\[sent. \d+ len. \d+\]: \[.+\]/ );
}
sub get_chapter_number {
my $sentence = shift;
return unless $sentence =~ /\~\s(\d*F*[\.I_]\w+)\s/;
return $1;
}
sub have_all_keys {
my $sentence = shift;
my #keys = #_;
for my $key ( #keys ) {
return unless $sentence =~ /\b\Q$key1\E/i;
}
return 1
}
sub have_excrescence {
my $sentence = shift;
my $exc = shift;
return 0 unless defined($exc);
return 0 unless length($exc)
return 0 unless $sentence =~ /\b\Q$exc\E\b/i );
return 1;
}

Try this approach (some of this code may look familiar to you ;-) ):
sub extractSentenceAndPositions {
my $sentenceBlock = shift;
my ($sentence, $sentenceNumber, $chapterNumber) = ("", "", "");
if ($sentenceBlock =~ /\[sent. (\d+) len. \d+\]: \[(.+)\]/) {
$sentenceNumber = $1;
$sentence = $2;
$sentence =~ s/, / /g;
if ($sentence =~ /\~\s(\d*F*[\.I_]\w+)\s/) { #2#
$chapterNumber = $1;
$chapterNumber =~ tr/./_/;
}
# Turning the original 'next-unless' chain into a conditional
# which zeroes out the return values instead
if ( !( $sentence =~ /\b\Q$search_key\E/i #3#
&& $sentence =~ /\b\Q$addkey0\E/i
&& $sentence =~ /\b\Q$addkey1\E/i )
||
!( defined($exc0) #4#
&& length($exc0)
&& $sentence =~ /\b\Q$exc0\E\b/i )
||
!( defined($exc1) #5#
&& length($exc1)
&& $sentence =~ /\b\Q$exc1\E\b/i )
) {
($sentence, $sentenceNumber, $chapterNumber) = ("", "", "");
}
}
return ($sentence, $sentenceNumber, $chapterNumber);
}
Then, replace your first listing with...
($sentence, $sentenceNumber, $chapterNumber) = extractSentenceAndPositions($sentblock);
next if (!$sentence || !$sentenceNumber || !$chapterNumber);
Regarding your best practices question, I would say for this use case (cgi vars and the like), where those values are almost certainly not going to change, I'd refer to them directly. The basic concept I generally follow is to scrub them once at the beginning of the run (by which I mean sanitize away any SQL injections, XSS, XSRF, shell injections, or other such nastiness in the values) and from then on treat them as read-only globals. I've heard other opinions on the subject, but that's what I usually do.
As far as checking the returned $sentence in the main program somehow destroying all the other matches, I'm not sure how that would happen unless there's something else going on. I've used this approach (next or last based on returned values) in numerous scripts, and there's nothing inherently destructive about it.

Related

(empty?) return of readline is not caught by control structure

I have a multidimensional hash containing opened file handles on SEEK_END with the intention to always read the latest line without getting to much I/O (what I would get with tail).
I'm now going through all of these handles with a for loop and calling readline on them.
It looks like this:
for $outer ( keys %config ) {
my $line = readline($config{$outer}{"filehandle"});
if (not defined $line || $line eq '' ){
next;
}
else{
print "\nLine: -->".$line."<--\n";
$line =~ m/(:)(\d?\.?\d\d?\d?\d?\d?)/;
$wert = $2;
}
}
If new content is written into these files, my script reads it and behaves just as planned.
The problem is that readline will usually return nothing because there is currently nothing at the end of the file, but my if doesn't seem to identify the empty return of readline as undef as empty -- it just prints nothing, which is would be right because there is nothing in this string, but I don't want it to be processed at all.
This is an operator precedence problem. You have used a mixture of low-priority not with high-priority || so your condition
not defined $line || $line eq ''
is parsed as
not( defined($line) || ($line eq '') )
which wrongly negates the $line eq '' part
It is usually safer to use the lower-priority and and or, and not over &&, ||, and !, but a mixture is a very bad idea
You can write either
if (not defined $line or $line eq '' ) {
...
}
or
if ( ! defined $line || $line eq '' ) {
...
}
then all will be well
I would prefer to see it written like this, because it loses the unnecessary else clause and the next statements, and discards lines that contain just space characters
Also note that I iterate over the values of the hash. Using the keys is wasteful when they are used only to access the values. You will probably be able to think of a better name for the loop control variable $item
And there's often no need for the concatenation operator when Perl will interpolate variables directly into double-quoted strings
for my $item ( values %config ) {
my $line = readline( $item->{filehandle} );
if ( defined $line and $line =~ /\S/ ) {
print "\nLine: -->$line<--\n";
$line =~ m/(:)(\d?\.?\d\d?\d?\d?\d?)/;
$wert = $2;
}
}
I am a proponent of simple conditions. If it is at all feasible, I avoid compound logical conditions with an associated else so that I don't have to think about set complements etc. So, I would have listed the skip conditions individually:
next unless defined $line;
next unless $line =~ /\S/;
This kind of thing also tends to keep the number of nested blocks lower which I find improves the readability of code.

Save the result of an expression to be used in a separate block

As you can see, I want to store the result of the first substitution in $enkel. I use this $enkel in the print of if.
But when I want to use this $enkel in the elsif print, it has no value. I actually want to use both $dubbel and $enkel in the elsif print.
Is there a way so that Perl stores this permanently in $dubbel, so it can be used in other prints?
if ($inputwoord =~ /((aa|uu|ee|oo)[^aeiour])$/) {
($enkel = $inputwoord) =~ s/([aueo])\1/$1/g;
print "$enkel$buig\n$inputwoord$gen\n$enkel$comp\n$enkel$compe\n$inputwoord$sup\n$inputwoord$supe\n";
}
elsif ($inputwoord =~ /[^aeiou][aeiou]([pktgnmlf])$/) {
($dubbel = $inputwoord) =~ s /([pktgnmlf]$)/$1$1/g;
print "$dubbel$buig\n$inputwoord$gen\n$dubbel$comp\n$dubbel$compe\n$inputwoord$sup\n$inputwoord$supe\n";
} # consonantgeminatie
You must calculate the value of $enkel outside the if statement if you want to use it in both the if and the elsif clauses. Like this
($enkel = $inputwoord) =~ s/([aueo])\1/$1/g;
($dubbel = $inputwoord) =~ s/([pktgnmlf]$)/$1$1/g;
if ($inputwoord =~ /((aa|uu|ee|oo)[^aeiour])$/) {
print "$enkel$buig\n$inputwoord$gen\n$enkel$comp\n$enkel$compe\n$inputwoord$sup\n$inputwoord$supe\n";
}
elsif ($inputwoord =~ /[^aeiou][aeiou]([pktgnmlf])$/) {
print "$dubbel$buig\n$inputwoord$gen\n$dubbel$comp\n$dubbel$compe\n$inputwoord$sup\n$inputwoord$supe\n";
} # consonantgeminatie

Finding substring in a string using perl

Below is the code
$string = "any text
Affected area :
menu
Feature to test :
diagnostics
";
$string1=rindex($string,':');
print "$string1\n";
$string2=substr($string,$string1+1);
print "$string2";
I can able to find the string after "Feature to test" using above code but i want to find the string which is for Affected area eg. menu.Please help
I take that this is a test program of some sort. Would doing this this way make more sense?
use strict;
use warnings;
my $feature_to_test;
my $affected_area;
while ( my $line <DATA> ) {
chomp $line;
if ( $line =~ /^Affected area\s*:/i ) {
for (;;) { #Loop forever (until I tell you to stop i.e.)
my $line = <DATA>;
if ( $line !~ /^\s*$/ ) {
$affected_area = $line;
last;
}
}
}
if ( $line =~ /^Affected area\s*:/i ) {
for (;;) { #Loop forever (until I tell you to stop i.e.)
my $line = <DATA>;
if ( $line !~ /^\s*$/ ) {
$affected_area = $line;
last;
}
}
}
if ( $line =~ /^Feature to test\s*:/i ) {
for (;;) { #Loop forever (until I tell you to stop i.e.)
my $line = <DATA>;
if ( $line !~ /^\s*$/ ) {
$feature_to_test = $line;
last;
}
}
}
}
else {
print qq("Not a special line: "$line"\n);
}
__DATA__
any text
Affected area :
menu
Feature to test :
diagnostics
The advantage of this method is that it allows you to test line-by-line instead of trying to parse the entire record at once. Plus, it better emulates the way a file would be read in.
It's also possible to use split to split your long text into an array that you could also go through line by line too:
use strict;
use warnings;
my $string = "any text
Affected area :
menu
Feature to test :
diagnostics
";
my #string_list = split /\n/, $string; #Now, this is split line by line
for my $line ( #string_list ) {
print "same logic as above...\n";
}
Doing this as a loop and reading in each line makes the logic a bit cleaner and easier to understand. It's probably not as efficient, but even reading in a multi-million line file in Perl doesn't take more than a few seconds even on an econo-box PC.
Perhaps a regex using a positive lookbehind and a capture will be helpful here:
use strict;
use warnings;
my $string = "any text
Affected area :
menu
Feature to test :
diagnostics
";
my ($area) = $string =~ /(?<=area :\n)(.+)/;
print $area;
Output:
menu

Returning 2 arrays from subroutine depending on stop list

This has been moved to a test case here.
RE-DONE:
I want to return arrays (must be references) from 2 subroutines, however the regex used as a conditional statement isn't working as I'd hoped. I've tried doing it with one, but I figure this will be easier.
To be clear, my goal is to have an array of matches sorted (#all_matches), and then add on another array (#all_pronoun_matches) sorted the same way but added at the end.
This is the #pronoun_matches subroutine:
my ($line, $verbform, $chapternumber, $sentencenumber, $sentence) = #_;
my #matches;
my #pronoun_matches;
return unless ($line =~ /(\w+)\((\w+)\-\d+\,\s(\w+)\-\d+\)/); #2nd repeat check
$grammar_relation = $1;
$argument1 = $2;
$argument2 = $3;
return if (($argument1 =~ /^$argument2/i)||($argument2 =~ /^$argument1/i));
foreach my $pronoun (#stopListNoun)
{
if ((lc $pronoun eq lc $argument1) || (lc $pronoun eq lc $argument2))
{
push (#pronoun_matches, $chapternumber, $sentencenumber, $sentence, $grammar_relation, $argument2, $argument1) if ($argument2 =~ /$verbform/i);
push (#pronoun_matches, $chapternumber, $sentencenumber, $sentence, $grammar_relation, $argument1, $argument2) if ($argument1 =~ /$verbform/i);
}
else
{return}
}
return (\#pronoun_matches);
The #matches has a very similar subroutine except this:
foreach my $pronoun (#stopListNoun) #Just a list of words
{
return if ((lc $pronoun eq lc $argument1) || (lc $pronoun eq lc $argument2));
}
push (#matches, $chapternumber, $sentencenumber, $sentence, $grammar_relation, $argument2, $argument1) if ($argument2 =~ /$verbform/i); ##USED TO BE 'eq', but that prevented protective from showing
push (#matches, $chapternumber, $sentencenumber, $sentence, $grammar_relation, $argument1, $argument2) if ($argument1 =~ /$verbform/i);
return \#matches;
This is called by:
my $matches;
my $pronoun_matches;
$matches = &dependency_checks($lines[$l], $verbform, $chapternumber, $sentencenumber, $sentence);
$pronoun_matches = &pronoun_dependency_checks($lines[$l], $verbform, $chapternumber, $sentencenumber, $sentence);
push #all_matches, $matches if ($matches);
push #all_pronoun_matches, $pronoun_matches if ($pronoun_matches);
To send to the print section after being sorted using hashes, I'd like to use:
#all_matches = (#all_matches, #all_pronoun_matches); However, #all_pronoun_matches has 0 matches (or they are being filtered somewhere).
Question
Why does #all_pronoun_matches have uninitialized values in it??
After some testing, I've found that the match never gets passed the conditional statement, but it's the same as the one in the #matches subroutine!
Originally, I had just wanted to remove the pronouns and it worked fine, so I know the condition works:
foreach my $pronoun (#stopListNoun)
{
return if ((lc $pronoun eq lc $argument1) || (lc $pronoun eq lc $argument2));
}
I've tried using an if-else in the foreach and combining the subroutines, but then all the matches (including pronouns) went into #all_matches despite being called correctly (this method was posted here before).
Let me know if anything is unclear about my intent or the problem.
#all_matches = #matches, #all_pronoun_matches;
should be
#all_matches = ( #matches, #all_pronoun_matches );
, has lower precedence than =
If you had warnings enabled, you would have gotten a Useless use of a variable in void context warning alerting you that #all_pronoun_matches didn't become part of the assignment.

How can I skip some block content while reading in Perl

I plan to skip the block content which include the start line of "MaterializeU4()" with the subroutin() read_block below. But failed.
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
# how to correct my code below? I don't need the 2nd block content.
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 0) ) ;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
Data as below:
__DATA__
status DynTest = <dynamic 100>
vid = 10002
name = "DynTest"
units = ""
status VIDNAME9000 = <U4 MaterializeU4()>
vid = 9000
name = "VIDNAME9000"
units = "degC"
status DynTest = <U1 100>
vid = 100
name = "Hello"
units = ""
Output:
<StatusVariables>
<SVID logicalName="DynTest" type="L" value="100" vid="10002" name="DynTest" units=""></SVID>
<SVID logicalName="DynTest" type="L" value="100" vid="100" name="Hello" units=""></SVID>
</StatusVariables>
[Updated]
I print the value of index($line, "MaterializeU4"), it output 25.
Then I updated the code as below
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 25)
Now it works.
Any comments are welcome about my practice.
Perl already has an operator to keep track of blocks. It's called the "flip-flop" operator:
Try this out:
while ( <DATA> ) {
next if /\Q<U4 MaterializeU4()>\E/../^\s*$/;
push #lines, $_;
}
The value of /\Q<U4 MaterializeU4()>\E/../^\s*$/ will be true when it sees a line that matches the starting regex and it will stop being true after it sees a line matching the second expression.
First, using a regex instead of index is probably better since you can tune it to the exact format of status string if you may decide to be stricter than just "substring exists"
I would suggest as one solution adding a second flag to skip the block contents if it's a MaterializeU4 block, as follows:
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Empty return for skippable (Materialize4U) block!!!
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = <$fh> ) {
if ($line =~ /^status.*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Here's a slightly modified sample I tested using codepad.org:
Code:
use Data::Dumper;
my #all_lines = (
"s 1" ,"b 1" ,""
, "s MaterializeU4" ,"b 2" ,""
, "s 3" ,"b 3" ,""
);
while (#all_lines) {
my $block = read_block();
print Data::Dumper->Dump([$block]);
}
exit 0;
sub read_block {
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = shift #all_lines) {
if ($line =~ /^s .*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Output:
$VAR1 = [
's 1',
'b 1'
];
$VAR1 = [];
$VAR1 = [
's 3',
'b 3'
];
On successful match of a substring, index returns the position of the substring, which could be any value >= 0. On "failure", index returns -1.
The way you are using index
index($line, "MaterializeU4") != 0
will be true for all lines except for a line that begins with the string "MaterializeU4".
It looks like you already know a little bit about Perl regular expressions. Why not use one in this case, too?
++$block_started if $line =~ /status/ && $line =~ /MaterializeU4/;
Another issue I see is that you set $block_started to begin capturing lines, but you never set it to zero at the end of the "block", say, when $line is empty. I'm not sure if that's what you wanted to do.