While loop will not find regex when checking for false value - perl

I'm working on a script that gathers various specs of my companies computers, and stores it to a database. One section of the script searches through a file for the serial number of the credit card reader attached to the computer.
What I've found is that if I do not include a true/false check in the while condition, the regex will be matched, and I can break from there. That's fine, and it's what I'm going with. I do not understand though, why, when I change the while condition to include the true/false check, the regex is never matched.
Here is what I've found works
use v5.10;
use warnings;
use strict;
my $test;
open(my $fh, "<", "\\\\192.168.0.132\\storeman\\log\\2018-06-22.sls");
while(my $line = <$fh>){
if($line =~ /unitid/i){
$line =~ /(\d{3}\-\d{3}\-\d{3})/;
$test = $1;
last;
}
};
say $test // "Nothing found";
On the other hand though, the follow does not work.
use v5.10;
use warnings;
use strict;
my $test;
open(my $fh, "<", "\\\\192.168.0.132\\storeman\\log\\2018-06-22.sls");
while(my $line = <$fh> && !$test){
if($line =~ /unitid/i){
$line =~ /(\d{3}\-\d{3}\-\d{3})/;
$test = $1;
}
};
say $test // "Nothing found";
Note that in each case, $test is undeclared until the regex finds a match. Also, even when declaring $test as an empty string, and trying && $test ne "", the regex still never matches.
I've debugged to the best of my abilities, and all I know is that when using && !$test, the if($line =~ /unitid/i) is never found to be true.
What is it about the && !$test that could cause the regex to never match, and thus the loop to never break, but instead run through the whole file?

Consider the return of <$fh> && !$test - this is a boolean context. Perl will return the last evaluated statement - in this case !$test. Which, since $test is undefined, is true, so 1.
Thus, $line would now contain 1. Which doesn't match your regex :)
If you wanted to, you could write while( !$test && (my $line = <$fh>) ){,
which would then behave as you seemed to expect. You could even do
while( my $line = !$test && <$fh> ) - but I would not recommend that, as it is a bit confusing ( as you have noticed ;) ).

Related

perl matching full string

I am very new to Perl. Trying to grep the full line for matched pattern of the string. Seems like it is not able to search for full string. Any suggestion?
use strict;
use warnings;
my $prev;
#my $pattern = ":E: (Sub level extra file/dir checks):";
open(INPUTFILE, "<log_be_sum2.txt") or die "$!";
open(OUTPUTFILE, ">>extract.txt") or die "$!";
while (<INPUTFILE>){
if ($_ =~ /^:E: (Sub level extra file/dir checks):/){
print OUTPUTFILE $prev, $_;
}
$prev = $_;
}
If you want to match a literal / in a regular expression, it either needs to be escaped with a backslash, or you need to use a different character as the regexp quote character (! in the below example). The parenthesis also have to be escaped so they're not treated as a capturing group:
use strict;
use warnings;
my $prev;
#my $pattern = ":E: (Sub level extra file/dir checks):";
open(INPUTFILE, "<", "log_be_sum2.txt") or die "$!";
open(OUTPUTFILE, ">>", "extract.txt") or die "$!";
while (<INPUTFILE>){
if ($_ =~ m!^:E: \(Sub level extra file/dir checks\):!){
print OUTPUTFILE $prev, $_;
}
$prev = $_;
}
Note the change to the three-argument version of open, which is highly recommended. Might consider lexical file handles too. And good for you for using warnings and strict mode! Don't see that enough in new users of the language.
It's good that you are using strict and warnings but you should pay attention to the error/warning messages, and post them along with the question if you don't understand them. My version of Perl fails with an error about an unmatched (. The reason this particular error is thrown is because Perl thinks your regexp is complete when it sees the "/" in "file/dir".
When you have special characters, a good practice is to use quotemeta. I noticed you have a commented line with a variable assignment to pattern. You could uncomment that and use it like this:
...
my $pattern = quotemeta ":E: (Sub level extra file/dir checks):";
...
if ($_ =~ /^$pattern/){
...
}
...
}
But there is also a shortcut documented in perlre: the \Q and \E escape sequences. You can use it like $_ =~ /^\Q$pattern\E/. You can still use it and avoid the variable assignment, but in your case you will need to use a different character for the quote-like operator, since your pattern contains a literal /. I tend to prefer m{}, but it's really up to you as long as it's not /.
use strict;
use warnings;
my $prev = q{}; # NOTE: see NOTE below
open INPUTFILE, "<", "log_be_sum2.txt" or die "$!";
open OUTPUTFILE, ">>", "extract.txt" or die "$!";
while (<INPUTFILE>){
if ($_ =~ m{^\Q:E: \(Sub level extra file/dir checks\):\E}){
print OUTPUTFILE $prev, $_;
}
$prev = $_;
}
*NOTE - I seeded $prev with an empty string, because otherwise if your match is on the first line, you will try to print an undefined value, which will result in a warning.

Parsing string in multiline data with positive lookbehind

I am trying to parse data like:
header1
-------
var1 0
var2 5
var3 9
var6 1
header2
-------
var1 -3
var3 5
var5 0
Now I want to get e.g. var3 for header2. Whats the best way to do this?
So far I was parsing my files line-by-line via
open(FILE,"< $file");
while (my $line = <FILE>){
# do stuff
}
but I guess it's not possible to handle multiline parsing properly.
Now I am thinking to parse the file at once but wasn't successful so far...
my #Input;
open(FILE,"< $file");
while (<FILE>){ #Input = <FILE>; }
if (#Input =~ /header2/){
#...
}
The easier way to handle this is "paragraph mode".
local $/ = "";
while (<>) {
my ($header, $body) =~ /^([^\n]*)\n-+\n(.*)/s
or die("Bad data");
my #data = map [ split ], split /\n/, $body;
# ... Do something with $header and #data ...
}
The same can be achieved without messing with $/ as follows:
my #buf;
while (1) {
my $line = <>;
$line =~ s/\s+\z// if !defined($line);
if (!length($line)) {
if (#buf) {
my $header = shift(#buf);
shift(#buf);
my #data = map [ split ], splice(#buf);
# ... Do something with $header and #data ...
}
last if !defined($line);
next;
}
push #buf, $line;
}
(In fact, the second snippet includes a couple of small improvements over the first.)
Quick comments on your attempt:
The while loop is useless because #Input = <FILE> places the remaining lines of the file in #Input.
#Input =~ /header2/ matches header2 against the stringification of the array, which is the stringification of the number of elements in #Input. If you want to check of an element of #Input contains header2, will you will need to loop over the elements of #Inputs and check them individually.
while (<FILE>){ #Input = <FILE>; }
This doesn't make much sense. "While you can read a record from FILE, read all of the data on FILE into #Input". I think what you actually want is just:
my #Input = <FILE>;
if (#Input =~ /header2/){
This is quite strange too. The binding operator (=~) expects scalar operands, so it evaluates both operands in scalar context. That means #Input will be evaluated as the number of elements in #Input. That's an integer and will never match "header2".
A couple of approaches. Firstly a regex approach.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $file = 'file';
open my $fh, '<', $file or die $!;
my $data = join '', <$fh>;
if ($data =~ /header2.+var3 (.+?)\n/s) {
say $1;
} else {
say 'Not found';
}
The key to this is the /s on the m// operator. Without it, the two dots in the regex won't match newlines.
The other approach is more of a line by line parser.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $file = 'file';
open my $fh, '<', $file or die $!;
my $section = '';
while (<$fh>) {
chomp;
# if the line all word characters,
# then we've got a section header.
if ($_ !~ /\W/) {
$section = $_;
next;
}
my ($key, $val) = split;
if ($section eq 'header2' and $key eq 'var3') {
say $val;
last;
}
}
We read the file a line at a time and make a note of the section headers. For data lines, we split on whitespace and check to see if we're in the right section and have the right key.
In both cases, I've switched to using a more standard approach (lexical filehandles, 3-arg open(), or die $!) for opening the file.

Perl Programming

I have these questions. But I don't know how to prove it or if I'm right. Are my answers right?
Find all complete lines of a file which contain only a row of any number of the letter x
x*
^x+$
^x*$ <-This one
^xxxxx$
Find all complete lines of a file which contain a row consisting only the letter x but ignoring any leading or trailing space on the line.
^\s* x+\s*$ <--This one
^\s(x*)\s$
\s* x+\s*
^\s+x+\s+$
I tried to use this
use strict;
use warnings;
my $filename = 'data.txt';
open( my $fh, '<:encoding(UTF-8)', $filename ) or die "Could not open file '$filename' $!";
while ( my $row = <$fh> ) {
chomp $row;
print "$row\n";
}
I tried this code but I got error at (^
use strict;
use warnings;
my $filename = 'data.txt';
open( my $fh, '<:encoding(UTF-8)', $filename ) or die "Could not open file '$filename' $!";
while ( my $row = <$fh> ) {
if ( ^x*$ ) {
print "This is";
}
}
You're talking about regular expressions and how to use them in Perl. Your question seems to be whether the answers you picked to homework are correct.
The code you've added should do what you want, but it has syntax errors.
if ( ^x*$ ) {
print "This is";
}
Your pattern is correct, but you don't know how to use a regular expression in Perl. You're missing the actual operator to tell Perl that you want a regular expression.
The short form is this, where I've highlighted the important part with #
if ( /^x*$/ ) {
# #
The slashes // tell Perl that it should match a pattern. The long form of it is:
if ( $_ =~ m/^x*$/ ) {
## ## ## #
$_ is the variable that you are matching against a pattern. The =~ is the matching operator. The m// constructs a pattern to match with. If you use // you can leave out the m, but it's clearer to put it in.
The $_ is called topic. It's like a default variable that stuff goes into in Perl if you don't specify another variable.
while ( <$fh> ) {
print $_ if $_ =~ m/foo/; # print all lines that contain foo
}
This code can be written as $_, because a lot of commands in Perl assume that you mean $_ when you don't explicitly name a variable.
while ( <$fh> ) { # puts each line in $_
print if m/foo/; # prints $_ if $_ contains foo
}
You code looks like you wanted to do that, but in fact you have a $row in your loop. That's good, because it is more explicit. That means it's easier to read. So what you need to do for your match is:
while ( my $row = <$fh> ) {
if ( $row =~ m/^x*$/ ) {
print "This is";
}
}
Now you will iterate each line of the file behind the $fh filehandle, and check if it matches the pattern ^x*$. If it does, you print _"This is". That doesn't sound very useful.
Consider this example, where I am using the __DATA__ section instead of a file.
use strict;
use warnings;
while ( my $row = <DATA> ) {
if ( $row =~ m/^x*$/ ) {
print "This is";
}
}
__DATA__
foo
xxx
x
xxxxx
bar
This will print:
This isThis isThis isThis is
It really does not seem to be very useful. It would make more sense to include the line that matched.
if ( $row =~ m/^x*$/ ) {
print "match: $row";
}
Now we get this:
match: xxx
match:
match: x
match: xxxxx
That's almost what we expected. It matches a single x, and a bunch of xs. It did not match foo or bar. But it does match an empty line.
That's because you picked the wrong pattern.
The * multiplier means match as many as possible, as least none.
The + multiplier means match as many as possible, at least one.
So your pattern should be the one with +, or it will match if there is nothing, because start of the line, no x, end of the line matches an empty line.
While you're at it, you could also rename your variable. Unless you're dealing with CSV, which has rows of data, you have lines, not rows. So $line would be a better name for your variable. Giving variables good, descriptive names is very important because it makes it easier to understand your program.
use strict;
use warnings;
my $filename = 'data.txt';
open( my $fh, '<:encoding(UTF-8)', $filename )
or die "Could not open file '$filename' $!";
while ( my $line = <$fh> ) {
if ( $line =~ m/^x+$/ ) {
print "match: $line";
}
}

perl script miscounting because of empty lines

the below script is basically catching the second column and counting the values. The only minor issue I have is that the file has empty lines at the end (it's how the values are being exported) and because of these empty lines the script is miscounting. Any ideas please? Thanks.
my $sum_column_b = 0;
open my $file, "<", "file_to_count.txt" or die($!);
while( my $line = <$file>) {
$line =~ m/\s+(\d+)/; #regexpr to catch second column values
$sum_column_b += $1;
}
print $sum_column_b, "\n";
I think the main issue has been established, you are using $1 when it is not conditionally tied to the regex match, which causes you to add values when you should not. This is an alternative solution:
$sum_column_b += $1 if $line =~ m/\s+(\d+)/;
Typically, you should never use $1 unless you check that the regex you expect it to come from succeeded. Use either something like this:
if ($line =~ /(\d+)/) {
$sum += $1;
}
Or use direct assignment to a variable:
my ($num) = $line =~ /(\d+)/;
$sum += $num;
Note that you need to use list context by adding parentheses around the variable, or the regex will simply return 1 for success. Also note that, like Borodin says, this will give an undefined value when the match fails, and you must add code to check for that.
This can be handy when capturing several values:
my #nums = $line =~ /(\d+)/g;
The main problem is that if the regex does not match, then $1 will hold the value it received in the previous successful match. So every empty line will cause the previous line to be counted again.
An improvement would be:
my $sum_column_b = 0;
open my $file, "<", "file_to_count.txt" or die($!);
while( my $line = <$file>) {
next if $line =~ /^\s*$/; # skip "empty" lines
# ... maybe skip other known invalid lines
if ($line =~ m/\s+(\d+)/) { #regexpr to catch second column values
$sum_column_b += $1;
} else {
warn "problematic line '$line'\n"; # report invalid lines
}
}
print $sum_column_b, "\n";
The else-block is of course optional but can help noticing invalid data.
Try putting this line just after the while line:
next if ( $line =~ /^$/ );
Basically, loop around to the next line if the current line has no content.
#!/usr/bin/perl
use warnings;
use strict;
my $sum_column_b = 0;
open my $file, "<", "file_to_count.txt" or die($!);
while (my $line = <$file>) {
next if (m/^\s*$/); # next line if this is unsignificant
if ($line =~ m/\s+(\d+)/) {
$sum_column_b += $1;
}
}
print "$sum_column_b\n";

Perl Remove Stop Words from multiple files

I have read so many forms on how to remove stop words from files, my code remove many other things but I want to include also stop words. This is how far I reached, but I don't know what I am missing. Please Advice
use Lingua::StopWords qw(getStopWords);
my $stopwords = getStopWords('en');
chdir("c:/perl/input");
#files = <*>;
foreach $file (#files)
{
open (input, $file);
while (<input>)
{
open (output,">>c:/perl/normalized/".$file);
chomp;
#####What should I write here to remove the stop words#####
$_ =~s/<[^>]*>//g;
$_ =~ s/\s\.//g;
$_ =~ s/[[:punct:]]\.//g;
if($_ =~ m/(\w{4,})\./)
{
$_ =~ s/\.//g;
}
$_ =~ s/^\.//g;
$_ =~ s/,/' '/g;
$_ =~ s/\(||\)||\\||\/||-||\'//g;
print output "$_\n";
}
}
close (input);
close (output);
The stop words are the keys of %$stopwords which have the value 1, i.e.:
#stopwords = grep { $stopwords->{$_} } (keys %$stopwords);
It might happen be true that the stop words are just the keys of %$stopwords, but according the the Lingua::StopWords docs you also need to check the value associated with the key.
Once you have the stop words, you can remove them with code like this:
# remove all occurrences of #stopwords from $_
for my $w (#stopwords) {
s/\b\Q$w\E\b//ig;
}
Note the use of \Q...\E to quote any regular expression meta-characters that might appear in the stop word. Even though it is very unlikely that stop words will contains meta-characters, this is a good practice to follow any time you want to represent a literal string in a regular expression.
We also use \b to match a word boundary. This helps ensure that we won't a stop word that occurs in the middle of another word. Hopefully this will work for you - it depends a lot on what your input text is like - i.e. do you have punctuation characters, etc.
# Always use these in your Perl programs.
use strict;
use warnings;
use File::Basename qw(basename);
use Lingua::StopWords qw(getStopWords);
# It's often better to build scripts that take their input
# and output locations as command-line arguments rather than
# being hard-coded in the program.
my $input_dir = shift #ARGV;
my $output_dir = shift #ARGV;
my #input_files = glob "$input_dir/*";
# Convert the hash ref of stop words to a regular array.
# Also quote any regex characters in the stop words.
my #stop_words = map quotemeta, keys %{getStopWords('en')};
for my $infile (#input_files){
# Open both input and output files at the outset.
# Your posted code reopened the output file for each line of input.
my $fname = basename $infile;
my $outfile = "$output_dir/$fname";
open(my $fh_in, '<', $infile) or die "$!: $infile";
open(my $fh_out, '>', $outfile) or die "$!: $outfile";
# Process the data: you need to iterate over all stop words
# for each line of input.
while (my $line = <$fh_in>){
$line =~ s/\b$_\b//ig for #stop_words;
print $fh_out $line;
}
# Close the files within the processing loop, not outside of it.
close $fh_in;
close $fh_out;
}