Perl Porter Stemmer - perl

I was checking this porter stemmer. Below they said I should change my first line. To what exactly I tried every thing but the stemmer ain't working. What a good example might be?
#!/usr/local/bin/perl -w
#
# Perl implementation of the porter stemming algorithm
# described in the paper: "An algorithm for suffix stripping, M F Porter"
# http://www.muscat.com/~martin/stem.html
#
# Daniel van Balen (vdaniel#ldc.usb.ve)
#
# October-1999
#
# To Use:
#
# Put the line "use porter;" in your code. This will import the subroutine
# porter into your current name space (by default this is Main:: ). Make
# sure this file, "porter.pm" is in your #INC path (it includes the current
# directory).
# Afterwards use by calling "porter(<word>)" where <word> is the word to strip.
# The stripped word will be the returned value.
#
# REMEMBER TO CHANGE THE FIRST LINE TO POINT TO THE PATH TO YOUR PERL
# BINARY
#
As A code I am writing what follows:
use Lingua::StopWords qw(getStopWords);
use Main::porter;
my $stopwords = getStopWords('en');
#stopwords = grep { $stopwords->{$_} } (keys %$stopwords);
chdir("c:/perl/input");
#files = <*>;
foreach $file (#files)
{
open (input, $file);
while (<input>)
{
open (output,">>c:/perl/normalized/".$file);
chomp;
porter<$_>;
for my $stop (#stopwords)
{
s/\b\Q$stop\E\b//ig;
}
$_ =~s/<[^>]*>//g;
$_ =~ s/[[:punct:]]//g;
print output "$_\n";
}
}
close (input);
close (output);
The code gives no errors except it is not stemming anything!!!

That comment block is full of incorrect advice.
A #! line in a .pm file has no effect. It's a common mistake. The #! line tells Unix which interpreter to run the program with if and only if you run the file as a command line program.
./somefile # uses #! to determine what to run somefile with
/usr/bin/perl somefile # runs somefile with /usr/bin/perl regardless of #!
The #! line does nothing in a module, a .pm file which you use. Perl is already running at that point. The line is nothing but a comment.
The second problem is that your default namespace is main not Main. Casing matters.
Moving on to your code, use Main::porter; should not work. It should be use porter. You should get an error message like Can't locate Main/porter.pm in #INC (#INC contains: ...). If that code runs, perhaps you moved porter.pm into a Main/ directory? Move it out, it will confuse the importing of the porter function.
porter<$_>; says "try to read a line from the filehandle $_ and pass that into porter". $_ isn't a filehandle, it's a line from the file you just opened. You want porter($_) to pass the line into the porter function. If you turn on warnings (add use warnings to the top of your script) Perl will warn you about mistakes like that.
You'll also presumably want to do something with the return value from porter, otherwise it will truly do nothing. my #whatever_porter_returns = porter($_).
Likely one or more of your chdir or opens have silently failed so your program may have no input. Unfortunately, Perl does not let you know when this happens, you have to check. Normally you add an or die $! after the function to check for the error. This is busy work and often one forgets, instead you can use autodie which will automatically produce an error if any system calls like chdir or open fail.
With that stuff fixed your code should work, or at least produce useful error messages.
Finally, there are many stemming modules on CPAN which are likely to be higher quality than the one you've found with documentation and tests and updates and all that. Lingua::Stem and Text::English specifically use the porter algorithm. You might want to give those a shot.

Related

Calling one Perl program from another

I have two Perl files and I want to call one file from another with arguments
First file a.pl
$OUTFILE = "C://programs/perls/$ARGV[0]";
# this should be some out file created inside work like C://programs/perls/abc.log
Second File abc.pl
require "a.pl" "abc.log";
# $OUTFILE is a variable inside a.pl and want to append current file's name as log.
I want it to create an output file with the name of log as that of current file.
One more constraint I have is to use $OUTFILE in both a.pl and abc.pl.
If there is any better approach please suggest.
The require keyword only takes one argument. That's either a file name or a package name. Your line
require "a.pl" "abc.log";
is wrong. It gives a syntax error along the lines of String found where operator expected.
You can require one .pl file from another .pl, but that is very old-fashioned, badly written Perl code.
If neither file defines a package then the code is implicitly placed in the main package. You can declare a package variable in the outside file and use it in the one that is required.
In abc.pl:
use strict;
use warnings;
# declare a package variable
our $OUTFILE = "C://programs/perls/filename";
# load and execute the other program
require 'a.pl';
And in a.pl:
use strict;
use warnings;
# do something with $OUTFILE, like use it to open a file handle
print $OUTFILE;
If you run this, it will print
C://programs/perls/filename
You should convert your perl file you want to call to a perl module:
Hello.pm
#!/usr/bin/perl
package Hello;
use strict;
use warnings;
sub printHello {
print "Hello $_[0]\n"
}
1;
Then you can call it:
test.pl
#!/usr/bin/perl
use strict;
use warnings;
# you have to put the current directory to the module search path
use lib (".");
use Hello;
Hello::printHello("a");
I tested it in git bash on windows, maybe you have to do some modifications in your environment.
In this way you can pass as many arguments as you would like to, and you don't have to look for the variables you are using and maybe not initialized (this is a less safe approach I think, e.g. sometimes you will delete something you did't really want) somewhere in the file you want to call. The disadvantage is that you need to learn a bit about perl modules but I think it definitely worths.
A second approach could be to use the exec/system call (you can pass arguments in this way too; if forking a child process is acceptable), but that is an another story.
I would do this another way. Have the program take the name of the log file as a command-line parameter:
% perl a.pl name-of-log-file
Inside a.pl, open that file to append to it then output whatever you like. Now you can run it from many other sorts of places besides another Perl program.
# a.pl
my $log_file = $ARGV[0] // 'default_log_name';
open my $fh, '>>:utf8', $log_file or die ...;
print { $fh } $stuff_to_output;
But, you could also call if from another Perl program. The $^X is the path to the currently running perl and this uses system in the slightly-safer list form:
system $^X, 'a.pl', $name_of_log_file
How you get something into $name_of_log_file is up to you. In your example you already knew the value in your first program.

Perl open file from command line with wildcard

I am executing my script this way:
./script.pl -f files*
I looked at some other threads (like How can I open a file in Perl using a wildcard in the directory name?)
If i hard code the file name like it is written in this thread I get my desired result. If I take it from the command line it does not.
My options subroutine should save all the files I get this way in an array.
my #file;
sub Options{
my $i=0;
foreach my $opt (#ARGV){
switch ($opt){
case "-f" {
$i++;
### This part does not work:
#file= glob $ARGV[$i];
print Dumper("$ARGV[$i]"); #$VAR1 = 'files';
print Dumper(#file); #$VAR1 = 'files';
}
}
$i++;
}
}
It seems the execution is interpreted in advance and the wildcard (*) is dropped in the process.
Desired result: All files beginning with files are saved in an array, after execution from the command line.
I hope you get my problem. If not feel free to ask.
Thank you.
Well, first I'd suggest using a module to do args on command line:
Getopt::Long for example.
But otherwise your problem is simpler - your shell is expanding the 'file*' before perl gets it. (shell glob is getting there first).
If you do this with:
-f 'file*'
then it'll work properly. You should be able to see this - for example - if you just:
use Data::Dumper;
print Dumper \#ARGV;
I expect you'll see a much longer list than you thought.
However, I'd also point out - perl has a really nice feature you may be able to use (depending what you're doing with your files).
You can use <>, which automatically opens and reads all files specified on command line (in order).
Since your shell is already expanding the glob files* into a list of filenames, that's what the Perl program gets.
$ perl -E 'say #ARGV' files*
files1files2files3
There's no need to do that in Perl, if your shell can do it for you. If all you want is the filenames in an array, you already have #ARGV which contains those.

Error with opening a filehandle

I have just begun working with Perl, I am only at the introductory level, and I have been having trouble with opening filehandles.
Here is the code:
#!/usr/bin/perl -w
$proteinfilename = 'peptide';
open(PROTEINFILE, $proteinfilename) or die "Can't write to file '$proteinfilename' [$!]\n";
$protein = <PROTEINFILE>;
close PROTEINFILE;
print $protein;
exit;
Every time I tried to run the program, it gave me an error
readline() on closed filehandle PROTEINFILE at C:\BIN\protein.pl
or
Can't write to file 'peptide' [No such file or directory]
Can you please help me figure this out. I have the file peptide saved as a .txt and its in the same folder as the protein.pl. What else can I do to make this work?
You're telling perl to open file peptide in the current directory, but it doesn't find such a file there ("No such file or directory").
Perhaps the current directory isn't C:\BIN, the directory in which you claim the file is located. You can address that by moving the file, using an absolute path, or changing the
current directory to be the one where teh script is located.
use Cwd qw( realpath );
use Path::File qw( file );
chdir(file(realpath($0))->dir);
Perhaps the file isn't named peptide. It might actually be named peptide.txt, for example. Windows hides extensions it recognises by default, a feature I HATE. You can address this by renaming the file or by using the correct file name.
Are you looking to open the file for reading or writing? Your open statement opens it for reading; your error message says 'writing'. You use it for reading — so your error message is confusing, I believe.
If you get 'No such file or directory' errors, it means that despite what you thought, the name 'peptide' is not the name of a file in the current directory. Perl does not add extensions to file names for you; if your file is actually peptide.txt (since you mention that it is a 'txt file'), then that's what you need to specify to open. If you run perl protein.pl and peptide (or peptide.txt) is in the current directory, then it is not clear what your problem is. If your script is in C:\BIN directory and your current directory is not C:\BIN but peptide (or peptide.txt) is also in C:\BIN, then you need to arrange to open C:/bin/peptide or c:/bin/peptide.txt. Note the switch from backslashes to slashes. Backslashes have meanings specific to Perl as an escape character, and Windows is happy with slashes in place of backslashes. If you must use backslashes, then use single quotes around the name:
my $proteinfilename = 'C:\BIN\peptide.txt';
It may be simplest to take the protein file name from a command line argument; this gives you the flexibility of having the script anywhere on your PATH and the file anywhere you choose.
Two suggestions to help your Perl:
Use the 3-argument form of open and lexical file handles, as in:
open my $PROTEINFILE, '<', $proteinfilename or
die "Can't open file '$proteinfilename' for reading [$!]\n";
my $protein = <$PROTEINFILE>;
close $PROTEINFILE;
Note that this reads a single line from the file. If you need to slurp the whole file into $protein, then you have to do a little more work. There are modules to handle slurping for you, but you can also simply use:
my $protein;
{ local $/; $protein = <$PROTEINFILE>; }
This sets the line delimiter to undef which means the entire file is slurped in one read operation. The $/ variable is global, but this adjusts its value in a minimal scope. Note that $protein was declared outside the block containing the slurp operation!
Use use strict; as well as -w or use warnings;. It will save you grief over time.
I've only been using Perl for 20 years; I don't write a serious script without both use strict; and use warnings; because I don't trust my ability to spot silly mistakes (and Perl will do it for me). I don't make all that many mistakes, but Perl has saved me on many occasions because I use them.
Here how your program will go
#!/usr/bin/perl
use strict;
use warnings;
my $proteinfilename = 'peptide.txt';
open(PROTEINFILE, $proteinfilename) or die "Can't write to file '$proteinfilename' [$!]\n";
my $protein = <PROTEINFILE>;
close PROTEINFILE;
print $protein;
You need to add the file extension(for example .txt) at the end like below.
my $proteinfilename = 'peptide.txt';
Your program say peptide_test.pl and input text file peptide.txt should be in the same directory.
If they are not in the same directory, use absolute path like below.
my $proteinfilename = 'C:\somedirectory\peptide.txt';
Note: Use single quotes in case of absolute path.This will ignore the backslash\ in path.
Now about errors, If you don't use die statement, you will get error
readline<> on closed filehandle PROTEINFILE at C:\BIN\protein.pl
After using die,
or die $! ;
you will get error No such file or directory.
Also always
use strict;
use warnings;
-w is deprecated after perl 5.6. These two lines/statements will help you finding typos,syntax errors
And one more,I don't think you need exit;, at the end.
Refer exit function.

File locking with Fcntl: Baffling bug involving 'use' and 'require'

The following Perl script outputs "SUCCESS" as you'd expect:
use Fcntl qw(:DEFAULT :flock);
sysopen(LF, "test.txt", O_RDONLY | O_CREAT) or die "SYSOPEN FAIL: $!";
if(flock(LF, LOCK_EX)) { print "SUCCESS.\n"; }
else { print "FAIL: $!\n"; }
But now, replace that first line with
require "testlib.pl";
where testlib.pl contains
use Fcntl qw(:DEFAULT :flock);
1;
Now, strangely enough, the script fails, like so:
FAIL: Bad file descriptor
The question: Why?
ADDED:
And now that I know why -- thanks! -- I'm wondering what is the best way to deal with this:
Just do the use Fcntl twice, once in the main script and once in the required library (both the main script and the library need it).
Replace O_RDONLY with &O_RDONLY, etc.
Replace O_RDONLY with O_RDONLY(), etc.
Something else?
By foregoing use, you deprive the Perl parser of the knowledge that O_RDONLY et al. are parameterless subroutines. You have to be a bit more verbose in that situation:
sysopen(LF, "test.txt", O_RDONLY() | O_CREAT()) or die "SYSOPEN FAIL: $!";
if(flock(LF, LOCK_EX())) { print "SUCCESS.\n"; }
EDIT: To elaborate a bit further, without the parentheses, the O_RDONLY and O_CREAT were being interpreted as barewords (strings), which don't behave as you'd expect when binary-or'ed together:
$ perl -le 'print O_RDONLY | O_CREAT'
O_SVOO\Y
(The individual characters are being bitwise or'ed togther.)
In this case, the string "O_SVOO\Y" (or whatever it is on your system) was being interpreted as the number 0 to sysopen, which would therefore still work as long as O_RDONLY is 0 (as is typical) and the file already existed (so the O_CREAT was superfluous). But fcntl is apparently not as forgiving with non-numeric arguments:
$ perl -e 'flock STDOUT, "LOCK_EX" or die "Failed: $!"'
Failed: Bad file descriptor at -e line 1.
Similarly:
$ perl -e 'flock STDOUT, LOCK_EX or die "Failed: $!"'
Failed: Bad file descriptor at -e line 1.
However:
$ perl -e 'use Fcntl qw(:flock); flock STDOUT, LOCK_EX or die "Failed: $!"'
(no output)
Finally, note that use strict provides many helpful clues.
The line use Fcntl qw(:DEFAULT :flock); is not just loading the Fcntl library for you, but also exporting some symbols into your script's namespace. If you move that to a different scope, then the constants O_RDONLY, O_CREAT, LF, and LOCK_EX are no longer available to you, and your code won't do the same thing [however you could still reach them, if you know what namespace they ended up in -- since it was a script that did the export, you could call &main::NAME or simply &NAME, but then you have to be aware of what another file is doing with its code, which is not very clean].
This is described in the documentation under EXPORTED SYMBOLS:
By default your system's F_* and O_* constants (eg, F_DUPFD and O_CREAT) and the FD_CLOEXEC constant are exported into your namespace.
You can request that the flock() constants (LOCK_SH, LOCK_EX, LOCK_NB and LOCK_UN) be provided by using the tag ":flock". See Exporter.
If you add the lines
use strict;
use warnings;
to the top of your script, you will get more informative error messages such as "Name "main::O_RDONLY" used only once: possible type at line ...", which would give you a clue that these constants definitions are no longer visible.
Edit: in response to your question, the best practice would be #1, to include
the use statement in every file that needs it. See perldoc -f use -- the Fcntl library is only included once, but the import() call is made every time it is needed, which is what you want.
use is equivalent to:
BEGIN { require Module; Module->import( LIST ); }
guaranteeing that the import functions are available before the code starts executing. Whe you replace use with require, it simply reads the code in at the lexical point in the program where it exists.

Why am I unable to load a Perl library when using the `do` function?

I'm new to Perl, and I'm updating an old Perl website. Every .pl file seems to have this line at the top:
do "func.inc";
So I figured I could use this file to tag on a subroutine for global use.
func.inc
#!/usr/bin/perl
sub foobar
{
return "Hello world";
}
index.pl
#!/usr/bin/perl
do "func.inc";
print "Content-type: text/html\n\n";
print foobar();
However, I get this error:
Undefined subroutine &main::foobar called at /path/to/index.pl line 4.
Both files are in the same directory, and there's tones of subs in func.inc already which are used throughout the website. However, the script works in the Linux production environment, but does not work for my Windows 7 dev environment (I'm using ActivePerl).
Update:
It looks like the file is not being included; the sub works if the file is included using an absolute path...
do "C:/path/to/func.inc";
... so it looks like relative paths don't work for my local dev environment, but they work in the production environment through. But this is no good for me, because the absolute path on my dev machine will not work for the live server.
How do I get do to work using a relative path on my Windows 7 dev machine?
Update 2:
I was using the Perl -T switch. Unfortunately this removes "." from #INC, and so stops us from using relative paths for do. I removed this switch and the old code is working now. I'm aware that this is not good practice, but unfortunately I'm working with old code, so it seems that I have no choice.
The perlfunc documentation for do reads
do EXPR
Uses the value of EXPR as a filename and executes the contents of the file as a Perl script.
do 'stat.pl';
is just like
eval `cat stat.pl`;
except that it's more efficient and concise, keeps track of the current filename for error messages, searches the #INC directories, and updates %INC if the file is found.
So to see all this in action, say C:\Cygwin\tmp\mylib\func.inc looks like
sub hello {
print "Hello, world!\n";
}
1;
and we make use of it in the following program:
#!/usr/bin/perl
use warnings;
use strict;
# your code may have unshift #INC, ...
use lib "C:/Cygwin/tmp/mylib";
my $func = "func.inc";
do $func;
# Now we can just call it. Note that with strict subs enabled,
# we have to use parentheses. We could also predeclare with
# use subs qw/ hello /;
hello();
# do places func.inc's location in %INC
if ($INC{$func}) {
print "$0: $func found at $INC{$func}\n";
}
else {
die "$0: $func missing from %INC!";
}
Its output is
Hello, world!
./prog: func.inc found at C:/Cygwin/tmp/mylib/func.inc
As you've observed, do ain't always no crystal stair, which the do documentation explains:
If do cannot read the file, it returns undef and sets $! to the error. If do can read the file but cannot compile it, it returns undef and sets an error message in $#. If the file is successfully compiled, do returns the value of the last expression evaluated.
To check all these cases, we can no longer use simply do "func.inc" but
unless (defined do $func) {
my $error = $! || $#;
die "$0: do $func: $error";
}
Explanations for each case are below.
do cannot read the file
If we rename func.inc to nope.inc and rerun the program, we get
./prog: do func.inc: No such file or directory at ./prog line 12.
do can read the file but cannot compile it
Rename nope.inc back to func.inc and delete the closing curly brace in hello to make it look like
sub hello {
print "Hello, world!\n";
1;
Running the program now, we get
./prog: do func.inc: Missing right curly or square bracket at C:/Cygwin/tmp/mylib/func.inc line 4, at end of line
syntax error at C:/Cygwin/tmp/mylib/func.inc line 4, at EOF
do can read the file and compile it, but it does not return a true value.
Delete the 1; at the end of func.inc to make it
sub hello {
print "Hello, world!\n";
}
Now the output is
./prog: do func.inc: at ./prog line 13.
So without a return value, success resembles failure. We could complicate the code that checks the result of do, but the better choice is to always return a true value at the end of Perl libraries and modules.
Note that the program runs correctly even with taint checking (-T) enabled. Try it and see! Be sure to read Taint mode and #INC in perlsec.
You use the subroutine the same way that you'd use any other subroutine. It doesn't matter that you loaded it with do. However, you shouldn't use do for that. Check out the "Packages" chapter in Intermediate Perl for a detailed explanation of loading subroutines from other files. In short, use require instead.
See the documentation for do. You need to have func.inc (which you can also just call func.pl since pl is "perl library") in one of the directories where Perl will look for libraries. That might be different than the directory that has index.pl. Put func.inc in #INC somewhere, or add its directory to #INC. do also doesn't die if it can't load the file, so it doesn't tell you that it failed. That's why you shouldn't use do to load libraries. :)
Making sure the path is correct, use:
#!/usr/bin/perl
require("func.inc");
print "Content-type: text/html\n\n";
print foobar();
I would first check if the file was actually loaded, the documentation for do mentions that it updates %INC if the file was found. There is also more information in the documentation.
make sure you have func.inc in the correct path.
do "func.inc"
means you are saying func.inc is in the same path as your perl script. check the correct path and then do this
do "/path/func.inc"