Special character in the file exit the while loop in Perl - perl

I wrote a simple parser for a .txt file with the following instructions:
my $file2 = "test.txt";
open ($process, "<",$file2) or die "couldn't manage to open the file:$file2!";
while (<$process>)
{
...
}
In some files that I am trying to parse there is a special character that is like the right arrow (->) and that I don't manage to paste here from the file.
Every time the parser hits that character (->), it exits the file without processing it till the end.
Is there a way to avoid it and continue processing the file till the very end?
I am using perl 5.6.1 (I cannot use a newer one) and the files that I need to process might have these special characters.
Thanks for your help.

I don't think it's perl that's causing your problem, but almost certainly something in that middle missing block. Are you using eval in the while block on the input from the file? This is a minimal example that shows that the stream containing -> doesn't cause difficulties:
#!/usr/bin/env perl
use warnings;
use strict;
while(<DATA>) {
print "Data[$.]: $_";
}
__DATA__
this is some data
this is also some data
-> this looks fine
foo->dingle also looks fine
This produces:
$ perl ./foo.pl
Data[1]: this is some data
Data[2]: this is also some data
Data[3]: -> this looks fine
Data[4]: foo->dingle also looks fine
So, the -> characters in perl are special:
"-> " is an infix dereference operator, just as it is in C and C++. If
the right side is either a [...] , {...} , or a (...) subscript, then
the left side must be either a hard or symbolic reference to an array,
a hash, or a subroutine respectively. (Or technically speaking, a
location capable of holding a hard reference, if it's an array or hash
reference being used for assignment.) See perlreftut and perlref.
Otherwise, the right side is a method name or a simple scalar variable
containing either the method name or a subroutine reference, and the
left side must be either an object (a blessed reference) or a class
name (that is, a package name). See perlobj.
So if I had to guess, you're definitely using eval to try to parse your content and it's failing to dereference the left side of the operator and crashing. Please provide command line error messages or the code in the while loop if you want further assistance.

Related

How can I get perl to correctly pass a command line argument with multiple arguments and complex file paths (spaces and symbols)?

I have a small perl script which collects file paths from an excel file and passes them through the command line to perltex which then compiles a pdf based on the files and paths chosen.
My problem is that the moment I introduce more complex file paths (which is necessary based on the network setup of the final user pool) perltex fails to find the file paths, cutting them at the space.
A MWE is a follows
#!/usr/bin/perl
use strict;
use warnings;
use 5.14.2;
use Text::Template;
use Spreadsheet::Read;
use Spreadsheet::ParseXLSX;
use utf8;
use charnames qw( :full :short );
use autodie;
my $row = 5;
my $col = 15;
my $File = "C:/Users/me/Desktop/Reporting-Static/Input-test1.xlsm";
my $parser = Spreadsheet::ParseXLSX->new();
my $workbook = $parser->parse($File);
my $worksheet = $workbook->worksheet("Input");
my $cell = $worksheet->get_cell($row, $col);
my $Filename = $cell->Value();
my $texfile = "C:/Users/me/Desktop/Reporting-Static/file.tex";
# can't find this file if there are spaces in the address
system("perltex", "--latex=pdflatex", "--nosafe", "--jobname=$Filename", "$texfile");
if ( $? == -1 )
{
print "command failed: $!\n";
}
else
{
printf "command exited with value %d", $? >> 8;
}
exit;
However, the moment I change the folder name to one with spaces eg. "Reporting Static" it fails to find the tex file.
I have read several other posts regarding this on stack exchange and other websites but for whatever reason the proposed solutions do not appear to work for me. I have tried
my $texfile = "C:/Users/me/Desktop/Reporting Static/file.tex";
my $texfile = C:/Users/me/Desktop/"Reporting Static"/file.tex;
my $texfile = "\"C:/Users/me/Desktop/"Reporting Static"/file.tex\"";
my $texfile = "\"C:/Users/me/Desktop/Reporting Static/file.tex\"";
my $texfile = "C:/Users/me/Desktop/Reporting^ Static/file.tex";
my $texfile = "C:/Users/me/Desktop/Reporting\^ Static/file.tex";
As well as a few other combinations or varioations of the above, all without success. I have also tried replacing the double quote with a single quote so that perl doesn't interpolate the contents.
I have also tried manually typing all of the above into the command prompt to check whether there was a small issue with the way perl passed the commands to the command line but still no luck.
I am aware that I can use the 'dir /X ~1 c:\' command to find system name allocations that avoid spaces but the idea is that the filename and location will be dynamic and change as a funtion of department and site, so I would prefer to avoid trying to write a script which will go and find this pathname and use it to replace all locations using spaces or other special characters.
The final idea that I had is that this problem could be connected ot the way that perltex passes it's arguments yet I am unable to find any documentation (that I can follow...) on the specifics of how this particular aspect of the file functions.
So my questions are, is there something I am missing not metioned in the other answers that I have read regarding how to correctly pass these paths to perltex, is there perhaps some sort of incompatiblity in how I'm trying to go about this, is this more probabl linked to perltex as opposed to perl or cmd or is there something completely different that I am unaware of that is stopping this from working???
EDIT:
from cmd prompt perltex returns a "unable to find path X, please enter another file location". Until now I hadn't really tested retyping the path by by entering 'C:/Users/me/Desktop/"Reporting Static"/file.tex' (no quotes at the beginning) it is subsequently accepted and runs. but initially passing it this path does not work, suggesting that some internal perltex code accepts the inital path differently to being repassed the same path after an error.... not quite sure what to make of this.
EDIT:
The contents of #latexcmdline that I extracted
$VAR1 = [
'pdflatex',
'--jobname=--',
'\\makeatletter\\def\\plmac#tag{AYNNNUVKQVJGZKKPGPTH}\\def\\plmac#tofile{Perl.topl}\\def\\plmac#fromfile{Perl.frpl}\\def\\plmac#toflag{Perl.tfpl}\\def\\plmac#fromflag{Perl.ffpl}\\def\\plmac#doneflag{Perl.dfpl}\\def\\plmac#pipe{Perl.pipe}\\makeatother\\input C:/Users/me/Desktop/PERLTEST/Perl',
'Modules/RevuedeProjetDB.tex'
];
This was done by inserting
use File::Slurp;
use Data::Dumper;
write_file 'C:\Users\me\Desktop\PERLTEST\mydump.log', Dumper( \#latexcmdline );
before the exec command.
Update
I initially recommended that you should use String::ShellQuote but that module is for Linux only so I deleted my answer when I realised that your question was about the Windows system
It seems that there's also a Win32::ShellQuote which does the same thing for Windows, so I am renewing my suggestion
As I said before, the issue is that perltex itself doesn't properly handle paths containing whitespace, even if they are correctly passed as a single element of #ARGV. I believe the solution is to pass the path including enclosing quotes, although I have never been able to test this properly as I have no LaTex installation
Unfortunately, even if I pass qq{"$texfile"}, the quotes are still stripped before they reach the target program, so they must be protected in some way
You need the quote_system function from that module, which will prepare a list of strings so that they retain any quotation marks
Using a parameter of quote_system(qq{"$texfile"}) produces the correct result in my tests. It is the equivalent of passing qq{"\\"$texfile\\""} but less ugly
So your system call should be like this (with no modification to perltex.pl)
I have applied the same principle to $Filename as it may well be that this also contains whitespace
use Win32::ShellQuote 'quote_system';
system(quote_system(
'perltex',
'--latex=pdflatex',
'--nosafe',
qq{--jobname="$Filename"},
qq{"$texfile"},
));
Okay, well I have a solution of sorts
The issue, as I suspected, is that, although the path is passed as a single string to perltex.pl, the latter doesn't handle paths with spaces properly after it has received them
The temporary fix is to hack perltex.pl
Line 82 of my version of perltex.pl (there is no version number in the source) reads
$latexcmdline[$firstcmd] = "\\input $option";
If you change that to
$latexcmdline[$firstcmd] = qq{\\input "$option"};
then all should be well. However this is a solid fix only when it is distributed by the author of perltex. Meanwhile I am looking for a nicer solution from the calling side
There are two steps to solving this problem.
Work out how to get the correct arguments into an external
program.
Work out how to do that from a Perl program.
For step 1, I find a program like this to be useful.
#!/usr/bin/perl
use strict;
use warnings;
print "Received ", scalar #ARGV, " arguments:\n";
for (1 .. #ARGV) {
print "$_: $ARGV[$_ - 1]\n";
}
It just explains what arguments it receives on the command line. You can use this in place of "perltex" for testing purposes.
You'll see that if you give it an argument that contains spaces, then that is interpreted as the called program as multiple arguments. The way to get round that is to quote the argument that contains a space. And I seem to remember that Windows insists on double-quotes (for reasons that I can never remember).
So I think that you want this:
system('perltex', '--latex=pdflatex', '--nosafe', "--jobname=\"$Filename\"", "\"$texfile\"");
I've double-quoted both of the filenames. Of course, those escaped double-quote characters look really ugly, and Perl gives us qq(...) to make that look nicer.
system('perltex', '--latex=pdflatex', '--nosafe', qq(--jobname="$Filename"), qq("$texfile"));
If that's not quite right, then the program I showed earlier will make it easier to find the solution.
Update: Borodin's comment below about this making no difference to $texfile is accurate. The fact that we're passing a list to system() means that the shell isn't involved at all.

Simple perl file copy methods without using File::Copy

I have very little perl experience and have been trying a few methods on OS X before I attempt to use Macperl on a more difficult to access OS 9 with very limited memory.
I have been trying simple file copy methods without using File::Copy.
Both of the following appear to work:
open R,"<old";
open W,">new";
print W <R>;
open $r,"<old";
open $w,">new";
while (<$r>) { print $w $_ }
Why can't I use $r and $w with the first method?
How does the first method work without a 'while'?
Is there a better way to do this?
Is there a better way to do this?
There sure is... File::Copy is a core module (no installation requred), so there's little reason to avoid using it.
use File::Copy;
copy('old', 'new');
Alternatively, you can use a system call to the underlying OS, in your case, OS-X
system('cp', 'old', 'new');
UPDATE Oops, you're on OS9, so I'm not sure what system calls are available.
You can use your first method with lexical file handles, but you need to disambiguate a little.
open $r, '<', 'old';
open $w, '>', 'new';
print {$w} <$r>;
Bare in mind this is unidiomatic code, and if you just want to create a direct copy, the first method is preferable (EDIT If your memory constraints allow for it).
Perl operators and functions can return different things depending on what their context expects.
The first method works because the print function creates what is called a list context for the <> operator - then thee operator will "slurp in" the entire file.
In the second example, the <> operator is called in the condition of the loop, which creates a scalar context, returning one line at a time (some asterisks here, but that's another story.)
Here is some explanation about contexts: http://perlmaven.com/scalar-and-list-context-in-perl.
And, both methods should work with the R and W filehandles (which are old fashioned Perl filehandles that are different from regular variables), and with the $r/$w notation that actually denotes variables which hold a filehandle reference. The difference is subtle but in most everyday use cases these can be used interchangeably. Have you tried using $ variables in the first example?
In addition to the Hellmar Becker's answer:
The print $w <$r>; does not work (gives a syntax error) because if the FILEHANDLE argument is a variable, perl tries to interpret the print's argument list beginning ($w <$r) as an operator (see the NOTE in http://perldoc.perl.org/functions/print.html). To disambigue put parentheses around the <$r>:
print $w (<$r>);

How is a Perl filehandle a scalar if it can return multiple lines?

I have kind of fundamental question about scalars in Perl. Everything I read says scalars hold one value:
A scalar may contain one single value in any of three different
flavors: a number, a string, or a reference. Although a scalar may not
directly hold multiple values, it may contain a reference to an array
or hash which in turn contains multiple values.
--from perldoc
Was curious how the code below works
open( $IN, "<", "phonebook.txt" )
or die "Cannot open the file\n";
while ( my $line = <$IN> ) {
chomp($line);
my ( $name, $area, $phone ) = split /\|/, $line;
print "$name $phone $phone\n";
}
close $IN;
Just to clarify the code above is opening a pipe delimited text file in the following format name|areacode|phone
It opens the file up and then it splits them into $name $area $phone; how does it go through the multiple lines of the file and print them out?
Going back to the perldoc quote from above "A scalar may contain a single value of a string, number, reference." I am assuming that it has to be a reference, but doesn't even really seem like a reference and if it is looks like it would a reference of a scalar? so I am wondering what is going on internally that allows Perl to iterate through all of the lines in the code?
Nothing urgent, just something I noticed and was curious about. Thanks.
It looks like Borodin zeroed in on the part you wanted, but I'll add to it.
There are variables, which store things for us, and there are operators, which do things for us. A file handle, the thing you have in $IN, isn't the file itself or the data in the file. It's a connection that the program to use to get information from the file.
When you use the line input operator, <>, you give it a file handle to tell it where to grab the next line from. By itself, it defaults to ARGV, but you can put any file handle in there. In this case, you have <$IN>. Borodin already explained the reference and bareword stuff.
So, when you use the line input operator, it look at the connection you give in then gets a line from that file and returns it. You might be able to grok this more easily with it's function form:
my $line = readline( $IN );
The thing you get back doesn't come out of $IN, but the thing it points to. Along the way, $IN keeps track of where it is in the file. See seek and tell.
Along the same lines are Perl's regexes. Many people call something like /foo.*bar/ a regular expression. They are slightly wrong. There's a regular expression inside the pattern match operator //. The pattern is the instructions, but it doesn't do anything by itself until the operator uses it.
I find in my classes if I emphasize the difference between the noun and verb parts of the syntax, people have a much easier time with this sort of stuff.
Old Answer
Through each iteration of the while loop, exactly one value is put into the scalar variables. When the loop is done with a line, everything is reset.
The value in $line is a single value: the entire line which you have not broken up yet. Perl doesn't care what that single value looks like. With each iteration, you deal with exactly one line and that's what's in $line. Remember, these are variables, which means you can modify and replace their values, so they can only hold one thing at a time, but there can be multiple times.
The scalars $name, $area, and $phone have single values, each produced by split. Those are lexical variables (my), so they are only visible inside the specific loop iteration where they are defined.
Beyond that, I'm not sure which scalar you might be confused about.
The old-fashioned way of opening files is to use a bare name for the file handle, like so
open IN, 'phonebook.txt'
A file handle is a special type of value, like scalar, hash, array etc. but it has no prefix symbol to differentiate it. (This isn't actually the full extent of the truth, but I am worried about confusing you if I add even more detail.)
Perl still works like this, but it is best avoided for a couple of reasons.
All such file handles are global, and there is no way to restrict access to them by scope
There is no way to pass the value to a subroutine or store it in a data structure
So Perl was enhanced several years ago so that you can use references to file handles. These can be stored in scalar variables, arrays, or hashes, and can be passed as subroutine parameters.
What happens now when you write
open my $in, '<', 'phonebook.txt'
is that perl autovivifies an anonymous file handle, and puts a reference to it in variable $in, so yes, you were right, it is a reference. (Another thing that was changed about the same time was the move to three-parameter open calls, which allow you to open a file called, say, >.txt for input.)
I hope that helps you to understand. It's an unnecessary level of detail, but it can often help you to remember the way Perl works to understand the underlying details.
Incidentally, it is best to keep to lower-case letters for lexical variables, even for file handle references. I often add fh to the end to indicate that the variable holds a file handle, like $in_fh. But there's no need to use capitals, which are generally reserved for global variables like Package::Names.
Update - The Rest of the Story
I thought I should add something to explain what I have mised out, for fear of misleading people who care about the gory detail.
Perl keeps a symbol table hash - a stash - that work very like ordinary Perl hashes. There is one such stash for each package, including the default package main. Note that this hash nothing to do with lexical variables - declared with my - which are stored entirely separately.
Ther indexes for the stashes are the names of the package variables, without the initial symbol. So, for example, if you have
our $val;
our #val;
our %val;
then the stash will have only a single element, with a key of val and a value which is a reference to an intermediate structure called a typeglob. This is another hash structure, with one element for each different type of variable that has been declared. In this case our val typeglob will have three elements, for the scalar, array, and hash varieties of the val variables.
One of these elements may also be an IO variable type, which is where file handles are kept. But, for historical reasons, the value that is passed around as a file handle is in fact a reference to the typeglob that contains it. That is why, if you write open my $in, '<', 'phonebook.txt' and then print $in you will see something like GLOB(0x269581c) - the GLOB being short for typeglob.
Apart from that, the account above is accurate. Perl autovivifies an anonymous typeglob in the current package, and uses only its IO slot for the file handle.
Scalars in Perl are denoted by a $ and they can indeed contain the type of values you mention in your questions but next to that they can also contain a file handle. You can create file handles in Perl in two ways one way is Lexical
open my $filehandle, '>', '/path/to/file' or die $!;
and the other is global
open FILEHANDLE, '>', '/path/to/file' or die $!;
You should use the Lexical version which is what you're doing.
The while loop in your code uses the <> operator on your lexical filehandle which returns a line out of your file every time it's called, until it's out of lines (when End Of File is reached) in which case it returns false.
I went into a bit more detail on file handles as it seems it's a concept you're not completely clear on.

Why is the perl print filehandle syntax the way it is?

I am wondering why the perl creators chose an unusual syntax for printing to a filehandle:
print filehandle list
with no comma after filehandle. I see that it's to distinguish between "print list" and "print filehandle list", but why was the ad-hoc syntax preferred over creating two functions - one to print to stdout and one to print to given filehandle?
In my searches, I came across the explanation that this is an indirect object syntax, but didn't the print function exist in perl 4 and before, whereas the object-oriented features came into perl relatively late? Is anyone familiar with the history of print in perl?
Since the comma is already used as the list constructor, you can't use it to separate semantically different arguments to print.
open my $fh, ...;
print $fh, $foo, $bar
would just look like you were trying to print the values of 3 variables. There's no way for the parser, which operates at compile time, to tell that $fh is going to refer to a file handle at run time. So you need a different character to syntactically (not semantically) distinguish between the optional file handle and the values to actually print to that file handle.
At this point, it's no more work for the parser to recognize that the first argument is separated from the second argument by blank space than it would be if it were separated by any other character.
If Perl had used the comma to make print look more like a function, the filehandle would always have to be included if you are including anything to print besides $_. That is the way functions work: If you pass in a second parameter, the first parameter must also be included. There isn't one function I can think of in Perl where the first parameter is optional when the second parameter exists. Take a look at split. It can be written using zero to four parameters. However, if you want to specify a <limit>, you have to specify the first three parameters too.
If you look at other languages, they all include two different ways ways to print: One if you want STDOUT, and another if you're printing to something besides STDOUT. Thus, Python has both print and write. C has both printf and fprintf. However, Perl can do this with just a single statement.
Let's look at the print statement a bit more closely -- thinking back to 1987 when Perl was first written.
You can think of the print syntax as really being:
print <filehandle> <list_to_print>
To print to OUTFILE, you would say:
To print to this file, you would say:
print OUTFILE "This is being printed to myfile.txt\n";
The syntax is almost English like (PRINT to OUTFILE the string "This is being printed to myfile.txt\n"
You can also do the same with thing with STDOUT:
print STDOUT "This is being printed to your console";
print STDOUT " unless you redirected the output.\n";
As a shortcut, if the filehandle was not given, it would print to STDOUT or whatever filehandle the select was set to.
print "This is being printed to your console";
print " unless you redirected the output.\n";
select OUTFILE;
print "This is being printed to whatever the filehandle OUTFILE is pointing to\n";
Now, we see the thinking behind this syntax.
Imagine I have a program that normally prints to the console. However, my boss now wants some of that output printed to various files when required instead of STDOUT. In Perl, I could easily add a few select statements, and my problems will be solved. In Python, Java, or C, I would have to modify each of my print statements, and either have some logic to use a file write to STDOUT (which may involve some conniptions in file opening and dupping to STDOUT.
Remember that Perl wasn't written to be a full fledge language. It was written to do the quick and dirty job of parsing text files more easily and flexibly than awk did. Over the years, people used it because of its flexibility and new concepts were added on top of the old ones. For example, before Perl 5, there was no such things as references which meant there was no such thing as object oriented programming. If we, back in the days of Perl 3 or Perl 4 needed something more complex than the simple list, hash, scalar variable, we had to munge it ourselves. It's not like complex data structures were unheard of. C had struct since its initial beginnings. Heck, even Pascal had the concept with records back in 1969 when people thought bellbottoms were cool. (We plead insanity. We were all on drugs.) However, since neither Bourne shell nor awk had complex data structures, so why would Perl need them?
Answer to "why" is probably subjective and something close to "Larry liked it".
Do note however, that indirect object notation is not a feature of print, but a general notation that can be used with any object or class and method. For example with LWP::UserAgent.
use strict;
use warnings;
use LWP::UserAgent;
my $ua = new LWP::UserAgent;
my $response = get $ua "http://www.google.com";
my $response_content = decoded_content $response;
print $response_content;
Any time you write method object, it means exactly the same as object->method. Note also that parser seems to only reliably work as long as you don't nest such notations or do not use complex expressions to get object, so unless you want to have lots of fun with brackets and quoting, I'd recommend against using it anywhere except common cases of print, close and rest of IO methods.
Why not? it's concise and it works, in perl's DWIM spirit.
Most likely it's that way because Larry Wall liked it that way.

Perl script getting stuck in terminal for no apparent reason

I have a Perl script which reads three files and writes new files after reading each one of them. Everything is one thread.
In this script, I open and work with three text files and store the contents in a hash. The files are large (close to 3 MB).
I am using a loop to go through each of the files (open -> read -> Do some action (hash table) -> close)
I am noticing that the whenever I am scanning through the first file, the Perl terminal window in my Cygwin shell gets stuck. The moment I hit the enter key I can see the script process the rest of the files without any issues.
It's very odd as there is no read from STDIN in my script. Moreover, the same logic applies to all the three files as everything is in the same loop.
Has anyone here faced a similar issue? Does this usually happen when dealing with large files or big hashes?
I can't post the script here, but there is not much in it to post anyway.
Could this just be a problem in my Cygwin shell?
If this problem does not go away, how can I circumvent it? Like providing the enter input when the script is in progress? More importantly, how can I debug such a problem?
sub read_set
{
#lines_in_set = ();
push #lines_in_set , $_[0];
while (<INPUT_FILE>)
{ $line = $_;
chomp($line);
if ($line=~ /ENDNEWTYPE/i or $line =~ /ENDSYNTYPE/ or eof())
{
push #lines_in_set , $line;
last;
}
else
{
push #lines_in_set , $line;
}
}
return #lines_in_set;
}
--------> I think i found the problem :- or eof() call was ensuring that the script would be stuck !! Somehow happening only at the first time. I have no idea why though
The eof() call is the problem. See perldoc -f eof.
eof with empty parentheses refers to the pseudo file accessed via while (<>), which consists of either all the files named in #ARGV, or to STDIN if there are none.
And in particular:
Note that this function actually reads a character and then "ungetc"s it, so isn't useful in an interactive context.
But your loop reads from another handle, one called INPUT_FILE.
It would make more sense to call eof(INPUT_FILE). But even that probably isn't necessary; your outer loop will terminate when it reaches the end of INPUT_FILE.
Some more suggestions, not related to the symptoms you're seeing:
Add
use strict;
use warnings;
near the top of your script, and correct any error messages this produces (perl -cw script-name does a compile-only check). You'll need to declare your variables using my (perldoc -f my). And use consistent indentation; I recommend the same style you'll find in most Perl documentation.