Filehandles and XML::Simple -> Memory corruption. Can't isolate problem - perl

In a small test file, I can run
#!/usr/bin/perl
use warnings;
use strict;
use open qw{:utf8 :std};
use XML::Simple;
my #cmdline = ("hg", "log", "-v", "--style", "xml");
open my $xml, "#cmdline |";
my $xmllog = XMLin($xml, ForceArray => ['logentry', 'parent', 'copy', 'path']);
foreach my $rev (#{$xmllog->{logentry}}) {
#do stuff
}
and it works fine. When I run the same code in a larger program (with the same XML input), it terminates with
*** glibc detected *** /usr/bin/perl: malloc(): memory corruption: 0x0a40e308 ***
(full crash log # pastebin.com)
However, if I do the exchange
#open my $xml, "#cmdline |";
my $xml = `#cmdline`;
then it works (in both files), so this is more a question of curiosity than a real problem for me.
Does anyone have any pointers on what the difference between my test case and the larger code base might be?
Is there a speed/memory/? difference in the different command calls? Best practices?
Debian Sid: Perl 5.12.4-1.
(This is my first Perl encounter, so don't assume too much about what I "should" know about the language. I just dove into existing code.)
(The larger program is ikiwiki, so the code is not a secret, but I don't know where to look for trouble, and I can't include all the code in this post for practical reasons. This concerns the Mercurial backend.)
As per suggestion from cjm, I added print "$_\n" for sort grep /XML/, keys %INC; which gave output
RPC/XML.pm
RPC/XML/Client.pm
RPC/XML/ParserFactory.pm
XML/NamespaceSupport.pm
XML/Parser.pm
XML/Parser/Expat.pm
XML/SAX.pm
XML/SAX/Base.pm
XML/SAX/Exception.pm
XML/SAX/Expat.pm
XML/SAX/ParserFactory.pm
XML/Simple.pm
in the large project, and
XML/NamespaceSupport.pm
XML/Parser.pm
XML/Parser/Expat.pm
XML/SAX.pm
XML/SAX/Base.pm
XML/SAX/Exception.pm
XML/SAX/Expat.pm
XML/SAX/ParserFactory.pm
XML/Simple.pm
in the test file.
Update: I installed the Debian package libxml-libxml-perl and added $XML::SAX::ParserPackage = "XML::LibXML::SAX"; as suggested. This also crashed, with a different message this time:
*** stack smashing detected ***: /usr/bin/perl terminated
full backtrace # pastebin.com
This time it happened consistently in both the large and the small file, though. Also, only when using open, not when using backticks.
I also installed libxml-libxml-simple-perl, but that was not supposed to be more than in practice a wrapper to always use XML::LibXML as parser. It also behaved differently and complained about the options to XMLin() that was set, so I discarded it.
Trying to explicitly (and blindly) make the program use each of the alternatives given by print "$_\n" for sort grep /XML/, keys %INC; seems to point towards that XML::SAX::Expat is used by default as cjm said (since all other alternatives exit with errors, and XML::SAX:Expat behaves exactly like the original problem in both files. Explicitly demanding XML::Simple goes into a loop that allocates all my memory).
I'm thankful for learning about different XML parsers and that XML::Simple automatically chooses different ones. Both parts of my original question somewhat remain though:
Why do the programs behave differently? Even if I explicitly set $XML::SAX::ParserPackage = "XML::SAX::Expat" in both programs, one crashes (using open) and the other works.
Should I use another method to receive output from the external command? Is it even wrong to expect XMLin() ta work with open (but why does it work in one case, then?)?
Or are they simple the "wrong" questions to ask (i.e. irrelevant)?
UPDATE: More than a week has passed, not a flurry of activity here, and I solve it a bit differently now, without problems. I mark cjm's answer as correct, since it got me further in the error analysis. Thanks!

XML::Simple is pure-Perl, so it's unlikely to cause the memory corruption you report. It depends on a lower-level XML parser, and it's likely the bug you've encountered is in there. But there are multiple parsers it could be using, and we'd need to know which one.
Try adding this line right after the XMLin line in your sample program, and update your question with the results:
print "$_\n" for sort grep /XML/, keys %INC;
This will tell us which XML parser you're actually using on your system.
Update: Since it looks like you're using XML::Parser (through its SAX interface XML::SAX::Expat, I'd suggest trying XML::LibXML::SAX instead. Libxml2 is considered one of the better XML parsers.
If you don't already have XML::LibXML::SAX installed, just installing it should switch your default SAX parser to it. If it is installed, try putting
$XML::SAX::ParserPackage = "XML::LibXML::SAX";
at the beginning of your program. (See XML::SAX::ParserFactory for how the SAX parser is selected.)

Related

Perl: Is it possible to dynamically fix compile time error?

If I have, for example, next perl script:
use strict;
use warnings;
print $x;
When I run this script, compilation will fail with error:
Global symbol "$x" requires explicit package name (did you forget to declare "my $x"?) at ...
Is it possible to write some perl module which will be called when this error occur and automatically fix this error and continue compilation? (Even links to any info is OK)
# This code is incorrect.
# Here I just ask about such ability
# This code is very weak approximation how it might look
package AutoFix;
sub fix {
$main::x = 'You are defined now';
}
1;
So next code will not fail and print You are defined now:
use strict;
use warnings;
use AutoFix;
print $x;
How much work would you like to do to create the code that could figure out what the fix should be? And, will that amount of work be comparable or less to the work required to examine code by hand?
Now, I'm writing all of this having spent quite a bit of time trying to come up with a system to analyze CPAN installer output to figure out what went wrong (a major impetus for CPANPLUS, now relegated to history). It's easy to tell that something is not right, but beyond that is a lot of suffering.
In your example, you have an error about an undeclared variable. How does AutoFix know if that should be a package or a lexical variable? You can guess one or the other, but you actually have two big problems:
What is the intent of the code?
Does the code reflect the actual intent?
Determining the intent of the code is often very difficult for even an experienced human programmer to figure out (just read StackOverflow question comments). Compiling code is often not correct code, in the sense that it doesn't achieve the desired outcome. Furthermore, does the programmer even understand the problem? Does the code the programmer wrote (incorrectly here) reflect the actual work the code should do? It's difficult for humans in code review to figure this out. Tools like Coverity can guess at problems it knows about, but they aren't going to be able to correct the code.
But let's say that the programmer understands the problem. Have they correctly expressed that? The longer you've been programming, the more you lean toward "no", in general, in my experience.
This is completely different than the database constraint you mentioned. That's a narrowly targeted fix for an expected and allowed situation. Consider a different parallel: if the record has a New York area code but a Chicago address, should I fix the city? When I was a younger dumbass, I did a similar thing to a database. It was stupid because I thought I knew something I didn't, and everyone who understood the situation recognized it immediately. Even then, those sorts of constraints are how we model what we know about the world, not what the world actually is.
Now, to make AutoFix, you need to make something that can look at code, understand it, and figure out what it should do. You can make guesses, but you have no basis for playing the probabilities there.
Technical matters can't solve this. AutoFix can undo the work of pragmas such that some classes of errors don't show up, but so what? The program with an error just continues? How does that help anyone?
Not only that, compilers tend to complain when they realize they can't parse something. What they complain about is often not the problem. The first thing I teach people while debugging is that they need to look at the statement immediately proceeding the line line number in the error message. Any error message you catch can have a virtually infinite number of causes.
Consider this code, which fails in the same way as your example (same error message) but for a completely different but common reason:
use strict;
use warnings;
my $x = 5,
print $x++;
How do you figure out what the fix should be? It's not about declaring $x.
So, you now have two cases, and you build that your fixer. Then you encounter another case, so you build that in. And you keep doing this until eventually you have a large dictionary of fixes. Maybe you get a bit crazy and do some machine learning (and wouldn't a corpus of bad code and resolutions be cool).
But, the program still can't continue. It has to start over because it has to at least back up to where it should have done something but didn't. You can't merely restart the program because you don't know if its idempotent. Re-runing the program might redo work it shouldn't, such as inserting duplicate into databases.
Having said all that, this sort of thing is related to static analysis and the refactoring browser. Adam Kennedy's Parse Perl Isolated (PPI) project was a first step into understanding Perl code without compiling it, then move toward the Smalltalk ideal of understanding which parts of code represented the same thing. If you knew that two things named foo were the same thing, you could rearrange code dealing with foo. For example, if you renamed a method from bar to set_bar, you could immediately know which bars you should rename and which belonged to some other class.
Adam wrote Acme::BadExample and challenged anyone to get it to run. He wrote "any given piece of Perl source exists in bizarre pseudo-quantum-like state, in that it demonstrates both duality and indeterminism."
Jos Boumans stepped up and used some mind-bending Perl, which he then showed in Barely Legal XXX Perl, which I think he first presented in 2006. He was amazingly creative in his solutions, and in a way that I wouldn't want in production code.
Perl doesn't even know, by design, what type of thing will be in a variable or even that the method you might call on it will exist. In fact, it defers so much to the runtime, trusting that things will be in place by the time you need them, that we often say "only Perl can parse perl". You literally need to be able to run Perl code to properly compile it since BEGIN blocks can affect the parse. For example, a BEGIN can define a subroutine with a certain arity. How do you parse foo 5, 6? You have to know what has already been defined.
Perl has other "action at a distance" features that make this even tougher. autodie redefines CORE features to add extra behavior, but you might not be able to see that in the code. You can set default regex flags (and I've seen plenty of big screw ups by people applying /isxm to entire files without checking).
As noted above, autofixing compile time error is not possible (or probably hard to fix)
Instead of fixing compile time error try to resolve your problem in different way.
For example. In your script you use $x variable. Probably you know that you will use it and you want to get instance of some value, e.g. You are defined now then you could use Exporter:
use strict;
use warnings;
use AutoFix qw/ $x /;
print $x;
And AutoFix module will look like:
package AutoFix;
require Exporter;
our #ISA = qw(Exporter);
our #EXPORT_OK = qw( $x $y $z ); # symbols to export on request
... # code which will create instance of $x $y $z on request
1;
Gool luck ;-)

In Perl scripts, should we use shell commands or call Perl functions that imitate shell operations?

I want to know about the best practices here. Suppose I want to get the content of some line of a file. I can use a one-line shell command to get my answer, or write a subroutine, as shown in the code below.
A text file named some_text:
She laughed. Then both continued eating in silence, like strangers,
but after dinner they walked side by side; and there sprang up
between them the light jesting conversation of people who are free
and satisfied, to whom it does not matter where they go or what
they talk about.
Code to get content of line 5 of the file
#!perl
use warnings;
use strict;
my $file = "some_text";
my $lnum = 5;
my $shellcmd = "awk 'NR==$lnum' $file";
print qx($shellcmd);
print getSrcLine($file, $lnum);
sub getSrcLine {
my($file, $lnum) = #_;
open FILE, $file or die "$!";
my #ray = <FILE>;
return $ray[$lnum-1];
}
I ask this because I see a lot of Perl scripts where at some point, a shell command was called, while at some later point, the same task was done by a call to a (library or handwritten) function, for example, rm -rf versus File::Path::rmtree. I just want to make it consistent.
What is the recommended thing to do?
If there's a Perl function for the operation, Perl thinks you should use its version. However, you give an example of a Perl module providing a pure Perl way to do it. That's much different. There's no single answer (as in most things), so you have to decide for yourself what to do:
Does the pure Perl approach do it correctly? For example, File::Copy has some limitations because it makes some awkward decisions for the user, so many people think it's broken. See, for instance, File::Copy versus cp/mv.
Does pure Perl approach do it in an acceptable time? Sometimes the external program is orders of magnitude faster. Sometimes it's a lot slower.
External commands usually are portable within a family of systems (e.g. all linux-like systems) but probably not across families (e.g. Windows and linux). Your tolerance for that might affect your answer. Even if you think you are running the same command, the different flavors of unix-like systems might have different switches for the operations.
Passing complicated arguments—spaces, quotes, and special characters—to external commands can make you cry. You have to do a lot of fiddly work to make sure you're handling arguments correctly. Perl subroutines don't care though.
You have to pay much more attention to what you are doing when you are using the external command. If you just call rm, Perl is going to search through your PATH and use the first thing called rm. That doesn't mean it's the program you think it is. I write about this quite a bit in the "Secure Programming Techniques" in Mastering Perl.
If the pure Perl approach requires a module, especially if that module has many complicated dependencies, you might be in for dependency or distribution hell down the road.
Personally, I start with the pure Perl approach until it doesn't work for the situation.
For your particular examples, I'd use Perl. Shelling out to awk, which is a proto-Perl, is just odd. You should be able to do everything awk does right it Perl. If you have an awk program, you can convert it to Perl with the a2p program:
NR==5
a2p turns that into (modulo some setup bits at the start):
while (<>) {
print $_ if $. == 5;
}
Notice that it still scans the entire file even though you have the fifth line. However, you can use the translated program as a start:
while (<>) {
if( $. == 5 ) {
print;
last;
}
}
I don't think you should shell out to some other program to avoid that Perl code.
To remove a directory tree, I like File::Path. It has some dependencies, but they are all in the Perl Standard Library. There's very little pain, if any, associated with that module. I'd use it until I ran into a problem where it didn't work.
If you want your app to be portable to non-unix systems, then definitely code everything in Perl.
If not, it's really up to you... creating a new process is slower, but if it's not important for the task then it doesn't matter. Personally I would pick the solution which I can quicker implement.
It seems to me that code that works should be the first priority. Yours fails if the file name has a space in it, for example.
Using the shell makes it harder to code correctly since your program needs to properly generate another program to be run by sh. (This problem goes away if you use the multi-arg version of system to avoid the shell.)
Furthermore, using external tools can make it hard to handle errors. You didn't even attempt to do so!
On the flip side, there are multiple reasons for using external tools. For example, Perl doesn't provide as good an file copy utility as cp; using the sort tool allows you to sort arbitrary large files with limited RAM; etc.

Is it okay to use modules from within subroutines?

Recently I start playing with OO Perl and I've been creating quite a bunch of new objects for a new project that I'm working on. As I'm unfamilliar with any best practice regarding OO Perl and we're kind in a tight rush to get it done :P
I'm putting a lot of this kind of code into each of my function:
sub funcx{
use ObjectX; # i don't declare this on top of the pm file
# but inside the function itself
my $obj = new ObjectX;
}
I was wondering if this will cause any negative impact versus putting on the use Object line on top of the Perl modules outside of any function scope.
I was doing this so that I feel it's cleaner in case I need to shift the function around.
And the other thing that I have noticed is that when I try to run a test.pl script on the unix server itself which test my objects, it slow as heck. But when the same code are run through CGI which is connected to an apache server, the web page doesn't load as slowly.
Where to put use?
use occurs at compile time, so it doesn't matter where you put it. At least from a purely pragmatic, 'will it work', point of view. Because it happens at compile time use will always be executed, even if you put it in a conditional. Never do this: if( $foo eq 'foo' ) { use SomeModule }
In my experience, it is best to put all your use statements at the top of the file. It makes it easy to see what is being loaded and what your dependencies are.
Update:
As brian d foy points out, things compiled before the use statement will not be affected by it. So, the location can matter. For a typical module, location does not matter, however, if it does things that affect compilation (for example it imports functions that have prototypes), the location could matter.
Also, Chas Owens points out that it can affect compilation. Modules that are designed to alter compilation are called pragmas. Pragmas are, by convention, given names in all lower-case. These effects apply only within the scope where the module is used. Chas uses the integer pragma as an example in his answer. You can also disable a pragma or module over a limited scope with the keyword no.
use strict;
use warnings;
my $foo;
print $foo; # Generates a warning
{ no warnings 'unitialized`; # turn off warnings for working with uninitialized values.
print $foo; # No warning here
}
print $foo; # Generates a warning
Indirect object syntax
In your example code you have my $obj = new ObjectX;. This is called indirect object syntax, and it is best avoided as it can lead to obscure bugs. It is better to use this form:
my $obj = ObjectX->new;
Why is your test script slow on the server?
There is no way to tell with the info you have provided.
But the easy way to find out is to profile your code and see where the time is being consumed. NYTProf is another popular profiling tool you may want to check out.
Best practices
Check out Perl Best Practices, and the quick reference card. This page has a nice run down of Damian Conway's OOP advice from PBP.
Also, you may wish to consider using Moose. If the long script startup time is acceptable in your usage, then Moose is a huge win.
question 1
It depends on what the module does. If it has lexical effects, then it will only affect the scope it is used in:
my $x;
{
use integer;
$x = 5/2; #$x is now 2
}
my $y = 5/2; #$y is now 2.5
If it is a normal module then it makes no difference where you use it, but it is common to use all of those modules at the top of the program.
question 2
Things that can affect the speed of a program between machines
speed of the processor
version of modules installed (some modules have XS versions that are much faster)
version of Perl
number of entries in PERL5LIB
speed of the drive
daotoad and Chas. Owens already answered the part of your question pertaining to the position of use statements. Let me remark on something else here:
I was doing this so that I feel it's
cleaner in case I need to shift the
function around.
Personally, I find it much cleaner to have all the used modules in one place at the top of the file. You won't have to search for use statements to see what other modules are being used and a quick glance will tell you what is being used and even what is not being used.
Regarding your performance problem: with Apache and mod_perl the Perl interpreter will have to parse and compile your used modules only once. The next time the script is run, execution should be much faster. On the command line, however, a second run doesn't get this benefit.

Why shouldn't I use shell tools in Perl code?

It is generally advised not to use additional linux tools in a Perl code;
e.g if someone intends to print the last line of a text file he can:
$last_line = `tail -1 $file` ;
or otherwise, open the file and read it line by line
open(INFO,$file);
while(<INFO>) {
$last_line = $_ if eof;
}
What are the pitfalls of using the previous and why should I avoid using shell tools in my code?
thanx,
Efficiency - you don't have to spawn a new process
Portability - you don't have to worry about an executable not existing, accepting different switches, or having different output
Ease of use - you don't have to parse the output, the results are already in a usable form
Error handling - you have finer-grained control over errors and what to do about them in Perl.
It's better to keep all the action in Perl because it's faster and because it's more secure. It's faster because you're not spawning a new process, and it's more secure because you don't have to worry about shell meta character trickery.
For example, in your first case if $file contained "afilename ; rm -rf ~" you would be a very unhappy camper.
P.S. The best all-Perlway to do the tail is to use File::ReadBackwards
One of the primary reasons (besides portability) for not executing shell commands is that it introduces overhead by spawning another process. That's why much of the same functionality is available via CPAN in Perl modules.
One reason is that your Perl code might be running in an environment where there is no shell tool called 'tail'.
It's a personal call depending on the project:
Is it going to be always used in shell environments with tail?
Do you care about only using pure Perl code?
Using tail? Fine. But that's really a special case, since it's so easy to use and since it is so trivial.
The problem in general is not really efficiency or portability, that is largely irrelevant; the issue is ease of use. To run an external utility, you have to find out what arguments it accepts, write code to transform your program's data structures to that format, quote them properly, build the command line, and run the application. Then, you might have to feed it data and read data from it (involving complexity like an event loop, worrying about deadlocking, etc.), and finally interpret the return value. (UNIX processes consider "0" true and anything else false, but Perl assumes the opposite. foo() and die is hard to read.) This is a lot of work to do, and that's why people avoid it. It's much easier to create an instance of a class and call methods on it to get the data you need.
(You can abstract away processes this way; see Crypt::GpgME for example. It handles the complexity associated with invoking gpg, which would normally involve creating multiple filehandles other than STDOUT, STDIN, and STDERR, among other things.)
The main reason I see for doing it all in Perl would be for robustness. Your use of tail will fail if the filename has shell metacharacters or spaces or doesn't exist or isn't accessible. From Perl, characters in the filename aren't an issue, and you can distinguish between errors in accessing the file. Sometimes being robust is more important than speedy coding and sometimes it's not.

Why does IIS crash when I print to stderr in Perl?

This has been driving me crazy. We have IIS (6) and windows 2008 and ActiveState Perl 5.10. For some reason whenever we do a warn or a carp it eventually corrupts the app pool. Of course, that's a pretty big deal since it means that our errors actually cause problems.
This happened with the previous version of Perl (5.8) and Windows (2003) and IIS (5.) Anyway, basically I put in a carp or a warn and I get an error message and then some garbage text. Any thoughts?
Check to make sure that IIS and the perl DLL are linked with the same version of the C runtime library. (Use depends.exe or dumpbin /dependents).
To expand: the problem may be that IIS has its FILE* table in one place, and the perl DLL thinks it's going to be in a slightly different place. When perl goes to find the stderr handle, it treats random memory as a file handle, with predictable results.
Try adding the following to the top of your scripts:
BEGIN {
open STDERR, '>> c:/iisError.log'
or die "Can't write to c:/issError.log: $!\n";
binmode STDERR;
}
I'm not sure why you would have this problem. But several "wild" guesses as to sources for such a problem would be addressed by the above code.
(It has been a while since I read the source code for appending to files in Win32, but, as I recall, the >> mode plus binmode means that writes to the file from different processes are unlikely to collide, preventing overlapping text in the log.)
A couple of suggestions:
Make sure that the id of the worker
process has write permission to the
directory/file you are writing. I
probably wouldn't give it full
control of C:, though. Better to
make a sub-directory.
Write to the event log instead of a file using
Win32::EventLog
Update: I discovered that this error only happens when you have a variable in the warn. If the warn is just regular text there are no issues. Also, the variable cannot be empty and it looks like you have to have two warns with nonempty variables to hit the bug.