Why SAX xml parser taking long time - perl

I have an application which is using SAX xml parser but observed that its talking long time ( ~30 sec) to parse xml file( xml file size of 5MB)
Is there solution/work around to improve the performance ?
Please find respective piece of code which causing the delay.
package XML::SAX::PurePerl;
sub _parse_systemid {
my $self = shift;
my ($uri) = #_;
my $reader = XML::SAX::PurePerl::Reader::URI->new($uri);
return $self->_parse($reader); # taking 30 sec
}

From the BUGS section for the module;
XML::SAX::PurePerl is slow. Very slow. I suggest you use something
else in fact.
Is there a reason why your are using the pureperl version? Are you able to install XML::SAX?
I've also found XML::Simple to be a solid xml workhorse; with options to load xml into hashes, and SAX2 support.

Related

What does this Lucene-related code actually do?

#usr/bin/perl
use Plucene::Document;
use Plucene::Document::Field;
use Plucene::Index::Writer;
use Plucene::Analysis::SimpleAnalyzer;
use Plucene::Search::HitCollector;
use Plucene::Search::IndexSearcher;
use Plucene::QueryParser;
my $content = "I am the law";
my $doc = Plucene::Document->new;
$doc->add(Plucene::Document::Field->Text(content => $content));
$doc->add(Plucene::Document::Field->Text(author => "Philip Johnson"));
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();
my $writer = Plucene::Index::Writer->new("my_index", $analyzer, 1);
$writer->add_document($doc);
undef $writer; # close
my $searcher = Plucene::Search::IndexSearcher->new("my_index");
my #docs;
my $hc = Plucene::Search::HitCollector->new(collect => sub {
my ($self, $doc, $score) = #_;
push #docs, $searcher->doc($doc);
});
$searcher->search_hc($query => $hc);
Try as I may, I don't understand what this code does. I understand the familiar Perl syntax and what's going on on that end...but what is a Lucene Document, Index::Writer - etc.? Most importantly, when I run this code I expect something to be generated...yet I see nothing.
I know what an Analyzer is...thanks to this doc linked to in CPAN: http://onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2. But I am just not getting why I run this code and it doesn't seem to DO anything...
Lucene is a search engine designed to search huge amounts of text very fast.
My perl is not strong, but from what I understand from Lucene objects:
my $content = "I am the law";
my $doc = Plucene::Document->new;
$doc->add(Plucene::Document::Field->Text(content => $content));
$doc->add(Plucene::Document::Field->Text(author => "Philip Johnson"));
This part creates a new document object and adds two text fields to it, content and author, in preparation to add it to an lucene index file as searchable data.
my $analyzer = Plucene::Analysis::SimpleAnalyzer->new();
my $writer = Plucene::Index::Writer->new("my_index", $analyzer, 1);
$writer->add_document($doc);
undef $writer; # close
This part creates the index files and adds the previously created document do that index. At this point, you should have a "my_index" folder with several index files in it, in your application directory, with docs's data in it as searchable text.
my $searcher = Plucene::Search::IndexSearcher->new("my_index");
my #docs;
my $hc = Plucene::Search::HitCollector->new(collect => sub {
my ($self, $doc, $score) = #_;
push #docs, $searcher->doc($doc);
});
$searcher->search_hc($query => $hc);
This part attempts to search the index file created above for the same document data you just used to create the index file. Presumably, you'll have your search results in #docs at this point, which you might want to display to user (tho it is not, in this sample).
This seems to be a "hello world" application for Lucene usage in perl. In real-life applications, I dont see a scenario where you would create the index file and then search it from same piece of code.
Where did you get this code from? It is a copy of the code in the Synopsis at the start of the Plucene POD documentation.
I guess it was an attempt by someone to begin learning about Plucene. The code in a module's synopsis isn't necessarily meant to achieve something useful on its own.
As the documentation you refer to says, Lucene is a Java library that adds text indexing and searching capabilities to an application. It is not a complete application that one can just download, install, and run.
Where did you get the idea that you should run the code you show?

Perl module loading - Safeguarding against: perhaps you forgot to load "Bla"?

When you run perl -e "Bla->new", you get this well-known error:
Can't locate object method "new" via package "Bla"
(perhaps you forgot to load "Bla"?)
Happened in a Perl server process the other day due to an oversight of mine. There are multiple scripts, and most of them have the proper use statements in place. But there was one script that was doing Bla->new in sub blub at line 123 but missing a use Bla at the top, and when it was hit by a click without any of the other scripts using Bla having been loaded by the server process before, then bang!
Testing the script in isolation would be the obvious way to safeguard against this particular mistake, but alas the code is dependent upon a humungous environment. Do you know of another way to safeguard against this oversight?
Here's one example how PPI (despite its merits) is limited in its view on Perl:
use strict;
use HTTP::Request::Common;
my $req = GET 'http://www.example.com';
$req->headers->push_header( Bla => time );
my $au=Auweia->new;
__END__
PPI::Token::Symbol '$req'
PPI::Token::Operator '->'
PPI::Token::Word 'headers'
PPI::Token::Operator '->'
PPI::Token::Word 'push_header'
PPI::Token::Symbol '$au'
PPI::Token::Operator '='
PPI::Token::Word 'Auweia'
PPI::Token::Operator '->'
PPI::Token::Word 'new'
Setting the header and assigning the Auweia->new parse the same. So I'm not sure how you can build upon such a shaky foundation. I think the problem is that Auweia could also be a subroutine; perl.exe cannot tell until runtime.
Further Update
Okay, from #Schwern's instructive comments below I learnt that PPI is just a tokenizer, and you can build upon it if you accept its limitations.
Testing is the only answer worth the effort. If the code contains mistakes like forgetting to load a class, it probably contains other mistakes. Whatever the obstacles, make it testable. Otherwise you're patching a sieve.
That said, you have two options. You can use Class::Autouse which will try to load a module if it isn't already loaded. It's handy, but because it affects the entire process it can have unintended effects.
Or you can use PPI to scan your code and find all the class method calls. PPI::Dumper is very handy to understand how PPI sees Perl.
use strict;
use warnings;
use PPI;
use PPI::Dumper;
my $file = shift;
my $doc = PPI::Document->new($file);
# How PPI sees a class method call.
# PPI::Token::Word 'Class'
# PPI::Token::Operator '->'
# PPI::Token::Word 'method'
$doc->find( sub {
my($node, $class) = #_;
# First we want a word
return 0 unless $class->isa("PPI::Token::Word");
# It's not a class, it's actually a method call.
return 0 if $class->method_call;
my $class_name = $class->literal;
# Next to it is a -> operator
my $op = $class->snext_sibling or return 0;
return 0 unless $op->isa("PPI::Token::Operator") and $op->content eq '->';
# And then another word which PPI identifies as a method call.
my $method = $op->snext_sibling or return 0;
return 0 unless $method->isa("PPI::Token::Word") and $method->method_call;
my $method_name = $method->literal;
printf "$class->$method_name seen at %s line %d.\n", $file, $class->line_number;
});
You don't say what server enviroment you're running under, but from what you say it sounds like you could do with preloading all your modules in advance before executing any individual pages. Not only would this prevent the problems you're describing (where every script has to remember to load all the modules it uses) but it would also save you memory.
In pre-forking servers (as is commonly used with mod_perl and Apache) you really want to load as much of your code before your server forks for the first time so that the code is stored once in copy-on-write shared memory rather than mulitple times in each child process when it is loaded on demand.
For information on pre-loading in Apache, see the section of Practical mod_perl

Why is it a bad idea to write configuration data in code?

Real-life case (from caff) to exemplify the short question subject:
$CONFIG{'owner'} = q{Peter Palfrader};
$CONFIG{'email'} = q{peter#palfrader.org};
$CONFIG{'keyid'} = [ qw{DE7AAF6E94C09C7F 62AF4031C82E0039} ];
$CONFIG{'keyserver'} = 'wwwkeys.de.pgp.net';
$CONFIG{'mailer-send'} = [ 'testfile' ];
Then in the code: eval `cat $config`, access %CONFIG
Provide answers that lay out the general problems, not only specific to the example.
There are many reasons to avoid configuration in code, and I go through some of them in the configuration chapter in Mastering Perl.
No configuration change should carry the risk of breaking the program. It certainly shouldn't carry the risk of breaking the compilation stage.
People shouldn't have to edit the source to get a different configuration.
People should be able to share the same application without using a common group of settings, instead re-installing the application just to change the configuration.
People should be allowed to create several different configurations and run them in batches without having to edit the source.
You should be able to test your application under different settings without changing the code.
People shouldn't have to learn how to program to be able to use your tool.
You should only loosely tie your configuration data structures to the source of the information to make later architectural changes easier.
You really want an interface instead of direct access at the application level.
I sum this up in my Mastering Perl class by telling people that the first rule of programming is to create a situation where you do less work and people leave you alone. When you put configuration in code, you spend more time dealing with installation issues and responding to breakages. Unless you like that sort of thing, give people a way to change the settings without causing you more work.
$CONFIG{'unhappy_employee'} = `rm -rf /`
One major issue with this approach is that your config is not very portable. If a functionally identical tool were built in Java, loading configuration would have to be redone. If both the Perl and the Java variation used a simple key=value layout such as:
owner = "Peter Palfrader"
email = "peter#peter#palfrader.org"
...
they could share the config.
Also, calling eval on the config file seems to open this system up to attack. What could a malicious person add to this config file if they wanted to wreak some havoc? Do you realize that ANY arbitrary code in your config file will be executed?
Another issue is that it's highly counter-intuitive (at least to me). I would expect a config file to be read by some config loader, not executed as a runnable piece of code. This isn't so serious but could confuse new developers who aren't used to it.
Finally, while it's highly unlikely that the implementation of constructs like p{...} will ever change, if they did change, this might fail to continue to function.
It's a bad idea to put configuration data in compiled code, because it can't be easily changed by the user. For scripts, just make sure it's separated entirely from the rest and document it nicely.
A reason I'm surprised no one mentioned yet is testing. When config is in the code you have to write crazy, contorted tests to be able to test safely. You can end up writing tests that duplicate the code they test which makes the tests nearly useless; mostly just testing themselves, likely to drift, and difficult to maintain.
Hand in hand with testing is deployment which was mentioned. When something is easy to test, it is going to be easy (well, easier) to deploy.
The main issue here is reusability in an environment where multiple languages are possible. If your config file is in language A, then you want to share this configuration with language B, you will have to do some rewriting.
This is even more complicated if you have more complex configurations (example the apache config files) and are trying to figure out how to handle potential differences in data structures. If you use something like JSON, YAML, etc., parsers in the language will be aware of how to map things with regards to the data structures of the language.
The one major drawback of not having them in a language, is that you lose the potential of utilizing setting config values to dynamic data.
I agree with Tim Anderson. Somebody here confuses configuration in code as configuration not being configurable. This is corrected for compiled code.
Both a perl or ruby file is read and interpreted, as is a yml file or xml file with configuration data. I choose yml because it is easier on the eye than in code, as grouping by test environment, development, staging and production, which in code would involve more .. code.
As a side note, XML contradicts the "easy on the eye" completely. I find it interesting that XML config is extensively used with compiled languages.
Reason 1. Aesthetics. While no one gets harmed by bad smell, people tend to put effort into getting rid of it.
Reason 2. Operational cost. For a team of 5 this is probably ok, but once you have developer/sysadmin separation, you must hire sysadmins who understand Perl (which is $$$), or give developers access to production system (big $$$).
And to make matters worse you won't have time (also $$$) to introduce a configuration engine when you suddenly need it.
My main problem with configuration in many small scripts I write, is that they often contain login data (username and password or auth-token) to a service I use. Then later, when the scripts gets bigger, I start versioning it and want to upload it on github.
So before every commit I need to replace my configuration with some dummy values.
$CONFIG{'user'} = 'username';
$CONFIG{'password'} = '123456';
Also you have to be careful, that those values did not eventually slip into your commit history at some point. This can get very annoying. When you went through this one or two times, you will never again try to put configuration into code.
Excuse the long code listing. Below is a handy Conf.pm module that I have used in many systems which allows you to specify different variables for different production, staging and dev environments. Then I build my programs to either accept the environment parameters on the command line, or I store this file outside of the source control tree so that never gets over written.
The AUTOLOAD provides automatic methods for variable retrieval.
# Instructions:
# use Conf;
# my $c = Conf->new("production");
# print $c->root_dir;
# print $c->log_dir;
package Conf;
use strict;
our $AUTOLOAD;
my $default_environment = "production";
my #valid_environments = qw(
development
production
);
#######################################################################################
# You might need to change this.
sub set_vars {
my ($self) = #_;
$self->{"access_token"} = 'asdafsifhefh';
if ( $self->env eq "development" ) {
$self->{"root_dir"} = "/Users/patrickcollins/Documents/workspace/SysG_perl";
$self->{"server_base"} = "http://localhost:3000";
}
elsif ($self->env eq "production" ) {
$self->{"root_dir"} = "/mnt/SysG-production/current/lib";
$self->{"server_base"} = "http://api.SysG.com";
$self->{"log_dir"} = "/mnt/SysG-production/current/log"
} else {
die "No environment defined\n";
}
#######################################################################################
# You shouldn't need to configure this.
# More dirs. Move these into the dev/prod sections if they're different per env.
my $r = $self->{'root_dir'};
my $b = $self->{'server_base'};
$self->{"working_dir"} ||= "$r/working";
$self->{"bin_dir"} ||= "$r/bin";
$self->{"log_dir"} ||= "$r/log";
# Other URLs. Move these into the dev/prod sections if they're different per env.
$self->{"new_contract_url"} = "$b/SysG-training-center/v1/contract/new";
$self->{"new_documents_url"} = "$b/SysG-training-center/v1/documents/new";
}
#######################################################################################
# Code, don't change below here.
sub new {
my ($class,$env) = #_;
my $self = {};
bless ($self,$class);
if ($env) {
$self->env($env);
} else {
$self->env($default_environment);
}
$self->set_vars;
return $self;
}
sub AUTOLOAD {
my ($self,$val) = #_;
my $type = ref ($self) || die "$self is not an object";
my $field = $AUTOLOAD;
$field =~ s/.*://;
#print "field: $field\n";
unless (exists $self->{$field} || $field =~ /DESTROY/ )
{
die "ERROR: {$field} does not exist in object/class $type\n";
}
$self->{$field} = $val if ($val);
return $self->{$field};
}
sub env {
my ($self,$in) = #_;
if ($in) {
die ("Invalid environment $in") unless (grep($in,#valid_environments));
$self->{"_env"} = $in;
}
return $self->{"_env"};
}
1;

Efficient Way To Check 10,000s of Blog Feeds in Perl

We have 10,000s of blogs we want to check multiple times a day for new posts. I'd love some ideas with example code on the most efficient way to do this using Perl.
Currently we are just using LWP::UserAgent to download each RSS feed and then checking each URL in the resulting feed against a MySQL database table of already found URLs one at a time. Needless to say this doesn't scale well and is super inefficient.
Thanks in advance for your help & advice!
Unfortunately, there is probably no other way than do some kind of polling.
Luckily, implementing the PubSubHubbub protocol can greatly help reduce the amount of polling for the feeds who support it.
For those feeds who don't support PubSubHubbub, then you'll have to make sure you use HTTP-level protocols (like ETags or If-Modified-Since headers to know if/when a resource has been updated).
Also make sure you implement some kind of back-off mechanisms.
Perhaps look at AnyEvent::Feed, it is asynchronous (using the AnyEvent event loop) with configurable polling intervals as well as built in support for 'seen' articles, and support for RSS and Atom feeds. You could possibly create a single process polling every feed or multiple processes polling different sections of your feed list.
From the synopsis:
use AnyEvent;
use AnyEvent::Feed;
my $feed_reader =
AnyEvent::Feed->new (
url => 'http://example.com/atom.xml',
interval => $seconds,
on_fetch => sub {
my ($feed_reader, $new_entries, $feed, $error) = #_;
if (defined $error) {
warn "ERROR: $error\n";
return;
}
for (#$new_entries) {
my ($hash, $entry) = #_;
# $hash a unique hash describing the $entry
# $entry is the XML::Feed::Entry object of the new entries
# since the last fetch.
}
}
);
Seems like two questions rolled into one: fetching an comparing. Others have answered the fetch part. As for comparing:
I've been reading about redis lately and it seems like a good fit for you as it can do a lot of simple operations per second (lets say ~80k /s). So checking if you already have an url should go really fast. Never actually used it though ;)
An idea: Have you tried comparing on size before parsing the RSS? Might save you some time if the change infrequently.
10000 are not so many.
You could probably handle then using some simple approach like forking some worker processes that get RSS URLs from the db, fetch them and update the database:
for (1..$n) {
my $pid = fork;
if (!$pid) {
defined $pid or die "fork failed";
my $db = open_db();
while (1) {
$url = get_next_url($db) or last;
$rss = feed_rss($url);
update_rss($db, $rss);
}
exit(0);
}
}
wait_for_workers(#pid);
That, considering you are not able to use some of the existent applications already pointed by other responders.

How can I fix the "Couldn't create file parser context for file ..." bug with Perl libxml on Debian?

When I try to read an XML file with XML::Simple, sometimes I get this error message:
Couldn't create file parser context for file ...
After some googling, it seems to be a problem with libxml-libxml-perl and is supposed to be fixed in the version I use (1.59-2).
Any ideas?
Edit: (code)
sub Read
{
my ($file, $no_option) = #_;
my %XML_INPUT_OPTIONS = ( KeyAttr => [], ForceArray => 1 );
if ((defined $file) && (-f $file))
{
my #stats = stat($file);
if ((defined $XML_CACHE{$file})
&& ($stats[9] == $XML_CACHE{$file}{modif_time}))
{
return ($XML_CACHE{$file}{xml});
}
else
{
my $xml = eval { XMLin($file,
(defined $no_option ? () : %XML_INPUT_OPTIONS)) };
AAT::Syslog("AAT::XML", "XML_READ_ERROR", $#) if ($#);
$XML_CACHE{$file}{modif_time} = $stats[9];
$XML_CACHE{$file}{xml} = $xml;
return ($xml);
}
}
return (undef);
}
And yes, I should & will use XML::Simple cache feature...
Does the error continue "No such file or directory at..."? If so, then I think that the problem is that (for whatever reason) when you get to that point in the script, whatever you are passing to XML::Simple has no xml file in it. Long story short, the script you are using may be passing a bad variable (blank? empty?) to XML::Simple at which point the module chokes. To debug, add a check on whatever you hand to XML::Simple before you pass it along. (See the next paragraph for a concrete example explaining why I think this may be your problem.)
A few months ago, I had a similar problem with Weather::Google. In a nutshell, the weather module was trying to get data from Google via LWP::Simple without a user agent. Google began (apparently) to reject requests without a user agent. I had to backtrack through the modules because the error appeared to come from XML::Simple. In fact, it was caused by what was done in LWP::Simple and Weather::Google. Or rather, the error was a result of Weather::Google not checking the data that was in an object created via LWP::Simple. In a case like this, it can be hard at first to see what's going wrong and where.