I'm new to using Perl XML::SAX and I encountered a problem with the characters event that is triggered. I'm trying to parse a very large XML file using perl.
My goal is to get the content of each tag (I do not know the tag names - given any xml file, I should be able to crack the record pattern and return every record with its data and tag like Tag:Data).
While working with small files, everything is ok. But when running on a large file, the characters{} event does partial reading of the content. There is no specific pattern in the way it cuts down the reading. Sometimes its the starting few characters of data and sometimes its last few characters and sometimes its just one letter from the actual data.
The Sax Parser is:
$myhandler = MyFilter->new();
$parser = XML::SAX::ParserFactory->parser(Handler => $myhandler);
$parser->parse_file($filename);
And, I have written my own Handler called MyFilter and overridding the character method of the parser.
sub characters {
my ($self, $element) = #_;
$globalvar = $element->{Data};
print "content is: $globalvar \n";
}
Even this print statement, reads the values partially at times.
I also tried loading the Parsesr Package before calling the $parser->parse() as:
$XML::SAX::ParserPackage = "XML::SAX::ExpatXS";
Stil doesn't work. Could anyone help me out here? Thanks in advance!
Sounds like you need XML::Filter::BufferText.
http://search.cpan.org/dist/XML-Filter-BufferText/BufferText.pm
From the description "One common cause of grief (and programmer error) is that XML parsers aren't required to provide character events in one chunk. They can, but are not forced to, and most don't. This filter does the trivial but oft-repeated task of putting all characters into a single event."
It's very easy to use once you have it installed and will solve your partial character data problem.
Related
I have been trying to get perl subroutine value and substitution to get the required part of string from randomips subroutine in exim.conf. However when i use string substitution i get error as follow:
Here is what I am trying to achieve
I am trying to split string by colon and get first occurrence as "interface". I'll be using second occurrence as the "helo_data.
exim.pl
sub randomhosts {
#inet = ("x.x.x.1:hostname1.domain.com","x.x.x.2:hostname2.domain.com","x.x.x.3:hostname3.domain.com"
);
return $inet[int rand($#inet+1)];
}
exim.conf
dkim_remote_smtp:
driver = smtp
interface = "${perl{randomhosts}%:*}"
helo_data = "${sender_address_domain}"
Error I get is as follow:
"failed to expand "interface" option for dkim_remote_smtp transport: missing '}' after 'perl'".
Probably the syntax.
Any help?
The code that you are trying to copy was written by someone who doesn't know much about Perl. It includes this line:
return $inet[int rand($#inet+1)];
A Perl programmer would write this as
return $inet[rand #inet];
I think there are a couple of issues here - one with your Exim syntax and one with your Perl syntax.
Exim is giving you this error:
failed to expand "interface" option for dkim_remote_smtp transport: missing '}' after 'perl'
I don't know anything about calling Perl from Exim, but this page mentions a syntax like ${perl{foo}} (which is similar to the one used in the page you are copying from) and one like ${perl{foo}{argument}} for calling a subroutine and passing it an argument. Nowhere does it mention syntax like yours:
${perl{randomhosts}%:*}
I'm not sure where you have got that syntax from, but it seems likely that this is what is causing your first error.
In a comment, you say
I am stying to get first part of string before colon for each random array value for "interface" and part after colon for "helo_data"
It seems to me that Exim doesn't support this requirement. You would need to call the function twice to get the two pieces of information that you require. You might be able to do this in the Perl using something like state variables - but it would be far more complex than the code you currently have there.
Secondly, your Perl code has a syntax error, so even if Exim was able to call your code, it wouldn't work.
The code you're copying sets up #inet like this:
#inet = ("x.x.x.1", "x.x.x.2", "x.x.x.3", "x.x.x.4");
Your equivalent code is this:
#inet = (
"x.x.x.1:hostname1.domain.com",
"x.x.x.2:hostname2.domain.com,
x.x.x.3:hostname3.domain.com
);
I've reformatted it, to make the problems more obvious. You are missing a number of quote marks around the elements of the array. (Note: I see that while I have been writing this answer, you have fixed that.)
Update: Ok, here is some code to put into exim.pl that does what you want.
use feature qw[state];
sub randomhosts {
state $current;
my #inet = (
"x.x.x.1:hostname1.domain.com",
"x.x.x.2:hostname2.domain.com",
"x.x.x.3:hostname3.domain.com"
);
if ($_[0] eq 'generate') {
shift;
#{$current}{qw[ip host]} = split /:/, $inet[rand #inet];
}
return $current->{$_[0]};
}
It generates a new ip/host pair if its first argument is 'generate'. It will then return either the hostname or the ip address from the generated pair. I think you can probably call it from your Exim config file like this:
dkim_remote_smtp:
driver = smtp
interface = "${perl{randomhosts}{generate}{ip}}"
helo_data = "${perl{randomhosts}{host}}"
But I'm no expert in Exim, so that syntax might need tweaking.
First I would like to note I have not worked with exim so I cannot say what exactly you are trying to do and why you have done things exactly so.
In the link you posted, a method called 'randinet' is added to exim.pl and the interface line in exim.conf is replaced by
interface = "${perl{randinet}}"
You have implemented a 'randomhosts' method and replaced the interface line with
interface = "${perl{randomhosts}%:*}"
Now the parser complains about not finding the closing bracket. That is likely due to the symbols you felt free to add but the parser does not have the freedom to ignore.
I suggest you try
interface = "${perl{randomhosts}}"
I am trying to help out a client who was unhappy with an EMR (Electronic Medical Records) system and wanted to switch but the company said they couldn't extract patient demographic data from the database (we asked if they can get us name, address, dob in a csv file of some sort, very basic stuff) - yet they claim they couldn't do that. (crazy considering they are using a sql database).
Anyway - the way they handed over the patients were in xml files and there are about 40'000+ of them. But they contain a lot more than the demographics.
After doing some research and having done extensive Perl programming 15 years ago (I admit it got rusty over the years) - I thought this should be a good task to get done in Perl - and I came across the XML::Twig module which seems to be able to do the trick.
Unfortunately the xml code that is of interest looks like this:
<==snip==>
<patient extension="Patient ID Number"> // <--Patient ID is 5 digit number)
<name>
<family>Patient Family name</family>
<given>Patient First/Given name</given>
<given>Patient Middle Initial</given>
</name>
<birthTime value=YEARMMDD"/>
more fields for address etc.are following in the xml file.
<==snip==>
Here is what I coded:
my $twig=XML::Twig->new( twig_handlers => {
'patient/name/family' => \&get_family_name,
'patient/name/given' => \&get_given_name
});
$twig->parsefile('test.xml');
my #fields;
sub get_family_name {my($twig,$data)=#_;$fields[0]=$data->text;$twig->purge;}
sub get_given_name {my($twig,$data)=#_;$fields[1]=$data->text;$twig->purge;}
I have no problems reading out all the information that have unique tags (family, city, zip code, etc.) but XML:Twig only returns the middle initial for the tag.
How can I address the first occurrence of "given" and assign it to $fields[1] and the second occurrence of "given" to $fields[2] for instance - or chuck the middle initial.
Also how do I extract the "Patient ID" or the "birthTime" value with XML::Twig - I couldn't find a reference to that.
I tried using $data->findvalue('birthTime') but that came back empty.
I looked at: Perl, XML::Twig, how to reading field with the same tag which was very helpful but since the duplicate tags are in the same path it is different and I can't seem to find an answer. Does XML::Twig only return the last value found when finding a match while parsing a file? Is there a way to extract all occurrences of a value?
Thank you for your help in advance!
It is very easy to assume from the documentation that you're supposed to use callbacks for everything. But it's just as valid to parse the whole document and interrogate it in its entirety, especially if the data size is small
It's unclear from your question whether each patient has a separate XML file to themselves, and you don't show what encloses the patient elements, but I suggest that you use a compromise approach and write a handler for just the patient elements which extracts all of the information required
I've chosen to build a hash of information %patient out of each patient element and push it onto an array #patients that contains all the data in the file. If you have only one patient per file then this will need to be changed
I've resolved the problem with the name/given elements by fetching all of them and joining them into a single string with intervening spaces. I hope that's suitable
This is completely untested as I have only a tablet to hand at present, so beware. It does stand a chance of compiling, but I would be surprised if it has no bugs
use strict;
use warnings 'all';
use XML::Twig;
my #patients;
my $twig = XML::Twig->new(
twig_handlers => { patient => \&get_patient }
);
$twig->parsefile('test.xml');
sub get_patient {
my ($twig, $pat) = #_;
my %patient;
$patient{id} = $pat>att('extension');
my $name = $pat->first_child('name');yy
$patient{family} = $name->first_child_trimmed_text('family');
$patient{given} = join ' ', $name->children_trimmed_text('given');
$patient{dob} = $pat->first_child('birthTime')->att('value');
push #patients, \%patient;
}
I only want to parse an interested element of xml (e.g. see below: class element with name equals to math) and I want to stop once the first element hitting this condition is parsed. (since There is only one class whose name is math, it is unnecessary to continue once the element is already found).
However, if I implement as follows, the code continues to read the whole file after it found the element i am interested (the xml file is very long so it takes long time). my question is how to stop it once the first class element with name = math is parsed?
my $twig = new XML::Twig(TwigRoots => {"class[\#name='math']" => \&class});
$twig->parsefile( shift #ARGV );
besides, I also want to delete this class from xml file (not only from memory) after it is parsed so that next time when parsing a class with other names, the class element will not be parsed. Is it possible to do that?
It seems what you're looking for are XML::Twig's finish_print and finish_now :
finish_print
Stops twig processing, flush the twig and proceed to finish printing
the document as fast as possible. Use
this method when modifying a document
and the modification is done.
finish_now
Stops twig processing, does not finish parsing the document (which
could actually be not well-formed
after the point where finish_now is
called). Execution resumes after the
Lparse> or parsefile call. The content
of the twig is what has been parsed so
far (all open elements at the time
finish_now is called are considered
closed).
I'm in a web scripting class, and honestly and unfortunately, it has come second to my networking and design and analysis classes. Because of this I find I encounter problems that may be mundane but can't find the solution to it easily.
I am writing a CGI form that is supposed to work with a MySQL DB. I can insert and delete into the DB just fine. My problem comes when querying the DB.
My code compiles fine and I don't get errors when trying to "display" the info in the DB through the browser but the data and text doesn't in fact display. The code in question is here:
print br, 'test';
my $dbh = DBI->connect("DBI:mysql:austinc4", "*******", "*******", {RaiseError => 1} );
my $usersstatement = "select * from users";
my $projstatment = "select * from projects";
# Get the handle
my $userinfo = $dbh->query($usersstatement);
my $projinfo = $dbh->query($projstatement);
# Fetch rows
while (#userrow = $userinfo->fetchrow()) {
print $userrow[0], br;
}
print 'end';
This code is in an if statement that is surrounded by the print header, start_html, form, /form, end_html. I was just trying to debug and find out what was happening and printed the statements test and end. It prints out test but doesn't print out end. It also doesn't print out the data in my DB, which happens to come before I print out end.
What I believe I am doing is:
Connecting to my DB
Forming a string the contains the command/request to the DB
Getting a handle for my query I perform on the DB
Fetching a row from my handle
Printing the first field in the row I fetched from my table
But I don't see why my data wouldn't print out as well as the end text. I looked in DB and it does in fact contain data in the DB and the table that I am trying to get data from.
This one has got me stumped, so I appreciate any help. Thanks again. =)
Solution:
I was using a that wasn't supported by the modules I was including. This leads me to another question. How can I detect errors like this? My program does in fact compile correctly and the webpage doesn't "break". Aside from me double checking that all the methods I do use are valid, do I just see something like text not being displayed and assume that an error like this occurred?
Upon reading the comments, the reason your program is broken is because query() does not execute an SQL query. Therefore you are probably calling an undefined subroutine unless this is a wrapper you have defined elsewhere.
Here is my original posting of helpful hints, which still apply:
I hope you have use CGI, use DBI, etc... and use CGI::Carp and use strict;
Look in /var/log/apache2/access.log or error.log for the bugs
Realize that the first thing a CGI script prints MUST be a valid header or the web server and browser become unhappy and often nothing else displays.
Because of #3 print the header first BEFORE you do anything, especially before you connect to the database where the script may die or print something else because otherwise the errors or other messages will be emitted before the header.
If you still don't see an error go back to #2.
CGIs that use CGI.pm can be run from a command line in a terminal session without going through the webserver. This is also a good way to debug.
I've started a little pet project to parse log files for Team Fortress 2. The log files have an event on each line, such as the following:
L 10/23/2009 - 21:03:43: "Mmm... Cycles!<67><STEAM_0:1:4779289><Red>" killed "monkey<77><STEAM_0:0:20001959><Blue>" with "sniperrifle" (customkill "headshot") (attacker_position "1848 813 94") (victim_position "1483 358 221")
Notice there are some common parts of the syntax for log files. Names, for example consist of four parts: the name, an ID, a Steam ID, and the team of the player at the time. Rather than rewriting this type of regular expression, I was hoping to abstract this out slightly.
For example:
my $name = qr/(.*)<(\d+)><(.*)><(Red|Blue)>/
my $kill = qr/"$name" killed "$name"/;
This works nicely, but the regular expression now returns results that depend on the format of $name (breaking the abstraction I'm trying to achieve). The example above would match as:
my ($name_1, $id_1, $steam_1, $team_1, $name_2, $id_2, $steam_2, $team_2)
But I'm really looking for something like:
my ($player1, $player2)
Where $player1 and $player2 would be tuples of the previous data. I figure the "killed" event doesn't need to know exactly about the player, as long as it has information to create the player, which is what these tuples provide.
Sorry if this is a bit of a ramble, but hopefully you can provide some advice!
I think I understand what you are asking. What you need to do is reverse your logic. First you need to regex to split the string into two parts, then you extract your tuples. Then your regex doesn't need to know about the name, and you just have two generic player parsing regexs. Here is an short example:
#!/usr/bin/perl
use strict;
use Data::Dumper;
my $log = 'L 10/23/2009 - 21:03:43: "Mmm... Cycles!<67><STEAM_0:1:4779289><Red>" killed "monkey<77><STEAM_0:0:20001959><
Blue>" with "sniperrifle" (customkill "headshot") (attacker_position "1848 813 94") (victim_position "1483 358 221")';
my ($player1_string, $player2_string) = $log =~ m/(".*") killed (".*?")/;
my #player1 = $player1_string =~ m/(.*)<(\d+)><(.*)><(Red|Blue)>/;
my #player2 = $player2_string =~ m/(.*)<(\d+)><(.*)><(Red|Blue)>/;
print STDERR Dumper(\#player1, \#player2);
Hope this what you were looking for.
Another way to do it, but the same strategy as dwp's answer:
my #players =
map { [ /(.*)<(\d+)><(.*)><(Red|Blue)>/ ] }
$log_text =~ /"([^\"]+)" killed "([^\"]+)"/
;
Your log data contains several items of balanced text (quoted and parenthesized), so you might consider Text::Balanced for parts of this job, or perhaps a parsing approach rather than a direct attack with regex. The latter might be fragile if the player names can contain arbitrary input, for example.
Consider writing a Regexp::Log subclass.