Which syntax is better XML::Simple or XML::Twig - perl

I was running a Perl script and I encountered the following result, instead of the answer I expected.
input HASH(0x17268bb0)
input HASH(0x172b3300)
input HASH(0x172b32a0)
Can anyone say what this is and how to rectify it?
This is my XML file here
<Root>
<Top name="ri_32">
<Module name="ALU">
<input name="power_control_bus"/>
<bidirection name="address_bus"/>
</Module>
<Module name="Power_control">
<input name="cpu_control_bus"/>
<output name="power_control_bus"/>
<bidirection name="address_bus"/>
</Module>
<input name="address"/>
<input name="clock"/>
<input name="data_in"/>
<output name="data_out"/>
<bidirection name="control"/>
</Top>
</Root>
I'm writing a Perl script which can be converted into a specific requirement (.v, .sv file)
use strict;
use XML::Simple;
use Data::Dumper;
my $xml_root = XMLin( './simodule.xml' );
my $root_top = $xml_root->{Top};
my $mod = $root_top->{Module};
print "Top $root_top->{name}\n";
my $top_in = $root_top->{input};
foreach my $namein ( keys %$top_in ) {
print " input $top_in->{$namein}\n";
}
my $top_ou = $root_top->{output};
foreach my $nameou ( keys %$top_ou ) {
print " output $top_ou->{$nameou}\n";
}
my $top_bi = $root_top->{bidirection};
foreach my $namebi ( keys %$top_bi ) {
print " bidirection $top_bi->{$namebi}\n";
}
output:
Top risc_32
input HASH(0x172b3300)
input HASH(0x172b32a0)
input HASH(0x17268bb0)
output data_out
bidirection control
Expected output
input address
input clock
input data_in
output data_out
bidirection control

You've made your task more difficult for yourself by using one of the most deceitful modules on CPAN. XML::Simple isn't simple.
But it's docs also suggest not using it:
Why is XML::Simple "Discouraged"?
So - how about instead, XML::Twig:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
#$twig now contains our XML data structure.
my $twig = XML::Twig->new->parsefile('simodule.xml');
#fetch a value with an xpath expression - ./Top
#then extract the attribute 'name' from this node.
print "Top ", $twig->get_xpath( './Top', 0 )->att('name'), "\n";
#iterate all 'input' elements beneath "Top":
#note - single argument to "get_xpath" means all of them in a list.
foreach my $input ( $twig->get_xpath('./Top/input') ) {
#retrieve from each their name attribute (and print)
print "input ", $input->att('name'), "\n";
}
#locate the 'output' and 'bidirection' nodes within the tree, and fetch
#their name attribute.
print "output ", $twig -> get_xpath( './Top/output',0) -> att('name'),"\n";
print "bidirection ", $twig -> get_xpath( './Top/bidirection',0) -> att('name'),"\n";
We use XML::Twig which makes use of get_xpath to specify an XML path. We also use att to retrieve a named attribute. You could use iterators such as first_child and children if you prefer though:
#Top element is below the root - we create a reference to it $top
my $top = $twig->root->first_child('Top');
#From this reference, fetch the name attribute.
print "Top ", $top->att('name'), "\n";
#get children of Top matching 'input' and iterate
foreach my $input ( $top -> children('input') ) {
#print attribute called 'name'.
print "input ", $input->att('name'), "\n";
}
#Find a child below Top called 'output' and retrieve 'name' attribute.
print "output ", $top -> first_child('output') -> att('name'),"\n";
#as above.
print "bidirection ", $top -> first_child('bidirection') -> att('name'),"\n";
These are doing the same thing - personally I like xpath as a way of navigating XML but that's a matter of taste. (It lets you do all sorts of things like specify a path with embedded attributes, that kind of thing - moot point in this example though).
Given your input XML, both produce:
Top ri_32
input address
input clock
input data_in
output data_out
bidirection control
it skips the nested hashes for ALU and Power_Control because your original code appears to.

You still haven't been at all clear about exactly what output you want. But I use XML::LibXML for most of my XML processing requirements and I'd write something like this:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use XML::LibXML;
my $parser = XML::LibXML->new();
my $doc = $parser->parse_file('simodule.xml');
foreach my $type (qw[input output bidirection]) {
foreach ($doc->findnodes("/Root/Top/$type")) {
say $_->nodeName, ' ', $_->getAttribute('name');
}
}

The output is correct.
As we don't know what exactly you need,
I modify your code to following so maybe you can figure out why you see the HASH and how to de-reference it by yourself, it's pretty simple:
use strict;
use XML::Simple;
use Data::Dumper;
local $/;
my $xml_root = XMLin(<DATA>);
print Dumper $xml_root;
my $root_top=$xml_root->{Top};
my $mod=$root_top->{Module};
print "Top $root_top->{name}\n";
my $top_in=$root_top->{input};
foreach my $namein (keys %$top_in)
{
print " input" , Dumper $top_in->{$namein};
}
my $top_ou=$root_top->{output};
foreach my $nameou (keys %$top_ou)
{
print " output $top_ou->{$nameou}\n";
}
my $top_bi=$root_top->{bidirection};
foreach my $namebi (keys %$top_bi)
{
print " bidirection $top_bi->{$namebi}\n";
}
__DATA__
<Root>
<Top name="ri_32">
<Module name="ALU">
<input name="power_control_bus"/>
<bidirection name="address_bus"/>
</Module>
<Module name="Power_control">
<input name="cpu_control_bus"/>
<output name="power_control_bus"/>
<bidirection name="address_bus"/>
</Module>
<input name="address">X</input>
<input name="clock"/>
<input name="data_in"/>
<output name="data_out"/>
<bidirection name="control"/>
</Top>
</Root>

Related

How to get the text contents of an XML child element based on an attribute of its parent

This is my XML data
<categories>
<category id="Id001" name="Abcd">
<project> ID_1234</project>
<project> ID_5678</project>
</category>
<category id="Id002" name="efgh">
<project> ID_6756</project>
<project> ID_4356</project>
</category>
</categories>
I need to get the text contents of each <project> element based on the name attribute of the containing <category> element.
I am using Perl with the XML::LibXML module.
For example, given category name Abcd i should get the list ID_1234, ID_5678.
Here is my code
my $parser = XML::LibXML->new;
$doc = $parser->parse_file( "/cctest/categories.xml" );
my #nodes = $doc->findnodes( '/categories/category' );
foreach my $cat ( #nodes ) {
my #catn = $cat->findvalue('#name');
}
This gives me the category names in array #catn. But how can I get the text values of each project?
You haven't shown what you've tried so far, or what your desired output is so I've made a guess at what you're looking for.
With XML::Twig you could do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> parse ( \*DATA );
foreach my $project ( $twig -> findnodes ( '//project' ) ) {
print join ",", (map { $project -> parent -> att($_) } qw ( id name )), $project -> text,"\n";
}
__DATA__
<categories>
<category id="Id001" name="Abcd">
<project> ID_1234</project>
<project> ID_5678</project>
</category>
<category id="Id002" name="efgh">
<project> ID_6756</project>
<project> ID_4356</project>
</category>
</categories>
Which produces:
Id001,Abcd, ID_1234,
Id001,Abcd, ID_5678,
Id002,efgh, ID_6756,
Id002,efgh, ID_4356,
It does this by using findnodes to locate any element 'project'.
Then extract the 'id' and 'name' attributes from the parent (the category), and print that - along with the text in this particular element.
xpath is a powerful tool for selecting data from XML, and with a more focussed question, we can give more specific answers.
So if you were seeking all the projects 'beneath' category "Abcd" you could:
foreach my $project ( $twig -> findnodes ( './category[#name="Abcd"]/project' ) ) {
print $project -> text,"\n";
}
This uses XML::LibXML, which is the library you're already using.
Your $cat variable contains an XML element object which you can process with the same findnodes() and findvalue() methods that you used on the top-level $doc object.
#!/usr/bin/perl
use strict;
use warnings;
# We use modern Perl here (specifically say())
use 5.010;
use XML::LibXML;
my $doc = XML::LibXML->new->parse_file('categories.xml');
foreach my $cat ($doc->findnodes('//category')) {
say $cat->findvalue('#name');
foreach my $proj ($cat->findnodes('project')) {
say $proj->findvalue('.');
}
}
You can try with XML::Simple
use strict;
use warnings;
use XML::Simple;
use Data::Dumper
my $XML_file = 'your XML file';
my $XML_data;
#Get data from your XML file
open(my $IN, '<:encoding(UTF-8)', $XML_file) or die "cannot open file $XML_file";
{
local $/;
$XML_data = <$IN>;
}
close($IN);
#Store XML data as hash reference
my $xmlSimple = XML::Simple->new(KeepRoot => 1);
my $hash_ref = $xmlSimple->XMLin($XML_data);
print Dumper $hash_ref;
The hash reference will be as below:
$VAR1 = {
'categories' => {
'category' => {
'efgh' => {
'id' => 'Id002',
'project' => [
' ID_6756',
' ID_4356'
]
},
'Abcd' => {
'id' => 'Id001',
'project' => [
' ID_1234',
' ID_5678'
]
}
}
}
};
Now to get data which you want:
foreach(#{$hash_ref->{'categories'}->{'category'}->{'Abcd'}->{'project'}}){
print "$_\n";
}
The result is:
ID_1234
ID_5678

Print attributes from two tags together

I am using XML:twig to extract some attributes from an XML file using Perl;
Here is my code
use XML::Twig;
my $file = $ARGV[0];
$file =~ /(.+)\.xml/;
my $outfile = $1 . ".snp" ;
open my $out, '>', $outfile or die "Could not open file '$outfile' $!";
my $twig = XML::Twig->new(
twig_handlers => {
'Rs/MergeHistory' => \&MergeHistory,
}
);
$twig -> parsefile( "$file");
sub MergeHistory {
my ($twig, $elt) = #_;
print $out "\t";
print $out "rs";
print $out $elt->att('rsId'), ",";
print $out "b";
print $out $elt->att('buildId'), ",";
}
This print the following results:
rs56546490,b130, rs386588736,b142
rs56546490,b130, rs386588736,b142
What I want is to print each MergeHistory rsId and buildId together as the following:
rs56546490,rs386588736, b130,b142
rs56546490,rs386588736, b130,b142
Here is a part of the XML file which contains on two MergeHistory tags :
<Rs rsId="98324" snpClass="snp" snpType="notwithdrawn"
molType="genomic" genotype="true"bitField="050028000005130500030100"
taxId="9606">
<Het type="est" value="0.05" stdError="0.1547"/>
<Validation byCluster="true" byOtherPop="true" byHapMap="true"
by1000G="true">
<otherPopBatchId>7179</otherPopBatchId>
</Validation>
<Create build="36" date="2000-09-19 17:02"/>
<Update build="144" date="2015-05-07 10:52"/>
<Sequence exemplarSs="491581208" ancestralAllele="C,C">
<Seq5>
ATAAGCAAATAACTGAAGTTTAATCAGTCTCCTCCCAGCAAGTGATATGCAACTGAGATTCC
TTATGACACATCTGAACACTAGTGGATTTGCTTTGTAGTAGGAACAA
GGTACATTCGCGGGATAAATGTGGCCAAGTTTTATCTGCTGCCAGGGCTTTCAAAT
AGGTTGACCTGACAATGGGTCACCTCTGGGACTGA</Seq5>
<Observed>C/T</Observed>
<Seq3>AATTAGGAAGAGCTGGTACCTAAAATGAAAGATGCCCTTAAATTTCAGATTCACAATTTTTT
TTTCTTAGTATAAGCATGTCCCATGTAATATCTGGGATATACTCATACCTT
TAAAAATGTGCTCATTGTTTATCTGAAATTCACATTTTAACAGGGAACCATTGT
TTTGTTATTGTTTATTGTTTTGTTTCTAAATAA</Seq3>
</Sequence>
<Ss ssId="1556770886" handle="1000GENOMES" batchId="1061891"
locSnpId="PHASE3_chrY_229259" subSnpClass="snp" orient="reverse"
strand="top" molType="genomic" buildId="144"
methodClass="sequence" validated="by-submitter">
<Sequence>
<Seq5>TTTTAGGTACCAGCTCTTCCTAATT</Seq5>
<Observed>A/G</Observed>
<Seq3>TCAGTCCCAGAGGTGACCCATTGTC</Seq3>
</Sequence>
</Ss>
<Assembly dbSnpBuild="144" genomeBuild="38.2"
groupLabel="GRCh38.p2" current="true" reference="true">
<Component componentType="contig" accession="NT_011875.13"
chromosome="Y" start="11642902" end="21789280"
orientation="fwd" gi="568801947" groupTerm="NC_000024.10"
contigLabel="GCF_000001405.28">
<MapLoc asnFrom="5341580" asnTo="5341580" locType="exact"
alnQuality="1" orient="reverse" physMapInt="16984482"
leftContigNeighborPos="5341579"rightContigNeighborPos="5341581"
refAllele="G"/>
</Component>
<SnpStat mapWeight="unique-in-contig" chromCount="1"
placedContigCount="1" unplacedContigCount="0" seqlocCount="1"
hapCount="0"/>
</Assembly>
<RsLinkout resourceId="1" linkValue="3894"/>
<RsLinkout resourceId="4" linkValue="60936"/>
<RsLinkout resourceId="5" linkValue="23388839"/>
<MergeHistory rsId="56546490" buildId="130"/>
<MergeHistory rsId="386588736" buildId="142"/>
<hgvs>NC_000024.9:g.19096363G>A</hgvs>
<hgvs>NC_000024.10:g.16984483G>A</hgvs>
<Frequency freq="0.0276" allele="A" sampleSize="1233"/>
</Rs>
twig_handlers is good for pre-processing XML, and most especially for discarding it as you go.
It's probably not what you want here though - it looks like what you're trying to do is:
extract each 'MergeHistory' element from each 'Rs' element.
Print the content reformatted.
So with that in mind - I think what you probably want is findnodes and children.
my $twig = XML::Twig->parsefile( $file );
foreach my $rs ( $twig->findnodes('//Rs') ) {
print join( ",",
map { "rs" . $_->att('rsId') } $rs->children('MergeHistory') ),
"\t";
print join( ",",
map { "b" . $_->att('buildId') } $rs->children('MergeHistory') ),
"\n";
}
Given your sample, this prints:
rs56546490,rs386588736 b130,b142
Which looks roughly what you wanted?
We use findnodes to iterate Rs elements.
Within each, we use children to fetch the MergeHistory elements.
map to extract the attribute and concat the b or rs string on the front.
join to merge it comma separated.
(you could still do the above with twig_handlers if you prefer, by firing the handler on "Rs" instead)

Perl XML lib get full xpath

Looking to return the full xpath from a general xpath that may grab multiple results.
The search string would be something general like this:
/myXmlPath/#myValue
The contained xml nodes might look something like this:
<myXmlPath someAttribute="false" myValue="">
<myXmlPath someAttribute="true" myValue="">
Perl code something like this:
use XML::LibXML;
use XML::XPath::XMLParser;
my $filepath = "c:\\temp\\myfile.xml";
my $parser = XML::LibXML->new();
$parser->keep_blanks(0);
my $doc = $parser->parse_file($filepath);
#myWarn = ('/myXmlPath/#myValue');
foreach(#myWarn) {
my $nodeset = $doc->findnodes($_);
foreach my $node ($nodeset->get_nodelist) {
my $value = $node->to_literal;
print $_,"\n";
print $value," - value \n";
print $node," - node \n";
}
}
I'd like to be able to evaluate the returned full path values from the xml. This code works fine when I'm using it to lookup general things in an xpath, but would be more ideal if I could get at other data from the nodeset result.
Like ikegami said, I'm not sure exactly what you're after so I've kind of produced a shotgun approach for everything I could interpret your question.
use strict;
use warnings;
use XML::LibXML;
use v5.14;
my $doc = XML::LibXML->load_xml(IO => *DATA);
say "Get the full path to the node";
foreach my $node ($doc->findnodes('//myXmlPath/#myValue')) {
say "\t".$node->nodePath();
}
say "Get the parent node of the attribute by searching";
foreach my $node ($doc->findnodes('//myXmlPath[./#myValue="banana"]')) {
say "\t".$node->nodePath();
my ($someAttribute, $myValue) = map { $node->findvalue("./$_") } qw (#someAttribute #myValue);
say "\t\tsomeAttribute: $someAttribute";
say "\t\tmyValue: $myValue";
}
say "Get the parent node programatically";
foreach my $attribute ($doc->findnodes('//myXmlPath/#myValue')) {
my $element = $attribute->parentNode;
say "\t".$element->nodePath();
}
__DATA__
<document>
<a>
<b>
<myXmlPath someAttribute="false" myValue="apple" />
</b>
<myXmlPath someAttribute="false" myValue="banana" />
</a>
</document>
Which would produce:
Get the full path to the node
/document/a/b/myXmlPath/#myValue
/document/a/myXmlPath/#myValue
Get the parent node of the attribute by searching
/document/a/myXmlPath
someAttribute: false
myValue: banana
Get the parent node programatically
/document/a/b/myXmlPath
/document/a/myXmlPath

CGI and Perl script one file, passing arguments

I have a script which fetches the summary file from the NCBI website using command line argument (accession number).
Example:
./efetch.pl NM_000040
Now I am trying to fetch the same file using a HTML webpage which takes the form request via a CGI script.
My question: Is it possible to combine the CGI and my Perl script in one file and pass the HTML form argument from the CGI portion of the code to the perl script in single run?
I have tried to do some scripting but it seems that the argument from the CGI is not getting passed to the Perl script.
Any help will be greatly appreciated.
CGI and Perl Script in one single file:
#!/usr/bin/perl -wT
use strict;
use warnings;
use LWP::Simple;
use CGI::Carp qw(warningsToBrowser fatalsToBrowser);
################### Environmental Variables ###################
my ($buffer, #pairs, $pair, $name, $value, %FORM);
# Read in text
$ENV{'REQUEST_METHOD'} =~ tr/a-z/A-Z/;
if ($ENV{'REQUEST_METHOD'} eq "POST")
{
read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'});
} else {
$buffer = $ENV{'QUERY_STRING'};
}
#print "$buffer\n";
# Split information into name/value pairs
#pairs = split(/&/, $buffer);
foreach $pair (#pairs) {
($name, $value) = split(/=/, $pair);
$value =~ tr/+/ /;
#$value =~ s/%(..)/pack("C", hex($1))/eg;
$FORM{$name} = $value;
}
my $access = $FORM{accession};
if ($access =~ m{\A(\w+\d+)\z}) {
$access = $1;
}
print "Content-type:text/html\r\n\r\n";
print "<html>";
print "<head>";
print "<title> CGI Program</title>";
print "</head>";
print "<body>";
if ($access eq "") {
print "<h2> Please check the accession number</h2>";
exit;
}
print "<h2>$access</h2>";
print "</body>";
print "</html>";
print <HEADING
<html>
<head>
<title> Output result of the program </title>
</head>
<body>
<h1> Summary result </h1>
<table border=1>
<tr>
<th>S.No.</th>
<th>Fragment</th>
<th>Position</th>
<th>Region</th>
<th>GC%</th>
</tr>
HEADING
;
######################## INPUT PARAMETERS #####################
my $utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils";
my $db = "nuccore";
my $query = $access; #"$ARGV[0]" or die "Please provide input for the accession number. $!";
############### END OF INPUT PARAMETERS ######################
############### FILE DOWNLOAD FROM NCBI ######################
my $report = "gb"; # downloads the summary text file
open (IN,">", $query.".summary");
my $esearch = "$utils/esearch.fcgi?" . "db=$db&retmax=1&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
$esearch_result =~ m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s;
my $Count = $1; my $QueryKey = $2; my $WebEnv = $3;
my $retstart; my $retmax=3;
for($retstart = 0; $retstart < $Count; $retstart += $retmax) {
my $efetch = "$utils/efetch.fcgi?" .
"rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" .
"db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
my $efetch_result = get($efetch);
print IN $efetch_result, "\n";
}
close (IN);
Print command in the perl script prints the $access but it fails to pass the value of $access to $query.
HTML form:
<form action="/cgi-bin/efetch.cgi" method="post" id="myform">
<div>
NCBI accession number:<label for="accession"> <input type="text" name="accession"> </label><br>
<input type="submit" value="Submit" form="myform">
</div>
</form>
Your script is much more complicated than it needs to be. Specifically - you're using the CGI module (which is deprecated, so you might want to consider something else*) but then you're trying to roll your own input handling in your script.
You can write a single script that sends 'POST' or 'GET' data to itself for processing. That's not too difficult at all.
A simple example might be
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
print "Content-Type: text/html\n\n";
my %param;
while ( <STDIN> ) {
my ( $key, $value ) = split ( "=" );
$param{$key} = $value;
}
print Dumper \%param;
print "<FORM METHOD=\"POST\">\n";
print " <INPUT TYPE=\"text\" NAME=\"access\">\n";
print " <INPUT TYPE=\"submit\">\n";
print "</FORM>\n";
This isn't a good example, but it'll work, and hopefully it'll give you an idea of what's going on - POSTed stuff comes on STDIN. GET stuff comes in the URL.
You can test for the existence of such input, and either render your basic form or process the input you got.
if ( $param{'access'} ) {
#process it;
else {
#print form;
}
There are many modules that make this easier (you're even using one already, in the form of CGI), so I wouldn't EVER suggest doing it this way 'for real' - this is purely an illustration of the basics.
With the CGI module, which is perhaps the thing that'd require least code alteration, you could use the 'CGI::param()' method to retrieve parameters:
use CGI;
print CGI::header;
print CGI::param('access');
#form stuff.
But a more complete one would be to consider a bit more of an in-depth rewrite, and consider using one of the more up to date 'web handling' frameworks. There really are a lot of potential gotchas. (Although it does depend rather, on how much control over your environment you have - internal/limited user scripts I'm a lot more relaxed about than internet facing).
* See: CGI::Alternatives
Writing a CGI program in 2014 is a lot like using a typewriter. Sure, it'll work, but people will look at you very strangely.
But given that you already have a CGI program, let's look at what it might look like if you used techniques that weren't out of date in the last millennium.
There are basically two underlying problems with your code.
You open a file using a name that comes from user input. And that violates the taint mode rules, so your program dies. You would have seen this in your web server error log, had you looked there.
You don't actually need to write the data to a file, because you want to send the data to the user's web browser.
So here's an improved version of your code. It fixes the two problems I mentioned above but it also uses modern tools.
The CGI module has a param() method which makes it far easier for us to get the parameters passed to our program. We also use its header() method to output the CGI header (basically just the Content-type header).
We use the Template module to move all of the HTML out of out code and put it in a separate area. Here I've cheated slightly and have just put it in the DATA section of the CGI program. Usually we'd put it in a completely separate file. Notice how separating the Perl and the HTML makes the program look cleaner and easier to maintain.
It wasn't clear to me exactly how you wanted to format the data you're getting back from the other web site. So I've just stuck it in a "pre" tag. You'll need to work that out for yourself.
Here's the code:
#!/usr/bin/perl -T
use strict;
use warnings;
use LWP::Simple;
use Template;
use CGI ':cgi';
use CGI::Carp qw(warningsToBrowser fatalsToBrowser);
my $access = param('accession');
my $utils = "http://www.ncbi.nlm.nih.gov/entrez/eutils";
my $db = "nuccore";
my $query = $access;
my $report = "gb"; # downloads the summary text file
my $esearch = "$utils/esearch.fcgi?" . "db=$db&retmax=1&usehistory=y&term=";
my $esearch_result = get($esearch . $query);
my $data = '';
if (my ($Count, $QueryKey, $WebEnv) = $esearch_result =~ m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s) {
my $retstart;
my $retmax=3;
for ($retstart = 0; $retstart < $Count; $retstart += $retmax) {
my $efetch = "$utils/efetch.fcgi?" .
"rettype=$report&retmode=text&retstart=$retstart&retmax=$retmax&" .
"db=$db&query_key=$QueryKey&WebEnv=$WebEnv";
my $efetch_result = get($efetch);
$data .= $efetch_result;
}
}
my $tt = Template->new;
print header;
$tt->process(\*DATA, { data => $data })
or die $tt->error;
__END__
<html>
<head>
<title> CGI Program</title>
</head>
<body>
<h1>Input</h1>
<form action="/cgi-bin/efetch.cgi" method="post" id="myform">
<div>NCBI accession number:<label for="accession"> <input type="text" name="accession"></label><br>
<input type="submit" value="Submit" form="myform"></div>
</form>
[% IF data -%]
<h1>Summary Result</h1>
<pre>
[% data %]
</pre>
[% END -%]
</body>
</html>

Perl HTML::Parser - search a for a specific string in a parsed file

I'm new to HTML::Parser for Perl.
I'm trying to parse a web page and then search for a specific string such as pass or fail. How might I go about that.
Due to framework issues I have to use the HTML::Parser base libary and not another module.
Snippet of code
#!/usr/bin/perl
use strict;
# define the subclass
package IdentityParse;
package HTMLStrip;
use base "HTML::Parser";
sub text {
my ($self, $text) = #_;
# just print out the original text
print $text;
}
sub comment {
my ($self, $comment) = #_;
# print out original text with comment marker
#print "hey hey";
}
sub end {
my ($self, $tag, $origtext) = #_;
# print out original text
#print $origtext;
}
#my $p = new IdentityParse;
my $p = new HTMLStrip;
my #file = $p->parse_file("testcase1.html");
if ($p->parse_file("testcase1.html") =~ "PASS") {
print " The test passed \n";
}
else {
print "\nthe test failed \n";
}
If all you want is to strip the tags from the XML leaving just the text content, then you're making things too hard for yourself. All you need is a text handler subroutine that concatenates each text node to a global scalar.
It looks like this. I've edited the final string to change all spaces and newlines to a single space; otherwise there is a lot of space in there from the layout indents.
use strict;
use warnings;
use HTML::Parser;
my $parser = HTML::Parser->new( text_h => [\&text, 'dtext'] );
my $text_content;
sub text {
$text_content .= shift;
}
$parser->parse_file(*DATA);
$text_content =~ s/\s+/ /g;
print $text_content;
__DATA__
<root>
<item>
Item 1
status failed
</item>
<item>
Item 2
status passed
</item>
<item>
Item 3
status failed
</item>
</root>
output
Item 1 status failed Item 2 status passed Item 3 status failed