How to modify an XML file using Perl and XML::Twig - perl

I tried to modify the name field in an XML file using this program
use XML::Twig;
open(OUT, ">resutl.xml") or die "cannot open out file main_file:$!";
my $twig = XML::Twig->new(
pretty_print => 'indented',
twig_handlers => {
association => sub {
$_->findnodes('div');
$_->set_att(name => 'xxx');
},
},
);
$twig->parsefile('in.xml');
$twig->flush(\*OUT);
<div
name="test1"
booktype="book1"
price="e200"
/>
<div
name="test2"
booktype="book2"
price="100" />
When I execute the Perl script it prints the error
junk after document element at line 6, column 0, byte 65 at C:/Perl64/lib/XML/Parser.pm line 187.
at C:\Users\admin\Desktop\parse.pl line 14.

I have tried to tidy your post a little but I don't understand the XML fragment that immediately follows the Perl code.
There are two empty div elements without a root element, so as it stands it isn't well-formed XML.
XML::Twig is assuming that the first div element is the document (root) element and, since it has no content, the subsequent text produces the error message
junk after document element
You also have set twig_handlers to just a single element that handles association elements in the XML, but your data has no such elements.
I think you need to explain more about what it is that you need to do

Properly formatted xml requires a single root element. When XML::Twig attempts to parse your file, it finds the first div and decides that is the root element of the file. When it reaches the end of that and finds another tag at line 6, it gets unhappy and rightfully says there's an error.
If this document is actually intended to be XML, you'll need to enclose that data in fake element in order for it to be parsable. The following does that:
use strict;
use warnings;
use XML::Twig;
my $data = do {local $/; <DATA>};
# Enclose $data in a fake <root> element
$data = qq{<root>$data</root>};
my $twig = XML::Twig->new(
pretty_print => 'indented',
twig_handlers => {
association => sub {
$_->findnodes('div');
$_->set_att(name => 'xxx');
},
},
);
$twig->parse($data);
$twig->print;
__DATA__
<div
name="test1"
booktype="book1"
price="e200"
/>
<div
name="test2"
booktype="book2"
price="100" />
Outputs:
<root>
<div booktype="book1" name="test1" price="e200"/>
<div booktype="book2" name="test2" price="100"/>
</root>
Now, it's also unclear what you're trying to do with your "XML". I suspect you're trying to change the name attributes of the div tags to be 'xxx'. If that's the case then you need to redo your twig_handlers to the following:
twig_handlers => {
'//div' => sub { $_->set_att(name => 'xxx'); },
},
The output will then be:
<root>
<div booktype="book1" name="xxx" price="e200"/>
<div booktype="book2" name="xxx" price="100"/>
</root>

Related

How to get the text contents of an XML child element based on an attribute of its parent

This is my XML data
<categories>
<category id="Id001" name="Abcd">
<project> ID_1234</project>
<project> ID_5678</project>
</category>
<category id="Id002" name="efgh">
<project> ID_6756</project>
<project> ID_4356</project>
</category>
</categories>
I need to get the text contents of each <project> element based on the name attribute of the containing <category> element.
I am using Perl with the XML::LibXML module.
For example, given category name Abcd i should get the list ID_1234, ID_5678.
Here is my code
my $parser = XML::LibXML->new;
$doc = $parser->parse_file( "/cctest/categories.xml" );
my #nodes = $doc->findnodes( '/categories/category' );
foreach my $cat ( #nodes ) {
my #catn = $cat->findvalue('#name');
}
This gives me the category names in array #catn. But how can I get the text values of each project?
You haven't shown what you've tried so far, or what your desired output is so I've made a guess at what you're looking for.
With XML::Twig you could do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig -> parse ( \*DATA );
foreach my $project ( $twig -> findnodes ( '//project' ) ) {
print join ",", (map { $project -> parent -> att($_) } qw ( id name )), $project -> text,"\n";
}
__DATA__
<categories>
<category id="Id001" name="Abcd">
<project> ID_1234</project>
<project> ID_5678</project>
</category>
<category id="Id002" name="efgh">
<project> ID_6756</project>
<project> ID_4356</project>
</category>
</categories>
Which produces:
Id001,Abcd, ID_1234,
Id001,Abcd, ID_5678,
Id002,efgh, ID_6756,
Id002,efgh, ID_4356,
It does this by using findnodes to locate any element 'project'.
Then extract the 'id' and 'name' attributes from the parent (the category), and print that - along with the text in this particular element.
xpath is a powerful tool for selecting data from XML, and with a more focussed question, we can give more specific answers.
So if you were seeking all the projects 'beneath' category "Abcd" you could:
foreach my $project ( $twig -> findnodes ( './category[#name="Abcd"]/project' ) ) {
print $project -> text,"\n";
}
This uses XML::LibXML, which is the library you're already using.
Your $cat variable contains an XML element object which you can process with the same findnodes() and findvalue() methods that you used on the top-level $doc object.
#!/usr/bin/perl
use strict;
use warnings;
# We use modern Perl here (specifically say())
use 5.010;
use XML::LibXML;
my $doc = XML::LibXML->new->parse_file('categories.xml');
foreach my $cat ($doc->findnodes('//category')) {
say $cat->findvalue('#name');
foreach my $proj ($cat->findnodes('project')) {
say $proj->findvalue('.');
}
}
You can try with XML::Simple
use strict;
use warnings;
use XML::Simple;
use Data::Dumper
my $XML_file = 'your XML file';
my $XML_data;
#Get data from your XML file
open(my $IN, '<:encoding(UTF-8)', $XML_file) or die "cannot open file $XML_file";
{
local $/;
$XML_data = <$IN>;
}
close($IN);
#Store XML data as hash reference
my $xmlSimple = XML::Simple->new(KeepRoot => 1);
my $hash_ref = $xmlSimple->XMLin($XML_data);
print Dumper $hash_ref;
The hash reference will be as below:
$VAR1 = {
'categories' => {
'category' => {
'efgh' => {
'id' => 'Id002',
'project' => [
' ID_6756',
' ID_4356'
]
},
'Abcd' => {
'id' => 'Id001',
'project' => [
' ID_1234',
' ID_5678'
]
}
}
}
};
Now to get data which you want:
foreach(#{$hash_ref->{'categories'}->{'category'}->{'Abcd'}->{'project'}}){
print "$_\n";
}
The result is:
ID_1234
ID_5678

HTML::Parser handler sends undefined parameter to callback function?

How its being declared:
my $HTML_GRABBER = HTML::Parser->new('api_version' => 2,
'handlers' => {
'start' => [\&start_tag,"tagname,text"],
'text' => [\&read_text,"tagname, text"],
'end' => [\&end_tag,"tagname"]
}
);
callback function:
sub read_text {
print Dumper(#_);
die "\n";
my ($tag,$stuff) = #_;
if(($DO_NOTHING==0)&&($tag eq $current_tag))
{
push #{$data_queue}, $stuff;
}
}
result:
$VAR1 = undef;
$VAR2 = '
';
so it passes an undefined value and an empty string for tag and text, apparently. THis is reading from a saved HTML file on my harddrive. IDK
I had something like this in mind:
#DOC structure:
#(
# "title"=> {"text"=>#("text")}
# "div" => [
# {
# "p"=> [
# {
# "class" => string
# "id" => string
# "style" => string
# "data"=>["first line", "second line"]
# }
# ],
# "class" => string
# "id" => string
# "style" => string
# }
# ]
#)
You've told it to.
You specified which parameters should be passed to the text handler:
'text' => [\&read_text,"tagname, text"],
Well, there is no tagname for a text token, and therefore it passes you undef as the first paramter.
What exactly are you trying to do? If you describe your actual goal, we might be able to suggest a better solution instead of just pointing out the flaws in your current implementation. Check out: What is an XY Problem?
Addendum about Mojo::DOM
There are modern modules like Mojo::DOM that are much better for navigating a document structure and finding specific data. Check out Mojocast Episode 5 for a helpful 8 minute introductory video.
You appear to be prematurely worried about efficiency of the parse. Initially, I'd advise you to just store the raw html in the database, and reparse it whenever you need to pull new information.
If you Benchmark and decide this is too slow, then you can use Storable to save a serialized copy of the parsed $dom object. However, this should definitely be in addition to the saved html.
use strict;
use warnings;
use Mojo::DOM;
use Storable qw(freeze thaw);
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
# Serializing to memory - Can then put it into a DB if you want
my $serialized = freeze $dom;
my $newdom = thaw($serialized);
# Load Title from Serialized dom
print $newdom->find('title')->text;
__DATA__
<html>
<head><title>My Title</title></head>
<body>
<h1>My Header one</h1>
<p>My Paragraph One</p>
<p>My Paragraph Two</p>
</body>
</html>
Outputs:
My Title

how to split a parent tag in xml twig?

I am new to xml-twig, how can I split parent tag?
File:
<xml>
<p class="indent">text text incluce <div>text</div> ateas</p>
<p class="text">text text incluce <div>text</div> ateas</p>
</xml>
I need Output as:
<xml>
<p class="indent">text text incluce</p>
<div>text</div>
<p class="indent">ateas</p>
<p class="text">text text incluce</p>
<div>text</div>
<p class="text">ateas</p>
</xml>
How can I split tag?
use strict;
use XML::Twig;
open(my $output , '>', "split.xml") || die "can't open the Output $!\n";
my $xml_twig_content = XML::Twig->new(
'p' => \&split, )
$xml_twig_content->parsefile("sample.xml");
$xml_twig_content->print($output);
sub split{
my ($xml_twig_content, $p) = #_;
}
how can I split tag?...
There are probably several ways how to do it. The following code uses wrap_in, which adds a new <p> around all the text nodes, and then erase to remove the original <p>. atts is used to copy the attributes of the original <p> to the new ones.
#!/usr/bin/perl
use warnings;
use strict;
use XML::Twig;
open(my $output , '>', "split.xml") || die "can't open the Output $!\n";
my $xml = XML::Twig->new( twig_handlers => { p => \&split_tag } );
$xml->parsefile("1.xml");
$xml->print($output);
sub split_tag {
my ($twig, $p) = #_;
$_->wrap_in('p', $p->atts) for $p->children('#TEXT');
$p->erase;
}
BTW, please post a runnable code. Your sample code misses important parts (t.g. twig_handlers or a semicolon).
For your additional constraint, you can bend the script as follows:
sub split_tag {
my ($twig, $p) = #_;
CHILD:
for my $ch ($p->children(sub {'div' ne shift->name})) {
my $wrap = $ch->wrap_in('p', $p->atts);
my $prev = $wrap->prev_sibling or next CHILD;
$prev->merge($wrap) if 'p' eq $prev->name;
}
$p->erase;
}
A lot depends on the nature of your complete XML data. If you expect nested <p> elements, for instance, then the solution is a lot more complex and the behaviour needs to be defined better.
However this program seems to do what you need, and works on your sample data. As in your own code, the split subroutine processes every <p> element that is encountered. An element is left untouched if all it contains is text, otherwise the children are detached and used to create a list of replacement nodes in array #split. The text nodes in this list are transformed by creating a clone of the parent <p> element and pasting the text as its contents. Once all text nodes have been modified, a call to replace_with substitutes the new list of elements for the original <p> element.
Note that the print_to_file method avoids the need to open the output file separately.
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
twig_handlers => { p => \&split },
);
$twig ->parsefile('sample.xml');
$twig->print_to_file('split.xml', pretty_print => 'indented');
sub split{
my ($twig, $p) = #_;
return if $p->contains_only_text;
my #split = $p->cut_children;
for my $child (grep $_->is_pcdata, #split) {
my $text = $child;
$child = $p->copy;
$text->paste(last_child => $child);
}
$p->replace_with(#split);
}
output
<xml>
<p class="indent">text text incluce </p>
<div>text</div>
<p class="indent"> ateas</p>
<p class="text">text text incluce </p>
<div>text</div>
<p class="text"> ateas</p>
</xml>

Get contents from HTML tag using MyParser in Perl

I have a html as the following:
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body bgcolor="white">
<h1>foo.c</h1>
<form method="post" action=""
enctype="application/x-www-form-urlencoded">
Compare this file to the similar file:
<select name="file2">
<option value="...">...</option>
</select>
<input type="hidden" name="file1" value="foo.c" /><br>
Show the results in this format:
</form>
<hr>
<p>
<pre>
some code
</pre>
I need to get value of input name = 'file' and the contents of HTML pre tag. I don't know on perl language, by googling I wrote this small program(that I believe isn't "elegant"):
#!/usr/bin/perl
package MyParser;
use base qw(HTML::Parser);
#Store the file name and contents obtaind from HTML Tags
my($filename, $file_contents);
#This value is set at start() calls
#and use in text() routine..
my($g_tagname, $g_attr);
#Process tag itself and its attributes
sub start {
my ($self, $tagname, $attr, $attrseq, $origtext) = #_;
$g_tagname = $tagname;
$g_attr = $attr;
}
#Process HTML tag body
sub text {
my ($self, $text) = #_;
#Gets the filename
if($g_tagname eq "input" and $g_attr->{'name'} eq "file1") {
$filename = $attr->{'value'};
}
#Gets the filecontents
if($g_tagname eq "pre") {
$file_contents = $text;
}
}
package main;
#read $filename file contents and returns
#note: it works only for text/plain files.
sub read_file {
my($filename) = #_;
open FILE, $filename or die $!;
my ($buf, $data, $n);
while((read FILE, $data, 256) != 0) {
$buf .= $data;
}
return ($buf);
}
my $curr_filename = $ARGV[0];
my $curr_file_contents = read_file($curr_filename);
my $parser = MyParser->new;
$parser->parse($curr_file_contents);
print "filename: ",$filename,"file contents: ",$file_contents;
Then I call ./foo.pl html.html But I'm getting empty values from $filename and $file_contents variables.
How to fix this?
Like always, there's more than one way to do it. Here's how to use the DOM Parser of Mojolicious for this task:
#!/usr/bin/env perl
use strict;
use warnings;
use Mojo::DOM;
# slurp all lines at once into the DOM parser
my $dom = Mojo::DOM->new(do { local $/; <> });
print $dom->at('input[name=file1]')->attr('value');
print $dom->at('pre')->text;
Output:
foo.c
some code
Using xpath and HTML::TreeBuilder::XPath Perl module ( very few lines ):
#!/usr/bin/env perl
use strict; use warnings;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new_from_content( <> );
print $tree->findvalue( '//input[#name="file1"]/#value' );
print $tree->findvalue( '//pre/text()' );
USAGE
./script.pl file.html
OUTPUT
foo.c
some code
NOTES
in the past, I was using HTML::TreeBuilder module to do some web-scraping. Now, I can't go back to complexity. HTML::TreeBuilder::XPath do all the magic with the useful Xpath expressions.
you can use new_from_file method to open a file or a filehandle instead of new_from_content, see perldoc HTML::TreeBuilder ( HTML::TreeBuilder::XPath inherit methods from HTML::TreeBuilder)
using <> in this way is allowed here because HTML::TreeBuilder::new_from_content() specifically allows reading multiple lines in that way. Most constructors will not allow this usage. You should provide a scalar instead or use another method.
You don't generally want to use plain HTML::Parser unless you're writing your own parsing module or doing something generally tricky. In this case, HTML::TreeBuilder, which is a subclass of HTML::Parser, is the easiest to use.
Also, note that HTML::Parser has a parse_file method (and HTML::TreeBuilder makes it even easier with a new_from_file method, so you don't have to do all of this read_file business (and besides, there are better ways to do it than the one you picked, including File::Slurp and the old do { local $/; <$handle> } trick.
use HTML::TreeBuilder;
my $filename = $ARGV[0];
my $tree = HTML::TreeBuilder->new_from_file($filename);
my $filename = $tree->look_down(
_tag => 'input',
type => 'hidden',
name => 'file1'
)->attr('value');
my $file_contents = $tree->look_down(_tag => 'pre')->as_trimmed_text;
print "filename: ",$filename,"file contents: ",$file_contents;
For information on look_down, attr, and as_trimmed_text, see the HTML::Element docs; HTML::TreeBuilder both is a, and works with, elements.

How to parse multi record XML file ues XML::Simple in Perl

My data.xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<catalog>
<cd country="UK">
<title>Hide your heart</title>
<artist>Bonnie Tyler</artist>
<price>10.0</price>
</cd>
<cd country="CHN">
<title>Greatest Hits</title>
<artist>Dolly Parton</artist>
<price>9.99</price>
</cd>
<cd country="USA">
<title>Hello</title>
<artist>Say Hello</artist>
<price>0001</price>
</cd>
</catalog>
my test.pl
#!/usr/bin/perl
# use module
use XML::Simple;
use Data::Dumper;
# create object
$xml = new XML::Simple;
# read XML file
$data = $xml->XMLin("data.xml");
# access XML data
print "$data->{cd}->{country}\n";
print "$data->{cd}->{artist}\n";
print "$data->{cd}->{price}\n";
print "$data->{cd}->{title}\n";
Output:
Not a HASH reference at D:\learning\perl\t1.pl line 16.
Comment: I googled and found the article(handle single xml record).
http://www.go4expert.com/forums/showthread.php?t=812
I tested with the article code, it works quite well on my laptop.
Then I created my practice code above to try to access multiple record. but failed. How can I fix it? Thank you.
Always use strict;, always use warnings; Don't quote complex references like you're doing. You're right to use Dumper;, it should have shown you that cd was an array ref - you have to specificity which cd.
#!/usr/bin/perl
use strict;
use warnings;
# use module
use XML::Simple;
use Data::Dumper;
# create object
my $xml = new XML::Simple;
# read XML file
my $data = $xml->XMLin("file.xml");
# access XML data
print $data->{cd}[0]{country};
print $data->{cd}[0]{artist};
print $data->{cd}[0]{price};
print $data->{cd}[0]{title};
If you do print Dumper($data), you will see that the data structure does not look like you think it does:
$VAR1 = {
'cd' => [
{
'country' => 'UK',
'artist' => 'Bonnie Tyler',
'price' => '10.0',
'title' => 'Hide your heart'
},
{
'country' => 'CHN',
'artist' => 'Dolly Parton',
'price' => '9.99',
'title' => 'Greatest Hits'
},
{
'country' => 'USA',
'artist' => 'Say Hello',
'price' => '0001',
'title' => 'Hello'
}
]
};
You need to access the data like so:
print "$data->{cd}->[0]->{country}\n";
print "$data->{cd}->[0]->{artist}\n";
print "$data->{cd}->[0]->{price}\n";
print "$data->{cd}->[0]->{title}\n";
In addition to what has been said by Evan, if you're unsure if you're stuck with one or many elements, ref() can tell you what it is, and you can handle it accordingly:
my $data = $xml->XMLin("file.xml");
if(ref($data->{cd}) eq 'ARRAY')
{
for my $cd (#{ $data->{cd} })
{
print Dumper $cd;
}
}
else # Chances are it's a single element
{
print Dumper $cd;
}