Perl XML::Twig - preserving quotes in and around attributes - perl

I'm selectively fixing some elements and attributes. Unfortunately, our input files contain both single- and double-quoted attribute values. Also, some attribute values contain quotes (within a value).
Using XML::Twig, I cannot see out how to preserve whatever quotes exist around attribute values.
Here's sample code:
use strict;
use XML::Twig;
my $file=qq(<file>
<label1 attr='This "works"!' />
<label2 attr="This 'works'!" />
</file>
);
my $fixes=0; # count fixes
my $twig = XML::Twig->new( twig_handlers => {
'[#attr]' => sub {fix_att(#_,\$fixes);} },
# ...
keep_atts_order => 1,
keep_spaces => 1,
keep_encoding => 1, );
#$twig->set_quote('single');
$twig->parse($file);
print $twig->sprint();
sub fix_att {
my ($t,$elt,$fixes) =#_;
# ...
}
The above code returns invalid XML for label1:
<label1 attr="This "works"!" />
If I add:
$twig->set_quote('single');
Then we would see invalid XML for label2:
<label2 attr='This 'works'!' />
Is there an option to preserve existing quotes? Or is there a better approach for selectively fixing twigs?

Is there any specific reason for you to use keep_encoding? Without it the quote is properly encoded.
keep_encoding is used to preserve the original encoding of the file, but there are other ways to do this. It was used mostly in the pre-5.8 era, when encodings didn't work as smoothly as they do now.

Related

Writing simple parser in Perl: having lexer output, where to go next?

I'm trying to write a simple data manipulation language in Perl (read-only, it's meant to transform SQL-inspired queries into filters and properties to use with vSphere Perl API: http://pubs.vmware.com/vsphere-60/topic/com.vmware.perlsdk.pg.doc/viperl_advancedtopics.5.1.html_)
I currently have something similar to lexer output if I understand it properly - a list of tokens like this (Data::Dumper prints array of hashes):
$VAR1 = {
'word' => 'SHOW',
'part' => 'verb',
'position' => 0
};
$VAR2 = {
'part' => 'bareword',
'word' => 'name,',
'position' => 1
};
$VAR3 = {
'word' => 'cpu,',
'part' => 'bareword',
'position' => 2
};
$VAR4 = {
'word' => 'ram',
'part' => 'bareword',
'position' => 3
};
Now what I'd like to do is to build a syntax tree. The documentation I've seen so far is mostly on using modules and generating grammars from BNF, but at the moment I can't wrap my head around it.
I'd like to tinker with relatively simple procedural code, probably recursive, to make some ugly implementation myself.
What I'm currently thinking about is building a string of $token->{'part'}s like this:
my $parts = 'verb bareword bareword ... terminator';
and then running a big and ugly regular expression against it, (ab)using Perl's capability to embed code into regular expressions: http://perldoc.perl.org/perlretut.html#A-bit-of-magic:-executing-Perl-code-in-a-regular-expression:
$parts =~ /
^verb(?{ do_something_smart })\s # Statement always starts with a verb
(bareword\s(?{ do_something_smart }))+ # Followed by one or more barewords
| # Or
# Other rules duct taped here
/x;
Whatever I've found so far requires solid knowledge of CS and/or linguistics, and I'm failing to even understand it.
What should I do about lexer output to start understanding and tinker with proper parsing? Something like 'build a set of temporary hashes representing smaller part of statement' or 'remove substrings until the string is empty and then validate what you get'.
I'm aware of the Dragon Book and SICP, but I'd like something lighter at this time.
Thanks!
As mentioned in a couple of comments above, but here again as a real answer:
You might like Parser::MGC. (Disclaimer: I'm the author of Parser::MGC)
Start by taking your existing (regexp?) definitions of various kinds of token, and turn them into "token_..." methods by using the generic_token method.
From here, you can start to build up methods to parse larger and larger structures of your grammar, by using the structure-building methods.
As for actually building an AST - it's possibly simplest to start with to simply emit HASH references with keys containing named parts of your structure. It's hard to tell a grammatical structure from your example given in the question, but you might for instance have a concept of a "command" that is a "verb" followed by some "nouns". You might parse that using:
sub parse_command
{
my $self = shift;
my $verb = $self->token_verb;
my $nouns = $self->sequence_of( sub { $self->token_noun } );
# $nouns here will be an ARRAYref
return { type => "command", verb => $verb, nouns => $nouns };
}
It's usually around this point in writing a parser that I decide I want some actual typed objects instead of mere hash references. One easy way to do this is via another of my modules, Struct::Dumb:
use Struct::Dumb qw( -named_constructors );
struct Command => [qw( verb nouns )];
...
return Command( verb => $verb, nouns => $nouns );

How to build a hashref with arrays in perl?

I am having trouble building what i think is a hashref (href) in perl with XML::Simple.
I am new to this so not sure how to go about it and i cant find much to build this href with arrays. All the examples i have found are for normal href.
The code bellow outputs the right xml bit, but i am really struggling on how to add more to this href
Thanks
Dario
use XML::Simple;
$test = {
book => [
{
'name' => ['createdDate'],
'value' => [20141205]
},
{
'name' => ['deletionDate'],
'value' => [20111205]
},
]
};
$test ->{book=> [{'name'=> ['name'],'value'=>['Lord of the rings']}]};
print XMLout($test,RootName=>'library');
To add a new hash to the arrary-ref 'books', you need to cast the array-ref to an array and then push on to it. #{ $test->{book} } casts the array-ref into an array.
push #{ $test->{book} }, { name => ['name'], value => ['The Hobbit'] };
XML::Simple is a pain because you're never sure whether you need an array or a hash, and it is hard to distinguish between elements and attributes.
I suggest you make a move to XML::API. This program demonstrates some how it would be used to create the same XML data as your own program that uses XML::Simple.
It has an advantage because it builds a data structure in memory that properly represents the XML. Data can be added linearly, like this, or you can store bookmarks within the structure and go back and add information to nodes created previously.
This code adds the two book elements in different ways. The first is the standard way, where the element is opened, the name and value elements are added, and the book element is closed again. The second shows the _ast (abstract syntax tree) method that allows you to pass data in nested arrays similar to those in XML::Simple for conciseness. This structure requires you to prefix attribute names with a hyphen - to distinguish them from element names.
use strict;
use warnings;
use XML::API;
my $xml = XML::API->new;
$xml->library_open;
$xml->book_open;
$xml->name('createdDate');
$xml->value('20141205');
$xml->book_close;
$xml->_ast(book => [
name => 'deletionDate',
value => '20111205',
]);
$xml->library_close;
print $xml;
output
<?xml version="1.0" encoding="UTF-8" ?>
<library>
<book>
<name>createdDate</name>
<value>20141205</value>
</book>
<book>
<name>deletionDate</name>
<value>20111205</value>
</book>
</library>

how to get block of xml code through perl

From morning i was scratching my head to resolve the below requirment. I know how to parse an xml but not able to find out the sollution to get the exact block along with tags.
sample code:
<employee name="sample1">
<interest name="cricket">
<function action= "bowling">
<rating> average </rating>
</function>
</interest>
<interest name="football">
<function action="defender">
<rating> good </rating>
</function>
</interest>
</employee>
I just want to extract the below content from above xml file and write it into another text file.
<interest name="cricket">
<function action= "bowling">
<rating> average </rating>
</function>
</interest>
Thanks for your help
Using XML::Twig:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
XML::Twig->new( twig_handlers => { 'interest[#name="cricket"]' => sub { $_->print } },
)
->parsefile( 'interest.xml');
A little explanation: the twig_handler is called when an element satisfying the trigger condition, in this case interest[#name="cricket"], is satisfied. At this point the associated sub is called. In the sub $_ is set to be the current element, which is then print'ed. For more complex subs, 2 arguments are passed, the twig itself (the document) and the current element.
VoilĂ .
Also with XML::Twig comes a tool called xml_grep, which makes it easy to extract what you want:
xml_grep --nowrap 'interest[#name="cricket"]' interest.xml
the --nowrap option prevents the default behaviour which wraps the results in a containing element.

What is wrong with my declaration of a hash inside a hash in Perl?

I am struggling with the following declaration of a hash in Perl:
my %xmlStructure = {
hostname => $dbHost,
username => $dbUsername,
password => $dbPassword,
dev_table => $dbTable,
octopus => {
alert_dir => $alert_dir,
broadcast_id => $broadcast_id,
system_id => $system_id,
subkey => $subkey
}
};
I've been googling, but I haven't been able to come up with a solution, and every modification I make ends up in another warning or in results that I do not want.
Perl complaints with the following text:
Reference found where even-sized list expected at ./configurator.pl line X.
I am doing it that way, since I want to use the module:
XML::Simple
In order to generate a XML file with the following structure:
<settings>
<username></username>
<password></password>
<database></database>
<hostname></hostname>
<dev_table></dev_table>
<octopus>
<alert_dir></alert_dir>
<broadcast_id></broadcast_id>
<subkey></subkey>
</octopus>
</settings>
so sometthing like:
my $data = $xmlFile->XMLout(%xmlStructure);
warn Dumper($data);
would display the latter xml sample structure.
Update:
I forgot to mention that I also tried using parenthesis instead of curly braces for the hash reference, and eventhough it seems to work, the XML file is not written properly:
I end up with the following structure:
<settings>
<dev_table>5L3IQWmNOw==</dev_table>
<hostname>gQMgO3/hvMjc</hostname>
<octopus>
<alert_dir>l</alert_dir>
<broadcast_id>l</broadcast_id>
<subkey>l</subkey>
<system_id>l</system_id>
</octopus>
<password>dZJomteHXg==</password>
<username>sjfPIQ==</username>
</settings>
Which is not exactly wrong, but I'm not sure if I'm going to have problems latter on as the XML file grows bigger. The credentials are encrypted using RC4 algorith, but I am encoding in base 64 to avoid any misbehavior with special characters.
Thanks
{} are used for hash references. To declare a hash use normal parentheses ():
my %xmlStructure = (
hostname => $dbHost,
username => $dbUsername,
password => $dbPassword,
dev_table => $dbTable,
octopus => {
alert_dir => $alert_dir,
broadcast_id => $broadcast_id,
system_id => $system_id,
subkey => $subkey
}
);
See also perldoc perldsc - Perl Data Structures Cookbook.
For your second issue, you should keep in mind that XML::Simple is indeed too simple for most applications. If you need a specific layout, you're better off with a different way of producing the XML, say, using HTML::Template. For example (I quoted variable names for illustrative purposes):
#!/usr/bin/env perl
use strict; use warnings;
use HTML::Template;
my $tmpl = HTML::Template->new(filehandle => \*DATA);
$tmpl->param(
hostname => '$dbHost',
username => '$dbUsername',
password => '$dbPassword',
dev_table => '$dbTable',
octopus => [
{
alert_dir => '$alert_dir',
broadcast_id => '$broadcast_id',
system_id => '$system_id',
subkey => '$subkey',
}
]
);
print $tmpl->output;
__DATA__
<settings>
<username><TMPL_VAR username></username>
<password><TMPL_VAR password></password>
<database><TMPL_VAR database></database>
<hostname><TMPL_VAR hostname></hostname>
<dev_table><TMPL_VAR dev_table></dev_table>
<octopus><TMPL_LOOP octopus>
<alert_dir><TMPL_VAR alert_dir></alert_dir>
<broadcast_id><TMPL_VAR broadcast_id></broadcast_id>
<subkey><TMPL_VAR subkey></subkey>
<system_id><TMPL_VAR system_id></system_id>
</TMPL_LOOP></octopus>
</settings>
Output:
<settings>
<username>$dbUsername</username>
<password>$dbPassword</password>
<database></database>
<hostname>$dbHost</hostname>
<dev_table>$dbTable</dev_table>
<octopus>
<alert_dir>$alert_dir</alert_dir>
<broadcast_id>$broadcast_id</broadcast_id>
<subkey>$subkey</subkey>
<system_id>$system_id</system_id>
</octopus>
</settings>
You're using the curly braces { ... } to construct a reference to an anonymous hash. You should either assign that to a scalar, or change the { ... } to standard parentheses ( ... ).

Process quoted string within XML

Perl version: perl, v5.10.1 (*) built for x86_64-linux-thread-multi
I am a relative newbie to perl. I have tried looking at the various XML processing utilities for Perl, XML::Simple, XML::Parser, XML::LibXML, XML::DOM, XML::XML::Twig, XML::XPath etc.
I am trying to process some XML that has quotes in the value portion. I am specifically looking to extract the title from the below XML, however, I've been stumbling over this for a bit now and would appreciate some help if possible.
$VAR1 = {
'issue' => {
'priority' => {
'fid' => '11',
'content' => '3 - Best Effort'
},
'transNum' => {
'fid' => '2',
'content' => '170'
},
'dueDate' => {
'fid' => '17',
'content' => '1327944695'
},
'status' => {
'fid' => '18',
'content' => 'Open - Unassigned'
},
'createdBy' => {
'fid' => '15',
'content' => '32'
},
'title' => {
'fid' => '20',
'content' => 'Testing on spider - issue with "quotation marks"'
},
'description' => {
'fid' => '22',
'content' => 'Noticed issue with title having quotes in title'
},
'issueNum' => {
'fid' => '1',
'content' => '33'
}
}
};
Using XML::LibXML and following code (Note: above if print of contents of $issueXML variable):
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($issueXML);
print $doc->toString;
This prints out:
<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<issueNum fid="1">33</issueNum>
<transNum fid="2">170</transNum>
<createdBy fid="15">32</createdBy>
<status fid="18">Open - Unassigned</status>
<title fid="20">Testing on spider - issue with "quotation marks"</title>
<priority fid="11">3 - Best Effort</priority>
<description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
<dueDate fid="17">1327944695</dueDate>
</issue>
</issues>
I am looking to specifically extract value for the title tag.
When I was processing using XML::Parser, I kept ending up with just the final quote mark. I would like to maintain the same format of the string to display:
Testing on spider - issue with "quotation marks"
I am a bit overwhelmed at the moment with the various XML processing functions. I have tried for awhile now to figure this out, and I am seriously spinning my wheels.
TIA, Appreciate any help,
Regards,
Scott
Another go with XML::LibXML. You should have no problems with quotation marks inside text nodes.
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
use Data::Dumper;
my $xml = XML::LibXML->load_xml(string => q{<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<issueNum fid="1">33</issueNum>
<transNum fid="2">170</transNum>
<createdBy fid="15">32</createdBy>
<status fid="18">Open - Unassigned</status>
<title fid="20">Testing on spider - issue with "quotation marks"</title>
<priority fid="11">3 - Best Effort</priority>
<description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
<dueDate fid="17">1327944695</dueDate>
</issue>
</issues>
});
my $title = $xml->find('/issues/issue/title');
print $title->get_node(0)->textContent;
I am not sure what problem you run into with the quotation marks. They're just a character like any other, except in attribute values where you may have to use an entity if the quote is already used as the value delimiter. Are you sure the "problem" is not just with the way Data::Dumper displays the data structure generated by XML::Simple?
In any case stay away from XML::Parser, which is too low-level, use XML::LibXML or XML::Twig. XML::Simple seems to generate a lot of questions, especially from people not familiar with Perl, so I am not sure it's the right tool to use.
Here is a solution with XML::Twig, but there are any other ways to do this, depending on exactly what you want to do with the titles.
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $issueXML=q{<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<issueNum fid="1">33</issueNum>
<transNum fid="2">170</transNum>
<createdBy fid="15">32</createdBy>
<status fid="18">Open - Unassigned</status>
<title fid="20">Testing on spider - issue with "quotation marks"</title>
<priority fid="11">3 - Best Effort</priority>
<description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
<dueDate fid="17">1327944695</dueDate>
</issue>
</issues>
};
my $t= XML::Twig->new( twig_handlers => { title => sub { print $_->text, "\n"; } })
->parse( $issueXML);
I usually use XML::XSH2 for XML manipulation. Your problem simplifies to:
open FILE.xml ;
for //title echo (.) ;
Your best way of pulling bits out of XML is with an XPath query.
In this case you are looking for the element 'title', inside an element 'issue', inside an element 'issues'.
So your XPath query is simply '//issues/issue/title'.
In two lines of code, you can use XML::LibXML::XPathContext to perform the XPath query for you, which will return the element's content which you are looking for.
This code snippet will demonstrate a simple way of doing an XPath query. The important bit of it is the two lines following the comment "Relevant bit here".
For more information, see the documentation for XML::LibXML::XPathContext
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $xml = XML::LibXML->load_xml(string => q{<?xml version="1.0" encoding="utf-8"?>
<issues>
<issue>
<issueNum fid="1">33</issueNum>
<transNum fid="2">170</transNum>
<createdBy fid="15">32</createdBy>
<status fid="18">Open - Unassigned</status>
<title fid="20">Testing on spider - issue with "quotation marks"</title>
<priority fid="11">3 - Best Effort</priority>
<description fid="22">Noticed issue with submission of Documentation issue #40 on accurev with quotes in title. </description>
<dueDate fid="17">1327944695</dueDate>
</issue>
</issues>
});
# Relevant bit here
my $xc = XML::LibXML::XPathContext->new($xml);
my $title = $xc->find('//issues/issue/title');
print "$title\n";
# prints:
# Testing on spider - issue with "quotation marks"