Perl LibXML raw data from textContent? - perl

Given the following XML:
<?xml version="1.0" encoding="utf-8" ?>
<Request>
<form_submit>
<form_submit id = 1424>
<form_id>1424</form_id>
<field1 id=’5’> <![CDATA[ test ]]> </field1>
<field2 id=’6’> <![CDATA[ test2 ]]> </field2>
</form_submit>
</form_submit>
</Request>
I'm trying to get the raw values for the field1 and field2 elements. I'm using the following code:
foreach my $node ( $xml_request->findnodes('Request/*/*/*[#id]') )
{
my $form_field_value = $node->textContent;
print "Value:\"$form_field_value\"\n";
}
But the output is:
Value:" test "
Value:" test2 "
How do I retrieve the exact data, raw and as is, with all the special characters? So that the output is:
Value:" <![CDATA[ test ]]> "
Value:" <![CDATA[ test2 ]]> "
Thank you.

Am not a libxml expert.
However this is what I could figure out after playing with your xml and libxml a bit.
CDATA is a node/section and is not part of text.
Code below goes one level deep and do a toString() for cdata child nodes
and textContent for other nodes.
foreach my $node ( $xml_request->findnodes('Request/*/*/*[#id]') )
{
my $text;
if($node->childNodes) {
foreach my $child ($node->childNodes()) {
if ($child->nodeType == XML::LibXML::XML_CDATA_SECTION_NODE) {
$text .= $child->toString;
} else {
$text .= $child->textContent;
}
}
} else {
$text = $node->textContent;
}
print qq{"$text"\n};
}
will print
" <![CDATA[ test ]]> "
" <![CDATA[ test2 ]]> "

Your sample data is invalid XML, and won't parse unless you replace 1424, ’5’ and ’6’ with "1424", "5" and "6".
You have asked for the text content and have got exactly that. To get what you need you must search for the children of the <fieldN> elements and use the toString method on them.
This code shows the idea. Note that the spaces before and after the CDATA, which would otherwise appear as separate text nodes, have been eliminated using a keep_blanks => 0 option on the object constructor.
use strict;
use warnings;
use XML::LibXML;
my $xml_request = XML::LibXML->load_xml(string => <<'END', keep_blanks => 0);
<?xml version="1.0" encoding="utf-8" ?>
<Request>
<form_submit>
<form_submit id = "1424">
<form_id>1424</form_id>
<field1 id="5"> <![CDATA[ test ]]> </field1>
<field2 id="6"> <![CDATA[ test2 ]]> </field2>
</form_submit>
</form_submit>
</Request>
END
foreach my $node ( $xml_request->findnodes('//form_submit/*[#id]/text()') ) {
my $form_field_value = $node->toString;
print qq(Value: "$form_field_value"\n);
}
output
Value: "<![CDATA[ test ]]>"
Value: "<![CDATA[ test2 ]]>"
Edit
ikegami has commented that the output requested in the question includes the whitespace surrounding the CDATA section. I don't know whether that is truly part of the requirement, but this edit provides a way to do that.
This would be clearer using XML::LibXML::Reader as it has a readInnerXml method (comparable to JavaScript's innerHTML ) that does exactly what is necessary. Instead, this program has to serialize all the children of the <fieldN> nodes and concatenate them with join.
This is a new foreach loop. The rest of the program remains unchanged except for the construction of $xml_request, which must have the keep_blanks option set to 1 or removed altogether.
foreach my $node ( $xml_request->findnodes('//*[starts-with(name(),"field")][#id]') ) {
my $form_field_value = join '', map $_->toString, $node->childNodes;
print qq(Value: "$form_field_value"\n);
}
output
Value: " <![CDATA[ test ]]> "
Value: " <![CDATA[ test2 ]]> "

Related

Perl: unable to extract sibling value using Twig::XPath syntax

Recently I start to use XML::Twig::XPath but the module does not seem to recognize an xpath syntax.
In the following XML, I want the value of "Txt" node if the value of PlcAndNm node is "ext_1"
<?xml version="1.0" encoding="UTF-8"?>
<root>
<Document>
<RedOrdrV03>
<MsgId>
<Id>1</Id>
</MsgId>
<Xtnsn>
<PlcAndNm>ext_1</PlcAndNm>
<Txt>1234</Txt>
</Xtnsn>
<Xtnsn>
<PlcAndNm>ext_2</PlcAndNm>
<Txt>ABC</Txt>
</Xtnsn>
</RedOrdrV03>
</Document>
<Document>
<RedOrdrV03>
<MsgId>
<Id>2</Id>
</MsgId>
<Xtnsn>
<PlcAndNm>ext_1</PlcAndNm>
<Txt>9876</Txt>
</Xtnsn>
<Xtnsn>
<PlcAndNm>ext_2</PlcAndNm>
<Txt>DEF</Txt>
</Xtnsn>
</RedOrdrV03>
</Document>
</root>
I have tried whit expression //Xtnsn[PlcAndNm="ext_1"]/Txt but I received an error
This is the code:
use XML::Twig::XPath;
my $subelt_count = 1;
my #processed_elements;
my $xmlfile = 'c:/test_file.xml';
my $parser = XML::Twig->new(
twig_roots => { 'RedOrdrV03' => \&process_xml } ,
end_tag_handlers => { 'Document' },
);
$parser->parsefile($xmlfile);
sub process_xml {
my ( $twig, $elt ) = #_;
push( #processed_elements, $elt );
if ( #processed_elements >= $subelt_count ) {
my $MsgId = $twig->findvalue('RedOrdrV03/MsgId/Id');
my $Xtnsn_Txt1 = $twig->findvalue('//Xtnsn[PlcAndNm="ext_1"]/Txt');
print "MsgId: $MsgId - Xtnsn_Txt1: $Xtnsn_Txt1\n";
}
$_->delete for #processed_elements;
#processed_elements = ();
$twig->purge;
}
Is there a simple way of using xpath to obtain the value?
I know that a possibility is somenthing like:
my $Xtnsn_Txt1 = $twig->first_elt( sub { $_[0]->tag eq 'PlcAndNm' && $_[0]->text eq 'ext_1' })->next_sibling()->text();
but I prefer using the simplest XPath syntax,
Thanks in advance for your help!
You can use this:
my $Xtnsn_Txt1 = $twig->findvalue('//Xtnsn/PlcAndNm[string()="ext_1"]/../Txt');
Another approach could be :
//Txt[preceding-sibling::PlcAndNm[.="ext_1"]]
You can also modify a little bit your XPath expression to see if it works with :
//Xtnsn[./PlcAndNm[contains(.,"ext_1")]]/Txt
EDIT : This works fine with the original XML::XPath module :
use XML::XPath;
use XML::XPath::Node::Element;
my $xp = XML::XPath->new(filename => 'pathtoyour.xml');
my $nodeset = $xp->find('//Xtnsn[PlcAndNm="ext_1"]/Txt');
foreach my $node ($nodeset->get_nodelist) {
print XML::XPath::Node::Element::string_value($node),"\n\n";
}
Output : 1234 9876

unable to parse xml file using registered namespace

I am using XML::LibXML to parse a XML file. There seems to some problem in using registered namespace while accessing the node elements. I am planning to covert this xml data into CSV file. I am trying to access each and every element here. To start with I tried out extracting attribute values of <country> and <state> tags. Below is the code I have come with . But I am getting error saying XPath error : Undefined namespace prefix.
use strict;
use warnings;
use Data::Dumper;
use XML::LibXML;
my $XML=<<EOF;
<DataSet xmlns="http://www.w3schools.com" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3schools.com note.xsd">
<exec>
<survey_region ver="1.1" type="x789" date="20160312"/>
<survey_loc ver="1.1" type="x789" date="20160312"/>
<note>Population survey</note>
</exec>
<country name="ABC" type="MALE">
<state name="ABC_state1" result="PASS">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
Some random text
contained here
]]></comment>
</state>
</country>
<country name="XYZ" type="MALE">
<state name="XYZ_state2" result="FAIL">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
any random text data
]]></comment>
</state>
</country>
</DataSet>
EOF
my $parser = XML::LibXML->new();
my $doc = $parser->parse_string($XML);
my $xc = XML::LibXML::XPathContext->new($doc);
$xc->registerNs('x','http://www.w3schools.com');
foreach my $camelid ($xc->findnodes('//x:DataSet')) {
my $country_name = $camelid->findvalue('./x:country/#name');
my $country_type = $camelid->findvalue('./x:country/#type');
my $state_name = $camelid->findvalue('./x:state/#name');
my $state_result = $camelid->findvalue('./x:state/#result');
print "state_name ($state_name)\n";
print "state_result ($state_result)\n";
print "country_name ($country_name)\n";
print "country_type ($country_type)\n";
}
Update
if I remove the name space from XML and change my XPath slightly it seems to work. Can someone help me understand the difference.
foreach my $camelid ($xc->findnodes('//DataSet')) {
my $country_name = $camelid->findvalue('./country/#name');
my $country_type = $camelid->findvalue('./country/#type');
my $state_name = $camelid->findvalue('./country/state/#name');
my $state_result = $camelid->findvalue('./country/state/#result');
print "state_name ($state_name)\n";
print "state_result ($state_result)\n";
print "country_name ($country_name)\n";
print "country_type ($country_type)\n";
}
This would be my approach
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $XML=<<EOF;
<DataSet xmlns="http://www.w3schools.com" xmlns:xsi="https://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3schools.com note.xsd">
<exec>
<survey_region ver="1.1" type="x789" date="20160312"/>
<survey_loc ver="1.1" type="x789" date="20160312"/>
<note>Population survey</note>
</exec>
<country name="ABC" type="MALE">
<state name="ABC_state1" result="PASS">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
Some random text
contained here
]]></comment>
</state>
</country>
<country name="XYZ" type="MALE">
<state name="XYZ_state2" result="FAIL">
<info>
<type>literacy rate comparison</type>
</info>
<comment><![CDATA[
any random text data
]]></comment>
</state>
</country>
</DataSet>
EOF
my $parser = XML::LibXML->new();
my $tree = $parser->parse_string($XML);
my $root = $tree->getDocumentElement;
my #country = $root->getElementsByTagName('country');
foreach my $citem(#country){
my $country_name = $citem->getAttribute('name');
my $country_type = $citem->getAttribute('type');
print "Country Name -- $country_name\nCountry Type -- $country_type\n";
my #state = $citem->getElementsByTagName('state');
foreach my $sitem(#state){
my #info = $sitem->getElementsByTagName('info');
my $state_name = $sitem->getAttribute('name');
my $state_result = $sitem->getAttribute('result');
print "State Name -- $state_name\nState Result -- $state_result\n";
foreach my $i (#info){
my $text = $i->getElementsByTagName('type');
print "Info --- $text\n";
}
}
print "\n";
}
Of course you can manipulate the data anyway you'd like. If you are parsing from a file change parse_string to parse_file.
For the individual elements in the xml use the getElementsByTagName to get the elements within the tags. This should be enough to get you going
There seem to be two small mistakes here.
1. call findvalue for the XPathContext document with the context node as parameter.
2. name is a attribute in country no a node.
Therefor try :
my $country_name = $xc->findvalue('./x:country/#name', $camelid );
Update to the updated question if I remove the name space from XML and change my XPath slightly it seems to work. Can someone help me understand the difference.
To understand what happens here have a look to NOTE ON NAMESPACES AND XPATH
In your case $camelid->findvalue('./x:state/#name'); calls findvalue is called for an node.
But: The recommended way is to use the XML::LibXML::XPathContext module to define an explicit context for XPath evaluation, in which a document independent prefix-to-namespace mapping can be defined. Which I did above.
Conclusion:
Calling find on a node will only work: if the root element had no namespace
(or if you use the same prefix as in the xml doucment if ther is any)

Complex XML parsing with Perl and LIBXML

I have XML:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="MeasDataCollection.xsl"?>
<measCollecFile xmlns="">
<fileHeader fileFormatVersion="32.435 V7.2.0">
</fileHeader>
<measData>
<managedElement localDn="bs=8" swVersion="R21A"/>
<measInfo measInfoId="CORE,SIP_session_statistics">
<measType p="1">CPUUSAGE</measType>
<measType p="2">CPUMEM</measType>
<measType p="3">SYSMEM</measType>
<measValue measObjLdn="SGC.bsNo=17,networkRole=2">
<r p="1">10</r>
<r p="2">20</r>
<r p="3">30</r>
</measValue>
<measValue measObjLdn="SGC.bsNo=18,networkRole=2">
<r p="1">40</r>
<r p="2">50</r>
<r p="3">60</r>
</measValue>
</measInfo>
</measData>
</measCollecFile>
QUESTION:
I want to extract the 40 from <r p="1">40</r> element. The only thing given is <measType p="1">CPUUSAGE</measType> and <measValue measObjLdn="SGC.bsNo=18,networkRole=2">
i.e. I only know that I need to find the CPUUSAGE of the bsNo=18. The order of the data is always maintained.
Here is what I have tried so far:
my $qry="//measInfo[measType/text() = 'CPUUSAGE']/measValue";
my #nodes= $conn->findnodes($qry);
foreach my $vnode (#nodes) {
if ($vnode->getAttribute('measObjLdn') =~ /'bsNo=18'/) {
foreach my $node ($vnode) {
foreach my $p ($node->getChildnodes) {
if (ref($p)=~'Element'){
$no=$p->textContent;
print $no;**#this prints the value of all the <r> elements**
}
}
}
}
}
My challenge is there can be many elements like CPUUSAGE,CPUMEM... and how I can reach the correct order in the <r> element in that order for a given measValue attribute (/'bsNo=18'/).
And subsequently modify that 40 to some other desired value**
Your Perl code can't work because you match the attribute value against 'bsNo=18' including single quotes.
If you want to find the r element with the same p attribute as the CPUUSAGE node, you could either try the single XPath expression by ikegami or something like the following:
for my $type_node ($conn->findnodes('//measInfo/measType[.="CPUUSAGE"]')) {
my $p = $type_node->getAttribute('p');
my $qry = <<"EOF";
..
/measValue[contains(concat(\#measObjLdn, ','), 'bsNo=18,')]
/r[\#p='$p']
EOF
for my $r_node ($type_node->findnodes($qry)) {
print $r_node->textContent, "\n";
}
}
This first loops over all measType nodes whose content is CPUUSAGE, gets the p attribute then finds all the corresponding r nodes. This approach should be more efficient than a single XPath query.
To find the r node by position and modify its contents, try:
for my $type_node ($conn->findnodes('//measInfo/measType[.="CPUUSAGE"]')) {
my $pos = $type_node->findvalue('count(preceding-sibling::measType) + 1');
my $qry = <<"EOF";
..
/measValue[contains(concat(\#measObjLdn, ','), 'bsNo=18,')]
/r[$pos]
EOF
for my $r_node ($type_node->findnodes($qry)) {
$r_node->removeChildNodes;
$r_node->appendText('50');
}
}
print $conn->toString;

print lines between x and y from the log using Perl

I have log file and that contain some xml messages like...
<fixsim xyz='tststtsts'>
<name test="test1">
<time t=234>
</time>
</name>
</fixsim>
here some normal log text
whoiwoei
blsdbndsnb
<fixsim xyz='tssts'
<name test="test2"
<time t=234>
</time>
</name>
</fixsim>
and so on....
From the above log file i want to grab the xml message (from <Fixsim> to </fixsim>) with some condition. For example
i want xml message having test= test2. so as output i should get
<fixsim xyz='tssts'
<name test="test2"
<time t=234>
</time>
</name>
</fixsim>
The following will get the XML docs:
process($_) for $log =~ m{<fixsim.*?</fixsim>}sg;
and so would
my $xml;
while (<$log_fh>) {
if ( my $count = m{<fixsim} .. m{</fixsim>} ) {
$xml .= $_;
if ($count =~ /E0\z/) {
process($xml);
$xml = undef;
}
}
process($xml) if defined($xml);
}
Once you got the XML, you can extract the field you need using your favorite XML parser.

Dropdown-Menu with optgroup

i am trying to create a dynamic dropdown-menu that receives its entries out of an xml-file at script-startup.
first i tried a static version like this:
Tr(td([popup_menu( -name=>'betreff', -values=>[optgroup(-name=>'Mädels',
-values=>['Susi','Steffi',''], -labels=>{'Susi'=>'Petra','Steffi'=>'Paula'})
,optgroup(-name=>'Jungs', -values=>['moe', 'catch',''])])]));
that worked fine.
The prob starts when i try to put the -values-parameter of popup_menu into a scalar variable.
Should somehow lokk similar to that one:
$popup_values = "[optgroup(-name=>'Mädels', -values=>['Susi','Steffi',''],
-labels=>{'Susi'=>'Petra','Steffi'=>'Paula'}),optgroup(-name=>'Jungs',
-values=>['moe', 'catch',''])]"
or with single quotation marks.
The goal is to build that string by concatenating the syntax-corrected elements of the xml-file. Thats because i do not know a priori how many optgroups or list elements within the optgroups will exist.
Any idea?
Thx in advance
Jochen
So you have an XML file which you use to generate that string? Why not directly generate the data structure necessary for the popup_menu call? It's just an array (you can call optgroup while "analysing" the XML file)
If you really want to use the string-solution then you could use eval to transform the string to the data structure. Though this solution has certain security issues.
Reading From XML-File
Here's an example of how to transform form XML to the optgroup, this of course depends on how your XML-file looks like.
use strict;
use warnings;
use XML::Simple;
use CGI qw/:standard/;
my $xmlString = join('', <DATA>);
my $xmlData = XMLin($xmlString);
my #popup_values;
foreach my $group (keys(%{$xmlData->{group}})) {
my (#values, %labels);
my $options = $xmlData->{group}->{$group}->{opt};
foreach my $option (keys(%{$options})) {
push #values, $option;
if(exists($options->{$option}->{label}) &&
'' ne $options->{$option}->{label}) {
$labels{$option} = $options->{$option}->{label};
}
}
push #popup_values, optgroup(-name => $group,
-labels => \%labels,
-values => \#values
);
}
print popup_menu(-name=>'betreff', -values=> \#popup_values);
__DATA__
<?xml version="1.0" encoding="UTF-8" ?>
<dropdown>
<group name="Mädels">
<opt name="Susi" label="Petra"/>
<opt name="Steffi" label="Paula"/>
<opt name="" />
</group>
<group name="Jungs">
<opt name="moe" />
<opt name="catch" />
<opt name="" />
</group>
</dropdown>