Perl XML::LibXML: how to access comment nodes - perl

For the life of me I can't figure out the proper code to access the comment lines in my XML file. Do I use findnodes, find, getElementByTagName (doubt it).
Am I even making the correct assumption that these comment lines are accessible? I would hope so, as I know I can add a comment.
The type number for a comment node is 8, so they must be parseable.
Ultimately, what I want tot do is delete them.
my #nodes = $dom->findnodes("//*");
foreach my $node (#nodes) {
print $node->nodeType, "\n";
}
<TT>
<A>xyz</A>
<!-- my comment -->
</TT>

If all you need to do is produce a copy of the XML with comment nodes removed, then the first parameter of toStringC14N is a flag that says whether you want comments in the output. Omitting all parameters implicitly sets the first to a false value, so
$doc->toStringC14N
will reproduce the XML trimmed of comments. Note that the Canonical XML form specified by C14N doesn't include an XML declaration header. It is always XML 1.0 encoded in UTF-8.
If you need to remove the comments from the in-memory structure of the document before processing it further, then findnodes with the XPath expression //comment() will locate them for you, and unbindNode will remove them from the XML.
This program demonstrates
use strict;
use warnings;
use XML::LibXML;
my $doc = XML::LibXML->load_xml(string => <<END_XML);
<TT>
<A>xyz</A>
<!-- my comment -->
</TT>
END_XML
# Print everything
print $doc->toString, "\n";
# Print without comments
print $doc->toStringC14N, "\n\n";
# Remove comments and print everything
$_->unbindNode for $doc->findnodes('//comment()');
print $doc->toString;
output
<?xml version="1.0"?>
<TT>
<A>xyz</A>
<!-- my comment -->
</TT>
<TT>
<A>xyz</A>
</TT>
<?xml version="1.0"?>
<TT>
<A>xyz</A>
</TT>
Update
To select a specific comment, you can add a predicate expression to the XPath selector. To find the specific comment in your example data you could write
$doc->findnodes('//comment()[. = " my comment "]')
Note that the text of the comment includes everything except the leading and trailing --, so spaces are significant as shown in that call.
If you want to make things a bit more lax, you could use normalize=space, which removes leading and trailing whitespace, and contracts every sequence of whitespace within the string to a single space. Now you can write
$doc->findnodes('//comment()[normalize-space(.) = "my comment"]')
And the same call would find your comment even if it looked like this.
<!--
my
comment
-->
Finally, you can make use of contains, which, as you would expect, simply checks whether one string contains another. Using that you could write
$doc->findnodes('//comment()[contains(., "comm")]')
The one to choose depends on your requirement and your situation.

According to the XPath spec:
* is a test that matches element nodes of any name. Comment nodes aren't element nodes.
comment() is a test that matches comment nodes.
Untested:
for $comment_node ($doc->findnodes('//comment()')) {
$comment_node->parentNode->removeChild($comment_node);
}

I know it's not XML::LibXML but here you have another way to remove comments easily with XML::Twig module:
#!/usr/bin/env perl
use warnings;
use strict;
use XML::Twig;
my $twig = XML::Twig->new(
pretty_print => 'indented',
comments => 'drop'
)->parsefile( shift )->print;
Run it like:
perl script.pl xmlfile
That yields:
<TT>
<A>xyz</A>
</TT>
The comments option has also the value process that lets you work with them using the xpath value of #COMMENT.

Related

How to prevent XML::LibXML to save modified xml using self-closing tag

The following working code reads my XML file containing lots of empty elements, then applies 2 changes and saves it again under different name.
But it also changes empty elements like <element></element> to self-closing tags like <element /> which is unwanted.
How to save it not using self-closing tags?
Or by another words how to tell XML::LibXML to use empty tags?
The original file is produced in commercial application, which uses style with empty elements, so I want to sustain that.
#! /usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $filename = 'out.xml';
my $dom = XML::LibXML->load_xml(location => $filename);
my $query = '//scalar[contains(#name, "partitionsNo")]/value';
for my $i ($dom->findnodes($query)) {
$i->removeChildNodes();
$i->appendText('16');
}
open my $out, '>', 'out2.xml';
binmode $out;
$dom->toFH($out);
# now out2.xml has only self-closing tags where previously
# were used empty elements
Unfortunately, XML::LibXML doesn't support libxml2's xmlsave module which has a flag to save without empty tags.
As a workaround you can add an empty text node to empty elements:
for my $node ($doc->findnodes('//*[not(node())]')) {
# Note that appendText doesn't work.
$node->appendChild($doc->createTextNode(''));
}
This is a bit costly for large documents, but I'm not aware of a better solution.
That said, the fragments <foo></foo> and <foo/> are both well-formed and semantically equivalent. Any XML parser or application that treats such fragments differently is buggy.
Note that some people believe the XML spec recommends using self-closing tags, but that's not exactly true. The XML spec says:
Empty-element tags may be used for any element which has no content, whether or not it is declared using the keyword EMPTY. For interoperability, the empty-element tag should be used, and should only be used, for elements which are declared EMPTY.
This means elements that are declared EMPTY in a DTD. For other elements, or if no DTD is present, the XML standard advises not to use self-closing tags ("and should only be used"). But this is only a non-binding recommendation for interoperability.
There is a package variable
$XML::LibXML::setTagCompression
Setting it to a true value forces all empty tags to be printed as <e></e>, while a false value forces <e/>.
See SERIALIZATION in the Parser documentation.

Remove non-digit characters perl

I have a file which has multiple quotes like the one below:
<verse-no>quote</verse-no>
<quote-verse>1:26,27 Man Created to Continually Develop</quote-verse>
<quote>When Adam came from the Creator’s hand, he bore, in his physical, mental, and
spiritual nature, a likeness to his Maker. “God created man in His own image”
(Genesis 1:27), and it was His purpose that the longer man lived the more fully
he should reveal this image—the more fully reflect the glory of the Creator. All
his faculties were capable of development; their capacity and vigor were
continually to increase. Ed 15
</quote>
I want to remove all strings from <quote-verse>.....</quote-verse> line so that the end result will be <quote>1:26,27</quote>.
I have tried perl -pi.bak -e 's#\D*$<\/quote-verse>#<\/quote-verse>#g' file.txt
This does nothing. I am a beginner in perl (self taught) with less than 10 days experience. Please tell me what's wrong and how to proceed.
You have XML. Therefore you want an XML parser. XML::Twig is a good one.
The reason there's a lot of people saying 'don't use regular expressions to parse XML' is because whilst it does work in a limited scope. But XML is a specification, and certain things are valid, some are not. If you make code that's built on assumptions that aren't always true, what you end up with is brittle code - code that will break one day without warning, if someone alters their perfectly valid XML into a slightly different but still perfectly valid XML.
So with that in mind:
This works:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
sub quote_verse_handler {
my ( $twig, $quote ) = #_;
my $text = $quote->text;
$text =~ s/(\d)\D+$/$1/;
$quote->set_text($text);
}
my $parser = XML::Twig->new(
twig_handlers => { 'quote-verse' => \&quote_verse_handler },
pretty_print => 'indented'
);
#$parser -> parsefile ( 'your_file.xml' );
local $/;
$parser->parse(<DATA>);
$parser->print;
__DATA__
<xml>
<verse-no>quote</verse-no>
<quote-verse>1:26,27 Man Created to Continually Develop</quote-verse>
<quote>When Adam came from the Creator's hand, he bore, in his physical, mental, and
spiritual nature, a likeness to his Maker. "God created man in His own image"
(Genesis 1:27), and it was His purpose that the longer man lived the more fully
he should reveal this image-the more fully reflect the glory of the Creator. All
his faculties were capable of development; their capacity and vigor were
continually to increase. Ed 15
</quote>
</xml>
What this does is - run through your file. Each time it encounters a section quote-verse it calls the handler, and gives it 'that bit' of the XML to do stuff with. We apply a regular expression, to chop off the trailing bit of the line, and then update the XML accordingly.
Once parse is finished, we spit out the finished product.
You'll probably want to replace:
local $/;
$parser -> parse ( <DATA> );
with:
$parser -> parsefile ( 'your_file_name' );
You may also find:
$parser -> print_to_file( 'output_filename' );
to be useful.

How to remove words with digits without removing a digit at the beginning of a string?

I'm doing some tweet sentiment analysis, and right now I'm trying to clean the data using perl on Ubuntu command line.
I have some data in the follow format:
sentiment, 'text'
Where sentiment = {0, 4} and text is any valid string.
Right now I'm having trouble removing data such as this:
0,'My 21yo son has finally graduated from college!'
4,'The NT2000 is an awesome product!'
4,'what is good88guy doing on my following list?'
I want the following to look like this after:
0,'My son has finally graduated from college!'
4,'The is an awesome product!'
4,'what is doing on my following list?'
I don't want to remove the sentiment and I also need to remove the yo. Any ideas how I can write this script?
You may want to try this:
s/ ?( |[a-z]+)\d+( |[a-z]+|)? ?/ /simg;
DEMO
http://regex101.com/r/zW2nJ3
Sounds like you want the following:
s/\w*\d\w*\s*//g;
You're statement that you don't want things removed "from the beginning" is a little confusing, but you'll have to add more information to get a better answer.
One of the easiest methods to communicate what you want is to create a list of before and after strings, trying to make each demonstrate an special case.
Since your recent comments I have understood your problem a little better.
The data format you describe must be processed using Text::CSV, so as to account for quoted fields and comma separators.
This program should suit your needs as far as I understand them. It has use autodie to avoid the need for hand-coding exceptions if the input file cannot be opened, and Text::CSV reads the data from the file, specifying single quotes as field delimiters.
I have used the code from my original answer to process each line of the file, as it provides the best flexibility if your requirements need to be refined.
use strict;
use warnings;
use autodie;
use Text::CSV;
my $csv_proc = Text::CSV->new({ eol => $/, quote_char => "'" });
open my $fh, '<', 'myfile.txt';
while ( my $row = $csv_proc->getline($fh) ) {
my #fields = split ' ', $row->[1];
$row->[1] = join ' ', grep { not /\d/ } #fields;
$csv_proc->print(*STDOUT, $row);
}
output
0,'My son has finally graduated from college!'
4,'The is an awesome product!'
4,'what is doing on my following list?'

how to search and take particular text in perl

I have one folder it contain 'n' number of html files. I'll read the files and take the one line. (i.e) I'll take the <img /> tag in one array and print the array. Now doesn't print the array. Can you help me. My code is here.
use strict;
use File::Basename;
use File::Path;
use File::Copy;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Excel';
print "Welcome to PERL program\n";
#print "\n\tProcessing...\n";
my $foldername = $ARGV[0];
opendir(DIR,$foldername) or die("Cannot open the input folder for reading\n");
my (#htmlfiles) = grep/\.html?$/i, readdir(DIR);
closedir(DIR);
#htmlfiles = grep!/(?:index|chapdesc|listdesc|listreview|addform|addform_all|pattern)\.html?$/i,#htmlfiles;
# print "HTML file is #htmlfiles";
my %fileimages;
my $search_for = 'img';
my $htmlstr;
for my $files (#htmlfiles)
{
if(-e "$foldername\\$files")
{
open(HTML, "$foldername\\$files") or die("Cannot open the html files '$files' for reading");
local undef $/;my $htmlstr=<HTML>;
close(HTML);
$fileimages{uc($2)}=[$1,$files] while($htmlstr =~/<img id="([^"]*)" src="\.\/images\/[^t][^\/<>]*\/([^\.]+\.jpg)"/gi);
}
}
In command prompt.
perl findtext.pl "C:\viji\htmlfiles"
regards, viji
I would like to point out that parsing HTML with regexes is futile. See the epic https://stackoverflow.com/a/1732454/1521179 for the answer.
Your regex to extract image tags is quite broken. Instead of using a HTML parser and walking the tree, you search for a string that…
/<img id="([^"]*)" src="\.\/images\/[^t][^\/<>]*\/([^\.]+\.jpg)"/gi
begins with <img
after exactly one space, the sequence id=" is found. The contents of that attribute are captured if it is found, else the match fails. The closing " is consumed.
after exactly one space, the sequence src="./images/ is found,
followed by a character that is not t. (This allows for ", of course).
This is followed by any number of any characters that are not slashes or <> characters (This allows for ", again),
followed by a slash.
now capture this:
one or more characters that are not dots
followed by the suffix .jpg
after which " has to follow immediately.
false positives
Here is some data that your regex will match, where it shouldn't:
<ImG id="" src="./ImAgEs/s" alt="foo/bar.jpg"
So what is the image path you get? ./ImAgEs/s" alt="foo/bar.jpg may not be what you wanted.
<!-- <iMg id="" src="./images/./foobar.jpg" -->
Oops, I matched commented content. And the path does not contain a subfolder of ./images. The . folder is completely valid in your regex, but denotes the same folder. I could even use .., what would be the folder of your HTML file. Or I could use ./images/./t-rex/image.jpg what would match a forbidden t-folder.
false negatives
Here is some data you would want, but that you won't get:
<img
id="you-cant-catch-me"
src='./images/x/awesome.jpg' />
Why? Newlines—but you only allow for single spaces between the parameters. Also, you don't allow for single quotes '
<img src="./images/x/awesome.jpg" id="you-cant-catch-me" />
Why? I now have single spaces, but swapped the arguments. But both these fragments denote the exact same DOM and therefore should be considered equivalent.
Conclusion
go to http://www.cpan.org/ and search for HTML and Tree. Use a module to parse your HTML and walk the tree and extract all matching nodes.
Also, add a print statement somewhere. I found a
use Data::Dumper;
print Dumper \%fileimages;
quite enlightening for debug purposes.

Perl / LibXML : keep closing tags when serializing

By default, LibXML will render empty tags as <tag />, but I need to have it render them as <tag></tag>.
Is there an option I missed in the documentation, or do I have to tweek the output with regexp replacements (or any other solution you might know of) ?
I'm looking for a better way of doing it in the place of:
$xml = $dom->serialize(0);
$xml =~ s/<([a-z]+)([^>]*?)\/>/<$1$2><\/$1>/gsi;
LibXML has a formerly documented feature, that might be considered deprecated as it's not in the documentation for the latest version, but it's still in the test files, so it might work.
All serialization functions understand the flag setTagCompression. if this Flag is set to 1 empty tags are displayed as <foo></foo> rather than <foo/>.
my $xml = do {
local $XML::LibXML::setTagCompression = 1;
$doc->toString();
};
How about:
use XML::LibXML;
my $x = XML::LibXML->new();
my $d = $x->load_xml(string => "<xml><foo/></xml>");
print $d->toString;
print qq{<?xml version="1.0"?>\n} . $d->toStringHTML();'
yields:
<?xml version="1.0"?>
<xml><foo/></xml>
<?xml version="1.0"?>
<xml><foo></foo></xml>