xml::Twig and findnodes - perl

I have the following xml code snippet :
<a>
<b> textb <b>
<c> textc <c>
<d> textd <d>
<\a>
<a>
<b> textb <b>
<c> textc <c>
<d> textd <d>
<\a>
I use xml::twig to parse it as below :
my #c= map { $_->text."\n" } $_->findnodes( './a/');
and get the textbtextctextd as one element of the array. Is there an option to get with findnodes
textb,textc,textd as 3 array elements and not one?

Use the star at the end of the expression:
$_->findnodes( './a/*');
The '*' matches any tag, so you get the 3 child nodes - your current example only matches the 'a', and its text is the concatenation of the text of the nested elements.

in XML::Twig 3.39 (and above) you can use findvalue to get an array of strings.
my #c = $_->findvalue('./a/');

Related

delete string between two strings in one line

i am trying to delete everything between bracket <>, i can do it if one line only has one <>, but if a line has more than one, it seems to delete everything inside the outer <>.
echo "hi, <how> are you" | sed 's/<.*>//'
result: hi, are you
echo "hi, <how> are <you>? " | sed 's/<.*>//'
result: hi, ?
the first echo is working fine, but if one sentense has more than one <>, it can not classify.
expected input: 1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>
expected out come: 1 2 3 4 .... 1000
thanks
Using awk:
# using gsub - recommended
$ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk 'gsub(/<[^>]*>/,"")'
1 2 3 4 ...... 1000
# OR using FS and OFS
$ echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk -F'<[^>]*>' -v OFS='' '$1=$1'
1 2 3 4 ...... 1000
Following awk will be helpful to you.
echo "hi, <how> are <you>? " | awk '{for(i=1;i<=NF;i++){if($i~/<.*>/){$i=""}}} 1'
OR
echo "1 <a> 2 <b> 3 <c> 4 <d> ...... 1000 <n>" | awk '{for(i=1;i<=NF;i++){if($i~/<.*>/){$i=""}}} 1'
Explanation: Simply going through all the fields of the line(by starting a for loop which starts from i=1 to till the value of NF(number of fields)), there I am checking if a field's value is satisfying regex <.*>(means it has ) then I am nullifying it.
* matches zero or more times with greedy. use the negation character class <[^>]*>
echo "hi, <how> are <you>? " | sed 's/<[^>]*>//g'

Remove line break characters in multi-line XML element

On a Unix system I have an input text file containing long multi-line strings.
I now want to remove line breaks only between two patterns ( and ) which can be on different lines.
Example input file:
text1 text2 <remarks> text3
text4 text5 </remarks> text6 text7 text8
Result output for the above input file should be:
text1 text2 <remarks> text3 text4 text5 </remarks> text6 text7 text8
I would prefer to use sed or Perl or maybe awk to do the job.
I do not see a solution as the newlines can happen "randomly" and text is just some log messages.
Here is a more detailed look of the input file I need to process. It does not contain a root XML section, but for testing I might just add one manually. Also there may be many "remarks" sections.
Inputfile Snippet (as it is very long), Filename is test:
<paymentTerm keyValue1="8" objectType="PAYMENTTERM" />
<paymentType keyValue1="20" objectType="PAYMENTTYPE" />
<priceList keyValue1="1" objectType="PRICELIST" />
<remarks>Zollanmeldung ab 250 €
Lager Adresse:
Hessen-Ring 456
D-64546 Mörfelden-Walldorff
eine Stunde vor Ankunft melden unter Mobile
Neu Spedition
A&R Logistics Group
Storkenburgstrasse 99
D-62546 Mörfelden-Walldorf
www.asp.de</remarks>
<salesPersons>
<PERSON keyValue1="2" keyValue2="SALESEMPLOYEE" objectType="PERSON" />
</salesPersons>
<shippingType keyValue1="5" objectType="SHIPPINGTYPE" />
As stated above I want to remove the linebreaks ONLY between the patterns "remarks" and "/remarks".
I tried the Perl XML Parsing suggested by borodin like this:
use strict;
use warnings 'all';
use XML::Twig;
use constant XML_FILE => 'test';
my $twig = XML::Twig->new(
twig_handlers => {
remarks => sub { $_->set_text($_->trimmed_text) }
}
);
$twig->parsefile(XML_FILE);
$twig->print;
It works, but prints everything on one line.
With GNU awk for multi-char RS:
$ awk -v RS='</?remarks>' -v ORS= '!(NR%2){gsub(/\n/,OFS)} {print $0 RT}' file
text1 text2 <remarks> text3 text4 text5 </remarks> text6 text7 text8
XML can represent the same information in many different ways, and it is always a risk to try processing it using regular expressions. It is far better to use a proper XML module to process XML data. This solution uses
XML::Twig
In the constructor for the $twig object you can specify a callback which is called automatically every time a given XML element is encountered in the input
The trimmed_text method removes leading and trailing whitespace from the text of the element, and turns any internal whitespace sequences, including line breaks, into a single space. That is exactly what you are asking for here, so a call to set_text is all that is necessary to update the string
The file to be processed is specified by the XML_FILE constant and you should modify that to specify the path to your own data file. The modified XML is printed to STDOUT
use strict;
use warnings 'all';
use open qw/ :std :encoding(UTF-8) /;
use XML::Twig;
use constant XML_FILE => 'remarks.xml';
my $twig = XML::Twig->new(
keep_spaces => 1,
twig_handlers => {
remarks => sub { $_->set_text($_->trimmed_text) }
}
);
$twig->parsefile(XML_FILE);
$twig->print;
input
Your sample data is invalid XML, so I have edited it to look like this. I have added the XML declaration that you said in a comment that you had, and I have added a root element <data>
<?xml version="1.0" encoding="UTF-8"?>
<data>
<paymentTerm keyValue1="8" objectType="PAYMENTTERM" />
<paymentType keyValue1="20" objectType="PAYMENTTYPE" />
<priceList keyValue1="1" objectType="PRICELIST" />
<remarks>Zollanmeldung ab 250 €
Lager Adresse:
Hessen-Ring 456
D-64546 Mörfelden-Walldorff
eine Stunde vor Ankunft melden unter Mobile
Neu Spedition
A&R Logistics Group
Storkenburgstrasse 99
D-62546 Mörfelden-Walldorf
www.asp.de</remarks>
<salesPersons>
<PERSON keyValue1="2" keyValue2="SALESEMPLOYEE" objectType="PERSON" />
</salesPersons>
<shippingType keyValue1="5" objectType="SHIPPINGTYPE" />
</data>
output
<?xml version="1.0" encoding="UTF-8"?>
<data>
<paymentTerm keyValue1="8" objectType="PAYMENTTERM"/>
<paymentType keyValue1="20" objectType="PAYMENTTYPE"/>
<priceList keyValue1="1" objectType="PRICELIST"/>
<remarks>Zollanmeldung ab 250 € Lager Adresse: Hessen-Ring 456 D-64546 Mörfelden-Walldorff eine Stunde vor Ankunft melden unter Mobile Neu Spedition A&R Logistics Group Storkenburgstrasse 99 D-62546 Mörfelden-Walldorf www.asp.de</remarks>
<salesPersons>
<PERSON keyValue1="2" keyValue2="SALESEMPLOYEE" objectType="PERSON"/>
</salesPersons>
<shippingType keyValue1="5" objectType="SHIPPINGTYPE"/>
</data>

Get NodeSet Size in XML::XPath

I have a XML i want to get the size of node set in the XML.
XML
<a>
<b>
<c>data</c>
<c>data</c>
<c>data</c>
</b>
</a>
I want to get the count c in the b tag.
my $obj = XML::XPath->new(xml => $xml);
print size(($obj->find('/a/b'));
I am not able to get the count of c in this XML
size is a method, not a function. Also, your XPath expression matches the b node, not its children.
The following works:
my $cs = $obj->find('/a/b/c');
print $cs->size, "\n";
Or, shorter, without the intermediate variable:
print $obj->find('/a/b/c')->size, "\n";

Parse fixed-width files

I have a lot of text files with fixed-width fields:
<c> <c> <c>
Dave Thomas 123 Main
Dan Anderson 456 Center
Wilma Rainbow 789 Street
The rest of the files are in a similar format, where the <c> will mark the beginning of a column, but they have various (unknown) column & space widths. What's the best way to parse these files?
I tried using Text::CSV, but since there's no delimiter it's hard to get a consistent result (unless I'm using the module wrong):
my $csv = Text::CSV->new();
$csv->sep_char (' ');
while (<FILE>){
if ($csv->parse($_)) {
my #columns=$csv->fields();
print $columns[1] . "\n";
}
}
As user604939 mentions, unpack is the tool to use for fixed width fields. However, unpack needs to be passed a template to work with. Since you say your fields can change width, the solution is to build this template from the first line of your file:
my #template = map {'A'.length} # convert each to 'A##'
<DATA> =~ /(\S+\s*)/g; # split first line into segments
$template[-1] = 'A*'; # set the last segment to be slurpy
my $template = "#template";
print "template: $template\n";
my #data;
while (<DATA>) {
push #data, [unpack $template, $_]
}
use Data::Dumper;
print Dumper \#data;
__DATA__
<c> <c> <c>
Dave Thomas 123 Main
Dan Anderson 456 Center
Wilma Rainbow 789 Street
which prints:
template: A8 A10 A*
$VAR1 = [
[
'Dave',
'Thomas',
'123 Main'
],
[
'Dan',
'Anderson',
'456 Center'
],
[
'Wilma',
'Rainbow',
'789 Street'
]
];
CPAN to the rescue!
DataExtract::FixedWidth not only parses fixed-width files, but (based on POD) appears to be smart enough to figure out column widths from header line by itself!
Just use Perl's unpack function. Something like this:
while (<FILE>) {
my ($first,$last,$street) = unpack("A9A25A50",$_);
<Do something ....>
}
Inside the unpack template, the "A###", you can put the width of the field for each A.
There are a variety of other formats that you can use to mix and match with, that is, integer fields, etc...
If the file is fixed width, like mainframe files, then this should be the easiest.

How do I use Perl's XML::Twig to count multiple tags in XML?

I am using XML::Twig to parse my input xml using Perl.
I need to extact a particular node in this XML and validate that node to see if it has multiple <p> tags and then count words in those P tags.
For example:
<XML>
<name>
</name>
<address>
<p id="1">a b c d </p>
<p id="2">y y y </p>
</address>
</XML>
Output:
Address has 2 paragraph tags with 7
words.
Any suggestions?
Here is one way to do it:
use strict;
use warnings;
use XML::Twig;
my $xfile = q(
<XML>
<name>
</name>
<address>
<p id="1">a b c d </p>
<p id="2">y y y </p>
</address>
</XML>
);
my $t = XML::Twig->new(
twig_handlers => { 'address/p' => \&addr}
);
my $pcnt = 0;
my $wcnt = 0;
$t->parse($xfile);
print "Address has $pcnt paragraph tags with $wcnt words.\n";
sub addr {
my ($twig, $add) = #_;
my #words = split /\s+/, $add->text();
$wcnt += scalar #words;
$pcnt++;
}
__END__
Address has 2 paragraph tags with 7 words.
XML::Twig has a dedicated website with documentation and a Tutorial to describe the handler technique used above.