XPath nodes text joined by br - dom

How to join text nodes between br tags again by br.
Here is the xml code
<div>
text1.
<br>
text2.
<br>
text3.
<div>ad sense code</div>
<br>
text4.
<div>ad sense code</div>
<br>
textxx.
<br>
</div>
I need to get all text node text2 to textxx joined by br tag or \n\n.
I can get all the text but joined without any separator using
//div/text()[position()>1] but the result like this:
text1.text2.text3.text4.textxx.
while I want it like this:
text1.<br>text2.<br>text3.<br>text4.<br>textxx.<br>
Simply I need to keep the br tags.
I am using Perl HTML::TreeBuilder::LibXML module.

XPath can be used (a) to select nodes from the input document, or (b) to compute atomic values such as strings, booleans, or numbers from the nodes in the input document. It can never [with very edge-case exceptions] return nodes that weren't present in the input.
It's not entirely clear what you mean by your desired output of
text1.<br>text2.<br>text3.<br>text4.<br>textxx.<br>
Are you looking for this as a string? Or a sequence of text nodes and element nodes, interspersed?
Returning it as a string is possible in XPath 3.1 using the serialize() function, but in Perl you only have access to the venerable and limited XPath 1.0.
Returning it as a set of nodes isn't possible because the nodes aren't there in the source: the source contains text nodes that have values such as "__text1__" where underscores represent whitespace, and your desired output drops the whitespace.
You appear to be doing a transformation rather than merely a selection, so you are out of XPath territory and into XSLT.

The solution I was able to do what I want in Perl is like this:
$text = "";
$tree = HTML::TreeBuilder::LibXML->new_from_content($content);
foreach my $node ($tree->findnodes("./div/text()[position()>1]")) {
$text .= $node->findvalue('string(.)') . "<br>";
}
$text =~ s/<br>$//g;

Related

How to identify a character in a string?

I am trying to write a Powershell code to identify a string with a specific character from a filename from multiple files.
An example of a filename
20190902091031_202401192_50760_54206_6401.pdf
$Variable = $Filename.Substring(15,9)
Results:
202401192 (this is what I am after)
However in some instances the filename will be like below
20190902091031_20240119_50760_54206_6401.pdf
$Variable = $Filename.Substring(15,9)
Results:
20240119_ (this is NOT what I am after)
I am trying to find a code to identify the 9th character,
IF the 9th character = "_"
THEN Set
$Variable = $Filename.Substring(15,8)
Results:
20240119
All credit to TheMadTechnician who beat me to the punch with this answer.
To expand on the technique a bit, use the split method or operator to split a string every time a certain character shows up. Your data is separated by the underscore character, so is a perfect example of using this technique. By using either of the following:
$FileName.Split('_')
$FileName -split '_'
You can turn your long string into an array of shorter strings, each containing one of the parts of your original string. Since you want the 2nd one, you use the array descriptor [1] (0 is 1st) and you're done.
Good luck

How to remove quotes in my product description string?

I'm using OSCommerce for my online store and I'm currently optimizing my product page for rich snippets.
Some of my Google Indexed pages are being marked as "Failed" by Google due to double quotes in the description field.
I'm using an existing code which strips the html coding and truncates anything after 197 characters.
<?php echo substr(trim(preg_replace('/\s\s+/', ' ', strip_tags($product_info['products_description']))), 0, 197); ?>
How can I include the removal of quotes in that code so that the following string:
<strong>This product is the perfect "fit"</strong>
becomes:
This product is the perfect fit
Happened with me, try to use:
tep_output_string($product_info['products_description']))
" becomes "
We can try using preg_replace_callback here:
$input = "SOME TEXT HERE <strong>This product is the perfect \"fit\"</strong> SOME MORE TEXT HERE";
$output = preg_replace_callback(
"/<([^>]+)>(.*?)<\/\\1>/",
function($m) {
return str_replace("\"", "", $m[2]);
},
$input);
echo $output;
This prints:
SOME TEXT HERE This product is the perfect fit SOME MORE TEXT HERE
The regex pattern used does the following:
<([^>]+)> match an opening HTML tag, and capture the tag name
(.*?) then match and capture the content inside the tag
<\/\\1> finally match the same closing tag
Then, we use a callback function which does an additional replacement to strip off all double quotes.
Note that in general using regex against HTML is bad practice. But, if your text only has single level/occasional HTML tags, then the solution I gave above might be viable.

XML removeChild, but out output has a blank row

Simple removeChild test, although the xml line is removed, it maintains an empty blank row, how come? Btw - my source xml file does have indents, however even when I remove them I get the same result. So what's the point of being able to removeChild row if it still retains a blank space?
Is there a way to re-format the resulting xml lines prior to outputing it to the file?
foreach my $XYZ ($doc->findnodes("//EE1"))
{
my $library = $XYZ->parentNode;
$library->removeChild($XYZ);
}
print {$FH} $doc->toString(0);
RESULT IN OUTPUT FILE:
<?xml version="1.0"?>
<TopTag>
<AA1>ZNY</AA1>
<AA2>111</AA2>
<BB1>
<CC1>ZNY</CC1>
<CC2>
<DD1>
<-----blank line remains
<EE2>2000</EE2>
</DD1>
<DD1>
<-----blank line remains
<EE2>5000</EE2>
</DD1>
</CC2>
</BB1>
<AA1>ZNY2</AA1>
<AA2>2</AA2>
</TopTag>
The empty lines come from text nodes containing whitespace. Consider the following document:
<doc>
<elem/>
</doc>
The doc element contains the following nodes:
A text node containing a newline and two space characters.
An element node with the elem element.
Another text node containing a newline.
If the elem element is removed, only the text nodes remain resulting in a blank line.
The easiest way to reindent a XML::LibXML document is to use the module XML::LibXML::PrettyPrint. Also have a look at this question.
Remove newlines that are preceded by another newline (positive look-behind assertion) and optional whitespace in between.
my $output = $doc->toString(0);
$output =~ s/(?<=\n)\s*\n//g;
print {$FH} $output;
You can use the no_blanks option for load_xml() - it will automatically strip any extra whitespace elements when importing your XML:
use XML::LibXML;
my $dom = XML::LibXML->load_xml(location => $filename, no_blanks => 1);
Since the whitespace is removed, you need to then use:
print $dom->toString(1);
to get nicely formatted output.

Perl get array count so can start foreach loop at a certain array element

I have a file that I am reading in. I'm using perl to reformat the date. It is a comma seperated file. In one of the files, I know that element.0 is a zipcode and element.1 is a counter. Each row can have 1-n number of cities. I need to know the number of elements from element.3 to the end of the line so that I can reformat them properly. I was wanting to use a foreach loop starting at element.3 to format the other elements into a single string.
Any help would be appreciated. Basically I am trying to read in a csv file and create a cpp file that can then be compiled on another platform as a plug-in for that platform.
Best Regards
Michael Gould
you can do something like this to get the fields from a line:
my #fields = split /,/, $line;
To access all elements from 3 to the end, do this:
foreach my $city (#fields[3..$#fields])
{
#do stuff
}
(Note, based on your question I assume you are using zero-based indexing. Thus "element 3" is the 4th element).
Alternatively, consider Text::CSV to read your CSV file, especially if you have things like escaped delimiters.
Well if your line is being read into an array, you can get the number of elements in the array by evaluating it in scalar context, for example
my $elems = #line;
or to be really sure
my $elems = scalar(#line);
Although in that case the scalar is redundant, it's handy for forcing scalar context where it would otherwise be list context. You can also find the index of the last element of the array with $#line.
After that, if you want to get everything from element 3 onwards you can use an array slice:
my #threeonwards = #line[3 .. $#line];

Reading custom values in Ebay RSS feed (XML::RSS module)

I've spent entirely way too long trying to figure this out. I'm using XML: RSS and Perl to read / parse an Ebay RSS feed. Within the <item></item> area, I see these entries:
<rx:BuyItNowPrice xmlns:rx="urn:ebay:apis:eBLBaseComponents">1395</rx:BuyItNowPrice>
<rx:CurrentPrice xmlns:rx="urn:ebay:apis:eBLBaseComponents">1255</rx:CurrentPrice>
However, I can't figure out how to grab the details during the loop. I wrote a regex to grab them:
#current_price = $item =~ m/\<rx\:CurrentPrice.*\>(\d+)\<\/rx\:CurrentPrice\>/g;
Which works if you place the above 'CurrentPrice' entry into a standalone string, but not while the script is reading through the RSS feed.
I can grab most of the information I want out of the item->description area (# bids, auction end time, BIN price, thumbnail image, etc.), but it would be nicer if I could grab the info from the feed without me having to deal with grabbing all that information manually.
How to grab custom fields from an RSS feed (short of writing regexes to parse the entire feed w/o a module)?
Here's the code I'm working with:
$my_limit = 0;
use LWP::Simple;
use XML::RSS;
$rss = XML::RSS->new();
$data = get( $mylink );
$rss->parse( $data );
$channel = $rss->{channel};
$NumItems = 0;
foreach $item (#{$rss->{'items'}}) {
if($NumItems > $my_limit){
last;
}
#current_price = $item =~ m/\<rx\:CurrentPrice.*\>(\d+)\<\/rx\:CurrentPrice\>/g;
print "$current_price[0]";
}
If you have the rss/xml document and want specific data you could use XPATH:
Perl CPAN XPATH
XPath Introduction
What is the way in which "it doesn't work" from an RSS feed? Do you mean no matches when there should be matches? Or one match where there should be several matches?
One thing that jumps out at me about your regular expression is that you use .*, which can sometimes be greedier than you want. That is, if $item contained the expression
<rx:BuyItNowPrice xmlns:rx="urn:...nts">1395</rx:BuyItNowPrice>
<rx:CurrentPrice xmlns:rx="urn:...nts">1255</rx:CurrentPrice>
<rx:BuyItNowPrice xmlns:rx="urn:...nts">1395</rx:BuyItNowPrice>
<rx:SomeMoreStuff xmlns:rx="urn:...nts">zzz</rx:BuyItNowPrice>
<rx:CurrentPrice xmlns:rx="urn:...nts">1255</rx:CurrentPrice>
then the first part of your regular expression (\<rx\:CurrentPrice.*\>) will wind up matching everything on lines 2, 3, and 4, plus the first part of line 5 (up to the >). Instead, you might want to use the regular expression1
m/\<rx:CurrentPrice[^>]*>(\d+)\<\/rx:CurrentPrice\>/
which will only match up to the closing </rx:CurrentPrice> tag after a single instance of an opening <rx:CurrentPrice> tag.
1 The other obvious answer is that you really don't want to use a regular expression at all, that regular expressions are inferior tools for parsing XML compared to customized parsing modules, and that all the special cases you will have to deal with using regular expressions will eventually render you unconscious from having repeatedly beaten your head against your desk. See Salgar's answer, for example.