XPath Expression: Optional character? - dom

I would like to find:
<div style="text-align:center;" >
<div style="text-align: center;" >
<div style="text-align:center" >
<div style="text-align: center" >
So an optional space before center and an optional semicolon at the end.
I can do:
//div[#style=’text-align:center;’ or #style=’text-align: center;’ or #style=’text-align:center’ or #style=’text-align: center’]
But is there a “cleaner” way? And able to take many more optional characters without getting too long?

You can first remove the optional characters f.e space and semicolon, assuming they aren't used in the required text, using translate() function, and then check whether the result equals only the required text f.e 'text-align:center' :
//div[translate(#style, ' ;', '') = 'text-align:center']
Or, when the pattern gets more complex, you can use regex in your XPath via PHP preg_match :
$xp->query("//div[php:function('preg_match', '~text-align:\s*center;*~', string(#style))]");
See full example demonstrating how to call PHP function from XPath in my older post : Get hrefs that match regex expression using PHP & XPath.

Related

XPath nodes text joined by br

How to join text nodes between br tags again by br.
Here is the xml code
<div>
text1.
<br>
text2.
<br>
text3.
<div>ad sense code</div>
<br>
text4.
<div>ad sense code</div>
<br>
textxx.
<br>
</div>
I need to get all text node text2 to textxx joined by br tag or \n\n.
I can get all the text but joined without any separator using
//div/text()[position()>1] but the result like this:
text1.text2.text3.text4.textxx.
while I want it like this:
text1.<br>text2.<br>text3.<br>text4.<br>textxx.<br>
Simply I need to keep the br tags.
I am using Perl HTML::TreeBuilder::LibXML module.
XPath can be used (a) to select nodes from the input document, or (b) to compute atomic values such as strings, booleans, or numbers from the nodes in the input document. It can never [with very edge-case exceptions] return nodes that weren't present in the input.
It's not entirely clear what you mean by your desired output of
text1.<br>text2.<br>text3.<br>text4.<br>textxx.<br>
Are you looking for this as a string? Or a sequence of text nodes and element nodes, interspersed?
Returning it as a string is possible in XPath 3.1 using the serialize() function, but in Perl you only have access to the venerable and limited XPath 1.0.
Returning it as a set of nodes isn't possible because the nodes aren't there in the source: the source contains text nodes that have values such as "__text1__" where underscores represent whitespace, and your desired output drops the whitespace.
You appear to be doing a transformation rather than merely a selection, so you are out of XPath territory and into XSLT.
The solution I was able to do what I want in Perl is like this:
$text = "";
$tree = HTML::TreeBuilder::LibXML->new_from_content($content);
foreach my $node ($tree->findnodes("./div/text()[position()>1]")) {
$text .= $node->findvalue('string(.)') . "<br>";
}
$text =~ s/<br>$//g;

negative regex with xidel + garbage-collect function

I currently use this command to extract URLs from a site
xidel https://www.website.com --extract "//h1//extract(#href, '.*')[. != '']"
This will extract all URLs (.*) but I would like to change this in a way that it would not extract URLs that contain specific strings in their URI path. For example, I would like to extract all URLs, except the ones that contain -text1- and -text2-
Also, xidel has a function called garbage-collect but it's not clear to me how to use these functions. I could be
--extract garbage-collect()
or
--extract garbage-collect()[0]
or
x:extract garbage-collect()
or
x"extract garbage-collect()
But these didn't reduce the memory usage when extracting URLs from multiple pages using --follow.
Just noticed this old question. It looks like OP's account is suspended, so I hope the following answer will be helpful for other users.
Let's assume 'test.htm' :
<html>
<body>
<span class="a-text1-u">1</span>
<span class="b-text2-v">2</span>
<span class="c-text3-w">3</span>
<span class="d-text4-x">4</span>
<span class="e-text5-y">5</span>
<span class="f-text6-z">6</span>
</body>
</html>
To extract all "class"-nodes, except the ones that contain "-text1-" and "-text2-":
xidel -s test.htm -e "//span[not(contains(#class,'-text1-') or contains(#class,'-text2-'))]/#class"
#or
xidel -s test.htm -e "//#class[not(contains(.,'-text1-') or contains(.,'-text2-'))]"
c-text3-w
d-text4-x
e-text5-y
f-text6-z
xidel has a function called garbage-collect but it's not clear to me how to use these functions.
http://www.benibela.de/documentation/internettools/xpath-functions.html#x-garbage-collect:
x:garbage-collect (0 arguments)
Frees unused memory. Always call it as garbage-collect()[0], or it might garbage collect its own return value
and crash.
So that would be -e "garbage-collect()[0]".

How to remove quotes in my product description string?

I'm using OSCommerce for my online store and I'm currently optimizing my product page for rich snippets.
Some of my Google Indexed pages are being marked as "Failed" by Google due to double quotes in the description field.
I'm using an existing code which strips the html coding and truncates anything after 197 characters.
<?php echo substr(trim(preg_replace('/\s\s+/', ' ', strip_tags($product_info['products_description']))), 0, 197); ?>
How can I include the removal of quotes in that code so that the following string:
<strong>This product is the perfect "fit"</strong>
becomes:
This product is the perfect fit
Happened with me, try to use:
tep_output_string($product_info['products_description']))
" becomes "
We can try using preg_replace_callback here:
$input = "SOME TEXT HERE <strong>This product is the perfect \"fit\"</strong> SOME MORE TEXT HERE";
$output = preg_replace_callback(
"/<([^>]+)>(.*?)<\/\\1>/",
function($m) {
return str_replace("\"", "", $m[2]);
},
$input);
echo $output;
This prints:
SOME TEXT HERE This product is the perfect fit SOME MORE TEXT HERE
The regex pattern used does the following:
<([^>]+)> match an opening HTML tag, and capture the tag name
(.*?) then match and capture the content inside the tag
<\/\\1> finally match the same closing tag
Then, we use a callback function which does an additional replacement to strip off all double quotes.
Note that in general using regex against HTML is bad practice. But, if your text only has single level/occasional HTML tags, then the solution I gave above might be viable.

Perl HTML::Treebuilder XPATH Table Tags with no ID/Name

I want to extract some text which is present in a specific table cell in the HTML page.
Now, the problem is, this cell is present inside a table tag which has no ID/Name.
I am using HTML::TreeBuilder::XPath to extract the value using XPATH expressions.
Here is how the HTML content looks like:
<table border="0">
<tr>
<td>Some Text</td>
<td>The Text I want comes here</td>
</tr>
This is how my XPATH expression looks like:
#nodes=$tree->findnodes(q{//table[8]/tr/td[2]/text()});
print $_->string_value."\n" foreach(#nodes); # corrected, thanks mirod.
It does not display the output.
I have used, table[8] above since this is the eight table tag in the HTML page (assuming the index starts from 1).
Also, I have used td[2] since I want the innerHTML between the second td tag.
Thanks.
What happens if you remove the text() at the end of the XPath query? I would think that calling string_value on the td itself would be enough.
Also method calls are not interpolated in strings, so you need to write print $_->string_value, "\n".
This will give you the text of the content, not the markup though. For that you would need to use as_HTML, and strip the outer tags (there is no method in HTML::Element that gives you the inner HTML):
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_content( <DATA>);
my #nodes=$tree->findnodes(q{//table[1]/tr/td[2]});
print $_->string_value, "\n" foreach(#nodes); # text
print $_->as_HTML, "\n" foreach(#nodes); # outerHTML
__DATA__
<html>
<body>
<table border="0">
<tr>
<td>Some Text</td>
<td>The Text I want comes here with <b>nested</b> content</td>
</tr>
</body>
</html>
The mirod approach should work for you.
But I recommend to use findvalues instead of findnodes if you need text content.
Try to run this code and show output:
my #values=$tree->findvalues(q{//table[8]//tr[1]//td});
print $_, "\n" foreach(#values);

Adding 2nd search to regex pattern?

// LINE 1
<td align="left" nowrap><font face="courier, monospace" size="-1"> (2002 GC1)</font></td>
// LINE 2
<td align="left" nowrap><font face="courier, monospace" size="-1"> 99942 Cocoon</font></td>
I have created a simple regular expression to scrape a little data I need from the HTML lines above, the expression works well and puts the data I need in two groups.
Regular Expression Pattern = ([0-9]+) ([A-Za-z0-9]+)
LINE1: Group1 = 2002, Group2 = GC1
LINE2: Group1 = 99942, Group2 = Cocoon
Having run this through my data I have now noticed that there is a new type of HTML line that has an extra number at the start that I should get.
// LINE 3
<td align="left" nowrap><font face="courier, monospace" size="-1">162421 (2000 CG70)</font></td>
LINE3: Group1 = 2000, Group2 = CG70
What I am trying to do is alter my pattern to additionally capture 162421 this matches the same pattern ([0-9]+) but being new to regular expressions I am unsure how to add this possibility into my pattern. Each time I try I either negate my already working search or I overwrite part of the result.
NOTE: I am using this with: NSRegularExpression on iOS.
You will have to add a capture group for early digits in the string. In the example, these digits are followed by "&nbsp"; (one or many times) and "(", and all of this is optional for the regex to match.
(?:([0-9]+)(?: )+\()?([0-9]+) ([A-Za-z0-9]+)
// ^ ^ ^ capture groups
The trickiest part comes with capture ranges.
Now you have one capture group more, you will always have 4 ranges when querying the NSTextCheckingResult object (0-index range is the entire match range, others are capture ranges).
But some times, only the last two will be valid.
To be sure, test the location member of the NSRange against NSNotFound. If the test succeed then the range is valid and you match and capture early digits, otherwise not.
How about:
([0-9]+) ([A-Za-z0-9]*)
Btw. I use this site to test regular expressions, very useful.