Find and extract content of division of certain class using DomXPath - dom

I am trying to extract and save into PHP string (or array) the content of a certain section of a remote page. That particular section looks like:
<section class="intro">
<div class="container">
<h1>Student Club</h1>
<h2>Subtitle</h2>
<p>Lore ipsum paragraph.</p>
</div>
</section>
And since I can't narrow down using class container because there are several other sections of class "container" on the same page and because there is the only section of class "intro", I use the following code to find the right division:
$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
#$doc->loadHTMLFile("https://www.remotesite.tld/remotepage.html");
$finder = new DomXPath($doc);
$intro = $finder->query("//*[contains(#class, 'intro')]");
And at this point, I'm hitting a problem - can't extract the content of $intro as PHP string.
Trying further the following code
foreach ($intro as $item) {
$string = $item->nodeValue;
echo $string;
}
gives only the text value, all the tags are stripped and I really need all those divs, h1 and h2 and p tags preserved for further manipulation needs.
Trying:
foreach ($intro->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo $name;
echo $value;
}
is giving the error:
Notice: Undefined property: DOMNodeList::$attributes in
So how could I extract the full HTML code of the found DOM elements?

I knew I was so close... I just needed to do:
foreach ($intro as $item) {
$h1= $item->getElementsByTagName('h1');
$h2= $item->getElementsByTagName('h2');
$p= $item->getElementsByTagName('p');
}

Related

DOMXPath multiple contain selectors not working

I have the following XPath query that a kind user on SO helped me with:
$xpath->query(".//*[not(self::textarea or self::select or self::input) and contains(., '{{{')]/text()") as $node)
Its purpose is to replace certain placeholders with a value, and correctly catches occurences such as the below that should not be replaced:
<textarea id="testtextarea" name="testtextarea">{{{variable:test}}}</textarea>
And replaces correctly occurrences like this:
<div>{{{variable:test}}}</div>
Now I want to exclude elements that are of type <div> that contain the class name note-editable in that query, e.g., <div class="note-editable mayhaveanotherclasstoo">, in addition to textareas, selects or inputs.
I have tried:
$xpath->query(".//*[not(self::textarea or self::select or self::input) and not(contains(#class, 'note-editable')) and contains(., '{{{')]/text()") as $node)
and:
$xpath->query(".//*[not(self::textarea or self::select or self::input or contains(#class, 'note-editable')) and contains(., '{{{')]/text()") as $node)
I have followed the advice on some questions similar to this: PHP xpath contains class and does not contain class, and I do not get PHP errors, but the note-editable <div> tags are still having their placeholders replaced.
Any idea what's wrong with my attempted queries?
EDIT
Minimum reproducible DOM sample:
<div class="note-editing-area">
<textarea class="note-codable"></textarea>
<div class="note-editable panel-body" contenteditable="true" style="height: 350px;">{{{variable:system_url}}</div>
</div>
Code that does the replacement:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach ($xpath->query(".//*[not(self::textarea or self::select or self::input or self::div[contains(#class,'note-editable')]) and contains(., '{{{')]/text()") as $node) {
$node->nodeValue = preg_replace_callback('~{{{([^:]+):([^}]+)}}}~', function($m) use ($placeholders) {
return $placeholders[$m[1]][$m[2]] ?? '';
},
$node->nodeValue);
}
$html = $dom->saveHTML();
echo html_entity_decode($html);
Use this below xpath.
.//*[not(self::textarea or self::select or self::input or self::div[contains(#class,'note-editable')]) and contains(., '{{{')]

domxpath- how to get content for parent tag only instead of child tags

I, am using domxpath query for fetching content for parent tag only that is (td[class='s']) instead of including div content which is nested inside that td as given below in my code.
<?php
$second_trim='<td class="s" style="line-height:18px;">THIS TEXT IS REQUIRED and <div id="a" style="display:none;background-color:black;border:1px solid #ddd;padding:5px;color:black;">THIS TEXT IS NOT REQUIRED </div></td>';
$dom = new DOMDocument();
$doc->validateOnParse = true;
#$dom->loadHTML($second_trim);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$b = $xpath->query('//td[#class="s"]');
echo "<p style='font-size:14px;color:red;'><b style='font-size:18px;color:gray;'>cONTENT :- </b>".$b->item(0)->nodeValue."</p>";
?>
so how to remove content of that div tag and fetching only td's content any ideas !!
EDIT
If you are only interested in the direct text content modify your xpath query:
$b = $xpath->query('//td[#class="s"]/text()');
echo '<p style="font-size:14px;color:red;">'
.'<b style="font-size:18px;color:gray;">cONTENT :- </b>'
.$b->item(0)->nodeValue
.'</p>';
Right now the result is very specific to the example:
If more than one direct text node exists, its not gone be displayed. To do that foreach through the DOMNodeList $b and echo every selected node value.

Extracting links inside <div>'s with HTML::TokeParser & URI

I'm an old-newbie in Perl, and Im trying to create a subroutine in perl using HTML::TokeParser and URI.
I need to extract ALL valid links enclosed within on div called "zone-extract"
This is my code:
#More perl above here... use strict and other subs
use HTML::TokeParser;
use URI;
sub extract_links_from_response {
my $response = $_[0];
my $base = URI->new( $response->base )->canonical;
# "canonical" returns it in the one "official" tidy form
my $stream = HTML::TokeParser->new( $response->content_ref );
my $page_url = URI->new( $response->request->uri );
print "Extracting links from: $page_url\n";
my($tag, $link_url);
while ( my $div = $stream->get_tag('div') ) {
my $id = $div->get_attr('id');
next unless defined($id) and $id eq 'zone-extract';
while( $tag = $stream->get_tag('a') ) {
next unless defined($link_url = $tag->[1]{'href'});
next if $link_url =~ m/\s/; # If it's got whitespace, it's a bad URL.
next unless length $link_url; # sanity check!
$link_url = URI->new_abs($link_url, $base)->canonical;
next unless $link_url->scheme eq 'http'; # sanity
$link_url->fragment(undef); # chop off any "#foo" part
print $link_url unless $link_url->eq($page_url); # Don't note links to itself!
}
}
return;
}
As you can see, I have 2 loops, first using get_tag 'div' and then look for id = 'zone-extract'. The second loop looks inside this div and retrieve all links (or that was my intention)...
The inner loop works, it extracts all links correctly working standalone, but I think there is some issues inside the first loop, looking for my desired div 'zone-extract'... Im using this post as a reference: How can I find the contents of a div using Perl's HTML modules, if I know a tag inside of it?
But all I have by the moment is this error:
Can't call method "get_attr" on unblessed reference
Some ideas? Help!
My HTML (Note URL_TO_EXTRACT_1 & 2):
<more html above here>
<div class="span-48 last">
<div class="span-37">
<div id="zone-extract" class="...">
<h2 class="genres"><img alt="extracting" class="png"></h2>
<li><a title="Extr 2" href="**URL_TO_EXTRACT_1**">2</a></li>
<li><a title="Con 1" class="sel" href="**URL_TO_EXTRACT_2**">1</a></li>
<li class="first">Pàg</li>
</div>
</div>
</div>
<more stuff from here>
I find that TokeParser is a very crude tool requiring too much code, its fault is that only supports the procedural style of programming.
A better alternatives which require less code due to declarative programming is Web::Query:
use Web::Query 'wq';
my $results = wq($response)->find('div#zone-extract a')->map(sub {
my (undef, $elem_a) = #_;
my $link_url = $elem_a->attr('href');
return unless $link_url && $link_url !~ m/\s/ && …
# Further checks like in the question go here.
return [$link_url => $elem_a->text];
});
Code is untested because there is no example HTML in the question.

Zend_Form_Element_MultiCheckbox: How to display a long list of checkboxes as columns?

So I am using a Zend_Form_Element_MultiCheckbox to display a long list of checkboxes. If I simply echo the element, I get lots of checkboxes separated by <br /> tags. I would like to figure out a way to utilize the simplicity of the Zend_Form_Element_MultiCheckbox but also display as multiple columns (i.e. 10 checkboxes in a <div style="float:left">). I can do it manually if I had an array of single checkbox elements, but it isn't the cleanest solution:
<?php
if (count($checkboxes) > 5) {
$columns = array_chunk($checkboxes, count($checkboxes) / 2); //two columns
} else {
$columns = array($checkboxes);
}
?>
<div id="checkboxes">
<?php foreach ($columns as $columnOfCheckboxes): ?>
<div style="float:left;">
<?php foreach($columnOfCheckboxes as $checkbox): ?>
<?php echo $checkbox ?> <?php echo $checkbox->getLabel() ?><br />
<?php endforeach; ?>
</div>
<?php endforeach; ?>
</div>
How can I do this same sort of thing and still use the Zend_Form_Element_MultiCheckbox?
The best place to do this is using a view helper. Here is something I thought of really quickly that you could do. You can use this in your view scripts are attach it to a Zend_Form_Element.
I am going to assume you know how to use custom view helpers and how to add them to form elements.
class My_View_Helper_FormMultiCheckbox extends Zend_View_Helper_FormMultiCheckbox
{
public function formMultiCheckbox($name, $value = null, $attribs = null,
$options = null, $listsep = "<br />\n")
{
// zend_form_element attrib has higher precedence
if (isset($attribs['listsep'])) {
$listsep = $attribs['listsep'];
}
// Store original separator for later if changed
$origSep = $listsep;
// Don't allow whitespace as a seperator
$listsep = trim($listsep);
// Force a separator if empty
if (empty($listsep)) {
$listsep = $attribs['listsep'] = "<br />\n";
}
$string = $this->formRadio($name, $value, $attribs, $options, $listsep);
$checkboxes = explode($listsep, $string);
$html = '';
// Your code
if (count($checkboxes) > 5) {
$columns = array_chunk($checkboxes, count($checkboxes) / 2); //two columns
} else {
$columns = array($checkboxes);
}
foreach ($columns as $columnOfCheckboxes) {
$html .= '<div style="float:left;">';
$html .= implode($origSep, $columnOfCheckboxes);
$html .= '</div>';
}
return $html;
}
}
If you need further explanation just let me know. I did this fairly quickly.
EDIT
The reason I named it the same and placed in a different directory was only to override Zend's view helper. By naming it the same and adding my helper path:
$view->addHelperPath('My/View/Helper', 'My_View_Helper');
My custom view helper gets precedence over Zend's helper. Doing this allowed me to test without changing any of my forms,elements, or views that used Zend's helper. Basically, that's how you replace one of Zend's view helpers with one of your own.
Only reason I mentioned the note on adding custom view helpers and adding to form elements was because I assumed you might rename the helper to better suit your needs.

How can I get the next immediate tag with HTML::Tokeparser?

I'm trying to get the tags that occur immediately after a particular div tag. For e.g., I have html code
<div id="example">
<h2>Example</h2>
<p>Hello !World</p>
</div>
I'm doing the following,
while ( $tag = $stream->get_tag('div') ) {
if( $tag->[1]{id} eq 'Example' ) {
$tag = $stream->get_tag;
$tag = $stream->get_tag;
if ( $tag->[0] eq 'div' ) {
...
}
}
}
But this throws the error
Can't use string ("</h2>") as a HASH ref while "strict refs" in use
It works fine if I say
$tag = $stream->get_tag('h2');
$tag = $stream->get_tag('p');
But I can't have that because I need to get the immediate two tags and verify if they are what i expect them to be.
It would be easier to tell if you posted a runnable example program, but it looks like the problem is you didn't realize that get_tag returns both start and end tags. End tags don't have attributes. Start tags are returned as [$tag, $attr, $attrseq, $text], and end tags are returned as ["/$tag", $text].