xpath query to parse html tags - dom

I need to parse the following sample html using xpath query..
<td id="msgcontents">
<div class="user-data">Just seeing if I can post a link... please ignore post
http://finance.yahoo.com
</div>
</td>
<td id="msgcontents">
<div class="user-data">some text2...
http://abc.com
</div>
</td>
<td id="msgcontents">
<div class="user-data">some text3...
</div>
</td>
The above html may repeat n no of times in a page.
Also sometimes the ..... portion may be absent as shown in the above html blocks.
What I need is the xpath syntax so that I can get the parsed strings as
array1[0]= "Just seeing if I can post a link... please ignore post ttp://finance.yahoo.com"
array[1]="some text2 htp://abc.com"
array[2]="sometext3"

Maybe something like the following:
$remote = file_get_contents('http://www.sitename.com');
$dom = new DOMDocument();
//Error suppression unfortunately, as an invalid xhtml document throws up warnings.
$file = #$dom->loadHTML($remote);
$xpath = new DOMXpath($dom);
//Get all data with the user-data class.
$userdata = $xpath->query('//*[contains(#class, \'user-data\')]');
//get links
$links = $xpath->query('//a/#href');
So to access one of these variables, you need to use nodeValue:
$ret = array();
foreach($userdata as $data) {
$ret[] = $data->nodeValue;
}
Edit: I thought I'd mention that this will get all the links on a given page, I assume this is what you wanted?

Use:
concat(/td/div/text[1], ' ', /td/div/a)
You can use instead of the ' ' above, whatever delimiter you'd like to appear between the two strings.

Related

DOMXPath multiple contain selectors not working

I have the following XPath query that a kind user on SO helped me with:
$xpath->query(".//*[not(self::textarea or self::select or self::input) and contains(., '{{{')]/text()") as $node)
Its purpose is to replace certain placeholders with a value, and correctly catches occurences such as the below that should not be replaced:
<textarea id="testtextarea" name="testtextarea">{{{variable:test}}}</textarea>
And replaces correctly occurrences like this:
<div>{{{variable:test}}}</div>
Now I want to exclude elements that are of type <div> that contain the class name note-editable in that query, e.g., <div class="note-editable mayhaveanotherclasstoo">, in addition to textareas, selects or inputs.
I have tried:
$xpath->query(".//*[not(self::textarea or self::select or self::input) and not(contains(#class, 'note-editable')) and contains(., '{{{')]/text()") as $node)
and:
$xpath->query(".//*[not(self::textarea or self::select or self::input or contains(#class, 'note-editable')) and contains(., '{{{')]/text()") as $node)
I have followed the advice on some questions similar to this: PHP xpath contains class and does not contain class, and I do not get PHP errors, but the note-editable <div> tags are still having their placeholders replaced.
Any idea what's wrong with my attempted queries?
EDIT
Minimum reproducible DOM sample:
<div class="note-editing-area">
<textarea class="note-codable"></textarea>
<div class="note-editable panel-body" contenteditable="true" style="height: 350px;">{{{variable:system_url}}</div>
</div>
Code that does the replacement:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
foreach ($xpath->query(".//*[not(self::textarea or self::select or self::input or self::div[contains(#class,'note-editable')]) and contains(., '{{{')]/text()") as $node) {
$node->nodeValue = preg_replace_callback('~{{{([^:]+):([^}]+)}}}~', function($m) use ($placeholders) {
return $placeholders[$m[1]][$m[2]] ?? '';
},
$node->nodeValue);
}
$html = $dom->saveHTML();
echo html_entity_decode($html);
Use this below xpath.
.//*[not(self::textarea or self::select or self::input or self::div[contains(#class,'note-editable')]) and contains(., '{{{')]

Find and extract content of division of certain class using DomXPath

I am trying to extract and save into PHP string (or array) the content of a certain section of a remote page. That particular section looks like:
<section class="intro">
<div class="container">
<h1>Student Club</h1>
<h2>Subtitle</h2>
<p>Lore ipsum paragraph.</p>
</div>
</section>
And since I can't narrow down using class container because there are several other sections of class "container" on the same page and because there is the only section of class "intro", I use the following code to find the right division:
$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;
#$doc->loadHTMLFile("https://www.remotesite.tld/remotepage.html");
$finder = new DomXPath($doc);
$intro = $finder->query("//*[contains(#class, 'intro')]");
And at this point, I'm hitting a problem - can't extract the content of $intro as PHP string.
Trying further the following code
foreach ($intro as $item) {
$string = $item->nodeValue;
echo $string;
}
gives only the text value, all the tags are stripped and I really need all those divs, h1 and h2 and p tags preserved for further manipulation needs.
Trying:
foreach ($intro->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo $name;
echo $value;
}
is giving the error:
Notice: Undefined property: DOMNodeList::$attributes in
So how could I extract the full HTML code of the found DOM elements?
I knew I was so close... I just needed to do:
foreach ($intro as $item) {
$h1= $item->getElementsByTagName('h1');
$h2= $item->getElementsByTagName('h2');
$p= $item->getElementsByTagName('p');
}

domxpath- how to get content for parent tag only instead of child tags

I, am using domxpath query for fetching content for parent tag only that is (td[class='s']) instead of including div content which is nested inside that td as given below in my code.
<?php
$second_trim='<td class="s" style="line-height:18px;">THIS TEXT IS REQUIRED and <div id="a" style="display:none;background-color:black;border:1px solid #ddd;padding:5px;color:black;">THIS TEXT IS NOT REQUIRED </div></td>';
$dom = new DOMDocument();
$doc->validateOnParse = true;
#$dom->loadHTML($second_trim);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$b = $xpath->query('//td[#class="s"]');
echo "<p style='font-size:14px;color:red;'><b style='font-size:18px;color:gray;'>cONTENT :- </b>".$b->item(0)->nodeValue."</p>";
?>
so how to remove content of that div tag and fetching only td's content any ideas !!
EDIT
If you are only interested in the direct text content modify your xpath query:
$b = $xpath->query('//td[#class="s"]/text()');
echo '<p style="font-size:14px;color:red;">'
.'<b style="font-size:18px;color:gray;">cONTENT :- </b>'
.$b->item(0)->nodeValue
.'</p>';
Right now the result is very specific to the example:
If more than one direct text node exists, its not gone be displayed. To do that foreach through the DOMNodeList $b and echo every selected node value.

Good practice parsing specific html table with urls in perl

Given a html with table data like the following...
<tr class=nbg1><td><A HREF=api.dll?pgm=cdq32&p1=oavmsd&p2=fg3m9s5d&p3=&cmd=w1d27d0id9654&hl=antibio&lstid=026>Nadifloxacin</A></td><td>Aknetherapeutikum Antibiotikum (Gyrasehemmer)</td><td>WST</td><td></td></tr>
<tr class=nbg2><td><A HREF=api.dll?pgm=cdq32&p1=oavmsd&p2=fg3m9s5d&p3=&cmd=w1d27d0id9728&hl=antibio&lstid=026>Ertapenem</A></td><td>Antibiotikum</td><td>WST</td><td></td></tr>
<tr class=nbg1><td><A HREF=api.dll?pgm=cdq32&p1=oavmsd&p2=fg3m9s5d&p3=&cmd=w1d27d0id9761&hl=antibio&lstid=026>Panipenem</A></td><td>Beta-Lactam-Antibiotikum</td><td>WST</td><td></td></tr>
<tr class=nbg2><td><A HREF=api.dll?pgm=cdq32&p1=oavmsd&p2=fg3m9s5d&p3=&cmd=w1d27d0id10302&hl=antibio&lstid=026>Prulifloxacin</A></td><td>Antibiotikum (Gyrasehemmer)</td><td>WST</td><td></td></tr>
</table></td>
<td width=15></td><td valign=top nowrap class=NBG1>
<TABLE width="200" border="0" cellspacing="0" cellpadding="2">
<TR><TD CLASS="NBG2">
</TD></TR></TABLE><BR>
I need to parse the url and the url description, where the extracted url will be used for further parsing the subpage. What would be a good practice to accomplish this, especially getting the url.
current code:
my $te = HTML::TableExtract->new( depth => 3, count => 0 );
$te->parse($mainpage);
my $ts = "";
my $row = "";
foreach $ts ($te->tables) {
foreach $row ($ts->rows) {
print #$row[0] . "\n";
}
}
if you want to extract only the href attribute from each a' element in that table, no need to use TableExtract, just use HTML::Query
my $qry = HTML::Query->new(text => $mainpage);
my #hrefs = map { $_->attr('href') } grep { m/api\.dll/i } $qry->query('tr > td > a')->get_elements();
no tested, but you get the idea...
HTML::TableExtract can help you exactly with dealing with tables.

dojo file upload using zend framework problem

I am struggling with a bit of dojo that is needed to upload a file. Now the file upload form sits within a dojo dialog box, so is hidden until the user selects an 'upload file' button.
This button can be clicked on anywhere on the site, so I've created a controller to handle the upload.
At the moment I am just trying to get it to work, and in my head script I have the following:
<?php $this->headScript()->captureStart(); ?>
function sendForm(){
//Hide the file input field
dojo.style('inputField',"display","none");
//Show the progress bar
dojo.style('progressField',"display","inline");
dojo.byId('preamble').innerHTML = "Uploading ...";
dojo.io.iframe.send({
url: "<?php echo $this->baseUrl(); ?>/fileprocssing/loadfile/",
method: "post",
handleAs: "text",
form: dojo.byId('StartFrm'),
handle: function(data,ioArgs){
var fileData = dojo.fromJson(data);
if (fileData.status == "success"){
//Show the file input field
dojo.style(dojo.byId('inputField'),"display","inline");
dojo.byId('fileInput').value = '';
//Hide the progress bar
dojo.style(dojo.byId('progressField'),"display","none");
dojo.byId('uploadedFiles').innerHTML += "success: File: " + fileData.details.name
+ " size: " + fileData.details.size +"<br>";
dojo.byId('preamble').innerHTML = "File to Upload: ";
}else{
dojo.style(dojo.byId('inputField'),"display","inline");
dojo.style(dojo.byId('progressField'),"display","none");
dojo.byId('preamble').innerHTML = "Error, try again: ";
}
}
});
}
<?php $this->headScript()->captureEnd() ?>
With the the basic upload for like this
<form id="StartFrm" enctype="multipart/form-data"
name="cvupload"
action="<?php echo $this->baseUrl();?>/fileprocssing/loadfile/"
method="post">
<input type="hidden" name="MAX_FILE_SIZE" value="500000">
<!-- wrapping these in spans to be able to modify
parts of this form depending on what the
dojo.io.iframe.submit() does -->
<span id="preamble">File to Upload:</span><br>
<span id="inputField">
<input type="file" id="fileInput" name="uploadFile">
</span>
<span id="progressField" style="display:none;">
<div dojoType="dijit.ProgressBar" style="width:200px" indeterminate="true"></div>
</span>
<br/>
<button value="upload" dojoType="dijit.form.Button"
onclick="sendForm()">Upload</button>
</form>
What I would like to know is how I can get the JSON data object from /fileprocssing/loadfile/ that contains upload data information if the form is called from /somecontroller/someaction/ ?? and when the file has been processed automatically redirect to something like /fileprocesing/reviewdata/
At the moment the action that I have looks like this
public function loadfileAction() {
$log = Zend_Registry::getInstance()->get('log');
$log->log('in loadfileaction', Zend_Log::DEBUG);
$log->log($_FILES['uploadFile']['name'], Zend_Log::DEBUG);
$uploadedFile = array(
'details' => $_FILES['uploadFile'],
'status' => 'success'
);
$log->log($fileUploadData->toJson(), Zend_Log::DEBUG);
$foo = "{'status':'success',details: {name:'".
$_FILES['uploadFile']['name'].
"',size:".
$_FILES['uploadFile']['size'].
"}}";
$log->log($foo, Zend_Log::DEBUG);
$this->view->fileData = $foo;
}
I've handcrafted the JSON data for the time being but will use Zend_Dojo_Data but at the moment I am just trying to get this working.
I have to confess that I don't know dojo that well, but trying to get my head around it in the shortest possible time.
Thanks in advance.
dojo.io.iframe.send requires the response data to be wrapped in a TEXTAREA tag. This is the only/easiest cross browser way to successfully access and load the response data, and is a requirement. It looks like you are sending plain JSON back from the action.
You can also adjust your handleAs to be "json" and skip the intermediate dojo.fromJson(data) call, it will be passed to you as a JSON object (provided the response is wrapped in the aforementioned TEXTAREA)