How to loop the result from findnodes() with HTML::TreeBuilder::XPath - perl

I have my script to monitor some Facebook pages. Since Facebook API banned page public access permission on 4-SEP-2019. I need to parse the content by xpath method.
Each Facebook post is wrap by div[contains(#class,"userContentWrapper")]. I would like to loop posts one by one to find a desired data.
I don't known why $message = $post->findvalue('//div[#data-testid="post_message"]//p'); show all text in <p> of every posts.
use LWP::UserAgent;
$ua = new LWP::UserAgent;
$request = new HTTP::Request;
$request->url('https://www.facebook.com/pg/FIFA/posts/');
$request->method('GET');
$request->header('User-Agent' => 'Mozilla/5.0 Chrome/71.0.3578.98 Safari/537.36');
$response = $ua->request($request);
open(HTM, ">zzz.htm");
print HTM $response->content;
close(HTM);
use HTML::TreeBuilder::XPath;
$tree = HTML::TreeBuilder::XPath->new_from_content($response->content);
$posts = $tree->findnodes('//div[contains(#class,"userContentWrapper")]');
for my $post (#{$posts})
{
$id = $post->findnodes('//div[#data-testid="story-subtitle"]/#id');
$id = $id->[0]->getValue;
print "id = $id\n\n";
$object_id = $post->findnodes('//div[#data-testid="story-subtitle"]//a/#href');
$object_id = 'https://www.facebook.com' . $object_id->[0]->getValue;
print "object_id = $object_id\n\n";
$message = $post->findvalue('//div[#data-testid="post_message"]//p');
# $message = $message->[0]->getValue;
print "$message\n\n";
$ajaxify = $post->findnodes('//div[#class="mtm"]//a/#ajaxify');
$ajaxify = $ajaxify->[0]->getValue;
print "ajaxify = $ajaxify\n\n";
$ploi = $post->findnodes('//div[#class="mtm"]//a/#data-ploi');
$ploi = $ploi->[0]->getValue;
print "ploi = $ploi\n\n";
# $plsi = $post->findnodes('//div[#class="mtm"]//a/#data-plsi');
# $plsi = $plsi->[0]->getValue;
# print "plsi = $plsi\n\n";
$href = $post->findnodes('//div[#class="mtm"]//a/#href');
$href = 'https://www.facebook.com' . $href->[0]->getValue;
print "href = $href\n\n";
print "---------------------------------------------------------\n\n";
}

The post is unclear and it seems to contain multiple questions. This needs to be fixed, but in the mean time, I'll address the following:
I would like to loop posts one by one to find a desired data.
From HTML::TreeBuilder::XPath,
findnodes ($path)
Returns a list of nodes found by $path. In scalar context returns an Tree::XPathEngine::NodeSet object.
From Tree::XPathEngine::NodeSet,
get_nodelist()
Returns a list of nodes. See Tree::XPathEngine::XMLParser for the format of the nodes.
So,
my #posts = $tree->findnodes('...');
for my $post (#posts) { ... }
or
my $posts = $tree->findnodes('...');
for my $post ($posts->get_nodelist()) { ... }
Any other questions should be posted as separate Questions.

Related

How correctly to use the snippet() function?

My first Sphinx app almost works!
I successfully save path,title,content as attributes in index!
But I decided go to SphinxQL PDO from AP:
I found snippets() example thanks to barryhunter again but don't see how use it.
This is my working code, except snippets():
$conn = new PDO('mysql:host=ununtu;port=9306;charset=utf8', '', '');
if(isset($_GET['query']) and strlen($_GET['query']) > 1)
{
$query = $_GET['query'];
$sql= "SELECT * FROM `test1` WHERE MATCH('$query')";
foreach ($conn->query($sql) as $info) {
//snippet. don't works
$docs = array();
foreach () {
$docs[] = "'".mysql_real_escape_string(strip_tags($info['content']))."'";
}
$result = mysql_query("CALL SNIPPETS((".implode(',',$docs)."),'test1','" . mysql_real_escape_string($query) . "')",$conn);
$reply = array();
while ($row = mysql_fetch_array($result,MYSQL_ASSOC)) {
$reply[] = $row['snippet'];
}
// path, title out. works
$path = rawurlencode($info["path"]); $title = $info["title"];
$output = '<a href=' . $path . '>' . $title . '</a>'; $output = str_replace('%2F', '/', $output);
print( $output . "<br><br>");
}
}
I have got such structure from Sphinx index:
Array
(
[0] => Array
(
[id] => 244
[path] => DOC7000/zdorovie1.doc
[title] => zdorovie1.doc
[content] => Stuff content
I little bit confused with array of docs.
Also I don't see advice: "So its should be MUCH more efficient, to compile the documents and call buildExcepts just once.
But even more interesting, is as you sourcing the the text from a sphinx attribute, can use the SNIPPETS() sphinx function (in setSelect()!) in the main query. SO you dont have to receive the full text, just to send back to sphinx. ie sphinx will fetch the text from attribute internally. even more efficient!
"
Tell me please how I should change code for calling snippet() once for docs array, but output path (link), title for every doc.
Well because your data comes from sphinx, you can just use the SNIPPET() function (not CALL SNIPPETS()!)
$query = $conn->quote($_GET['query']);
$sql= "SELECT *,SNIPPET(content,$query) AS `snippet` FROM `test1` WHERE MATCH($query)";
foreach ($conn->query($sql) as $info) {
$path = rawurlencode($info["path"]); $title = $info["title"];
$output = '<a href=' . $path . '>' . $title . '</a>'; $output = str_replace('%2F', '/', $output);
print("$output<br>{$info['snippet']}<br><br>");
}
the highlighted text is right there in the main query, dont need to mess around with bundling the data back up to send to sphinx.
Also shows you should be escaping the raw query from user.
(the example you found does that, because the full text comes fom MySQL - not sphinx - so it has no option but to mess around sending data back and forth!)
Just for completeness, if REALLY want to use CALL SNIPPETS() would be something like
<?php
$query =$conn->quote($_GET['query']);
//make query request
$sql= "SELECT * FROM `test1` WHERE MATCH($query)";
$result = $conn->query($sql);
$rows = $result->fetchAll(PDO::FETCH_ASSOC);
//build list of docs to send
$docs = array();
foreach ($rows as $info) {
$docs[] = $conn->quote(strip_tags($info['content']));
}
//make snippet reqest
$sql = "CALL SNIPPETS((".implode(',',$docs)."),'test1',$query)";
//decode reply
$reply = array();
foreach ($conn->query($sql) as $row) {
$reply[] = $row['snippet'];
}
//output results using $rows, and cross referencing with $reply
foreach ($rows as $idx => $info) {
// path, title out. works
$path = rawurlencode($info["path"]); $title = $info["title"];
$output = '<a href=' . $path . '>' . $title . '</a>'; $output = str_replace('%2F', '/', $output);
$snippet = $reply[$idx];
print("$output<br>$snippet<br><br>");
}
Shows putting the rows into an array, because need to lopp though the data TWICE. Once to 'bundle' up the docs array to send. Then again to acully display rules, when have $rows AND $reply both available.

perl LDAP entry not recognised

We are writing a Perl code (to be run from Unix) which will reset the password of a Windows AD User. (We are not using powershell as we have been asked not to use Windows scripts).
With the following Perl code, we are able to connect to the AD User directory and query the correct user.
#!/usr/bin/perl -w
#########################
#This script resets the password in active user directory
#########################
use strict;
use warnings;
use DBI;
use Net::LDAP;
use Net::LDAPS;
use Authen::SASL qw(Perl);
use Net::LDAP::Control::Paged;
use Time::Local;
my $CERTDIR = "<cert path>";
my $AD_PASS = "$CERTDIR/.VDIAD_pass";
my $sAN = "vahmed";
### Generate Random Password ###
my $randompass = askPasswd();
my $uninewpass;
my $mail;
my $fullname;
my $name;
my $distName;
my $finalresult;
my #AD_passwords = get_domain_pass();
my $result = reset_AD_Password();
#Reset AD user password
sub reset_AD_Password {
my $ad = Net::LDAP->new($AD_passwords[0]);
my $msg = $ad->bind(dn => "cn=$AD_passwords[2],$AD_passwords[1]",
password => $AD_passwords[3],
version => 3);
if ($msg->code)
{
print "Error :" . $msg->error() . "\n";
exit 2;
}
my $acc_name = 'sAMAccountName';
my $acc_fullname = 'displayName';
my $acc_base = 'manager';
my $acc_distName = 'distinguishedName';
my $acc_mail = 'mail';
my $act = $ad->search(
base => "$AD_passwords[1]",
filter => "(&(objectCategory=person)(sAMAccountName=$sAN))",
attrs => [$acc_name, $acc_fullname, $acc_distName, $acc_mail]);
die 1 if ($act->count() !=1 );
my $samdn = $act->entry(0)->dn;
$fullname = $samdn->get_value($acc_fullname);
$mail = $samdn->get_value($acc_mail);
}
}
However we get an error on the line:
$fullname = $samdn->get_value($acc_fullname);
$mail = $samdn->get_value($acc_mail);
The error states "Can't locate object method "get_value" via package (distinguished Name) (perhaps you forgot to load (distinguished Name))"
However the code works correctly when we replace $samdn with the following code:
foreach my $entry ($act->entries){
$name = $entry->get_value($acc_name);
$fullname = $entry->get_value($acc_fullname);
$distName = $entry->get_value($acc_distName);
$mail = $entry->get_value($acc_mail);
}
It would appear that the code is unable to identify $samdn as a Net::LDAP::Entry record.
We have tried typecasting $samdn but got the same error.
Could someone help in resolving this issue as we would not prefer to use the for loop just in case more that one record is returned by the search? Thanks in advance.
You are not assigning a Net::LDAP::Entry to $samdn. You are assigning the dn of the first entry.
# VVVV
my $samdn = $act->entry(0)->dn;
Get rid of that ->dn and it should work, if $act->entry(0) returns a Net::LDAP::Entry.

strip_tags or preg_replace to remove few tags from html?

I am in dilemma between these two.
I want to strip head tags ( and everything inside/before including doctype/html) , body tag and script tags from a page that I am importing via curl.
So first thought was this
$content = strip_tags($content, '<img><p><a><div><table><tbody><th><tr><td><br><span><h1><h2><h3><h4><h5><h6><code><pre><b><strong><ol><ul><li><em>'.$tags);
which as you can see can get even longer with HTML5 tags, video object etc..
Than I saw this here.
https://stackoverflow.com/a/16377509/594423
Can anyone advise the preferred method or show your way of doing this and please explain why and
possibly tell me which one is faster.
Thank you!
You can test something like that:
$dom = new DOMDocument();
#$dom->loadHTML($content);
$result = '';
$bodyNode = $dom->getElementsByTagName('body')->item(0);
$scriptNodes = $bodyNode->getElementsByTagName('script');
$toRemove = array();
foreach ($scriptNodes as $scriptNode) {
$toRemove[] = $scriptNode;
}
foreach($toRemove as $node) {
$node->parentNode->removeChild($node);
}
$bodyChildren = $bodyNode->childNodes;
foreach($bodyChildren as $bodyChild) {
$result .= $dom->saveHTML($bodyChild);
}
The advantage of the DOM approach is a relative reliability against several html traps, especially some cases of malformed tags, or tags inside javascript strings: var str = "<body>";
But what about speed?
If you use a regex approach, for example:
$pattern = <<<'EOD'
~
<script[^>]*> (?>[^<]++|<(?!/script>))* </script>
|
</body>.*$
|
^ (?>[^<]++|<(?!body\b))* <body[^>]*>
~xis
EOD;
$result = preg_replace($pattern, '', $content);
The result is a little faster (from 1x to 2x for an html file with 400 lines). But with this code, the reliability decreases.
If speed is important and if you have a good idea of the html quality, for the same reliability level than the regex version, you can use:
$offset = stripos($content, '<body');
$offset = strpos($content, '>', $offset);
$result = strrev(substr($content,++$offset));
$offset = stripos($result, '>ydob/<');
$result = substr($result, $offset+7);
$offset = 0;
while(false !== $offset = stripos($result, '>tpircs/<', $offset)) {
$soffset = stripos($result, 'tpircs<', $offset);
$result = substr_replace($result, '', $offset, $soffset-$offset+7);
}
$result = strrev($result);
That is between 2x and 5x faster than the DOM version.

POST array to LWP: only first entry getting posted

Here's section of code and i am trying to POST whole array using LWP but server is receiving only first row of array(0 index) while others are not getting sent to server, please guide what i am doing wrong
$data_post[0] = "text1";
$data_post[1] = "text2";
$data_post[2] = "texxt3";
$data_post[3] = "text4";
$data_post[4] ="text5";
my $ua= LWP::UserAgent->new();
my $response = $ua->post( $url, { 'istring' => #data_post} );
my $content = $response->decoded_content();
my $cgi = CGI->new();
print $cgi->header(), $content;
You can't assign an array to a hash key, only a scalar. Your attempt will expand the array and send this:
{ "istring" => "text1", "text2" => "texxt3", "text4" => "text5" }
Use an array ref instead by putting the "take a reference" operator in front of the array:
{ istring => \#data_post }

How can I change the hostname in a URL using Perl?

I have some URLs like http://anytext.a.abs.com
In these, 'anytext' is the data that is dynamic. Rest of the URL will remain same in every case.
I'm using the following code:
$url = "http://anytext.a.abs.com";
my $request = new HTTP::Request 'GET', $url;
my $response = $ua->request($request);
if ($response->is_success)
{
function......;
}
Now, how can I parse a URL that has dynamic data in it?
Not sure but is this close to what you're after?:
for my $host qw(anytext someothertext andanother) {
my $url = "http://$host.a.abs.com";
my $request = new HTTP::Request 'GET', $url;
my $response = $ua->request($request);
if ($response->is_success)
{
function......;
}
}
Something like this maybe?
Otherwise, you can use the URI class to do url manipulation.
my $protocol = 'http://'
my $url_end = '.a.abs.com';
$url = $protocol . "anytext" . $url_end;
my $request = new HTTP::Request 'GET', $url;
my $response = $ua->request($request);
if ($response->is_success)
{
function......;
}
I think this is probably enough:
# The regex specifies a string preceded by two slashes and all non-dots
my ( $host_name ) = $url =~ m{//([^.]+)};
And if you want to change it:
$url =~ s|^http://\K([^.]+)|$host_name_I_want|;
Or even:
substr( $url, index( $url, $host_name ), length( $host_name ), $host_name_I_want );
This will expand the segment sufficiently to accommodate $host_name_I_want.
Well, like you would parse any other data: Use the information you have about the structure.
You have a protocol part, followed by "colon slash slash", then the host followed by optional "colon port number" and an optional path on the host.
So ... build a little parser that extracts the information you are after.
And frankly, if you are only hunting for "what exactely is 'anytext' here?", a RegEx of this form should help (untested; use as guidance):
$url =~ m/http\:\/\/(.*).a.abs.com/;
$subdomain = $1;
$do_something('with', $subdomain);
Sorry if I grossly misunderstood the problem at hand. Please explain what you mean with 'how can I parse a URL that has dynamic data in it?' in that case :)