strip_tags or preg_replace to remove few tags from html? - preg-replace

I am in dilemma between these two.
I want to strip head tags ( and everything inside/before including doctype/html) , body tag and script tags from a page that I am importing via curl.
So first thought was this
$content = strip_tags($content, '<img><p><a><div><table><tbody><th><tr><td><br><span><h1><h2><h3><h4><h5><h6><code><pre><b><strong><ol><ul><li><em>'.$tags);
which as you can see can get even longer with HTML5 tags, video object etc..
Than I saw this here.
https://stackoverflow.com/a/16377509/594423
Can anyone advise the preferred method or show your way of doing this and please explain why and
possibly tell me which one is faster.
Thank you!

You can test something like that:
$dom = new DOMDocument();
#$dom->loadHTML($content);
$result = '';
$bodyNode = $dom->getElementsByTagName('body')->item(0);
$scriptNodes = $bodyNode->getElementsByTagName('script');
$toRemove = array();
foreach ($scriptNodes as $scriptNode) {
$toRemove[] = $scriptNode;
}
foreach($toRemove as $node) {
$node->parentNode->removeChild($node);
}
$bodyChildren = $bodyNode->childNodes;
foreach($bodyChildren as $bodyChild) {
$result .= $dom->saveHTML($bodyChild);
}
The advantage of the DOM approach is a relative reliability against several html traps, especially some cases of malformed tags, or tags inside javascript strings: var str = "<body>";
But what about speed?
If you use a regex approach, for example:
$pattern = <<<'EOD'
~
<script[^>]*> (?>[^<]++|<(?!/script>))* </script>
|
</body>.*$
|
^ (?>[^<]++|<(?!body\b))* <body[^>]*>
~xis
EOD;
$result = preg_replace($pattern, '', $content);
The result is a little faster (from 1x to 2x for an html file with 400 lines). But with this code, the reliability decreases.
If speed is important and if you have a good idea of the html quality, for the same reliability level than the regex version, you can use:
$offset = stripos($content, '<body');
$offset = strpos($content, '>', $offset);
$result = strrev(substr($content,++$offset));
$offset = stripos($result, '>ydob/<');
$result = substr($result, $offset+7);
$offset = 0;
while(false !== $offset = stripos($result, '>tpircs/<', $offset)) {
$soffset = stripos($result, 'tpircs<', $offset);
$result = substr_replace($result, '', $offset, $soffset-$offset+7);
}
$result = strrev($result);
That is between 2x and 5x faster than the DOM version.

Related

How correctly to use the snippet() function?

My first Sphinx app almost works!
I successfully save path,title,content as attributes in index!
But I decided go to SphinxQL PDO from AP:
I found snippets() example thanks to barryhunter again but don't see how use it.
This is my working code, except snippets():
$conn = new PDO('mysql:host=ununtu;port=9306;charset=utf8', '', '');
if(isset($_GET['query']) and strlen($_GET['query']) > 1)
{
$query = $_GET['query'];
$sql= "SELECT * FROM `test1` WHERE MATCH('$query')";
foreach ($conn->query($sql) as $info) {
//snippet. don't works
$docs = array();
foreach () {
$docs[] = "'".mysql_real_escape_string(strip_tags($info['content']))."'";
}
$result = mysql_query("CALL SNIPPETS((".implode(',',$docs)."),'test1','" . mysql_real_escape_string($query) . "')",$conn);
$reply = array();
while ($row = mysql_fetch_array($result,MYSQL_ASSOC)) {
$reply[] = $row['snippet'];
}
// path, title out. works
$path = rawurlencode($info["path"]); $title = $info["title"];
$output = '<a href=' . $path . '>' . $title . '</a>'; $output = str_replace('%2F', '/', $output);
print( $output . "<br><br>");
}
}
I have got such structure from Sphinx index:
Array
(
[0] => Array
(
[id] => 244
[path] => DOC7000/zdorovie1.doc
[title] => zdorovie1.doc
[content] => Stuff content
I little bit confused with array of docs.
Also I don't see advice: "So its should be MUCH more efficient, to compile the documents and call buildExcepts just once.
But even more interesting, is as you sourcing the the text from a sphinx attribute, can use the SNIPPETS() sphinx function (in setSelect()!) in the main query. SO you dont have to receive the full text, just to send back to sphinx. ie sphinx will fetch the text from attribute internally. even more efficient!
"
Tell me please how I should change code for calling snippet() once for docs array, but output path (link), title for every doc.
Well because your data comes from sphinx, you can just use the SNIPPET() function (not CALL SNIPPETS()!)
$query = $conn->quote($_GET['query']);
$sql= "SELECT *,SNIPPET(content,$query) AS `snippet` FROM `test1` WHERE MATCH($query)";
foreach ($conn->query($sql) as $info) {
$path = rawurlencode($info["path"]); $title = $info["title"];
$output = '<a href=' . $path . '>' . $title . '</a>'; $output = str_replace('%2F', '/', $output);
print("$output<br>{$info['snippet']}<br><br>");
}
the highlighted text is right there in the main query, dont need to mess around with bundling the data back up to send to sphinx.
Also shows you should be escaping the raw query from user.
(the example you found does that, because the full text comes fom MySQL - not sphinx - so it has no option but to mess around sending data back and forth!)
Just for completeness, if REALLY want to use CALL SNIPPETS() would be something like
<?php
$query =$conn->quote($_GET['query']);
//make query request
$sql= "SELECT * FROM `test1` WHERE MATCH($query)";
$result = $conn->query($sql);
$rows = $result->fetchAll(PDO::FETCH_ASSOC);
//build list of docs to send
$docs = array();
foreach ($rows as $info) {
$docs[] = $conn->quote(strip_tags($info['content']));
}
//make snippet reqest
$sql = "CALL SNIPPETS((".implode(',',$docs)."),'test1',$query)";
//decode reply
$reply = array();
foreach ($conn->query($sql) as $row) {
$reply[] = $row['snippet'];
}
//output results using $rows, and cross referencing with $reply
foreach ($rows as $idx => $info) {
// path, title out. works
$path = rawurlencode($info["path"]); $title = $info["title"];
$output = '<a href=' . $path . '>' . $title . '</a>'; $output = str_replace('%2F', '/', $output);
$snippet = $reply[$idx];
print("$output<br>$snippet<br><br>");
}
Shows putting the rows into an array, because need to lopp though the data TWICE. Once to 'bundle' up the docs array to send. Then again to acully display rules, when have $rows AND $reply both available.

Unable to upload web page

I am trying to upload a file from a web page. I use CGI to query for all of the input fields but "my $upload_filehandle = $query->upload("fileInput");" is always empty even when "my $file = $query->param("fileInput");" has properly fetched the file name from the same field. Here is the code:
use CGI;
$CGI::POST_MAX = 1024 * 5000;
my $query = CGI->new;
my $url = $query->param("urlInput");
my $file = $query->param("fileInput");
my $upload_filehandle = $query->upload("fileInput");
my $text = $query->param("textInput");
my $k = $query->param("kInput");
Any advice is appreciated.
Regards.
I will repost my comment, as an answer:
make sure your HTML form contains the enctype attribute, and that it's set to enctype="multipart/form-data"

Write config in Zend Framework with APPLICATION_PATH

For an application I'd like to create some kind of setup-steps. In one of the steps the database configuration is written to the application.ini file. This all works, but something very strange happens: All the paths to the directories (library, layout, ...) are changed from paths with APPLICATION_PATH . to full paths. As you can imagine, this isn't very systemfriendly. Any idea how I can prevent that?
I update the application.ini with this code:
# read existing configuration
$config = new Zend_Config_Ini(
$location,
null,
array('skipExtends' => true,
'allowModifications' => true));
# add new values
$config->production->doctrine->connection = array();
$config->production->doctrine->connection->host = $data['server'];
$config->production->doctrine->connection->user = $data['username'];
$config->production->doctrine->connection->password = $data['password'];
$config->production->doctrine->connection->database = $data['database'];
# write new configuration
$writer = new Zend_Config_Writer_Ini(
array(
'config' => $config,
'filename' => $location));
$writer->write();
Since Zend_Config_Ini uses the default ini scanning mode (INI_SCANNER_NORMAL), it will parse all options and replace constants with their respective values. What you could do, is call parse_ini_file directly, using the INI_SCANNER_RAW mode, so the options aren't parsed.
ie. use
$config = parse_ini_file('/path/to/your.ini', TRUE, INI_SCANNER_RAW);
You will get an associative array that you can manipulate as you see fit, and afterwards you can write that back with the following snippet (from the comments):
function write_ini_file($assoc_arr, $path, $has_sections=FALSE) {
$content = "";
if ($has_sections) {
foreach ($assoc_arr as $key=>$elem) {
$content .= "[".$key."]\n";
foreach ($elem as $key2=>$elem2) {
if(is_array($elem2))
{
for($i=0;$i<count($elem2);$i++)
{
$content .= $key2."[] = ".$elem2[$i]."\n";
}
}
else if($elem2=="") $content .= $key2." = \n";
else $content .= $key2." = ".$elem2."\n";
}
}
}
else {
foreach ($assoc_arr as $key=>$elem) {
if(is_array($elem))
{
for($i=0;$i<count($elem);$i++)
{
$content .= $key2."[] = ".$elem[$i]."\n";
}
}
else if($elem=="") $content .= $key2." = \n";
else $content .= $key2." = ".$elem."\n";
}
}
if (!$handle = fopen($path, 'w')) {
return false;
}
if (!fwrite($handle, $content)) {
return false;
}
fclose($handle);
return true;
}
ie. call it with :
write_ini_file($config, '/path/to/your.ini', TRUE);
after manipulating the $config array. Just make sure you add double quotes to the option values where needed...
Or alternatively - instead of using that function - you could try writing it back using Zend_Config_Writer_Ini, after converting the array back to a Zend_Config object, I guess that should work as well...
I'm guess you could iterate over the values, checking for a match between the value of APPLICATION_PATH, and replacing it with string literal APPLICATION_PATH.
That is if you know that APPLICATION_PATH contains the string '/home/david/apps/myapp/application' and you find a config value '/home/david/apps/myapp/application/views/helpers', then you do some kind of replacement of the leading string '/home/david/apps/myapp/application' with the string 'APPLICATION_PATH', ending up with 'APPLICATION_PATH "/views/helpers"'.
Kind of a kludge, but something like that might work.
This is a long shot - but have you tried running your Zend_Config_Writer_Ini code while the APPLICATION_PATH constant is not defined? It should interpret it as the literal string 'APPLICATION_PATH' and could possibly work.

<fb:comments-count> not working on my WordPress powered blog

I am using the Facebook comments plugin on WordPress and the comments box is working fine but I want to access the number of counts on the index page and on single pages. On the pages, the Facebook Javascript is loaded on the pages.
Here's the code I used:
<fb:comments-count href=<?php echo get_permalink() ?>/></fb:comments-count> comments
But it doesn't count the FB comments.
Is there a simple code that let me retrieve the number of comment counts?
Thanks,
Include this function somewhere in your template file :
function fb_comment_count() {
global $post;
$url = get_permalink($post->ID);
$filecontent = file_get_contents('https://graph.facebook.com/?ids=' . $url);
$json = json_decode($filecontent);
$count = $json->$url->comments;
if ($count == 0 || !isset($count)) {
$count = 0;
}
echo $count;
}
use it like this in your homepage or wherever
<?php fb_comment_count() ?>
Had the same problem, that function worked for me... if you get an error... try reading this.
The comments often don't appear here :
graph.facebook.com/?ids = [your url]
Instead they appear well in
graph.facebook.com/comments/?ids = [your url]
Hence the value of the final solution.
Answer by ifennec seems fine, but actually is not working (facebook maybe changed something and now is only returning the number of shares).
You could try to get all the comments:
$filecontent = file_get_contents(
'https://graph.facebook.com/comments/?ids=' . $url);
And count all:
$json = json_decode($filecontent);
$content = $json->$url;
$count = count($content->data);
if (!isset($count) || $count == 0) {
$count = 0;
}
echo $count;
This is just a fix until facebook decides to read the FAQ about fb:comments-count, and discovers it's not working :) (http://developers.facebook.com/docs/reference/plugins/comments/ yeah, awesome comments).
By the way, I applied the function in Drupal 7 :) Thank you very much ifennec, you showed me the way.
This works for me :
function fb_comment_count() {
global $post;
$url = get_permalink($post->ID);
$filecontent = file_get_contents('https://graph.facebook.com/comments/?ids=' . $url);
$json = json_decode($filecontent);
echo(count($json->$url->comments->data));
}
This is resolved.
<p><span class="cmt"><fb:comments-count href=<?php the_permalink(); ?>></fb:comments-count></span> Comments</p>
The problem was that I was using 'url' than a 'href' attribute in my case.
Just put this function in functions.php and pass the post url to function fb_comment_count wherever you call it on your theme files
function fb_comment_count($url) {
$filecontent = file_get_contents('https://graph.facebook.com/comments/?ids=' . $url);
$json = json_decode($filecontent);
$content = $json->$url;
echo count($content->comments->data);
}

Cannot print on multiple pages using PDF::API2

I have been tinkering around with PDF::API2 and i am facing a problem, create a pdf file very well and add text into it. However say if the text to be written flows over to more than one page, the script does not print over to the next page. I have tried researching for an answer to this but to no avail. I would like each page to have exactly 50 lines of text. My script is as below. It only prints on the first page, creates the other pages but does not print into them. Anyone with a solution
!/usr/bin/perl
use PDF::API2;
use POSIX qw(setsid strftime);
my $filename = scalar(strftime('%F', localtime));
my $pdf = PDF::API2->new(-file => "$filename.pdf");
$pdf->mediabox(595,842);
my $page = $pdf->page;
my $fnt = $pdf->corefont('Arial',-encoding => 'latin1');
my $txt = $page->text;
$txt->textstart;
$txt->font($fnt, 20);
$txt->translate(100,800);
$txt->text("Lines for $filename");
my $i=0;
my $line = 780;
while($i<310)
{
if(($i%50) == 0)
{
my $page = $pdf->page;
my $fnt = $pdf->corefont('Arial',-encoding => 'latin1');
my $txt = $page->text;
}
$txt->font($fnt, 10);
$txt->translate(100,$line);
$txt->text("$i This is the first line");
$line=$line-15;
$i++;
}
$txt->textend;
$pdf->save;
$pdf->end( );
The problem is that you are making new page, but forget new variables instantly:
if(($i%50) == 0)
{
my $page = $pdf->page;
my $fnt = $pdf->corefont('Arial',-encoding => 'latin1');
my $txt = $page->text;
}
All my variables you make disappear on closing parentheses. Just remove my and you will modify variables from top-level scope.
Edit: You also probably want to reset $line variable when making new page.
The typeface, $fnt, does not have to be changed since it depends on the PDF, $pdf, and not the page, $page.
As much as I love Perl, I learned enough Python to use the ReportLabs library for PDF generation. Creating PDF is one of the weak spots of Perl v. Python.