So i've been using both CURL and simple_html_dom for a while, for anyone who is not familiar with simple HTML DOM - It allows you to go through elements with ease and without the hassle of having to use regex/exploding stuff and so on.
E.g.
$html = file_get_html($obj->loc);
$item['title'] = $html->find('#Prod-Name h1',0)->plaintext;
However as far as i'm aware this does not support cookies - like CURL does, is there something out there that does?
Would be interested to hear peoples experience in this screen scraping/bot creation.
You can just download with curl and parse it with the parsing lib of your choice. I use this method sometimes but I'm not very happy with it, it would be nice if php had some decent scraping libs and even nicer if they were built in.
Related
I am hosting my Sinatra application using Apache with Passenger. It's hosted within a subfolder -- meaning, my main site is example.com, my application is at example.com/popcorn.
So I have a get '/' route, and that works fine. The problem is that my view includes a HTML form that makes a post request to upload, and the post '/upload' route isn't handling it. Instead of example.com/popcorn/upload, it's trying to get example.com/upload.
So I figure okay, not the ideal solution, but for now I'll hardcode the form action URL. But that doesn't work either -- making the action popcorn/upload fails too. This is where I get a little baffled, and my Google-fu was weak, I couldn't find help there.
Maybe I could have some kind of Apache rewrite rule, but is that the correct solution? Am I missing something? I would really appreciate a tip here because it feels like I've messed up something very simple and it's really bugging me.
You probably want the url helper method. That takes into account where the app is mounted on the server:
url('/upload')
The above code will evaluate to something like this:
http://example.com/popcord/upload
Inside your app you shouldn’t need to change anything, this will be routed to the existing post '/upload' handler.
We use Memcached and Zend Framework in our web project. Now, we need to clean cache selectively using tags as specified in Zend_Cache API.
Unfortunately, memcached doesn't support tags.
I have found these workarounds:
Memcached-tag project. Has anybody tested it? How to implement it with Zend?
Use wildchards like in this question, but it seems a bit confusing, less transparent and harder to implement with Zend.
Use this implementation or this one, for supporting tags in Memcached, beeing aware of the drawbacks.
Any other option?
Thanks in advance
You're right. Memcache don't support tags.
You can use another key-value to implement tag for memcache.
EX :
$this->objCache->save($arrResults, $strKey,array($strMyTag),$intCacheTime) // note : array($strMyTag) don't work for Memcache
MemcacheTag::setTag($strKey, $strMyTag) // our work around
About setTag Method & MemcacheTag:
function setTag($strKey,$strTag){
$arrKey = $cacheOjb->get($strTag);
$arrKey[]= $strKey;
}
function deleteCacheWithTag($strTag){
$arrKey = $cacheOjb->get($strTag);
foreach ($arrKey as $strKey){
$objCache->delete($strKey);
}
}
This work around is quite simple and it works for my projects.
*Note: these codes need some modification, sorry for posting in a hurry
I would like to generate documentation in a Dancer application in the same way that Mojolicious does with Mojolicious::Plugin::PODRenderer, I mean in the browser, under the /perldoc path.
Does somebody knows a module that can help? I found no ready-made plugin for Dancer. If it don't exist, any recommandation is welcome.
Porting Mojolicious' PODRenderer to Dancer should be fairly simple - it's an example plugin and the code is fairly short. I've done this for my own use in my CGI framework at work.
https://github.com/kraih/mojo/blob/master/lib/Mojolicious/Plugin/PODRenderer.pm#L34
Essentially what the plugin does is define the route /perldoc/:module to call the _perldoc method; the _perldoc method uses Pod::Simple::Search to find a documentation file matching the module param in the #INC directories; If it doesn't, it redirects the search to MetaCPAN. If it does, it uses Pod::Simple::HTML to convert the documentation to HTML, which is then tidied up with Mojo::DOM and wrapped in a lovely template.
Finding the location of that template is left as an exercise for... oh, nevermind, here it is: https://github.com/kraih/mojo/blob/master/lib/Mojolicious/templates/perldoc.html.ep
I have a Perl Dancer web application that uses GD to dynamically create images. I am trying to deliver these images to the user as PNG. For example:
package MyApp;
use Dancer ':syntax';
use GD;
...
get '/dynamic_image/:var1/:var2' => sub {
my $im = GD::Image->new(100,100);
my $black = $im->colorAllocate(0,0,0);
my $white = $im->colorAllocate(255,255,255);
$im->rectangle(10,10,90,90,$white);
my $png = $im->png;
return send_file( \$png, content_type => 'image/png', filename => params->{var1}."_".params->{var2}.".png" );
};
However, when accessing the above route, Chrome and Firefox don't seem to know what to do with the image data. If I try to use the route in Lightbox, Chrome complains. For example, when clicking on a link like this:
link
Chrome's console says:
Resource interpreted as Image but transferred with MIME type application/octet-stream: "http://www.example.com/dynamic_image/my/image".
It looks like Dancer is not using content_type correctly. Interestingly, IE8 seems to load the images just fine. Any idea what's going on? I'm currently running it standalone on Windows 7 with Strawberry Perl v5.16.2.
To explain the different behavior with IE: If IE encounters a Content-Type of application/octet-stream, it will attempt to scan the file to determine a more specific MIME type. That behavior is covered more here.
I recommend using the GET` commandline tool from Perl's LWP distribution to confirm what's going on. You can try this:
GET -sSe http://www.example.com/dynamic_image/my/image | less
The result should include among other things the Content-Type header. It sounds like you'll find that it says application/octet-stream. This starts to look like an issue with Dancer.
You didn't specify what version of Dancer you are using. Older versions did not support the content_type option to send_file(). If you are are reading the latest docs on CPAN and expecting them to apply to an older version, there could be some confusion.
It does not seem to be a dancer problem. There are other environments where it happens too.
Resource interpreted as Document but transferred with
MIME type image
After banging my head against this for awhile, I think I can answer my own question. Firefox actually tipped me off to a bug in my own code. Basically, when accessing the dynamically created image in Firefox, it would display a page with the HTTP request info along with the PNG data. I noticed that some debugging text was displayed on the page. It turns out that I left a print in one of the loops that generated the image data (I had used it to verify the image was being built correctly), and that text somehow made it into the "image" itself--which I assume caused Firefox and Chrome to freak out a bit. So this wasn't a Dancer or application bug, but a PEBKAC issue. Thanks for the input, everybody.
I want to use the #! token to make my GWT application crawlable, as described here:
http://code.google.com/web/ajaxcrawling/
There is a GWT sample app available online that uses this, for example:
http://gwt.google.com/samples/Showcase/Showcase.html#!CwRadioButton
Will serve the following static webpage to the googlebot:
http://gwt.google.com/samples/Showcase/Showcase.html?_escaped_fragment_=CwRadioButton
I want my GWT app to do something similar. In short, I'd like to serve a different flavor of the page whenever the _escaped_fragment_ parameter is found in the URL.
What should I modify in order for the server to serve something else (a static page, or a page dynamically generated through a headless browser like HTML Unit)? I'm guessing it could be the web.xml file, but I'm not sure.
(Note: I thought of checking the Showcase app provided with the GWT SDK, but unfortunately it doesn't seem to support serving static files on _escaped_fragment_ and it doesn't use the #! token..)
If you want to use web.xml, then I think it won't work with a servlet-mapping, because the url-patterns ignore the get parameters. (Not 100% sure, if there is another way to make this possible.)
You could of course map Showcase.html to a servlet, and in that servlet decide what to do, based on the get parameter "_escaped_fragment_". But it's a little bit expensive to call a Servlet just to serve a static page for the majority of the requests (not too bad, but still. You could set cache headers, if you're sure that it doesn't change).
Or you could have an Apache or something in front of your server - but I understand, I wouldn't like to have to do that either. Maybe your JavaEE server (which one are you using BTW?) provides some mechanism for URL filtering before the request gets passed on to the web container - I'd like to know that, too!
Found my answer! The Showcase sample supporting crawlable hyperlinks is in the following branch:
http://code.google.com/p/google-web-toolkit/source/browse/branches/crawlability/samples/showcase/?r=7726
It defines a filter in the web.xml to redirect URLs with the _escaped_fragment_ token to the output of HTML Unit.