Perl WWW-Mechanize Module - perl

I am using www-mechanize module to access website controls. Some html pages contains frames. I cant get the links names and i am unable to access the links in frames. Please any one suggest right solution to resolve this issue.
Working Platform: Windows, Perl
Thanks in advance

From what I see, WWW::Mechanize does not load frames automatically; you need to do so yourself. You can get links to the frames with:
#frames = $mech->find_link( 'tag' => 'frame' );
and then $mech->get each one (cloning your mech object if necessary).

Related

Selenium, Web Scraping, can't access the class

I am very new to web scraping, It's been several days that I am dealing with the same problem:
Please look at the below line of code(extracted directly from the web page):
< option value='pick' id='ember2314' class='x-option ember view'>To Pick</option
whatever I do I can't access that class:
driver.find_element_by_class_name('x-option ember view') #when I want to print the text here, it says unable to locate element.
But for some other cases, I can easily access the class, and sometimes for some cases, I can't access the class.
Can anyone please shed some light on this? (sorry, I am very new to web scraping)
Please note that the 'id' and 'value' are changing every time so I can't rely on them.
Any help would be much appreciated.
Thanks,
For the people who are beginners like me, here is the solution. It is easy to search it with it's xpath:
//tagename[#attribute='value of the attribute']
so for this case:
driver.find_element_by_xpath('//option[#class="x-option ember view"]')
would do the trick.
From my understanding, the 'class' here is actually an attribute of the tag 'option', so search it like this: find_element_by _class_name('x-option ember view') won't give you anything.

Perl Dancer send_file Issue with Images

I have a Perl Dancer web application that uses GD to dynamically create images. I am trying to deliver these images to the user as PNG. For example:
package MyApp;
use Dancer ':syntax';
use GD;
...
get '/dynamic_image/:var1/:var2' => sub {
my $im = GD::Image->new(100,100);
my $black = $im->colorAllocate(0,0,0);
my $white = $im->colorAllocate(255,255,255);
$im->rectangle(10,10,90,90,$white);
my $png = $im->png;
return send_file( \$png, content_type => 'image/png', filename => params->{var1}."_".params->{var2}.".png" );
};
However, when accessing the above route, Chrome and Firefox don't seem to know what to do with the image data. If I try to use the route in Lightbox, Chrome complains. For example, when clicking on a link like this:
link
Chrome's console says:
Resource interpreted as Image but transferred with MIME type application/octet-stream: "http://www.example.com/dynamic_image/my/image".
It looks like Dancer is not using content_type correctly. Interestingly, IE8 seems to load the images just fine. Any idea what's going on? I'm currently running it standalone on Windows 7 with Strawberry Perl v5.16.2.
To explain the different behavior with IE: If IE encounters a Content-Type of application/octet-stream, it will attempt to scan the file to determine a more specific MIME type. That behavior is covered more here.
I recommend using the GET` commandline tool from Perl's LWP distribution to confirm what's going on. You can try this:
GET -sSe http://www.example.com/dynamic_image/my/image | less
The result should include among other things the Content-Type header. It sounds like you'll find that it says application/octet-stream. This starts to look like an issue with Dancer.
You didn't specify what version of Dancer you are using. Older versions did not support the content_type option to send_file(). If you are are reading the latest docs on CPAN and expecting them to apply to an older version, there could be some confusion.
It does not seem to be a dancer problem. There are other environments where it happens too.
Resource interpreted as Document but transferred with
MIME type image
After banging my head against this for awhile, I think I can answer my own question. Firefox actually tipped me off to a bug in my own code. Basically, when accessing the dynamically created image in Firefox, it would display a page with the HTTP request info along with the PNG data. I noticed that some debugging text was displayed on the page. It turns out that I left a print in one of the loops that generated the image data (I had used it to verify the image was being built correctly), and that text somehow made it into the "image" itself--which I assume caused Firefox and Chrome to freak out a bit. So this wasn't a Dancer or application bug, but a PEBKAC issue. Thanks for the input, everybody.

Gtkwebkit, save html to pdf

Last days I search for best and shortest way to convert html files to pdf. Since I create my html files with C program and see them through gtkwebkit which uses cairo it should be some efficient and direct way to convert content of showed page to html with C (I think).
But can't find any example or direction to go on the net.
Until now, among different virtual printers, I find only commandline tools which are maded in perl or which depends on qt what is not wanted.
Please for any suggestion, example or advice to get this functionality from gtkwebkit and if not, maybe something with some tiny C library.
As far as I can tell from reading the documentation (haven't tried it out myself):
Get the main frame with webkit_web_view_get_main_frame().
Create a GtkPrintOperation with gtk_print_operation_new().
Set the export-file property on your print operation to be the name of the PDF you want to export to.
Print the frame with webkit_web_frame_print_full(). Make sure to pass GTK_PRINT_OPERATION_ACTION_EXPORT as the 'action' parameter.
I once wrote some code, to accomplish that without opening a window. But then I ran into a problem with using that code from multiple threads (in a webserver e.g.). I made some research and I figured out that gtk itself is single threaded. So I made my code thread safe, by queuing the print operations to the main thread. Anyway, if it helps, check it out... https://github.com/gnudles/wkgtkprinter

zend framework under document root in subdir

I developed a application with Zend Framework and now I want to be able to place the app in an subdirectory of a Documentroot.
e.g. http://www.example.com/myapp/
I read quite a lot of Docu how this could work, but all in all these solutions don´t fit my needs. Is there a trivial way to do the subdir thing, without adding the concrete path to any file which generates the pages.
There are some examples in the net, where a basePath is set in the application enviroment and so there is a method call bevor each "form" creation which prepends the path before the link.
$form->setAction($this->_request->getBaseUrl() . $this->_helper->url('sign'));
This was from: http://johnmee.com/2008/11/zend-framework-quickstart-tutorial-deploy-to-a-subdirectory-instead-of-web-root/
But this is only works for small examples, I have tons of forms, tons of views and tons of scripts. I can´t belive this (lets call it hack :) ) is the only solution to do this.
Any ideas?
You don't have to do anything special. See my tutorial at http://akrabat.com/Zend-framework-tutorial which is developed entirely within a sub-directory.
As they say on the web page:
I’m told this last issue has been
lodged has a defect and not necessary
from releases “1.7″ and beyond. The
helper->url will henceforth prepend
the baseUrl to its result.
So you should be fine. Do you actually use the $form->setAction() method on every form already? Because if you use it in combination with the url helper, the baseUrl will already be included.

How can I download Yahoo Groups?

I want to download some Yahoo Groups (files, photos, messages, memberlist) and I've found these scripts:
http://freshmeat.net/projects/grabyahoogroup/
http://sourceforge.net/project/showfiles.php?group_id=62034
I've downloaded ActivePerl and the needed modules from CPAN (nothing fancy; they're very easy to find). I've managed to install them, but when I run the script I get an error after it tells me that I've successfully logged in:
"Use of uninitialized value $cells in pattern match (m//) at yahoogroups_files.pl line 244, line 2."
I'm guessing that Yahoo changed the layout of the page or something, but I'm not able to update the script myself. I'm a newbie when it comes to Perl and understanding the way Yahoo generates the pages, I only know some basic C++. I want to mention that I'm not lazy, I'll try do fix it myself but I need your help: hints, advice, anything.
PS: I've contacted the author, but he isn't willing to update the scripts.
You would need knowledge in the following fields:
use of an html parser
http knowledge ( get/post/head )
web scraping
I suggest you focus on WWW::Mechanize since it's capable of all these things ( and more )
EDIT: another solution ( that doesn't need programming ) , is this: login with your browser on yahoo groups, store the cookie, and then run wget , passing the stored cookie as a parameter. This way you'll get the task accomplished very fast.
Find your browser's cookies.txt file on your harddrive, and then call wget like this ( if I remember the commands correctly ) :
wget --load-cookies path_to_cookie_file -r -w 60 website
The full man page can be found here
EDIT2: Another option is to use WebDriver to automate firefox. You can use this article as a guide on how to accomplish this.
By the filename I'm assuming you're using Yahoo Group archiver found here: http://sourceforge.net/projects/grabyahoogroup/
I ran the files script against the SubEthaEdit group and it works great. All of the files downloaded without incident.
Looking at the code it seems to barf while processing an html table in a while loop if $cells is empty.
Considering the code did work when I tested it it's possible there's something going on with the listing of that group's files. You'll want to try outputting $content and figure out where and why the regular expression on 243 isn't able to process that html.
EDIT: If you don't mind posting the group this is happening with I'm sure myself or someone else here can try it out and troubleshoot on their own. It's tough to pinpoint what's up when the issue can't be duplicated. Also, try the same group I did and see if it works out for you. Certainly something up with the group you're trying if that works.
Dunno if it will help you, but here's what I did to get the message-download working:
http://sourceforge.net/forum/forum.php?thread_id=3283915&forum_id=209170
(I only used message-download, I didn't look at file-download)
Was tinkering on this a while ago to backup my girlfriend's group messages and files from uni. Upon debugging on the latest scripts I've found out that there seems to be a bug on group_domain declaration (theres also a group declaration bug that i've found on yahoo2maildir.pl of the same project, see $request)
($group_domain) = $url =~ /\/\/(.*?groups.yahoo.com)\//;
in this case, i've overwritten the $request var under the function sub download_folder() with
from <br>
$request = GET "http://$group_domain/group/$group/files$sub_folder/";
<br> to <br>
$request = GET "http://**groups.yahoo.com/group/$user_group**/files$sub_folder/";
grabyahoogroup works well in the latest edition, which can be found at the svn repo:
http://grabyahoogroup.svn.sourceforge.net/viewvc/grabyahoogroup/trunk/yahoo_group/
The version at sourceforge.net/projects/grabyahoogroup/files/ HAS BUGS AND DID NOT WORK FOR ME.
I've been looking for a tool that collects messages/conversations from Yahoo Groups!. I finally found this tool that converts your Yahoo! Groups messages into MBOX format after struggling to try to make my own and searching everywhere on the internet.
Download tools
Both of the following are Google Chrome extensions.
Chrome Extension to Download Members posted by Sam Hobbs (2015).
Chrome Application To Download Messages posted by Mark Fletcher (Jan 2016).
Plain string to Base64 binary data
At some time past September 16, 2010 (at least for me), the messages retrieved are no longer plain text and instead Base 64 binary data (ASCII). Using this swiss converter tool can allow you to read the data as it is.
Sample content from the MBOX format
VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIHRoZSBsYXp5IGRvZy4=
Sample result after conversion
The quick brown fox jumps over the lazy dog.
for cause, as of 2019/09
https://github.com/csaftoiu/yahoo-groups-backup
.....