How to deal with nameless forms on websites? - perl

I would like to write a script that lets me use this website
http://proteinmodel.org/AS2TS/LGA/lga.html
(I need to use it a few hundred times, and I don't feel like doing that manually)
I have searched the internet for ways how this could be done using Perl, and I came across WWW::Mechanize, which seemed to be just what I was looking for. But now I have discovered that the form on that website which I want to use has no name - its declaration line simply reads
<FORM METHOD="POST" ACTION="./lga-form.cgi" ENCTYPE=multipart/form-data>
At first I tried simply not setting my WWW::Mechanize object's form_name property, which gave me this error message when I provided a value for the form's email address field:
Argument "my_email#address.com" isn't numeric in numeric gt (>) at /usr/share/perl5/WWW/Mechanize.pm line 1618.
I then tried setting form_name to '' and later ' ', but it was to no avail, I simply got this message:
There is no form named " " at ./automate_LGA.pl line 40
What way is there to deal with forms that have no names? It would be most helpful if someone on here could answer this question - even if the answer points away from using WWW::Mechanize, as I just want to get the job done, (more or less) no matter how.
Thanks a lot in advance!

An easy and more robust way is to use the $mech->form_with_fields() method from WWW::Mechanize to select the form you want based on the fields it contains.
Easier still, use the submit_form method with the with_fields option.
For instance, to locate a form which has fields named 'username' and 'password', complete them and submit the form, it's as easy as:
$mech->submit_form(
with_fields => { username => $username, password => $password }
);
Doing it this way has the advantage that if they shuffle their HTML around, changing the order of the forms in the HTML, or adding a new form before the one you're interested in, your code will continue to work.

I don't know about WWW::Mechanize, but its Python equivalent, mechanize, gives you an array of forms that you can iterate even if you don't know their names.
Example (taken from its homepage):
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
for form in br.forms():
print form
EDIT: searching in the docs of WWW::Mechanize I found the $mech->forms() method, that could be what you need. But since I don't know perl or WWW::Mechanize, I'll leave there my python answer.

Okay, I have found the answer. I can address the nameless form by its number (there's just one form on the webpage, so I guessed it would be number 1, and it worked). Here's part of my code:
my $lga = WWW::Mechanize->new();
my $address = 'my_email#address.com';
my $options = '-3 -o0 -d:4.0';
my $pdb_2 = "${pdb_id}_1 ${pdb_id}_2";
$lga->get('http://proteinmodel.org/AS2TS/LGA/lga.html');
$lga->success or die "LGA GET fail\n";
$lga->form_number(1);
$lga->field('Address', $address);
$lga->field('Options', $options);
$lga->field('PDB_2', $pdb_2);
$lga->submit();
$lga->success or die "LGA POST fail\n";

Related

Handle POST data sent as array

I have an html form which sends a hidden field and a radio button with the same name.
This allows people to submit the form without picking from the list (but records a zero answer).
When the user does select a radio button, the form posts BOTH the hidden value and the selected value.
I'd like to write a perl function to convert the POST data to a hash. The following works for standard text boxes etc.
#!/usr/bin/perl
use CGI qw(:standard);
sub GetForm{
%form;
foreach my $p (param()) {
$form{$p} = param($p);
}
return %form;
}
However when faced with two form inputs with the same name it just returns the first one (ie the hidden one)
I can see that the inputs are included in the POST header as an array but I don't know how to process them.
I'm working with legacy code so I can't change the form unfortunately!
Is there a way to do this?
I have an html form which sends a hidden field and a radio button with
the same name.
This allows people to submit the form without picking from the list
(but records a zero answer).
That's an odd approach. It would be easier to leave the hidden input out and treat the absence of the data as a zero answer.
However, if you want to stick to your approach, read the documentation for the CGI module.
Specifically, the documentation for param:
When calling param() If the parameter is multivalued (e.g. from multiple selections in a scrolling list), you can ask to receive an array. Otherwise the method will return the first value.
Thus:
$form{$p} = [ param($p) ];
However, you do seem to be reinventing the wheel. There is a built-in method to get a hash of all paramaters:
$form = $CGI->new->Vars
That said, the documentation also says:
CGI.pm is no longer considered good practice for developing web applications, including quick prototyping and small web scripts. There are far better, cleaner, quicker, easier, safer, more scalable, more extensible, more modern alternatives available at this point in time. These will be documented with CGI::Alternatives.
So you should migrate away from this anyway.
Replace
$form{$p} = param($p); # Value of first field named $p
with
$form{$p} = ( multi_param($p) )[-1]; # Value of last field named $p
or
$form{$p} = ( grep length, multi_param($p) )[-1]; # Value of last field named $p
# that has a non-blank value

Perl-Mechanize posting hidden form value

I am attempting to create a perl script to test a web form for me. I am using mechanize for the automation and am having trouble finding documentation on the the field method. I am using the field method to return the value of a hidden form and this is causing my post to fail. The problem is probably a simple oversight but I am curious about $mech->field('name'); as it seems to be returning the hidden form's value for me.
Using perl v5.16.3 (w & w/o warnings)
$id = $test->field('MId');
print $id . " \n";
#This is printing the desired Id ,
#the post will not succeed as long as $id assigned this way.
print "This is where I am attempting to upload the images\n";
my $fileuploadresult;
$fileuploadresult = $mech->post($uploadURL,
'Content_Type' => "multipart/form-data",
'Content' => [
'myFile' => $file , 'MId' => $id
]
);
print $fileuploadresult->content() . "\n\n\n"; #If I set $id to something like 'test' it
#will work fine.
#I am using two agent because there are two POST's going on and they have to be sequential.
I was wondering why my submit failed when I grabbed the value of a form. I just changed the 'field' method to 'value' and realized this fixed my problem. Sorry noob question didn't look at the documentation enough. When field returned the value I assumed that was part of its functionality I did not realize it also set the value to null. (Or as far as I can tell that is what its doing)

perl dom mechanize xpath

I'm trying to scrape some data from metacriti* website using mechanize, but I'm getting no output
Here's my code with a url example:
my $metaURL = "http://www.metacriti*.com/game/pc/dota-2";
my $mech = WWW::Mechanize->new();
$mech->get($metaURL) or die "unable to get $metaURL";
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse($mech->content);
my #nodes = $tree->findnodes(q{//*[#id="main"]//a[contains(./#href, "user-reviews")]/span[#class="score_value"]});
print $_->string_value, "\n" foreach(#nodes); # text
#nodes array seems to be empty, my xpath seems good and since i'm using the same syntax in another working script, I really couldnt figure out what is wrong with this one...
Also since this is just the begining, maybe you can suggest me another easy way to scrape/parse websites... If there's any better one :)
Thank you in advance
The HTML seems to be really bad, if you search for $tree->findnodes( '//div[#id="main"]')->[0]->as_HTML you get a very bare div:
<div class="col main_col" id="main"><div itemscope="itemscope" itemtype="http://schema.org/SoftwareApplication"></div></div>
this indeed does not contain any a, which explains the result you got.
I tried using tidy to pretty print the HTML, but it barfed on the file.
If you forget about the div and use q{//a[contains(./#href, "user-reviews")]/span[#class="score_value"]} you will get a result though, 7.9 in this case.

security for a simple php search form

I have a table that lists movies and I have incorporated a simple search function.
I have one text field in a form where a title or keyword can be entered and then the form is submitted.
php/mysql code that does the work is:
$find = $_POST['find'];
$find = mysql_real_escape_string($find);
$find = htmlspecialchars($find);
$sql = "SELECT * FROM tbl_buyerguide WHERE rel_date BETWEEN NOW() AND DATE_ADD(now(), INTERVAL 2 MONTH) AND title LIKE '%".$find."%' ORDER BY title";
where 'find' is the name of the text input in the search form.
This works well enough for the search functionality for the required purpose.
My question to all is:
Is the mysql_real_escape_string and htmlspecialchars enough to make my search form secure?
I have read all of the questions that I can find on stackoverflow about this, but I would really like someone in the know to just say to me "yes, that is all you need", or "no, you also need to take into account ...".
Thanks in Advance.
Cheers Al.
Remember the adage: Filter In, Escape Out.
You're not outputting the term there, so why are you escaping it for HTML purposes with htmlspecialchars()?
Instead, ONLY escape it for the database (you should be using prepared statements, but that's another point). So you should not be using htmlspecialchars there.
Instead, when you go to output the variable onto the HTML page, that's when you should escape it for HTML (again, using htmlspecialchars).
Right now, you're mixing database and html escaping, which is going to lead to neither being effective...
Yes it is enough to make it secure....you could always throw strip_tags() in there as well....
Although I would just do it in one line...instead of using three
$find = htmlspecialchars(mysql_real_escape_string($_POST['find']));
But to really make it secure and up to date, you should stop using mysql_* functions as they are deprecated, and will be removed in any future relases of PHP....
You should instead switch to either mysqli_* or PDO, and implement prepared statements which handles security for you.
Example...in PDO
$db = new PDO('mysql:server=localhost;dbname=test', 'username', 'password');
$find = $_POST['find'];
$query = $db->prepare('SELECT * FROM tbl_buyerguide WHERE rel_date BETWEEN NOW() AND DATE_ADD(now(), INTERVAL 2 MONTH) AND title LIKE :like ORDER BY title');
$query->bindValue(':like', '%' . $find . '%');
$query->execute();

Dom-Processing with Perl-Mechanize: finalizing a little programme

I'm currently working on a little harvester, using this dataset of 2700 foundations. All the data are free to use with no limitations or copyright isues.
What I have so far: The harvesting task should be no problem if I take WWW::Mechanize — particularly for doing the form based search and selecting the individual entries. Hmm — I guess that the algorithm would be basically two nested loops: the outer loop runs the form-based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here? Well — how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);
This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.
If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of Perl (max. 20 lines — likely less).
But wait: the processing of the entry pages will then be the most complex part
of the script.
Approaches: In principle we could do the same algorithm with a single while loop
if we use the back() function smartly.
Can you give me a hint for the beginning — the processing of the entry pages — doing this in Perl:: Mechanize?
Here's what I have:
GetThePage(
starting url
);
sub GetThePage {
my $mech ...
my #pages = ...
while(#pages) {
my $page = shift #pages;
$mech->get( $page );
push #pages, GetMorePages( $mech );
SomethingImportant( $mech );
SomethingXPATH( $mech );
}
}
The question is how to find the DOM-paths.
Use Firebug, Opera Dragonfly, Chromium Developer tools.
Call the context menu on the indicated element to copy an XPath expression or CSS selector (useful for Web::Query) to clipboard.
Really you want to use Web::Scraper for this kind of thing.