Dom-Processing with Perl-Mechanize: finalizing a little programme - perl

I'm currently working on a little harvester, using this dataset of 2700 foundations. All the data are free to use with no limitations or copyright isues.
What I have so far: The harvesting task should be no problem if I take WWW::Mechanize — particularly for doing the form based search and selecting the individual entries. Hmm — I guess that the algorithm would be basically two nested loops: the outer loop runs the form-based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here? Well — how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);
This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.
If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of Perl (max. 20 lines — likely less).
But wait: the processing of the entry pages will then be the most complex part
of the script.
Approaches: In principle we could do the same algorithm with a single while loop
if we use the back() function smartly.
Can you give me a hint for the beginning — the processing of the entry pages — doing this in Perl:: Mechanize?
Here's what I have:
GetThePage(
starting url
);
sub GetThePage {
my $mech ...
my #pages = ...
while(#pages) {
my $page = shift #pages;
$mech->get( $page );
push #pages, GetMorePages( $mech );
SomethingImportant( $mech );
SomethingXPATH( $mech );
}
}
The question is how to find the DOM-paths.

Use Firebug, Opera Dragonfly, Chromium Developer tools.
Call the context menu on the indicated element to copy an XPath expression or CSS selector (useful for Web::Query) to clipboard.

Really you want to use Web::Scraper for this kind of thing.

Related

Handle POST data sent as array

I have an html form which sends a hidden field and a radio button with the same name.
This allows people to submit the form without picking from the list (but records a zero answer).
When the user does select a radio button, the form posts BOTH the hidden value and the selected value.
I'd like to write a perl function to convert the POST data to a hash. The following works for standard text boxes etc.
#!/usr/bin/perl
use CGI qw(:standard);
sub GetForm{
%form;
foreach my $p (param()) {
$form{$p} = param($p);
}
return %form;
}
However when faced with two form inputs with the same name it just returns the first one (ie the hidden one)
I can see that the inputs are included in the POST header as an array but I don't know how to process them.
I'm working with legacy code so I can't change the form unfortunately!
Is there a way to do this?
I have an html form which sends a hidden field and a radio button with
the same name.
This allows people to submit the form without picking from the list
(but records a zero answer).
That's an odd approach. It would be easier to leave the hidden input out and treat the absence of the data as a zero answer.
However, if you want to stick to your approach, read the documentation for the CGI module.
Specifically, the documentation for param:
When calling param() If the parameter is multivalued (e.g. from multiple selections in a scrolling list), you can ask to receive an array. Otherwise the method will return the first value.
Thus:
$form{$p} = [ param($p) ];
However, you do seem to be reinventing the wheel. There is a built-in method to get a hash of all paramaters:
$form = $CGI->new->Vars
That said, the documentation also says:
CGI.pm is no longer considered good practice for developing web applications, including quick prototyping and small web scripts. There are far better, cleaner, quicker, easier, safer, more scalable, more extensible, more modern alternatives available at this point in time. These will be documented with CGI::Alternatives.
So you should migrate away from this anyway.
Replace
$form{$p} = param($p); # Value of first field named $p
with
$form{$p} = ( multi_param($p) )[-1]; # Value of last field named $p
or
$form{$p} = ( grep length, multi_param($p) )[-1]; # Value of last field named $p
# that has a non-blank value

In SAP scripts how do you define which data is sent to an element

I need to make some changes to an SAPScript. I have the program and form name
Program: RBOSORDER01
Form: RBOSORDER02
I am looking to change some of the data shown in the form. I have debugged the program and I get see the call to write to the form, for example:
CALL FUNCTION 'WRITE_FORM'
EXPORTING
ELEMENT = 'ITEM_TEXT'
EXCEPTIONS
ELEMENT = 1
WINDOW = 2.
But how is the data passed between the program and the form. I cannot link between each. I was expecting to see a structure or a data element passed with 'ITEM_TEXT' and then this data is printed at this element "ITEM_TEXT" in the form but the link is not clear to me.
I have looked at the form also in SE71 and cannot see where you define this. Where is the link here, what am I missing?
This is in the form, so SE71 is what you need. You have to find the window first, where this element (ITEM_TEXT) is displayed, than look for the element and see what is displayed inside. The SAPSript form uses the global variables (structures, internal tables) of the print program directly by default (there are some other options as well, INCLUDE texts for example). So for example if a global variable gv_text is declared in the print program, and it is displayed in the SAPScript, than it will look like &GV_TEXT& in the form.
You can also debug the SAPScript if you switch on debugging in SE71 (can be painful, if the form is big).
Function 'WRITE_FORM' just calls the EntryPoint of the Form (SE71 / RBOSORDER02) in this case with ELEMENT='ITEM_TEXT'.
So you will end up in MAIN-Window at:
/E ITEM_TEXT
/: INCLUDE &VBDPA-TDNAME& OBJECT VBBP ID 0001 PARAGRAPH IT
In this case you have to debug what "VBDPA-TDNAME" is at this time and then you will find its value with transaction "SO10" (Standard-Text)
The INCLUDE can be a complex text and can have its own format strings.
As Jozsef said before, VBDPA-TDNAME is defined global in the print programm. (SE38n / RBOSORDER01)

Perl XPath statement with a conditional - is that possible?

This question has been rephrased. I am using CPAN Perl modules WWW::Mechanize to navigate a website, HTML::TreeBuilder-XPath to capture the content and xacobeo to test my XPath code on the HTML/XML. The goal is to call this Perl script from a PHP-based website and upload the scraped contents into a database. Therefore, if content is "missing" it still needs to be accounted for.
Below is a tested, reduced sample code depicting my challenge. Note:
This page is dynamically filled and contains various ITEMS outputted for different stores; a different number of Products* will exist for each store. And those product listings may or may not have an itemized table underneath of it.
The captured data has to be in arrays and the association of any itemized list (if it exists) to the Product listing has to be maintained.
Below, the example xml changes per store (as described above) but for brevity I only show one "type" of output. I realize that all data can be captured into one array and then regex used to decipher the content for the purpose of uploading it into a database. I am seeking a better knowledge of XPath to help streamline this (and future) solution(s).
<!DOCTYPE XHTML>
<table id="8jd9c_ITEMS">
<tr><th style="color:red">The Products we have in stock!</th></tr>
<tr><td><span id="Product_NUTS">We have nuts!</span></td></tr>
<tr><td>
<!--Table may or may not exist -->
<table>
<tr><td style="color:blue;text-indent:10px">Almonds</td></tr>
<tr><td style="color:blue;text-indent:10px">Cashews</td></tr>
<tr></tr>
</table>
</td></tr>
<tr><td><span id="Product_VEGGIES">We have veggies!</span></td></tr>
<tr><td>
<!--Table may or may not exist -->
<table>
<tr><td style="color:blue;text-indent:10px">Carrots</td></tr>
<tr><td style="color:blue;text-indent:10px">Celery</td></tr>
<tr></tr>
</table>
</td></tr>
<tr><td><span id="Product_ALCOHOL">We have booze!</span></td></tr>
<!--In this case, the table does not exist -->
</table>
An XPath statement of:
'//table[contains(#id, "ITEMS")]/tr[position() >1]/td/span/text()'
would find:
We have nuts!
we have veggies!
We have booze!
And an XPath statement of:
'//table[contains(#id, "ITEMS")]/tr[position() >1]/td/table/tr/td/text()'
would find:
Almonds
Cashews
Carrots
Celery
The two XPath statements can be combined:
'//table[contains(#id, "ITEMS")]/tr[position() >1]/td/span/text() | //table[contains(#id, "ITEMS")]/tr[position() >1]/table/tr/td/text()'
To find:
We have nuts!
Almonds
Cashews
We have veggies!
Carrots
Celery
We have booze!
Again, the above array can be deciphered (in the real code) for it's product-to-list association using regex. But can the array be built using XPath in a manner that would keep that association?
For example (pseudo-speak, this does not work):
'//table[contains(#id, "ITEMS")]/tr[position()>1]/td/span/text() |
if exists('//table[contains(#id, "ITEMS")]/tr[position() >1]/table))
then ("NoTable") else ("TableRef") |
Save this result into #TableRef ('//table[contains(#id, "ITEMS")]/tr[position() >1]/table/tr/td/text()')'
It is not possible to build multi-dimensional arrays (in the traditional sense) in Perl, see perldoc perlref But hopefully a solution similar to the above could create something like:
#ITEMS[0] => We have nuts!
#ITEMS[1] => nutsREF <-- say, the last word of the span value + REF
#ITEMS[2] => We have veggies!
#ITEMS[3] => veggiesREF <-- say, the last word of the span value + REF
#ITEMS[4] => We have booze!
#ITEMS[5] => NoTable <-- value accounts for the missing info
#nutsREF[0] => Almonds
#nutsREF[1] => Cashews
#veggiesREF[0] => Carrots
#veggiesREF[1] => Celery
In the real code the Products are known, so my #veggiesREF and my #nutsREF can be defined in anticipation of the XPath output.
I realize the XPath if/else/then functionality is in the XPath 2.0 version. I am on a ubuntu system and working locally, but I am still not clear on whether my apache2 server is using it or the 1.0 version. How do I check that?
Finally, if you can show how to call a Perl scrip from a PHP form submit AND how to pass back a Perl array to the calling PHP function then that would go along way to getting the bounty. :)
Thanks!
FINAL EDIT:
Comments immediately below this post were directed at an initial post that was too vague. The subsequent re-post (and bounty) was responded to by ikegami with a very creative use which solved the pseudo problem, but was proving difficult for me to grasp and reuse in my real application - which entails multiple uses on various html pages. In about the 18th comment in our dialog I finally discovered his meaning and use of ($cat) - an undocumented Perl syntax that he used. For new readers, understanding that syntax makes it possible to understand (and reformat) his intelligent solution to the problem. His post certainly meets the basic requirements sought in the OP but does not use HTML::TreeBuilder::XPath to do it.
jpalecek uses the HTML::TreeBuilder::XPath but does not place the captured data into arrays for passing back to a PHP function and uploading into a database.
I have learned from both responders and hope this post helps others who are new to Perl, like myself. Any final contributions would be greatly appreciated.
If I were to guess, your question is: "How do I get the following from the provided input?"
my $categorized_items = {
'We have nuts!' => [ 'Almonds', 'Cashwes' ],
'We have veggies!' => [ 'Carrots', 'Celery' ],
'We have booze!' => [ ],
};
If so, here's how I'd do it:
use Data::Dumper qw( Dumper );
use XML::LibXML qw( );
my $root = XML::LibXML->load_xml(IO=>\*DATA)->documentElement;
my %cat_items;
for my $cat_tr ($root->findnodes('//table[contains(#id, "ITEMS")]/tr[td/span]')) {
my ($cat) = map $_->textContent(),
$cat_tr->findnodes('td/span');
my #items = map $_->textContent(),
$cat_tr->findnodes('following-sibling::tr[position()=1]/td/table/tr/td');
$cat_items{$cat} = \#items;
}
print(Dumper(\%cat_items));
__DATA__
...xml...
PS - What you have there isn't valid HTML.
A TABLE element cannot be placed directly inside a TR element. There's a missing TD element.
A TR element cannot be empty. It must have at least one TH or TD element.
How to ascertain that something exists before running query. Eg. if //p[#class='red'] exists, then return //table:
/.[//p[#class='red']]//table
x[3 and 4 and 5]: 3 and 4 and 5 is a boolean expression that yields true. Therefore it will get you all xs. For 3rd, 4th and 5th you want
x[position() >= 3 and position() <= 5]
Answer for the edited question:
Why don't you use XML::XPathEngine with multiple queries?
my $xp = XML::XPathEngine->new;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse (something);
Then, you can query:
my $shops = $xp->findnodes('//table[contains(#id, "ITEMS")]/tr[position() >1]/td[#span]', $tree);
for($shops->get_nodelist) {
print "Name of shop is ".$xp->findvalue('span/text()', $_)."\n"; # <- query relative to $_
print "The shop sells:\n". join("\n", $xp->findvalues('parent::*/following-sibling::tr[1][not(span)]/td/table/tr/td', $_));
}
This does the same thing as #ikegami's answer (XML::XPathEngine is used by HTML::TreeBuilder::XPath). BTW, if the shops can have more lines with products after them, this should be updated.

How to process a simple loop in Perl's WWW::Mechanize?

Especially interesting for me as a PHP/Perl-beginner is this site in Switzerland:
see this link:http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de&webgrab_path=http://esv2000.edi.admin.ch/d/entry.asp?Id=1308
Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it.
what we have so far: Well the harvesting task should be no problem if i take WWW::Mechanize - particularly for doing the form based search and selecting the individual entries. Hmm - i guess that the algorithm would be basically 2 nested loops: the outer loop runs the form based search, the inner loop processes the search results.
The outer loop would use the select() and the submit_form() functions on the second search form on the page. Can we use DOM processing here. Well - how can we get the get the selection values.
The inner loop through the results would use the follow link function to get to the actual entries using the following call.
$mech->follow_link(url_regex => qr/webgrab_path=http:\/\/evs2000.*\?
Id=\d+$/, n => $result_nbr);
This would forward our mechanic browser to the entry page. Basically the URL query looks for links that have the webgrap_path to Id pattern, which is unique for each database entry. The $result_nbr variable tells mecha which one of the results it should follow next.
If we have several result pages we would also use the same trick to traverse through the result pages. For the semantic extraction of the entry information,we could parse the content of the actual entries with XML:LibXML's html parser (which works fine on this page), because it gives you some powerful DOM selection (using XPath) methods.
Well the actual looping through the pages should be doable in a few lines of perl of max. 20 lines - likely less.
But wait: the processing of the entry pages will then be the most complex part
of the script.
Approaches: In principle we could do the same algorithm with a single while loop
if we use the back() function smartly.
Can you give me a hint for the beginning - the processing of the entry pages - doing this in Perl:: Mechanize
"Which has a dataset of 2700 foundations. All the data are free to use with no limitations copyrights on it."
Not true. See http://perlmonks.org/?node_id=905767
"The data is copyrighted even though it is made available freely: "Downloading or copying of texts, illustrations, photos or any other data does not entail any transfer of rights on the content." (and again, in German, as you've been scraping some other German list to spam before)."

How to deal with nameless forms on websites?

I would like to write a script that lets me use this website
http://proteinmodel.org/AS2TS/LGA/lga.html
(I need to use it a few hundred times, and I don't feel like doing that manually)
I have searched the internet for ways how this could be done using Perl, and I came across WWW::Mechanize, which seemed to be just what I was looking for. But now I have discovered that the form on that website which I want to use has no name - its declaration line simply reads
<FORM METHOD="POST" ACTION="./lga-form.cgi" ENCTYPE=multipart/form-data>
At first I tried simply not setting my WWW::Mechanize object's form_name property, which gave me this error message when I provided a value for the form's email address field:
Argument "my_email#address.com" isn't numeric in numeric gt (>) at /usr/share/perl5/WWW/Mechanize.pm line 1618.
I then tried setting form_name to '' and later ' ', but it was to no avail, I simply got this message:
There is no form named " " at ./automate_LGA.pl line 40
What way is there to deal with forms that have no names? It would be most helpful if someone on here could answer this question - even if the answer points away from using WWW::Mechanize, as I just want to get the job done, (more or less) no matter how.
Thanks a lot in advance!
An easy and more robust way is to use the $mech->form_with_fields() method from WWW::Mechanize to select the form you want based on the fields it contains.
Easier still, use the submit_form method with the with_fields option.
For instance, to locate a form which has fields named 'username' and 'password', complete them and submit the form, it's as easy as:
$mech->submit_form(
with_fields => { username => $username, password => $password }
);
Doing it this way has the advantage that if they shuffle their HTML around, changing the order of the forms in the HTML, or adding a new form before the one you're interested in, your code will continue to work.
I don't know about WWW::Mechanize, but its Python equivalent, mechanize, gives you an array of forms that you can iterate even if you don't know their names.
Example (taken from its homepage):
import mechanize
br = mechanize.Browser()
br.open("http://www.example.com/")
for form in br.forms():
print form
EDIT: searching in the docs of WWW::Mechanize I found the $mech->forms() method, that could be what you need. But since I don't know perl or WWW::Mechanize, I'll leave there my python answer.
Okay, I have found the answer. I can address the nameless form by its number (there's just one form on the webpage, so I guessed it would be number 1, and it worked). Here's part of my code:
my $lga = WWW::Mechanize->new();
my $address = 'my_email#address.com';
my $options = '-3 -o0 -d:4.0';
my $pdb_2 = "${pdb_id}_1 ${pdb_id}_2";
$lga->get('http://proteinmodel.org/AS2TS/LGA/lga.html');
$lga->success or die "LGA GET fail\n";
$lga->form_number(1);
$lga->field('Address', $address);
$lga->field('Options', $options);
$lga->field('PDB_2', $pdb_2);
$lga->submit();
$lga->success or die "LGA POST fail\n";