I am trying to run the Selenium driver with Perl bindings, and due to the lack of examples and documentation, I am running into some roadblocks. I have figured out how to do some basic things, but I seem to be running into some issues with other simple things like validating the text on a page using Remote::Driver package.
If I try to do something like this:
$sel->get("https://www.yahoo.com/" );
$ret = $sel->find_element("//div[contains( text(),'Thursday, April 26, 2012')]");
I get a message back that the element couldn't be found. I am using xpath because the driver package doesn't appear to have a sub specific for finding text.. at least not that I've found.
If my xpath setup is wrong or if someone knows a better way, that would be extremely helpful. I'm having problems with some button clicking too.. but this seems like it should be easier and is bugging me.
Finding text on a web page and comparing that text to some "known good value" using Selenium::Remote::Driver can be implemented as follows:
File: SomeWebApp.pm
package SomeWebApp;
sub get_text_present {
my $self = shift;
my $target = shift;
my $locator = shift;
my $text = $self->{driver}->find_element($target, $locator)->get_text();
return $text;
}
Somewhere in your test script: test.pl
my $text = $some_web_app->get_text_present("MainContent_RequiredFieldValidator6", "id");
The above finds the element identified by $target using the locating scheme identified by $locator and stores it in the variable $text. You can then use that to compare / validate as required / needed.
https is a tad slower loading than http. Although WebDriver is pretty good about waiting until it's figured out that the requested page is fully loaded, maybe you need to give it a little help here. Add a sleep(2); after the get() call and see if it works. If it does, try cutting down to 1 second. You can also do a get_title call to see if you've loaded the page you think you have.
The other possibility is that your text target isn't quite exactly the same as what's on the page. You could try looking first for one word, such as "April", and see if you get a hit, and then expand until you find the mismatch (e.g., does this string actually have a newline or break within it? How about an HTML entity such as a non-breaking space?). Also, you are looking for that bit of text anywhere under a div (all child text supposedly is concatenated, and then the search done). That would more likely cast too wide a net than not get anything at all, but it's worth knowing.
Related
This is the first time I'm encountering GetLayoutObjectAttribute and I am having serious issues with it. My variable $web won't set. I think it's because PD_WebV isn't the right object name to refer to, but I don't know how to find the right object name. I can't find the objects name when I hit Edit Layout, so does anyone know how to find an layout objects name?
Loop
Pause/Resume Script [Duration (seconds): 1]
Set Variable[$Web; Value: GetLayoutObjectAttribute("PD_WebV";"content")]
If[$Web="done"]
#execute if statements
After Edit:
After some troubleshooting, I found out that PD_WebV is the right object name to refer and it's refered to correctly, so my new question is why doesn't the script go to the line If[$Web="done"] and how could I fix it? Ss my If statement not evaluating something it should be? Is my $web variable never set to done or is the issue something completely different? Would the problem possibly have to do with my WebDirect sharing settings? Any guidance would help. Thanks.
After, After Edit:
So now that my application is getting past Set Variable[$Web; Value: GetLayoutObjectAttribute("PD_WebV";"content")], the variable $web only equals <HTML></HTML>. Does anyone know a way, without using javascript, to test the inside of the html tags?
Also, I printed the bounds of the webViewer PD_WebV that I can't locate on the layout but am referring to in the script. The bounds that are printed are different each time I run the script. Is the usual or unusual? My source is also about:blank so it doesn't look like I'm sourcing from a URL
Is my $web variable never set to done or is the issue something
completely different?
If you're doing:
Set Variable[$Web; Value: GetLayoutObjectAttribute("PD_WebV";"content")]
then the only time
$Web="done"
will return true is when the web page loaded into your web viewer contains exactly the string "done" (in practical terms, that's never).
I have already suggested in a comment that you test for:
PatternCount ( $webpage ; "</html>" )
This is assuming you want the subsequent steps to execute only after the page has finished loading. The entire script would look something like this:
Loop
Pause/Resume Script [Duration (seconds): 1]
Set Variable[$Web; Value: GetLayoutObjectAttribute("PD_WebV";"content")]
Exit Loop If [ PatternCount ( $webpage ; "</html>" ) ]
End Loop
# execute statements
You might also want to add a counter to the loop and exit the script after n trials.
Ah, I reread your question.
To set the object name for your webviewer so that the GetLayoutObjectAttribute function works you need to set it in the Name field in the inspector when you have the webviewer selected.
e.g.:
After that your variable should populate.
Be aware
What it will populate with will be all of the html from the browser, i.e. not a boolean true/false liek your conditional suggests.
I'm not sure exactly what you're trying to accomplish, but to be able to determine a result from your web viewer you'll need to either parse the HTML to see if it's content contains what you're looking for or within the code you're setting the webviewer with, fire a javascript function that calls back to the FileMaker file using a FileMaker url.
I believe this may be a bug in the module I am using, or I am just completely overlooking something.
My code is this:
#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
use CAM::PDF::Annot;
sub main()
{
my $pdf = CAM::PDF::Annot->new( 'b.pdf' );
my $otherDoc = CAM::PDF::Annot->new( 'b_an.pdf' );
my $page = 1;
my %refs;
my #list = #{$pdf->getAnnotations($page)};
for my $annotRef (#list){
$otherDoc->appendAnnotation( $page, $pdf, $annotRef, \%refs);
}
$otherDoc->output('pdf_merged.pdf');
}
exit main;
This code was taken almost directly from the synopsis found on the module's CPAN page: http://metacpan.org/pod/CAM::PDF::Annot
The problem comes when I run the script using TWO pdf's with annotations. Using two pdf's without annotations runs. Using one pdf with annotations, and one pdf without annotations, runs. Only when both pdf's have annotations does it error.
The error is: "Can't use string ("46") as an ARRAY ref while "strict refs" in use at /usr/opt/perl5/lib/site_perl/5.10.1/CAM/PDF/Annot.pm line 195"
Line 195 of Annot.pm is:
push #{$annots->{value}}, $pupRef;
Annot.pm is inside the CAM::PDF::Annot module.
Any guidance in fixing this would be greatly appreciated!
P.S. In the error, "string ("x")", x is always a number, and seems to change depending on the pdf and the annotations within the pdf.
And I will try to add any other information that you need to help figure this out!
Whenever I have a problem with a CPAN module, I go to its webpage to try and assess its quality and see if any bugs have already been reported.
http://search.cpan.org/~donatoaz/CAM-PDF-Annot-0.06 shows the following suspicious results:
CPAN Testers PASS (2) FAIL (168) NA (49)
It is surprising that you were able to install the module. No one has reported bugs, but there is clearly a major problem with the code. It seems the author is either unaware of the tester reports (which have been sent to his CPAN email address for more than a year), or has stopped maintaining it.
You could submit a bug report, so at least others will be aware of your issue.
I realize this does not answer your question of how to fix the problem, but even if you do identify a fix, the author may not apply it (in which case, someone could start the process of becoming a co-maintaner).
I'm pretty new to NLP in general, but getting really good at Perl, and I was wondering what kind of powerful NLP modules are out there. Basically, I have a file with a bunch of paragraphs, and some of them are people's biographies. So, first I need to look for a person's name, and that helps with the rest of the process later.
So I was roughly starting with something like this:
foreach $PPid (0 .. $PPscalar) {
$paragraph = #PP[$PPid];
if ($paragraph =~ /^(\w+ \w\. \w+|\w+ \w+)( also|)( has served| served| worked| joined| currently serves| has| was| is|, )/){
$possibleName = $1;
$badName = 0;
foreach $piece (#pieces){
if ($possibleName =~ /$piece/){
$badName = 1;
}
}
if ($badName == 0){
push #namePile, $possibleName;
}
}
}
Because most of the names start at the beginning of the paragraphs. And then I'm looking for keywords that denote action or possession, but right now, that picks up extra junk that is not a name. There has to be a module to do this, right?
Extracting names from data is hard. There are a variety of solutions. For named entity extraction you've got the following
The naive approach. I remember looking at this and being unimpressed with the output.
The dictionary approach. I've used this, but lots of false negatives, and I'm not too fond of the code underneath it.
An open source binary with a perl interface (not recommended, and I'm the author of this cpan library - and setting it up is fiddly too).
Best solution is the propietary web service with the Net::Calais perl wrapper
Net::Calais is by far the best bet for speed and accuracy. Go with the Stanford library if you need the underlying implementation to be open source.
Have you tried searching CPAN?
http://search.cpan.org/search?query=NLP&mode=all
I also tried searching for "Natural Language" and found the following that you might be interested in:
Lingua::EN::Tagger
Also, if you must roll your own, with regards to NLP, you want to check out Regexp::Grammars. This is the successor to Parse::RecDesent.
I don't know of any Perl modules which do processing of English in order to break it into parts of speech. I expect there are libraries out there which do that, in C or C++ or something, so if you don't find a good answer, maybe you can broaden your search.
One easy hack is to check for two words which are both capitalized:
if (/[A-Z][a-z]+\s+[A-Z][a-z]/) { ...
or check for titles:
if (/(?:Mr|Mrs|Ms|Dr)\.?\s+[A-Z][a-z]+/) { ...
I'm playing around with Win32::IE:Mechanize to try to access some authentication-required sites automatically. So far I've achieved moderate success, for example, I can automatically log in to my yahoo mailbox. But I find many sites are using some kind of image verification mechanism, which is possibly called CAPTCHA. I can do nothing to them. But one of the sites I'm trying to auto access is using a plain-text verification code. It is comnposed of four digits, selectable and copyable. But they're not in the source file which can be fetched using
$mech->content;
I searched for the keyword that appears on the webpage but not in the source file through all the files in the Temporary Internet Files but still can't find it.
Any idea what's going on? I was suspecting that the verification code was somehow hidden in some cookie file but I can't seem to find it :(
The following is the code that completes all the fields requirements except for the verification code:
use warnings;
use Win32::IE::Mechanize;
my $url = "http://www.zjsmap.com/smap/smap_login.jsp";
my $eccode = "myeccode";
my $username = "myaccountname";
my $password = "mypassword";
my $verify = "I can't figure out how to let the script get the code yet"
my $mech = Win32::IE::Mechanize->new(visible=>1);
$mech->get($url);
sleep(1); #avoids undefined value error
$mech->form_name("BaseForm");
$mech->field(ECCODE => $eccode);
$mech->field(MEMBERACCOUNT => $username);
$mech->field(PASSWORD => $password);
$mech->field(verify => $verify);
$mech->click();
Like always any suggestions/comments would be greatly appreciated :)
UPDATE
I've figured out a not-so-smart way to solve this problem. Please comment on my own asnwer posted below. Thanks like always :)
This is the reason why they are there. To stop program like yours to do automated stuff ;-)
A CAPTCHA or Captcha is a type of
challenge-response test used in
computing to ensure that the response
is not generated by a computer.
This appears to be an irrelevant number. The page uses it in 3 places: generating it; displaying it on the form next to the input field for it; and checking for the input value being equal to the random number chosen. That is, it is a client-only check. Still, if you disable javascript it looks like, I'm guessing, important cookies don't get set. If you can execute JavaScript in the context of the page (you should be able to with a get method call and a javascript URI), you could change the value of random_number to f.e. 42 and fill that in on the form.
The code is inserted by JavaScript – disable JS, reload the page and see it disappear. You have to hunt through the JS code to get an idea where it comes from and how to replicate it.
Thanks to james2vegas, zoul and Shoban.
I've finally figured out on my own a not-so-smart but at-least-workable way to solve the problem I described here. I'd like to share it here. I think the approach suggested by #james2vegas is probably much better...but anyway I'm learning along the way.
My approach is this:
Although the verification code is not in the source file but since it is still selectable and copyable, I can let my script copy everything in the login page and then extract the verification code.
To do this, I use the sendkeys functions in the Win32::Guitest module to do "Select All" and "Copy" to the login page.
Then I use Win32:Clipboard to get the clipboard content and then Regexp to extract the code. Something like this:
$verify = Win32::Clipboard::GetText();
$verify =~ s/.* (\d{4}).*/$1/msg;
A few thoughts:
The random number is generated by something like this in Perl
my $random_number = int(rand(8999)) + 1000; #var random_number = rand(1000,10000);
And then it checks if $verify == $random_number. I don't know how to catch the value of one-session-only $random_number. I think it is stored somewhere in the memory. If I can capture the value directly then I wouldn't have gone to so much trouble of using this and that extra module.
I'm trying to read out the POST-Data that was sent from a form in a page to my Perl Script. I googled and found out that:
read(STDIN, $param_string, $ENV{'CONTENT_LENGTH'})
reads out the whole Data-String with and writes the whole string to $param_string in the form of
Param1=Value1&Param2=Value2&Param3=Value3
by spliting it at the right places I get the necessary Data.
But I wonder why my $param_string is empty.
When I try the whole thing with GET:
$param_string = $ENV{'QUERY_STRING'};
everything works fine. Does anybody have an idea?
There absolutely no real reason for someone at your level to want to hand parse CGI requests.
Please use CGI::Simple or CGI.pm.
CGI.pm has a lot of baggage (HTML generation, function oriented interface) which makes CGI::Simple preferable.
Using any CGI processing module on CPAN is better than trying to write CGI processing code from scratch.
See parse_query_string in CGI::Simple for a way of accessing parameters passed using the query string when processing a form that is POSTed to your script.
If you want to learn how to do it right, you can read the source code of either module. Reading through the CGI.pm CHANGES file is also instructive.
If you are able to retrieve GET-data but not able to retrieve POST-data, most likely you forgot to change form method from to be post. You can check your submit method by using this condition in if statement:
if ($ENV{'REQUEST_METHOD'} eq "POST"){
read(STDIN, $param_string, $ENV{'CONTENT_LENGTH'});
}else {
$param_string = $ENV{'QUERY_STRING'};
}
Under mod_perl 2, Apache2::Request works for me.