feedpp and session ID - perl

we are using Perl and cpan Modul FeedPP to parse RSS Feeds.
The Perl script runs trough the different items of the RSS Feeds and save the link to the database, liket his:
my $response = $ua->get($url);
if ($response->is_success) {
my $feed = XML::FeedPP->new( $response->content, -type => 'string' );
foreach my $item ( $feed->get_item() ) {
my $link = $item->link();
[...]
$url contains the URL to an RSS Feed, like http://my.domain/RSS/feeds.xml
in this case, $item->link() will contain links to the RSS article, like http://my.domain/topic/myarticle.html
The Problem is, some webservers (which provides the RSS feeds) does an HTTP refer in order to add an session ID to the URL, like this: http://my.domain/RSS/feeds.xml;jsessionid=4C989B1DB91D706C3E46B6E30427D5CD.
The strange think is, that feedPP seams to add this session-ID to the link of every item. So $item->link() contain links to the RSS article, like http://my.domain/topic/myarticle.html;jsessionid=4C989B1DB91D706C3E46B6E30427D5CD
Even if the original link does not contain an session ID.
Is there a way to turn of that behavior of feedPP??
Thank you for any kind of help.

I took a look through http://metacpan.org/pod/XML::FeedPP but didn't see any way to turn have the link() method trim those session IDs for you. (I'm using XML::FeedPP in one of my scripts and the site I happen to be parsing doesn't use session IDs.)
So I think the answer is no, not currently. You could try contacting the author or filing a bug.

IMHO, the behavior is correct: uri components which follow a semi-colon are defined part of the path (configuration parameter for interpretation), so when the uri is used to make a relative url into an absolute uri it needs to be copied as well.
You expect compatible behavior with '&' parameters, but they are not equal.
https://rt.cpan.org/Ticket/Display.html?id=73895

Related

Mojolicious not following redirection from webarchive.org

I'm using Mojolicious DOM and UserAgent to get the source of a page from Webarchive.org, parse it, and import it into a Dotclear database (using webarchive as a backup).
In the source, there are "Previous" and "Next" links allowing to get to the different posts originaly made on the blog.
The perl script I have developped is supposed to run through those links to import all pages of this blog's snapshot.
It first get the source of the first post of the blog, parses it, put the result in a local DB, and gets the link under "Next" to do that same thing on the next post, until there is no more "Next" posts.
As for the bases.
But the trick is that the link I get from the source is not the link Webarchive has.
Webarchive's links to snapshots go like this :
http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost
The big number between "web" and the original URL is (i guess) the date the snapshot was made. The trick is that it changes at each snapshot, and although it may appear on one post, the next post have been snapshoted on anotherdate. So the URL wont fit.
When I click on the link i get from the source, it brings me to webarchive.org, which automaticaly searches on the page i pass, and redirect me to it.
But when I try to get the source via the get() function of Mojolicious, it just gets the "Page not found" page of webarchive.
So, there is my question : is there a way to let mojolicious follow the redirection of webarchive ? I activated max_redirects(5) on my UserAgent, but still does the same.
Here is my code :
sub main{
my ($url) = #_;
my $ua = Mojo::UserAgent->new;
$ua = $ua->max_redirects(5);
my $dom = $ua->get($url)->res->dom;
#...Treatment and parsing of the source ...
return $nextUrl;
}
my $nextUrl="http://web.archive.org/web/20131012182412/http://www.mytarget.com/post?mypost";
my $secondUrl;
while ($nextUrl){
$secondUrl = main($nextUrl);
$nextUrl = $secondUrl;
}
Thanks in advance...
I've finally found a way around.
I use this piece of code to follow the URL and get the finally reached URL :
use LWP::UserAgent qw();
my $ua = LWP::UserAgent->new;
my $ret = $ua->get($url);
$url = $ret->request->uri ."";
print "URL returned: ".$url."\n";
Then I use that URL to get the source code and fetch it.

How to post a file in grails

I am trying to use HTTP to POST a file to an outside API from within a grails service. I've installed the rest plugin and I'm using code like the following:
def theFile = new File("/tmp/blah.txt")
def postBody = [myFile: theFile, foo:'bar']
withHttp(uri: "http://picard:8080/breeze/project/acceptFile") {
def html = post(body: postBody, requestContentType: URLENC)
}
The post works, however, the 'myFile' param appears to be a string rather than an actual file. I have not had any success trying to google for things like "how to post a file in grails" since most of the results end up dealing with handling an uploaded file from a form.
I think I'm using the right requestContentType, but I might have missed something in the documentation.
POSTing a file is not as simple as what you have included in your question (sadly). Also, it depends on what the API you are calling is expecting, e.g. some API expect files as base64 encoded text, while others accept them as mime-multipart.
Since you are using the rest plugin, as far as I can recall it uses the Apache HttpClient, I think this link should provide enough info to get you started (assuming you are dealing with mime-multipart). It shouldn't be too hard to change it around to work with your API and perhaps make it a bit 'groovy-ier'

How to read a web page content which may itself be redirected to another url?

I'm using this code to read the web page content:
my $ua = new LWP::UserAgent;
my $response= $ua->post($url);
if ($response->is_success){
my $content = $response->content;
...
But if $url is pointing to moved page then $response->is_success is returning false. Now how do I get the content of redirected page easily?
You need to chase the redirect itself.
if ($response->is_redirect()) {
$url = $response->header('Location');
# goto try_again
}
You may want to put this in a while loop and use "next" instead of "goto". You may also want to log it, limit the number of redirections you are willing to chase, etc.
[update]
OK I just noticed there is an easier way to do this. From the man page of LWP::UserAgent:
$ua->requests_redirectable
$ua->requests_redirectable( \#requests )
This reads or sets the object's list of request names that
"$ua->redirect_ok(...)" will allow redirection for. By default,
this is "['GET', 'HEAD']", as per RFC 2616. To change to include
'POST', consider:
push #{ $ua->requests_redirectable }, 'POST';
So yeah, maybe just do that. :-)

zend framework urls and get method

I am developing a website using zend framework.
i have a search form with get method. when the user clicks submit button the query string appears in the url after ? mark. but i want it to be zend like url.
is it possible?
As well as the JS approach you can do a redirect back to the preferred URL you want. I.e. let the form submit via GET, then redirect to the ZF routing style.
This is, however, overkill unless you have a really good reason to want to create neat URLs for your search queries. Generally speaking a search form should send a GET query that can be bookmarked. And there's nothing wrong with ?param=val style parameters in a URL :-)
ZF URLs are a little odd in that they force URL parameters to be part of the main URL. I.e. domain.com/controller/action/param/val/param2/val rather than domain.com/controller/action?param=val&param2=val
This isn't always what you want, but seems to be the way frameworks are going with URL parameters
There is no obvious solution. The form generated by zf will be a standard html one. When submitted from the browser using GET it will result in a request like
/action/specified/in/form?var1=val1&var2=var2
Only solution to get a "zendlike url" (one with / instead of ? or &), would be to hack the form submission using javascript. For example you can listen for onSubmit, abort the submission and instead redirect browser to a translated url. I personally don't believe this solution is worth the added complexity, but it should perform what you're looking for.
After raging against this for a day-and-a-half, and doing my best to figure out the right way to do this fairly simple this, I gave up and did the following. I still can't believe there's not a better way.
The use case that necessitates this is a simple record listing, with a form up top for adding some filters (via GET), maybe some column sorting, and Zend_Paginate thrown in for good measure. I ran into issues using the Url view helper in my pagination partial, but I suspect with even just sorting and a filter-form, Zend_View_Helper_Url would still fall down.
But I digress. My solution was to add a method to my base controller class that merges any raw query-string parameters with the existing zend-style slashy-params, and redirects (but only if necessary). The method can be called in any action that doesn't have to handle POSTs.
Hopefully someone will find this useful. Or even better, find a better way:
/**
* Translate standard URL parameters (?foo=bar&baz=bork) to zend-style
* param (foo/bar/baz/bork). Query-string style
* values override existing route-params.
*/
public function mergeQueryString(){
if ($this->getRequest()->isPost()){
throw new Exception("mergeQueryString only works on GET requests.");
}
$q = $this->getRequest()->getQuery();
$p = $this->getRequest()->getParams();
if (empty($q)) {
//there's nothing to do.
return;
}
$action = $p['action'];
$controller = $p['controller'];
$module = $p['module'];
unset($p['action'],$p['controller'],$p['module']);
$params = array_merge($p,$q);
$this->_helper->getHelper('Redirector')
->setCode(301)
->gotoSimple(
$action,
$controller,
$module,
$params);
}

How do I use and debug WWW::Mechanize?

I am very new to Perl and i am learning on the fly while i try to automate some projects for work. So far its has been a lot of fun.
I am working on generating a report for a customer. I can get this report from a web page i can access.
First i will need to fill a form with my user name, password and choose a server from a drop down list, and log in.
Second i need to click a link for the report section.
Third a need to fill a form to create the report.
Here is what i wrote so far:
my $mech = WWW::Mechanize->new();
my $url = 'http://X.X.X.X/Console/login/login.aspx';
$mech->get( $url );
$mech->submit_form(
form_number => 1,
fields =>{
'ctl00$ctl00$cphVeriCentre$cphLogin$txtUser' => 'someone',
'ctl00$ctl00$cphVeriCentre$cphLogin$txtPW' => '12345',
'ctl00$ctl00$cphVeriCentre$cphLogin$ddlServers' => 'Live',
button => 'Sign-In'
},
);
die unless ($mech->success);
$mech->dump_forms();
I dont understand why, but, after this i look at the what dump outputs and i see the code for the first login page, while i belive i should have reached the next page after my successful login.
Could there be something with a cookie that can effect me and the login attempt?
Anythings else i am doing wrong?
Appreciate you help,
Yaniv
This is several months after the fact, but I resolved the same issue based on a similar questions I asked. See Is it possible to automate postback from the client side? for more info.
I used Python's Mechanize instead or Perl, but the same principle applies.
Summarizing my earlier response:
ASP.NET pages need a hidden parameter called __EVENTTARGET in the form, which won't exist when you use mechanize normally.
When visited by a normal user, there is a __doPostBack('foo') function on these pages that gives the relevant value to __EVENTTARGET via a javascript onclick event on each of the links, but since mechanize doesn't use javascript you'll need to set these values yourself.
The python solution is below, but it shouldn't be too tough to adapt it to perl.
def add_event_target(form, target):
#Creates a new __EVENTTARGET control and adds the value specified
#.NET doesn't generate this in mechanize for some reason -- suspect maybe is
#normally generated by javascript or some useragent thing?
form.new_control('hidden','__EVENTTARGET',attrs = dict(name='__EVENTTARGET'))
form.set_all_readonly(False)
form["__EVENTTARGET"] = target
You can only mechanize stuff that you know. Before you write any more code, I suggest you use a tool like Firebug and inspect what is happening in your browser when you do this manually.
Of course there might be cookies that are used. Or maybe your forgot a hidden form parameter? Only you can tell.
EDIT:
WWW::Mechanize should take care of cookies without any further intervention.
You should always check whether the methods you called were successful. Does the first get() work?
It might be useful to take a look at the server logs to see what is actually requested and what HTTP status code is sent as a response.
If you are on Windows, use Fiddler to see what data is being sent when you perform this process manually, and then use Fiddler to compare it to the data captured when performed by your script.
In my experience, a web debugging proxy like Fiddler is more useful than Firebug when inspecting form posts.
I have found it very helpful to use Wireshark utility when writing web automation with WWW::Mechanize. It will help you in few ways:
Enable you realize whether your HTTP request was successful or not.
See the reason of failure on HTTP level.
Trace the exact data which you pass to the server and see what you receive back.
Just set an HTTP filter for the network traffic and start your Perl script.
The very short gist of aspx pages it that they hold all of the local session information within a couple of variables prefixed by "__" in the general aspxform. Usually this is a top level form and all form elements will be part of it, but I guess that can vary by implementation.
For the particular implementation I was dealing with I needed to worry about 2 of these state variables, specifically:
__VIEWSTATE
__EVENTVALIDATION.
Your goal is to make sure that these variables are submitted into the form you are submitting, since they might be part of that main form aspxform that I mentioned above, and you are probably submitting a different form than that.
When a browser loads up an aspx page a piece of javascript passes this session information along within the asp server/client interaction, but of course we don't have that luxury with perl mechanize, so you will need to manually post these yourself by adding the elements to the current form using mechanize.
In the case that I just solved I basically did this:
my $browser = WWW::Mechanize->new( );
# fetch the login page to get the initial session variables
my $login_page = 'http://www.example.com/login.aspx';
$response = $browser->get( $login_page);
# very short way to find the fields so you can add them to your post
$viewstate = ($browser->find_all_inputs( type => 'hidden', name => '__VIEWSTATE' ))[0]->value;
$validation = ($browser->find_all_inputs( type => 'hidden', name => '__EVENTVALIDATION' ))[0]->value;
# post back the formdata you need along with the session variables
$browser->post( $login_page, [ username => 'user', password => 'password, __VIEWSTATE => $viewstate, __EVENTVALIDATION => $validation ]);
# finally get back the content and make sure it looks right
print $response->content();