How can I "drill down" into a website using Perl's WWW::Mechanize - perl

I have used the WWW::Mechanize Perl module on a number of projects and it's helped me out a lot.
I am trying to use it on a different site and I can't "drill down" into the content of the site.
The site is https://customer.bookingbug.com/?client=hantsrecyclingcentres#/services
I have tried figure out what the URL would be to get content shown in the resulting HTML, such as bb.d570283b87c834518ba9.css, bb.d570283b87c834518ba9.js and version.js
I tried to copy the resulting HTML into this posting, but used all sorts of quote and code sample combinations and it wouldn't display properly.
Does anyone have any idea how I "navigate" this site using this Perl module please?

WWW::Mechanize is a web client with some HTML parsing capabilities. But as you clearly noticed, the information you want is not in the HTML document you requested. Either download the correct document (whatever that might be), or do what the browser does and execute the JavaScript. This would require a JavaScript engine. The simplest way to achieve that is to remote-control a web browser (e.g. using Selenium::Chrome).

Related

How do I crawl a website written in JSP (Java Server Pages) using Perl Script?

I have here this website:https://www.connect2nse.com/iislNet/UserFolder.jsp
Firstly i tried using WWW::Mechanize, but it doesn't seem to work. WWW::Mechanize doesn't work with JSP written website. So I researched about how to download a file in a website written in JSP, but can't find a good one. Can anybody help me with this one? Thanks in advance.
As far as the client is concerned, JavaServer Pages is identical to PHP, Perl, or even static HTML files. The result is a page of HTML that can be rendered and displayed, and the source of the data isn't the reason for WWW::Mechanize failing to do what you want
Doesn't work is useless as a problem description, and the issue could be pretty much anything. However, if the HTML is associated with some JavaScript (which is executed on the client system after the page has been retrieved and not on the server) then it may be more or less handicapped because WWW::Mechanize doesn't support JavaScript. For that you will need to use WWW::Mechanize::Firefox or similar, which works by using a real instance of Firefox to render the HTML and execute any JavaScript

Wget to download html

I have been trying to download an html from http://osu.ppy.sh/u/2330158 to get Historical data
but it doesnt download that part. Nor it downloads General, Top Ranks etc
Is there a way to make wget to download it?
That part of the page is loaded dynamically, so wget won't see it as it doesn't support Javascript. However, if you open the web developer tools in your browser of choice and then load the main page you can get the URL which you're really after. For this page, it's: http://osu.ppy.sh/pages/include/profile-history.php?u=2330158&m=0
Luckily, it's another simple, parameterised URL so you can feed that to wget:
wget "http://osu.ppy.sh/pages/include/profile-history.php?u=2330158&m=0"
That'll get you an html document containing just the historic data you're looking for.

How to extract data from a web site and format to raw text - iPhone Dev

I have been looking around for a while and not found anything useful, also not sure if I have worded the question in the clearest fashion so apologies
I have a section of an app I am building called 'Company News'. The company in question has a news page on their website which displays a title, an excerpt of text and a read more option.
At the minute in the iPhone application I just have a UIWebView which links to that URL, displays an error if no connection is available. However, if my user clicks a story to read the news obviously it opens up a new page, I want to avoid having to build in 'back' and 'forward' buttons and stay away from it looking like a browser within the app.
With that said, I am looking for a way to just extract that data from the website and just display it in my app as raw text. I am not particularly bothered about rich text formatting or anything fancy. I would just like the title and body of text.
Is this possible?
In essence, then, you are looking for an HTML parser.
Assuming the HTML you wish to parse has a predictable format, the approach I would take is to load the HTML via whatever URL loading system you want - e.g. NSURLConnection, ASIHTTPRequest, etc.
Then you will need to parse the raw HTML. I use XPath. It requires that you learn the syntax, but it should work.
For more details about how you might use XPath for parsing HTML, see the second response to this question. You will need to link to libxml2 in your project then use XPath to extract the nodes of interest.
Scraping web pages in this way is fragile, though, because it depends on the structure of a page you don't control and which could be changed unpredictably.

Perl Mechanize module for scraping pdfs

I have a website into which many pdfs are uploaded. What i want to do is to download all those PDFs present in the website. To do so i first need to provide username and password to the website. After searching for sometime i found WWW::Mechanize package that does this work. Now the problem arises here that i want to make a recursive search in the website meaning that if the link does not contain a PDF, then i should not simply discard the link but should navigate the link and check whether the new page has links that contain PDFs. In this way i should exhaustively search the entire website to download all PDFs uploaded. Any suggestion on how to do this?
I'd also go with wget, which runs on a variety of platforms.
If you want to do it in Perl, check CPAN for web crawlers.
You might want to decouple collecting PDF URLs from actually downloading them. Crawling already is lengthy processing and it might be advantageous to be able to hand off downloading tasks to seperate worker processes.
You are right about using WWW::Mechanize module. This module has a method - find_all_links() wherein you can point out the regex to match the kind of pages you want to grab or follow.
For example:
my $obj = WWW::Mechanize->new;
.......
.......
my #pdf_links = $obj->find_all_links( url_regex => qr/^.+?\.pdf/ );
This gives you all the links pointing to pdf files, Now iterate through these links and issue a get call on each of them.
I suggest to try with wget. Something like:
wget -r --no-parent -A.pdf --user=LOGIN --pasword=PASSWORD http://www.server.com/dir/

How safe is the data being parsed by RTF editors like TinyMCE?

I have a great concern in deploying the TinyMCE editor on a website. Looking at the code parsed by the editor it does a great job, and I leave the HTML button off the toolbar configuration so users can not inject their own source.
However, from what I read in the TinyMCE docs, it claims to degrade nicely to a regular textarea should javascript be disabled on a users browser... and therein lies my concern. If it does revert to a normal textarea, then the user is then able to easily inject their own HTML, and this leaves me with a security concern.
I just pass through data created with TinyMCE, and it is used within another page created by my script, so it poses no security risk to my server. The security concern arises over what malicious data may be passed to another user viewing the generated page.
I know many of you will tell me to just use regexes, or parse this data, but that itself could be a nightmare, as I would be trying to either...
a.) Use regexes to try and clean up the HTML without breaking the generated page,
and it is better to parse the data for that anyway.
b.) Reparsing data that has already been parsed by the RTF editor, which also
would probably end up breaking the generated page.
Anyone with any previous experience with this type of scenario, I would really appreciate a 'heads-up' as to any other risks that using an RTF editor for user data could entail.
I would really like to provide this as a user option, but not if the risks outweigh giving the user using the RTF a chance to take a wack at another user viewing the page that is generated by the script.
My gut feeling is to steer a wide berth around use of the RTF at this point.
Thanks for any direction you can give me with your own experiences.
You cannot have client-side security on the web. You simply can't trust the browser, because it's easy for a malicious user to substitute a replacement browser that does whatever he wants.
If you accept HTML from users (using TinyMCE or through any other method) and display it to other users, you must sanitize or validate the HTML in some way on the server. If you're using Perl, the leading package seems to be HTML::Scrubber (along with various other modules that help you plug it in to various frameworks). I haven't had occasion to try it myself.
The TinyMCE Security page mentions some ways to make it harder for people to submit arbitrary HTML, but you still need server-side checks.
Regex is generally not considered good for parsing HTML
RegEx match open tags except XHTML self-contained tags but I have noted the "perl" tag :)
My advice when taking markup from users is to always parse it through something that can accept mal-formed HTML and return well formed HTML. These parses generally produce something that can be queried and updated with some form of XPath.
In Python there is a module called BeautifulSoup, Ruby has Nokogiri and in ASP.NET there is a project called HtmlAgilityPack that all do this sort of thing. I'm not sure what library perl has, but I'm sure there would be something.