GET html using WWW::Mechanize causes "Forbidden" - perl

I want to get the content of a film of imdb by using WWW::Mechanize. First of all, I have to find a way to find a respective /title/tt* url. When I have, e.g., a movie called fight club, I want to visit this link:
*ttp://www.imdb.com/find?s=all&q=fight+club
For some reason, this fails already. Heres the line that causes an error
$mech->get('http://www.imdb.com/find?s=all&q=fight+club');
error message:
Error GETing
http://www.imdb.com/find?s=all&q=fight+club:
Forbidden
If I write something like get(http://www.google.com), it works fine. What's the difference when using imdb? Any proposal for an alternative solution?

IMDB probably sniff the User-Agent string and reject WWW::Mechanize requests. The "solution" is to respect their wish to block you from interacting with the site in an automated fashion.
(Or you could read their terms and conditions very, very carefully and then change the user agent string)
Licensing IMDb Content; Consent to Use Robots and Crawlers: If you are interested in receiving our express written permission to use IMDb content for your non-personal (including commercial) use, please visit our Content Licensing section or contact our Licensing Department. We do allow the limited use of robots and crawlers, such as those from certain search engines, with our express written consent. If you are interested in receiving our express written permission to use robots or crawlers on our site, please contact our Licensing Department.

David is right, that's probably what's happening.
But did you know lots of information is available from IMDB via FTP? And that they have a number of tools you can use to get at their information other than scraping?
See http://www.imdb.com/interfaces

Related

Disallow certain URLs in robot.txt

I'm currently running a web service where people can browse products. The URL for that is basically just /products/product_pk/. However, we don't serve products with certain product_pks, e.g. nothing smaller than 200. Is there hence a way to discourage bots to hit URLs like /products/10/ (because they will receive a 404)?
Thank you for your help :)
I am pretty sure that crawlers don't try and fail autogenerated urls. It crawls your website and find the next links to crawl. If you have any links that return 404, that is bad design on your site, since they should not be there.

How to track direct URL referrer

Most hosts come with softwares or google analytics which allows you to know how a person got to your site, for example: a link on yelp.com or a facebook.com page link.
But it is impossible for the software to know what method of marketing got a person to visit your site if he directly typed the url (those analytic softwares show them as "direct url").
I need a creative idea where I can refine this broad term "direct url".
One way would be to use flyers with QR codes (links to the website) but instead of the website itself I direct them to clickemart.ca/flyer1_referrer which in turn sends him to the website clickemart.ca but a different flyer distributed to a different location would have a QR code which can be scanned to direct a mobile user to clickemart.ca/flyer2_referrer
So my question is, is this possible, will I be able to figure out which flyer 1 or 2 was more effective based on visits to the redirect urls? If it is possible can you give me a brief idea on how to implement it?
I know a lot of you will say add a form field with "source of referrer" on the site but from my experience this is never filled or filled with incorrect values (typically the closest to the mouse pointer or the top most option or "other" - you get it, something useless).
Any help or guidance is really appreciated!
Solved it here is how:
Connect your site to Google Analytics (www.google.ca/analytics)
Create a campaign URL (support.google.com/analytics/answer/1033867?hl=en)
OPTIONAL: shorten the URL (www.bitly.com)
Create a QR code to the URL (www.qrstuff.com/)
Scan the QR code and watch it appear as a campaign referral under Google Analytics
Here is an explanation to the different steps:
Once you create an account on Google Analytics, you can connect the site by copying a PHP script or a JavaScript onto every page or in my case (magento) the user code can simply be connected through the configuration settings
The campaign URL adds information to the URL link of your website just like the current link on stackoverflow.com variables such as noredirect="..." contains info for your server to process, so using the campaign URL could be used as a tag to determine the source of the referral
Shortening the URL is recommended because the QR code becomes less dense and this in turn reduces the chance of error while scanning the code, the shortened URL links directly to the campaign URL link you provided so it is a seamless process
QRStuff is a good place to download the QR code image at a high resolution
So when you scan this is what happens:
CODE SCANNED >> PHONE COMMAND TO GO TO SHORTENED URL LINK >> REDIRECT TO CAMPAIGN URL >> GOOGLE ANALYTICS RECEIVES INFO ABOUT CAMPAIGN REFERRAL >> YOU CAN SEE IT BY LOGIN INTO GOOGLE ANALYTICS

How to prevent Google from indexing redirect URL I do not own

A domainname that I do not own, is redirecting to my domain. I donĀ“t know who owns it and why it is redirecting to my domain.
This domain however is showing up in Googles search results. When doing a whois it also returns this message:
"Domain:http://[baddomain].com webserver returns 307 Temporary Redirect"
Since I do not own this domain I cannot set a 301 redirect, or disable it. When clicking the baddomain in Google it shows the content of my website but the baddomain.com stays visible in the URL bar.
My question is: How can I stop Google from indexing and showing this bad domain in the search results and only show my website instead?
Thanks.
Some thoughts:
You cannot directly stop Google from indexing other sites, but what you could do is add the cannonical tag to your pages so Google can see that the original content is located on your domain and not "bad domain".
For example check out : https://support.google.com/webmasters/answer/139394?hl=en
Other actions can be taken SEO wise if the 'baddomain' is outscoring you in the search rankings, because then it sounds like your site could use some optimizing.
The better your site and domain rank in the SERPs, the less likely it is that people will see the scraped content and 'baddomain'.
You could however also look at the referrer for the request and if it is 'bad domain' you should be able to do a redirect to your own domain, change content etc, because the code is being run from your own server.
But that might be more trouble than it's worth as you'd need to investigate how the 'baddomain' is doing things and code accordingly. (properly iframe or similar from what you describe, but that can still be circumvented using scripts).
Depending on what country you and 'baddomain' are located in, there are also legal actions. So called DMCA complaints. This however can also be quite a task, and well - it's often not worth it because a new domain will just pop up.

Facebook Graph API SEO Comments and Profanity Filter

I'm trying to integrate the Facebook comments left on our site in a way in which the content can be crawled by search engines and also for people (although I highly doubt there will be many) who don't have Javascript enabled on their browser.
Currently our Facebook comments are displayed via the use of the Facebook comment social plugin (using the <fb:comments href="MY_URL" num_posts="50" width="665"></fb:comments> tag). This ends up rendering an iFrame (which are mostly ignored by search engine crawlers) so the plan is to render this information and format it with basic HTML. To do this, the comments are pulled using the Graph API - this is then only be displayed to crawlers and people with Javascript disabled.
This all works nicely using the Graph API call (https://graph.facebook.com/comments/?ids=MY_URL), parsing the JSON result and displaying it on the page. The problem is that the <fb:comments> approach filters our results based on a blacklist we have set up on one of our Facebook Apps. The AppId with the relevant blacklist is stored on the page using metadata (<meta property="fb:app_id" content="APP_ID"/>) which the <fb:comments> control obviously must somehow use to filter the comments.
The problem is the Graph API method does not filter any results as I guess no blacklist (or App Id containing a blacklist) is specified. Does anyone know how to specify a Facebook App ID to the API call URL or of another way to not fetch commnents back that violate the terms of the blacklist?
On a side note, I know the debate about filtering content in comments rages on but it is a management decision to implement the blacklist, and one that I have no influence in changing - just incase anyone felt the need to explain the reasons why content filtering is or isn't a good idea!
Any thoughts on a solution?
Unfortunately there's no way to access a filtered list of comments using the API - it might be a reasonably request to have this in the API - you should file a wishlist item in Facebook's bug tracker
Otherwise, the only solution I can think of is to implement your own filter on your side when retrieving and displaying the comments from the API.
According to the Comments plugin documentation the filter on Facebook's side is implemented as a simple substring match, so it should be trivial to implement.
A fairly simple regular expression match should be able to check each comment against a relatively long list quickly.
(Unfortunately, the tradeoff here is that implementing a filter is easy, but you'd also need to write an interface so that whoever's updating the list of disallowed words can maintain the list for both the Facebook plugin, and your own filtering.)
Quote from docs:
The comment is checked via substring matching. This means if you blacklist the
word 'at', if the comment contains the sequence 'a' 't' anywhere it will be
marked with limited visibility; e.g. if the comment contained the words 'bat',
'hat', 'attend', etc it would be caught.
Pretty sure there is no current way of doing this from the graph API, the only thing I can suggest is taking the blacklist and build your own filter

Best way to store data for Greasemonkey based crawler?

I want to crawl a site with Greasemonkey and wonder if there is a better way to temporarily store values than with GM_setValue.
What I want to do is crawl my contacts in a social network and extract the Twitter URLs from their profile pages.
My current plan is to open each profile in it's own tab, so that it looks more like a normal browsing person (ie css, scrits and images will be loaded by the browser). Then store the Twitter URL with GM_setValue. Once all profile pages have been crawled, create a page using the stored values.
I am not so happy with the storage option, though. Maybe there is a better way?
I have considered inserting the user profiles into the current page so that I could all process them with the same script instance, but I am not sure if XMLHttpRequest looks indistignuishable from normal user initiated requests.
I've had a similar project where I needed to get a whole lot of (invoice line data) from a website, and export it into an accounting database.
You could create a .aspx (or PHP etc) back end, which processes POST data and stores it in a database.
Any data you want from a single page can be stored in a form (hidden using style properties if you want), using field names or id's to identify the data. Then all you need to do is make the form action an .aspx page and submit the form using javascript.
(Alternatively you could add a submit button to the page, allowing you to check the form values before submitting to the database).
I think you should first ask yourself why you want to use Greasemonkey for your particular problem. Greasemonkey was developed as a way to modify one's browsing experience -- not as a web spider. While you might be able to get Greasemonkey to do this using GM_setValue, I think you will find your solution to be kludgy and hard to develop. That, and it will require many manual steps (like opening all of those tabs, clearing the Greasemonkey variables between runs of your script, etc).
Does anything you are doing require the JavaScript on the page to be executed? If so, you may want to consider using Perl and WWW::Mechanize::Plugin::JavaScript. Otherwise, I would recommend that you do all of this in a simple Python script. You will want to take a look at the urllib2 module. For example, take a look at the following code (note that it uses cookielib to support cookies, which you will most likely need if your script requires you to be logged into a site):
import urllib2
import cookielib
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
response = opener.open("http://twitter.com/someguy")
responseText = response.read()
Then you can do all of the processing you want using regular expressions.
Have you considered Google Gears? That would give you access to a local SQLite database which you can store large amounts of information in.
The reason for wanting Greasemonkey
is that the page to be crawled does
not really approve of robots.
Greasemonkey seemed like the easiest
way to make the crawler look
legitimate.
Actually tainting your crawler through the browser does not make it that more legitimate. You are still breaking the terms of use of the site! WWW::Mechanize for example is equally well suited to 'spoof' your User Agent String, but that and crawling is, if the site does not allow spiders/crawlers, illegal!
The reason for wanting Greasemonkey is that the page to be crawled does not really approve of robots. Greasemonkey seemed like the easiest way to make the crawler look legitimate.
I think this is the the hardest way imaginable to make a crawler look legitimate. Spoofing a web browser is trivially easy with some basic understanding of HTTP headers.
Also, some sites have heuristics that look for clients that behave like spiders, so simply making requests look like browser doesn't mean the won't know what you are doing.