Fundamental rules of crawling a website

Fundamental rules of crawling a website - robots.txt

I am studying about website crawling.
I would like to ask following questions.
If a website appears in the result of google search, can I crawl the website?
In robots.txt of a website, the following is written. How can I confirm webpages which is prohibited crawling in this website, with browser?
Disallow: /usr/top
Could you tell me the answer of the above questions?

If a website appears in the result of google search, can I crawl the website?
I assume that you want to honour robots.txt. In that case the answer is: No, not necessarily.
You have to check the robots.txt. It might be the case that Google’s bot is allowed to crawl it, but your bot is not allowed to.
I want to know concrete webpage URLs indicated in /usr/top
When there is a line like Disallow: /usr/top, you can’t know which existing URLs are blocked by this. Disallow always takes the beginning of a URL path as value. So in this example, it would block the following URLs (assuming that the robots.txt is at http://example.com/robots.txt):
http://example.com/usr/top
http://example.com/usr/top/
http://example.com/usr/top.html
http://example.com/usr/topfoo
http://example.com/usr/top/foo/bar
http://example.com/usr/top/foo/bar.html
…

If a website appears in the result of google search, can I crawl the website?
The short answer is maybe. The long answer is: many websites will have a terms of use/agreement or use that may mention if crawling it is allowed. For example, I believe FaceBook will not allow crawling.
In regards to the robots.txt file: this link may be helpful.

Related

Prevent Facebook following/discarding short url to redirect page

We're trying to build a share widget for referral links (links that are essentially short urls - a.b.com/uniqueCode, which redirects to a client website but goes through our service and lets us track them) - essentially someone needs to share their own affiliate url on Facebook.
The problem is Facebook seems to always resolve the url to the final destination, and doesn't display the url we pass in. I can't find any documentation on whether there's a way to prevent this or not. We've tried both with a 301 and 302 redirect, with no change. I've tried different urls to make sure we're not seeing the result of their url caching.
Is there a way to instruct Facebook (and Google Plus, Linkedin) to keep the referring link we provide?

You could try doing the redirect using Javascript.
Also, you might be able to do it by showing the social networks' crawlers different pages based on their useragent or IP or so – look into your webserver logs or so and check what happens there immediately after you've posted the link.

Description in google search + redirect metatag to fan page

here is my problem...
there is this domain: dbrinterativa.com
when i try to seach for 'dbr' in google, it returns this website as the first option. but the problem is that the html has a metatag redirecting the user for a facebook fanpage... and google is getting this fanpage's descriptions and title!
<meta http-equiv="refresh" content="0;url=https://www.facebook.com/dbrinterativa">
and there is another problem...
when i try to search for its url in google like dbrinterativa.com (url of example) it says that the robots.txt is not allowing google to get its meta datas... here is a link to my robots.txt
does anybody knows what can i do to solve this problem?
thanks!

Looking your robots, anyway not disallow facebook because you are using a meta refresh which is not a good practice.
So the crawler join in your website and after go to facebook, no matter what you have in your robots because not is following your robots anymore.
So I dont know why u are redirecting directly to facebook. You could do a website with a link to your FB page then you will be indexed correctly.
From the crawler of Google think that your website is the FB page, so for that is taking the title and description from your FB page and not from yours.

Comments not crawlable by search engines?

I was wondering if Search Engine spiders can see the comments, when I open the source of the page the comments are not showing up (same as with disqus), so I'm assuming when the search engines crawl the page they won't see the comments either? Is this assumption correct? If so, is there a way to change this?
Found the solution:
http://developers.facebook.com/docs/reference/plugins/comments/
How can I get an SEO boost from the comments left on my site?
The Facebook comments box is rendered in an iframe on your page, and
most search engines will not crawl content within an iframe. However,
you can access all the comments left on your site via the graph API as
described above. Simply grab the comments from the API and render them
in the body of your page behind the comments box. We recommend you
cache the results, as pulling the comments from the graph API on each
page load could slow down the rendering time of the page.

Only what get thrown to a crawl engine the crawl engine can see, hence these comments should be outputted in able to get crawled and saved into the SE database or whatever it uses to collect data about websites, you might check the headers the connection request came from, if it belongs to a crawl engine and that's called a user agent in our case humans (browsers), here you can find a way to detect crawlers using PHP, after detecting it you force the comments to be shown in order to get crawled, here also a good resource on how to deal with crawlers from Google itself.
Now if you're talking about Facebook comments, it's impossible to let them indexed by the crawler or SE, when a crawler attempt to visit one of the Facebook pages it won't be able to see users' data because of the login page, and if you are talking about Facebook plugins you may do what what I suggested above, article talking about Facebook comments crawling.

Facebook Graph for URLs - Getting the number of shares for a group of pages

How can we know the sum of all Facebook shares for all URLs that start with:
http://www.guardian.co.uk/artanddesign/
There's an API method for a specific URL that is:
https://graph.facebook.com/?ids=http://www.guardian.co.uk/artanddesign/2012/feb/19/frank-gehry-new-york-interview
But I can't make it sum all the URLs that start with
http://www.guardian.co.uk/artanddesign/.
Notes:
I can't iterate trough all the URL's in the directory "artsanddesign".
Please note that the URL's above are just examples.
I can't access insights because I don't have permissions to add a meta-tag on the root directory of the site
Maybe the solution is using the Facebook Query Language (FQL)? How could it work?

I don't believe there's any way to get information for all URLs that match a certain pattern, I think you have to specify the individual URLs manually - this shouldn't be a bug problem if it's your own site, as you'll already have a database of all possible article URLs, but if you're checking someone else's site it could be difficult.

How to move facebook likes on wordpress from one domain to another

Background
I have a domain say www.foo.com. I host my wordpress blog there.
There are many facebook likes on the posts i have put up.
Now I have redirected www.foo.com to www.boo.com and closed the foo.com domain.(On the server i have simply copied the files from foo.com to boo.com's folder.The database is the same for both.)
Problem
The problem i am facing is that the facebook likes are "Gone". How can i retrieve the facebook likes? Are the likes linked to domain name ?

You can do this with a 301 redirect, although its not 100% reliable.
I kept my goggle page rank and facebook page likes when moving between www.foo.co.uk and www.foo.com I completely changed the design and url (although foo was the same). Facebook and google+ and google page rank followed the 301's and attributed it to the new page. Having said that if your original site is closed it may be a problem.

You cant do this. Even if in your case it is a valid request - think about the meaning of this action. if it were possible people couild just move likes from pages to other pages, from applications and from websites... users that liked "foo.com" DIDNT like "boo.com", the fact that the content is exactly the same is purely coincidence. The user did not "like" that second URL therefore you can not "move" likes...
Perhaps if you contact facebook (as a developer) they might be able to assist you - but there is no method that us (non Mark Zuckerberg types) can do.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Fundamental rules of crawling a website - robots.txt

Related

Prevent Facebook following/discarding short url to redirect page

Description in google search + redirect metatag to fan page

Comments not crawlable by search engines?

Facebook Graph for URLs - Getting the number of shares for a group of pages

How to move facebook likes on wordpress from one domain to another

Categories

Resources