From where does Google get the abstract for each of its site results, that it displays on its search result page? - metadata

I am working on a project in which i have to search for terms on a search engine and then cluster the results on their contextual sense. So i have to treat each result as a document. unfortunately, the data present along with each result on the result page is too little for clustering. Hence, I wanted to know from where the search engines get the abstract for each result that they show. If i could get that entire abstract then i can cluster the results by treating them as separate documents.
From where does google get the abstract ?
For eg: If you search for "1000 Mile" on google, the second result shows the following abstract:
"The women's 1000 Mile Collection is based on classic designs and reflects Wolverine's long heritage of crafting quality footwear. Complementing these classics ..."
This abstract is not present in the Meta tags of the page.
From where does Google find this data.
Thanks

From Does Google use the Meta Description Tag for Description of Page?
Google will choose your search results snippets from the following places (not necessarily in this order):
The page's Meta Description tag
The page's Open Directory Project (ODP) Listing
Page content relevant to the search query
If you do not want Google to use the ODP listing's description then you can tell them not to do so with the following Meta tag:
<meta name="robots" content="NOODP">
If you want to encourage Google to use your Meta Description tag then make sure it is unique to each page. Also make sure it contains an accurate description of the page's content.
In thew absence of an ODP description and Meta Description tag, Google will use a portion of the page's text as the description. This text will contain the closest matches to the search query. I have not seen any official limit to how long this can be but a couple of sentences seems about right.
On a related note, if you don't want a snippet to be shown with a particular page you can use the following Meta tag to prevent one from being shown:
<meta name="robots" content="nosnippet">
See this blog post for Google's tips on using the meta description tag.
According to this site, "The meta description should typically be at most 145 to 150 characters in length as these are the maximum number of characters typically displayed at Yahoo! and Google, respectively."

That site is Flash-based, and Google can index Flash content, so given that the snippet isn't in the HTML source of the page as you point out, nor is it in the cached version of the page, I'm guessing that it's somewhere in the Flash movie.
It's kind of arbitrary that the snippet mentions 'The women's 1000 Mile Collection' while the site link itself is to the parent category of 1000 mile, not just women's, so I'm guessing here that gathering snippet-friendly metadata from a Flash site is an imprecise science. That's my best guess.
In this Google Webmaster blog post, they explain how they use external text or HTML files loaded into the Flash movie, and in one of the comments Jonathan Simon says (sorry):
"We try our best to crawl Flash content but the results can sometimes be less than ideal. You are only seeing a title in the search results for your site because that's the only bit of HTML text that you have outside of your Flash content. You could add a Meta description element to offer more information in HTML. You could also add some other text that's not a part of your Flash content. Just doing this should improve the snippet you see associated with your site in the search results."

Related

How to remove duplicate pages from search engine

I have duplicate content from my home page.
In Google Webmasters they tell me that I have a problem with duplicate content:
For example:
www.example.com/page1/ www.example.com/page2/ www.example.com/page2/
How can I remove it?
What that page says is not that you have duplicate pages but that you have several pages with the same meta description:
Meta descriptions are HTML attributes that provide concise summaries of webpages. They commonly appear underneath the blue clickable links in a search engine results page.
Usually each page should have their own meta description that describes its content (that is the reason why Google warns you about duplicates), but sometimes it's OK that several pages share the same description.
For example, based in your screen shot your site appears to be about mobile phones, let's say that the duplicates with 2 pages are one for the summary of the phone and the other for the technical specifications (I'm supposing as I don't understand Arabic), the meta description of both pages could be similar but not exact because it should reflect that they cover different aspects of the phone (summary vs technical spec).
On the other hand the duplicate with 14 pages appears to be several pages from a product list, maybe phones with same tag, if this is correct then it's OK that all those pages have the same meta description, as they are just parts of the same topic split into several pages.

Googlebot is not reading dynamic content

Website is fully dynamic.
meta tags, opengraph tags and contents are created dynamicially on webpages.
I might be doing something wrong. Please guide me to get approved for GOOGLE ADSENSE Program.
Google Adsense gave reason "Insufficient content" for this
I think the only real answer is to implement some kind of partial caching. If needed content is not in the source code of your pages, it won't be indexed.
What exactly do you mean by "fully dynamic" and what parts do you want to be indexed?

Rich Snippet not showing in Google Search result page

About a month ago we implemented Rich Snippets on the product detail pages for our e-commerce site (example).
We used the http://schema.org/ syntax for the structured data, as it seems to be the route Google are taking moving forward.
The data appears to be correct in the Rich Snippet Testing Tool and the data has started to appear in Google Webmaster Tools.
However the data is still to be seen on the SERP.
We have followed the rich data guide on Google to the letter and still no results. Is this a case of just waiting?
Here is an additional piece of information that is making it all the more puzzling, we initially went with a Microformats implementation and within 24 hours the data started showing up on the SERP. However we moved away from this because the Schema.org approach seemed a better bet.
I suppose it is one of the reasons explained in my Wiki post at
http://wiki.goodrelations-vocabulary.org/FFAQ#Why_is_Google_not_showing_rich_snippets_for_my_pages.3F
While that one refers to GoodRelations markup, the situation should be the same for schema.org.
Martin
Quote:
If you have added GoodRelations (manually or via a shop extension module) to your shop and still do not get rich snippets in Google search results, this can have one of the following reasons:
Google has not yet re-crawled your page or pages. Google dedicates just a limited amount of crawling time to a site, depending on its global relevance. It may be that Google has simply not yet re-indexed your page. Wait 2 - 8 weeks ;-)
The markup is invalid. Try the Google Validator. If that shows a rich snippet in the preview, you may just have to wait 4 - 12 weeks until Google will notice and white-list your pages. If it does not show a rich snippet, you either do not have valid GoodRelations markup in the page, you are missing properties that Google requires (e.g. gr:validThrough for prices), the price of the item has expired, or you use markup for which Google does not show rich snippets. Currently, Google shows snippets only for products and offers.
Google cannot see that your page changed. Your XML sitemap (http://example.com/sitemap.xml or similar) does not contain a lastmod attribute or the lastmod attribute was not updated after you added GoodRelations/schema.org. This attribute is important for crawlers to notice which pages need to be reindexed.
Low ranking of your item pages. Your item pages have a low ranking and what you see in your Google results are category pages or other pages summarizing multiple items. GoodRelations shop extensions add markup only to the "deep" item pages, because those are best for rich snippets. Use the title / product name of one of your products and restrict the Google search to your site with the additional statement site:www.example.com.

How can I create a website summary with Perl?

When you share something on Facebook or Digg, it generates some summary of the page. How would I do this in Perl? What algorithms are there?
For example:
If I go to Facebook and tried to share this question as a link:
How can I create a website summary with Perl?
It retrieves "Facebook/Digg get website summary? - Stack Overflow" as the title (which is just the title of the page) and [... incomplete question?]
CPAN is your friend.
Some promising looking modules:
HTML::Summary
HTML::SummaryBasic
Lingua::EN::Summarize
Assuming you mean sharing a link...
Usually the summary is written by the user submitting the URL. If you have to write a summary automagically this can be achieved by:
Using the first 100 or so characters of the document body (in itself not easy)
Using metadata like the description or keywords (often empty or spammed)
Context-relevant summaries like recreating Google snippets (sorry its PHP but simple)
Tags/keywords from the document using something like the Yahoo Keyword Extractor API or your own keyword density function
Your best bet is to ask the user!
Hope that helps somewhat :)
Basically you want to scrape the URL and find the "most significant paragraph" which might be the first <div> or <p> element after the first <h2> or <h1>, depending on the layout of the page.
You could check and see if there is a meta description on the page, but that leaves you at the mercy of whoever wrote the meta description.

Facebook Post Link Image

When someone posts a link on facebook, a script usually scans that link for any images, and displays a quick thumbnail next to the post. For certain URLs though (including mine), FB doesn't seem to pick up anything, despite their being a number of images on that page.
I read up that FB prefers the "image_src" rel tag for the image the user wishes to specify, but this does not generate that thumbnail either for my site.
My url goes directly to the DNS, and is not forwarded, so I don't imagine that could be the problem either.
Does anyone have an idea as to why FB can't generate any thumbnails from my site?
The easiest way is just a link tag:
<link rel="image_src" href="http://stackoverflow.com/images/logo.gif" />
But there are some other things you can add to your site to make it more Social media friendly:
Open Graph Tags
Open Graph tags are tags that you add to the <head> of your website to describe the entity your page represents, whether it is a band, restaurant, blog, or something else.
An Open Graph tag looks like this:
<meta property="og:tag name" content="tag value"/>
If you use Open Graph tags, the following six are required:
og:title - The title of the entity.
og:type - The type of entity. You must select a type from the list of Open Graph types.
og:image - The URL to an image that represents the entity. Images must be at least 50 pixels by 50 pixels. Square images work best, but you are allowed to use images up to three times as wide as they are tall.
og:url - The canonical, permanent URL of the page representing the entity. When you use Open Graph tags, the Like button posts a link to the og:url instead of the URL in the Like button code.
og:site_name - A human-readable name for your site, e.g., "IMDb".
fb:admins or fb:app_id - A comma-separated list of either the Facebook IDs of page administrators or a Facebook Platform application ID. At a minimum, include only your own Facebook ID.
More information on Open Graph tags and details on Administering your page can be found on the Open Graph protocol documentation.
http://developers.facebook.com/docs/reference/plugins/like
I know this question is old, but I recently dealt with the exact same problem and went round and round on it for a couple weeks. Multiple searches on Google turned up a lot of useful information, but most of it was focused on Open Graph tags, which I wasn't interested in using. Turns out my site had multiple issues, but here are some of the basics.
As EightyEight said, make sure your HTML is valid - and the same goes for your javascript and server-side code (PHP, ASP, etc.). I had a small PHP error in a piece of code that was executing as a separate call to the server from the main page. Due to a number of bizarre coincidences, that code was generating a 500 error - but ONLY for IE6 and strict parsing engines like the W3C validator and the Facebook page crawler. The problem didn't appear in modern browsers (Chrome 4, FF 3.5, IE 8, etc) so I didn't see it right away, but older/stricter clients were showing the 500 every time and that was the main reason FB wasn't crawling our page (when everything else seemed to be correct).
Regarding Randy's response, he's correct that Facebook will keep an old cached copy of your page long after you've updated it. FB claims it's only held for 24 hours, but I experienced much longer times than that. FORTUNATELY, FB has released their "URL Linter" tool that will show you a preview of how your page will appear when being shared on FB, and it will force FB to instantly update its cache of your page. This was a lifesaving tool. You can find it at http://developers.facebook.com/tools/lint/
Regarding the URL Linter tool, be aware that each variation of a URL is cached separately on Facebook, so "www.example.com" is not the same as "example.com". Also, unique capitalization is stored as well, so "ExampleOne.com" is not the same as "exampleone.com". (This led to a lot of confusion between my client and myself when it appeared to me that the cache had been updated just fine and the client claimed they weren't seeing the updates. Turns out I was looking at exampleone.com and had used Linter to update the cache, but they were looking at exampleOne.com which I hadn't submitted to Linter. As a result, I ended up submitting quite a few variations of the URL to Linter just to cover the bases.)
WyrdNEXUS's advice to use the image_src link tag is spot-on. This allows you to be sure that FB is scraping the best possible image for your page. There are some varying guidelines out there about what specs the image file should have, but I've successfully used a 128px square image and have seen a 130x97 image make it through as well. Here is Facebook's official documentation from http://developers.facebook.com/docs/reference/plugins/like/:
Images must be at least 50 pixels by 50 pixels. Square images work best, but you are allowed to use images up to three times as wide as they are tall.
Obviously, FB will resize a large image for you, but you'll almost always get better results if you resize it yourself beforehand.
Regarding Mike Cooper's link to the eHow article, avoid using step #1 in that article. It was valid advice when the article was written and when Mike posted the link, but it's now better to use the URL Linter tool for previewing how your page will appear when being shared. By using Linter, you won't cause FB to cache a (potentially) bad copy of the page before you get a chance to tweak it.
Use the facebook lintter available here. http://developers.facebook.com/tools/lint/
This will check your link and re fetch any images. this also clears any old cache.
Or try this - https://developers.facebook.com/tools/debug
To change Title, Description and Image, we need to add some meta tags under head tag.
STEP 1 :
Add meta tags under head tag
<html>
<head>
<meta property="og:url" content="http://www.test.com/" />
<meta property="og:image" content="http://www.test.com/img/fb-logo.png" />
<meta property="og:title" content="Prepaid Phone Cards, low rates for International calls with Lucky Prepay" />
<meta property="og:description" content="Cheap prepaid Phone Cards. Low rates for international calls anywhere in the world." />
NEXT STEP :
Click on below link
https://developers.facebook.com/tools/debug
Add your URL in text box (e.g http://www.test.com/) where you mentioned the tags. Click on DEBUG button.
Its done.
You can verify here https://www.facebook.com/sharer/sharer.php?u=http://www.test.com/
In above url, u = your website link
ENJOY !!!!
try this: http://www.ehow.com/how_4938148_thumbnail-show-up-facebook-share.html
Is the site's HTML valid? Run it through w3c validation service.
Actually, if you've already tried linking that page on Facebook BEFORE adding the "image_src" link, Facebook will keep using the old cached copy and not even see your changes. Try modifying the URL by removing or adding the 'www', or duplicate your page to test it.
I've noticed that Facebook does not take thumbnails from websites if they start with https, is that maybe your case?
had the same problem and figured out that my head closing tag was in the wrong place
Old question but recently I seemed to be running into same issue with thumbnail images from my link not showing in status updates on Facebook. I post for many clients and this is relatively new.
FB doesn't seem to like long URLs anymore — if you use a URL shortener such as goo.gl or bitly.com, the thumbnail from your link/post will appear in your FB update.
Try using something like this:
<link rel="image_src" href="http://yoursite.com/graphics/yourimage.jpg" /link>`
Seems to work just fine on Firefox as long as you use a full path to your image.
Trouble is it get vertically offset downward for some reason. Image is 200 x 200 as recommended somewhere I read.
If you used any plugin for seo then Check 1st your seo plugin settings.Then find out Noindex setting if Enable Media for Noindex then disable it.