How should I connect DAY from GSA? I would like to know if there is any way to integrate GSA with Day CQ5 (CMS). Your valuables thoughts are invited.
Would appreciate a quick turnaround.
The most flexible way is to make a http request and parse the xml response into a result view
As you know, the GSA is a crawling search engine appliance. The lowest-tech way to go is to publish a dynamic sitemap that displays a link to every page on your site (that you want to index). Make the sitemap available on your CQ5 publisher/dispatcher. Then, enter the publisher/dispatcher's sitemap URI into the GSA's start URLs configuration. In version 6.10 of the GSA software, this is under "Crawl and Index - Crawl URLs - Start Crawling from the Following URLs".
Now tell the GSA that sitemap changes frequently. Enter the same sitemap URI in "Crawl and Index - Freshness Tuning - Crawl Frequently".
In this way, any page you activate from the CQ5 author will go to the publisher, show up in the sitemap, and be automatically crawled from the GSA. It will take the GSA up to an hour to crawl the newly activated content, assuming a page population of less than 15,000.
Related
Wordpress Newbie here. I am trying to optimize my website and GT metrics is showing me a very low score on minimize redirects. It says
Remove the following redirect chain if possible:
https://match.prod.bidr.io/cookie-sync/pm&gdpr=0&gdpr_consent=
https://cm.g.doubleclick.net/pixel?google_nid=beeswaxio&google_sc=&google_hm=QUFtNGYwNjl6T2NBQUFfa1pwY09UUQ&bee_sync_partners=sas%2Csyn%2Cpp%2Cpm&bee_sync_current_partner=adx&bee_sync_initiator=pm&bee_sync_hop_count=1
https://match.prod.bidr.io/cookie-sync/adx?bee_sync_partners=sas%2Csyn%2Cpp%2Cpm&bee_sync_current_partner=adx&bee_sync_initiator=pm&bee_sync_hop_count=1
https://rtb-csync.smartadserver.com/redir?partnerid=127&partneruserid=AAm4f069zOcAAA_kZpcOTQ&redirurl=https%3A%2F%2Fmatch.prod.bidr.io%2Fcookie-sync%3Fbee_sync_partners%3Dsyn%252Cpp%252Cpm%26bee_sync_current_partner%3Dsas%26bee_sync_initiator%3Dadx%26bee_sync_hop_count%3D2%26userid%3DSMART_USER_ID
https://match.prod.bidr.io/cookie-sync?bee_sync_partners=syn%2Cpp%2Cpm&bee_sync_current_partner=sas&bee_sync_initiator=adx&bee_sync_hop_count=2&userid=5474368427267615867
https://sync.technoratimedia.com/services?srv=cs&pid=73&uid=AAm4f069zOcAAA_kZpcOTQ&cb=https%3A%2F%2Fmatch.prod.bidr.io%2Fcookie-sync%3Fuserid%3D5474368427267615867%26bee_sync_partners%3Dpp%252Cpm%26bee_sync_current_partner%3Dsyn%26bee_sync_initiator%3Dadx%26bee_sync_hop_count%3D3
https://match.prod.bidr.io/cookie-sync?userid=5474368427267615867&bee_sync_partners=pp,pm&bee_sync_current_partner=syn&bee_sync_initiator=adx&bee_sync_hop_count=3
https://image2.pubmatic.com/AdServer/Pug?vcode=bz0yJnR5cGU9MSZjb2RlPTMyOTcmdGw9MTI5NjAw&piggybackCookie=AAm4f069zOcAAA_kZpcOTQ
.... and there are atleast 50 more like the above.
For the life of me I am unable to trace the source of these links and how and where they are being generated. I have searched the Database, have clicked on the view page source of my webiste but i am unable to find any of these links.
I have tried de-activating the plugins but they still show.
Kindly let me know how I can kill these requests.
I implemented a content security policy to solve this issue and improve overall optimization of my webiste.
I used the Content Security Policy Generator plugin available in Google chrome web store to generate the content security policy.
I then Installed the HTTP Security Options Plugin onto my wordpress site and applied the Content security policy generated by the generator in the CSP Options.
I tweaked around with the content security policy by replacing the subdomain with *. to have coverage of sub domains. I also added a slash at the end of the domain which seemed to help eg: *.domain.com/. I now have no redirect URL's on GT Metrix and have a score of 100%
I am working on linkchecker and want to know that when AEM saves the URLs in /var/linkchecker and on what basis?
If i am opening a link,then it saves it,or it has a polling like it traverse the complete content and put it in /var/linkchecker.
Which java class help to store valid or invalid links in its storage directory?
LinkChecker is based on an eventHandler for /content (and child) nodes on creates and updates. All content is parsed and links are validated against allowed protocols and (configurable) external site links.
External Links
All the validation is done asynchronously in the background and the HTML is updated based on verification results.
/var/linkchecker is the cache for external links. The results based on simple GET requests to external links in order to optimise the process. The HTTP 200/30x response means that the links are valid. AEM looks at this cache before requesting a validation of the external link in order to optimize the page processing. This also means that the link validation is NOT real time and the delay is proportional to the load on your server.
All the links that have been checked can be seen via the /etc/linkchecker.html screen where you can request for revalidation and refresh the status of the links.
You can configure the frequency of this background check via the Day CQ Link Checker Service configuration under /system/console/configMgr. The default interval is 5 seconds (scheduler.period parameter).
Under the config manager /system/console/configMgr you will find a lot of other Day CQ Link * configurations that control this feature.
For example, Day CQ Link Checker Transformer contains config for all the elements that need to be transformed by the link checker.
Similarly Day CQ Link Checker Info Storage Service configures the link cache.
Internal Links
Internal links are ignored unless they used FQDN and external urls (which is not normally the case on author). The only exception is in a multi-tenant environment where page from one site links to another site and all the mapping information is stored in sling mappings.
I have coded my ASP.NET MVC application in a way that allows stored entities to be retrieved via a friendly name in the URL, for example:
www.mysite.com/artists/james-brown/songs
Where james-brown is a URL friendly string stored on my Artist entity.
Now imagine I add an artist that no one has heard of before, and no one ever navigated to that artist's songs page.
How would Google/Yahoo/Other Search Engines know that my site does indeed have songs for that unknown artist.
Do I create a sitemap and maintain it through code as I add / remove artists?
There are few defined known ways to make the new links visible to search engine world.
XML and HTML Sitemap:
Add it to sitemap and submit it through webmaster tools.
HTML sitemaps are another way to achieve it. If your site has footer sitemap, you can add it to them.
Internal Links
Create internal links from your high ranking pages or highly crawled pages to the new pages. Google and other search engines tend to crawl pages where the content changes frequently. So if you have a refreshed content pages, try adding it to those pages and chances are high for those pages to be discovered quickly.
External Links
Create links from external blogs, company blogs and sites like pagetube.org which can help it to be discovered.
Yeah just add them to either sitemap, internal or even external links
My blog was successfully transferred to octopress and github-pages. My problem though is that website's search uses google search but the result of 'search' as you can see, are pointing to the old (wordpress) links. Now these links have change structure, following default octopress structure.
I don't understand why this is happening. Is it possible for google to have stored in it's DB the old links (my blog was 1st page for some searches, but gathered just 3.000 hits / month... not much by internet's standards) and this will change with time, or is it something I'm able to change somehow?
thanks.
1.You can wait for Google to crawl and re-index your
pages, or you can use the URL Removal Request tool
to expedite removal of old pages from the index.
http://www.google.com/support/webmasters/bin/answer.py?answer=61062
According to that page, the removal process
"usually takes 3-5 business days."
Consider submitting a Sitemap:
http://www.google.com/support/webmasters/bin/answer.py?answer=40318
click here to resubmit your sitemap.
More information about Sitemaps:
http://www.google.com/support/webmasters/bin/answer.py?answer=34575
http://www.google.com/support/webmasters/bin/topic.py?topic=8467
http://www.google.com/support/webmasters/bin/topic.py?topic=8477
https://www.google.com/webmasters/tools/docs/en/protocol.html
2.Perhaps your company might consider the
Google Mini? You could set up the Mini to
crawl the site every night or even 'continuously'.
http://www.google.com/enterprise/mini/
According to the US pricing page,
the Mini currently starts at $1995 for a
50,000-document license with a year of support.
Here is the Google Mini discussion group:
http://groups.google.com/group/Google-Mini
http://www.google.com/enterprise/hosted_vs_appliance.html
(Click: "show all descriptions")
http://www.google.com/support/mini/
(Google Mini detailed FAQ)
I am working on Apache Nutch modification project. We already swapped Nutch's original module with ours built using HtmlUnit. I need to download whole Facebook user site (ex. http://www.facebook.com/profile.php?id=100002517096832), which is going to be parsed using our own parser. Unfortunately Facebook is using mechanism called BigPipe (http://www.facebook.com/note.php?note_id=389414033919). That's why most of current website is hidden in <.!-- --> tags.
Usually when we scroll down Facebook page, new content is being unpacked every time we are about to hit bottom of the page. I have tried to use Javascript that scroll my htmlPage (HtmlPage object from HtmlUnit project), but finally I realized that scrolling is not triggering loading new content on Facebook user site.
How can I check, what event on page triggers loading content on current Facebook page? Maybe I should approach problem from different side, for example try to extract BigPipe "things" on my own? Have you ever did that?
Before dealing to your question … what kind of project are you trying to build there?
Since Apache Nutch is an open source web-search software, I think you are trying to build some kind of search engine, that scrapes Facebook user profiles/feeds to get data and make it searchable on some third-party website?
Well, that would be a violoation of Facebook Platform Policies:
I. Features and Functionality
12. You must not include data obtained from us in any search engine or directory without our written permission.
So, do you have that written permission?