Nutch - How to crawl only urls newly added in the last 24 hours using nutch? - plugins

I'm using Nutch 1.7 and everything seems to be working just fine. However, there is one big issue I don't know how to overcome.
How can I crawl ONLY urls newly added in the last 24 hours. Of course we could use adaptive fetching but we hope there would be another better way that we are not aware until now.
We only need the urls that are added in the last 24 hours as we visit our source-websites every each day.
Please let me know if nutch can be configured and setup to do that or if there is a written plugin for crawling only urls added in the last 24 hours.
Kind regards,
Christian

you gain your new urls by parsing HTML!
there is no way that you could specify an anchor's lifetime by parsing an
<a>
tag!
you have to have a list for old urls in your DB so you can skip them!

Related

Disable session timeout in just one jsp in Liferay

I have set the session timeout in my Liferay 6.x to ten minutes and works great, but now I need to override it to a larger value in just one page of the web, as it has a pretty long read and my customers can't finnish it.
Is there any magic javascript, or maybe I need to move that jsp to a different portlet by itself, or what?
EDIT: There's an AUI().ready in a main.js, maybe there?
You can call Liferay.Session.extend() in loop every let's say 9 minutes from javascript until user leaves this webpage

TYPO3 hack - Viagra and other stuff

I have actual the latest TYPO3 6.2 version ... 6.2.31 ... I know ... working on upgrade.
But now I have a google hack which replaces Links in Google with Viagra stuff. Had this several weeks ago thought I fixed it with update from 6.2.9 to 6.2.31
There is unkown code in Core .. Does any Body knows this and can help me fix the hole?
last time it was here:
/data/www/domain/public/typo3/typo3/sysext/cms/tslib/index_ts.php
Thanks at all
Please follow the TYPO3 Security Guide which means that if your website is hacked you must take it offline, check the site, find the security issue and then bring it only.
If your website is hacked, not only your server and data is at risk but every user who visits your website. Especially if users trust you and your knowledge, you should take that issue serious.
Most of the time I have seen this issue one of the following problems occurred:
Hacked FTP account
Security issues in custom or 3rd party extensions.

Dynamic Sitemap Generation

I'm currently using a CMS solution that does not generate any type of Sitemap for use with Google/Bing/Yahoo!/etc. I've requested it for 4 years now but they do not show any interest in adding it any time soon.
With that being said, I'm trying to find a way to create a sitemap for 1) all of our pages over 5,000; 2) all of our images; and 3) all of our documents.
Can anyone help me with this? I know my way around PHP and would like to code this up that way but I don't know where to start with crawling my site to generate the links needed. I tried https://github.com/jdevalk/XML-Sitemap-PHP-Script but had no luck as it only returned the 5 pages in the root and none of the child pages inside folders like it was suppose to. It also showed our last modification date as 1970 which is incorrect.
Have you tried the Bing XML Sitemap Plugin?
Blockquote The Bing XML Sitemap Plugin is an open source server-side technology that takes care of generating XML Sitemaps compliant with sitemaps.org for websites running on Internet Information Services (IIS) for Windows® Server as well as Apache HTTP Server.
Bing XML Sitemap Plugin

Wrong Link on Facebook and wrong open graph but good meta property

My web site www.jeancharlesbarthelet.com has good meta . (I checked)
When I tried to share on Facebook, there is no image, no description.
I tried one hundred times the debug tool !! (for one month)
Nothing works.
I am very worried because lots of people are sharing my news at this time..
Thanks for help.
You should try and show some code examples....
However, with that said, it may be a caching issue with your images and their server.
I did see that the path to the image (http://www.jeancharlesbarthelet.com/autre/logo.jpg) does serve up the image correctly. Perhaps revisit thie tomorrow?
When did you upload it to your server? if your hosting company uses akamai to cache it could take up to 6 hours to be seen globally...

Facebook struggles to scrape one domain

I have already checked out this question, and it sounds like he's describing the same exact problem as me except for a few things:
I'm not running on https
80% of the time I try to debug, I get this message " Error parsing input URL, no data was scraped."
The scraper works perfectly on a different domain, but same server, same theme with almost identical content. Every time I try a domain it scrapes it perfectly including the image
During the 20% that it actually scrapes my page, I am having the same issue in the above link. It is reading my thumbnail, yet showing a blank image. The link brings me to a working image but it doesn't want to show anything.
The weird part is it worked completely fine about 10 months ago when I updated this blog on a daily basis. The only difference is I've switched servers recently. While that would explain a possibility, the other domain switched as well and doesn't have this problem.
I am at a loss why my links either show no image at all in facebook or give me the:
Domain Link
Domain
(no image, no description)
Very frustrating situation. Does anyone have any suggestions?
Update:
I have 6 domains...
When I moved servers recently, I found the new server wasn't prepared to compress the pages, so my blog posts looked crazy. This forced me to turn compression 'off' on WP Super Cache on my main blog. I also did it to my 2nd highest traffic blog figuring I'd get to the other 4 later.
Well, now those first two blogs appear to work fine in the facebook debugger, but the remaining 4 have troubles. The tricky part is, I completely removed WP Super Cache from one site and still had trouble fetching the data.
So while it seems logically it should have been a WP Super Cache issue, continuing to have errors despite removing it leads me to believe now? I'm still so baffled.
Update:
Ok, I loaded Chrome and IE, and both were able to pull the data with ease. The google snippet tool also worked great. I am going to try posting a link to my facebook fan page via chrome and see if it works correctly.
I did clear my FF cache and it didn't change, but I am still confused why one domain works ok while the other does not. Either way, if adding in Chrome works, I'll stick with that for now.
Any other suggestions?
Cache should not make any problem. If a browser can see your page, so can facebook debugger.
See if some 500 error is there. Try from different browser, clearing the browser cache etc. Try google rich snippet and see if a custom search engine is scrapping it fine.
PS: It will be nicer if you post url.