wget recursive fails on wiki pages - wget

I'm trying to recursively fetch all pages linked from a Moin wiki page. I've tried many different wget recursive options, which all have the same result: only the html file from the given URL gets downloaded, not any of the pages linked from that html page.
If I use the --convert-links option, wget correctly translates the unfetched links to the right web links. It just doesn't recursively download those linked pages.
wget --verbose -r https://wiki.gnome.org/Outreachy
--2017-03-02 10:34:03-- https://wiki.gnome.org/Outreachy
Resolving wiki.gnome.org (wiki.gnome.org)... 209.132.180.180, 209.132.180.168
Connecting to wiki.gnome.org (wiki.gnome.org)|209.132.180.180|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘wiki.gnome.org/Outreachy’
wiki.gnome.org/Outreachy [ <=> ] 52.80K 170KB/s in 0.3s
2017-03-02 10:34:05 (170 KB/s) - ‘wiki.gnome.org/Outreachy’ saved [54064]
FINISHED --2017-03-02 10:34:05--
Total wall clock time: 1.4s
Downloaded: 1 files, 53K in 0.3s (170 KB/s)
I'm not sure if it's failing because the wiki's html links don't end with .html. I've tried using various combinations of --accept='[a-zA-Z0-9]+', --page-requisites, and --accept-regex='[a-zA-Z0-9]+' to work around that, no luck.
I'm not sure if it's failing because the wiki has html pages like https://wiki.gnome.org/Outreachy that links page URLs like https://wiki.gnome.org/Outreachy/Admin and https://wiki.gnome.org/Outreachy/Admin/GettingStarted. Maybe wget is confused because there will need to be an HTML page and a directory with the same name? I also tried using --nd but no luck.
The linked html pages are all relative to the base wiki URL (e.g. Outreachy history page). I've tried also adding --base="https://wiki.gnome.org/ with no luck.
At this point, I've tried a whole lot of different wget options, read several stack overflow and unix.stackexchange.com questions, and nothing I've tried has worked. I'm hoping there's a wget expert that can look at this particular wiki page and figure why wget is failing to recursively fetch linked pages. The same options work fine on other domains.
I've also tried httrack, with the same result. I'm running Linux, so please don't suggest Windows or proprietary tools.

This seems to be caused by the following tag in the wiki:
<meta name="robots" content="index,nofollow">
If you are sure you want to ignore the tag, you can make wget ignore it using -e robots=off:
wget -e robots=off --verbose -r https://wiki.gnome.org/Outreachy

Related

TYPO3 website migration issue - the TYPO3 backend works but the frontend doesn't work

I am trying to migrate a TYPO3 website from one web host to another. The site is using TYPO3 version 6.2.10.
I am following the steps provided here - https://blog.scwebs.in/how-to-transfer-typo3-site-to-a-new-host/
I can log into the backend of the site. which is here and can see the list of all pages under the page section. http://79.170.40.34/historylearning.com/typo3/
But the front end is broken. http://79.170.40.34/historylearning.com/index.php
Can you please suggest any solution.
Additional Note -
In this link it has suggested to upload these folders /fileadmin; /t3lib; /typo3; /typo3conf; /typo3temp; /uploads; .htaccess; index.php. But I don't have the /t3lib folder inside the site content
the above link also suggested
When the files are all uploaded, you will need to change the
permissions recursively for /fileadmin, /typo3conf, /typo3temp,
/uploads, and index.php to chmod 777.
But If I set the permission to 777 then I cannot access it at all. So I have left it to default permission of 755
On the same Link In step 20 it has suggested
Click on “Templavoila” and then “Update mapping”
I cannot find that option anywhere.
During the time of installation, I see this error. I don't quite understand what this means
Is it the reason? if so how to resolve this issue.
I do not have any previous experience with TYPO3. Please suggest a solution. My server environment is Linux.
As your referenced page mentioned to copy t3lib it shows that it's very old. the folder got removed long time ago.
Your installation with TYPO3 6.2 also is very old and should not get in production any more. But it could be a base for an update to TYPO3 10LTS (don't use 9LTS as it will end support in octobre), although it is a long way.
Your file access rights on the server should enable you and your web-server to access the files. The most commonly used solution: you and your web-server-user (something like: apache,www, wwwrun,..) have a common group and the group-owner of all is set to this common group.
And then the access mask is set to 775 (better 2775 so access is inherited) for folders and 664 for files.
chown youraccount:www -R *
find . -type d -exec chmod 2775 "{}" \;
find . -type f -exec chmod g+w "{}" \;
if you have copied the files with your account and use only 755 rights for folders TYPO3 can not work correct and it will result in an incomplete website.
templavoila is an extension which was used not for every installation. Be happy if your installation has no templavoila
TYPO3 6.2.x is not supported anymore, you should not use this version in production because it could contain vulnerabilities that threaten the security of your host system.
When I load your website I see lots of missing ressources in the network tab of the developer tools (hit F12), e.g.
GET http://79.170.40.34/typo3temp/stylesheet_5a17574694.css
[HTTP/1.1 404 Not Found 31ms]
GET http://79.170.40.34/fileadmin/template/2.8/bootstrap/css/bootstrap.min.css
[HTTP/1.1 404 Not Found 80ms]
GET http://79.170.40.34/fileadmin/template/2.8/bootstrap/css/bootstrap-responsive.min.css
[HTTP/1.1 404 Not Found 84ms]
GET http://79.170.40.34/fileadmin/template/2.8/docs61.css
[HTTP/1.1 404 Not Found 79ms]
GET http://79.170.40.34/fileadmin/template/2.8/History-Learning-Logo.png
[HTTP/1.1 404 Not Found 148ms]
GET http://79.170.40.34/fileadmin/historyLearningSite/roman_9.jpg
[HTTP/1.1 404 Not Found 116ms]
GET http://79.170.40.34/fileadmin/historyLearningSite/will1.6.jpg
[HTTP/1.1 404 Not Found 80ms]
GET http://79.170.40.34/uploads/pics/H8ii-front.jpg
[HTTP/1.1 404 Not Found 80ms]
GET http://79.170.40.34/fileadmin/historyLearningSite/gunpow1.gif
[HTTP/1.1 404 Not Found 79ms]
GET http://79.170.40.34/fileadmin/historyLearningSite/domest1.gif
[HTTP/1.1 404 Not Found 111ms]
GET http://79.170.40.34/fileadmin/historyLearningSite/infant2.jpg
[HTTP/1.1 404 Not Found 75ms]
GET http://79.170.40.34/fileadmin/historyLearningSite/derby_2.jpg
[HTTP/1.1 404 Not Found 73ms]
Make sure to copy all files from the fileadmin folder of your old host to the new one and check if things get better.

wget - selective recursive download + page-rerequisites?

I'm trying to scrape a forum site, to build a read-only archive.
I understand how to use -A and -R to limit the pages I retrieve, but is there a way to also retrieve page-prerequisites (e.g., icons and such)
Thanks!

wget mirror rss website

I want to use wget to get a mirror from one RSS site, and want to mirror all the links of 3 level deep
wget -m -k -l 3 http://www.cnn.com/services/rss/
However, in the output, I only see the index.html and terms.html. Is there any special with RSS site or I can not mirror this site please?
Thanks in advance.

Are you able to create clean URLs with Wget?

I'm attempting to create a mirror of a WordPress site with clean URLs (i.e. http://example.org/foo not http://example.org/foo.php). When Wget mirrors the site, it gives all pages and links a ".html" extension (i.e. http://example.org/foo.html).
Is it possible to set options for Wget to create a clean URL structure, so that the mirrored file corresponding to the page "http:example.org/foo" would be "/foo/index.html" and the link to that page would be "http:example.org/foo"? If so, how?
If I understand your question correctly, you're asking for what is the default behaviour of Wget.
Wget will only add the extension to the local copy, if the --adjust-extension option has been passed to it. Quoting the man page for Wget:
--adjust-extension
If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the
local filename. This is useful, for instance, when you're mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good
use for this is when you're downloading CGI-generated materials. A URL like http://example.com/article.cgi?25 will be saved as article.cgi?25.html.
However, what you seem to be asking for, that Wget saves example.org/foo as /foo/index.html is actually the default option. If you're seeing some other output, you should post the complete output of Wget with the --debug switch.

How to skip selected url while mirroring site with wget

I have the following problem. I need to mirror password protected site. Sounds like simple task:
wget -m -k -K -E --cookies=on --keep-session-cookies --load-cookies=myCookies.txt http://mysite.com
in myCookies.txt I am keeping proper session cookie. This works until wget come accross logout page - then session is invalidated and, effectively, further mirroring is usless.
W tried to add --reject option, but it works only with file types - I can block only html file download or swf file download, I can't say
--reject http://mysite.com/*.php?type=Logout*
Any ideas how to skip certain URLs in wget? Maybe there is other tool that can do the job (must work on MS Windows).
What if you first download (or even just touch) the logout page, and then
wget --no-clobber --your-original-arguments
This should skip the logout page, as it has already been downloaded
(Disclaimer: I didn't try this myself)
I have also encountered this problem and later solved it like this: "--reject-regex logout", more:wget-devTips