I have a MoinMoin site which I've inherited from a previous system
administrator. I'd like to shut it down but keep a static copy of the
content as an archive, ideally with the same URLs. At the moment I'm
trying to accomplish this using wget with the following parameters:
--mirror
--convert-links
--page-requisites
--no-parent
-w 1
-e robots=off
-user-agent="Mozilla/5.0"
-4
This seems to work for getting the HTML and CSS, but it fails to
download any of the attachments. Is there an argument I can add to wget
which will get round this problem?
Alternatively, is there a way I can tell MoinMoin to link directly to
files in the HTML it produces? If I could do that then I think wget
would "just work" and download all the attachments. I'm not bothered
about the attachment URLs changing as they won't have been linked to
directly in other places (e.g. email archives).
The site is running MoinMoin 1.9.x.
My version of wget:
$ wget --version
GNU Wget 1.16.1 built on linux-gnu.
+digest +https +ipv6 +iri +large-file +nls +ntlm +opie -psl +ssl/openssl
The solution in the end was to use MoinMoin's export dump functionality:
https://moinmo.in/FeatureRequests/MoinExportDump
It doesn't preserve the file paths in the way that wget does, but has the major advantage of including all the files and the attachments.
Related
I'm using wget with --page-requisites option. I'd like to combine this option with --directory-prefix. So for example when calling wget --page-requisites --directory-prefix=/tmp/1 https://google.com would download the google page to /tmp/1/ directory without creating it's own folder (like google.com).
I'd expect the google homepage to end up at /tmp/1/index.html
Is there a way to do this without creating some kind of script that would move the files where I want them to be?
Ok using option --no-directories seems to do the trick.
wget recurses to the second-bottom level and goes no further. If I specify the bottom level HTML file as the source, it parses it and goes further. I think this may be caused by the PDF files linked off the HTML document being in an different root file path on the server. I need it to retrieve all the PDF files off the leaves of this hierarchy since I am going to promote them together as part of a campaign for depression awareness.
I am using GNU Wget 1.19.4 built on linux-gnu.
I have tried, --exclude, --exclude-directory, -l2, -l10, --continue and many other switches. I need to use the --include commands or wget grabs the entire site. If I use -np it won't go "up" into /docs
This code gets me the HTML files but does not follow links in the "bottom most"
HTML files.
wget --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/
This code, when I manually specify the HTML file, gets the PDF files I want in it.
wget --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
I want it to visit all the HTML files in this branch, get out all the PDF links in them, and retrieve all the PDF files from /docs
https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
Here is one of the PDFs. The /docs directory does not have a listing.
https://www.beyondblue.org.au/docs/default-source/research-project-files/online-forums-2015-report.pdf?sfvrsn=3d00adea_2
The best I can get wget to do is walk the site and get HTML files down to this level:
https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
https://www.beyondblue.org.au/about-us/research-projects/research-projects/networks-of-advocacy-and-influence-peer-mentors-in-beyond-blue-s-mental-health-forums
...
150 of them
It seems like a depth-limiting setting or a path traversal limitation or something. I suspect it's an easy one to spot.
Thanks again!
Alright it looks like wget might be breadth first. This means gets everything in the directory before recursing into pages. I'm not sure of this but I let the below run and it seemed to get all the leaf HTML files, but then recurse into them after it had got all of them.
wget -r --verbose --include /docs/default-source/research-project-files/,/about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/
Certainly running this and stopping it when it seemed to halt at the bottom HTML layer and not get the PDFs was stopping it too early.
I am attempting to regularly archive a few file types hosted on a community website where our admin has been MIA for years, in case he dies or just stops paying for the hosting.
I am able to download all of the files I need using wget -r -np -nd -e robots=off -l 0 URL but this leaves me with about 60,000 extra files to waste time both downloading and deleting.
I am really only looking for files with the extensions "tbt" and "zip". When I add in -A tbt,zip to the input, wget then only downloads a single file, "index.html.tmp". It immediately deletes this file because it doesn't match the file type specified, and then the process stops entirely, with wget announcing that it is finished. It does not attempt to download any of the other files that it grabs when the -A flag is not included.
What am I doing wrong? Why does specifying file types in the way that I did cause it to finish after only looking at one file?
Possibly you're hitting the same problem I've hit when trying to do something similar. When using --accept, wget determines whether a links refers to a file or directory based on whether or not it ends with a /.
For example, say I have a directory named files, and a web page that has:
Lots o' files!
If I were to request this with wget -r, then I wget would happily GET /files, see that it was an HTML document containing a bunch of links, and continue to download those links.
However, if I add -A zip to my command line, and run wget with --debug, I see:
appending ‘http://localhost:8080/files’ to urlpos.
[...]
Deciding whether to enqueue "http://localhost:8080/files".
http://localhost:8080/files (files) does not match acc/rej rules.
Decided NOT to load it.
In other words, wget thinks this is a file (no trailing /) and it doesn't match our acceptance criteria, so it gets rejected.
If I modify the remote file so that it looks like...
Lots o' files!
...then wget will follow the link and download files as desired.
I don't think there's a great solution to this problem if you need to use wget. As I mentioned in my comment, there are other tools available that may handle this situation more gracefully.
It's also possible you're experiencing a different issue; the output of adding --debug to your command line clarify things in that case.
I also experienced this issue, on a page where all the download links looked something like this: filedownload.ashx?name=file.mp3. The solution was to match for both the linked file, and the downloaded file. So my wget accept flag looked like this: -A 'ashx,mp3'. I also used the --trust-server-names flag. This catches all the .ashx that are linked in the webpage, then when wget does the second check, all the mp3 files that were downloaded will stay.
As an alternative to --trust-server-names, you may also find the --content-disposition flag helpful. Both flags help rename the file that gets downloaded from filedownload.ashx?name=file.mp3 to just file.mp3.
I am using Pentaho CE 5 on windows. I would like to use CTools but I can't make them show up in the File -> New menu to use them.
Being behind a proxy, I can not use the Marketplace plugin, so I have tried a manual installation.
First, I tried to use the ctools-installer.sh. I have run the following command line in cygwin (wget and unzip are installed):
./ctools-installer.sh -s /cygdrive/d/Users/[user]/Mes\ Programmes/pentaho/biserver-ce/pentaho-solutions/ -w /cygdrive/d/Users/[user]/Mes\ programmes/pentaho/biserver-ce/tomcat/webapps/pentaho/
The script starts, asks me what module I want to install, and begins the downloads.
For each module, I get an output like (set -x added to the script) :
echo -n 'Downloading CDF...' Downloading CDF...+ wget -q --no-check-certificate 'http://ci.analytical-labs.com/job/Webdetails-CDF-5-Release/lastSuccessfulBuild/artifact/bi-platform-v2-plugin/dist/zip/dist.zip'
-O .tmp/cdf/dist.zip SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
'[' '!' -z '' ']'
rm -f .tmp/dist/marketplace.xml
unzip -o .tmp/cdf/dist.zip -d .tmp End-of-central-directory signature not found. Either this file is not a zipfile, or it
constitutes one disk of a multi-part archive. In the latter case
the central directory and zipfile comment will be found on the last
disk(s) of this archive. unzip: cannot find zipfile directory in
.tmp/cdf/dist.zip,
and cannot find .tmp/cdf/dist.zip.zip, period.
chmod -R u+rwx .tmp
echo Done Done
Then the script ends. I have seen on this page (pentaho-bi-suite) that it is the normal output. Nevertheless, it seems a bit strange to me and when I start my pentaho server (login: admin/password), I cannot see any new tools in the menus.
After a look to a few other tutorials and the script itself, I have downloaded the .zip snapshots for every tool and unzipped them in the system directory of my pentaho server. Same result.
I would like to make the .sh works, what can I try or adjust ?
Thanks
EDIT 05/06/2014
I checked the dist.zip files dowloaded by the script and they are all empty. It seems that wget cannot fetch the zip files, and therefore the installation fails.
When I try to get any webpage through wget, it fails. I think it is because of the proxy.
Here is my .wgetrc file, located in my user's cygwin home folder:
use_proxy=on
http_proxy=http://[url]:[port]
https_proxy=http://[url]:[port]
proxy_user=[user]
proxy_password=[password]
How could I make this work?
EDIT 10/06/2014
In the end, I have changed my network connection settings to bypass the proxy. It seems that there is an offline mode for the installer, so one can download all needed files on a proxy-free environment and then run the script offline.
I guess this is related with the -r option.
I consider this post solved, since it not a CTools issue anymore.
Difficult to identify the issue in the above procedure..
but you can refer this blog he is key member of pentaho itself..
In the end, I have changed my network connection settings to bypass the proxy. It seems that there is an offline mode for the installer, so one can download all needed files on a proxy-free environment and then run the script offline. I guess this is related with the -r option.
I consider this post solved, since it is not a CTools issue anymore.
You can manually install the components from http://www.webdetails.pt/ctools/ or if you have pentaho 5.1 or above, you add the following parameters to CATALINA_OPTS option (in start-pentaho.bat or start-pentaho.sh):
-Dhttp.proxyHost= -Dhttp.proxyPort= -Dhttp.nonProxyHosts="localhost|127.0.0.1|10...*"
http://docs.treasuredata.com/articles/pentaho-dataintegration#tips-how-can-i-use-pentaho-through-a-proxy
I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character â - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/