crawl links of sitemap.xml through wget command

crawl links of sitemap.xml through wget command - wget

I try to crawl all links of a sitemap.xml to re-cache a website. But the recursive option of wget does not work, I only get as respond:
Remote file exists but does not contain any link -- not retrieving.
But for sure the sitemap.xml is full of "http://..." links.
I tried almost every option of wget but nothing worked for me:
wget -r --mirror http://mysite.com/sitemap.xml
Does anyone knows how to open all links inside of a website sitemap.xml?
Thanks,
Dominic

It seems that wget can't parse XML. So, you'll have to extract the links manually. You could do something like this:
wget --quiet http://www.mysite.com/sitemap.xml --output-document - | egrep -o "https?://[^<]+" | wget -i -
I learned this trick here.

While this question is older, google send me here.
I finally used xsltproc to parse the sitemap.xml:
sitemap-txt.xsl:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:sitemap="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" version="1.0" encoding="UTF-8" indent="no"/>
<xsl:template match="/">
<xsl:for-each select="sitemap:urlset/sitemap:url">
<xsl:value-of select="sitemap:loc"/><xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Using it (in this case it is from a cache-prewarming-script, so the retrieved pages are not kept ("-o /dev/null"), only some statistics are printed ("-w ....")):
curl -sS http://example.com/sitemap.xml | xsltproc sitemap-txt.xsl - | xargs -n1 -r -P4 curl -sS -o /dev/null -w "%{http_code}\t%{time_total}\t%{url_effective}\n"
(Rewriting this to use wget instead of curl is left as an exercise for the reader ;-) )
What this does is:
Retrieve sitemap.xml
Parse sitemap, output url-list as texts (one url per line)
use xargs to call "curl" on each url, using 4 requests in parallel)

You can use one of the sitemapping tools. Try Slickplan. It has the site crawler option and by using it you can import a structure of existing website and create a visual sitemap from it. Then you can export it to Slickplan XML format, which contains* not only links, but also SEO metadata, page titles (product names), and a bunch of other helpful data.

Related

How to wget a file without getting the html instead?

I'm trying to download a file using:
wget https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
I'm expecting to get the .txt file, however, I get the page html instead.
I tried wget --max-redirect=2 --trust-server-names <url> based on the suggestions here and wget -m <url> which downloads the entire website, and a few other variations that also don't work.

wget https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
This point wget to HTML page even though it has .txt suffix. After visting it I found there is link to text file itself under raw, which you should be able to use with wget following way
wget https://huggingface.co/distilbert-base-uncased/raw/main/vocab.txt
If you need to reveal true type of file without downloading it you might use --spider option, in this case
wget --spider https://huggingface.co/distilbert-base-uncased/blob/main/vocab.txt
gives output containing
Length: 7889527 (7,5M) [text/html]
and
wget --spider https://huggingface.co/distilbert-base-uncased/raw/main/vocab.txt
gives output containing
Length: 231508 (226K) [text/plain]

wget - only output redirect url but no download

I have a download link to a large file.
You need to be logged in to the site, so a cookie is used.
The download link redirects to another URL.
I'm able to download the file with wget but I only want the output of the "real" direct download link.
wget does exactly this before starting the download
Location: https://foo.com/bar.zip [following]
Is there a way to make wget stop and not actually downloading the file?
The solutions I found recommend redirecting to dev/null but this would still download the file. What I want is wget following the redirects but not actually starting the download.

I couldn't find a way to do it with wget, but I found a way to do it with curl:
curl https://openlibrary.org/data/ol_dump_latest.txt.gz -s -L -I -o /dev/null -w '%{url_effective}'
This only downloads the HEAD of the page (and sends it to /dev/null), so the file itself is never downloaded.
(src: https://stackoverflow.com/a/5300429/2317712 )

Going off of #qqilihq's comment to the curl answer, this will first strip out the line starting with "Location:" then remove the "Location: " from the beginning and the " [following]" from the end using awk. Not sure if I would use this as it looks like a small change in the wget output could make it blow up. I would use the curl answer myself.
wget --max-redirect=0 http://example.com/link-to-get-redirec-url-from 2>&1 | awk '/Location: /,// { print }' | awk '{print $2}'

Using wget (for windows) to download all MIDI files

I've been trying to use wget to download all midi files from a website (http://cyberhymnal.org/) using:
wget64 -r -l1 H -t1 -nd -N -np -A.mid -erobots=off http://cyberhymnal.org/
I got the syntax from various sites which all suggest the same thing, but it doesn't download anything. I've tried various variations on the theme, such as different values for '-l' etc.
Does anybody have any suggestions as to what I am doing wrong? Is it the fact that I am using Windows?
Thanks in advance.

I don't know much about all the parameters you are using like H, -t1, -N etc though we can find it online. But I also had to download files from a url matching a wildcard. So command that worked for me:
wget -r -l1 -nH --cut-dirs=100 -np "$url" -P "${newLocalLib/$tokenFind}" -A "com.iontrading.arcreporting.*.jar"
after -P you specify the path where you wanna save the files to and after -A you provide the wild card token. Like in your case that would be "*.mid".
-A means Accept. So here we provide the files to accept from the provided URL. Similarly -R for reject list.

You may have better luck (at least, you'll get more MIDI files), if you try the actual Cyber Hymnal™, which moved over 10 years ago. The current URL is now http://www.hymntime.com/tch/.

Download all pdf files using wget

I have the following site http://www.asd.com.tr. I want to download all PDF files into one directory. I've tried a couple of commands but am not having much luck.
$ wget --random-wait -r -l inf -nd -A pdf http://www.asd.com.tr/
With this code only four PDF files were downloaded. Check this link, there are over several thousand PDFs available:
http://www.asd.com.tr/Default.aspx
For instance, hundreds of files are in the following folder:
http://www.asd.com.tr/Folders/asd/…
But I can't figure out how to access them correctly to see and download them all, there are some of folders in this subdirectory, http://www.asd.com.tr/Folders/, and thousands of PDFs in these folders.
I've tried to mirror site using -m command but it failed too.
Any more suggestions?

First, verify that the TOS of the web site permit to crawl it. Then, one solution is :
mech-dump --links 'http://domain.com' |
grep pdf$ |
sed 's/\s+/%20/g' |
xargs -I% wget http://domain.com/%
The mech-dump command comes with Perl's module WWW::Mechanize (libwww-mechanize-perl package on debian & debian likes distros)

recursive wget with hotlinked requisites

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts.
For example, let's look at this page
https://dl.dropbox.com/u/11471672/wget-all-the-things.html
Let's pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked.
wget -e robots=off -r -l inf -pk
^^ gets everything but the hotlinked image
wget -e robots=off -r -l inf -pk -H
^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web
wget -e robots=off -r -l inf -pk -H --ignore-tags=a
^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site.
I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I'd like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I'm outputting.

You can't specify to span hosts for page-reqs only; -H is all or nothing. Since -r and -H will pull down the entire Internet, you'll want to split the crawls that use them. To grab hotlinked page-reqs, you'll have to run wget twice: once to recurse through the site's structure, and once to grab hotlinked reqs. I've had luck with this method:
1) wget -r -l inf [other non-H non-p switches] http://www.example.com
2) build a list of all HTML files in the site structure (find . | grep html) and pipe to file
3) wget -pH [other non-r switches] -i [infile]
Step 1 builds the site's structure on your local machine, and gives you any HTML pages in it. Step 2 gives you a list of the pages, and step 3 wgets all assets used on those pages. This will build a complete mirror on your local machine, so long as the hotlinked assets are still live.

I've managed to do this by using regular expressions. Something like this to mirror http://www.example.com/docs
wget --mirror --convert-links --adjust-extension \
--page-requisites --span-hosts \
--accept-regex '^http://www\.example\.com/docs|\.(js|css|png|jpeg|jpg|svg)$' \
http://www.example.com/docs
You'll probably have to tune the regexs for each specific site. For example some sites like to use parameters on css files (e.g. style.css?key=value), which this example will exclude.
The files you want to include from other hosts will probably include at least
Images: png jpg jpeg gif
Fonts: ttf otf woff woff2 eot
Others: js css svg
Anybody know any others?
So the actual regex you want will probably look more like this (as one string with no linebreaks):
^http://www\.example\.org/docs|\.([Jj][Ss]|[Cc][Ss][Ss]|[Pp][Nn][Gg]|[Jj]
[Pp][Ee]?[Gg]|[Ss][Vv][Gg]|[Gg][Ii][Ff]|[Tt][Tt][Ff]|[Oo][Tt][Ff]|[Ww]
[Oo][Ff][Ff]2?|[Ee][Oo][Tt])(\?.*)?$

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

crawl links of sitemap.xml through wget command - wget

It seems that wget can't parse XML. So, you'll have to extract the links manually. You could do something like this: wget --quiet http://www.mysite.com/sitemap.xml --output-document - | egrep -o "https?://[^<]+" | wget -i - I learned this trick here.

Related

How to wget a file without getting the html instead?

wget - only output redirect url but no download

Using wget (for windows) to download all MIDI files

Download all pdf files using wget

recursive wget with hotlinked requisites

Categories

Resources