wget --accept-regex for url path matching?

wget --accept-regex for url path matching? - wget

Is it possible to wget files matched with specific directory pattern?
For example, I want to download all the .dsc files form this site http://ftp.us.debian.org/debian/pool/. And only from sub directory /libg/.
http://ftp.us.debian.org/debian/pool/main/libg/libgdsii/libgdsii_0.1+ds.1-1.dsc
http://ftp.us.debian.org/debian/pool/non-free/libg/libgeotiff-epsg/libgeotiff-epsg_1.4.0-1.dsc
...
I tried this command but not working as expected.
wget -r -nH -np --accept-regex="\/libg\/.*\.dsc" -R index.htm* http://ftp.us.debian.org/debian/pool/

Related

Recursive file download with wget not working

From what I can tell by studying the wget manual, the following should work:
wget -r -l1 -np -nd -A.* -R "index.html*" http://s3.amazonaws.com/gp.gms/blat/
However, instead of getting all of the files in the blat folder without the apparently auto-generated index.html file, I get a 404 not found error on this and several dozen variations that I've tried.
I can easily download any of the 4 files but trying to do it recursively fails.
Any pointers would be greatly appreciated.

Try replacing -r -l1 with -r -l 1. You need a space between the l and the 1. Also, try adding -k with your options. This will convert the links to point to the corresponding files on your computer.

downloading using wget for multiple files, with renaming of files

I am aware that you can download from multiple url using:
wget "url1" "url2" "url3"
Renaming the output file can be done via:
wget "url1" -O "new_name1"
But when I tried
wget "url1" "url2" "url3" -O "name1" "name2" "name3"
all the files are using name1.
what is the proper way to do so in a single command?

Yes something like this, You can add a file name next to each URL in the file then do:
while IFS= read -r url fileName;do
wget -O "$fileName" "$url"
done < list
where it is assumed you have added a (unique) file name after each URL in the file (separated by a space).
The -O option allows you to specify the destination file name. But if you're downloading multiple files at once, wget will save all of their content to the file you specify via -O. Note that in either case, the file will be truncated if it already exists. See the man page for more info.
You can exploit this option by telling wget to download the links one-by-one:
while IFS= read -r url;do
fileName="blah" # Add a rule to define a new name for each file here
wget -O "$fileName" "$url"
done < list
hope it useful.

How to rename files downloaded with wget -r

I want to download an entire website using the wget -r command and change the name of the file.
I have tried with:
wget -r -o doc.txt "http....
hoping that the OS would have automatically create file in order like doc1.txt doc2.txt but It actually save the stream of the stdout in that file.
Is there any way to do this with just one command?
Thanks!

-r tells wget to recursively get resources from a host.
-o file saves log messages to file instead of the standard error. I think that is not what you are looking for, I think it is -O file.
-O file stores the resource(s) in the given file, instead of creating a file in the current directory with the name of the resource. If used in conjunction with -r, it causes wget to store all resources concatenated to that file.
Since wget -r downloads and stores more than one file, recreating the server file tree in the local system, it has no sense to indicate the name of one file to store.
If what you want is to rename all downloaded files to match the pattern docX.txt, you can do it with a different command after wget has end:
wget -r http....
i=1
while read file
do
mv "$file" "$(dirname "$file")/doc$i.txt"
i=$(( $i + 1 ))
done < <(find . -type f)

Avoid wget appending index.html to links

I am trying to make a static HTML copy of a Wordpress site that I can upload somewhere else, like Github pages.
I use this command:
Option 1:
wget -k -r -l 1000 -p -N -F -nH -P ./website http://example.com/website
It downloads the entire site etc. but my main issue here is that it adds "index.html" to every single link. I understand the need for this to view the site locally, but it is not required on a static website host.
So is there a way to tell wget not to modify all the links and add index.html to them?
For example it creates:
Hello world!
On the default Worpress Hello World post.
Option 2:
Use mirroring command with -k convert links:
wget -E -m -p -F -nH -P ./website http://example.com/website
Then it will not apply index.html and retain the domain name.
But then it also crawls up to http://example.com and indexes everything there. I do not want that. I want the /website to be the root (Because Wordpress multi site). How do I fix this?
I also want it to rewrite the hostname instead of stripping it or keeping it. So it should go from http://example.com/website/ (Wordpress multi site) to http://example.org/ Is this possible or do I need to run sed/awk on all files after download?

faced a similar problem, solved it by postprocessing it with sed.
This replaces all occurrences of /index.html' by /' as the comment above indicates that a redirect occurrs anyway if the trailing slash is missing, I added it =)
find ./ -type f -exec sed -i -e "s/\/index\.html'/\/\'/g" {} \;
And this monster replaces all occurrences of "index.html" or 'index.html' (or "index.html' or 'index.html" ..) by ".":
find ./ -type f -exec sed -i -e "s/['\\\"]index\.html['\\\"]/\\\".\\\"/g" {} \;
You can look what sed is doing with your matches e.g. on index.html with this command:
sed -n "s/['\\\"]index\.html['\\\"]/'\/'/p" index.html
Hope you find that useful

How to download all files from a specific Sourceforge project?

After spending about an hour downloading almost every Msys package from sourceforge I'm wondering whether there is a more clever way to do this. Is it possible to use wget for this purpose?

I've used this script successfully:
https://github.com/SpiritQuaddicted/sourceforge-file-download
For your use run:
sourceforge-file-downloader.sh msys
It should download all the pages first then find the actual links in the pages and download the final files.
From the project description:
Allows you to download all of a sourceforge project's files. Downloads to the current directory into a directory named like the project. Pass the project's name as first argument, eg ./sourceforge-file-download.sh inkscape to download all of http://sourceforge.net/projects/inkscape/files/
Just in case the repo ever gets removed I'll post it here since it's short enough:
#!/bin/sh
project=$1
echo "Downloading $project's files"
# download all the pages on which direct download links are
# be nice, sleep a second
wget -w 1 -np -m -A download http://sourceforge.net/projects/$project/files/
# extract those links
grep -Rh direct-download sourceforge.net/ | grep -Eo '".*" ' | sed 's/"//g' > urllist
# remove temporary files, unless you want to keep them for some reason
rm -r sourceforge.net/
# download each of the extracted URLs, put into $projectname/
while read url; do wget --content-disposition -x -nH --cut-dirs=1 "${url}"; done < urllist
rm urllist

In case of no wget or shell install do it with FileZilla: sftp://yourname#web.sourceforge.net you open the connection with sftp and your password then you browse to the /home/pfs/
after that path (could be a ? mark sign) you fill in with your folder path you want to download in remote site, in my case
/home/pfs/project/maxbox/Examples/
this is the access path of the frs: File Release System: /home/frs/project/PROJECTNAME/

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

wget --accept-regex for url path matching? - wget

Related

Recursive file download with wget not working

downloading using wget for multiple files, with renaming of files

How to rename files downloaded with wget -r

Avoid wget appending index.html to links

How to download all files from a specific Sourceforge project?

Categories

Resources