How to download all files from a specific Sourceforge project?

How to download all files from a specific Sourceforge project? - wget

After spending about an hour downloading almost every Msys package from sourceforge I'm wondering whether there is a more clever way to do this. Is it possible to use wget for this purpose?

I've used this script successfully:
https://github.com/SpiritQuaddicted/sourceforge-file-download
For your use run:
sourceforge-file-downloader.sh msys
It should download all the pages first then find the actual links in the pages and download the final files.
From the project description:
Allows you to download all of a sourceforge project's files. Downloads to the current directory into a directory named like the project. Pass the project's name as first argument, eg ./sourceforge-file-download.sh inkscape to download all of http://sourceforge.net/projects/inkscape/files/
Just in case the repo ever gets removed I'll post it here since it's short enough:
#!/bin/sh
project=$1
echo "Downloading $project's files"
# download all the pages on which direct download links are
# be nice, sleep a second
wget -w 1 -np -m -A download http://sourceforge.net/projects/$project/files/
# extract those links
grep -Rh direct-download sourceforge.net/ | grep -Eo '".*" ' | sed 's/"//g' > urllist
# remove temporary files, unless you want to keep them for some reason
rm -r sourceforge.net/
# download each of the extracted URLs, put into $projectname/
while read url; do wget --content-disposition -x -nH --cut-dirs=1 "${url}"; done < urllist
rm urllist

In case of no wget or shell install do it with FileZilla: sftp://yourname#web.sourceforge.net you open the connection with sftp and your password then you browse to the /home/pfs/
after that path (could be a ? mark sign) you fill in with your folder path you want to download in remote site, in my case
/home/pfs/project/maxbox/Examples/
this is the access path of the frs: File Release System: /home/frs/project/PROJECTNAME/

Related

wget redownloading a file only if it has been update on the server

I am currently using wget to downloads assets from a server. I currently use the following options of wget
wget --user=m_username --password=m_password -r -np -x -nH -q -nc URL_PATH
/**
* -r - download recursively
* -np - no parent ( only the files below a certain hierarchy will be downloaded)
* -x - force to create the same directory structure.
* -nH - Disable generation of host-prefixed directories
* -q - quiet - no output.
* -nc - existing files will not be redownloaded.
*
* */
In addition to the above options I want wget to re-download the file if the file has been updated in the server. Is there an option that I can use for that. I couldn't find anything specifically for that.

You're looking for -N: "When running Wget with -N, with or without -r or -p, the decision as to whether or not to download a newer copy of a file depends on the local and remote timestamp and size of the file."

Quoting the manual pages:
--no-clobber
If a file is downloaded more than once in the same directory, Wget's behavior depends on a few options, including -nc. In certain
cases, the local file
will be clobbered, or overwritten, upon repeated download. In other cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the same file in the same directory will result in the original copy of file
being preserved and
the second copy being named file.1. If that file is downloaded yet again, the third copy will be named file.2, and so on. (This is
also the behavior
with -nd, even if -r or -p are in effect.) When -nc is specified, this behavior is suppressed, and Wget will refuse to download newer
copies of file.
Therefore, ""no-clobber"" is actually a misnomer in this mode---it's not clobbering that's prevented (as the numeric suffixes
were already preventing
clobbering), but rather the multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, re-downloading a file will result in the new copy simply overwriting
the old. Adding -nc
will prevent this behavior, instead causing the original version to be preserved and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the decision as to whether or not to download a newer copy of a file depends on the
local and remote
timestamp and size of the file. -nc may not be specified at the same time as -N.
A combination with -O/--output-document is only accepted if the given output file does not exist.
Note that when -nc is specified, files with the suffixes .html or .htm will be loaded from the local disk and parsed as if they had been
retrieved from
the Web.
From my understanding what you really want is just the --mirror option, which sets the -r -N -l inf --no-remove-listing flags.

How to rename files downloaded with wget -r

I want to download an entire website using the wget -r command and change the name of the file.
I have tried with:
wget -r -o doc.txt "http....
hoping that the OS would have automatically create file in order like doc1.txt doc2.txt but It actually save the stream of the stdout in that file.
Is there any way to do this with just one command?
Thanks!

-r tells wget to recursively get resources from a host.
-o file saves log messages to file instead of the standard error. I think that is not what you are looking for, I think it is -O file.
-O file stores the resource(s) in the given file, instead of creating a file in the current directory with the name of the resource. If used in conjunction with -r, it causes wget to store all resources concatenated to that file.
Since wget -r downloads and stores more than one file, recreating the server file tree in the local system, it has no sense to indicate the name of one file to store.
If what you want is to rename all downloaded files to match the pattern docX.txt, you can do it with a different command after wget has end:
wget -r http....
i=1
while read file
do
mv "$file" "$(dirname "$file")/doc$i.txt"
i=$(( $i + 1 ))
done < <(find . -type f)

How to force wget to overwrite an existing file ignoring timestamp?

I tried '-N' and '--no-clobber' but the only result that I get is to retrieve a new copy of the existing example.exe with number a number added using this synax 'example.exe.1'. This is not what I'd like to get. I just need to download and overwrite the file example.exe in the same folder where I already saved a copy of example.com without that wget verifies if the mine is older or newer respect the on example.exe file already present in my download folder. Do you think is i possible or I need to create a script that delete the example.exe file or maybe something that change his modification date etc?

If you specify the output file using the -O option it will overwrite any existing file.
For example:
wget -O index.html bbc.co.uk
Run multiple times will keep over-writting index.html.

wget doesn't let you overwrite an existing file unless you explicitly name the output file on the command line with option -O.
I'm a bit lazy and I don't want to type the output file name on the command line when it is already known from the downloaded file. Therefore, I use curl like this:
curl -O http://ftp.vim.org/vim/runtime/spell/fr.utf-8.spl
Be careful when downloading files like this from unsafe sites. The above command will write a file named as the connected web site wishes to name it (inside the current directory though). The final name may be hidden through redirections and php scripts or be obfuscated in the URL. You might end up overwriting a file you don't want to overwrite.
And if you ever find a file named ls or any other enticing name in the current directory after using curl that way, refrain from executing the downloaded file. It may be a trojan downloaded from a rogue or corrupted web site!

wget --backups=1 google.com
renames original file with .1 suffix and writes new file to the intended filename.
Not exactly what was requested, but could be handy in some cases.

-c or --continue
From the manual:
If you use ‘-c’ on a non-empty file, and the server does not support
continued downloading, Wget will restart the download from scratch and
overwrite the existing file entirely.

I like the -c option. I started with the man page then the web but I've searched for this several times. Like if you're relaying a webcam so the image needs to always be named image.jpg. Seems like it should be more clear in the man page.
I've been using this for a couple years to download things in the background, sometimes combined with "limit-rate = " in my wgetrc file
while true
do
wget -c -i url.txt && break
echo "Restarting wget"
sleep 2
done
Make a little file called url.txt and paste the file's URL into it. Set this script up in your path or maybe as an alias and run it. It keeps retrying the download until there's no error. Sometimes at the end it gets into a loop displaying
416 Requested Range Not Satisfiable
The file is already fully retrieved; nothing to do.
but that's harmless, just ctrl-c it. I think it's always gotten the file I wanted even if wget runs out of retries or the connection temporarily goes away. I've downloaded things for days at a time with it. A CD image on dialup, yes, always with wget.

My use case involves two different URLs, sometimes the second one doesn't exist, but if it DOES exist, I want it to overwrite the first file.
The problem of using wget -O is that, when the second file DOESN'T exist, it will overwrite the first file with a BLANK file.
So the only way I could find is with an if statement:
--spider checks if a file exists, and returns 0 if it does
--quiet fail quietly, with no output
-nv is quiet, but still reports errors
wget -nv https://example.com/files/file01.png -O file01.png
# quietly check if a different version exists
wget --quiet --spider https://example.com/custom-files/file01.png
if [ $? -eq 0 ] ; then
# A different version exists, so download and overwrite the first
wget -nv https://example.com/custom-files/file01.png -O file01.png
fi
It's verbose, but I found it necessary. I hope this is helpful for someone.

Here is an easy way to get it done with parameter trimming
url=https://example.com/example.exe ; wget -nv $url -O ${url##*/}
Or you can use basename
url=https://example.com/example.exe ; wget -nv $url -O $( basename $url )

For those who do not want to use -O and want to specify the output directory only, the following command can be used.
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"

the first command will download from the source with the wget command
the second command will remove the older file
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"; \
rm '$file.1' -f;

how to print the progress of the files being copied in bash [duplicate]

I suppose I could compare the number of files in the source directory to the number of files in the target directory as cp progresses, or perhaps do it with folder size instead? I tried to find examples, but all bash progress bars seem to be written for copying single files. I want to copy a bunch of files (or a directory, if the former is not possible).

You can also use rsync instead of cp like this:
rsync -Pa source destination
Which will give you a progress bar and estimated time of completion. Very handy.

To show a progress bar while doing a recursive copy of files & folders & subfolders (including links and file attributes), you can use gcp (easily installed in Ubuntu and Debian by running "sudo apt-get install gcp"):
gcp -rf SRC DEST
Here is the typical output while copying a large folder of files:
Copying 1.33 GiB 73% |##################### | 230.19 M/s ETA: 00:00:07
Notice that it shows just one progress bar for the whole operation, whereas if you want a single progress bar per file, you can use rsync:
rsync -ah --progress SRC DEST

You may have a look at the tool vcp. Thats a simple copy tool with two progress bars: One for the current file, and one for overall.
EDIT
Here is the link to the sources: http://members.iinet.net.au/~lynx/vcp/
Manpage can be found here: http://linux.die.net/man/1/vcp
Most distributions have a package for it.

Here another solution: Use the tool bar
You could invoke it like this:
#!/bin/bash
filesize=$(du -sb ${1} | awk '{ print $1 }')
tar -cf - -C ${1} ./ | bar --size ${filesize} | tar -xf - -C ${2}
You have to go the way over tar, and it will be inaccurate on small files. Also you must take care that the target directory exists. But it is a way.

My preferred option is Advanced Copy, as it uses the original cp source files.
$ wget http://ftp.gnu.org/gnu/coreutils/coreutils-8.21.tar.xz
$ tar xvJf coreutils-8.21.tar.xz
$ cd coreutils-8.21/
$ wget --no-check-certificate wget https://raw.githubusercontent.com/jarun/advcpmv/master/advcpmv-0.8-8.32.patch
$ patch -p1 -i advcpmv-0.8-8.32.patch
$ ./configure
$ make
The new programs are now located in src/cp and src/mv. You may choose to replace your existing commands:
$ sudo cp src/cp /usr/local/bin/cp
$ sudo cp src/mv /usr/local/bin/mv
Then you can use cp as usual, or specify -g to show the progress bar:
$ cp -g src dest

A simple unix way is to go to the destination directory and do watch -n 5 du -s . Perhaps make it more pretty by showing as a bar . This can help in environments where you have just the standard unix utils and no scope of installing additional files . du-sh is the key , watch is to just do every 5 seconds.
Pros : Works on any unix system Cons : No Progress Bar

To add another option, you can use cpv. It uses pv to imitate the usage of cp.
It works like pv but you can use it to recursively copy directories
You can get it here

There's a tool pv to do this exact thing: http://www.ivarch.com/programs/pv.shtml
There's a ubuntu version in apt

How about something like
find . -type f | pv -s $(find . -type f | wc -c) | xargs -i cp {} --parents /DEST/$(dirname {})
It finds all the files in the current directory, pipes that through PV while giving PV an estimated size so the progress meter works and then piping that to a CP command with the --parents flag so the DEST path matches the SRC path.
One problem I have yet to overcome is that if you issue this command
find /home/user/test -type f | pv -s $(find . -type f | wc -c) | xargs -i cp {} --parents /www/test/$(dirname {})
the destination path becomes /www/test/home/user/test/....FILES... and I am unsure how to tell the command to get rid of the '/home/user/test' part. That why I have to run it from inside the SRC directory.

Check the source code for progress_bar in the below git repository of mine
https://github.com/Kiran-Bose/supreme
Also try custom bash script package supreme to verify how progress bar work with cp and mv comands
Functionality overview
(1)Open Apps
----Firefox
----Calculator
----Settings
(2)Manage Files
----Search
----Navigate
----Quick access
|----Select File(s)
|----Inverse Selection
|----Make directory
|----Make file
|----Open
|----Copy
|----Move
|----Delete
|----Rename
|----Send to Device
|----Properties
(3)Manage Phone
----Move/Copy from phone
----Move/Copy to phone
----Sync folders
(4)Manage USB
----Move/Copy from USB
----Move/Copy to USB

There is command progress, https://github.com/Xfennec/progress, coreutils progress viewer.
Just run progress in another terminal to see the copy/move progress. For continuous monitoring use -M flag.

Download all pdf files using wget

I have the following site http://www.asd.com.tr. I want to download all PDF files into one directory. I've tried a couple of commands but am not having much luck.
$ wget --random-wait -r -l inf -nd -A pdf http://www.asd.com.tr/
With this code only four PDF files were downloaded. Check this link, there are over several thousand PDFs available:
http://www.asd.com.tr/Default.aspx
For instance, hundreds of files are in the following folder:
http://www.asd.com.tr/Folders/asd/…
But I can't figure out how to access them correctly to see and download them all, there are some of folders in this subdirectory, http://www.asd.com.tr/Folders/, and thousands of PDFs in these folders.
I've tried to mirror site using -m command but it failed too.
Any more suggestions?

First, verify that the TOS of the web site permit to crawl it. Then, one solution is :
mech-dump --links 'http://domain.com' |
grep pdf$ |
sed 's/\s+/%20/g' |
xargs -I% wget http://domain.com/%
The mech-dump command comes with Perl's module WWW::Mechanize (libwww-mechanize-perl package on debian & debian likes distros)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse