wget --warc-file --recursive, prevent writing individual files - wget

I run wget to create a warc archive as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 http://www.epfl.ch/
$ l -h /tmp/epfl.warc.gz
-rw-r--r-- 1 david wheel 657K Sep 2 15:18 /tmp/epfl.warc.gz
$ find .
./www.epfl.ch/index.html
./www.epfl.ch/public/hp2013/css/homepage.70a623197f74.css
[...]
I only need the epfl.warc.gz file. How do I prevent wget to creating all the individual files?
I tried as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 --output-document=/dev/null http://www.epfl.ch/
ERROR: -k or -r can be used together with -O only if outputting to a regular file.

tl;dr Add the options --delete-after and --no-directories.
Option --delete-after instructs wget to delete each downloaded file immediately after its download is complete. As a consequence, the maximum disk usage during execution will be the size of the WARC file plus the size of the single largest downloaded file.
Option --no-directories prevents wget from leaving behind a useless tree of empty directories. By default wget creates a directory tree that mirrors the one on the host, and downloads each file into the appropriate directory of the mirrored tree. wget does this even when the downloaded file is temporary due to --delete-after. To prevent that, use option --no-directories.
The below demonstrates the result, using your given example (slightly altered).
$ cd $(mktemp -d)
$ wget --delete-after --no-directories \
--warc-file=epfl --recursive --level=1 http://www.epfl.ch/
...
Total wall clock time: 12s
Downloaded: 22 files, 1.4M in 5.9s (239 KB/s)
$ ls -lhA
-rw-rw-r--. 1 chadv chadv 1.5M Aug 31 07:55 epfl.warc
If you forget to use --no-directories, you can easily clean up the tree of empty directories with find -type d -delete.

For individual files (without --recursive) the option -O /dev/null will make wget not to create a file for the output. For recursive fetches /dev/null is not accepted (don't know why). But why not just write all the output concatenated into one single file via -O tmpfile and delete this file afterwards?

Related

wget converted some of the links - how to convert links after download?

I used
wget -mirror --convert-links http://example.com/ 2>&1 | tee -a wget.log
to download a website. It turns out that only some of the links were converted. How can I have all of the links converted, even after the download? I do not want to download all of the contents again.
Firstly, please be aware that --convert-links does it job after everything was downloaded so if you are inspecting certain downloaded file before wget finished working you might see unconverted list.
I do not want to download all of the contents again.
then you should use --no-clobber, but according to man page --mirror is equivalent to -r -N -l inf --no-remove-listing and --no-clobber and -N are mutually exclusive, therefore you must not use --mirror but parts of it excluding -N taking this is account your command should look following way
wget -r --no-clobber -l inf --no-remove-listing --convert-links http://example.com/

how to exclude snapshots while running tar in Solaris

I'm trying to take a tar of the '/home/store/' directory content.
tar cvf store.tar /home/store/
While doing so, I can see that the .snapshot directories are also getting included. My understanding is that snapshots are kind of backups. Can I skip this? If possible, how? Tried excluding a test directory using the below command ran from /home/store/
tar cvfX store.tar <(echo /home/store/test) /home/store/
But this is not excluding the test directory from the tar created.
Also, tried this
tar cvf store.tar /home/store/ --exclude-file=exclude.txt
Output:
a /home/store// 0K
a /home/store//.profile 1K
a /home/store//local.profile 1K
a /home/store//.vas_logon_server 1K
a /home/store//.vas_disauthcc_611400381 1K
a /home/store//.bash_history 7K
a /home/store//test/ 0K
a /home/store//test/1.txt 1K
a /home/store//test/migrate-perf3.txt 3958K
a /home/store//test.txt 1K
a /home/store//exclude.txt 1K
a /home/store//.snapshot/hourly.0/d2/dd/d5d/f82-1 59K
a /home/store//.snapshot/hourly.0/d2/dd/d5d/f83-1 58K
.....
tar: --exclude-file=exclude.txt: No such file or directory
/home/store/exclude.txt has the entry 'test'. Tried entering the following as well and got same error.
/home/store/test/
/home/store/test/1.txt
When I gave the full path to 'exclude.txt' like this
`tar cvf store.tar /home/store/ --exclude-file=/home/store/exclude.txt`
it's giving the below error
tar: can't change directories to --exclude-file=/home/store: No such file or directory
tar -h
Usage: tar {c|r|t|u|x}[BDeEFhilmnopPqTvw#[0-7]][bfk][X...] [blocksize] [tarfile] [size] [exclude-file...] {file | -I include-file | -C directory file}...
Thanks well in advance!
Van Peer
Try to do so:
tar cvfX /var/tmp/src.tar /var/tmp/excl.txt /var/tmp/src/
Your exclude file should contain path:
/home/store//.snapshot
Best practice not to use full path of your tar dir, because in future you can overwite your /etc , when extract tar archive from /var/tmp, for example.
For example:
sudo tar -zcvpf /backup/farm-backup-$(date +%d-%m-%Y).tar.gz --exclude ".snapshots" --exclude ".cache" farm
Did not use a backslash in the command ie:/farm for the directory. Execute the tar command from the /home directory to back up "farm" user.
for making a backup in the root /backup directory.
OS: OpenSuse 15.1

How Can I make gsutil cp skip false symlinks?

I am using gsutil to upload a folder which contains symlinks, the problem is that some of these files are false symlinks ( Unfortunately, that's the case)
Here is an example of the command I am using:
gsutil -m cp -c -n -e -L output-upload.log -r output gs://my-storage
and I get the following:
[Errno 2] No such file or directory: 'output/1231/file.mp4'
CommandException: 1 file/object could not be transferred.
Is there a way to make gsutil skip this file or fail safely without stopping the upload ?
This was a bug in gsutil (which it looks like you reported here) and it will be fixed in gsutil 4.23.

how to use wget on a site with many folders and subfolders

I try to download this site, with this code:
wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off tenshi.spb.ru/anime-ost/
But I only get the index and enter inside the first folder, not the subfolder, help me?
I use this command to download sites including their subfolders:
wget --mirror -p --convert-links -P . [site address]
A little explanation:
--mirror is a shortcut for -N -r -l inf --no-remove-listing.
--convert-links makes links in downloaded HTML or CSS point to local files
-p allows you to get all images, etc. needed to display HTML pages
-P specifies the next argument is the directory the files will be saved to
I found the command at:
http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/
You use -l 1 also known as --level=1 which limits recursion to one level. Set that to a higher level to download more pages. BTW, I like long options like --level because its easier to see what you are doing without going back to man pages.

How to specify the download location with wget?

I need files to be downloaded to /tmp/cron_test/. My wget code is
wget --random-wait -r -p -nd -e robots=off -A".pdf" -U mozilla http://math.stanford.edu/undergrad/
So is there some parameter to specify the directory?
From the manual page:
-P prefix
--directory-prefix=prefix
Set directory prefix to prefix. The directory prefix is the
directory where all other files and sub-directories will be
saved to, i.e. the top of the retrieval tree. The default
is . (the current directory).
So you need to add -P /tmp/cron_test/ (short form) or --directory-prefix=/tmp/cron_test/ (long form) to your command. Also note that if the directory does not exist it will get created.
-O is the option to specify the path of the file you want to download to:
wget <uri> -O /path/to/file.ext
-P is prefix where it will download the file in the directory:
wget <uri> -P /path/to/folder
Make sure you have the URL correct for whatever you are downloading. First of all, URLs with characters like ? and such cannot be parsed and resolved. This will confuse the cmd line and accept any characters that aren't resolved into the source URL name as the file name you are downloading into.
For example:
wget "sourceforge.net/projects/ebosse/files/latest/download?source=typ_redirect"
will download into a file named, ?source=typ_redirect.
As you can see, knowing a thing or two about URLs helps to understand wget.
I am booting from a hirens disk and only had Linux 2.6.1 as a resource (import os is unavailable). The correct syntax that solved my problem downloading an ISO onto the physical hard drive was:
wget "(source url)" -O (directory where HD was mounted)/isofile.iso"
One could figure the correct URL by finding at what point wget downloads into a file named index.html (the default file), and has the correct size/other attributes of the file you need shown by the following command:
wget "(source url)"
Once that URL and source file is correct and it is downloading into index.html, you can stop the download (ctrl + z) and change the output file by using:
-O "<specified download directory>/filename.extension"
after the source url.
In my case this results in downloading an ISO and storing it as a binary file under isofile.iso, which hopefully mounts.
"-P" is the right option, please read on for more related information:
wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2
Relevant snippets from man pages for convenience:
-P prefix
--directory-prefix=prefix
Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is . (the current directory).
-nd
--no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
filenames will get extensions .n).
-np
--no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
man wget:
-O file
--output-document=file
wget "url" -O /tmp/cron_test/<file>