I'm using the following command:
wget --no-check-certificate -nd -r -l5 --no-parent -A "*revisions.txt" https://mysite.com/folder/test/product1/
to get all files containing "revisions.txt" but my structure is like this:
> https://mysite.com/folder/test/product1/xxx/final/
> https://mysite.com/folder/test/product1/yy/rc/
> https://mysite.com/folder/test/product1/yy/beta/
> https://mysite.com/folder/test/product2/zzz/final/
> https://mysite.com/folder/test/product2/yy/rc/
> https://mysite.com/folder/test/product2/xx/beta/
> https://mysite.com/folder/test/product2/xxx/alpha/
> https://mysite.com/folder/test/product2/zz/alpha/
...
And I only want to browse directories "rc" and "final". The problem is that I have more 50 products and this number is increasing with time so this has to be "dynamic".
How can I browse only rc and final to get all revisions.txt from these two directories? The parent directories can have different names.
Got it!
wget -P ./revisions/ --no-check-certificate -nd -r -l8 -A "*revisions.txt" --exclude-directories="*/BETA/*","*/ALPHA/*" --no-parent https://mysite.com/folder/test/
Related
I'm trying to semi mirror a site. What I want is to download all of the MP3s and make sure I'm not redownloading those that I already have (hence the "mirror" part). I've typed in the following:
wget -m -nd -e robots=off --random-wait -A "*.mp3" -P FOLDER http://www.example.com/
And it downloads all the MP3s on the Current Page. It never follows the links to the "Next Page" or the likes. I've replaced the -m with -N -c -r without success. What other options can I use?
Try:
wget ‐‐execute robots=off ‐‐recursive ‐‐accept mp3,MP3 --random-wait ‐‐no-parent ‐‐continue ‐‐no-clobber //site.com/
I run wget to create a warc archive as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 http://www.epfl.ch/
$ l -h /tmp/epfl.warc.gz
-rw-r--r-- 1 david wheel 657K Sep 2 15:18 /tmp/epfl.warc.gz
$ find .
./www.epfl.ch/index.html
./www.epfl.ch/public/hp2013/css/homepage.70a623197f74.css
[...]
I only need the epfl.warc.gz file. How do I prevent wget to creating all the individual files?
I tried as follows:
$ wget --warc-file=/tmp/epfl --recursive --level=1 --output-document=/dev/null http://www.epfl.ch/
ERROR: -k or -r can be used together with -O only if outputting to a regular file.
tl;dr Add the options --delete-after and --no-directories.
Option --delete-after instructs wget to delete each downloaded file immediately after its download is complete. As a consequence, the maximum disk usage during execution will be the size of the WARC file plus the size of the single largest downloaded file.
Option --no-directories prevents wget from leaving behind a useless tree of empty directories. By default wget creates a directory tree that mirrors the one on the host, and downloads each file into the appropriate directory of the mirrored tree. wget does this even when the downloaded file is temporary due to --delete-after. To prevent that, use option --no-directories.
The below demonstrates the result, using your given example (slightly altered).
$ cd $(mktemp -d)
$ wget --delete-after --no-directories \
--warc-file=epfl --recursive --level=1 http://www.epfl.ch/
...
Total wall clock time: 12s
Downloaded: 22 files, 1.4M in 5.9s (239 KB/s)
$ ls -lhA
-rw-rw-r--. 1 chadv chadv 1.5M Aug 31 07:55 epfl.warc
If you forget to use --no-directories, you can easily clean up the tree of empty directories with find -type d -delete.
For individual files (without --recursive) the option -O /dev/null will make wget not to create a file for the output. For recursive fetches /dev/null is not accepted (don't know why). But why not just write all the output concatenated into one single file via -O tmpfile and delete this file afterwards?
Here is my wget command where i am trying to rename the file which i am downloading but it is not working. I am using -O option here but somehow it is not working.
access="http://mvn:8081/nexus/content/com/mvn/"
wget -r -np -nd -l1 -O "access.war" "$access" -A "com.infa.products.ldm.ingestion.access.web-"$n"-.-1-ldm-access-web.war"
Here i am renaming it to access.war. I can only use wget to do this job due to some restrictions.
Thanks for the help.
The option -A is "comma separated", but you are using dots to separate the extensions!
Instead of
-A "com.infa.products.ldm.ingestion.access.web-"$n"-.-1-ldm-access-web.war"
Try
-A "com,infa,products,ldm,ingestion,access,web-"$n"-,-1-ldm-access-web,war"
If this is not the solution to your problem, I suggest you simplify your wget-call down to something like this
wget -r -np -nd -l1 -O "access.war" "$access"
Just to verify that all else is working.
Or even better (to get fewer files)
wget -r -np -nd -l1 -O "access.war" "$access" -A "war"
I try to download this site, with this code:
wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off tenshi.spb.ru/anime-ost/
But I only get the index and enter inside the first folder, not the subfolder, help me?
I use this command to download sites including their subfolders:
wget --mirror -p --convert-links -P . [site address]
A little explanation:
--mirror is a shortcut for -N -r -l inf --no-remove-listing.
--convert-links makes links in downloaded HTML or CSS point to local files
-p allows you to get all images, etc. needed to display HTML pages
-P specifies the next argument is the directory the files will be saved to
I found the command at:
http://www.thegeekstuff.com/2009/09/the-ultimate-wget-download-guide-with-15-awesome-examples/
You use -l 1 also known as --level=1 which limits recursion to one level. Set that to a higher level to download more pages. BTW, I like long options like --level because its easier to see what you are doing without going back to man pages.
What's the best way of updating data files from a website that has moved on to a new domain, with changes in their folder structure.
The old URL for example is http://folder.old-domain.com while the new URL is http://new-domain.com/directory1/directory2. My data is stored locally in ~/Data_Backup/folder.old-domain.com folder.
Data was originally downloaded using:
$ wget -S -t 0 -c --mirror –w 2 –k http://folder.old-domain.com
I was thinking of using mv to rename the old folder to follow the new URL pattern, but is there a better way of doing this?
Will this work? I'm not particular with the directory structure. What's important is to update the contents of the target folder (and its sub-folders.)
$ wget -S -t 0 -c -m –w 2 –k -N -np -P ~/Data_Backup/folder.old-domain.com http://new-domain.com/directory/directory
Thanks in advance.
Got it!
I need to add the following options:
-nH --cut-dirs=2
and now it works.