How to use Rsync to copy only specific subdirectories (same names in several directories) - centos

I have such directories structure on server 1:
data
company1
unique_folder1
other_folder
...
company2
unique_folder1
...
...
And I want duplicate this folder structure on server 2, but copy only directories/subdirectories of unique_folder1. I.e. as result must be:
data
company1
unique_folder1
company2
unique_folder1
...
I know that rsync is very good for this.
I've tried 'include/exclude' options without success.
E.g. I've tried:
rsync -avzn --list-only --include '*/unique_folder1/**' --exclude '*' -e ssh user#server.com:/path/to/old/data/ /path/to/new/data/
But, as result, I don't see any files/directories:
receiving file list ... done
sent 43 bytes received 21 bytes 42.67 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
What's wrong? Ideas?
Additional information:
I have sudo access to both servers. One idea I have - is to use find command and cpio together to copy to new directory with content I need and after that use Rsync. But this is very slow, there are a lot of files, etc.

I've found the reason. As for me - it wasn't clear that Rsync works in this way.
So correct command (for company1 directory only) must be:
rsync -avzn --list-only --include 'company1/' --include 'company1/unique_folder1/***' --exclude '*' -e ssh user#server.com:/path/to/old/data/ /path/to/new/data
I.e. we need include each parent company directory. And of course we cannot write manually all these company directories in the command line, so we save the list into the file and use it.
Final things we need to do:
1.Generate include file on server 1, so its content will be (I've used ls and awk):
+ company1/
+ company1/unique_folder1/***
...
+ companyN/
+ companyN/unique_folder1/***
2.Copy include.txt to server 2 and use such command:
rsync -avzn \
--list-only \
--include-from '/path/to/new/include.txt' \
--exclude '*' \
-e ssh user#server.com:/path/to/old/data/ \
/path/to/new/data

If the first matching pattern excludes a directory, then all its descendants will never be traversed. When you want to include a deep directory e.g. company*/unique_folder1/** but exclude everything else *, you need to tell rsync to include all its ancestors too:
rsync -r -v --dry-run \
--include='/' \
--include='/company*/' \
--include='/company*/unique_folder1/' \
--include='/company*/unique_folder1/**' \
--exclude='*'
You can use bash’s brace expansion to save some typing. After brace expansion, the following command is exactly the same as the previous one:
rsync -r -v --dry-run --include=/{,'company*/'{,unique_folder1/{,'**'}}} --exclude='*'

An alternative to Andron's Answer which is simpler to both understand and implement in many cases is to use the --files-from=FILE option. For the current problem,
rsync -arv --files-from='list.txt' old_path/data new_path/data
Where list.txt is simply
company1/unique_folder1/
company2/unique_folder1/
...
Note the -r flag must be included explicitly since --files-from turns off this behaviour of the -a flag. It also seems to me that the path construction is different from other rsync commands, in that company1/unique_folder1/ matches but /data/company1/unique_folder1/ does not.

For example, if you only want to sync target/classes/ and target/lib/ to a remote system, do
rsync -vaH --delete --delete-excluded --include='classes/***' --include='lib/***' \
--exclude='*' target/ user#host:/deploy/path/
The important things to watch:
Don't forget the "/" from the end of the pathes, or you will get a copy into subdirectory.
The order of the --include, --exclude counts.
Contrary the other answers, starting with "/" an include/exclude parameter is unneeded, they will automatically appended to the source directory (target/ in the example).
To test, what exactly will happen, we can use a --dry-run flags, as the other answers say.
--delete-excluded will delete all content in the target directory, except the subdirectories we specifically included. It should be used wisely! On this reason, a --delete is not enough, it does not deletes the excluded files on the remote side by default (every other, yes), it should be given beside the ordinary --delete, again.

Related

wget redownloading a file only if it has been update on the server

I am currently using wget to downloads assets from a server. I currently use the following options of wget
wget --user=m_username --password=m_password -r -np -x -nH -q -nc URL_PATH
/**
* -r - download recursively
* -np - no parent ( only the files below a certain hierarchy will be downloaded)
* -x - force to create the same directory structure.
* -nH - Disable generation of host-prefixed directories
* -q - quiet - no output.
* -nc - existing files will not be redownloaded.
*
* */
In addition to the above options I want wget to re-download the file if the file has been updated in the server. Is there an option that I can use for that. I couldn't find anything specifically for that.
You're looking for -N: "When running Wget with -N, with or without -r or -p, the decision as to whether or not to download a newer copy of a file depends on the local and remote timestamp and size of the file."
Quoting the manual pages:
--no-clobber
If a file is downloaded more than once in the same directory, Wget's behavior depends on a few options, including -nc. In certain
cases, the local file
will be clobbered, or overwritten, upon repeated download. In other cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the same file in the same directory will result in the original copy of file
being preserved and
the second copy being named file.1. If that file is downloaded yet again, the third copy will be named file.2, and so on. (This is
also the behavior
with -nd, even if -r or -p are in effect.) When -nc is specified, this behavior is suppressed, and Wget will refuse to download newer
copies of file.
Therefore, ""no-clobber"" is actually a misnomer in this mode---it's not clobbering that's prevented (as the numeric suffixes
were already preventing
clobbering), but rather the multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, re-downloading a file will result in the new copy simply overwriting
the old. Adding -nc
will prevent this behavior, instead causing the original version to be preserved and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the decision as to whether or not to download a newer copy of a file depends on the
local and remote
timestamp and size of the file. -nc may not be specified at the same time as -N.
A combination with -O/--output-document is only accepted if the given output file does not exist.
Note that when -nc is specified, files with the suffixes .html or .htm will be loaded from the local disk and parsed as if they had been
retrieved from
the Web.
From my understanding what you really want is just the --mirror option, which sets the -r -N -l inf --no-remove-listing flags.

rsync will not exclude hidden files in gsutil 4.15

Previously gsutil appeared to not upload hidden files. Now hidden files cannot be prevented from upload. Using the -x command with either
.*/\\..* or
.*/[.].* still uploads both hidden files and directories.
This is with a local directory up to a bucket.
Is there a different expression that is required?
The -x exclude option should work:
gsutil rsync -x '\..*|./[.].*$' source-dir gs://your-bucket
You can learn more about it from the [official documentation].
This works for both hidden files and directories, at any spot in the path:
gsutil rsync -x '.*/\..*|^\..*' source dest
The other answer didn't work for me.
As the regexp is not tied to the edges of the string, .*'s at the beginning and at the end are not necessary, plus we can use grouping to simplify (sic!) a bit:
gsutil rsync -x '(^|/)\.' source dest
Where \. is the dot itself and (^|/) states that the dot should follow either the beginning of file name (^) or a / - a dot file in a subfolder.

How to force wget to overwrite an existing file ignoring timestamp?

I tried '-N' and '--no-clobber' but the only result that I get is to retrieve a new copy of the existing example.exe with number a number added using this synax 'example.exe.1'. This is not what I'd like to get. I just need to download and overwrite the file example.exe in the same folder where I already saved a copy of example.com without that wget verifies if the mine is older or newer respect the on example.exe file already present in my download folder. Do you think is i possible or I need to create a script that delete the example.exe file or maybe something that change his modification date etc?
If you specify the output file using the -O option it will overwrite any existing file.
For example:
wget -O index.html bbc.co.uk
Run multiple times will keep over-writting index.html.
wget doesn't let you overwrite an existing file unless you explicitly name the output file on the command line with option -O.
I'm a bit lazy and I don't want to type the output file name on the command line when it is already known from the downloaded file. Therefore, I use curl like this:
curl -O http://ftp.vim.org/vim/runtime/spell/fr.utf-8.spl
Be careful when downloading files like this from unsafe sites. The above command will write a file named as the connected web site wishes to name it (inside the current directory though). The final name may be hidden through redirections and php scripts or be obfuscated in the URL. You might end up overwriting a file you don't want to overwrite.
And if you ever find a file named ls or any other enticing name in the current directory after using curl that way, refrain from executing the downloaded file. It may be a trojan downloaded from a rogue or corrupted web site!
wget --backups=1 google.com
renames original file with .1 suffix and writes new file to the intended filename.
Not exactly what was requested, but could be handy in some cases.
-c or --continue
From the manual:
If you use ‘-c’ on a non-empty file, and the server does not support
continued downloading, Wget will restart the download from scratch and
overwrite the existing file entirely.
I like the -c option. I started with the man page then the web but I've searched for this several times. Like if you're relaying a webcam so the image needs to always be named image.jpg. Seems like it should be more clear in the man page.
I've been using this for a couple years to download things in the background, sometimes combined with "limit-rate = " in my wgetrc file
while true
do
wget -c -i url.txt && break
echo "Restarting wget"
sleep 2
done
Make a little file called url.txt and paste the file's URL into it. Set this script up in your path or maybe as an alias and run it. It keeps retrying the download until there's no error. Sometimes at the end it gets into a loop displaying
416 Requested Range Not Satisfiable
The file is already fully retrieved; nothing to do.
but that's harmless, just ctrl-c it. I think it's always gotten the file I wanted even if wget runs out of retries or the connection temporarily goes away. I've downloaded things for days at a time with it. A CD image on dialup, yes, always with wget.
My use case involves two different URLs, sometimes the second one doesn't exist, but if it DOES exist, I want it to overwrite the first file.
The problem of using wget -O is that, when the second file DOESN'T exist, it will overwrite the first file with a BLANK file.
So the only way I could find is with an if statement:
--spider checks if a file exists, and returns 0 if it does
--quiet fail quietly, with no output
-nv is quiet, but still reports errors
wget -nv https://example.com/files/file01.png -O file01.png
# quietly check if a different version exists
wget --quiet --spider https://example.com/custom-files/file01.png
if [ $? -eq 0 ] ; then
# A different version exists, so download and overwrite the first
wget -nv https://example.com/custom-files/file01.png -O file01.png
fi
It's verbose, but I found it necessary. I hope this is helpful for someone.
Here is an easy way to get it done with parameter trimming
url=https://example.com/example.exe ; wget -nv $url -O ${url##*/}
Or you can use basename
url=https://example.com/example.exe ; wget -nv $url -O $( basename $url )
For those who do not want to use -O and want to specify the output directory only, the following command can be used.
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"
the first command will download from the source with the wget command
the second command will remove the older file
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"; \
rm '$file.1' -f;

Mirroring a website and maintaining URL structure

The goal
I want to mirror a website, such that I can host the static files anywhere (localhost, S3, etc.) and the URLs will appear just like the original to the end user.
The command
This is almost perfect for my needs (...but not quite):
wget --mirror -nH -np -p -k -E -e robots=off http://mysite
What this does do
--mirror : Recursively download the entire site
-p : Download all necessary page requisites
-k : Convert the URL's to relative paths so I can host them anywhere
What this doesn't do
Prevent duplicate downloads
Maintain (exactly) the same URL structure
The problem
Some things are being downloaded more than once, which results in myfile.html and myfile.1.html. This wouldn't be bad, except that when wget rewrites the hyperlinks, it is writing it with the myfile.1.html version, which is changing the URLs and therefore has SEO considerations (Google will index ugly looking URL's).
The -nc option would prevent this, but as of wget-v1.13, I cannot use -k and -nc at the same time. Details for this are here.
Help?!
I was hoping to use wget, but I am now considering looking into using another tool, like httrack, but I don't have any experience with that yet.
Any ideas on how to achieve this (with wget, httrack or anything else) would be greatly appreciated!
httrack got me most of the way, the only URL mangling it did was make the links to point to /folder/index.html instead of /folder/.
Using either httrack or wget didn't seem to result in perfect URL structure, so we ended up writing a little bash script that runs the crawler, followed by sed to clean up some of the URLS (crop the index.html from links, replace bla.1.html with bla.html, etc.)
wget description and help
According to this (and a quick experiment of my own) you should have no problems using -nc and -k options together to gather the pages you are after.
What will cause an issue is using -N with -nc (Does not work at all, incompatible) so you won't be able to compare files by timestamp and still no-clobber them, and with the --mirror option you are including -N inherently.
Rather than use --mirror try instead replacing it with "-r -l inf" which will enable recursive downloading to an infinite level but still allow your other options to work.
An example, based on your original:
wget -r -l inf -k -nc -nH -p -E -e robots=off http://yoursite
Notes: I would suggest using -w 5 --random-wait --limit-rate=200k in order to avoid DOSing the server and be a little less rude, but obviously up to you.
Generally speaking I try to avoid using option groupings like --mirror because of conflicts like this being harder to trace.
I know this is an answer to a very old question but I think it should be addressed - wget is a new command for me but so far proving to be invaluable and I would hope others would feel the same.

wget - does -nc option skip downloading existed files?

I'm downloading a website with wget. the command is as below :
wget -nc --recursive --page-requisites --html-extension --convert-links --restrict-file-names=windows --domain any-domain.com --no-parent http://any-domain.com/any-page.html
does -nc option skip downloading existed files even when we download a website recursively? It seems -nc option not works.
the man say that :
-nc
--no-clobber
If a file is downloaded more than once in the same directory, Wget's behavior depends on a few options, including -nc. In certain cases, the local file will be clobbered, or overwritten, upon
repeated download.
Here is more details (from the man too) :
When running Wget with -r or -p, but without -N, -nd, or -nc, re-downloading a file will result in the new copy simply overwriting the old.
Yes, the -nc option will prevent re-download of the file.
The manual page is confusing because it describes all of the related options together.
Here is pertinent bits from the man page:
When running Wget with -r or -p, but without -N or -nc, re-downloading a file will result in the
new copy simply overwriting the old. Adding -nc will prevent this behavior, instead causing the
original version to be preserved and any newer copies on the server to be ignored.
The option --convert-links seams to conflict with -nc. Try removing it.