gsutil rsync uploading then immediately deleting file, leaving source and target in different states - google-cloud-storage

I have a script which is running gsutil rsync -r -d -c, and occasionally it will leave the source and target directories out of sync. The last file in in the list (named version.json) is first uploaded, and then immediately deleted.
Has anybody encountered this bug?
Additional information:
versioning is turned off in the target bucket
This occurs when attempting to overwrite the entire contents of the target bucket, which is already present.

Related

How to download all bucket files. (The issue with the -m flag gsutil)

I am trying to copy all files from cloud storage bucket recursively and I am having problem with the -m flag as I have investigated.
The command that I am running
gsutil -m cp -r gs://{{ src_bucket }} {{ bucket_backup }}
I am getting something like this:
CommandException: 1 file/object could not be transferred.
where the number of files/objects differs every time.
After investigation I have tried to reduce number of threads/processes which used with the -m option, but this has not helped, so I am looking for some advice about this. I have 170 MiB data on the bucket which is approximately 300k files. I need to download them as fast as possible
UPD:
Logs with -L flag
[Errno 2] No such file or directory: '<path>/en_.gstmp' -> '<path>/en'
6 errors like that.
The root of the issue might be that both directory and file of the same name exist in the GCS bucket. Try executing the command with -L flag, so you will get additional logs on the execution and you will be able to find the file that is causing this error.
I would suggest you delete that file and make sure there is no directory in the bucket of that name and then upload this file to the bucket again.
Also check if any of the directory created with Jar name. Delete them and processed the copy files.
And check if the required file is already at destination and delete the file at destination and execute copy again.
There are alternatives to copy, for example, it is possible to transfer files using rsync, as described here.
You can also check similar threads: thread1 , thread2 & thread3

Nearline - Backup Solution - Versioning

I've setup some Nearline buckets and enabled versioning and object lifecycle management. The use-case is to replace my current backup solution, Crashplan.
Using gsutil I can see the different versions of a file using a command like gsutil ls -al gs://backup/test.txt.
First, is there any way of finding files that don't have a live version (e.g. deleted) but still have a version attached?
Second, is there any easier way of managing versions? For instance if I delete a file from my PC, it will no longer have a live version in my bucket but will still have the older versions associated. Say, if I didn't know the file name would I just have to do a recursive ls on the entire bucket and sift through the output?
Would love a UI that supported versioning.
Thanks.
To check if the object currently has no life version use x-goog-if-generation-match header equal to 0, for example :
gsutil -h x-goog-if-generation-match:0 cp file.txt gs://bucket/file.txt
will fail (PreconditionException: 412 Precondition Failed) if file has a live version and will succeed if it has only archived versions.
In order to automatically synchronize your local folder and folder in the bucket (or the other way around) use gcloud rsync:
gcloud rsync -r -d ./test gs://bucket/test/
notice the trailing / in gs://bucket/test/, without it you will receive
CommandException: arg (gs://graham-dest/test) does not name a directory, bucket, or bucket subdir.
-r synchronize all the directories in ./test recursively to gs://bucket/test/`
-d will delete all files from gs://bucket/test/that are not found in./test`
Regarding UI, there already exists a future request. I don't know anything about third party applications however.

Prevent downtime using lftp mirror

I'm using lftp to deploy a website via Travis CI. There is a build process before the deployment, for that reason a build directory is present and pushed to the root of the ftp server.
lftp $FTP_URL -e "glob -d mirror build . --reverse --delete-first --parallel=10 && exit"
It works quite well, but I dislike to have a downtime / temporary PHP parse errors because of missing files on my website. What is the best way to work arround that issue?
My first approach was an option to set a temporary directory, but the lftp man page says there is only a options for temporary files. I still tried the option but it didn't help.
My second approach was to use "mirror build temp" to use a temporary folder and then replace the root with it. The problem here is, that I cannot exclude the temp folder while deleting the old files and folders like rm -rf *.
For small changes not involving adding/removing php files set xfer:use-temp-file should be sufficient. Also don't use --remove-first, as it causes lftp to delete obsolete files before uploading.
For larger changes I'd create a separate directory for each version of the site and redirect the web server to the directory using .htaccess mod_rewrite or some other configuration file. This technique will allow atomic switch to the new version (and back if needed). Besides, you will be able to do final pre-production testing of the new version if you redirect to the new version conditionally based on your IP address or using some other rule.
If you don't want to re-upload whole site for each new version and the FTP server supports FXP with itself, then you can copy old version to a new directory using mirror old_directory ftp://user#example.com/new_directory, then update the new directory using mirror -eR local_dir new_directory.
This is a zero downtown pattern - each placeholder should be replaced:
lftp $FTP_URL -e "mirror {SOURCE} {TARGET}-new-{TIMESTAMP} --reverse --delete-first;
mv {TARGET} {TARGET}-old-{TIMESTAMP};
mv {TARGET}-new-{TIMESTAMP} {TARGET};
rm -rf {TARGET}-old-{TIMESTAMP};
exit"

Can we wget with file list and renaming destination files?

I have this wget command:
sudo wget --user-agent='some-agent' --referer=http://some-referrer.html -N -r -nH --cut-dirs=x --timeout=xxx --directory-prefix=/directory/for/downloaded/files -i list-of-files-to-download.txt
-N will check if there is actually a newer file to download.
-r will turn the recursive retrieving on.
-nH will disable the generation of host-prefixed directories.
--cut-dirs=X will avoid the generation of the host's subdirectories.
--timeout=xxx will, well, timeout :)
--directory-prefix will store files in the desired directorty.
This works nice, no problem.
Now, to the issue:
Let's say my files-to-download.txt has these kind of files:
http://website/directory1/picture-same-name.jpg
http://website/directory2/picture-same-name.jpg
http://website/directory3/picture-same-name.jpg
etc...
You can see the problem: on the second download, wget will see we already have a picture-same-name.jpg, so it won't download the second or any of the following ones with the same name. I cannot mirror the directory structure because I need all the downloaded files to be in the same directory. I can't use the -O option because it clashes with --N, and I need that. I've tried to use -nd, but doesn't seem to work for me.
So, ideally, I need to be able to:
a.- wget from a list of url's the way I do now, keeping my parameters.
b.- get all files at the same directory and being able to rename each file.
Does anybody have any solution to this?
Thanks in advance.
I would suggest 2 approaches -
Use the "-nc" or the "--no-clobber" option. From the man page -
-nc
--no-clobber
If a file is downloaded more than once in the same directory, >Wget's behavior depends on a few options, including -nc. In certain >cases, the local file will be
clobbered, or overwritten, upon repeated download. In other >cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the >same file in the same directory will result in the original copy of file >being preserved and the second copy
being named file.1. If that file is downloaded yet again, the >third copy will be named file.2, and so on. (This is also the behavior >with -nd, even if -r or -p are in
effect.) When -nc is specified, this behavior is suppressed, >and Wget will refuse to download newer copies of file. Therefore, ""no->clobber"" is actually a misnomer in
this mode---it's not clobbering that's prevented (as the >numeric suffixes were already preventing clobbering), but rather the >multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, >re-downloading a file will result in the new copy simply overwriting the >old. Adding -nc will prevent this
behavior, instead causing the original version to be preserved >and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the >decision as to whether or not to download a newer copy of a file depends >on the local and remote timestamp and
size of the file. -nc may not be specified at the same time as >-N.
A combination with -O/--output-document is only accepted if the >given output file does not exist.
Note that when -nc is specified, files with the suffixes .html >or .htm will be loaded from the local disk and parsed as if they had been >retrieved from the Web.
As you can see from this man page entry, the behavior might be unpredictable/unexpected. You will need to see if it works for you.
Another approach would be to use a bash script. I am most comfortable using bash on *nix, so forgive the platform dependency. However the logic is sound, and with a bit of modifications, you can get it to work on other platforms/scripts as well.
Sample pseudocode bash script -
for i in `cat list-of-files-to-download.txt`;
do
wget <all your flags except the -i flag> $i -O /path/to/custom/directory/filename ;
done ;
You can modify the script to download each file to a temporary file, parse $i to get the filename from the URL, check if the file exists on the disk, and then take a decision to rename the temp file to the name that you want.
This offers much more control over your downloads.

rsync using both --update and --append

I have a problem using rsync. Actually I want to synchronize 2 folders, one on the server (192.168.1.5) and one on my pc, the rsync command that I use is:
rsync -rvzru --exclude=._* xxx#192.168.1.5::Folder /Users/blabla/boh/received
It works fine, if a file has been modified on my server it modifies it also on my pc, skipping all the others. Now I want to introduce the resume feature, so I add --append like this:
rsync -rvzru --exclude=._* --append xxx#192.168.1.5::Folder /Users/blabla/boh/received
I also tried with --partial, but it doesn't work at all, it skips the uncompleted files while updates all the others if there was a modification. If I remove -U from the command the resume works fine, but if I modify a file on the server, I can't see the modification on my PC.
rsync -rvzr --exclude=._* --append xxx#192.168.1.5::Folder /Users/blabla/boh/received
There's a way to let this work both resuming and updating new files? Thanks