How to download all bucket files. (The issue with the -m flag gsutil) - google-cloud-storage

I am trying to copy all files from cloud storage bucket recursively and I am having problem with the -m flag as I have investigated.
The command that I am running
gsutil -m cp -r gs://{{ src_bucket }} {{ bucket_backup }}
I am getting something like this:
CommandException: 1 file/object could not be transferred.
where the number of files/objects differs every time.
After investigation I have tried to reduce number of threads/processes which used with the -m option, but this has not helped, so I am looking for some advice about this. I have 170 MiB data on the bucket which is approximately 300k files. I need to download them as fast as possible
UPD:
Logs with -L flag
[Errno 2] No such file or directory: '<path>/en_.gstmp' -> '<path>/en'
6 errors like that.

The root of the issue might be that both directory and file of the same name exist in the GCS bucket. Try executing the command with -L flag, so you will get additional logs on the execution and you will be able to find the file that is causing this error.
I would suggest you delete that file and make sure there is no directory in the bucket of that name and then upload this file to the bucket again.
Also check if any of the directory created with Jar name. Delete them and processed the copy files.
And check if the required file is already at destination and delete the file at destination and execute copy again.
There are alternatives to copy, for example, it is possible to transfer files using rsync, as described here.
You can also check similar threads: thread1 , thread2 & thread3

Related

Nearline - Backup Solution - Versioning

I've setup some Nearline buckets and enabled versioning and object lifecycle management. The use-case is to replace my current backup solution, Crashplan.
Using gsutil I can see the different versions of a file using a command like gsutil ls -al gs://backup/test.txt.
First, is there any way of finding files that don't have a live version (e.g. deleted) but still have a version attached?
Second, is there any easier way of managing versions? For instance if I delete a file from my PC, it will no longer have a live version in my bucket but will still have the older versions associated. Say, if I didn't know the file name would I just have to do a recursive ls on the entire bucket and sift through the output?
Would love a UI that supported versioning.
Thanks.
To check if the object currently has no life version use x-goog-if-generation-match header equal to 0, for example :
gsutil -h x-goog-if-generation-match:0 cp file.txt gs://bucket/file.txt
will fail (PreconditionException: 412 Precondition Failed) if file has a live version and will succeed if it has only archived versions.
In order to automatically synchronize your local folder and folder in the bucket (or the other way around) use gcloud rsync:
gcloud rsync -r -d ./test gs://bucket/test/
notice the trailing / in gs://bucket/test/, without it you will receive
CommandException: arg (gs://graham-dest/test) does not name a directory, bucket, or bucket subdir.
-r synchronize all the directories in ./test recursively to gs://bucket/test/`
-d will delete all files from gs://bucket/test/that are not found in./test`
Regarding UI, there already exists a future request. I don't know anything about third party applications however.

gsutil rsync uploading then immediately deleting file, leaving source and target in different states

I have a script which is running gsutil rsync -r -d -c, and occasionally it will leave the source and target directories out of sync. The last file in in the list (named version.json) is first uploaded, and then immediately deleted.
Has anybody encountered this bug?
Additional information:
versioning is turned off in the target bucket
This occurs when attempting to overwrite the entire contents of the target bucket, which is already present.

Compare file sizes and download if they're different via wget

I'm downloading some .mp3 files (all legal) via wget :
wget -r -nc files.myserver.com
I have to stop the download sometimes and at that times the file is partially downloaded. For example a 10 minutes record.mp3 file become 4 minutes record.mp3 file. It's playing correctly but incomplete.
If I use the same command above, because the record.mp3 file is already exist in my local computer wget skips that file although it isn't complete.
I wonder if there is a way to check the file sizes and if the file size in the remote server and local computer isn't same re-download it. (I've learned the --spider command gives the file size but is there any other command that automatically check the file sizes and download or not).
I would go with wget's -N option for timestamping, but note that wget will only compare the file sizes if you also specify the --no-if-modified-since option. Without it, incomplete files are indeed skipped on the next run because they receive a timestamp of the current time, which is newer than that on the server.
The reason is probably that with only -N, a GET request is sent for the file with the If-Modified-Since field set. The server responds with either 200 or 304, but the 304 doesn't contain the file size so wget can't check it.
With --no-if-modified-since wget sends a HEAD request instead to get the timestamp and file size, and checks both.
What I use for recursive download of a folder:
wget -T 300 -nv -t 1 -r -nd -np -l 1 -N --no-if-modified-since -P $my_folder $my_url
With:
-T 300: Set the network timeout to 300 seconds
-nv: Turn off verbose without being completely quiet
-t 1: Set number of tries to 1
-r: Turn on recursive retrieving
-nd: Do not create a hierarchy of directories when retrieving recursively
-np: Do not ever ascend to the parent directory when retrieving recursively
-l 1: Specify recursion maximum depth 1
-N: Turn on time-stamping
--no-if-modified-since: Do not send If-Modified-Since header in ā€˜-Nā€™ mode, send preliminary HEAD request instead
You may try the -c option to continue the download of partially downloaded files, however the manual gives an explicit warning:
You need to be especially careful of this when using -c in conjunction
with -r, since every file will be considered as an "incomplete
download" candidate.
While there is no perfect solution to this problem you could try to use -N option to turn on timestamping. This might prevent errors when the file has changed on the server but only if the server supports timestamping and partial downloads. Try it and see how it goes.
wget -r -N -c files.myserver.com
If you need check if file was partially downloaded (has different size) or updated on remote server by timestamp and must be in this case updated locally you need use -N option.
Here some additional info about -N (--timestamping) option from Wget docs:
If the local file does not exist, or the sizes of the files do not match, Wget will download the remote file no matter what the
time-stamps say.
Added From: https://www.gnu.org/software/wget/manual/wget.html (Chapter: 5 Time-Stamping)

using wget to overwrite file but use temporary filename until full file is received, then rename

I'm using wget in a cron job to fetch a .jpg file into a web server folder once per minute (with same filename each time, overwriting). This folder is "live" in that the web server also serves that image from there. However if someone web-browses to that page during the time the image is being fetched, it is considered a jpg with errors and says so in the browser. So what I need to do is, similar to when Firefox is downloading a file, wget should write to a temporary file, either in /var or in the destination folder but with a temporary name, until it has the whole thing, then rename in an atomic (or at least negligible-duration) step.
I've read the wget man page and there doesn't seem to be a command line option for this. Have I missed it? Or do I need to do two commands in my cron job, a wget and a move?
There is no way to do this purely with GNU Wget.
wget's job is to download files and it does that. A simple one line script can achieve what you're looking for:
$ wget -O myfile.jpg.tmp example.com/myfile.jpg && mv myfile.jpg{.tmp,}
Since mv is atomic, atleast on Linux, you get the atomic update of a ready file.
Just wanted to share my solution:
alias wget='func(){ (wget --tries=0 --retry-connrefused --timeout=30 -O download_pkg.tmp "$1" && mv download_pkg.tmp "${1##*/}") || rm download_pkg.tmp; unset -f func; }; func
it creates a function that receives a parameter "url" to download the file to a temporary name. If it is successful, it is renamed to the correct filename extracted from parameter $1 with ${1##*/}. and if it fails, deletes the temp file. If the operation is aborted, the temp file will be replace on the next run. after all, unset -f removes the function definition as the alias is executed.

Capistrano - How to put files in the shared folder?

I am new to Capistranoand I saw there is shared folder and also option :linked_files. I think shared folder is used to keep files between releases. But my question is, how do files end up being in the shared folder?
Also, if I want to symlink another directory to the current directory e.g. static folder at some path, how do I put it at the linked_dirs ?
Lastly how to set chmod 755 to linked_files and linked_dirs.
Thank you.
Folders inside your app are symlinks to folders in the shared directory. If your app writes to log/production.log, it will actually write to ../shared/log/production.log. That's how the files end up being in the shared folder.
You can see how this works by looking at the feature specs or tests in Capistrano.
If you want to chmod these shared files, you can just do it once directly over ssh since they won't ever be modified by Capistrano after they've been created.
To add a linked directory, in your deploy.rb:
set :linked_dirs, %w{bin log tmp/backup tmp/pids tmp/cache tmp/sockets vendor/bundle}
or
set :linked_dirs, fetch(:linked_dirs) + %w{public/system}
Capistrano 3.5+
Capistrano 3.5 introduced append for array fields. From the official docs, you should use these:
For Shared Files:
append :linked_files, %w{config/database.yml}
For Shared Directories:
append :linked_dirs, %w{bin log public/uploads vendor/bundle}
I've written a task for Capistrano 3 to upload your config files to the shared folder of each of your servers, it'll check these directories in order:
config/deploy/config/:stage/*.yml
config/deploy/config/*.yml
And upload all config files found. It'll only upload the files if they've changed. Note also that if you have the same file on both directories then the second one will be ignored.
Here's the code: https://gist.github.com/Jesus/448d618c83fb0445ebbf
One last thing, this task is just uploading the config. files to your remote shared folder, you still need to set linked_files in config/deploy.rb, eg:
set :linked_files, %w{config/database.yml config/aws.yml}
UPDATE:
If you're using Git, you'll probably want to ignore these files:
echo "config/deploy/config/*" >> .gitignore
There are 3 simple steps you can follow to put a file that you don't want to change in consecutive releases; add your file to linked_files list.
set :linked_files, fetch(:linked_files, []).push('config.php')
Select all the files that you want to share. Put this file from your local to remote server through scp
scp config.php deployer#amazon:~/capistrano/shared/config.php
Now, deploy through the command given below:
bundle exec cap staging deploy
of course, staging can be changed as per requirements may be production,sandbox etc.
One more thing, because you don't want your team members to commit such files. So, put this file to your .gitignore file. And push it to git remote repo.
For Capistrano 3.5+, as specified in official doc :
append :linked_dirs, ".bundle", "tmp"
For me non of the above worked so I ended up adding two functions to the end of the deployment process:
namespace :your_company do
desc "remove index.php"
task :rm_files do
on roles(:all) do
execute "rm -rf #{release_path}/index.php"
end
end
end
namespace :your_company do
desc "add symlink to index.php"
task :add_files do
on roles(:all) do
execute "ln -sf #{shared_path }/index.php #{release_path}/index.php"
end
end
end
after "deploy:finished", "your_company:rm_files"
after "deploy:finished", "your_company:add_files"