Google Cloud Tar Without Leaving the Cloud - google-cloud-storage

If I want to move files from a nearline to a coldline bucket and I want to tar the files during the move, is that possible?
I’m using gsutil and the -zZ flags, but I can’t get them to do what I’m asking for above. That makes me think they are just doing the gzip during the transfer.

You are right. The z and Z flags apply gzip encoding to those files, but those files maintain the Content-Type they had.
(Note: The options for mv are the same as cp has, minus the R flag)
The key aspect here is that the files are given a Content-Encoding header set to gzip, indicating that the files are compressed on the servers.
If you want to save those files as tar files, you would have to compress them as such before moving, either manually, with a script, etc.

Related

GSUTIL CP using file size

I am trying to copy files from a directory on my Google Compute Instance to Google Cloud Storage Bucket. I have it working, however there are ~35k files but only ~5k have an data in them.
Is there anyway to only copy files above a certain size?
I've not tried this but...
You should be able to do this using a resumable transfer and setting the threshold to 5k (defaults to 8Mib). See: https://cloud.google.com/storage/docs/gsutil/commands/cp#resumable-transfers
May be advisable to set BOTO_CONFIG specifically for this copy (a) to be intentional; (b) to remind yourself how it works. See: https://cloud.google.com/storage/docs/boto-gsutil
Resumable uploads has the added benefit, of course, of resuming if there are any failures.
Recommend: try this on a small subset and confirm it works to your satisfaction.
While it's not possible to do it only with gsutil, it's possible to do it by parsing the names and use the -I flag on the cp command to process them. If you're using a Linux Compute Engine instance you can perform it by using the du and awk commands:
du * | awk '{if ($1 > 1000) print $2 }' | gsutil -m cp -I gs://bucket2
The command will get the filesize of the files inside the current directory on your compute engine with du * and will only copy the files which size are larger than 1000 bytes to bucket2, you can change that value to adjust it to your needs.

Nearline - Backup Solution - Versioning

I've setup some Nearline buckets and enabled versioning and object lifecycle management. The use-case is to replace my current backup solution, Crashplan.
Using gsutil I can see the different versions of a file using a command like gsutil ls -al gs://backup/test.txt.
First, is there any way of finding files that don't have a live version (e.g. deleted) but still have a version attached?
Second, is there any easier way of managing versions? For instance if I delete a file from my PC, it will no longer have a live version in my bucket but will still have the older versions associated. Say, if I didn't know the file name would I just have to do a recursive ls on the entire bucket and sift through the output?
Would love a UI that supported versioning.
Thanks.
To check if the object currently has no life version use x-goog-if-generation-match header equal to 0, for example :
gsutil -h x-goog-if-generation-match:0 cp file.txt gs://bucket/file.txt
will fail (PreconditionException: 412 Precondition Failed) if file has a live version and will succeed if it has only archived versions.
In order to automatically synchronize your local folder and folder in the bucket (or the other way around) use gcloud rsync:
gcloud rsync -r -d ./test gs://bucket/test/
notice the trailing / in gs://bucket/test/, without it you will receive
CommandException: arg (gs://graham-dest/test) does not name a directory, bucket, or bucket subdir.
-r synchronize all the directories in ./test recursively to gs://bucket/test/`
-d will delete all files from gs://bucket/test/that are not found in./test`
Regarding UI, there already exists a future request. I don't know anything about third party applications however.

Can we wget with file list and renaming destination files?

I have this wget command:
sudo wget --user-agent='some-agent' --referer=http://some-referrer.html -N -r -nH --cut-dirs=x --timeout=xxx --directory-prefix=/directory/for/downloaded/files -i list-of-files-to-download.txt
-N will check if there is actually a newer file to download.
-r will turn the recursive retrieving on.
-nH will disable the generation of host-prefixed directories.
--cut-dirs=X will avoid the generation of the host's subdirectories.
--timeout=xxx will, well, timeout :)
--directory-prefix will store files in the desired directorty.
This works nice, no problem.
Now, to the issue:
Let's say my files-to-download.txt has these kind of files:
http://website/directory1/picture-same-name.jpg
http://website/directory2/picture-same-name.jpg
http://website/directory3/picture-same-name.jpg
etc...
You can see the problem: on the second download, wget will see we already have a picture-same-name.jpg, so it won't download the second or any of the following ones with the same name. I cannot mirror the directory structure because I need all the downloaded files to be in the same directory. I can't use the -O option because it clashes with --N, and I need that. I've tried to use -nd, but doesn't seem to work for me.
So, ideally, I need to be able to:
a.- wget from a list of url's the way I do now, keeping my parameters.
b.- get all files at the same directory and being able to rename each file.
Does anybody have any solution to this?
Thanks in advance.
I would suggest 2 approaches -
Use the "-nc" or the "--no-clobber" option. From the man page -
-nc
--no-clobber
If a file is downloaded more than once in the same directory, >Wget's behavior depends on a few options, including -nc. In certain >cases, the local file will be
clobbered, or overwritten, upon repeated download. In other >cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the >same file in the same directory will result in the original copy of file >being preserved and the second copy
being named file.1. If that file is downloaded yet again, the >third copy will be named file.2, and so on. (This is also the behavior >with -nd, even if -r or -p are in
effect.) When -nc is specified, this behavior is suppressed, >and Wget will refuse to download newer copies of file. Therefore, ""no->clobber"" is actually a misnomer in
this mode---it's not clobbering that's prevented (as the >numeric suffixes were already preventing clobbering), but rather the >multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, >re-downloading a file will result in the new copy simply overwriting the >old. Adding -nc will prevent this
behavior, instead causing the original version to be preserved >and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the >decision as to whether or not to download a newer copy of a file depends >on the local and remote timestamp and
size of the file. -nc may not be specified at the same time as >-N.
A combination with -O/--output-document is only accepted if the >given output file does not exist.
Note that when -nc is specified, files with the suffixes .html >or .htm will be loaded from the local disk and parsed as if they had been >retrieved from the Web.
As you can see from this man page entry, the behavior might be unpredictable/unexpected. You will need to see if it works for you.
Another approach would be to use a bash script. I am most comfortable using bash on *nix, so forgive the platform dependency. However the logic is sound, and with a bit of modifications, you can get it to work on other platforms/scripts as well.
Sample pseudocode bash script -
for i in `cat list-of-files-to-download.txt`;
do
wget <all your flags except the -i flag> $i -O /path/to/custom/directory/filename ;
done ;
You can modify the script to download each file to a temporary file, parse $i to get the filename from the URL, check if the file exists on the disk, and then take a decision to rename the temp file to the name that you want.
This offers much more control over your downloads.

php wget and priovide a time for last modified

I know that wget has an -N function, (and also a timestamping option for non-header sending protocols like FTP), but how would I specify a time and date for wget.
For example; I don't want to compare local and remote files, I would like to directly specify a time and date for wget to use. I know the following is not correct, it just serves the purpose of the example:
wget -N **--jan-2-2013-05:00** r ftp://user:myPassword#ftp.example.com/public_html
Is it possible to give wget a timestamp to use when checking last modified?
No there is not such option.
Timestamping behavior is pretty automatic in wget. When you use --timestamping or -N option then the file will not get downloaded if:
a) The file already exists locally
AND
b) The file has the same size locally and remotely
AND
c) The timestamp of the remote file is equal or older than the timestamp of the local file (in http it compares against the Last-Modified header)
so you can emulate the behavior you are asking if you:
Create the files you are going to download with the same size as
those in the remote location (a+b conditions)
Touch the files so that they have the timestamp you want (c
condition)
Example:
for a single file the name which we already know let's say foo.txt of size 7348 that we want to get only newer files after 2013-07-24T14:27 that would be:
dd if=/dev/zero of=foo.txt bs=1 count=7348
touch -t 201307241427 foo.txt
now if you use wget -N http://url/path/foo.txt the wget will work like you asked

How to create an identical gzip of the same file?

I have a file, its contents are identical. It is passed into gzip and only the compressed form is stored. I'd like to be able to generate the zip again, and only update my copy should they differ. As it stands diffing tools (diff, xdelta, subversion) see the files as having changed.
Premise, I'm storing a mysqldump of an important database into a subversion repository. It is my intention that a cronjob periodically dump the db, gzip it, and commit the file. Currently, every time the file is dumped and then gzipped it is considered as differing. I'd prefer not to have my revision numbers needlessly increase every 15m.
I realize I could dump the file as just plain text, but I'd prefer not as it's rather large.
The command I am currently using to generate the dumps is:
mysqldump $DB --skip-extended-insert | sed '$d' | gzip -n > $REPO/$DB.sql.gz
The -n instructs gzip to remove the filename/timestamp information. The sed '$d' removes the last line of the file where mysqldump places a timestamp.
At this point, I'm probably going to revert to storing it in a plain text fashion, but I was curious as to what kind of solution there is.
Resolved, Mr. Bright was correct, I had mistakenly used a capital N when the correct argument was a lowercase one.
The -N instructs gzip to remove the
filename/timestamp information.
Actually, that does just the opposite. -n is what tells it to forget the original file name and time stamp.
I think gzip is preserving the original date and timestamp on the file(s) which will cause it to produce a different archive.
-N --name
When compressing, always save the original file
name and time stamp; this is the default. When
decompressing, restore the original file name and
time stamp if present. This option is useful on
systems which have a limit on file name length or
when the time stamp has been lost after a file
transfer.
But watchout: two gzips made at different times of the same unchanged file differ. This is because the gzip is itself timestamped with the gzip creation date - this is written to the header of the gzip file. Thus the apparently different gzips can contain the exact same content.