compute engine use gsutil to download tgz file has crcmod error - google-cloud-storage

I find if you create a compute engine (CentOS or Debian) machine and using gsutil to download (cp) a tgz file will cause a crcmod error...
$ gsutil cp gs://mybucket/data.tgz .
Copying gs://mybucket/data.tgz...
CommandException:
Downloading this composite object requires integrity checking with CRC32c, but
your crcmod installation isn't using the module's C extension, so the the hash
computation will likely throttle download performance. For help installing the
extension, please see:
$ gsutil help crcmod
To download regardless of crcmod performance or to skip slow integrity checks,
see the "check_hashes" option in your boto config file.
Currently I use "check_hashes = never" to bypass the check...
$ vi /etc/boto.cfg
[GSUtil]
default_project_id = 429100748693
default_api_version = 2
check_hashes = never
...
But, what is the root cause? and is there any good solution to solve the problem?

The object you're trying to download is a composite object, which basically means it was uploaded in parallel chunks. gsutil automatically does this when uploading objects larger than 150M (a configurable threshold), to provide better performance.
Composite objects only have a crc32c checksum (no MD5), so in order to validate data integrity when downloading composite objects, gsutil needs to perform a crc32c checksum. Unfortunately, the libraries distributed with Python don't include a compiled crc32c implementation, so unless you install a compiled crc32c, gsutil will use a non-compiled Python implementation of crc32c that's quite slow. That warning is printed to let you know there's a way to fix that performance problem: Please run:
gsutil help crcmod
and follow the instructions there for installing a compiled crc32c. It's pretty easy to do it, and worth the effort.
One other note: I strongly recommend against setting check_hashes = never in your boto config file. That will disable integrity checking, which means it's possible your download could get corrupted and you wouldn't know it. You want data integrity checking enabled to ensure you're working with correct data.

Related

Extending PCR of TPM2.0 during boot by using buildroot with uboot

I feel very stupid asking this question, since originally I thought that I just have to enable a config statement and afterwards it runs smoothly. But I do not find the correct settings.
I have an embedded system and build a rootfs, linux kernel, u-boot, etc. using builtroot.
Now I want to implement remote attestation. Therefore I want the different steps during the boot process to extend the pcrs of my TPM 2.0 with the hash values of the next step.
I can run commands on the TPM using tpm2-tools when the system is booted.
I thought that u-boot, the kernel, etc. all got their tpm driver, so it should not be a problem for them to extend the pcr.
But how do I enable this?
Thank you so much for your answer.
I answer my question by myself in case there is someone with a similar problem.
The problem is solved by creating a U-Boot patch. To boot the operating system, uBoot run several steps. These are extended by my patch.
I copy the rootfs into memory, hash over it and extend a pcr with it. The following commands are needed:
$ tpm2 init // init the tpm
$ tpm2 start TPM2_SU_CLEAR // start the tpm
$ mmc read $loadaddr 0x800 0x80000 //read your rootfs
$ hash sha256 $loadaddr *0x10000000 // hash over it
$ tpm2 pcr_extend 4 0x10000000 // extend a pcr with the hashed value
Hopefully someone find it helpful. In case you find an error pls comment.
EDIT: added missing asterisk

OS Development. Creating bootable iso from files.

I'm studying OS development and I use brokenthorn resource but with a little bit different tool, namely, I use CentOS, NASM and Qemu as a test/dev environment. I've been facing some issues while creating bootable img file with secondary loader.
I've got two files:
1. bootloader.bin which is first stage loader.
2. stage2.bin which is secondary loader.
In order to create bootable img file I do the following:
dd if=/dev/zero of=floppy.iso bs=1024 count=1440 -- Creating empty file
mkfs.vfat -F 12 floppy.iso --Creating file system in the file
dd if=../bin/bootloader.bin of=floppy.iso bs=512 count=1 conv=notrunc --Writing first loader to the boot sector
sudo mount -o loop floppy.iso /mnt/floppy/ -- Try to mount file system to write secondary loader using previously create FAT-12 files system.
In the last step I'm getting the following error:
mount: /dev/loop0 is write-protected, mounting read-only
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
Can you please help me to understand what I'm doing wrong and what other ways I can use to accomplish creating bootable img with file system on board.
Thanks!
I once stumbled upon the similar problem and this answer may be of help to you.
However I would strongly recommend you switching to bootloader like Grub and spend time and effort developing the actual OS of yours. For that I would reccomend grub resque as it's simple to use and allows to to quickly create ISO that you can either burn or feed to virtual machine. Otherwise, you may just drown in all these minor things like enabling protected mode, loading your stages and so on.

Is there any advantage to using gsutil or the google cloud storage API in production transfers?

Which is better to use with production transfers, gsutil, or the google cloud storage API?
gsutil uses a Google Cloud Storage API to transfer data, specifically the JSON API (by default, you can change it). Its main advantage over using the API directly is that it has been tuned to transfer data quickly. For example, it can open up multiple simultaneous connections to GCS, each of which is uploading or downloading part of the file concurrently, which in many cases can provide a significant boost to total throughput.
There's no reason that programming against the API directly could not also provide the same or even better performance, but I would expect gsutil to be at least a little bit faster on average if you implement things in the simplest possible manner.
I'm not sure this is adding much over what Brandon has said. I'm very new to gcloud storage and Python, but I've quickly found that I prefer to use the gsutil command line over the python client library whereever possible. I create compute instances that copy a few GB of input data from cloud storage after they have booted. I found that its both neater and faster to do this using the gsutil command line where possible, so in my python code I use:
import subprocess
subprocess.call("gsutil -m cp gs://my-uberdata-archive/* /home/<username>/rawdata/", shell=True)
The main reasons being that I can do the command in a single line whereas it takes several lines using the client library, and as Brandon points out, gsutil supports multi-threading with the '-m' flag. I haven't found an equivalent way to do this with the Python Client library yet.

Why is gsutil rsync re-downloading all our files?

We've been using gsutil -m rsync -r to keep dev and deploy boxes in sync with a GCS bucket for nearly 2 years without any problem. There are about 85k objects in the bucket.
Until recently, this worked perfectly: we'd run a deploy-box -> GCS rsync every 15 mins or so, to keep all new uploaded resource backed up, and then a GCS -> dev box rsync whenever we wanted to refresh the local dev data (running on OSX El Capitan).
Within the last couple of months, though, the GCS->dev rsync has started to bloat, downloading more and more images.
Initially I just thought "great, we're getting more resources uploaded", but it's been growing way faster than the data, until today when it seems to be downloading the whole 85k images.
I've double-checked I'm in the right place, the command is correct, the paths are correct, etc. For all that the gsutil output is scrolling by with reams and reams of "Copying..." and "Downloading..." messages, making good parallel use of our 100mbps connection, when I go to another terminal and run find . -type f | wc -l on the destination directory every 10 seconds, it shows that barely 2 or 3 new files are being added a minute. I look at modification times on files that gsutil says it's downloading right now and in the large majority they're old, plenty haven't changed in a year or more. Meaning: it's downloading all the data, using tons of time and bandwidth, all for the sake of a few hundred files.
Has something changed in recent OSX gsutil versions? Is there possibly a bug? How would I even start to go about tracking this down? Or reporting it? The newsgroups gsutil-discuss and gs-discussion have been archived, and the talk in gce-discussion is all about using gsutil from GCE instances.
Thanks!
I had a similar issue where the same files were synced over and over. I don't have that many files so you might need to check for performance but I decided to use the -c option to force using the checksum instead of mtime which was modified locally in my build process.
I think (and hope) the documentation is slightly wrong stating that
compare checksums for files if the size of source and destination as
well as mtime match
as it seems to use checksum even if mtime does not match
gsutil 4.20 (released 2016-07-20) modified the change detection algorithm for rsync. Instead of comparing only the size of the local file with its cloud counterpart, it now compares both the size and file modification time of local files. The file modification time is stored in the custom user metadata for the file when it is uploaded with rsync. If that doesn't exist the object creation time is used.

On resume gsutil seems to re-upload files

I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.
It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)
For example:
< 404104811 2014-04-08T14:13:44Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
> 404104811 2014-04-08T14:43:48Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.
Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!
You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.
gsutil cp -Rn <disk-top-directory> <bucket>
From the help (gsutil help cp)
-n No-clobber. When specified, existing files or objects at the
destination will not be overwritten. Any items that are skipped
by this option will be reported as being skipped. This option
will perform an additional HEAD request to check if an item
exists before attempting to upload the data. This will save
retransmitting data, but the additional HTTP requests may make
small object transfers slower and more expensive.
Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.
If you're open to working with the (still beta) gsutil v4, that version of gsutil has an rsync command. You can get this by running:
gsutil update gs://prerelease/gsutil_4.0beta2pre_minus_m_sugg.tar.gz
Please be sure to read the release notes before switching to this major new release, especially if you're using gsutil v3 in scripts.