Can't download GSuite exported data using gsutil - google-cloud-storage

I am trying to download the exported data from my GSuite (Google Workplace) account. I ran the data export tool and it is sitting in a bucket. I want to download all of the files but it says that the only way I can download multiple files is to use the gsutil utility.
I installed it using pip instal -U gsutil.
I tried running the following command:
gsutil cp -r \
gs://takeout-export-3ba9a6a2-c080-430a-bece-6f830889cc83/20201202T070520Z/ \
gs://takeout-export-3ba9a6a2-c080-430a-bece-6f830889cc83/Status\ Report.html \
.
...but it failed with an error:
ServiceException: 401 Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object.
I suppose that is because I am not authenticated. I tried going through the motions with gsutil config, but it is now asking me for a "Project ID", which I cannot find anywhere in the cloud storage web page showing the bucket with the exported files.
I tries following the top answer for this question, but the project ID does not appear to be optional anymore.
How do I download my files?

The project ID is "optional" in the sense that it's only used for certain scenarios, e.g. when you want to create a bucket (without explicitly specifying a project for it to live in), that project is specified as its parent. For most things, like your scenario of copying existing GCS objects to your local filesystem, your default project ID doesn't matter; you can just type whatever you want for the project ID in order to generate your boto file for authentication.

Related

Why does gsutil cp require storage.objects.delete on versioned bucket?

I'm using a service account to upload a file to Google Cloud Storage bucket that has versioning. I want to keep the service account privileges minimal, it only ever needs to upload files so I don't want to give it permission to delete files, but the upload fails (only after streaming everything!) saying it requires delete permission.
Shouldn't it be creating a new version instead of deleting?
Here's the command:
cmd-that-streams | gsutil cp -v - gs://my-bucket/${FILE}
ResumableUploadAbortException: 403 service-account#project.iam.gserviceaccount.com does not have storage.objects.delete access to my-bucket/file
I've double checked that versioning is enabled on the bucket
> gsutil versioning get gs://my-bucket
gs://my-bucket: Enabled
The permission storage.objects.delete is required if you are executing the gsutl cp command as per cloud storage gsutil commands.
Command: cp
Required permissions:
storage.objects.list* (for the destination bucket)
storage.objects.get (for the source objects)
storage.objects.create (for the destination bucket)
storage.objects.delete** (for the destination bucket)
**This permission is only required if you don't use the -n flag and you insert an object that has the same name as an object that already
exists in the bucket.
Google docs suggests to use -n (do not overwrite an existing file) so storage.objects.delete won't be required. But your use case uses versioning and you will be needing to overwrite, thus you will need to add storage.objects.delete on your permissions.
I tested this with a bucket versioning is enabled and only has 1 version. Service account that have roles Storage Object Creator and Storage Object Viewer.
See screenshot for the commands and output:
If you're overwriting an object, regardless of whether or not its parent bucket has versioning enabled, you must have storage.objects.delete permission for that object.
Versioning works such that when you delete the "live" version of an object, that version is marked as a "noncurrent" version (and the timeDeleted field is populated). In order to create a new version of an object when a live version already exists (i.e. overwriting the object), the transaction that happens is:
Delete the current version
Create a new version that becomes the "live" or "current" version

Transfer file from Azure Blob Storage to Google Cloud Storage programmatically

I have a number of files that I transferred into Azure Blob Storage via the Azure Data Factory. Unfortunately, this tool doesn't appear to set the Content-MD5 value for any of the values, so when I pull that value from the Blob Storage API, it's empty.
I'm aiming to transfer these files out of Azure Blob Storage and into Google Storage. The documentation I'm seeing for Google's Storagetransfer service at https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#HttpData indicates that I can easily initiate such a transfer if I supply a list of the files with their URL, length in bytes and an MD5 hash of each.
Well, I can easily pull the first two from Azure Storage, but the third doesn't appear to automatically get populated by Azure Storage, nor can I find any way to get it to do so.
Unfortunately, my other options look limited. In the possibilities so far:
Download file to local machine, determine the hash and update the Blob MD5 value
See if I can't write an Azure Functions app in the same region that can calculate the hash value and write it to the blob for each in the container
Use an Amazon S3 egress from Data Factory and then use Google's support for importing from S3 to pull it from there, per https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#AwsS3Data but this really seems like a waste of bandwidth (and I'd have to set up an Amazon account).
Ideally, I want to be able to write a script, hit go and leave it alone. I don't have the fastest download rate from Azure, so #1 would be less than desireable as it'd take a long time.
Have any other approaches?
May 2020 update: Google Cloud Data Transfer now supports Azure Blob storage as a source. This is a no-code solution.
We used this to transfer ~ 1TB of files from Azure Blob storage to Google Cloud Storage. We also have a daily refresh so any new files in Azure Blob are automatically copied to Cloud Storage.
I know it's a bit late to answer this question for you, but it might help others who all are trying to migrate data from Azure Blob Storage to Google Cloud Storage
Google Cloud Storage and Azure Blob Storage, both platforms being storage services, does not have a command line interface, where we can simply go and run transfer commands. For that, we need an intermediate compute instance which would actually be able to run the required commands. We will follow the steps below in order to achieve the Cloud to Cloud transfer.
First and foremost, create a Compute Instance in Google Cloud Platform. You needn't create a computationally powerful instance, all you need is a Debian-10GB machine with 2-core CPU and 4 GB of memory.
In the early days, you would have downloaded the data to the Compute Instance in GCP and then move it further to Google Cloud Storage. But now with the introduction of gcsfuse we can simply mount a Google Storage Account as a File System.
Once the compute instance is created, simply login to that instance using SSH from Google Console and install the following packages.
Install Google Cloud Storage Fuse
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update -y
sudo apt-get install gcsfuse -y
# Create local folder
mkdir local_folder_name
# Mount the Storage Account as a bucket
gcsfuse <bucket_name> <local_folder_path>
Install Azcopy
wget https://aka.ms/downloadazcopy-v10-linux
tar -xvf downloadazcopy-v10-linux
sudo cp ./azcopy_linux_amd64_*/azcopy /usr/bin/
Once these packages are installed, the next step is to create the Shared Signature Access key. If you have Azure Blob Storage Explorer, just right click on the storage account name in the directory tree and Select Generate Shared Access Signature
Now you will have to create a URL to your blob objects. To achieve this, simply right-click on any of your blob object, select Properties and copy the URL from the dialogue box.
Your final Url should look like.
<https://URL_to_file> + <SAS Token>
https://myaccount.blob.core.windows.net/sascontainer/sasblob.txt?sv=2015-04-05&st=2015-04-29T22%3A18%3A26Z&se=2015-04-30T02%3A23%3A26Z&sr=b&sp=rw&sip=168.1.5.60-168.1.5.70&spr=https&sig=Z%2FRHIX5Xcg0Mq2rqI3OlWTjEg2tYkboXr1P9ZUXDtkk%3D
Now, use following command to start copying the files from Azure to GCP storage.
azcopy cp --recursive=true "<-source url->" "<-destination url->"
If in case, your job fails you can list your jobs using:
azcopy jobs list
and to resume failed jobs:
azcopy jobs resume jobid <-source sas->
You can collate all the steps into one bash, leave it running till your data transfer is complete.
And that's all! I hope it help others
We have migrated about 3TB files from Azure to Google Storage. We have started a cheap Linux server with a few TB local disk in the Google Computing Engine. Transferred the the Azure files to the local disk by blobxfer, then copied the files from the local disk to the Google Storage by gsutil rsync (gsutil cp works too).
You can use other tools to transfer files from Azure, you may even start the Windows server in the GCE and use gsutils on Windows.
It has taken a few days, but was simple and straightforward.
Did you think about using Azure Data Factory custom activity support that is used for data transformation? On back-end, you can use Azure Batch for downloading, updating and uploading your files into Google Storage, if you go with ADF custom activity.

Google Cloud Storage - make objects in a bucket publicly viewable

I've got a bucket in Google Cloud Storage, and a website. People can currently upload to the bucket through the website (using Google authentication).
However, I need to set it so that anyone can view the files that are uploaded (and can't modify them).
This can't be something that Google needs to authenticate, as some of our clients' IT departments have blocked Google (for whatever reason) and refuse to budge. It could be something where the request is made from my website, it could allow it (as I'll record the URL on the website's database).
Preferably, if this could be done without using gsutil that would be great.
You can set a default object ACL on the bucket that makes all objects uploaded to that bucket publicly readable. For example you could do it using gsutil:
gsutil defacl ch -u AllUsers:R gs://your-bucket
Note that the above command only affects newly written objects. If you already have objects in your bucket that need to be made public you could accomplish that with gsutil as well:
gsutil acl ch -u AllUsers:R gs://your-bucket/**
Regarding your point about making sure anyone can view the files but not modify them: You can accomplish this by making sure the bucket ACL only allows you (or your service account) to write objects, not all users.

How to use Service Accounts with gsutil, for downloading from CS - DCM Google private owned bucket

A project, a Google Group have been set up for controlling data access following the DCM guide: https://support.google.com/dcm/partner/answer/3370481?hl=en-GB&ref_topic=6107456
The project does not contain the bucket I want to access(under Storage->Cloud Storage), since it's Google owned bucket, for which I only have read only access. I can see the bucket in my browser since I am allowed to with my Google account(since I am a member of the ACL).
I used the gsutil tool to configure the service account of the project that was linked with the private bucket using
gsutil config -e
but when I try to access that private bucket with
gsutil ls gs://<bucket_name>
I always get 403 errors, and I don't know why is that. Did anyone tried that before or any ideas are welcome.
Since the bucket is private and in project A, service accounts in your project (project B) will not have access. The service account for your project (project B) would need to be added to the ACL for that bucket.
Note that since you can access this bucket with read access as a user, you can run gsutil config to grant your user credentials to gsutil and use that to read the bucket.

The gsutil tool is not working to register a channel in object change notification

When executin the follow command:
gsutil notifyconfig watchbucket -i myapp-channel -t myapp-token https://myapp.appspot.com/gcsnotify gs://mybucket
I receive the follow answer, but I used the same command before in another buckets and it worked:
Watching bucket gs://mybucket/ with application URL https://myapp.appspot.com/gcsnotify...
Failure: <HttpError 401 when requesting https://www.googleapis.com/storage/v1beta2/b/mybucket/o/watch?alt=json returned "Unauthorized WebHook callback channel: https://myapp.appspot.com/gcsnotify">.
I used gsutil config to set permissions and tried with gsutil config -e also.
I already tried to set the permissions, made myself owner of the project, but is not working, any help?
I was getting the same error. You must configure gsutil to use a service account before you can watch a bucket.
An additional security requirement was recently added for Object Change Notification. You must add your endpoint domain as a trusted domain on your cloud project. To do that, the domain first has to be whitelisted with the Google Webmaster Tools.
See instructions here:
https://developers.google.com/storage/docs/object-change-notification#_Authorization
I also determined that I needed to:
Whitelist my appspot domain
Create a service account before I can watch a bucket.
At first I was using the google cloud shell and I figured it should just be authenticated. gsutil ls listed the objects in my bucket so I assumed I was authenticated. However that is not the case.
You need to instal gsutil or google cloud sdk, log in, get the .p12 file from the service account, and auth it as Wind Up Toy described. After that it will work.