How to debug remote-cache write failures? - google-cloud-storage

We're using Bazel (via Bazelisk) and set up a GCS bucket remote cache as documented. However when we run, it seems we regularly get BulkTransferExecptions during the Remote Cache writing phase:
> bazel build //... --sandbox_debug --verbose_failures
INFO: Invocation ID: fba91f67-788f-47cc-be4e-24f92ed11301
INFO: Analyzed 25 targets (74 packages loaded, 3245 targets configured).
INFO: Found 25 targets...
WARNING: Writing to Remote Cache:
BulkTransferException
INFO: Elapsed time: 17.115s, Critical Path: 15.47s
INFO: 16 processes: 16 worker.
INFO: Build completed successfully, 39 total actions
As far as I can tell, I have the appropriate access (Storage Object Admin).
I've been trying to get more information around that specific exception, but I've been unable to.
And if the bucket wasn't working, I'd expect an exception when reading from the cache: I'd seen such things when attempting other URLs to reach the bucket, such as domain storage.cloud.google.com instead of storage.googleapis.com.
Any and all advice to help debug what's going on here is welcome! The documentation is sparse on what to happen if you get exceptions, and as far as I can tell no results are uploading so no caching is actually occurring.
Update 2020/07/09
For some unknown reason, when we moved from one bucket to a more permanent planned on, it stopped occurring. So things work for us, and as far as we can tell the buckets were the same, so we don't know why it was failing initially.

You can use --verbose_failures, which will make it print out a longer stack trace. I just had a very similar problem, and figured out that my problem was due to insufficient permissions my service account had on my GCS bucket. I got this more helpful error message with --verbose_failures:
<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message><Details>REDACTED#REDACTED.iam.gserviceaccount.com does not have storage.objects.delete access to REDACTED/cas/REDACTED.</Details></Error>
I had to read the source code the message came from. I'll try to submit a PR to add this hint to Bazel documentation: https://github.com/bazelbuild/bazel/pull/12945

Related

GCloud custom image upload failure due to size or permissions

I've been trying to upload two custom images for some time now and I have failed repeatedly. During the import process the Google application always responds with the message that the Compute Engine Default Service Account does not have the role 'roles/compute.storageAdmin'. However, I have both assigned it using the CLI as the webinterface.
Notable is that the application throws this error during resizing of the disk. The original size of the disk is about 10GB, however, it tries to convert it to a 1024GB (!) disk. This got me thinking, could it be that this is too big for the application, hence it throwing the error it lacks permissions?
As a follow up questions, I have not found any options to set the size of the disk (not in the CLI nor in the webapp). Does anybody know of such options?
Here is the error message I have recieved:
ate-import-3ly9z": StatusMatch found: "Import: Resizing temp-translation-disk-3ly9z to 1024GB in projects/0000000000000/zones/europe-west4-a."
[import-and-translate]: 2020-05-01T07:46:30Z Error running workflow: step "import" run error: step "wait-for-signal" run error: WaitForInstancesSignal FailureMatch found for "inst-importer-import-and-translate-import-3ly9z": "ImportFailed: Failed to resize disk. The Compute Engine default service account needs the role: roles/compute.storageAdmin'"
[import-and-translate]: 2020-05-01T07:46:30Z Serial-output value -> target-size-gb:1024
[import-and-translate]: 2020-05-01T07:46:30Z Serial-output value -> source-size-gb:7
[import-and-translate]: 2020-05-01T07:46:30Z Serial-output value -> import-file-format:vmdk
[import-and-translate]: 2020-05-01T07:46:30Z Workflow "import-and-translate" cleaning up (this may take up to 2 minutes).
[import-and-translate]: 2020-05-01T07:47:34Z Workflow "import-and-translate" finished cleanup.
[import-image] 2020/05/01 07:47:34 step "import" run error: step "wait-for-signal" run error: WaitForInstancesSignal FailureMatch found for "inst-importer-import-and-translate-import-3ly9z": "ImportFailed: Failed to resize disk. The Compute Engine default service account needs the role: roles/compute.storageAdmin'"
ERROR
ERROR: build step 0 "gcr.io/compute-image-tools/gce_vm_image_import:release" failed: step exited with non-zero status: 1
ERROR: (gcloud.compute.images.import) build a9ccbeac-92c5-4457-a784-69d486e85c3b completed with status "FAILURE"
Thanks for your time!
EDIT: Not sure but I'm farily certain this is due to the 1024GB being too big. I've uploaded a 64GB without any issues using the same methods. For those who read after me, that's most likely the issue (:
This error message with the import of virtual disks have 2 root causes:
1.- Cloud Build and/or Compute engine and/or your User account did not have the correct IAM roles to perform these tasks. You can verify them here.
Cloud Build SA roles needed:
roles/iam.serviceAccountTokenCreator
roles/compute.admin
roles/iam.serviceAccountUser
Compute Engine SA roles needed:
roles/compute.storageAdmin
roles/storage.objectViewer
User Account roles needed:
roles/storage.admin
roles/viewer
roles/resourcemanager.projectIamAdmin
2.- " Not sure but I'm fairly certain this is due to the 1024GB being too big" The disk quota you have is less than 1T. The normal disk quota is 250-500 GB so that could be why by importing a 64 GB disk you encounter no problem.
You can check your quota in step 1 of this document; If you need to request more, you can follow steps 2 to 7.

Invalid permissions after setting gcloud caching use_kaniko?

I encountered a strange permissions error while building Docker images on the cloud. I switched to another machine, installed Gcloud, did gcloud init and everything worked again.
However, I noticed while building images, it took much longer because I didn't enable kaniko cache (which I figured out from this post: gcloud rebuilds complete container but Dockerfile is the same, only the script has changed)
After enabling this feature, I tried to rebuild my last image and bam, the same error message:
Status: Downloaded newer image for gcr.io/kaniko-project/executor:latest
gcr.io/kaniko-project/executor:latest
error checking push permissions --
make sure you entered the correct tag name, and that you are authenticated correctly, and try again:
checking push permission for "eu.gcr.io/pipeline/tree-par": creating push check transport for eu.gcr.io failed:
GET https://eu.gcr.io/v2/token?scope=repository%3pipeline%2Ftree-par%3Apush%2Cpull&service=eu.gcr.io:
UNAUTHORIZED: You don't have the needed permissions to perform this operation, and you may have invalid credentials.
To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication
ERROR
ERROR: build step 0 "gcr.io/kaniko-project/executor:latest" failed: step exited with non-zero status: 1
-------------------------------------------------------------------------------------------------------------------------------
ERROR: (gcloud.builds.submit) build bad4a9a4-054d-4ad7-991d-e5aeae039b7c completed with status "FAILURE"
Anyone any idea why this failed upon enabling the Kaniko cache? I hate to not use it because when it still worked, it really decreased the time it took to create docker images.
It seems that the issue comes from Kaniko's end.
Three days ago, on version v0.21.0, they added this fix:
Fix: GCR credential helper check does not respect DOCKER_CONFIG environment variable
Even after this release, 1 day later, this issue was reported where users saw a very similar Error message:
"[...] You don't have the needed permissions to perform this operation, and you may have invalid credentials[...] "
This was already fixed yesterday with the release of the v0.22.0 version. The suggested workaround is to execute the following command:
gcr.io/kaniko-project/executor:v0.22.0
I would suggest use that command instead of executor:latest to "force" the use of the v0.22.0 version.
I hope this is helpful! :)

Why does BitBake error if it can't find www.example.com?

BitBake fails for me because it can't find https://www.example.com.
My computer is an x86-64 running native Xubuntu 18.04. Network connection is via DSL. I'm using the latest versions of the OpenEmbedded/Yocto toolchain.
This is the response I get when I run BitBake:
$ bitbake -k core-image-sato
WARNING: Host distribution "ubuntu-18.04" has not been validated with this version of the build system; you may possibly experience unexpected failures. It is recommended that you use a tested distribution.
ERROR: OE-core's config sanity checker detected a potential misconfiguration.
Either fix the cause of this error or at your own risk disable the checker (see sanity.conf).
Following is the list of potential problems / advisories:
Fetcher failure for URL: 'https://www.example.com/'. URL https://www.example.com/ doesn't work.
Please ensure your host's network is configured correctly,
or set BB_NO_NETWORK = "1" to disable network access if
all required sources are on local disk.
Summary: There was 1 WARNING message shown.
Summary: There was 1 ERROR message shown, returning a non-zero exit code.
The networking issue, the reason why I can't access www.example.com, is a question for the SuperUser forum. My question here is, why does BitBake rely on the existence of www.example.com? What is it about that website that is so vital to BitBake's operation? Why does BitBake post an Error if it cannot find https://www.example.com?
At this time, I don't wish to set BB_NO_NETWORK = "1". I would rather understand and resolve the root cause of the problem first.
Modifying poky.conf didn't work for me (and from what I read, modifying anything under Poky is a no-no for a long term solution).
Modifying /conf/local.conf was the only solution that worked for me. Simply add one of the two options:
#check connectivity using google
CONNECTIVITY_CHECK_URIS = "https://www.google.com/"
#skip connectivity checks
CONNECTIVITY_CHECK_URIS = ""
This solution was originally found here.
For me, this appears to be a problem with my ISP (CenturyLink) not correctly resolving www.example.com. If I try to navigate to https://www.example.com in the browser address bar I just get taken to the ISP's "this is not a valid address" page.
Technically speaking, this isn't supposed to happen, but for whatever reason it does. I was able to work around this temporarily by modifying the CONNECTIVITY_CHECK_URIS in poky/meta-poky/conf/distro/poky.conf to something that actually resolves:
# The CONNECTIVITY_CHECK_URI's are used to test whether we can succesfully
# fetch from the network (and warn you if not). To disable the test set
# the variable to be empty.
# Git example url: git://git.yoctoproject.org/yocto-firewall-test;protocol=git;rev=master
CONNECTIVITY_CHECK_URIS ?= "https://www.google.com/"
See this commit for more insight and discussion on the addition of the www.example.com check. Not sure what the best long-term fix is, but the change above allowed me to build successfully.
If you want to resolve this issue without modifying poky.conf or local.conf or any of the files for that matter, just do:
$touch conf/sanity.conf
It is clearly written in meta/conf/sanity.conf that:
Expert users can confirm their sanity with "touch conf/sanity.conf"
If you don't want to execute this command on every session or build, you can comment out the line INHERIT += "sanity" from meta/conf/sanity.conf, so the file looks something like this:
Had same issue with Bell ISP when accessing example.com gave DNS error.
Solved by switching ISP's DNS IP to Google's DNS (to avoid making changes to configs):
https://developers.google.com/speed/public-dns/docs/using

SSO Bad Data Error

I'm running BizTalk 2013r2 CU5 in Win2012r2
I noticed a file wasn't being collected from a receive location. The relevant host instance was running, so I checked the event log and found this:
SSO AUDIT Function: GetConfigInfo
({E182FB76-16B4-47D7-8178-4C66C9E3BA9D}) Tracking ID:
c4d0d0d1-0763-4ec5-99ea-fb2ac3bcc744 Client Computer: BizTalkBuild01
(BTSNTSvc64.exe:7940) Client User: BIZTALKBUILD01\BizTalkSvc
Application Name: {E182FB76-16B4-47D7-8178-4C66C9E3BA9D} Error Code:
0xC0002A1F, Cannot perform encryption or decryption because the secret
is not available from the master secret server. See the event log for
related errors.
I then restored the master secret using:
ssoConfig -restoresecret SSOxxxx.bak
After restoring, the file is still not being collected but the error messages in the event log have changed to this:
SSO AUDIT Function: GetConfigInfo
({2DC11892-82FF-4617-A491-5324CAEF8E90}) Tracking ID:
5e91d09d-1128-491b-851b-e8c8e69d06eb Client Computer: BizTalkBuild01
(BTSNTSvc64.exe:26408) Client User: BIZTALKBUILD01\BizTalkSvc
Application Name: {2DC11892-82FF-4617-A491-5324CAEF8E90} Error Code:
0x80090005, Bad Data.
Does anyone know of a solution to this please? This is the 2nd time I've faced this problem on different servers in the last 3 months.
The MSI for CU6 has now been fixed
For BizTalk 2013 R2 this may be a known issue, with a hotfix available!
There is a hotfix for this issue, however, the hotfix may introduce another issue (memory leak). A solution can be found here: https://blogs.msdn.microsoft.com/amantaras/2015/11/10/event-id-10536-entsso-bad-data-issue/

Google cloud datalab deployment unsuccessful - sort of

This is a different scenario from other question on this topic. My deployment almost succeeded and I can see the following lines at the end of my log
[datalab].../#015Updating module [datalab]...done.
Jul 25 16:22:36 datalab-deploy-main-20160725-16-19-55 startupscript: Deployed module [datalab] to [https://main-dot-datalab-dot-.appspot.com]
Jul 25 16:22:36 datalab-deploy-main-20160725-16-19-55 startupscript: Step deploy datalab module succeeded.
Jul 25 16:22:36 datalab-deploy-main-20160725-16-19-55 startupscript: Deleting VM instance...
The landing page keeps showing a wait bar indicating the deployment is still in progress. I have tried deploying several times in last couple of days.
About additions described on the landing page -
An App Engine "datalab" module is added. - when I click on the pop-out url "https://datalab-dot-.appspot.com/" it throws an error page with "404 page not found"
A "datalab" Compute Engine network is added. - Under "Compute Engine > Operations" I can see a create instance for datalab deployment with my id and a delete instance operation with *******-ompute#developer.gserviceaccount.com id. not sure what it means.
Datalab branch is added to the git repo- Yes and with all the components.
I think the deployment is partially successful. When I visit the landing page again, the only option I see is to deploy the datalab again and not to start it. Can someone spot the problem ? Appreciate the help.
I read the other posts on this topic and tried to verify my deployment using - "https://console.developers.google.com/apis/api/source/overview?project=" I get the following message-
The API doesn't exist or you don't have permission to access it
You can try looking at the App Engine dashboard here, to verify that there is a "datalab" service deployed.
If that is missing, then you need to redeploy again (or switch to the new locally-run version).
If that is present, then you should also be able to see a "datalab" network here, and a VM instance named something like "gae-datalab-main-..." here. If either of those are missing, then try going back to the App Engine console, deleting the "datalab" service, and redeploying.