How to run cluster initialization script on GCP after creation of cluster - google-cloud-dataproc

I have created a Google Dataproc cluster, but need to install presto as I now have a requirement. Presto is provided as an initialization action on Dataproc here, how can I run this initialization action after creation of the cluster.

Most init actions would probably run even after the cluster is created (though I haven't tried the Presto init action).
I like to run clusters describe to get the instance names, then run something like gcloud compute ssh <NODE> -- -T sudo bash -s < presto.sh for each node. Reference: How to use SSH to run a shell script on a remote machine?.
Notes:
Everything after the -- are args to the normal ssh command
The -T means don't try to create an interactive session (otherwise you'll get a warning like "Pseudo-terminal will not be allocated because stdin is not a terminal.")
I use "sudo bash" because init actions scripts assume they're being run as root.
presto.sh must be a copy of the script on your local machine. You could alternatively ssh and gsutil cp gs://dataproc-initialization-actions/presto/presto.sh . && sudo bash presto.sh.
But #Kanji Hara is correct in general. Spinning up a new cluster is pretty fast/painless, so we advocate using initialization actions when creating a cluster.

You could use initialization-actions parameter
Ex:
gcloud dataproc clusters create $CLUSTERNAME \
--project $PROJECT \
--num-workers $WORKERS \
--bucket $BUCKET \
--master-machine-type $VMMASTER \
--worker-machine-type $VMWORKER \
--initialization-actions \
gs://dataproc-initialization-actions/presto/presto.sh \
--scopes cloud-platform
Maybe this script can help you: https://github.com/kanjih-ciandt/script-dataproc-datalab

Related

How to get gcloud dataproc create flag in a spark job?

I want to get flags used when creating a dataproc cluster in a spark job.
for example, I created my cluster using this command line:
gcloud dataproc clusters create cluster-name \
--region=region \
--bucket=bucket-name \
--temp-bucket=bucket-name \
other args ...
In my scala spark job I want to get the bucket name and other arguments how to do that, I know if I want to get the arguments of my job I must do that:
val sc = sparkSession.sparkContext
val conf_context=sc.getConf.getAll
conf_context.foreach(println)
Any help, please?
Thanks
Dataproc also publishes some attributes, including the bucket name, to GCE instance Metadata. You can also specify your own Metadata. See https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/metadata.
These will be available to you through the Metadata server. For example, if you want to read the bucket name, you can run
curl -s -H Metadata-Flavor:Google http://metadata.google.internal/computeMetadata/v1/instance/attributes/dataproc-bucket
You can use gcloud dataproc clusters describe shell command to get details about the cluster:
gcloud dataproc clusters describe $clusterName --region $clusterRegion
To get the bucket name from this command, you can use grep:
BUCKET_NAME=$(gcloud dataproc clusters describe $clusterName \
--region $clusterRegion \
| grep 'configBucket:' \
| sed 's/.* //')
You should be able to execute this from Scala, see this post for how to do.

Error executing access token command "/google/google-cloud-sdk/bin/gcloud config-helper --format=json

I'm trying to follow this step by step to upload the airflow in Kubernetes (https://github.com/EamonKeane/airflow-GKE-k8sExecutor-helm) but in this part of the execution I have problems as follows:
Researching on the topic did not find anything that solved so far my problem, does anyone have any suggestions of what to do?
SQL_ALCHEMY_CONN=postgresql+psycopg2://$AIRFLOW_DB_USER:$AIRFLOW_DB_USER_PASSWORD#$KUBERNETES_POSTGRES_CLOUDSQLPROXY_SERVICE:$KUBERNETES_POSTGRES_CLOUDSQLPROXY_PORT/$AIRFLOW_DB_NAME
echo $SQL_ALCHEMY_CONN > /secrets/airflow/sql_alchemy_conn
# Create the fernet key which is needed to decrypt database the database
FERNET_KEY=$(dd if=/dev/urandom bs=32 count=1 2>/dev/null | openssl base64)
echo $FERNET_KEY > /secrets/airflow/fernet-key
kubectl create secret generic airflow \
--from-file=fernet-key=/secrets/airflow/fernet-key \
--from-file=sql_alchemy_conn=/secrets/airflow/sql_alchemy_conn
Unable to connect to the server: error executing access token command
"/google/google-cloud-sdk/bin/gcloud config config-helper
--format=json": err=exit status 1 output= stderr=ERROR: gcloud crashed (BadStatusLine): '' If you would like to report this issue, please run
the following command: gcloud feedback To check gcloud for common
problems, please run the following command: gcloud info
--run-diagnostics
I solved this by creating a new cloud shell tab to connect the cluster:
gcloud container clusters get-credentials testcluster1 --zone = your_zone
Example:
get the name and location of your cluster
gcloud container clusters list
then
gcloud container clusters get-credentials demo --region=us-west1-a

How to create an SSH in gcloud, but keep getting API error

I am trying to set up datalab from my chrome book using the following tutorial https://cloud.google.com/dataproc/docs/tutorials/dataproc-datalab. However when trying to set up an SSH tunnel using the following guidelines https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces#create_an_ssh_tunnel I keep on receiving the following error.
ERROR: (gcloud.compute.ssh) Could not fetch resource:
- Project 57800607318 is not found and cannot be used for API calls. If it is recently created, enable Compute Engine API by visiting https://console.developers.google
.com/apis/api/compute.googleapis.com/overview?project=57800607318 then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our sy
stems and retry.
The error message would lead me to believe my "Compute Engine API" is not enabled. However, I have double checked and "Compute Engine API" is enabled.
Here is what I am entering into the cloud shell
gcloud compute ssh ${test-cluster-m} \
--project=${datalab-test-229519} --zone=${us-west1-b} -- \
-4 -N -L ${8080}:${test-cluster-m}:${8080}
The ${} is for accessing the local environment variable. You set them in the step before with:
export PROJECT=project;export HOSTNAME=hostname;export ZONE=zone;PORT=number
In this case would be:
export PROJECT=datalab-test-229519;export HOSTNAME=test-cluster-m;export ZONE=us-west1-b;PORT=8080
Either try this:
gcloud compute ssh test-cluster-m \
--project datalab-test-229519 --zone us-west1-b -- \
-D 8080 -N
Or access the enviroment variables with:
gcloud compute ssh ${HOSTNAME} \
--project=${PROJECT} --zone=${ZONE} -- \
-D ${PORT} -N
Also check the VM you are trying to access is running.

Shell (ssh) into Azure AKS (Kubernetes) cluster worker node

I have a Kubernetes cluster in Azure using AKS and I'd like to 'login' to one of the nodes. The nodes do not have a public IP.
Is there a way to accomplish this?
The procedure is longly decribed in an article of the Azure documentation:
https://learn.microsoft.com/en-us/azure/aks/ssh. It consists of running a pod that you use as a relay to ssh into the nodes, and it works perfectly fine:
You probably have specified the ssh username and public key during the cluster creation. If not, you have to configure your node to accept them as the ssh credentials:
$ az vm user update \
--resource-group MC_myResourceGroup_myAKSCluster_region \
--name node-name \
--username theusername \
--ssh-key-value ~/.ssh/id_rsa.pub
To find your nodes names:
az vm list --resource-group MC_myResourceGroup_myAKSCluster_region -o table
When done, run a pod on your cluster with an ssh client inside, this is the pod you will use to ssh to your nodes:
kubectl run -it --rm my-ssh-pod --image=debian
# install ssh components, as their is none in the Debian image
apt-get update && apt-get install openssh-client -y
On your workstation, get the name of the pod you just created:
$ kubectl get pods
Add your private key into the pod:
$ kubectl cp ~/.ssh/id_rsa pod-name:/id_rsa
Then, in the pod, connect via ssh to one of your node:
ssh -i /id_rsa theusername#10.240.0.4
(to find the nodes IPs, on your workstation):
az vm list-ip-addresses --resource-group MC_myAKSCluster_myAKSCluster_region -o table
This Gist and this page have pretty good explanations of how to do it. Sshing into the nodes and not shelling into the pods/containers.
you can use this instead of SSH. This will create a tiny priv pod and use nsenter to access the noed.
https://github.com/mohatb/kubectl-wls

How to know when dataproc initialization actions are done

I need to run a Dataproc cluster with both BigQuery and Cloud Storage connectors installed.
I use a variant of this script (because I have no access to the bucket used in the general one), everything is working fine but when I run a job, when the cluster is up and running, it always results in a Task was not acquired error.
I can fix this by simply restarting the dataproc agent on every nodes but I really need this to work properly to be able to run a job right after my cluster is created. it seems that this part of the script is not working properly:
# Restarts Dataproc Agent after successful initialization
# WARNING: this function relies on undocumented and not officially supported Dataproc Agent
# "sentinel" files to determine successful Agent initialization and not guaranteed
# to work in the future. Use at your own risk!
restart_dataproc_agent() {
# Because Dataproc Agent should be restarted after initialization, we need to wait until
# it will create a sentinel file that signals initialization competition (success or failure)
while [[ ! -f /var/lib/google/dataproc/has_run_before ]]; do
sleep 1
done
# If Dataproc Agent didn't create a sentinel file that signals initialization
# failure then it means that initialization succeded and it should be restarted
if [[ ! -f /var/lib/google/dataproc/has_failed_before ]]; then
service google-dataproc-agent restart
fi
}
export -f restart_dataproc_agent
# Schedule asynchronous Dataproc Agent restart so it will use updated connectors.
# It could not be restarted sycnhronously because Dataproc Agent should be restarted
# after its initialization, including init actions execution, has been completed.
bash -c restart_dataproc_agent & disown
My question here are:
How to know that the initialization actions are done?
Do I have/How to properly restart the Dataproc agent one my newly created cluster's nodes?
EDIT:
Here is the command I use to create a cluster (using the 1.3 image version):
gcloud dataproc --region europe-west1 \
clusters create my-cluster \
--bucket my-bucket \
--subnet default \
--zone europe-west1-b \
--master-machine-type n1-standard-1 \
--master-boot-disk-size 50 \
--num-workers 2 \
--worker-machine-type n1-standard-2 \
--worker-boot-disk-size 100 \
--image-version 1.3 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--project my-project \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh \
--metadata 'gcs-connector-version=1.9.6' \
--metadata 'bigquery-connector-version=0.13.6'
Also, please note that the connectors initialization script has been fixed and works fine by now, so I am using it now but I still have to restart manually the dataproc agent to be able to run a job.
Dataproc agent logs Custom initialization actions finished. message in the /var/log/google-dataproc-agent.0.log file after initialization actions succeed.
No you don't need to restart Dataproc agent manually.
This issue is caused by Dataproc agent service restart in the connectors initialization action and should be resolved by this PR.
As for knowing when the initialization actions are finished, You can check the dataproc's status.state, if it's CREATING that means initialization actions are still being executed, if RUNNING that would mean that they are done!
Check here