Setting Dynamic Properties in Dataproc Job - google-cloud-dataproc

Here's what I am trying to accomplish. I want to create a workflow template so that I can spin up a cluster, run a job, and delete the cluster. Within the job, I want to pass in properties that can be set dynamically. For example, set a property to the current date.
Below is a simple example. I uses the data function correctly but that is handled at creation time so it looks like it will always be 12/31/2020 if I setup the workflow today. I know I can delete the job and add it back to the template for each run, but I was was hoping for a simpler way.
gcloud dataproc workflow-templates create workflow-mk-test --region us-east1 --project data-engineering-doz4
gcloud dataproc workflow-templates set-managed-cluster workflow-mk-test \
--cluster-name=cluster-mk-test \
--project data-engineering-doz4 \
--image-version=1.3-ubuntu18 \
--bucket data-engineering-dev \
--region us-east1 \
--subnet ml-data-engineering-east1 \
--no-address \
--zone us-east1-b \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 15 \
--num-workers 2 \
--worker-machine-type n1-standard-4 \
--worker-boot-disk-size 15
gcloud dataproc workflow-templates add-job pyspark gs://data-engineering-dev/jobs/millard-test.py \
--workflow-template=workflow-mk-test \
--step-id=test-job \
--region=us-east1 \
--project=data-engineering-doz4 \
-- date `date -v -1d '+%Y/%m/%d'` \
--output-location s3n://missionlane-data-engineering-dev-us-east-1/delete-me/`date -v -1d '+%Y/%m/%d'`

Dynamic properties generated by running shell commands is not a supported feature of Dataproc jobs. In this case, you might want to consider making the logic part of your job, i.e., getting the current date dynamically in millard-test.py.

Related

How to make Powershell pass a list as argument to gcloud

I want to submit my neural network model to google cloud via the following command as in the tutorial:
gcloud ai-platform jobs submit training ${JOB_NAME} \
--region=us-central1 \
--master-image-uri=gcr.io/cloud-ml-public/training/pytorch-gpu.1-10 \
--scale-tier=CUSTOM \
--master-machine-type=n1-standard-8 \
--master-accelerator=type=nvidia-tesla-p100,count=1 \
--job-dir=${JOB_DIR} \
--package-path=./trainer \
--module-name=trainer.task \
-- \
--train-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_train.csv \
--eval-files=gs://cloud-samples-data/ai-platform/chicago_taxi/training/small/taxi_trips_eval.csv \
--num-epochs=10 \
--batch-size=100 \
--learning-rate=0.001
I was working with powershell and I have a problem with the master-accelerator argument which should be a dictionnary. I don't know how to pass such to gcloud. I have tried #{count=1; type=...}, but received a Bad syntax for dict arg: [#] error.
How can I pass a list of parameters in PowerShell such that the gcloud submit command accepts it?
I thank you in advance for your help.
EDIT
I have tried to use delimiters as ^^^^:^^^^, along this, but this still does not work (Invalid delimeter).

ERROR: (gcloud.sql.instances.create) Projects instance [my-project] not found: The requested flag is either misspelled or unsupported by Cloud SQL

When I try to create cloud sql instance using gcloud I got this error. Any thoughts folks?
--database-version=$DB_VERSION \
--cpu=$NUMBER_CPUS \
--memory=$MEMORY_SIZE \
--storage-type=$STORAGE_TYPE \
--storage-size=$STORAGE_SIZE \
--storage-auto-increase \
--database-flags=$DATABASE_FLAGS \
--region=$REGION \
--authorized-networks=$NETWORKS \
--assign-ip \
--project=$PROJECT_ID
It doesnt mater enabled projectId or not in this command
Thanks!

In `aws cloudformation deploy --parameter-overrides`, how to pass multiple values to `List<AWS::EC2::Subnet::ID>` parameter?

I am using this CloudFormation template
The List parameter I'm trying to pass values to is:
"Subnets" : {
"Type" : "List<AWS::EC2::Subnet::Id>",
"Description" : "The list of SubnetIds in your Virtual Private Cloud (VPC)",
"ConstraintDescription" : "must be a list of at least two existing subnets associated with at least two different availability zones. They should be residing in the selected Virtual Private Cloud."
},
I've written an utility script that looks like this:
#!/bin/bash
SUBNET1=subnet-abcdef
SUBNET2=subnet-ghijlm
echo -e "\n==Deploying stack.cf.yaml===\n"
aws cloudformation deploy \
--region $REGION \
--profile $CLI_PROFILE \
--stack-name $STACK_NAME \
--template-file stack.cf.json \
--no-fail-on-empty-changeset \
--capabilities CAPABILITY_NAMED_IAM \
--parameter-overrides \
VpcId=$VPC_ID \
Subnets="$SUBNET1 $SUBNET2" \ #<---------------this fails
InstanceType=$EC2_INSTANCE_TYPE \
OperatorEMail=$OPERATOR_EMAIL \
KeyName=$KEY_NAME \
If I deploy this, after a while my stack fails to deploy saying that a Subnet with the value "subnet-abcdef subnet-ghijlmn" does not exist.
The correct way to pass parameters to list is to comma separate them
So:
#!/bin/bash
SUBNET1=subnet-abcdef
SUBNET2=subnet-ghijlm
aws cloudformation deploy --parameter-overrides Subnets="$SUBNET1,SUBNET2"
will work
Tried every possible solution found online, none worked.
According to the documentation below, you should escape the comma without double-slashes. Tried that, didn't work either.
https://docs.aws.amazon.com/cli/latest/reference/cloudformation/create-stack.html
What worked FOR ME (apparently this is very environment-dependent) was the command below, escaping the coma with just one slash.
aws cloudformation create-stack --stack-name teste-memdb --template-body file://memorydb.yml --parameters ParameterKey=VpcId,ParameterValue=vpc-xxxx ParameterKey=SubnetIDs,ParameterValue=subnet-xxxxx\,subnet-yyyy --profile whatever
From the Documentation here
List/Array can be passed just like python Lists.
'["value1", "value2", "value3"]'
Also to note Cloudformation internally used python.

How Can I Debug apiserver Startup When No Logs Are Generated?

I am trying to install the aws-encryption-provider following the steps at https://github.com/kubernetes-sigs/aws-encryption-provider. After I added the --encryption-provider-config=/etc/kubernetes/aws-encryption-provider-config.yaml parameter to /etc/kubernetes/manifests/kube-apiserver.yaml the apiserver process did not restart. Nor do I see any error messages.
What technique can I use to see errors created when apiserver starts?
Realizing that the apiserver is running inside a docker container, I connected to one of my controller nodes using SSH. Then I started a container using the following command to get a shell prompt using the same docker image that apiserver is using.
docker run \
-it \
--rm \
--entrypoint /bin/sh \
--volume /etc/kubernetes:/etc/kubernetes:ro \
--volume /etc/ssl/certs:/etc/ssl/certs:ro \
--volume /etc/pki:/etc/pki:ro \
--volume /etc/pki/ca-trust:/etc/pki/ca-trust:ro \
--volume /etc/pki/tls:/etc/pki/tls:ro \
--volume /etc/ssl/etcd/ssl:/etc/ssl/etcd/ssl:ro \
--volume /etc/kubernetes/ssl:/etc/kubernetes/ssl:ro \
--volume /var/run/kmsplugin:/var/run/kmsplugin \
k8s.gcr.io/kube-apiserver:v1.18.5
Once inside that container, I could run the same command that is setup in kube-apiserver.yaml. This command was:
kube-apiserver \
--encryption-provider-config=/etc/kubernetes/aws-encryption-provider-config.yaml \
--advertise-address=10.250.203.201 \
...
--service-node-port-range=30000-32767 \
--storage-backend=etcd3 \
I elided the bulk of the command since you'll need to get specific values from your own kube-apiserver.yaml file.
Using this technique showed me the error message:
Error: error while parsing encryption provider configuration file
"/etc/kubernetes/aws-encryption-provider-config.yaml": error while parsing
file: resources[0].providers[0]: Invalid value:
config.ProviderConfiguration{AESGCM:(*config.AESConfiguration)(nil),
AESCBC:(*config.AESConfiguration)(nil), Secretbox:(*config.SecretboxConfiguration)
(nil), Identity:(*config.IdentityConfiguration)(nil), KMS:(*config.KMSConfiguration)
(nil)}: provider does not contain any of the expected providers: KMS, AESGCM,
AESCBC, Secretbox, Identity

How to get output of gcloud composer command?

I'm executing gcloud composer commands:
gcloud composer environments run airflow-composer \
--location europe-west1 --user-output-enabled=true \
backfill -- -s 20171201 -e 20171208 dags.my_dag_name \
kubeconfig entry generated for europe-west1-airflow-compos-007-gke.
It's a regular airflow backfill. The command above is printing the results at the end of the whole backfill range, is there any way to get the output in a streaming manner ? Each time a DAG gets backfilled it will be printed in the standard output, like in a regular airflow-cli.