Integration with Dataproc + Datalab + Source Code repos - google-cloud-dataproc

Can someone been able to integrate Dataproc,Datalab and Source code repo? As many of us have seen that when you call an init action to install datalab, it does not create the source code repo. I am trying to achieve a full end-to-end solution where a user logs into to a datalab notebook, interact with Dataproc through Pyspark and check-in the notebooks to the Source code repo. I have not been able to do this with the init action like i pointed out earlier. I also tried installing dataproc and then datalab as a separate install ( this time it creates the source repo) , however, I can't run any spark code on this datalab notebook. Can someone please give me some pointers on how to achieve this? Any and all is appreciated.
Code in Datalab
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""show databases""").show()
hc.sql("""CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://my-exercise-project-2019016-ds-team/datasets/invoices'""")
hc.sql("""select * from invoices limit 10""").show()
Error
Py4JJavaError: An error occurred while calling o55.sql.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2395)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3208)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3240)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$or

Unfortunately, it takes some pre-work to be able to create the datalab-notebooks repository in Cloud Source Repositories from an init action.
The reason is that creating the repository requires the service account for the VM to have the "source.repos.create" IAM permission on the project, which is not true by default.
You can either grant that permission to the service account, and then create the repository via gcloud source repos create datalab-notebooks, or manually create the repository before creating the cluster.
Then, to clone the repository inside of your startup script, add the following lines:
mkdir -p ${HOME}/datalab
gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
If you are modifying the canned init action for Datalab, then I would suggest adding these lines here

Related

How to run data bricck notebook with mlflow in azure data factory pipeline?

My colleagues and I are facing an issue when trying to run my databricks notebook in Azure Data Factory and the error is coming from MLFlow.
The command that is failing is the following:
# Take the parent notebook path to use as path for the experiment
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
nb_base_path = context['extraContext']['notebook_path'][:-len("00_training_and_validation")]
experiment_path = nb_base_path + 'trainings'
mlflow.set_experiment(experiment_path)
experiment = mlflow.get_experiment_by_name(experiment_path)
experiment_id = experiment.experiment_id
run = mlflow.start_run(experiment_id=experiment_id, run_name=f"run_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}")
And the error that is throwing is:
An exception was thrown from a UDF: 'mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: No experiment ID was specified. An experiment ID must be specified in Databricks Jobs and when logging to the MLflow server from outside the Databricks workspace. If using the Python fluent API, you can set an active experiment under which to create runs by calling mlflow.set_experiment("/path/to/experiment/in/workspace") at the start of your program.', from , line 32.
The pipeline just runs the notebook from ADF, it does not have any other step and the cluster we are using is type 7.3 ML.
Could you please help us?
Thank you in advance!
I think you need to set artifact URI and specify experiment ID (if in the artifact directory has much experiment ID
Reference: https://www.mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded

Is there a way in Terraform Enterprise to read the payload from VCS?

I have configured a webhook between github and terraform enterprise correctly, so each time I push a commit, the terraform module gets executed. Why I want to achieve is to use part of the branch name where the push was made and pass it as a variable in the terraform module.
I have read that the value of a variable can be a HCL code, but I am unable to find the correct object to access the payload (or at least, the branch name), so at this moment I think it is not possible to get that value directly from the workspace configuration.
if you get a workaround for this, it may also work from me.
At this point the only idea I get is to call the terraform we hook using an API Call
Thanks in advance
Ok, after several try and error I found out that it is not possible to get any information in the terraform module if you are using the VCS mode. So, in order to be able to get the branch, I got these options:
Use several workspaces
You can configure a workspace for each branch, so you may create a variable a select that branch in each workspace. The problem is you will be repeating yourself with this option
Use Terraform CLI and a GitHub action
I used these fine tutorial from Hashicorp for creating a Github action that uses Terraform Cloud. It gets you done the 99% of the job. For passing a varible you must be aware that there are two methods, using a file or using an enviromental variable (check that information on the Hashicorp site here). So using a:
terraform apply -var="branch=value"
won't work. In my case I used the tfvars approach, so in my Github Action I put this snippet:
- name: Setup Terraform variables
id: vars
run: |-
cat > terraform.auto.tfvars <<EOF
branch = "${GITHUB_REF#refs/*/}"
EOF
I defined a variable within terraform called branch, I was able to get and work with this value

deploy zip to aws lambda automatically

I have zipped my source code using python and moved Zip file to S3 bucket. And how can I automatically deploy this zip file to my already existing Lambda function.
could you please give an idea on this.
Thanks in advance.
first install serverless.
npm install -g serverless
check this repo for examples. I am providing a simple python lambda function example. serverless examples
You can reference your lambda function from the files and also create necessary roles and invoke permissions and mention your resources in serverless.yml.
To deploy the cloud formation script simply use below command from the directory of serverless.yml file
serverless deploy
To delete the resources you deployed simply use following command from serverless.yml file's directory.
serverless remove
This saves you a lot of time than creating your resources through console.
You can also see different examples of nodejs etc in that repo.
You can setup S3 to trigger a different lambda function whenever a code is uploaded in the s3 bucket and configure this lambda function to upload that zip in s3 to your desired lambda function.
If your usecase is you only have to do changes and update the code from bucket. You can use serverless instead of paying for another lambda function.
Serverless uses cloudformation underlyingly.
see this reference on how to setup a s3 trigger create s3 trigger. Write your logic using boto3 client in this triggered lambda to upload the code to other lambda.

Error in Google Cloud Shell Commands while working on the lab (Securing Google Cloud with CFT Scorecard)

I am working in a GCP lab (Securing Google Cloud with CFT Scorecard). All instructions for the lab are given.
First I have to run the following two commands to set environment variables
export GOOGLE_PROJECT=$DEVSHELL_PROJECT_ID
export CAI_BUCKET_NAME=cai-$GOOGLE_PROJECT
In the second command given above I don't know what to replace with my own credentials? May be that is the reason I am getting error.
Now I have to enable the "cloudasset.googleapis.com" gcloud service. For this they gave the following command.
gcloud services enable cloudasset.googleapis.com \
--project $GOOGLE_PROJECT
Error for this is given in the screeshot attached herewith:
Error in the serviec enabling command
Next step is to clone the policy: The given command for that is:
git clone https://github.com/forseti-security/policy-library.git
After that they said: "You realize Policy Library enforces policies that are located in the policy-library/policies/constraints folder, in which case you can copy a sample policy from the samples directory into the constraints directory".
and gave this command:
cp policy-library/samples/storage_blacklist_public.yaml policy-library/policies/constraints/
On running this command I received this:
error on running the directory command
Finally they said "Create the bucket that will hold the data that Cloud Asset Inventory (CAI) will export" and gave the following command:
gsutil mb -l us-central1 -p $GOOGLE_PROJECT gs://$CAI_BUCKET_NAME
I am confused in where to replace my own credentials like in the place of project_Id I wrote my own project id.
Also I don't know these errors are ocurring. Kindly help me.
I'm unable to access the tutorial.
What happens if you run the following:
echo ${DEVSHELL_PROJECT_ID}
I suspect you'll get an empty result because I think this environment variable isn't actually set.
I think it should be:
echo ${DEVSHELL_GCLOUD_CONFIG}
Does that return a result?
If so, perhaps try using that variable instead:
export GOOGLE_PROJECT=${DEVSHELL_GCLOUD_CONFIG}
export CAI_BUCKET_NAME=cai-${GOOGLE_PROJECT}
It's not entirely clear to me why this tutorial is using this approach but, if the above works, it may get you further along.
We're you asked to create a Google Cloud Platform project?
As per the shared error, this seems to be because your env variable GOOGLE_PROJECT is not set. You can verify it by using echo $GOOGLE_PROJECT and seeing whether it returns the project ID or not. You could also use echo $DEVSHELL_PROJECT_ID. If that returns the project ID and the former doesn't, it means that you didn't export the variable as stated at the beginning.
If the problem is that GOOGLE_PROJECT doesn't have any value, there are different approaches on how to solve it.
Set the env variable as you explained at the beginning. Obviously this will only work if the variable DEVSHELL_PROJECT_ID is also set.
export GOOGLE_PROJECT=$DEVSHELL_PROJECT_ID
Manually set the project ID into that variable. This is far from ideal because in Qwiklabs they create a new temporal project on every lab, so this would've only worked if you were still on that project. The project ID can be seen on both of your shared screenshots.
export GOOGLE_PROJECT=qwiklabs-gcp-03-c6e1787dc09e
Avoid using the argument --project. According to the documentation, the aforementioned argument is optional and if none is used the command will take the one by default, which will be on the configuration settings. You can get the current project by using this:
gcloud config get-value project
If the previous command matches the project ID you want to use, you can simply issue the following command:
gcloud services enable cloudasset.googleapis.com
Notice that the project ID is not being explicitly mentioned using --project.
Regarding your issue with the GitHub file, I have checked the repository and the file storage_blacklist_public.yaml doesn't seem to be in the directory policy-library/samples. There seems to be a trace that it was once there, but it isn't anymore, they should probably update the lab as it isn't anymore.
About your credentials confusion, you don't have to use your own project ID, just the one given on your lab. If I recall properly all the needed data should be on the left side of the lab. Still, you shouldn't need to authenticate in a normal situation as you are already logged in your temporal project if you are accessing it form the Cloud Shell, which is where you should be doing all this.
Adding this for the later versions
in the gcloud shell you can set a temp variable for the current project id with
PROJECT_ID="$(gcloud config get-value project)"
then use like
--project ${PROJECT_ID}

How to import a file from Google Cloud Storage to H2O running in R

I would like to import a csv file from my Google Cloud Storage bucket into H2O running in R locally (h2o.init(ip = "localhost")).
I tried following the instructions at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/cloud-integration/gcs.html?highlight=environment.
I can already upload csv files from R to GCS and vice-versa using the R package cloudml. So I am reasonably sure I have the authorizations set correctly.
I have tried using Sys.setenv(GOOGLE_APPLICATION_CREDENTIALS = "/full/path/to/auth.json"). I tried using the terminal from Rstudio to do the same thing: export GOOGLE_APPLICATION_CREDENTIALS="/full/path/to/auth.json". I also tried gcloud auth application-default login using the terminal from Rstudio.
But in every case, I could not make this work, from within Rstudio:
h2o.init()
h2o::h2o.importFile(path = "gs://[gcs_bucket]/[tbl.csv],
destination_frame = "tbl_from_gcs")
H2O throws the error:
Error in h2o.importFolder(path, pattern = "", destination_frame = destination_frame, :
all files failed to import
If I turn on logging (h2o::h2o.startLogging("logfile")), it shows:
GET http://localhost:54321/3/ImportFiles?path=gs%3A%2F%2F[gcs_bucket]%2F[tbl.csv]&pattern=
postBody:
curlError: FALSE
curlErrorMessage:
httpStatusCode: 200
httpStatusMessage: OK
millis: 182
{"__meta":{"schema_version":3,"schema_name":"ImportFilesV3","schema_type":"ImportFiles"},"_exclude_fields":"","path":"gs://[gcs_bucket]/[tbl.csv]","pattern":"","files":[],"destination_frames":[],"fails":["gs://[gcs_bucket]/[tbl.csv]"],"dels":[]}
(Obviously, I changed the bucket name and table name, but hopefully you get the idea.)
I am running h2o version 3.26.0.2 in R 3.6.1 and Rstudio 1.2.1578. (I am running Rstudio server in Docker on my local server using rocker/tidyverse:latest, FYI.)
If anyone could walk me through the steps to authenticate H2O so it can access GCS buckets directly, I would appreciate it. I know I could use cloudml or googleCloudStorageR as a workaround, but I would like to be able to use H2O directly so I can more easily switch from a local H2O cluster to a cloud H2O cluster.
I found one solution to this authentication issue: Because I am running h2o in Docker swarm, I can set an environment variable for the container in Docker Compose.
The relevant parts of the docker compose file look like this:
environment:
- GOOGLE_APPLICATION_CREDENTIALS=/run/secrets/google_auth_secret
secrets:
- google_auth_secret
...
secrets:
google_auth_secret:
file: ./gcloud_auth.json
Where gcloud_auth.json is the credentials file described here for your GCS bucket.