Data Fusion Dataproc compute profle in a different account

Data Fusion Dataproc compute profle in a different account - google-cloud-dataproc

I'm trying to execute a pipeline on a Data Proc cluster in a different project by the one Data Fusion instance is deployed but I am having some trouble. Data Proc instance seems to be created correctly but the start of the job fails. Any idea on how to solve?
Here the stack trace of the error
Thanks

This seems like the project where the Google Cloud Dataproc is doesn't have SSH port open. Can you check that your project allow port 22 connection? Cloud Data Fusion uses SSH to upload and monitor the job in the Cloud Dataproc.

Related

Connect PySpark session to DataProc

I'm trying to connect a PySpark session running locally to a DataProc cluster. I want to be able to work with files on gcs without downloading them. My goal is to perform ad-hoc analyses using local Spark, then switch to a larger cluster when I'm ready to scale. I realize that DataProc runs Spark on Yarn, and I've copied over the yarn-site.xml locally. I've also opened up an ssh tunnel from my local machine to the DataProc master node and set up port forwarding for the ports identified in the yarn xml. It doesn't seem to be working though, when I try to create a session in a Jupyter notebook it hangs indefinitely. Nothing in stdout or DataProc logs that I can see. Has anyone had success with this?

For anyone interested, I eventually abandoned this approach. I'm instead running Jupyter Enterprise Gateway on the master node, setting up port forwarding, and then launching my notebooks locally to connect to kernel(s) running on the server. It works very nicely so far.

Has anyone deployed Metabase with cloud run?

I did research but couldn't find any "howTo" for deploying Metabase with Cloud Run on GCP, I only found Q&A the problems by deploying it.
My goal is to deploy Metabase with Cloud run and use Postgres as database. I have already deployed app by Cloud Run and have cloudbuild.yml pipeline on git that I use it for building my app I just wanna add Metabase.
Any directions or solutions?

It is not recommended to deploy Metabase on Cloud Run because:
Computation is scoped to a request. You should only expect to be able
to do computation within the scope of a request: a container instance
does not have any CPU available if it is not processing a request.
Container runtime contract
Therefore you might be facing issues loading the application and database connections timeout.
I think your best option is using AppEngine Flexible.
Install Metabase on Google Cloud with Docker – App Engine

It's very hard to use Metabase on Cloud Run. And Running it on Flexible App Engine is more expensive (~100$/month). Metabase entrypoint doing lot of processes and if Cloud Run decide to create another container instance it will display the loading metabase page so we have to do multiple manually refresh to get the new container ready.
I managed this problem by using more JVM JAVA_OPTS -Xmx4g and limit the MAX_INSTANCES to 1. I set a Cloud Scheduler task who does a get query to the metabase service base URL every 10 minutes.
With this configuration metabase will stay alive and Cloud Run will never create more than 1 container. So the refresh problem will not appear any more.
For the moment I don't have problems with this configuration but I think the best way to host metabase is to run it on Compute Engine VM.

Stopping Cloud Data Fusion Instance

I have production pipelines which only runs for couple of hours using Google Data Fusion. I would like to stop the Data Fusion Instance and start it the next day. I don't see an option to stop the instance. Is there anyway we can stop the instance and start the same instance again ?

As per design Data Fusion instance is running in a GCP tenancy unit that guarantees the user fully automated way to manage all the cloud resources and services (GKE cluster, Cloud Storage, Cloud SQL, Persistent Disk, Elasticsearch, and Cloud KMS, etc.) for storing, developing and executing customer pipelines. Therefore, there is no possibility to terminate Data Fusion instance, thus all the pipeline service execution contributors are launching on demand and clearing after pipeline completion, find here the price charging concepts.

a way to script automatically to start and stop the sql database in gcp

i want to run a job in cloud scheduler in gcp to start and stop the sql database in weekdays at working hours.
I have tried by triggering cloud function and using pubsub but i am not getting proper way to do it.

You can use the Cloud SQL Admin API to start or stop and instance. Depending on your language, there are clients available to help you do this. This page contains examples using curl.
Once you've created two Cloud Functions (one to start, and one to stop), you can configure the Cloud Scheduler to send a pub/sub trigger to your function. Check out this tutorial which walks you through the process.

In order to achieve this you can use a Cloud Function to make a call to the Cloud SQL Admin API to start and stop your Cloud SQL instance (you will need 2 Cloud functions). You can see my code on how to use a Cloud Function to start a Cloud SQL instance and stop a Cloud SQL instance
After creating your Cloud Function you can configure the Cloud Scheduler to trigger the http address of each Cloud function

Spring cloud data flow deployment

I wanna deploy the Spring-cloud-data-flow on several hosts.
I will deploy the server of Spring-cloud-data-flow on one host-A, and deploy the agents on the other hosts(These hosts are in charge of executing the tasks).
Except the host-A, all the other hosts run the same tasks.
Shall I modify on the basis of the Spring Cloud Data Flow Local Server or on the Spring Cloud Data Flow Apache Yarn Server or other better choice?

Do you mean how the apps are deployed on several hosts? If so, the apps are deployed using the underlying deployer implementation. For instance, if it is local deployer then, each app is deployed by spawning a new process. You can scale out the number apps deployment using the count property during the stream deploy. I am not sure what do you mean by the agents here.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Data Fusion Dataproc compute profle in a different account - google-cloud-dataproc

I'm trying to execute a pipeline on a Data Proc cluster in a different project by the one Data Fusion instance is deployed but I am having some trouble. Data Proc instance seems to be created correctly but the start of the job fails. Any idea on how to solve? Here the stack trace of the error Thanks

This seems like the project where the Google Cloud Dataproc is doesn't have SSH port open. Can you check that your project allow port 22 connection? Cloud Data Fusion uses SSH to upload and monitor the job in the Cloud Dataproc.

Related

Connect PySpark session to DataProc

Has anyone deployed Metabase with cloud run?

Stopping Cloud Data Fusion Instance

a way to script automatically to start and stop the sql database in gcp

Spring cloud data flow deployment

Categories

Resources