Stopping Cloud Data Fusion Instance - google-cloud-data-fusion

I have production pipelines which only runs for couple of hours using Google Data Fusion. I would like to stop the Data Fusion Instance and start it the next day. I don't see an option to stop the instance. Is there anyway we can stop the instance and start the same instance again ?

As per design Data Fusion instance is running in a GCP tenancy unit that guarantees the user fully automated way to manage all the cloud resources and services (GKE cluster, Cloud Storage, Cloud SQL, Persistent Disk, Elasticsearch, and Cloud KMS, etc.) for storing, developing and executing customer pipelines. Therefore, there is no possibility to terminate Data Fusion instance, thus all the pipeline service execution contributors are launching on demand and clearing after pipeline completion, find here the price charging concepts.

Related

GKE and Task Queues

I am working on a cloud service platform that consists of getting tasks from users, executing them, and giving back the results.
TL;DR
Is there a way to have a "task queue", where tasks can be inserted via a REST API, and extracted automatically by the Google Kubernetes Engine cluster by guaranteeing an automatic scaling?
Long description
Users can send tasks in parallel, and each task is time consuming and need to be performed on a GPU. So, setting up an auto-scaling GPU cluster is what I thought of.
More in particular, in my idea, users could send tasks/data through a REST API, the REST API provides in filling a task queue, and the task queue itself will feed tasks to workers on the GPU auto-scaling cluster. Of course, there are other details (authentication, database, storage, etc.) that have to be addressed but are not the point of my question.
For reasons I don't specify here, the project is already started on the Google Cloud Platform, so switching to AWS or other providers is not an option.
For what I understood, things seem a bit different from standard Docker-only clusters in AWS, that is, we have to use the Google Kubernetes Engine (GKE) to setup the auto-scaling cluster, even for "simple" GPU-enabled Docker containers.
By looking at the not-so-exhaustive documentation, I know that queues are used, but what I don't know is whether feeding of tasks to the cluster is automatically handled. Also, the so-called "Task Queue" service has been deprecated.
Thank you!
First I thought Cloud Tasks queues may be the answer to your troubles, but more this post seems to promote Cloud Pub/Sub as a better alternative.
After a quick chat with batch developers, the current solution (before the batch service become public) is to adopt a third-party queue system like Slurm.

Spring Cloud Data Flow with Azure Event Hub limitations?

We plan to use Spring Cloud Data Flow on Azure Cloud using Azure EventHub as a messaging binder.
On Azure EventHub, there are hard limits :
100 Namespaces
10 topics per namespaces.
The Spring Cloud Azure Event Hub Stream Binder seems to be able to configure only one namespace, so how can we manage multiple namespaces?
Maybe we should use multiple binders, to have multiple instances of the Spring Cloud Azure Event Hub Stream Binder?
Does anyone have any ideas? or documentation we did not find?
Regards
RĂ©mi
Spring Cloud Data Flow and Spring Cloud Skipper support the concept of "platform accounts". Using that, you can set up multiple accounts, for each namespace or any other K8s clusters even. This opens a lot of flexibility to work around these hard limits in Azure stack.
We have a recipe on multi-platform deployments.
When deploying the streams from SCDF, you'd pick and choose the platform account (aka namespace or other configs), so automatically the deployed stream apps (with Azure binder in the classpath) would be running in different namespaces. Effectively, dodging the limits enforced in Azure.
The provenance tracking of where the apps run and the audit trail is automatically also captured in SCDF, so at any given time, you'd know who did what and in which namespace.

Mongo database in GCP app engine

I'm currently looking into GCP app engine and I was figuring out how I would deploy a very large application with multiple services. I also wanted to use mongodb. GCP docs say that app engine allows dockerfiles and images. What would happen if I used the mongo docker image as a service on app engine? How would it scale it's instances? What will happen to consistency? I'm aware GCP have a third party solution for mongo, but since they allow docker images, what stops me from using it?
App Engine routinely tears down and creates new instances. If your instance is running MongoDB, then all the data stored in that instance will be lost.
This is why Google Cloud offers other, permanent places to store state, like Datastore and CloudSQL. You can also run MongoDb yourself on Google Compute Engine.
What would happen if I used the mongo docker image as a service on app engine?
Flexible App Engine allows you to use docker images to build your own application, as per is mentioned on this document [1]: "App Engine flexible environment instances are Compute Engine virtual machines, which means that you can take advantage of custom libraries, use SSH for debugging, and deploy your own Docker containers."
So there is no problem to use your own docker image in flexible app Engine.
How would it scale it's instances?
Each active version in App Engine must have at least one instance to handle requests, there are two ways to scale the instance in App Engine: automatic and manual.
As per is mentioned on the document[2]:
Automatic scaling creates instances based on request rate, response latencies, and other application metrics. You can specify thresholds for each of these metrics, as well as a minimum number instances to keep running at all times.
Manual scaling specifies the number of instances that continuously run regardless of the load level. This allows tasks such as complex initializations and applications that rely on the state of the memory over time.
The way you can configure these features is through the app.yaml file, I suggest you read this document[3]
What will happen to consistency?
Since App Engine scaling can be configured depending on its load, this allows for good performance in service execution and provides consistency in operations and optimization of resources.
[1] https://cloud.google.com/appengine/docs/flexible#features
[2] https://cloud.google.com/appengine/docs/flexible/go/how-instances-are-managed#instance_scaling
[3] https://cloud.google.com/appengine/docs/flexible/go/configuring-your-app-with-app-yaml

Spring cloud data flow deployment

I wanna deploy the Spring-cloud-data-flow on several hosts.
I will deploy the server of Spring-cloud-data-flow on one host-A, and deploy the agents on the other hosts(These hosts are in charge of executing the tasks).
Except the host-A, all the other hosts run the same tasks.
Shall I modify on the basis of the Spring Cloud Data Flow Local Server or on the Spring Cloud Data Flow Apache Yarn Server or other better choice?
Do you mean how the apps are deployed on several hosts? If so, the apps are deployed using the underlying deployer implementation. For instance, if it is local deployer then, each app is deployed by spawning a new process. You can scale out the number apps deployment using the count property during the stream deploy. I am not sure what do you mean by the agents here.

Google Cloud SQL CPU Monitoring

I'm working on trying to setup some monitoring on a Google Cloud SQL node and am not seeing how to do it. I was able to install the monitoring agent on my Google Compute Engine instances to monitor CPU, Network, etc. I have not been able to figure out how to do so on the Cloud SQL instance. I have access to these types of monitoring:
Storage Usage (GB)
Number of Read/Write operations
Egress Bytes
Active Connections
MySQL Queries
MySQL Questions
InnoDB Pages Read/Written (pages/sec)
InnoDB Data fsyncs (operations/sec)
InnoDB Log fsyncs (operations/sec)
I'm sure these are great options, but at this point all I want to pay attention to is if my node is performing on a CPU/RAM standpoint as they seem to first and foremost measures for performance.
If I'm missing something, or misunderstnading what I'm trying to do, any advice is appreciated.
Thanks!
Google has a Stackdriver which is for logging and monitoring Google and AWS cloud infrastructure. It can monitor every single thing present on GCP. You can create visualization to monitor your Cloud SQL instance in one dashboard. You just have to ---->
1. login to stackdriver and Go to any existing dashboard, If you dont have create one.---->
2. Add chart and select Cloud SQL in resource Name.---->
3. Select CPU Utilization from metric and save. You can also monitor memory, Disk I/o, Delta count of Queries or servers Up-time and many more.
if you want to monitor any other GCP Compute engine, App-Engine, Kubernetese Engine, storage bucket, Bigtable or pub/sub you just have to select appropriate resource name from list. Hope you got your answer.
You can view all of them directly from the "Overview" tab of the Cloud SQL console:
I have added this as a feature request as issue 110.
https://code.google.com/p/googlecloudsql/issues/detail?id=110