Google Cloud Sdk from DataProc Cluster - gcloud

What is the right way to use/install python google cloud apis such as pub-sub from a google-dataproc cluster? For example if im using zeppelin/pyspark on the cluster and i want to use the pub-sub api, how should i prepare it?
It is unclear to me what is installed and what is not installed during default cluster provisioning and if/how I should try to install python libraries for google cloud apis.
I realise additionally there may be scopes/authentication to setup.
To be clear, I can use the apis locally but I am not sure what is the cleanest way to make the apis accessible from the cluster and I dont want to perform any unnecessary steps.

In general, at the moment, you need to bring your own client libraries for the various Google APIs unless using the Google Cloud Storage connector or BigQuery connector from Java or via RDD methods in PySpark which automatically delegate into the Java implementations.
For authentication, you should simply use --scopes https://www.googleapis.com/auth/pubsub and/or --scopes https://www.googleapis.com/auth/cloud-platform and the service account on the Dataproc cluster's VMs will be able to authenticate to use PubSub via the default installed credentials flow.

Related

What benefits does Cloud Composer provide over a Helm chart and GKE?

As I dive into the world of Cloud Composer, Airflow, Google Kubernetes Engine, and Kubernetes I've not yet found a good answer to what exactly makes Cloud Composer better than Helm and GKE.
Here are some things I've found that could be unique to Composer but mostly seem like they could be handled by GKE.
On their homepage:
End-to-end integration with Google Cloud products including BigQuery, Dataflow, Dataproc, Datastore, Cloud Storage, Pub/Sub, and AI Platform gives users the freedom to fully orchestrate their pipeline.
On the features page:
Identity-Aware Proxy protects the interface
Cloud Composer associates a Cloud Storage bucket with the environment. The associated bucket stores the DAGs, logs, custom plugins, and data for the environment.
The downsides of Composer I've seen include:
It takes many hours to spin up a new instance
It doesn't support Kubernetes Executor
It is risky to change the underlying GKE config because it could be changed back by a composer update
There are often errors that happen when auto-scaling often happen but are documented as known
Upgrading environments is still beta
To be clear, I'm not saying Cloud Composer is bad. I'm just having trouble seeing why people like it. When I've asked folks why it is better than Helm + GKE they haven't had any compelling answers despite that they can tell many stories of Composer being unpredictable and having lots of issues.
Are you comparing the same things?
On one side, GKE, you have a container orchestrator. Declare that you want, it will deploy and maintain the stability of the cluster according with declared configuration. This configuration can be packaged with helm to write it in an easier mode. Because you deploy container, you can use the language that you want in your services.
On the other side, you have a workflow manager, with scheduler, retry policies, parallel task, context forwarding. you write DAG in python (only!) and you have operators to interact with external product/services. It's mainly designed for data processing and used a lot by data scientist and data engineering team.
Note: Cloud Composer is deployed on top of GKE (scheduler and worker), redis, app engine and Cloud SQL.
You compare 2 different worlds: Ops world (GKE/Helm) and the App/Data world (Composer/Airflow). Have a look to this new video
Update 1:
My bad, I didn't understand!!! Anyway, personally I don't want to manage things by myself: a cluster, the update of K8S, VM patching, replicas, snapshot, backup/restore,...
If someone can do this for me, I prefer, and managed services are perfect for me!!
Do you ask yourselves this question about Cloud SQL and a database managed by yourselves on a Compute Engine instance? If not (because Cloud SQL solve a lot of boring issues), my opinion is the same for Composer.
But it's an opinion, I didn't test both and compare the performance, cost and easiness.

Eclipse Google Cloud Tools: Where is Datastore running?

When running a Java Google App Engine through Eclipse (with Google Cloud Tools) I can inspect my Datastore through the admin dashboard (localhost:8080/_ah/admin/datastore).
Is it possible to access the Datastore Rest API? Where would I be able to do that? Is it running on the same port under a different path?
It looks like Eclipse is starting up dev_appserver.py and with this you can't use Datastore API.
I've never used the Datastore emulator, but that might allow you to use the API.
Another option is to use the live Datastore API for a test GAE project.
If you want to know the port in which the Datastore emulator is listening to calls is by default 8081 and can be changed with gcloud beta emulators datastore start --host-port=localhost:8081
Alternatively, if you want to manually access the Datastore API of your GCP project you can:
Use the Datastore Dashboard in the Cloud Console
Manually make use of the Datastore API as in Try this API feature.
Alternatively

Can AWS Toolkit in Eclipse be used with localstack?

For local development, I was hoping to set up a localhost profile for AWS Toolkit that I could then use in Eclipse to interact with resources on localstack, but I'm at a loss to set this up. There is a local(localhost) option in AWS Toolkit, but I don't see how it would know what endpoints to access for the various services in localstack.
It seems like a relatively logical thing to want to do, or do I have to do all my interaction with the aws (or awslocal) cli?

How to add AppSync backend to AWS MobileHub project via console?

Although awsmobile-cli has a feature for enabling and configuring an AppSync backend like:
awsmobile appsync enable
awsmobile appsync configure
It is prone to end up with a total irrelevant configuration: It creates DynamoDB tables in us-west-2 (Oregon), even if my project is located at eu-central-1 (Frankfurt). And it does so through its default "events" graphql schema. And after all, it does not appear on the MobileHub project console as a backend feature.
Now, the thing I want to do is adding an AppSync backend to AWS MobileHub project via the console. And then I can pull the changes from the cli once I am done i.e. modified the my graphql schema, attached the resolvers and engaged the datasources.
Is it possible as of now?
Unfortunately right now this is not possible via the Mobile Hub console. It is in the CLI roadmap to support importing existing AppSync resources.
As it is not possible to on Mobile Hub right now you could try to use serverless framework together with serverless-appsync-plugin. It allows you to write your infrastructure as code and to deploy it to AWS via CLI.
While Mobile Hub is kinda limiting, you can actually design more complex backend for your app with serverless tool. You can even set up lambda data sources for appsync. Here you can find some examples for different graphql API setups: https://github.com/serverless/serverless-graphql
If you have more or less complex schema it is a right solution to deploy it from CLI as AppSync console starts to lag with big schemas

Testing BigInsights + Cloud Storage (How to use nodejs over this two components)

Hi I have been trying to test this two components on bluemix since last 2 days, I need to now if both of then has some robust library on NodeJs, because I have been trying the ones I found at npm and event the one featured at Bluemix Cloud Storage as Nodejs SDK and I have unsuccessfull on even connect to Cloud Storage and Hive, I feel completly lost. I hope some one here could at least give e a lead ....
Thanks in advance
I find out that the bluemix platform had an issue when creating the cloud storage within the Biginsights cluster manager, so I created the Object Storage first and then link it to the Biginsights service and now the NodeJs SDK from bluemix works just fine, thanks anyway.