Connect to a DB hosted within a Kubernetes engine cluster from a PySpark Dataproc job - mongodb

I am a new Dataproc user and I am trying to run a PySpark job that is supposed to use the MongoDB connector to retrieve data from a MongoDB replicaset hosted within a Googke Kubernetes Engine cluster.
Is it there a way to achieve this as my replicaset is not supposed to be accessible from the outside without using a port-forward or something?

In this case I assume by saying "outside" you're pointing to the internet or other networks than your GKE cluster's. If you deploy your Dataproc cluster on the same network as your GKE cluster, and expose the MongoDB service to the internal network, you should be able to connect to the databases from your Dataproc job without needing to expose it to outside of the network.
You can find more information in this link to know how to create a Cloud Dataproc cluster with internal IP addresses.

Just expose your Mogodb service in GKE and your should be able to reach it from within the same VPC network.
Take a look at this post for reference.
You should also be able to automate the service exposure through an init script

Related

How to authenticate to a GKE cluster without using the gcloud CLI

I've got a container inside a GKE cluster and I want it to be able to talk to the Kubernetes API of another GKE cluster to list some resources there.
This works well if run the following command in a separate container to proxy the connection for me:
gcloud container clusters get-credentials MY_CLUSTER --region MY_REGION --project MY_PROJECT; kubectl --context MY_CONTEXT proxy --port=8001 --v=10
But this requires me to run a separate container that, due to the size of the gcloud cli is more than 1GB big.
Ideally I would like to talk directly from my primary container to the other GKE cluster. But I can't figure out how to figure out the IP address and set-up the authentication required for the connection.
I've seen a few questions:
How to Authenticate GKE Cluster on Kubernetes API Server using its Java client library
Is there a golang sdk equivalent of "gcloud container clusters get-credentials"
But it's still not really clear to me if/how this would work with the Java libraries, if at all possible.
Ideally I would write something like this.
var info = gkeClient.GetClusterInformation(...);
var auth = gkeClient.getAuthentication(info);
...
// using the io.fabric8.kubernetes.client.ConfigBuilder / DefaultKubernetesClient
var config = new ConfigBuilder().withMasterUrl(inf.url())
.withNamespace(null)
// certificate or other autentication mechanishm
.build();
return new DefaultKubernetesClient(config);
Does that make sense, is something like that possible?
There are multiple ways to connect to your cluster without using the gcloud cli, since you are trying to access the cluster from another cluster within the cloud you can use the workload identity authentication mechanism. Workload Identity is the recommended way for your workloads running on Google Kubernetes Engine (GKE) to access Google Cloud services in a secure and manageable way. For more information refer to this official document. Here they have detailed a step by step procedure for configuring workload identity and provided reference links for code libraries.
This is drafted based on information provided in google official documentation.

How to connect to AWS ECS cluster?

I have successfully created ECS cluster (EC2 Linux + Networking). Is it possible to login to the cluster to perform some administrative tasks? I have not deployed any containers or tasks to it yet. I can’t find any hints for it in AWS console or AWS documentation.
The "cluster" is just a logical grouping of resources. The "cluster" itself isn't a server you can log into or anything. You would perform actions on the cluster via the AWS console or the AWS API. You can connect to the EC2 servers managed by the ECS cluster individually. You would do that via the standard ssh method you would use to connect to any other EC2 Linux server.
ECS will take care most of the administrative works for you.You simply have to deploy and manage your applications on ECS. If you setup ECS correctly, you will never have to connect to instances.
Follow these instructions to deploy your service (docker image): https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-service.html
Also you can use Cloudwatch to store container logs, so that you don't have to connect to instances to check the logs: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html

Fail to connect the GKE with GCE on the same VPC?

I am new to Google Cloud Platform and the following context:
I have a Compute Engine VM running as a MongoDB server and a Compute Engine VM running as a NodeJS server already with Docker. Then the NodeJS application connects to Mongo via the default VPC internal IP. Now, I'm trying to migrate the NodeJS application to Google Kubernetes Engine, but I can't connect to the MongoDB server when I deploy the NodeJS application Docker image to the cluster.
All services like GCE and GKE are in the same region (us-east-1).
I did a hard test accessing a kubernetes cluster node via SSH and deploying a simple MongoDB Docker image and trying to connect to the remote MongoDB server via command line, but the problem is the same, time out when trying to connect.
I have also checked the firewall settings on GCP as well as the bindIp setting on the MongoDB server and it has no blocking on that.
Does anyone know what may be happening? Thank you very much.
In my case traffic from GKE to GCE VM was blocked by Google Firewall even thou both are in the same network (default).
I had to whitelist cluster pod network listed in cluster details:
Pod address range 10.8.0.0/14
https://console.cloud.google.com/kubernetes/list
https://console.cloud.google.com/networking/firewalls/list
By default, containers in a GKE cluster should be able to access GCE VMs of the same VPC through internal IPs. It is just like you access the internet (e.g., google.com) from GKE containers, GKE and VPC know how to route the traffic. The problem must be with other configurations (firewall or your application).
You can do a test, start a simple HTTP server in the GCE VM, say the internal IP is 10.138.0.5:
python -m SimpleHTTPServer 8080
then create a GKE container and try to access the service:
kubectl run my-client -it --image=tutum/curl --generator=run-pod/v1 -- curl http://10.138.0.5:8080

How to submit a Dask job to a remote Kubernetes cluster from local machine

I have a Kubernetes cluster set up using Kubernetes Engine on GCP. I have also installed Dask using the Helm package manager. My data are stored in a Google Storage bucket on GCP.
Running kubectl get services on my local machine yields the following output
I can open the dashboard and jupyter notebook using the external IP without any problems. However, I'd like to develop a workflow where I write code in my local machine and submit the script to the remote cluster and run it there.
How can I do this?
I tried following the instructions in Submitting Applications using dask-remote. I also tried exposing the scheduler using kubectl expose deployment with type LoadBalancer, though I do not know if I did this correctly. Suggestions are greatly appreciated.
Yes, if your client and workers share the same software environment then you should be able to connect a client to a remote scheduler using the publicly visible IP.
from dask.distributed import Client
client = Client('REDACTED_EXTERNAL_SCHEDULER_IP')

Google cloud access mongo deployed on compute engine from app deployed on kubernetes engine

I have three instances for kubernetes cluster and three instances for mongo cluster as shown here:
I can access my mongo cluster from app console and other compute instances using uri like this:
mongo mongodb:root:passwd#mongodb-1-servers-vm-0:27017,mongodb-1-servers-vm-1:27017/devdb?replicaSet=rs0
I also tried replacing instance names with internal and external ip addresses, but that didn't help it either.
But the same command does not work from instances inside the kubernetes cluster. I assume that I have to configure some kind of permissions for my cubernetes cluster to access compute instances? Can someone help?
Ok, I managed to find a solution, not sure if the best one.
First we add firewall rules to allow mongodb traffic
gcloud compute firewall-rules create allow-mongodb --allow tcp:27017
Then we use external ip's to connect to mongodb from kubernetes instances
mongodb:root:passwd#<ip1>:27017,<ip2>:27017/devdb?replicaSet=rs0