Expose Hue with Component Gateway for Dataproc - google-cloud-dataproc

Is it possible to expose Hue with Component Gateway for Dataproc? I went through the docs and didn't find any option to add service to it. I am creating Dataproc cluster with below command.
gcloud beta dataproc clusters create hive-cluster \
--scopes sql-admin,bigquery \
--image-version 1.5 \
--master-machine-type n1-standard-4 \
--num-masters 1 \
--worker-machine-type n1-standard-1 \
--num-workers 2 \
--region $REGION \
--zone $ZONE \
--optional-components=ANACONDA,JUPYTER \
--initialization-actions gs://bucket/init-scripts/cloud-sql-proxy.sh,gs://bucket/init-scripts/hue.sh \
--properties hive:hive.metastore.warehouse.dir=gs://$PROJECT-warehouse/datasets,dataproc:jupyter.notebook.gcs.dir=gs://bucket/notebooks/jupyter \
--metadata "hive-metastore-instance=$PROJECT:$REGION:hive-metastore" \
--enable-component-gateway

Hue is not an optional component of Dataproc, hence not accessible from component gateway. For now, you have to use Dataproc web interfaces:
Once the cluster has been created, Hue is configured to run on port 8888 on the master node in a Dataproc cluster. To connect to the Hue web interface, you will need to create an SSH tunnel and use a SOCKS 5 Proxy with your web browser as described in the dataproc web interfaces documentation. In the opened web browser go to 'localhost:8888' and you should see the Hue UI.

Related

Creating an AKS Cluster using C#?

I have not found a way to do so using C# K8s SDK: https://github.com/kubernetes-client/csharp
How to create a AKS Cluster in C#? Basically, the following command:
az aks create -g $RESOURCE_GROUP -n $AKS_CLUSTER \
--enable-addons azure-keyvault-secrets-provider \
--enable-managed-identity \
--node-count $AKS_NODE_COUNT \
--generate-ssh-keys \
--enable-pod-identity \
--network-plugin azure
Sending a PUT request with payload (JSON body) to ARM.
See this: https://learn.microsoft.com/en-us/rest/api/aks/managed-clusters/create-or-update?tabs=HTTP

Connect Kubernetes cluster on GCP and keep different projects separated

I want to setup several GKE clusters like here. So essentially, I would first create a VPC
gcloud compute networks create ${network_name} --subnet-mode=custom
and then the subnets
gcloud compute networks subnets create ${subnet_1} \
--region=${region_1} \
--network=${network_name} \
--range=10.0.0.0/16 \
--secondary-range pods=10.10.0.0/16,services=10.100.0.0/16
gcloud compute networks subnets create ${subnet_2} \
--region=${region_2} \
--network=${network_name} \
--range=10.1.0.0/16 \
--secondary-range pods=10.11.0.0/16,services=10.101.0.0/16
gcloud compute networks subnets create ${subnet_3} \
--region=${region_3} \
--network=${network_name} \
--range=10.2.0.0/16 \
--secondary-range pods=10.12.0.0/16,services=10.102.0.0/16
and then three GKE clusters:
gcloud beta container clusters create ${cluster_1} \
--region ${region_1} --num-nodes 1 \
--network ${network_name} --subnetwork ${subnet_1} \
--cluster-dns clouddns --cluster-dns-scope vpc \
--cluster-dns-domain ${cluster_domain_1}
--enable-ip-alias \
--cluster-secondary-range-name=pods --services-secondary-range-name=services
gcloud beta container clusters create ${cluster_2} \
--region ${region_2} --num-nodes 1 \
--network ${network_name} --subnetwork ${subnet_2} \
--cluster-dns clouddns --cluster-dns-scope vpc \
--cluster-dns-domain ${cluster_domain_2}
--enable-ip-alias \
--cluster-secondary-range-name=pods --services-secondary-range-name=services
gcloud beta container clusters create ${cluster_3} \
--region ${region_3} --num-nodes 1 \
--network ${network_name} --subnetwork ${subnet_3} \
--cluster-dns clouddns --cluster-dns-scope vpc \
--cluster-dns-domain ${cluster_domain_3}
--enable-ip-alias \
--cluster-secondary-range-name=pods --services-secondary-range-name=services
Furthermore, we need the node pools (here only done for cluster no. 1):
gcloud container node-pools create pd --cluster ${cluster_1} --machine-type n1-standard-4 --num-nodes=1 \
--node-labels=dedicated=pd --node-taints=dedicated=pd:NoSchedule
gcloud container node-pools create tikv --cluster ${cluster_1} --machine-type n1-highmem-8 --num-nodes=1 \
--node-labels=dedicated=tikv --node-taints=dedicated=tikv:NoSchedule
gcloud container node-pools create tidb --cluster ${cluster_1} --machine-type n1-standard-8 --num-nodes=1 \
--node-labels=dedicated=tidb --node-taints=dedicated=tidb:NoSchedule
Here begins the interesting part: We list the firewalls for cluster subnet no. 1:
gcloud compute firewall-rules list --filter='name~gke-${cluster_1}-.*-all'
and we allow incoming traffic from the other clusters
gcloud compute firewall-rules update ${firewall_rule_name} --source-ranges 10.10.0.0/16,10.11.0.0/16,10.12.0.0/16
If we repeat this for all clusters, then they are interconnected, i.e., we can access a service from cluster A in cluster B.
Now, I am facing the following situation. Say, we have project A and B and one cluster C.
I can use NetworkPolicies to ensure that the resources of the namespaces of project A (A1, A2, A3) can communicate with one another, as can the resources of the namespaces of project B (B1, B2), but there is no communication possible between, say, A1 and B2.
Now, my question is, how can we make that possible for multiple clusters that are connected as above? So assume, we have clusters C1, C2, C3 and for project A we have namespaces A1_C1, A2_C1, A3_C2, A4_C3, A5_C3 (in the respective cluster) and for project B we have namespaces B1_C1, B2_C2, B3_C2, B4_C3.
How can I make it possible, that all the resources of the namespaces associated to project A can communicate, say, A1_C1 to A3_C2, same for project B, but there is no communication possible between projects, say between resources of A1_C1 and B1_C1 or B2_C2?
Is such a thing possible? If so, how?
Your support is greatly appreciated.

Grant AKS access to Postgres database in Azure via VNet

Trying to figure out how to best give my AKS cluster access to a Postgres database in Azure.
This is how I create the cluster:
az group create \
--name $RESOURCE_GROUP \
--location $LOCATION
az aks create \
--resource-group $RESOURCE_GROUP \
--name $CLUSTER_NAME \
--node-vm-size Standard_DS2_v2 \
--node-count 1 \
--enable-addons monitoring \
--enable-managed-identity \
--generate-ssh-keys \
--kubernetes-version 1.19.6 \
--attach-acr $ACR_NAME \
--location $LOCATION
This will automatically create a VNet with a subnet that the node pool uses.
The following works:
Find the VNet resource in Azure
Go to "subnets" -> select the subnet -> Choose "Microsoft.SQL" under "Services". Save
Find the Postgres resource in Azure
Go to "Connection Security" -> Add existing virtual network -> Select the AKS VNet subnet. Save
So I have two questions:
Is it recommended to "fiddle" with the VNet subnet automatically created by az aks create? I.e adding the service endpoint for Micrsoft.SQL
If it's ok, how can I accomplish the same using Azure CLI only? The problem I have is how to figure out the id of the subnet (based on what az aks create returns)

Assigning IP addresses to multiple nightly k8s clusters

Background:
We run Analytics pipelines on dedicated clusters once a day. All clusters are created at the same time, have once pod deployed, run their pipeline and are deleted once complete, use the default VPC network in the same region and are created with a command like this:
gcloud beta container clusters create <CLUSTER_NAME> \
--zone "europe-west1-b" \
--machine-type "n1-standard-2" \
--num-nodes=1 \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--service-account=<SA_EMAIL> \
--disk-size 10GB \
--network default \
--subnetwork <SUB_NETWORK> \
--enable-master-global-access \
--enable-private-nodes \
--enable-private-endpoint \
--enable-ip-alias \
--enable-intra-node-visibility \
--enable-master-authorized-networks \
--master-ipv4-cidr=<MASTER_IP>/28 \
--cluster-ipv4-cidr <CLUSTER_IP>/14 \
--services-ipv4-cidr <SERVICES_IP/20 \
--enable-network-policy \
--enable-shielded-nodes
When we add a new cluster for a new pipeline we have encountered issues where the IP addresses collide, overlap and are unavailable. As we expect to continually add more pipelines and thus more clusters we want an automated way of avoiding this issue.
We have explored creating a dedicated network (and subnetwork) for each cluster so each cluster can have the same IP addresses (albeit in different networks) but are unsure if this is best practice.
Question:
Is it possible to create kubernetes clusters in Google Cloud so as the master, cluster and service IP addresses are auto-assigned?

Pass packages to pyspark running on dataproc from airflow?

We have an Airflow DAG that involves running a pyspark job on Dataproc. We need a jdbc driver during the job, which I'd normally pass to the dataproc submit command:
gcloud dataproc jobs submit pyspark \
--cluster my-cluster \
--properties spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
--py-files ...
But how can I do it with Airflow's DataProcPySparkOperator?
For now we're adding this library to the cluster itself:
gcloud dataproc clusters create my-cluster \
--region global \
--zone europe-west1-d \
...
--properties spark:spark.jars.packages=mysql:mysql-connector-java:6.0.6 \
...
This seems to be working fine, but it doesn't feel like the right way to do it. Is there another way?
I believe you want to pass dataproc_pyspark_properties to the DataProcPySparkOperator.
See:
https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/operators/dataproc_operator.py