Druid Mapreduce job not listed on Google Dataproc dashboard - google-cloud-dataproc

I will explain my situation. I have an on-prem Apache Druid, I managed to let Druid execute the Hadoop ingestion job (Mapreduce) on a Dataproc cluster. The job is running fine, but the problem isn't visible on the Dataproc dashboard.
Note: The only connection between the Dataproc cluster and Druid is a VPN connection between the master node and the Druid cluster.

If you are not submitting the MapReduce job through Dataproc API, but through other ways e.g., submitting to YARN directly, then it won't be available in the Dataproc dashboard. But in this case, you should be able to view the jobs through YARN resource manager UI in Component Gateway.

Related

can apache storm setup done on ECS fargate

I have a application that has apache storm setup and want to move on aws cloud. Looking for information if is it possible to migrate/deploy this on container ECS with Fargate.
If not the is there any equivalent aws service ?

Autoscaling On-Demand HDInsight cluster

Is there a way to enable autoscaling in an on-demand cluster created in Azure Data Factory?
The CPU usage sometimes exceeds expectations.
Unfortunately, Autoscaling option is not available for on-demand HDInsight clusters. And It's only available on Azure HDInsight clusters.
You can create a on-demand HDInsight cluster with the predefined cluster size.
In this type of configuration, the computing environment is fully managed by the Azure Data Factory service. It is automatically created by the Data Factory service before a job is submitted to process data and removed when the job is completed. You can create a linked service for the on-demand compute environment, configure it, and control granular settings for job execution, cluster management, and bootstrapping actions.
For more details, refer Azure HDInsight on-demand linked service.
I would suggest you to provide feedback on the same:
https://feedback.azure.com/forums/217335-hdinsight
All of the feedback you share in these forums will be monitored and reviewed by the Microsoft engineering teams responsible for building Azure.

Flink session cluster and jobs submission in Kubernetes

Our team set up a Flink Session Cluster in our K8S cluster. We chose Flink Session Cluster rather than Job Cluster because we have a number of different Flink Jobs, so that we want to decouple the development and deployment of Flink from those of our jobs. Our Flink setup contains:
Single JobManager as a K8S pod, no High Availability (HA) setup
A number of TaskManagers, each as a K8S pod
And we develop our jobs in a separate repository and deploy to Flink cluster when there is code merged.
Now, we noticed that JobManager as a pod in K8S can be redeployed anytime by K8S. So, once it is redeployed, it loses all jobs. To solve this problem, we developed a script that keeps monitoring the jobs in Flink, if jobs not running, the script will resubmit the jobs to the cluster. Since it may take some time for the script to discover and resubmit the jobs, there is a small service break quite often, and we are thinking if this could be improved.
So far, we have some ideas or questions:
One possible solution could be: when the JobManager is (re)deployed, it will fetch the latest Jobs jar and run the jobs. This solution looks overall good. Still, since our jobs are developed in a separate repo, we need a solution for the cluster to notice the latest jobs when there are changes in the jobs, either JobManager keeps polling the latest jobs jar or Jobs repo deploys the latest jobs jar.
I see that Flink HA feature can store checkpoints/savepoints, but not sure if Flink HA can already handle this redeployment issue?
Does anyone have any comment or suggestion on this? Thanks!
Yes, Flink HA will solve the JobManager failover problems you're concerned about. The new job manager will pick up information about what jobs are (supposed to be) running, their jars, checkpoint status, etc, from the HA storage.
Note also that Flink 1.10 includes a beta release of native support for Kubernetes session clusters. See the docs.

Connect to a DB hosted within a Kubernetes engine cluster from a PySpark Dataproc job

I am a new Dataproc user and I am trying to run a PySpark job that is supposed to use the MongoDB connector to retrieve data from a MongoDB replicaset hosted within a Googke Kubernetes Engine cluster.
Is it there a way to achieve this as my replicaset is not supposed to be accessible from the outside without using a port-forward or something?
In this case I assume by saying "outside" you're pointing to the internet or other networks than your GKE cluster's. If you deploy your Dataproc cluster on the same network as your GKE cluster, and expose the MongoDB service to the internal network, you should be able to connect to the databases from your Dataproc job without needing to expose it to outside of the network.
You can find more information in this link to know how to create a Cloud Dataproc cluster with internal IP addresses.
Just expose your Mogodb service in GKE and your should be able to reach it from within the same VPC network.
Take a look at this post for reference.
You should also be able to automate the service exposure through an init script

Running flink job on kubernetes

I am trying out flink job on kubernetes with latest version of flink 1.5.
Flink on Kubernetes document defines how to deploy flink and I used minikube in mac. The flink ui comes up nicely showing the job manager and task manager.
The question I have is how to run a example app on the above flink cluster. The flink example project has information how to build a docker image with flink app and submit that application to flink. I followed the example, just changed version of flink to latest. I find the application (example-app) is submitted successfully and shows in pod in kubernetes, but the flink UI does not show any running jobs. Can someone please point me to an example of how to submit a flink job to flink cluster running on kubernetes.
There is a problem with Minikube's VM that a pod cannot reference a service pointing to itself. Here is the corresponding issue.
You have to log into the Minikube VM to set the proper ip link. The following command should do the trick
minikube ssh 'sudo ip link set docker0 promisc on'
The reason why we need this is because the web submission handler which runs on the cluster entrypoint needs to connect to the cluster entrypoint to submit the job.