We want to deploy a number of the applications into our cluster (Tez, Hue, Presto, Zeppelin and Oozie. A quick scan of the repo suggests that some of the ports will conflict by default (Zeppellin and Presto).
Is this a bug? How do I go about ensuring we can initialize the cluster with the tools we need ? Suggesting I have multiple clusters isn't really a useful answer.
Thanks
As per above:
For posterity, the discussion/answer is here: github.com/GoogleCloudPlatform/dataproc-initialization-actions/
Related
Our usecase is pretty simple, however, I haven't found a solution for it yet.
In the organization I'm working at, we decided to move to Kubernetes as our container manager in order to spin-up slaves.
Until we moved to this kind of environment, we used to have dedicated slaves per each team. Each got the resources it needs and based on that, it was working.
However, when we moved to use Kubernetes, it started to cause issues as each team shares the same pile of resources, which, can lead to congestion or job failures.
The suggested solution was to create Kubernetes cluster per each team, however, this will lead to burnout of the teams involved with maintanance of multiple clusters.
Searching online, I didn't found any solution avilable, hence, I'm asking here - what is the best way to approach the solution? I understand that we might need to implament a dispacher, but currently it's not possible in the way the Kubernetes plugin is developed.
Thanks,
How to productionise https://hub.docker.com/r/fiorix/freegeoip such that it is launched as a Fargate task and Also how to take care of the geoipupdate functionality such that the GeoLite2-City.mmdb is updated in the task.
I have the required environment details like GEOIPUPDATE_ACCOUNT_ID, GEOIPUPDATE_LICENSE_KEY and GEOIPUPDATE_EDITION_IDS but could not understand the flow for deployment as there are two separate dockerfile/images for geoip as well as geoipupdate.
Has someone deployed this on Fargate? If yes could you please list down with the high level steps for the same. I have already tried researching if such a thing is deployed on ECS, but I can only find examples for Lambda and EC2.
Thanks
Our organisation has recently moved its infrastructure from aws to google cloud compute and I figured dataproc clusters are a good solution to running our existing spark jobs . But when it comes to comparing the pricing , I also realised that I can just fire up a google kubernetes engine cluster and install spark in it to run spark applications on it .
Now my question is , how do “running spark on gke “ and using dataproc compare ? Which one would be the best option in terms of autoscaling , pricing and infrastructure . I’ve read googles documentation on gke and dataproc but there isn’t enough for to be sure in terms of advantages and disadvantages of using GKE or dataproc over the other .
Any expert opinion will be extremely helpful.
Thanks in advance.
Spark on DataProc is proven and it's in use at many organizations, though its not fully managed, you can automate cluster creation and tear down, submitting jobs etc through GCP api, but still it's another stack you have to manage.
Spark on GKE is something new, Spark started adding features from 2.4 onwards to support Kubernetes, and even Google updated the Kubernetes for the preview couple of days back, Link
I would just go with DataProc if I have to run Jobs in Prod environment as we speak otherwise you could just experiment yourself with Docker and see how it fares, but I think it needs little more time to be stable, from purely cost perspective it would be cheaper with Docker as you can share resources with your other services.
Adding my two cents to above answer.
I would favor DataProc, because its managed and supports Spark out of
the box. No hazzles. More importantly, cost optimized. You may not
need clusters all the time, you can have ephemeral clusters with
dataproc.
With GKE, I need to explicitly discard the cluster and recreate when
necessary. Additional care needs to be taken care of.
I could not come across any direct service offering from GCP on data
lineage. In that case, I would probably use Apache Atlas with
Spark-Atlas-Connector on Spark installation managed by myself. In
that case, running Spark on GKE with all the control lying with
myself would make a compelling choice.
I would like to run a sequence of Kubernetes jobs one after another. It's okay if they are run on different nodes, but it's important that each one run to completion before the next one starts. Is there anything built into Kubernetes to facilitate this? Other architecture recommendations also welcome!
This requirement to add control flow, even if it's a simple sequential flow, is outside the scope of Kubernetes native entities as far as I know.
There are many workflow engine implementations for Kubernetes, most of them are focusing on solving CI/CD but are generic enough for you to use however you want.
Argo: https://applatix.com/open-source/argo/
Added a custom resource deginition in Kubernetes entity for Workflow
Brigade: https://brigade.sh/
Takes a more serverless like approach and is built on Javascript which is very flexible
Codefresh: https://codefresh.io
Has a unique approach where you can use the SaaS to easily get started without complicated installation and maintenance, and you can point Codefresh at your Kubernetes nodes to run the workflow on.
Feel free to Google for "Kubernetes Workflow", and discover the right platform for yourself.
Disclaimer: I work at Codefresh
I would try to use cronjobs and set the concurrency policy to forbid so it doesn't run concurrent jobs.
I have worked on IBM TWS (Workload Automation) which is a scheduler similar to cronjob where you can mention the dependencies of the jobs.
You can specify a job to run only after it's dependencies has run using follows keyword.
I'm new to Akka Clusters, however as I am understanding its documentation, I need to know at least one "seed node" to join an existing cluster.
So when using clusters with OpenShift I would need to know if the current gear is the first node - then I would create a new cluster - or if there are already some other gears around - I would need to know at least one of their IPs to join them.
Is this possible with OpenShift cloud? (I'm using the DIY catridge, so customizing the start up script wouldn't be a problem. However I can't find any environment variable which provides me relevant data.)
DIY gears on OpenShift Online do not scale. And if you are spinning up separate applications for each of the nodes in your cluster, you are going to (probably) run into inter-gear communication issues. You might need to create your own akka cartridge (http://docs.openshift.org/origin-m4/oo_cartridge_developers_guide.html), then you can set your own scaling options. You might check out this cartridge (https://github.com/smarterclayton/openshift-redis-cart) which supports scaling and might give you some ideas about how to implement yours.