GCP - spark on GKE vs Dataproc - pyspark

Our organisation has recently moved its infrastructure from aws to google cloud compute and I figured dataproc clusters are a good solution to running our existing spark jobs . But when it comes to comparing the pricing , I also realised that I can just fire up a google kubernetes engine cluster and install spark in it to run spark applications on it .
Now my question is , how do “running spark on gke “ and using dataproc compare ? Which one would be the best option in terms of autoscaling , pricing and infrastructure . I’ve read googles documentation on gke and dataproc but there isn’t enough for to be sure in terms of advantages and disadvantages of using GKE or dataproc over the other .
Any expert opinion will be extremely helpful.
Thanks in advance.

Spark on DataProc is proven and it's in use at many organizations, though its not fully managed, you can automate cluster creation and tear down, submitting jobs etc through GCP api, but still it's another stack you have to manage.
Spark on GKE is something new, Spark started adding features from 2.4 onwards to support Kubernetes, and even Google updated the Kubernetes for the preview couple of days back, Link
I would just go with DataProc if I have to run Jobs in Prod environment as we speak otherwise you could just experiment yourself with Docker and see how it fares, but I think it needs little more time to be stable, from purely cost perspective it would be cheaper with Docker as you can share resources with your other services.

Adding my two cents to above answer.
I would favor DataProc, because its managed and supports Spark out of
the box. No hazzles. More importantly, cost optimized. You may not
need clusters all the time, you can have ephemeral clusters with
dataproc.
With GKE, I need to explicitly discard the cluster and recreate when
necessary. Additional care needs to be taken care of.
I could not come across any direct service offering from GCP on data
lineage. In that case, I would probably use Apache Atlas with
Spark-Atlas-Connector on Spark installation managed by myself. In
that case, running Spark on GKE with all the control lying with
myself would make a compelling choice.

Related

How to avoid congestion when using Kubernetes pods as Jenkins slaves

Our usecase is pretty simple, however, I haven't found a solution for it yet.
In the organization I'm working at, we decided to move to Kubernetes as our container manager in order to spin-up slaves.
Until we moved to this kind of environment, we used to have dedicated slaves per each team. Each got the resources it needs and based on that, it was working.
However, when we moved to use Kubernetes, it started to cause issues as each team shares the same pile of resources, which, can lead to congestion or job failures.
The suggested solution was to create Kubernetes cluster per each team, however, this will lead to burnout of the teams involved with maintanance of multiple clusters.
Searching online, I didn't found any solution avilable, hence, I'm asking here - what is the best way to approach the solution? I understand that we might need to implament a dispacher, but currently it's not possible in the way the Kubernetes plugin is developed.
Thanks,

Web application deployment approach using Google Cloud - GKE

I deploying a python + tensorflow + flask application using a fully managed Google Cloud Run Service (1 vCPUs and 4 GB Ram).
System works fine but it is really slow, so I am evaluating ways of making it fast (it needs to run 20-30 times faster than what is doing now)
What would be the best approach?
To use a Kubernetes Cluster with one or two powerful machines
To use a Kubernetes Cluster with 3-5 weaker machines
To forget about Kubernets/Docker and run everything on single powerfull VM
Something else maybe?
For now I don't expect to have more than 10 users at a time but I want to be able to scale it up eventually.
You might want to evaluate according to your use case
Per this article, Fully managed Cloud Run is an ideal serverless platform for stateless containerized microservices that don’t require Kubernetes features like namespaces, co-location of containers in pods (sidecars) or node allocation and management.
GKE is a great choice if you are looking for a container orchestration platform that offers advanced scalability and configuration flexibility.
You mentioned you are looking the cheaper/easier method to develop, but this will probably not be as scalable, efficient or manageable, you might want to take a closer look at all cloud compute options in GCP to see what could benefit your use case the most.
You mentioned your use case is CPU intensive, so you might want to leverage the high CPU machine types, these might be used directly by creating a VM, creating an instance group or using them in other services like GKE or App Engine

How can I have a GKE cluster "expire" and delete itself?

We stand up a lot of clusters for testing/poc/deving and its up to us to remember to delete them
What I would like is a way of setting a ttl on an entire gke cluster and having it get deleted/purged automatically.
I could tag the clusters with a timestamp at creation and have an external process running on a schedule that reaps old clusters, but it'd be great if I didn't have to do that- it might be the only way but maybe there is a gke/k8s feature for this?
Is there a way to have the cluster delete itself without relying on an external service? I suppose it could spawn a cloud function itself- but Im wondering if there is a native gke/k8s feature to do this more elegantly
You can spawn GKE cluster with Alpha features. Such clusters exist for one month maximum and then are auto-deleted.
Read more: https://cloud.google.com/kubernetes-engine/docs/concepts/alpha-clusters
Try Cloud Scheduler and hook it up with your build server. Cloud Scheduler supports Http , App Engine , Pub/Sub endpoints.
I don't believe there is a native way to do this, but it doesn't seem unreasonable to use cloud scheduler to every so often trigger a cloud function which looks for appropriately labeled clusters and triggers their deletion via the API.

Best way to deploy MongoDB to Google Cloud Platform?

Been working on a web app with a simple database model that only needs CRUD operations, figured MongoDB would be perfect for it. The most important constraints of the project is that it be able to scale from a small amount of users to a large amount. I’ve been looking at the cloud launcher and I’ve noticed that the most popular MongoDB solution advertises a cost of ~$350/mo. This is a surprisingly large amount that makes me consider using cloud sql for my database instead. Is there a better way to deploy MongoDB to GCP that’s more fitted to my use case? I’ve been reading about automatic scaling with kubernetes but I can’t find anything about price. Any and all advice is greatly appreciated
I haven't used mongodb with kubernetes but we do use the cloud launcher solution at work. We use 2 nodes(n1-standard-1) and an arbiter(micro) + 100GB storage on each node which comes up around $100 a month. You would need a replicaset in a production environment so this seems to be a reasonable base cost.
Kubernetes does not provide a lot of advantages over the classic GCE deployment for mongodb compared to a webserver. Setting up a replicaset on kubernetes is a bit more work compared to GCE setup. https://medium.com/google-cloud/mongodb-replica-sets-with-kubernetes-d96606bd9474 and http://blog.kubernetes.io/2017/01/running-mongodb-on-kubernetes-with-statefulsets.html should serve as decent references but wouldn't lower your costs. Scaling nodes would be slightly easier though but does not strictly translate to scaling mongodb.
I have lately been working on a similar solution.
GCP announced that they don't charge for Kubernetes cluster management but only for resources used by it (instances, network ...):
https://cloud.google.com/kubernetes-engine/pricing
In general, databases are high maintenance (data mounts, backups, migrations...), so I would not start running Mongo on Kubernetes right away. You could get there but it will be more complicated than deploying your web app on Kubernetes.
Better to use MongoDB as a service that supports GCP (e.g. MongoDB Atlas), I have done so myself and see a few other companies do that.
If you scale gradually you should be able to control your costs.
The web app itself should be easy to deploy and maintain on Kubernetes.

Akka cluster and OpenShift

I'm new to Akka Clusters, however as I am understanding its documentation, I need to know at least one "seed node" to join an existing cluster.
So when using clusters with OpenShift I would need to know if the current gear is the first node - then I would create a new cluster - or if there are already some other gears around - I would need to know at least one of their IPs to join them.
Is this possible with OpenShift cloud? (I'm using the DIY catridge, so customizing the start up script wouldn't be a problem. However I can't find any environment variable which provides me relevant data.)
DIY gears on OpenShift Online do not scale. And if you are spinning up separate applications for each of the nodes in your cluster, you are going to (probably) run into inter-gear communication issues. You might need to create your own akka cartridge (http://docs.openshift.org/origin-m4/oo_cartridge_developers_guide.html), then you can set your own scaling options. You might check out this cartridge (https://github.com/smarterclayton/openshift-redis-cart) which supports scaling and might give you some ideas about how to implement yours.