So for my Lab I need to create a cluster in Dataproc.
I've followed all the steps listed so often I can do them blind now.
It just keeps coming up with an error saying I don't have permission.
I can't imagine I'm doing anything wrong.
Or could i...?
I'm on the third course from coursera Data Engineering
Related
Our usecase is pretty simple, however, I haven't found a solution for it yet.
In the organization I'm working at, we decided to move to Kubernetes as our container manager in order to spin-up slaves.
Until we moved to this kind of environment, we used to have dedicated slaves per each team. Each got the resources it needs and based on that, it was working.
However, when we moved to use Kubernetes, it started to cause issues as each team shares the same pile of resources, which, can lead to congestion or job failures.
The suggested solution was to create Kubernetes cluster per each team, however, this will lead to burnout of the teams involved with maintanance of multiple clusters.
Searching online, I didn't found any solution avilable, hence, I'm asking here - what is the best way to approach the solution? I understand that we might need to implament a dispacher, but currently it's not possible in the way the Kubernetes plugin is developed.
Thanks,
This morning we noticed that all Kubernetes clusters in all projects ( 2 projects, 2 clusters per project ) showed unavailable / ERROR in the Google Cloud Console.
The dashboard shows no current issues: https://status.cloud.google.com/
It basically looks like the master nodes are down, the API does not respond and the clusters cannot be edited in the UI. Before the weekend everything was up and since at least yesterday evening they all show in red.
The deployed services fortunately respond, but we cannot manage the cluster in any way.
I reported it here too:
https://issuetracker.google.com/issues/172841082
Did anyone else encounter this and is there any way to restart or trigger the master node to restart? I cannot edit the cluster so an upgrade is not possible either.
I read elsewhere that only SRE folks from Google can (re)start them.
It's beyond me how this can happen.
By the way, auto-repair is set to on and I followed the troubleshooting page, basically with all paths leading to : master node down, nothing to be done.
Any help would be greatly appreciated, or simply a SRE doing a start node action ;).
Thank you #dany L, it was indeed a billing issue.
I'm surprised there is nothing like a message in the Cloud Console and one has to go to billing specifically to find out about this.
After billing was fixed, it took a few minutes while before the clusters were available, then everything looked back to normal.
Our organisation has recently moved its infrastructure from aws to google cloud compute and I figured dataproc clusters are a good solution to running our existing spark jobs . But when it comes to comparing the pricing , I also realised that I can just fire up a google kubernetes engine cluster and install spark in it to run spark applications on it .
Now my question is , how do “running spark on gke “ and using dataproc compare ? Which one would be the best option in terms of autoscaling , pricing and infrastructure . I’ve read googles documentation on gke and dataproc but there isn’t enough for to be sure in terms of advantages and disadvantages of using GKE or dataproc over the other .
Any expert opinion will be extremely helpful.
Thanks in advance.
Spark on DataProc is proven and it's in use at many organizations, though its not fully managed, you can automate cluster creation and tear down, submitting jobs etc through GCP api, but still it's another stack you have to manage.
Spark on GKE is something new, Spark started adding features from 2.4 onwards to support Kubernetes, and even Google updated the Kubernetes for the preview couple of days back, Link
I would just go with DataProc if I have to run Jobs in Prod environment as we speak otherwise you could just experiment yourself with Docker and see how it fares, but I think it needs little more time to be stable, from purely cost perspective it would be cheaper with Docker as you can share resources with your other services.
Adding my two cents to above answer.
I would favor DataProc, because its managed and supports Spark out of
the box. No hazzles. More importantly, cost optimized. You may not
need clusters all the time, you can have ephemeral clusters with
dataproc.
With GKE, I need to explicitly discard the cluster and recreate when
necessary. Additional care needs to be taken care of.
I could not come across any direct service offering from GCP on data
lineage. In that case, I would probably use Apache Atlas with
Spark-Atlas-Connector on Spark installation managed by myself. In
that case, running Spark on GKE with all the control lying with
myself would make a compelling choice.
I followed the tutorial for setting up JupyterHub on an AWS EMR cluster at this link: https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/
I got the cluster up and running, but now my question is how do I stress/load test? (i.e. simulate 100 users running through the notebooks simultaneously).
In a classroom setting, I had about 30 users sshed into my cluster running through the notebook exercises, but there was a huge slowdown when more people started executing the code blocks in the notebooks. What happened was some python library imports took forever, some exercises stopped working or was just hanging. Cloudwatch showed that there was a network bottleneck.
Basically what I'm asking is how can I go about debugging something like that? What's the best way to simulate multiple users sshing into the EMR cluster, opening up jupyter notebooks and running the code blocks concurrently?
You should look (and contribute?) to project like this one which are meant to load-test JupyterHub and should migrate to jupyterHub organisation once more polished.
Note that in your case you are not really wishing to test JupyterHub, you are testing your cluster; just run N scripts in parallel importing your library and you have your load test.
I suddenly became an admin of the cluster in my lab and I'm lost.
I have experiences managing linux servers but not clusters.
Cluster seems to be quite different.
I figured the cluster is running CentOS and ROCKS.
I'm not sure what SGE and if it is used in the cluster or not.
Would you point me to an overview or documents of how cluster is configured and how to manage it? I googled but there seem to be lots of way to build a cluster and it is confusing where to start.
I too suddenly became a Rocks Clusters admin. While your CentOS knowledge will be handy, there are some 'Rocks' way of doing things, which you need to read up on. They mostly start with the CLI command rocks list|set command, and they are very nice to work with, when you get to learn them.
You should probadly start by reading the documentation (for the newest version, you can find yours with 'rocks report version'):
http://central6.rocksclusters.org/roll-documentation/base/6.1/
You can read up on SGE part at
http://central6.rocksclusters.org/roll-documentation/sge/6.1/
I would recommend you to sign up for the Rokcs Clusters discussion mailing list on:
https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion
The list is very friendly.