I followed the tutorial for setting up JupyterHub on an AWS EMR cluster at this link: https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/
I got the cluster up and running, but now my question is how do I stress/load test? (i.e. simulate 100 users running through the notebooks simultaneously).
In a classroom setting, I had about 30 users sshed into my cluster running through the notebook exercises, but there was a huge slowdown when more people started executing the code blocks in the notebooks. What happened was some python library imports took forever, some exercises stopped working or was just hanging. Cloudwatch showed that there was a network bottleneck.
Basically what I'm asking is how can I go about debugging something like that? What's the best way to simulate multiple users sshing into the EMR cluster, opening up jupyter notebooks and running the code blocks concurrently?
You should look (and contribute?) to project like this one which are meant to load-test JupyterHub and should migrate to jupyterHub organisation once more polished.
Note that in your case you are not really wishing to test JupyterHub, you are testing your cluster; just run N scripts in parallel importing your library and you have your load test.
Related
I have a python app that builds a dataset for a machine learning task on GCP.
Currently I have to start an instance of a VM that we have, and then SSH in, and run the app, which will complete in 2-24 hours depending on the size of the dataset requested.
Once the dataset is complete the VM needs to be shutdown so we don't incur additional charges.
I am looking to streamline this process as much as possible, so that we have a "1 click" or "1 command" solution, but I'm not sure the best way to go about it.
From what I've read about so far it seems like containers might be a good way to go, but I'm inexperienced with docker.
Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down? How would I pass information to the container such as where to get the config file etc? It's conceivable that we will have multiple datasets being generated at the same time based on different config files.
Is there a better gcloud feature that suits our purpose more effectively than containers?
I'm struggling to get information regarding these basic questions, it seems like container tutorials are dominated by web apps.
It would be useful to have a batch-like container service that runs a container until its process completes. I'm unsure whether such a service exists. I'm most familiar with Google Cloud Platform and this provides a wealth of compute and container services. However -- to your point -- these predominantly scale by (HTTP) requests.
One possibility may be Cloud Run and to trigger jobs using Cloud Pub/Sub. I see there's async capabilities too and this may be interesting (I've not explored).
Another runtime for you to consider is Kubernetes itself. While Kubernetes requires some overhead in having Google, AWS or Azure manage a cluster for you (I strongly recommend you don't run Kubernetes yourself) and some inertia in the capacity of the cluster's nodes vs. the needs of your jobs, as you scale the number of jobs, you will smooth these needs. A big advantage with Kubernetes is that it will scale (nodes|pods) as you need them. You tell Kubernetes to run X container jobs, it does it (and cleans-up) without much additional management on your part.
I'm biased and approach the container vs image question mostly from a perspective of defaulting to container-first. In this case, you'd receive several benefits from containerizing your solution:
reproducible: the same image is more probable to produce the same results
deployability: container run vs. manage OS, app stack, test for consistency etc.
maintainable: smaller image representing your app, less work to maintain it
One (beneficial!?) workflow change if you choose to use containers is that you will need to build your images before using them. Something like Knative combines these steps but, I'd stick with doing-this-yourself initially. A common solution is to trigger builds (Docker, GitHub Actions, Cloud Build) from your source code repo. Commonly you would run tests against the images that are built but you may also run your machine-learning tasks this way too.
Your containers would container only your code. When you build your container images, you would pip install, perhaps pip install --requirement requirements.txt to pull the appropriate packages. Your data (models?) are better kept separate from your code when this makes sense. When your runtime platform runs containers for you, you provide configuration information (environment variables and|or flags) to the container.
The use of a startup script seems to better fit the bill compared to containers. The instance always executes startup scripts as root, thus you can do anything you like, as the command will be executed as root.
A startup script will perform automated tasks every time your instance boots up. Startup scripts can perform many actions, such as installing software, performing updates, turning on services, and any other tasks defined in the script.
Keep in mind that a startup script cannot stop an instance but you can stop an instance through the guest operating system.
This would be the ideal solution for the question you posed. This would require you to make a small change in your Python app where the Operating system shuts off when the dataset is complete.
Q1) Can I setup a container that will pip install the latest app from our private GitHub and execute the dataset build before shutting down?
A1) Medium has a great article on installing a package from a private git repo inside a container. You can execute the dataset build before shutting down.
Q2) How would I pass information to the container such as where to get the config file etc?
A2) You can use ENV to set an environment variable. These will be available within the container.
You may consider looking into Docker for more information about container.
I am running into the same issue as in this thread with my Scala Spark Streaming application: Why does Spark job fail with "too many open files"?
But given that I am using Azure HDInsights to deploy my YARN cluster, and I don't think I can log into that machine and update the ulimit in all machines.
Is there any other way to solve this problem? I cannot reduce the number of reducers by too much either, or my job will become much slower.
You can ssh into all nodes from the head node (ambari ui show fqdn of all nodes).
ssh sshuser#nameofthecluster.azurehdinsight.net
You can the write a custom action that alters the settings on the necessary nodes if you want to automate this action.
I want to start learning Spark 2.0 so I try to setup my dev (Scalav2.11) environment.
Spark uses a distributed env. to work on one cluster across multiple separate machines each node per machine. However, I do not have many machines for my testing purpose I only have one machine with CentOS 7 on it.
I am not after performance, I need something that would simulate a working cluster so that I could learn Spark.
How can I setup a development environment to learn and develop Spark applications without having to access multiple machines but still being able to learn and write code for fully functional Spark based environment?
Start with local mode.
Spark will do everything as usual: spawn executors, distribute tasks etc, the only step that will be omitted is the transfer of data across the network, and it's done completely under the hood in production so you don't need to take this omission into account while coding.
You will be able to specify number of executors (only threads in this mode), and test for example the fact that Spark Streaming needs at least 2 of them.
Refering to your comments:
Or it does not make much sense to make a cluster to learn spark
because it is all done under the hood and the programming is all the
same on local and say standalone/YARN/mesos mode
Yes, there are some conventions, but they are exactly the same on local and other modes.
Does the local mode means that I will be able to start exemplary
cluster with say 3 nodes?
local[3] should do the trick.
This is a question of integration:
I would like to run Jenkins on Google Compute Engine. I can do this, but I will quickly break my budget if I leave an 8-core virtual machine running at all times. As a solution, I think I can leave a micro instance with a low amount of memory powered on and acting as the jenkins master running at all times. It seems that I should be able to configure github to startup a jenkins slave (with 8 cores) whenever a push is performed. How do I connect github post-commit hooks to Google Compute Engine to achieve this? A complete answer is probably asking too much, but even just pointers to the relevant documentation would be helpful.
Alternatively, how would you solve my problem?
You can run an AppEngine instance and use the URL it provides as the target of your GitHub on-commit web hook. This way, you won't be charged unless the instance is actually running, which may even be cheaper than running a micro instance 24x7 on Compute Engine.
You can then start/stop instances on Compute Engine or trigger actions on them from your code running on App Engine.
Here's a related question which has an answer for how to authenticate to Compute Engine from code running on AppEngine.
I ended up using a preemptible instance that automatically gets restarted every few minutes. I had to setup the instance manager to perform this restart, and I had to use the API, since this is a bit of an advanced and peculiar use of the features.
I suddenly became an admin of the cluster in my lab and I'm lost.
I have experiences managing linux servers but not clusters.
Cluster seems to be quite different.
I figured the cluster is running CentOS and ROCKS.
I'm not sure what SGE and if it is used in the cluster or not.
Would you point me to an overview or documents of how cluster is configured and how to manage it? I googled but there seem to be lots of way to build a cluster and it is confusing where to start.
I too suddenly became a Rocks Clusters admin. While your CentOS knowledge will be handy, there are some 'Rocks' way of doing things, which you need to read up on. They mostly start with the CLI command rocks list|set command, and they are very nice to work with, when you get to learn them.
You should probadly start by reading the documentation (for the newest version, you can find yours with 'rocks report version'):
http://central6.rocksclusters.org/roll-documentation/base/6.1/
You can read up on SGE part at
http://central6.rocksclusters.org/roll-documentation/sge/6.1/
I would recommend you to sign up for the Rokcs Clusters discussion mailing list on:
https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion
The list is very friendly.