PBS: Can a large-memory serial job make use of memory from more than one node? - hpc

I am a new user to high-performance computing, I am trying to run a serial job that requires around 80GB memory in total. However the total memory available for one node is only 12GB for our cluster (Our lab's cluster is a little old). I read through some guides online and to my understanding, only MPI jobs can make use of memory from more than one node? Is it true? Any ideas on how to solve my particular problem? Thank you guys very much!

What you're describing is some sort of shared memory abstraction for distributed systems. Unfortunately clusters or any other HPC system doesn't work like that and you need to utilize inter-node communication (message passing) to access more memory. MPI is the de-facto standard for distributed processing and you won't be able to accomplish weak-scaling beyond a node's memory limits without making edits to the code.

I have never heard MPI would be able to run code that would require more memory than on a single node (unless the application was specifically designed to share memory between nodes). ElasticOS works on that, though: http://synergy.cs.vt.edu/2015-nsf-xps-workshop/reports/Richard_Han_45-Conference_Presentation_ElasticOS_XPS_2015.2.pdf

Related

OpenShift/Kubernetes memory handling

What is the correct way of memory handling in OpenShift/Kubernetes?
If I create a project in OKD, how can I determine optimal memory usage of pods? For example, if I use 1 deployment for 1-2 pods and each pod uses 300-500 Mb of RAM - Spring Boot apps. So technically, 20 pods uses around 6-10GB RAM, but as I see, sometimes each project could have around 100-150 containers which needs at least 30-50Gb of RAM.
I also tried with horizontal scale, and/or request/limits but still lot of memory used by each micro-service.
However, to start a pod, it requires around 500-700MB RAM, after spring container has been started they can live with around 300MB as mentioned.
So, I have 2 questions:
Is it able to give extra memory but only for the first X minutes for each pod start?
If not, than what is the best practice to handle memory shortage, if I have limited memory (16GB) and wants to run 35-40 pod?
Thanks for the answer in advance!
Is it able to give extra memory but only for the first X minutes for each pod start?
You do get this behavior when you set the limit to a higher value than the request. This allows pods to burst, unless they all need the memory at the same time.
If not, than what is the best practice to handle memory shortage, if I have limited memory (16GB) and wants to run 35-40 pod?
It is common to use some form of cluster autoscaler to add more nodes to your cluster if it needs more capacity. This is easy if you run in the cloud.
In general, Java and JVM is memory hungry, consider some other technology if you want to use less memory. How much memory an application needs/uses totally depends on your application, e.g what data structures are used.

Multiple node pools vs single pool with many machines vs big machines

We're moving all of our infrastructure to Google Kubernetes Engine (GKE) - we currently have 50+ AWS machines with lots of APIs, Services, Webapps, Database servers and more.
As we have already dockerized everything, it's time to start moving everything to GKE.
I have a question that may sound too basic, but I've been searching the Internet for a week and did not found any reasonable post about this
Straight to the point, which of the following approaches is better and why:
Having multiple node pools with multiple machine types and always specify in which pool each deployment should be done; or
Having a single pool with lots of machines and let Kubernetes scheduler do the job without worrying about where my deployments will be done; or
Having BIG machines (in multiple zones to improve clusters' availability and resilience) and let Kubernetes deploy everything there.
List of consideration to be taken merely as hints, I do not pretend to describe best practice.
Each pod you add brings with it some overhead, but you increase in terms of flexibility and availability making failure and maintenance of nodes to be less impacting the production.
Nodes too small would cause a big waste of resources since sometimes will be not possible to schedule a pod even if the total amount of free RAM or CPU across the nodes would be enough, you can see this issue similar to memory fragmentation.
I guess that the sizes of PODs and their memory and CPU request are not similar, but I do not see this as a big issue in principle and a reason to go for 1). I do not see why a big POD should run merely on big machines and a small one should be scheduled on small nodes. I would rather use 1) if you need a different memoryGB/CPUcores ratio to support different workloads.
I would advise you to run some test in the initial phase to understand which is the size of the biggest POD and the average size of the workload in order to properly chose the machine types. Consider that having 1 POD that exactly fit in one node and assign to it is not the right to proceed(virtual machine exist for this kind of scenario). Since fragmentation of resources would easily cause to impossibility to schedule a large node.
Consider that their size will likely increase in the future and to scale vertically is not always this immediate and you need to switch off machine and terminate pods, I would oversize a bit taking this issue into account and since scaling horizontally is way easier.
Talking about the machine type you can decide to go for a machine 5xsize the biggest POD you have (or 3x? or 10x?). Oversize a bit as well the numebr of nodes of the cluster to take into account overheads, fragmentation and in order to still have free resources.
Remember that you have an hard limit of 100 pods each node and 5000 nodes.
Remember that in GCP the network egress throughput cap is dependent on the number of vCPUs that a virtual machine instance has. Each vCPU has a 2 Gbps egress cap for peak performance. However each additional vCPU increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine.
Regarding the prices of the virtual machines notice that there is no difference in price buying two machines with size x or one with size 2x. Avoid to customise the size of machines because rarely is convenient, if you feel like your workload needs more cpu or mem go for HighMem or HighCpu machine type.
P.S. Since you are going to build a pretty big Cluster, check the size of the DNS
I will add any consideration that it comes to my mind, consider in the future to update your question with the description of the path you chose and the issue you faced.
1) makes a lot of sense as if you want, you can still allow kube deployments treat it as one large pool (by not adding nodeSelector/NodeAffinity) but you can have different machines of different sizes, you can think about having a pool of spot instances, etc. And, after all, you can have pools that are tainted and so forth excluded from normal scheduling and available to only a particular set of workloads. It is in my opinion preferred to have some proficiency with this approach from the very beginning, yet in case of many provisioners it should be very easy to migrate from 2) to 1) anyway.
2) As explained above, it's effectively a subset of 1) so better to build up exp with 1) approach from day 1, but if you ensure your provisioning solution supports easy extension to 1) model then you can get away with starting with this simplified approach.
3) Big is nice, but "big" is relative. It depends on the requirements and amount of your workloads. Remember that while you need to plan for loss of a whole AZ anyway, it will be much more frequent to loose single nodes (reboots, decommissions of underlying hardware, updates etc.) so if you have more hosts, impact of loosing one will be smaller. Bottom line is that you need to find your own balance, that makes sense for your particular scale. Maybe 50 nodes is too much, would 15 cut it? Who knows but you :)

Apache Spark Auto Scaling properties - Add Worker on the Fly

During the execution of a Spark Program, let's say,
reading 10GB of data into memory, and just doing a filtering, a map, and then saving in another storage.
Can I auto-scale the cluster based on the load, and for instance add more Worker Nodes to the Program, if this program eventually needs to hangle 1TB instead of 10GB ?
If this is possible, how can it be done?
It is possible to some extent, using dynamic allocation, but behavior is dependent on the job latency, not direct usage of particular resource.
You have to remember that in general, Spark can handle data larger than memory just fine, and memory problems are usually caused by user mistakes, or vicious garbage collecting cycles. None of these could be easily solved, by "adding more resources".
If you are using any of the cloud platforms for creating the cluster you can use auto-scaling functionality. that will scale cluster horizontally(number of nodes with change)
Agree with #user8889543 - You can read much more data then your memory.
And as for adding more resources on the fly. It is depended on your cluster type.
I use standalone mode, and I have a code that add on the fly machines that attached to the master automatically, then my cluster has more cores and memory.
If you only have on job/program in the cluster then it is pretty simple. Just set
spark.cores.max
to a very high number and the job will take all the cores of the cluster always. see
If you have several jobs in the cluster it becomes complicate. as mentioned in #user8889543 answer.

Multi-core processor for multiple data containers

I have a dual core Intel processor and would like to use one core for processing certain commands like SATA writes and another for reads, how do we do it? Can this be controlled from the application(with multiple threads) or would this require a change in the kernel to ensure the reads/writes dont get processed by the the 'wrong' core?
This will be pretty much totally up to your operating system, which you haven't specified.
Some may offer thread affinity to try and keep one thread on the same execution engine (be that a core or a CPU), but that's only for threads. If two threads both write to disk, then they may well do so on different engines.
If you want that sort of low level control, it's probably best to do it at the kernel level.
My question to you would by "Why?". A great deal of performance tuning goes into OS kernels and they would generally know better than any application how to efficiently do this low level stuff.

Benefits of multiple memcached instances

Is there any difference between having 4 .5GB memcache servers running or one 2GB instance?
Does running multiple instances offer any benifits?
If one instance fails, you're still get advantages of using the cache. This is especially true if you are using the Consistenthashing that will bring the same data to the same instance, rather than spreading new reads/writes among the machines that are still up.
You may also elect to run servers on 32 bit operating systems, that cannot address more than around 3GB of memory.
Check the FAQ: http://www.socialtext.net/memcached/ and http://www.danga.com/memcached/
High availability is nice, and memcached will automatically distribute your cache across the 4 servers. If one of those servers dies for some reason, you can handle that error by either just continuing as if the cache was blank, redirecting to a different server, or any sort of custom error handling you want. If your 1x 2gb server dies, then your options are pretty limited.
The important thing to remember is that you do not have 4 copies of your cache, it is 1 cache, split amongst the 4 servers.
The only downside is that it's easier to run out of 4x .5 than it is to run out of 1x 2gb memory.
I would also add that theoretically, in case of several machines, it might save you some performance, as if you have a lot of frontends doing a lot of heavy reads, it's much better to split them into different machines: you know, network capabilities and processing power of one machine can become an upper bound for you.
This advantage is highly dependent on memcache utilization, however (sometimes it might be ways faster to fetch everything from one machine).