Can GPU supports multiple jobs without delay? - neural-network

So I am running PyTorch deep learning job using GPU
but the job is pretty light.
My GPU has 8 GB but the job only uses 2 GB. Also GPU-Util is close to 0%.
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:01:00.0 On | N/A |
| 0% 36C P2 45W / 210W | 1155MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
based on GPU-Util and memory I might be able to fit in another 3 jobs.
However, I am not sure if that will affect the overall runtime.
If I run multiple jobs on same GPU does that affects the overall runtime?
I think tried once and I think there was delay.

Yes you can. One option is to use NVIDIA's Multi-Process Service (MPS) to run four copies of your model on the same card.
This is the best description I have found of how to do it: How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?
If you are using your card for inference only, then you can host several models (either copies, or different models) on the same card using NVIDIA's TensorRT Inference Service.

Related

Google Cloud SQL Storage Expanding Issue

My PostgreSQL database storage is expanding way too much relative to my real database storage. I am assuming it is making logs for every actions against the database. If so, how do I turn it off?
All tables' storage size in the database:
table_name | pg_size_pretty
-----------------+----------------
matches | 3442 MB
temp_matches | 3016 MB
rankings | 262 MB
atp_matches | 41 MB
players | 11 MB
injuries | 4648 kB
tournaments | 1936 kB
temp_prematches | 112 kB
locations | 104 kB
countries | 16 kB
(10 rows)
My storage usage should be around 10GB.
Your PostgreSQL instance may have Point-in-time recovery (PITR) enabled.
To add explanation, PITR uses write ahead logs (WAL). It is necessary to archive the WAL files for instances it is enabled on. This archiving is done automatically on the backend and will consume storage space (even if the instance is idle) and therefore using this feature would result on an increased storage space on your DB instance.
Here's a similar issue: Google Cloud SQL - Postgresql storage keeps growing
You can stop the storage increase by disabling Point-in-time recovery: https://cloud.google.com/sql/docs/postgres/backup-recovery/pitr#disablingpitr
First, I recommend you verify if you have set the value "Enable automatic storage increases" since your instance's storage will continue increasing and your pocket will be affected.
Please keep in mind that you can increase storage size, but you cannot decrease it; the storage increases are permanent for the life of the instance. When you enable this setting, a spike in storage requirements can permanently increase storage costs (incrementally) for your instance.
On the other hand, If you have the Point-in-time recovery ( PITR ) enabled, I recommend you disable it in order to delete the logs. If not, I think that It would be necessary contact the GCP support team in order that they inspect your instance carefully.

ReadWriteMany volumes on kubernetes with terabytes of data

We want to deploy a k8s cluster which will run ~100 IO-heavy pods at the same time. They should all be able to access the same volume.
What we tried so far:
CephFS
was very complicated to set up. Hard to troubleshoot. In the end, it crashed a lot and the cause was not entirely clear.
Helm NFS Server Provisioner
runs pretty well, but when IO peaks a single replica is not enough. We could not get multiple replicas to work at all.
MinIO
is a great tool to create storage buckets in k8s. But our operations require fs mounting. That is theoretically possible with s3fs, but since we run ~100 pods, we would need to run 100 s3fs sidecars additionally. Thats seems like a bad idea.
There has to be some way to get 2TB of data mounted in a GKE cluster with relatively high availability?
Firestorage seems to work, but it's a magnitude more expensive than other solutions, and with a lot of IO operations it quickly becomes infeasible.
I contemplated creating this question on server fault, but the k8s community is a lot smaller than SO's.
I think I have a definitive answer as of Jan 2020, at least for our usecase:
| Solution | Complexity | Performance | Cost |
|-----------------|------------|-------------|----------------|
| NFS | Low | Low | Low |
| Cloud Filestore | Low | Mediocre? | Per Read/Write |
| CephFS | High* | High | Low |
* You need to add an additional step for GKE: Change the base image to ubuntu
I haven't benchmarked Filestore myself, but I'll just go with stringy05's response: others have trouble getting really good throughput from it
Ceph could be a lot easier if it was supported by Helm.

Jenkins and PostgreSQL is consuming a lot of memory

We have a Data ware house server running on Debian linux ,We are using PostgreSQL , Jenkins and Python.
It's been few day the memory of the CPU is consuming a lot by jenkins and Postgres.tried to find and check all the ways from google but the issue is still there.
Anyone can give me a lead on how to reduce this memory consumption,It will be very helpful.
below is the output from free -m
total used free shared buff/cache available
Mem: 63805 9152 429 16780 54223 37166
Swap: 0 0 0
below is the postgresql.conf file
Below is the System configurations,
Results from htop
Please don't post text as images. It is hard to read and process.
I don't see your problem.
Your machine has 64 GB RAM, 16 GB are used for PostgreSQL shared memory like you configured, 9 GB are private memory used by processes, and 37 GB are free (the available entry).
Linux uses available memory for the file system cache, which boosts PostgreSQL performance. The low value for free just means that the cache is in use.
For Jenkins, run it with these JAVA Options
JAVA_OPTS=-Xms200m -Xmx300m -XX:PermSize=68m -XX:MaxPermSize=100m
For postgres, start it with option
-c shared_buffers=256MB
These values are the one I use on a small homelab of 8GB memory, you might want to increase these to match your hardware

Large number of worker nodes v/s few worker nodes with more resources

Is it preferable to have a Kubernetes cluster with 4 nodes having resources 4 CPUs, 16 GB RAM or 2 nodes cluster with resources 8 CPUs and 32 GB RAM?
What benefits user will get if they go for horizontal scaling over vertical scaling in Kubernetes concepts. I mean suppose we want to run 4 pods, is it good to go with 2 nodes cluster with resources 8 CPU and 32 GB RAM or 4 nodes cluster with resources 4 CPU and 16 GB RAM.
In general I would recommend larger nodes because it's easier to place containers on them.
If you have a pod that resources: {requests: {cpu: 2.5}}, you can only place one of them on a 4-core node, and two on 2x 4-core nodes, but you can put 3 on a single 8-core node.
+----+----+----+----+ +----+----+----+----+
|-WORKLOAD--| | |-WORKLOAD--| |
+----+----+----+----+ +----+----+----+----+
+----+----+----+----+----+----+----+----+
|-WORKLOAD--|--WORKLOAD--|-WORKLOAD--| |
+----+----+----+----+----+----+----+----+
If you have 16 cores total and 8 cores allocated, it's possible that no single node has more than 2 cores free with 4x 4-CPU nodes, but you're guaranteed to be able to fit that pod with 2x 8-CPU nodes.
+----+----+----+----+ +----+----+----+----+
|-- USED -| | |-- USED -| |
+----+----+----+----+ +----+----+----+----+
+----+----+----+----+ +----+----+----+----+
|-- USED -| | |-- USED -| |
+----+----+----+----+ +----+----+----+----+
Where |-WORKLOAD--| goes?
+----+----+----+----+----+----+----+----+
|------- USED ------| |
+----+----+----+----+----+----+----+----+
+----+----+----+----+----+----+----+----+
|------- USED ------| |
+----+----+----+----+----+----+----+----+
At the specific scale you're talking about, though, I'd be a little worried about running a 2-node cluster: if a single node dies you've lost half your cluster capacity. Unless I knew that I was running multiple pods that needed 2.0 CPU or more I might lean towards the 4-node setup here so that it will be more resilient in the event of node failure (and that does happen in reality).
Horizontal Autoscaling
Pros
Likely to have more capacity since you are expanding VMs or/and servers. You are essentially expanding your cluster.
In theory, more redundancy since you are spreading your workloads across different physical servers.
Cons
In theory, it's slower. Meaning it's slower to provision servers and VMs than pods/containers in the same machine (for vertical autoscaling)
Also, you need to provision both servers/VMs and containers/pods when you scale up.
Doesn't work that well with plain bare-metal infrastructure/servers.
Vertical Autoscaling
Pros
In theory, it should be faster to autoscale if you have large servers provisioned. (Also, faster response)
If you have data-intensive apps you might benefit from workloads running on the same machines.
Great if you have a few extra unused bare-metal servers.
Cons
If you have large servers provisioned you may waste a lot of resources.
You need to calculate the capacity of your workloads more precisely (this could be a pro or a con depending on how you see it)
If you have a fixed set of physical servers, you will run into eventual limitations of CPUs, Storage, Memory, etc.
Generally, you'd want to have a combination of both Horizontal and Vertical autoscaling.

AppFog Claims I have used all two service slots when I have none in use

AppFog claims that I am using both of my available services, when in reality I have deleted the services that were in use. (I had MySQL databases attached to a couple of apps I was using, but when I deleted the apps, I also deleted the services... they just never free'd up for some reason)
Anyone have any suggestions on how I might reclaim those lost services? It's kinda hard to have apps without services and it won't show me anything to unbind or delete in order to free up those slots.
-Thanks
C:\Sites>af info
AppFog Free Your Cloud Edition
For support visit http://support.appfog.com
Target: https://api.appfog.com (v0.999)
Client: v0.3.18.12
User: j****g#gmail.com
Usage: Memory (0B of 512.0M total)
Services (2 of 2 total)
Apps (0 of 2 total)
C:\Sites>af services
============== System Services ==============
+------------+---------+-------------------------------+
| Service | Version | Description |
+------------+---------+-------------------------------+
| mongodb | 1.8 | MongoDB NoSQL store |
| mongodb2 | 2.4.8 | MongoDB2 NoSQL store |
| mysql | 5.1 | MySQL database service |
| postgresql | 9.1 | PostgreSQL database service |
| rabbitmq | 2.4 | RabbitMQ message queue |
| redis | 2.2 | Redis key-value store service |
+------------+---------+-------------------------------+
=========== Provisioned Services ============
Probably easiest to email support#appfog.com and get them to look into it.