PBSpro: $PBS_NODEFILE variable empty - hpc

For some reason, my $PBS_NODEFILE is empty, even when requesting multiple nodes / hosts.

Seems to be dependent on having mpiprocs set:
#PBS -l select=60:ncpus=4:mpiprocs=0 - nodefile empty.
#PBS -l select=60:ncpus=4:mpiprocs=1 - nodefile contains list of nodes.

Related

Kubectl appears to be discarding standard output

I'm trying to copy the contents of a large (~350 files, ~40MB total) directory from a Kubernetes pod to my local machine. I'm using the technique described here.
Sometimes it succeeds, but very frequently the standard output piped to the tar xf command on my host appears to get truncated. When that happens, I see errors like:
<some file in the archive being transmitted over the pipe>: Truncated tar archive
The files in the source directory don't change. The file in the error message is usually different (ie: it appears to be truncated in a different place).
For reference (copied from the document lined to above), this is the analog to what I'm trying to do (I'm using a different pod name and directory names):
kubectl exec -n my-namespace my-pod -- tar cf - /tmp/foo | tar xf - -C /tmp/bar
After running it, I expect the contents of my local /tmp/bar to be the same as those in the pod.
However, more often than not, it fails. My current theory (I have a very limited understanding of how kubectl works, so this is all speculation) is that when kubectl determines that the tar command has completed, it terminates -- regardless of whether or not there are remaining bytes in transit (over the network) containing the contents of standard output.
I've tried various combinations of:
stdbuf
Changing tar's blocking factor
Making the command take longer to run (by adding && sleep <x>)
I'm not going to list all combinations I've tried, but this is an example that uses everything:
kubectl exec -n my-namespace my-pod -- stdbuf -o 0 tar -b 1 -c -f - -C /tmp/foo . && sleep 2 | tar xf - -C /tmp/bar
There are combinations of that command that I can make work pretty reliably. For example, forgetting about stdbuf and -b 1 and just sleeping for 100 seconds, ie:
kubectl exec -n my-namespace my-pod -- tar -c -f - -C /tmp/foo . && sleep 100 | tar xf - -C /tmp/bar
But even more experimentation led me to believe that the block size of tar (512 bytes, I believe?) was still too large (the arguments of -b are a count of blocks, not the size of those blocks). This is the command I'm using for now:
kubectl exec -n my-namespace my-pod -- bash -c 'dd if=<(tar cf - -C /tmp/foo .) bs=16 && sleep 10' | tar xf - -C /tmp/bar
And yes, I HAD to make bs that small and sleep "that big" to make it work. But this at least gives me two variables I can mess with. I did find that if I set bs=1, I didn't have to sleep... but it took a LONG time to move all the data (one byte at a time).
So, I guess my questions are:
Is my theory that kubectl truncates standard output after it determines the command given to exec has finished correct?
Is there a better solution to this problem?
Maybe you haven't been specific enough for kubectl regarding what the full command that it must contend with really is. There might be ambiguity as to who should be responsible for the pipe process. The "--" probably doesn't direct kubectl to include that as part of the command. That is probably being intercepted by the shell.
Have you tried wrapping all of it in double-quotes ?
CMD="tar cf - /tmp/foo | tar xf - -C /tmp/bar"
kubectl exec -n my-namespace my-pod -- "${CMD}"
That way it would include the scope of saving at the target as part of the process to monitor for completion.

Why some buckets should not appear after a gsutil ls?

When I do gsutil ls -p myproject-id I get a list of buckets (in my case 2 buckets), which I expect to be the list of all my buckets in the project:
gs://bucket-one/
gs://bucket-two/
But, if I do gsutil ls -p myproject-id gs://asixtythreecharacterlongnamebucket I actually get the elements of that long-named bucket:
gs://asixtythreecharacterlongnamebucket/somefolder/
So my question is: why when I do a ls to the project I don't get in the results the long-named bucket?
The only explanation it made sense to me was this: https://stackoverflow.com/a/34738829/3457432
But I'm not sure. Is this the reason? Or could it be other ones?
Are you sure that asixtythreecharacterlongnamebucket belongs to myproject-id? It really sounds like asixtythreecharacterlongnamebucket was created in a different project.
You can verify this by checking the bucket ACLs for asixtythreecharacterlongnamebucket and bucket-one and seeing if the project numbers in the listed entities match:
$ gsutil ls -Lb gs://asixtythreecharacterlongnamebucket | grep projectNumber
$ gsutil ls -Lb gs://bucket-one | grep projectNumber
Also note that the -p argument to ls has no effect in your second command when you're listing objects in some bucket. The -p argument only affects which project should be used when you're listing buckets in some project, as in your first command. Think of ls as listing the children resources belonging to some parent -- the parent of a bucket is a project, while the parent of an object is a bucket.
You don't perform the same request!
gsutil ls -p myproject-id
Here you ask all the bucket resources that belong to a project
gsutil ls -p myproject-id gs://asixtythreecharacterlongnamebucket
Here you ask all the objects that belong to the bucket asixtythreecharacterlongnamebucket and you use the quota project myproject-id
In both case, you need to have permissions to access the resources

How can I clear all children nodes of a data node, but NOT delete the data node itself in zookeeper?

I have a znode: /test
And /test has two children nodes: /test/data1, /test/data2
How can I delete /test/data1 and /test/data2, but at the same time, NOT delete the node /test?
You can execute something like the following:
zkCli.sh -server xxx ls /test | \
grep "^\[" | \
grep -o -P "\w*" | \
while read znode ; do zkCli.sh -server xxx delete /test/$znode ; done
This is only using zkCli.sh and bash commands but is not optimal because it connects to the ZooKeeper server multiple times (one for each direct child deletion + one to fetch the children list). A more straightforward approach would be to use a ZooKeeper client library like kazoo or ZooKeeper Java API for such a task.

Can we see transfer progress with kubectl cp?

Is it possible to know the progress of file transfer with kubectl cp for Google Cloud?
No, this doesn't appear to be possible.
kubectl cp appears to be implemented by doing the equivalent of
kubectl exec podname -c containername \
tar cf - /whatever/path \
| tar xf -
This means two things:
tar(1) doesn't print any useful progress information. (You could in principle add a v flag to print out each file name as it goes by to stderr, but that won't tell you how many files in total there are or how large they are.) So kubectl cp as implemented doesn't have any way to get this out.
There's not a richer native Kubernetes API to copy files.
If moving files in and out of containers is a key use case for you, it will probably be easier to build, test, and run by adding a simple HTTP service. You can then rely on things like the HTTP Content-Length: header for progress metering.
One option is to use pv which will show time elapsed, data transferred and throughput (eg MB/s):
$ kubectl exec podname -c containername -- tar cf - /whatever/path | pv | tar xf -
14.1MB 0:00:10 [1.55MB/s] [ <=> ]
If you know the expected transfer size ahead of time you can also pass this to pv and it will then calculate a % progress and also an ETA, eg for a 100m transfer:
$ kubectl exec podname -c containername -- tar cf - /whatever/path | pv -s 100m | tar xf -
13.4MB 0:00:09 [1.91MB/s] [==> ] 13% ETA 0:00:58
You obviously need to have pv installed (locally) for any of the above to work.
It's not possible, but you can find here how to implement rsync with kubernetes, rsync shows you the progress of the transfer file.
rsync files to a kubernetes pod
I figured out a hacky way to do this. If you have bash access to the container you're copying to, you can do something like wc -c <file> on the remote, then compare that to the size locally. du -h <file> is another option, which gives human-readable output so it may be better
On MacOS, there is still the hacky way of opening the "Activity Monitor" on the "Network" tab. If you are copying with kubectl cp from your local machine to a distant pod, then the total transfer is shown in the "Sent Bytes" column.
Not of super high precision, but it sort of does the job without installing anything new.
I know it doesn't show an active progress of each file, but does output a status including byte count for each completed file, which for multiple files run via scripts, is almost as good as active progress:
kubectl cp local.file container:/path/on/container --v=4
Notice the --v=4 is verbose mode and will give you output. I found kubectl cp output shows from v=3 thru v=5.

K-means on Mahout returning non-exclusive clusters

In my data I have users with a list of likes, I've dumped these likes into individual files for each user and would like to cluster them. Everything is working except the output has the same likes in multiple clusters. My understanding is k-means should be exclusive. I figure the problem is perhaps with how I am dumping the data. I have also dumped all of the likes without spaces for the time being until I can write a custom tokenizer. Here's what I'm running (from a ruby script).
system("#{MAHOUT_CMD} seqdirectory -c UTF-8 -i data/users -o data/kmeans/converted")
system("#{MAHOUT_CMD} seq2sparse -i data/kmeans/converted -o data/kmeans/vectors")
system("#{MAHOUT_CMD} kmeans -i data/kmeans/vectors/tfidf-vectors -c data/kmeans/initial_clusters -o data/kmeans/kmeans_clusters -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.1 -k 20 -x 20")
last_cluster_folder = Dir["data/kmeans/kmeans_clusters/*"].last.gsub("data/kmeans/kmeans_clusters/", "")
system("#{MAHOUT_CMD} clusterdump -s data/kmeans/kmeans_clusters/#{last_cluster_folder}/ -d data/kmeans/vectors/dictionary.file-0 -dt sequencefile -o data/kmeans/clusters.txt -n 1000")
The output lists the "top terms" in each cluster, however many of the likes occur in each cluster (though with different weights). Is the normal output for clusterdumper, do I need to find out what cluster each word belongs to by its weight?
Thanks
Mahout probably is only doing approximate k-means. Plus, there might be objects that have the same distance to more than one cluster.
You should however be able to just take the k means, and then do a 1-nearest-neighbor classification to get a unique result for each objects (this is trivial to parallelize and very fast).