Mahout clustering: How to retrieve the name of a named vector - cluster-analysis

I want to cluster multiple documents using Mahout. The clustering works fine but I have no idea how to find out which documents are located in each cluster.
I read that you can use the option --namedVector when creating the sparse-files but where does it take the ID from and how can I retrieve this ID after the clustering is completed?
Right now I am doing the following steps:
I have a directory with a file for each document. The files are in the following format with the ID of the document as filename:
filename: documentID.txt
[TITLE]
[CONTENT]
I create a sparse directory with namedVectors using:
./mahout seqdirectory -i tmp/es-out -o tmp/es-out-seqdir -c UTF-8 -chunk 64 -xm sequential
./mahout seq2sparse -i tmp/es-out-seqdir -o tmp/es-out-sparse --maxDFPercent 85 --namedVector
Then I can cluster the results and create a dump:
./mahout kmeans -i tmp/es-out-sparse/tfidf-vectors -c tmp/es-kmeans-clusters -o tmp/es-kmeans -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 20 -ow --clustering
./mahout clusterdump -i tmp/es-kmeans/clusters-10-final -o tmp/clusterdump -d tmp/es-out-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -sp 0 --pointsDir tmp/es-kmeans/clusteredPoints
The dump looks like this:
:VL-190{n=1 c=[1:3.407, 110:6.193, 2007:3.736, about:1.762, according:2.948, account:3.507, acting:6.
Top Terms:
epa => 13.471728324890137
mountaintop => 11.364262580871582
mine => 10.942587852478027
Weight : [props - optional]: Point:
[...]

k-means in Mahout is only a toy.
You can use it for howtos and tutorials, but for real use it is too slow, too limited, roo hard to use. (Also, k-means results are not half as good as people think... most of the time they are dogfood.)
Benchmark other tools, and you'll be surprised big time.

I found a way. You can use the seqdumper to extract the cluster mapping:
./mahout seqdumper -i /tmp/es-kmeans/clusteredPoints/part-m-00000 -o /tmp/cluster-points.txt
Than you can use a regex to extract the mapping of the vector IDs to cluster IDs.

Related

How to print debugging information on one/specific OpenAPI model?

According to the OpenAPI docs here is how one can print generator's models data:
$ java -jar openapi-generator-cli.jar generate \
-g typescript-fetch \
-o out \
-i api.yaml \
-DdebugModels
which outputs 39000 lines and it's a little difficult to find a modele of one's interest.
How to output debug information on just one model?
Unfortunately, there's no way to generate the debug log for just one model or operation.
As a workaround, you can draft a new spec that contains the model you want to debug.

How to copy partial topic data from one cluster to another

I have a use case where I need to copy the data from one topic to another topic in a different cluster but I need to copy only from a given offset. What can I use for the above use case?
I have looked into mirror maker as it copies data from one cluster to another but how to mention the offset part, I am not getting that.
Is there any utility I can use?
If as you say "This will be a one time operation" you can use kafkacat this -o option.
For example (the easiest case):
kafkacat -C -b mybrocker_cluster_1:9092 -t mytopic1 -o <offset> | \
kafkacat -P -b mybrocker_cluster_2:9092 -t mytopic1
You probably still need to add a few parameters to the consumer:
-X message.max.bytes=<value> -X fetch.message.max.bytes=<value> -X receive.message.max.bytes=<value>

Can we see transfer progress with kubectl cp?

Is it possible to know the progress of file transfer with kubectl cp for Google Cloud?
No, this doesn't appear to be possible.
kubectl cp appears to be implemented by doing the equivalent of
kubectl exec podname -c containername \
tar cf - /whatever/path \
| tar xf -
This means two things:
tar(1) doesn't print any useful progress information. (You could in principle add a v flag to print out each file name as it goes by to stderr, but that won't tell you how many files in total there are or how large they are.) So kubectl cp as implemented doesn't have any way to get this out.
There's not a richer native Kubernetes API to copy files.
If moving files in and out of containers is a key use case for you, it will probably be easier to build, test, and run by adding a simple HTTP service. You can then rely on things like the HTTP Content-Length: header for progress metering.
One option is to use pv which will show time elapsed, data transferred and throughput (eg MB/s):
$ kubectl exec podname -c containername -- tar cf - /whatever/path | pv | tar xf -
14.1MB 0:00:10 [1.55MB/s] [ <=> ]
If you know the expected transfer size ahead of time you can also pass this to pv and it will then calculate a % progress and also an ETA, eg for a 100m transfer:
$ kubectl exec podname -c containername -- tar cf - /whatever/path | pv -s 100m | tar xf -
13.4MB 0:00:09 [1.91MB/s] [==> ] 13% ETA 0:00:58
You obviously need to have pv installed (locally) for any of the above to work.
It's not possible, but you can find here how to implement rsync with kubernetes, rsync shows you the progress of the transfer file.
rsync files to a kubernetes pod
I figured out a hacky way to do this. If you have bash access to the container you're copying to, you can do something like wc -c <file> on the remote, then compare that to the size locally. du -h <file> is another option, which gives human-readable output so it may be better
On MacOS, there is still the hacky way of opening the "Activity Monitor" on the "Network" tab. If you are copying with kubectl cp from your local machine to a distant pod, then the total transfer is shown in the "Sent Bytes" column.
Not of super high precision, but it sort of does the job without installing anything new.
I know it doesn't show an active progress of each file, but does output a status including byte count for each completed file, which for multiple files run via scripts, is almost as good as active progress:
kubectl cp local.file container:/path/on/container --v=4
Notice the --v=4 is verbose mode and will give you output. I found kubectl cp output shows from v=3 thru v=5.

Kmeans clustering using mahout

I am trying to perform kmeans algorithm on data using . The option that has to be passed while running need a path to initial clusters. Can anyone tell me how can we have initial clusters even before starting the algorithm?
bin/mahout kmeans \
-i <input vectors directory> \
-c <input clusters directory> \
-o <output working directory> \
-k <optional number of initial clusters to sample from input vectors> \
-dm <DistanceMeasure> \
-x <maximum number of iterations> \
-cd <optional convergence delta. Default is 0.5> \
-ow <overwrite output directory if present>
-cl <run input vector clustering after computing Canopies>
-xm <execution method: sequential or mapreduce>
A) Mahout is slooooow. If your data fits into main memory, use other tools such as ELKI. They outperformed Mahout for me by far. If your data doesn't fit into main memory: are you sure k-means makes any sense on your data anyway? There is no point in doing a computation that doesn't solve your problem. Start with a sample to first check if it works at all, then scale up. Mahout is a last resort choice: if you absolutely need this to be computed on all your data, and everything else failed, then use Mahout.
B) Read all the documentation... next line in the documentation of Mahout k-means says:
Note: if the -k argument is supplied, any clusters in the -c directory will be overwritten and -k random points will be sampled from the input vectors to become the initial cluster centers.
In other words: if you know the initial cluster centers, supply them via -c and do not set -k. Otherwise an empty -c folder is okay, if you provide -k, the number of cluster centers to sample.

K-means on Mahout returning non-exclusive clusters

In my data I have users with a list of likes, I've dumped these likes into individual files for each user and would like to cluster them. Everything is working except the output has the same likes in multiple clusters. My understanding is k-means should be exclusive. I figure the problem is perhaps with how I am dumping the data. I have also dumped all of the likes without spaces for the time being until I can write a custom tokenizer. Here's what I'm running (from a ruby script).
system("#{MAHOUT_CMD} seqdirectory -c UTF-8 -i data/users -o data/kmeans/converted")
system("#{MAHOUT_CMD} seq2sparse -i data/kmeans/converted -o data/kmeans/vectors")
system("#{MAHOUT_CMD} kmeans -i data/kmeans/vectors/tfidf-vectors -c data/kmeans/initial_clusters -o data/kmeans/kmeans_clusters -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.1 -k 20 -x 20")
last_cluster_folder = Dir["data/kmeans/kmeans_clusters/*"].last.gsub("data/kmeans/kmeans_clusters/", "")
system("#{MAHOUT_CMD} clusterdump -s data/kmeans/kmeans_clusters/#{last_cluster_folder}/ -d data/kmeans/vectors/dictionary.file-0 -dt sequencefile -o data/kmeans/clusters.txt -n 1000")
The output lists the "top terms" in each cluster, however many of the likes occur in each cluster (though with different weights). Is the normal output for clusterdumper, do I need to find out what cluster each word belongs to by its weight?
Thanks
Mahout probably is only doing approximate k-means. Plus, there might be objects that have the same distance to more than one cluster.
You should however be able to just take the k means, and then do a 1-nearest-neighbor classification to get a unique result for each objects (this is trivial to parallelize and very fast).