Kmeans clustering using mahout - cluster-analysis

I am trying to perform kmeans algorithm on data using . The option that has to be passed while running need a path to initial clusters. Can anyone tell me how can we have initial clusters even before starting the algorithm?
bin/mahout kmeans \
-i <input vectors directory> \
-c <input clusters directory> \
-o <output working directory> \
-k <optional number of initial clusters to sample from input vectors> \
-dm <DistanceMeasure> \
-x <maximum number of iterations> \
-cd <optional convergence delta. Default is 0.5> \
-ow <overwrite output directory if present>
-cl <run input vector clustering after computing Canopies>
-xm <execution method: sequential or mapreduce>

A) Mahout is slooooow. If your data fits into main memory, use other tools such as ELKI. They outperformed Mahout for me by far. If your data doesn't fit into main memory: are you sure k-means makes any sense on your data anyway? There is no point in doing a computation that doesn't solve your problem. Start with a sample to first check if it works at all, then scale up. Mahout is a last resort choice: if you absolutely need this to be computed on all your data, and everything else failed, then use Mahout.
B) Read all the documentation... next line in the documentation of Mahout k-means says:
Note: if the -k argument is supplied, any clusters in the -c directory will be overwritten and -k random points will be sampled from the input vectors to become the initial cluster centers.
In other words: if you know the initial cluster centers, supply them via -c and do not set -k. Otherwise an empty -c folder is okay, if you provide -k, the number of cluster centers to sample.

Related

Can I use multiple --subtitutions flags in gcloud builds sbmit command?

This is what I'm doing right now:
gcloud builds submit ./dist \
--config=./cloudbuild.yaml \
--substitutions=_SUB_1=$VALUE_1,_SUB_2=$VALUE_2,_SUB_3=$VALUE_3 \
--project=$PROJECT_ID
This is what I'd like to do:
--substitutions=_SUB_1=$VALUE_1 \
--substitutions=_SUB_2=$VALUE_2 \
--substitutions=_SUB_3=$VALUE_3 \
Is this allowed?
Have you tried repeating the flag to prove to yourself whether that works?
I think it doesn't and you must use your first syntax.
When parsing command-line arguments, there's a distinction between a flag --xxx that has a repeating value and a repeating flag ---xxx=A --xxx=B that takes a singular value so, generally, the two aren't interchangeable (though it's logical to want these to be equivalent).
Do you want this because you're trying to script the command and encountering problems?

How to print debugging information on one/specific OpenAPI model?

According to the OpenAPI docs here is how one can print generator's models data:
$ java -jar openapi-generator-cli.jar generate \
-g typescript-fetch \
-o out \
-i api.yaml \
-DdebugModels
which outputs 39000 lines and it's a little difficult to find a modele of one's interest.
How to output debug information on just one model?
Unfortunately, there's no way to generate the debug log for just one model or operation.
As a workaround, you can draft a new spec that contains the model you want to debug.

Mahout clustering: How to retrieve the name of a named vector

I want to cluster multiple documents using Mahout. The clustering works fine but I have no idea how to find out which documents are located in each cluster.
I read that you can use the option --namedVector when creating the sparse-files but where does it take the ID from and how can I retrieve this ID after the clustering is completed?
Right now I am doing the following steps:
I have a directory with a file for each document. The files are in the following format with the ID of the document as filename:
filename: documentID.txt
[TITLE]
[CONTENT]
I create a sparse directory with namedVectors using:
./mahout seqdirectory -i tmp/es-out -o tmp/es-out-seqdir -c UTF-8 -chunk 64 -xm sequential
./mahout seq2sparse -i tmp/es-out-seqdir -o tmp/es-out-sparse --maxDFPercent 85 --namedVector
Then I can cluster the results and create a dump:
./mahout kmeans -i tmp/es-out-sparse/tfidf-vectors -c tmp/es-kmeans-clusters -o tmp/es-kmeans -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 20 -ow --clustering
./mahout clusterdump -i tmp/es-kmeans/clusters-10-final -o tmp/clusterdump -d tmp/es-out-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -sp 0 --pointsDir tmp/es-kmeans/clusteredPoints
The dump looks like this:
:VL-190{n=1 c=[1:3.407, 110:6.193, 2007:3.736, about:1.762, according:2.948, account:3.507, acting:6.
Top Terms:
epa => 13.471728324890137
mountaintop => 11.364262580871582
mine => 10.942587852478027
Weight : [props - optional]: Point:
[...]
k-means in Mahout is only a toy.
You can use it for howtos and tutorials, but for real use it is too slow, too limited, roo hard to use. (Also, k-means results are not half as good as people think... most of the time they are dogfood.)
Benchmark other tools, and you'll be surprised big time.
I found a way. You can use the seqdumper to extract the cluster mapping:
./mahout seqdumper -i /tmp/es-kmeans/clusteredPoints/part-m-00000 -o /tmp/cluster-points.txt
Than you can use a regex to extract the mapping of the vector IDs to cluster IDs.

Progress of simulations in headless mode

Is there any way to check the progress of simulations in headless mode as opposed to gui?
Basic Code:
$ ~/netlogo-5.1.0/netlogo-headless.sh \
--model ~/myproject/MyModel.nlogo \
--experiment MyExperiment \
--table ~/myproject/MyNewOutputData.csv
I'd suggest doing tail -f ~/myproject/MyNewOutputData.csv. This will show you a live view of the output file as it is being written to.

K-means on Mahout returning non-exclusive clusters

In my data I have users with a list of likes, I've dumped these likes into individual files for each user and would like to cluster them. Everything is working except the output has the same likes in multiple clusters. My understanding is k-means should be exclusive. I figure the problem is perhaps with how I am dumping the data. I have also dumped all of the likes without spaces for the time being until I can write a custom tokenizer. Here's what I'm running (from a ruby script).
system("#{MAHOUT_CMD} seqdirectory -c UTF-8 -i data/users -o data/kmeans/converted")
system("#{MAHOUT_CMD} seq2sparse -i data/kmeans/converted -o data/kmeans/vectors")
system("#{MAHOUT_CMD} kmeans -i data/kmeans/vectors/tfidf-vectors -c data/kmeans/initial_clusters -o data/kmeans/kmeans_clusters -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.1 -k 20 -x 20")
last_cluster_folder = Dir["data/kmeans/kmeans_clusters/*"].last.gsub("data/kmeans/kmeans_clusters/", "")
system("#{MAHOUT_CMD} clusterdump -s data/kmeans/kmeans_clusters/#{last_cluster_folder}/ -d data/kmeans/vectors/dictionary.file-0 -dt sequencefile -o data/kmeans/clusters.txt -n 1000")
The output lists the "top terms" in each cluster, however many of the likes occur in each cluster (though with different weights). Is the normal output for clusterdumper, do I need to find out what cluster each word belongs to by its weight?
Thanks
Mahout probably is only doing approximate k-means. Plus, there might be objects that have the same distance to more than one cluster.
You should however be able to just take the k means, and then do a 1-nearest-neighbor classification to get a unique result for each objects (this is trivial to parallelize and very fast).