Dumping clustering result with vectors names - cluster-analysis

I have created my Vectors as described in this question and have run mahout kmeans on the data.
Since I'm using Mahout 0.7, the clusterdump command didn't work as described in Mahout in Action, but I got it to work like this:
export HADOOP_CLASSPATH=/path/to/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar:/path/to/mahout-distribution-0.7/integration/target/mahout-integration-0.7.jar
hadoop jar core/target/mahout-core-0.7-job.jar org.apache.mahout.utils.clustering.ClusterDumper -i /clustering/out/clusters-20-final -o textout -of TEXT
and I am getting lines like this one:
VL-1383471{n=192 c=[0.180, -0.087, 0.281, 0.512, 0.678, 1.833, 2.613, 0.313, 0.226, 1.023, 0.229, -0.104, -0.461, -0.553, -0.318, 0.315, 0.658, 0.245, 0.635, 0.220, 0.660, 0.193, 0.277, -0.182, 0.497, 0.346, 0.658, 0.660, 0.191, 0.660, 0.636, 0.018, 0.519, 0.335, 0.535, 0.008, -0.028, 0.461, 0.229, 0.287, 0.619, 0.509, 0.566, 0.389, -0.075, -0.180, -0.461, 0.381, -0.108, 0.126, -0.728] r=[0.983, 0.890, 0.384, 0.823, 0.702, 0.000, 0.000, 1.132, 0.605, 0.979, 0.897, 0.862, 0.438, 0.546, 0.390, 0.171, 0.257, 0.234, 0.251, 0.106, 0.257, 0.093, 0.929, 0.077, 0.204, 0.218, 0.257, 0.257, 0.258, 0.257, 0.249, 0.112, 0.217, 0.157, 0.284, 0.197, 0.228, 0.229, 0.323, 0.401, 0.248, 0.217, 0.269, 1.002, 0.819, 0.706, 0.412, 0.964, 0.787, 0.872, 0.172]}
which is not yet useful to me, since I need the names of my vectors in each cluster.
I saw that for text documents a dictionary file is created. How would I create a dictionary for my data?
Also, using -of CSV gives me an empty file, am I doing something wrong?
Another attempt I took was to directly access the cluster-20-final/part-m-00000 file, like it's done in listing 7.2 of Mahout in Action. Turns out it doesn't contain WeightedVectorWritable but ClusterWritable, from which I can get the Cluster instance but not any actual contained Vector.

A bit late, but this might help someone somewhere, sometime.
When running
KMeansDriver.run(input, clustersIn, outputPath, measure, convergenceDelta, maxIterations, true, 0.0, false);
One of the outputs was a directory called clusteredPoints. There is a part file there with all the clustered vectors by cluster. This means that something like this
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
Path clusteredPoints = new Path(output + "/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000");
FileSystem fs = FileSystem.get(clusteredPoints.toUri(), new Configuration());
try (SequenceFile.Reader reader = new SequenceFile.Reader(fs, clusteredPoints, fs.getConf())) {
while (reader.next(key, value)) {
// Do something useful here
((NamedVector) value.getVector()).getName();
}
} catch (Throwable t) {
throw t;
}
seems to do the trick. Using something like this, I was able to get a good sense of what was clustered where when running my tests with k-means clustering and Mahout.
I was using Mahout 0.8 when I did this.

(a really late answer, but since I just spent a day figuring this out thought I would share it)
What you are missing is the dictionary of Vector Dimension name to its index.
This dictionary will be used by clusterdump to give you the names of the different dimensions in the vector.
When you run clusterdump, you can specify two additional flags:
d: dictionary file
dt: type of the dictionary file (text|sequencefile)
Here is a sample invocation:
mahout clusterdump -i clusteringExperiment/exp1/initialCentroids/clusters-0-final -d clusteringExperiment/dictionary/vectorDimensions -dt sequencefile
and your output will look something like:
VL-0{n=185 c=[A:0.006, G:0.550, M:0.011, O:0.026, S:0.000, T:0.072, U:0.096, V:0.010] r=[A:0.029, G:0.176, M:0.043, O:0.054, S:0.001, T:0.098, U:0.113, V:0.035]}
Note that the dictionary is a simple key value file, where the key is the category name (a string), and the value is the numerical index.

Related

Pedersen circom/circomlibjs inconsistency?

As a unit test for a larger use case, I am checking that indeed the pedersen hash I am doing in the frontend aligns with the expected hash done through a circom circuit. I am using a simple assert in the circuit and generating a witness and am feeding both the hashed and unhashed values to the circuit, recreating the hash to make sure that it goes through.
I am running a Pedersen hash in my frontend using circomlibjs. As a unit test, I have. a circuit with a simple assert that check whether the results from my frontend line up with the pedersen hash in the circom circuit.
The circuit I am using:
include "../node_modules/circomlib/circuits/bitify.circom";
include "../node_modules/circomlib/circuits/pedersen.circom";
template check() {
signal input unhashed;
signal input hashed;
signal output createdHash[2];
component hasher = Pedersen(256);
component unhashedBits = Num2Bits(256);
unhashedBits.in <== unhashed;
for (var i = 0; i < 256; i++){
hasher.in[i] <== unhashedBits.out[i];
}
createdHash[0] <== hasher.out[0];
createdHash[1] <== hasher.out[1];
hashed === createdHash[1];
}
component main = check();
In the frontend, I am running the following,
import { buildPedersenHash } from 'circomlibjs';
export function buff2hex(buff) {
function i2hex(i) {
return ('0' + i.toString(16)).slice(-2);
}
return '0x' + Array.from(buff).map(i2hex).join('');
}
const secret = (new TextEncoder(32)).encode("Hello");
var pedersen = await buildPedersenHash();
var h = pedersen.hash(secret);
console.log(buff2hex(secret));
console.log(buff2hex(h));
The values that are printed are:
0x48656c6c6f
0x0e90d7d613ab8b5ea7f4f8bc537db6bb0fa2e5e97bbac1c1f609ef9e6a35fd8b
Which are consistent with the test done here.
So I then create an input.json file which looks as follows,
{
"unhashed": "0x48656c6c6f",
"hashed": "0x0e90d7d613ab8b5ea7f4f8bc537db6bb0fa2e5e97bbac1c1f609ef9e6a35fd8b"
}
And lastly run the following script to create a witness, in the hopes that the assert will go through.
# Compile the circuit
circom ${CIRCUIT}.circom --r1cs --wasm --sym --c
# Generate the witness.wtns
node ${CIRCUIT}_js/generate_witness.js ${CIRCUIT}_js/${CIRCUIT}.wasm input.json ${CIRCUIT}_js/witness.wtns
However, I keep getting an assert error,
Error: Error: Assert Failed.
Error in template check_11 line: 26
Which describes the assert in the circuit, so I assume there is an inconsistency in the hash.
I am new to circom so any insights would be greatly appreciated!
For anyone who stumbles across this, it happens that the cause of issue is endianess. The issue was fixed by converting the unhashed to little endian in the input, I am not sure as to where exactly the problem is, but seems like the hasher reads it as big endian on the frontend but the input is expected little endian (or vice verse).
As I have managed to patch up a fix for this at the moment, I will stop investigating, but implore anyone who understand this further to give a better explanation.

Neo4j: Converting REST call output to JSON

I have a requirement to convert the output of cypher into JSON.
Here is my code snippet.
RestCypherQueryEngine rcqer=new RestCypherQueryEngine(restapi);
String nodeN = "MATCH n=(Company) WITH COLLECT(n) AS paths RETURN EXTRACT(k IN paths | LAST(nodes(k))) as lastNode";
final QueryResult<Map<String,Object>> queryResult = rcqer.query(searchQuery);
for(Map<String,Object> row:queryResult)
{
System.out.println((ArrayList)row.get("lastNode"));
}
Output:
[http://XXX.YY6.192.103:7474/db/data/node/445, http://XXX.YY6.192.103:7474/db/data/node/446, http://XXX.YY6.192.103:7474/db/data/node/447, http://XXX.YY6.192.103:7474/db/data/node/448, http://XXX.YY6.192.103:7474/db/data/node/449, http://XXX.YY6.192.103:7474/db/data/node/450, http://XXX.YY6.192.103:7474/db/data/node/451, http://XXX.YY6.192.103:7474/db/data/node/452, http://XXX.YY6.192.103:7474/db/data/node/453]
I am not able to see the actual data (I am getting URL's). I am pretty sure I am missing something here.
I would also like to convert the output to JSON.
The cypher works in my browser interface.
I looked at various articles around this:
Java neo4j, REST and memory
Neo4j Cypher: How to iterate over ExecutionResult result
Converting ExecutionResult object to json
The last 2 make use of EmbeddedDatabase which may not be possible in my scenario (as the Neo is hosted in another cloud, hence the usage of REST).
Thanks.
Try to understand what you're doing? Your query does not make sense at all.
Perhaps you should re-visit the online course for Cypher: http://neo4j.com/online-course
MATCH n=(Company) WITH COLLECT(n) AS paths RETURN EXTRACT(k IN paths | LAST(nodes(k))) as lastNode
you can just do:
MATCH (c:Company) RETURN c
RestCypherQueryEngine rcqer=new RestCypherQueryEngine(restapi);
final QueryResult<Map<String,Object>> queryResult = rcqer.query(query);
for(Node node : queryResult.to(Node.class))
{
for (String prop : node.getPropertyKeys()) {
System.out.println(prop+" "+node.getProperty(prop));
}
}
I think it's better to use the JDBC driver for what you try to do, and also actually return the properties you're trying to convert to JSON.

How to bundle many files in S3 using Spark

I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text containing between 0 (empty) and 100KB of text (95th percentile, although there are a few files that are up to several MBs).
Using Spark and Scala (I'm new to both and want to learn), I would like to save "daily bundles" (8000 of them), each containing whatever number of files were found for that day. Ideally I would like to store the original filenames as well as their content. The output should reside in S3 as well and be compressed, in some format that is suitable for input in further Spark steps and experiments.
One idea was to store bundles as a bunch of JSON objects (one per line and '\n'-separated), e.g.
{id:"doc0001", meta:{x:"blah", y:"foo", ...}, content:"some long string here"}
{id:"doc0002", meta:{x:"foo", y:"bar", ...}, content: "another long string"}
Alternatively, I could try the Hadoop SequenceFile, but again I'm not sure how to set that up elegantly.
Using the Spark shell for example, I saw that it was very easy to read the files, for example:
val textFile = sc.textFile("s3n://mybucket/path/txt/1996/04/09/*.txt.gz")
// or even
val textFile = sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz")
// which will take for ever
But how do I "intercept" the reader to provide the file name?
Or perhaps I should get an RDD of all the files, split by day, and in a reduce step write out K=filename, V=fileContent?
You can use this
First You can get a Buffer/List of S3 Paths :
import scala.collection.JavaConverters._
import java.util.ArrayList
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.ObjectListing
import com.amazonaws.services.s3.model.S3ObjectSummary
import com.amazonaws.services.s3.model.ListObjectsRequest
def listFiles(s3_bucket:String, base_prefix : String) = {
var files = new ArrayList[String]
//S3 Client and List Object Request
var s3Client = new AmazonS3Client();
var objectListing: ObjectListing = null;
var listObjectsRequest = new ListObjectsRequest();
//Your S3 Bucket
listObjectsRequest.setBucketName(s3_bucket)
//Your Folder path or Prefix
listObjectsRequest.setPrefix(base_prefix)
//Adding s3:// to the paths and adding to a list
do {
objectListing = s3Client.listObjects(listObjectsRequest);
for (objectSummary <- objectListing.getObjectSummaries().asScala) {
files.add("s3://" + s3_bucket + "/" + objectSummary.getKey());
}
listObjectsRequest.setMarker(objectListing.getNextMarker());
} while (objectListing.isTruncated());
//Removing Base Directory Name
files.remove(0)
//Creating a Scala List for same
files.asScala
}
Now Pass this List object to the following piece of code, note : sc is an object of SQLContext
var df: DataFrame = null;
for (file <- files) {
val fileDf= sc.textFile(file)
if (df!= null) {
df= df.unionAll(fileDf)
} else {
df= fileDf
}
}
Now you got a final Unified RDD i.e. df
Optional, And You can also repartition it in a single BigRDD
val files = sc.textFile(filename, 1).repartition(1)
Repartitioning always works :D
have you tried something along the lines of sc.wholeTextFiles?
It creates an RDD where the key is the filename and the value is the byte array of the whole file. You can then map this so the key is the file date, and then groupByKey?
http://spark.apache.org/docs/latest/programming-guide.html
At your scale, elegant solution would be a stretch.
I would recommend against using sc.textFile("s3n://mybucket/path/txt/*/*/*/*.txt.gz") as it takes forever. What you can do is use AWS DistCp or something similar to move files into HDFS. Once its in HDFS, spark is quite fast in ingesting the information in whatever way suits you.
Note that most of these processes require some sort of file list so you'll need to generate that somehow. for 20 mil files, this creation of file list will be a bottle neck. I'd recommend creating a file that get appended with the file path, every-time a file gets uploaded to s3.
Same for output, put into hdfs and then move to s3 (although direct copy might be equally efficient).

Retrieving gdcm DataElement values as strings

I am basically trying to read out all or most attribute values from a DICOM file, using the gdcm C++ library. I am having hard time to get out any non-string values. The gdcm examples generally assume I know the group/element numbers beforehand so I can use the Attribute template classes, but I have no need or interest in them, I just have to report all attribute names and values. Actually the values should go into an XML so I need a string representation. What I currently have is something like:
for (gdcm::DataSet::ConstIterator it = ds.Begin(); it!=ds.End(); ++it) {
const gdcm::DataElement& elem = *it;
if (elem.GetVR() != gdcm::VR::SQ) {
const gdcm::Tag& tag = elem.GetTag();
std::cout << dict.GetDictEntry(tag).GetKeyword() << ": ";
std::cout << elem.GetValue() << "\n";
}
}
It seems for numeric values like UL the output is something like "Loaded:4", presumably meaning that the library has loaded 4 bytes of data (an unsigned long). This is not helpful at all, how to get the actual value? I must be certainly overlooking something obvious.
From the examples it seems there is a gdcm::StringFilter class which is able to do that, but it seems it wants to search each element by itself in the DICOM file, which would make the algorithm complexity quadratic, this is certainly something I would like to avoid.
TIA
Paavo
Have you looked at gdcmdump? You can use it to output the DICOM file as text or XML. You can also look at the source to see how it does this.
I ended up with extracting parts of gdcm::StringFilter::ToStringPair() into a separate function. Seems to work well for simpler DCM files at least...
You could also start by reading the FAQ, in particular How do I convert an attribute value to a string ?
As explained there, you simply need to use gdcm::StringFilter:
sf = gdcm.StringFilter()
sf.SetFile(r.GetFile())
print sf.ToStringPair(gdcm.Tag(0x0028,0x0010))
Try something like this:
gdcm::Reader reader;
reader.SetFileName( absSlicePath.c_str() );
if( !_reader.Read() )
{
return;
}
gdcm::File file = reader.GetFile();
gdcm::DataSet ds = file.GetDataSet();
std::stringstream strm;
strm << ds;
you get a stringstream containing all the DICOM tags-values.
Actually, most of the DICOM classes (DataElement, DataSet, etc) have the std::ostream & operator<< (std::ostream &_os, const *Some_Class* &_val) overloaded. So you can just expand the for loop and use operator<< to put the values into the stringstream, and then into the string.
For example, if you are using QT :
ui->pTagsTxt->append(QString(strm.str().c_str()));

Preferred way of processing this data with parallel arrays

Imagine a sequence of java.io.File objects. The sequence is not in any particular order, it gets populated after a directory traversal. The names of the files can be like this:
/some/file.bin
/some/other_file_x1.bin
/some/other_file_x2.bin
/some/other_file_x3.bin
/some/other_file_x4.bin
/some/other_file_x5.bin
...
/some/x_file_part1.bin
/some/x_file_part2.bin
/some/x_file_part3.bin
/some/x_file_part4.bin
/some/x_file_part5.bin
...
/some/x_file_part10.bin
Basically, I can have 3 types of files. First type is the simple ones, which only have a .bin extension. The second type of file is the one formed from _x1.bin till _x5.bin. And the third type of file can be formed of 10 smaller parts, from _part1 till _part10.
I know the naming may be strange, but this is what I have to work with :)
I want to group the files together ( all the pieces of a file should be processed together ), and I was thinking of using parallel arrays to do this. The thing I'm not sure about is how can I perform the reduce/acumulation part, since all the threads will be working on the same array.
val allBinFiles = allBins.toArray // array of java.io.File
I was thinking of handling something like that:
val mapAcumulator = java.util.Collections.synchronizedMap[String,ListBuffer[File]](new java.util.HashMap[String,ListBuffer[File]]())
allBinFiles.par.foreach { file =>
file match {
// for something like /some/x_file_x4.bin nameTillPart will be /some/x_file
case ComposedOf5Name(nameTillPart) => {
mapAcumulator.getOrElseUpdate(nameTillPart,new ListBuffer[File]()) += file
}
case ComposedOf10Name(nameTillPart) => {
mapAcumulator.getOrElseUpdate(nameTillPart,new ListBuffer[File]()) += file
}
// simple file, without any pieces
case _ => {
mapAcumulator.getOrElseUpdate(file.toString,new ListBuffer[File]()) += file
}
}
}
I was thinking of doing it like I've shown in the above code. Having extractors for the files, and using part of the path as key in the map. Like for example, /some/x_file can hold as values /some/x_file_x1.bin to /some/x_file_x5.bin. I also think there could be a better way of handling this. I would be interested in your opinions.
The alternative is to use groupBy:
val mp = allBinFiles.par.groupBy {
case ComposedOf5Name(x) => x
case ComposedOf10Name(x) => x
case f => f.toString
}
This will return a parallel map of parallel arrays of files (ParMap[String, ParArray[File]]). If you want a sequential map of sequential sequences of files from this point:
val sqmp = mp.map(_.seq).seq
To ensure that the parallelism kicks in, make sure you have enough elements in you parallel array (10k+).