Using ELKI on custom objects and making sense of results - cluster-analysis

I am trying to use ELKI's SLINK implementation of hierarchical clustering in my program.
I have a set of objects (of my own type) that need to be clustered. For that, I convert them to feature vectors before clustering.
This is how I currently got it to run and produce some result (code is in Scala):
val clusterer = new SLINK(CosineDistanceFunction.STATIC, 3)
val connection = new ArrayAdapterDatabaseConnection(featureVectors)
val database = new StaticArrayDatabase(connection, null)
database.initialize()
val result = clusterer.run(database).asInstanceOf[Clustering[_ <: Model]]
Now, the result is a Clustering that contains elements of type Model. I can output them, but I don't know how to make sense of this result, especially since SLINK returns models of type DendrogramModel which does not seem to be parametrizable.
Specifically, how can I link the results back to my original elements (the ones from which I created the variable featureVectors earlier)?
I assume I need to create some kind of custom model or somehow maintain some link to the original elements through initialization and execution of the algorithm to retrieve from the result. I cannot find where to get started on this though.
I am aware that embedding ELKI into own programs is discouraged. However, it seems that calling ELKI in some other way would not be any different: I need to cluster and map the results back to my objects during runtime of my program.

The DendrogramModel does not include the objects in the cluster. Models are additional meta data on the clusters.
Use the getIDs() method to access the members of a Cluster instance.

Related

how LGBM handles the categorical features without specification

I am playing with LGBM and indexed my categorical features using StingIndexer. but after that I haven't tell my model which features is categorical features. So, I am wondering how it knows which features are categorical features
Here is how I init my LGBM model.
val lgbm = new LightGBMClassifier("lgbm").
setObjective("binary").
setFeatureFraction(0.85).
setFeaturesCol("features").
setLabelCol("is_booker")
If you are using mmlspark (you didn't mention how you're using LightGBM in Scala), LightGBM automatically figures out which columns should be treated as categorical, based on the attributes of the columns.
From Azure/mmlspark#559:
...if you use string indexer or our value indexer, categorical metadata will be automatically added to the dataframe and lightgbm will actually be able to interpret it and treat those columns as categoricals by splitting on the feature values directly (so you won't need to one-hot-encode them)
The method that accomplishes that is called LightGBMUtils.getCategoricalIndexes(), and you can find it at https://github.com/Azure/mmlspark/blob/95c1f8a782191e3578587a49313e1d57abee5da3/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMUtils.scala#L74-L104.
That method is re-used by LightGBMBase.getCategoricalIndexes() during training:
definition: https://github.com/Azure/mmlspark/blob/96f0b7775629d6e7b521d1ed8ca0e54655deef00/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBase.scala#L149-L159
use: https://github.com/Azure/mmlspark/blob/96f0b7775629d6e7b521d1ed8ca0e54655deef00/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBase.scala#L211.
If I'm right that you're using mmlspark and you have further questions about how this works, I recommend opening issues in Azure/mmlspark.

Monitoring runtime use of concrete collections

Background:
Our Scala software consists of various components, developed by different teams, that pass Scala collections back and forth. The APIs usually use abstract collections such as Seq[T] and Set[T], and developers are currently essentially free to choose any implementation they like: e.g. when creating new instances, some go with List() or Vector(), others with Seq.empty.
Problem:
Different implementations have different performance characteristics, e.g. List might have been a good choice locally (for one component) because the collection is only sequentially iterated over or modified at the head, but it could have been a poor choice globally, because another component performs loads of random accesses.
Question:
Are their any tools — ideally Scala-specific, but JVM-general might also be OK — that can monitor runtime use of collections and record the information necessary to detect and report undesirable access/usage patterns of collections?
My feeling is that runtime monitoring would be more fruitful than static analyses (including simple linting) because (i) statically detecting usage patterns in hot code is virtually impossible, and (ii) would most likely miss collections that are internally created, e.g. when performing complex filter/map/fold/etc. operations on immutable collections.
Edits/Clarifications:
Changing the interfaces to enforce specific types such as List isn't an option; it would also not prevent purely internal use of "wrong" collections/usage patterns.
The goal is identifying a globally optimal (over many runs of the software) collection type rather than locally optimising for each applied algorithm
You don't need linting for this, let alone runtime monitoring. This is exactly what having a strictly-typed language does for you out of the box. If you want to ensure a particular collection type is passed to the API, just declare that that API accepts that collection type (e.g., def foo(x: Stream[Bar]), not def foo(x: Seq[Bar]), etc.).
Alternatively, when practical, just convert to the desired type as part of implementation: def foo(x: List[Bar]) = { val y = x.toArray ; lotsOfRandomAccess(y); }
Collections that are "internally created" are typically the same type as the parent object: List(1,2,3).map(_ + 1) returns a List etc.
Again, if you want to ensure you are using a particular type, just say so:
val mapped: List[Int] = List(1,2,3).map(_ + 1)
You can actually, change the type this way if there is a need for that:
val mappedStream: Stream[Int] = List(1,2,3).map(_ + 1)(breakOut)
As discussed in the comments, this is a problem that needs to be solved at a local level rather than via global optimisation.
Each algorithm in the system will work best with a particular data type, so using a single global structure will never be optimal. Instead, each algorithm should ensure that the incoming data is in a format that can be processed efficiently. If it is not in the right format, the data should be converted to a better format as the first part of the process. Since the algorithm works better on the right format, this conversion is always a performance improvement.
The output data format is more of a problem if the system does not know which algorithm will be used next. The solution is to use the most efficient output format for the algorithm in question, and rely on other algorithms to re-format the data if required.
If you do want to monitor the whole system, it would be better to track the algorithms rather than the collections. If you monitor which algorithms are called and in which order you can create multiple traces through the code. You can then play back those traces with different algorithms and data structures to see which is the most efficient configuration.

Apache Spark - Implementing a distributed QuadTree

I am really, really, new to Apache Spark.
I am working on implementing Approximate LOCI (or ALOCI), an anomaly detection algorithm, on a distributed way over Spark. This algorithm is based on storing points in a QuadTree that is used to find a point's number of neighbors.
I know exactly how QuadTrees work. In fact, I have implemented such a structure in Java recently. But I am completely lost as far as it concerns the way that such a structure can work in a distributed way over Spark.
Something similar to what I need can be found in Geospark.
https://github.com/DataSystemsLab/GeoSpark/tree/b2b6f1d7f0015d5c9d663a7b28d5e1bb1043c413/core/src/main/java/org/datasyslab/geospark/spatialPartitioning/quadtree
GeoSpark uses in many cases a PointRDD class, that extends a SpatialRDD class which I can see that uses the QuadTree that can be found in the link above to partition the Spatial objects. That is what I understood, at least, in theory.
In practice, I still cannot figure this out. Let's say for example that I have millions of records in a csv and I want to read and load them in a QuadTree.
I could read a csv to an RDD, but then what? How does this RDD logically connects to the QuadTree I am trying to build?
Of course, I don't expect a working solution here. I just need the logic here to fill the gap in my mind. How do I implement a distributed QuadTree and how do I use it?
Ok, sadly there are no answers to this, but here I am two weeks later with a working solution. Not 100% sure if it is the right approach here, though.
I created a class named Element and turned each line of my csv to an RDD[Element]. I then created a serializable class named QuadNode which has a List[Elements] and an Array[String] of size 4. On adding elements to a node, these elements are added in the node's List. If the list get more than X elements (20 in my case), the node breaks into 4 children and the elements are sent to the children. Finally, I created a class QuadTree which has an RDD[QuadNodes] among its rest properties. Every time a node breaks to children then these children-nodes are added in the tree's RDD.
In a non-functional language, each node would have 4 pointers, one for each child. Since, we are in a distributed environment this approach could not work. So, I gave each node a unique Id. Root node has an id = "0". Root's nodes have ids "00", "01", "02" and "03". Node-"00" children have ids "000","001","002","003". In this way if we want to find all the descendants of a node, we filter our tree's RDD[QuadNode] by checking if nodes' ids startWith out node id. Reversing this logic helps us to find a node's parent node.
This is how I implemented my QuadTree, at least for now. If someone knows a better way of implementing this I would love to hear his/her opinion.

Object cache on Spark executors

A good question for Spark experts.
I am processing data in a map operation (RDD). Within the mapper function, I need to lookup objects of class A to be used in processing of elements in an RDD.
Since this will be performed on executors AND creation of elements of type A (that will be looked up) happens to be an expensive operation, I want to pre-load and cache these objects on each executor. What is the best way of doing it?
One idea is to broadcast a lookup table, but class A is not serializable (no control over its implementation).
Another idea is to load them up in a singleton object. However, I want to control what gets loaded into that lookup table (e.g. possibly different data on different Spark jobs).
Ideally, I want to specify what will be loaded on executors once (including the case of Streaming, so that the lookup table stays in memory between batches), through a parameter that will be available on the driver during its start-up, before any data gets processed.
Is there a clean and elegant way of doing it or is it impossible to achieve?
This is exactly the targeted use case for broadcast. Broadcasted variables are transmitted once and use torrents to move efficiently to all executors, and stay in memory / local disk until you no longer need them.
Serialization often pops up as an issue when using others' interfaces. If you can enforce that the objects you consume are serializable, that's going to be the best solution. If this is impossible, your life gets a little more complicated. If you can't serialize the A objects, then you have to create them on the executors for each task. If they're stored in a file somewhere, this would look something like:
rdd.mapPartitions { it =>
val lookupTable = loadLookupTable(path)
it.map(elem => fn(lookupTable, elem))
}
Note that if you're using this model, then you have to load the lookup table once per task -- you can't benefit from the cross-task persistence of broadcast variables.
EDIT: Here's another model, which I believe lets you share the lookup table across tasks per JVM.
class BroadcastableLookupTable {
#transient val lookupTable: LookupTable[A] = null
def get: LookupTable[A] = {
if (lookupTable == null)
lookupTable = < load lookup table from disk>
lookupTable
}
}
This class can be broadcast (nothing substantial is transmitted) and the first time it's called per JVM, you'll load the lookup table and return it.
In case serialisation turns out to be impossible, how about storing the lookup objects in a database? It's not the easiest solution, granted, but should work just fine. I could recommend checking e.g. spark-redis, but I am sure there are better solution out there.
Since A is not serializable the easiest solution is to create yout own serializable type A1 with all data from A required for computation. Then use the new lookup table in broadcast.

Spark/Mllib Train many GaussianMixture models in a distributed way

I've been playing around with the Gaussian Mixture Models provided for spark/mllib.
I found it really nice to generate a GaussianMixture from an enormous number of vectors/points. However, this is not always the case in ML. Very often you do not need to generate a model from numberless vectors, but to generate a numberless models -each one- from a few vectors (i.e., building a GMM for each user of a database with hundred of users).
At this point, I do not know how to proceed with the mllib, as I cannot see an easy way to distribute in both by users and by data.
Example:
Let featuresByUser = RDD[user, List[Vectors]],
the natural way to train a GMM for each user might be something like
featuresByUser.mapValues(
feats => new GaussianMixture.set(nGaussians).run(sc.parallelize(feats))
)
However, it is well-known that this is forbidden in spark. The inside sc.parallelize is not in the driver, so this leads to an error.
So the question are,
should the Mllib methods accept Seq[Vector] as input apart from
RDD[Vector] Thus, the programmer could choose one of the other depending on the problem.
Is there any other workaround that I'm missing to deal with this case (using mllib)?
Mllib unfortunately is currently not meant to create many models, but only one at the time, which was confirmed at a recent Spark meetup in London.
What you can do is launch a separate job for each model in a separate thread in the driver. This is described in the job scheduling documentation. So you would create one RDD per user and run a Gaussian mixture on each, running the 'action' that makes the thing run for each on a separate thread.
Another option, if the amount of data per user fits on one instance, you can do a Gaussian mixture on each user with something else than Mllib. This approach was described in the meetup in a case where sklearn was used within PySpark to create multiple models. You'd do something like:
val users: List[Long] = getUsers
val models = sc.parallelize(users).map(user => {
val userData = getDataForUser(user)
buildGM(userData)
})