Spark/Mllib Train many GaussianMixture models in a distributed way - scala

I've been playing around with the Gaussian Mixture Models provided for spark/mllib.
I found it really nice to generate a GaussianMixture from an enormous number of vectors/points. However, this is not always the case in ML. Very often you do not need to generate a model from numberless vectors, but to generate a numberless models -each one- from a few vectors (i.e., building a GMM for each user of a database with hundred of users).
At this point, I do not know how to proceed with the mllib, as I cannot see an easy way to distribute in both by users and by data.
Example:
Let featuresByUser = RDD[user, List[Vectors]],
the natural way to train a GMM for each user might be something like
featuresByUser.mapValues(
feats => new GaussianMixture.set(nGaussians).run(sc.parallelize(feats))
)
However, it is well-known that this is forbidden in spark. The inside sc.parallelize is not in the driver, so this leads to an error.
So the question are,
should the Mllib methods accept Seq[Vector] as input apart from
RDD[Vector] Thus, the programmer could choose one of the other depending on the problem.
Is there any other workaround that I'm missing to deal with this case (using mllib)?

Mllib unfortunately is currently not meant to create many models, but only one at the time, which was confirmed at a recent Spark meetup in London.
What you can do is launch a separate job for each model in a separate thread in the driver. This is described in the job scheduling documentation. So you would create one RDD per user and run a Gaussian mixture on each, running the 'action' that makes the thing run for each on a separate thread.
Another option, if the amount of data per user fits on one instance, you can do a Gaussian mixture on each user with something else than Mllib. This approach was described in the meetup in a case where sklearn was used within PySpark to create multiple models. You'd do something like:
val users: List[Long] = getUsers
val models = sc.parallelize(users).map(user => {
val userData = getDataForUser(user)
buildGM(userData)
})

Related

Literature for Classification Problem with Changing Classes

I am currently looking at a text classification problem (say N classes), for which labeled training data exists. Now, ocasionally, a new class is created and some of the labels in the "old" training data become wrong because they now should have the new class label. So the new class recruits from the old classes.
We can assume that we have some new labeled data for the new class, or even that from an input stream of new data we eventually obtain the correct labels by human verification (the goal, however, is to require as few manual corrections as possible).
How to set up a classifier that may face new "recruiting" classes from time to time? Are you aware of approaches/literature for the specific setting described above?
Perhaps, basic strategies may include
trying to relabel the training data and re-train,
using incremental classifiers (e.g., KNN)

Convert PySpark ML Word2Vec model to Gensim Word2Vec model

I've generated a PySpark Word2Vec model like so:
from pyspark.ml.feature import Word2Vec
w2v = Word2Vec(vectorSize=100, minCount=1, inputCol='words', outputCol = 'vector')
model = w2v.fit(df)
(The data that I used to train the model on isn't relevant, what's important is that its all in the right format and successfully yields a pyspark.ml.feature.Word2VecModel object.)
Now I need to convert this model to a Gensim Word2Vec model. How would I go about this?
If you still have the training data, re-training the gensim Word2Vec model may be the most straightforward approach.
If you only need the word-vectors, perhaps PySpark's model can export them in the word2vec.c format that gensim can load with .load_word2vec_format().
The only reason to port the model would be to continue training. Such incremental training, while possible, involves considering a lot of tradeoffs in balancing the influence of the older and later training to get good results.
If you are in fact wanting to do this conversion in order to do more training in such a manner, it again suggests that using the original training to reproduce a similar model could be plausible.
But, if you have to convert the model, the general approach would be to study the source code and internal data structures of the two models, to discover how they alternatively represent each of the key aspects of the model:
the known word-vectors (model.wv.vectors in gensim)
the known-vocabulary of words, including stats about word-frequencies and the position of individual words (model.wv.vocab in gensim)
the hidden-to-output weights of the model (`model.trainables' and its properties in gensim)
other model properties describing the model's modes & metaparameters
A reasonable interactive approach could be:
Write some acceptance tests that take models of both types, and test whether they are truly 'equivalent' for your purposes. (This is relatively easy for just checking if the vectors for individual words are present and identical, but nearly as hard as the conversion itself for verifying other ready-to-be-trained-more behaviors.)
Then, in an interactive notebook, load the source model, and also create a dummy gensim model with the same vocabulary size. Consulting the source code, write Python statements to iteratively copy/transform key properties over from the source into the target, repeatedly testing if they verify as equivalent.
When they do, take those steps you did manually and combine them into a utility method to do the conversion. Again verify its operation then try using the converted model however you'd hoped – perhaps discovering overlooked info or discovering other bugs in the process, and then improving the verification method and conversion method.
It's possible that the PySpark model will be missing things the gensim model expects, which might require synthesizing workable replacement values.
Good luck! (But re-train the gensim model from the original data if you want things to just be straightforward and work.)

DeepLearning4J - Acquiring Data and Train Model

I try to create the easiest of a NeuralNetwork and training it with some data:
Therefore I created a test.csv with a the following pattern:
number,number+1;
number2,number2+1
...
I try to make a linear regression with the network...
But I do not find a way to acquire the data, DataSetIterator does not work.
How to fit the Data, how to test the Data?
In our examples, we encourage people to use datavec + recordreaderdatasetiterator.
Datavec has all of the various data loading components.
I'm not sure what you mean about "datasetiterator not working" wihtout seeing any code, but it seems like you didn't really look at our examples.
In there are multiple examples of a csv record reader you can use for both regression and classification use cases.
Consider reorienting your data pipeline to use those.
Those examples are always found here:
https://github.com/deeplearning4j/dl4j-examples
If you follow any of those, the same pattern emerges:
Record reader for whatever data format -> RecordReaderDataSetIterator
The iterator allows you to specify common constructors such as whether it is a regression or not, which column your label is etc.

How to use Spark MlLib/Pipelines to build 1 model per each user [duplicate]

This question already has an answer here:
Run ML algorithm inside map function in Spark
(1 answer)
Closed 4 years ago.
I want to train different models for each user in my dataset. Is there built in support for that in Spark MlLib/Pipelines?
If not, what's the easiest/cleanest way to train multiple and separate models for each user?
Unfortunately Spark-ML doesn't provide the ability to separate concept "single model - single user". But you can make a custom logic as you wish. I see two possible variants of solving this task.
The first scenario for solving this situation is following to the next algorithm (I took everything for example - you will have different steps, but algorithm will logically similar):
You must obtain training data for the specific user - (e.g. read data csv file from hdfs, s3 etc.)
Train model for the Dataset which depends on the user related data - let's consider the next situation your dataset has two columns - the specific criteria X and user's productivity Y and latest parameter is changeable for user group - you must train your model for instance with LinearRegression so predict if user can do work in the time or can't.
Next, you save data to the disk on call trained model depending on the
user's id, group or etc.
The second approach is to train your model so it was applicable to every user, you must choose options for algorithm so it didn't depend on group of user, in other words, generalize algorithm of training model to all user groups - in this case, you don't have a sense of separation
"single-model--> single user". If the second variant is more complicated to the implementation on your dataset, follow the first approach.

Using ELKI on custom objects and making sense of results

I am trying to use ELKI's SLINK implementation of hierarchical clustering in my program.
I have a set of objects (of my own type) that need to be clustered. For that, I convert them to feature vectors before clustering.
This is how I currently got it to run and produce some result (code is in Scala):
val clusterer = new SLINK(CosineDistanceFunction.STATIC, 3)
val connection = new ArrayAdapterDatabaseConnection(featureVectors)
val database = new StaticArrayDatabase(connection, null)
database.initialize()
val result = clusterer.run(database).asInstanceOf[Clustering[_ <: Model]]
Now, the result is a Clustering that contains elements of type Model. I can output them, but I don't know how to make sense of this result, especially since SLINK returns models of type DendrogramModel which does not seem to be parametrizable.
Specifically, how can I link the results back to my original elements (the ones from which I created the variable featureVectors earlier)?
I assume I need to create some kind of custom model or somehow maintain some link to the original elements through initialization and execution of the algorithm to retrieve from the result. I cannot find where to get started on this though.
I am aware that embedding ELKI into own programs is discouraged. However, it seems that calling ELKI in some other way would not be any different: I need to cluster and map the results back to my objects during runtime of my program.
The DendrogramModel does not include the objects in the cluster. Models are additional meta data on the clusters.
Use the getIDs() method to access the members of a Cluster instance.