how LGBM handles the categorical features without specification - scala

I am playing with LGBM and indexed my categorical features using StingIndexer. but after that I haven't tell my model which features is categorical features. So, I am wondering how it knows which features are categorical features
Here is how I init my LGBM model.
val lgbm = new LightGBMClassifier("lgbm").
setObjective("binary").
setFeatureFraction(0.85).
setFeaturesCol("features").
setLabelCol("is_booker")

If you are using mmlspark (you didn't mention how you're using LightGBM in Scala), LightGBM automatically figures out which columns should be treated as categorical, based on the attributes of the columns.
From Azure/mmlspark#559:
...if you use string indexer or our value indexer, categorical metadata will be automatically added to the dataframe and lightgbm will actually be able to interpret it and treat those columns as categoricals by splitting on the feature values directly (so you won't need to one-hot-encode them)
The method that accomplishes that is called LightGBMUtils.getCategoricalIndexes(), and you can find it at https://github.com/Azure/mmlspark/blob/95c1f8a782191e3578587a49313e1d57abee5da3/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMUtils.scala#L74-L104.
That method is re-used by LightGBMBase.getCategoricalIndexes() during training:
definition: https://github.com/Azure/mmlspark/blob/96f0b7775629d6e7b521d1ed8ca0e54655deef00/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBase.scala#L149-L159
use: https://github.com/Azure/mmlspark/blob/96f0b7775629d6e7b521d1ed8ca0e54655deef00/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBase.scala#L211.
If I'm right that you're using mmlspark and you have further questions about how this works, I recommend opening issues in Azure/mmlspark.

Related

Literature for Classification Problem with Changing Classes

I am currently looking at a text classification problem (say N classes), for which labeled training data exists. Now, ocasionally, a new class is created and some of the labels in the "old" training data become wrong because they now should have the new class label. So the new class recruits from the old classes.
We can assume that we have some new labeled data for the new class, or even that from an input stream of new data we eventually obtain the correct labels by human verification (the goal, however, is to require as few manual corrections as possible).
How to set up a classifier that may face new "recruiting" classes from time to time? Are you aware of approaches/literature for the specific setting described above?
Perhaps, basic strategies may include
trying to relabel the training data and re-train,
using incremental classifiers (e.g., KNN)

Convert PySpark ML Word2Vec model to Gensim Word2Vec model

I've generated a PySpark Word2Vec model like so:
from pyspark.ml.feature import Word2Vec
w2v = Word2Vec(vectorSize=100, minCount=1, inputCol='words', outputCol = 'vector')
model = w2v.fit(df)
(The data that I used to train the model on isn't relevant, what's important is that its all in the right format and successfully yields a pyspark.ml.feature.Word2VecModel object.)
Now I need to convert this model to a Gensim Word2Vec model. How would I go about this?
If you still have the training data, re-training the gensim Word2Vec model may be the most straightforward approach.
If you only need the word-vectors, perhaps PySpark's model can export them in the word2vec.c format that gensim can load with .load_word2vec_format().
The only reason to port the model would be to continue training. Such incremental training, while possible, involves considering a lot of tradeoffs in balancing the influence of the older and later training to get good results.
If you are in fact wanting to do this conversion in order to do more training in such a manner, it again suggests that using the original training to reproduce a similar model could be plausible.
But, if you have to convert the model, the general approach would be to study the source code and internal data structures of the two models, to discover how they alternatively represent each of the key aspects of the model:
the known word-vectors (model.wv.vectors in gensim)
the known-vocabulary of words, including stats about word-frequencies and the position of individual words (model.wv.vocab in gensim)
the hidden-to-output weights of the model (`model.trainables' and its properties in gensim)
other model properties describing the model's modes & metaparameters
A reasonable interactive approach could be:
Write some acceptance tests that take models of both types, and test whether they are truly 'equivalent' for your purposes. (This is relatively easy for just checking if the vectors for individual words are present and identical, but nearly as hard as the conversion itself for verifying other ready-to-be-trained-more behaviors.)
Then, in an interactive notebook, load the source model, and also create a dummy gensim model with the same vocabulary size. Consulting the source code, write Python statements to iteratively copy/transform key properties over from the source into the target, repeatedly testing if they verify as equivalent.
When they do, take those steps you did manually and combine them into a utility method to do the conversion. Again verify its operation then try using the converted model however you'd hoped – perhaps discovering overlooked info or discovering other bugs in the process, and then improving the verification method and conversion method.
It's possible that the PySpark model will be missing things the gensim model expects, which might require synthesizing workable replacement values.
Good luck! (But re-train the gensim model from the original data if you want things to just be straightforward and work.)

DeepLearning4J - Acquiring Data and Train Model

I try to create the easiest of a NeuralNetwork and training it with some data:
Therefore I created a test.csv with a the following pattern:
number,number+1;
number2,number2+1
...
I try to make a linear regression with the network...
But I do not find a way to acquire the data, DataSetIterator does not work.
How to fit the Data, how to test the Data?
In our examples, we encourage people to use datavec + recordreaderdatasetiterator.
Datavec has all of the various data loading components.
I'm not sure what you mean about "datasetiterator not working" wihtout seeing any code, but it seems like you didn't really look at our examples.
In there are multiple examples of a csv record reader you can use for both regression and classification use cases.
Consider reorienting your data pipeline to use those.
Those examples are always found here:
https://github.com/deeplearning4j/dl4j-examples
If you follow any of those, the same pattern emerges:
Record reader for whatever data format -> RecordReaderDataSetIterator
The iterator allows you to specify common constructors such as whether it is a regression or not, which column your label is etc.

Spark/Mllib Train many GaussianMixture models in a distributed way

I've been playing around with the Gaussian Mixture Models provided for spark/mllib.
I found it really nice to generate a GaussianMixture from an enormous number of vectors/points. However, this is not always the case in ML. Very often you do not need to generate a model from numberless vectors, but to generate a numberless models -each one- from a few vectors (i.e., building a GMM for each user of a database with hundred of users).
At this point, I do not know how to proceed with the mllib, as I cannot see an easy way to distribute in both by users and by data.
Example:
Let featuresByUser = RDD[user, List[Vectors]],
the natural way to train a GMM for each user might be something like
featuresByUser.mapValues(
feats => new GaussianMixture.set(nGaussians).run(sc.parallelize(feats))
)
However, it is well-known that this is forbidden in spark. The inside sc.parallelize is not in the driver, so this leads to an error.
So the question are,
should the Mllib methods accept Seq[Vector] as input apart from
RDD[Vector] Thus, the programmer could choose one of the other depending on the problem.
Is there any other workaround that I'm missing to deal with this case (using mllib)?
Mllib unfortunately is currently not meant to create many models, but only one at the time, which was confirmed at a recent Spark meetup in London.
What you can do is launch a separate job for each model in a separate thread in the driver. This is described in the job scheduling documentation. So you would create one RDD per user and run a Gaussian mixture on each, running the 'action' that makes the thing run for each on a separate thread.
Another option, if the amount of data per user fits on one instance, you can do a Gaussian mixture on each user with something else than Mllib. This approach was described in the meetup in a case where sklearn was used within PySpark to create multiple models. You'd do something like:
val users: List[Long] = getUsers
val models = sc.parallelize(users).map(user => {
val userData = getDataForUser(user)
buildGM(userData)
})

Using ELKI on custom objects and making sense of results

I am trying to use ELKI's SLINK implementation of hierarchical clustering in my program.
I have a set of objects (of my own type) that need to be clustered. For that, I convert them to feature vectors before clustering.
This is how I currently got it to run and produce some result (code is in Scala):
val clusterer = new SLINK(CosineDistanceFunction.STATIC, 3)
val connection = new ArrayAdapterDatabaseConnection(featureVectors)
val database = new StaticArrayDatabase(connection, null)
database.initialize()
val result = clusterer.run(database).asInstanceOf[Clustering[_ <: Model]]
Now, the result is a Clustering that contains elements of type Model. I can output them, but I don't know how to make sense of this result, especially since SLINK returns models of type DendrogramModel which does not seem to be parametrizable.
Specifically, how can I link the results back to my original elements (the ones from which I created the variable featureVectors earlier)?
I assume I need to create some kind of custom model or somehow maintain some link to the original elements through initialization and execution of the algorithm to retrieve from the result. I cannot find where to get started on this though.
I am aware that embedding ELKI into own programs is discouraged. However, it seems that calling ELKI in some other way would not be any different: I need to cluster and map the results back to my objects during runtime of my program.
The DendrogramModel does not include the objects in the cluster. Models are additional meta data on the clusters.
Use the getIDs() method to access the members of a Cluster instance.