synapseml lightgbm model doesn't converge on Dataproc - google-cloud-dataproc

I'm trying to deploy a gbdt model with synapseml lightgbm[0.9.5] on google dataproc[2.0-debian10].
I use Spark StringIndexer to index string categorical columns and assemble all columns as a vector. With categorical features setting, I found the model error doesn't converge and there are lots of warnings:
DEFAULT [LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
It's strange that I checked all categorical features are in [0.0, 72234.0] which are in the range of Int32 https://github.com/microsoft/LightGBM/issues/1359
Then I removed the categorical meta info and treat all features as numeric features. The warning is gone but the metric seems still wierd.
lightgbm valid metric in logs
The model works on local spark environment. So I guess there is something wrong with data shared from JVM to C on DataProc.
Can anybody help?

Related

Can ELKI cluster non-normalized negative points? [duplicate]

This question already has answers here:
ELKI Kmeans clustering Task failed error for high dimensional data
(2 answers)
Closed 3 years ago.
I have gone through this question but the solution doesn't help.
ELKI Kmeans clustering Task failed error for high dimensional data
This is my first time with ELKI so, please bear with me. I have 45000 2D data points (after performing doc2vec ) that contain negative values and are not normalized. The dataset looks something like this :
-4.688612 32.793335
-42.990147 -20.499323
-24.948868 -10.822767
-45.502155 -40.917801
27.979715 -40.012688
1.867812 -9.838544
56.284512 6.756072
I am using the K-means algorithm to get 2 clusters. However, I get the following error:
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=0,maxdim=1 LabelList
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]
So my question is, does ELKI require the data to be in the range of [0,1] because all the examples that I came across had their data within that range.
Or is it that ELKI does not accept negative values?
If something else, can someone please guide me through this?
Thank you!
ELKI can handle negative values just fine.
Your input data is not correctly formatted. Same problem as in ELKI Kmeans clustering Task failed error for high dimensional data
Apparently your lines have either 0 or 1 values. ELKI itself is fine with that, but
k-means requires the data to be in a R^d vector space, hence ELKI cannot run k-means on your data set. But the reason is that the input file is bad. You may want to double check your file - there probably is at least one line that is not properly formatted.

Converting Dataframe from Spark to the type used by DL4j

Is there any convenient way to convert Dataframe from Spark to the type used by DL4j? Currently using Daraframe in algorithms with DL4j I get an error:
"type mismatch, expected: RDD[DataSet], actual: Dataset[Row]".
In general, we use datavec for that. I can point you at examples for that if you want. Dataframes make too many assumptions that make it too brittle to be used for real world deep learning.
Beyond that, a data frame is not typically a good abstraction for representing linear algebra. (It falls down when dealing with images for example)
We have some interop with spark.ml here: https://github.com/deeplearning4j/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-scaleout/spark/dl4j-spark-ml/src/test/java/org/deeplearning4j/spark/ml/impl/SparkDl4jNetworkTest.java
But in general, a dataset is just a pair of ndarrays just like numpy. If you have to use spark tools, and want to use ndarrays on the last mile only, then my advice would be to get the dataframe to match some form of schema that is purely numerical, map that to an ndarray "row".
In general, a big reason we do this is because all of our ndarrays are off heap.
Spark has many limitations when it comes to working with their data pipelines and using the JVM for things it shouldn't be(matrix math) - we took a different approach that allows us to use gpus and a bunch of other things efficiently.
When we do that conversion, it ends up being:
raw data -> numerical representation -> ndarray
What you could do is map dataframes on to a double/float array and then use Nd4j.create(float/doubleArray) or you could also do:
someRdd.map(inputFloatArray -> new DataSet(Nd4j.create(yourInputArray),yourLabelINDARray))
That will give you a "dataset" You need a pair of ndarrays matching your input data and a label.
The label from there is relative to the kind of problem you're solving whether that be classification or regression though.

How to revert One-Hot Enoding in Spark (Scala)

After running k-means (mllib spark scala) I want to make sense of the cluster centers I obtained from data which I pre-processed using (among other transformers) mllib's OneHotEncoder.
A center looks like this:
Cluster Center 0 [0.3496378699559276,0.05482645034473324,111.6962521358467,1.770525792286651,0.0,0.8561916265130964,0.014382183950365071,0.0,0.0,0.0,0.47699722692567864,0.0,0.0,0.0,0.04988557988346689,0.0,0.0,0.0,0.8981811028926263,0.9695107580117296,0.0,0.0,1.7505886931570156,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.771620072281845,0.0,0.0,0.0,0.0]
Which is obviously not very human friendly... Any ideas on how to revert the one-hot encoding and retrieve the original categorical features?
What if I look for the data point which is closest (using the same distance metric that is used by k-means, which I assume is Euclidean distance) to the centroid and then revert the encoding of that particular data point?
For the cluster centroids it is not possible (strongly disrecommended) to reverse the encoding. Imagine you have the original feature "3" out of 6 and it is encoded as [0.0,0.0,1.0,0.0,0.0,0.0]. In this case it's easy to extract 3 as the correct feature from the encoding.
But after kmeans application you may get a cluster centroid that looks for this feature like this [0.0,0.13,0.0,0.77,0.1,0.0]. If you want to decode this back to the representation that you had before, like "4" out of 6, because the feature 4 has the largest value, then you will lose information and the model may get corrupted.
Edit: Add a possible way to revert encoding on datapoints from the comments to the answer
If you have IDs on the datapoints you can perform a select / join operation on the ID after you assigned a datapoints to a cluster to get the old state, before the encoding.

ELKI clustering FDBSCAN algorithm

Please could you show me example of input file for FDBSCAN in ELKI. I got error like this:
Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: UncertainObject,field
Available types: DBID DoubleVector,dim=2
at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
at de.lmu.ifi.dbs.elki.algorithm.clustering.uncertain.FDBSCANNeighborPredicate.instantiate(FDBSCANNeighborPredicate.java:131)
at de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.GeneralizedDBSCAN.run(GeneralizedDBSCAN.java:122)
at de.lmu.ifi.dbs.elki.algorithm.clustering.gdbscan.GeneralizedDBSCAN.run(GeneralizedDBSCAN.java:79)
at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
at [...]
FDBSCAN requires data of the type UncertainObject, i.e. objects with uncertainty information.
If you simply load a CSV file, the data will be certain, and you cannot use uncertain clustering.
There are several ways of modeling uncertainty. These implement as filters in the typeconversions package.
UncertainSplitFilter can split a vector of length k*N into k possible instances, each of length N with uniform weight.
WeightedUncertainSplitFilter is similar, but every instance can also have a weight associated.
UncertainifyFilter can simulate uncertainty by e.g. assuming a Gaussian or Uniform distribution around the original vector.
UniformUncertainifier (the U-Model, see Javadoc of UniformContinuousUncertainObject)
SimpleGaussianUncertainifier (see Javadoc of SimpleGaussianContinuousUncertainObject)
UnweightedDiscreteUncertainifier (BID Model, see Javadoc of WeightedDiscreteUncertainObject)
WeightedDiscreteUncertainifier (as above)
or add your own uncertainty information by extending the API!

Mahout: Why is using setProbes() having this affect?

I'm using mahout 0.7 to do some classification. I have an encoder for a continuous variable
ContinuousValueEncoder durationPlanEncoder = new ContinuousValueEncoder("duration_plan");
The feature associated with this encoder is a number of days and can range from about 6 to 16.
I'm using an OnlineLogisticRegression model and I use the encoder to train it:
durationPlanEncoder.addToVector(null, <duration_plan double val>, trainDataVector);
For simplicity (since i'm trying to understand this whole classification thing while also learning Mahout), i am using 2 variables: 1) a categorical variable with 6 categories -- one of which ("dev") always predicts the =1 category; and 2) this "duration_plan" variable.
What i expect to find is that, when i give the classifier test data that consists of the category "dev" and a "duration_plan" value, the accuracy of the classifier will increase as the "duration_plan" value i give it gets closer to its average value across the training data. This is not what i'm seeing, however. Instead, the accuracy of the classifier improves as the value of "duration_plan" goes to 0.0. However -- there are no training vectors with duration_plan=0.0!! Why would this be the case?
Then i modified my durationPlanEncoder as follows:
durationPlanEncoder.setProbes(2);
and the accuracy improved. It got even better when i made the number of probes 20, then 200. Why? What is setProbes() doing and is this an anomaly or is this actually how i should be doing it?
The final part of my question is to mention that, even after setting setProbes(20), changing the value of "duration_plan" in the test data has no effect on the accuracy of the classifier -- which I don't think is how it should be. If i give a value for duration_plan that doesn't even exist in any of the training data and thus is never correlated with the =1 class, i would expect the classifier to classify the test sample as =0. Right? Which makes me think i must be coding something just plain wrong. Any suggestions are appreciated.
Mahout documentation is woefully sparse.
thanks.