Spark: Logistic regression

Spark: Logistic regression - scala

This code works great!
val model = new LogisticRegressionWithLBFGS().setNumClasses(2).run(training)
I am able to call model.predict(...)
However, when I try to setup the model parameters, I can't call model.predict
For example, with the following code, I can't call predict on model variable.
val model = new LogisticRegressionWithLBFGS().setNumClasses(2)
model.optimizer.setUpdater(new L1Updater).setRegParam(0.0000001).setNumIterations(numIterations)
model.run(training)
Any help with this will be great.

It happens because model in the second case is LogisticRegressionWithLBFGS not LogisticRegressionModel. What you need is something like this:
import org.apache.spark.mllib.classification.{
LogisticRegressionWithLBFGS, LogisticRegressionModel}
import org.apache.spark.mllib.optimization.L1Updater
// Create algorithm instance
val lr: LogisticRegressionWithLBFGS = new LogisticRegressionWithLBFGS()
.setNumClasses(2)
// Set optimizer params (it modifies lr object)
lr.optimizer
.setUpdater(new L1Updater)
.setRegParam(0.0000001)
.setNumIterations(numIterations)
// Train model
val model: LogisticRegressionModel = lr.run(training)
Now model is LogisticRegressionModel and can be used for predictions.

Related

Why I cannot extend a scipy rv_discrete class successfully?

I'm trying to extend the rv_discrete scipy class, as it is supposed to work in every case while extending a class.
I just want to add a couple of instance attributes.
from scipy.stats import rv_discrete
class Distribution(rv_discrete):
def __init__(self, realization):
self._realization = realization
self.num = len(realization)
#stuff to obtain random alphabet and probabilities from realization
super().__init__(values=(alphabet,probabilities))
This should allow me to do something like this :
realization = #some values
dist = Distribution(realization)
print(dist.mean())
Instead, I receive this error
ValueError: rv_discrete.__init__(..., values != None, ...)
If I simply create a new rv_discrete object as in the following line of code
dist = rv_discrete(values=(alphabet,probabilities))
It works just fine.
Any idea why? Thank you for your help

Keras get model outputs after each batch

I'm using a generator to make sequential training data for a hierarchical recurrent model, which needs the outputs of the previous batch to generate the inputs for the next batch. This is a similar situation to the Keras argument stateful=True which saves the hidden states for the next batch, except it's more complicated so I can't just use that as-is.
So far I tried putting a hack in the loss function:
def custom_loss(y_true, y_pred):
global output_ref
output_ref[0] = y_pred[0].eval(session=K.get_session())
output_ref[1] = y_pred[1].eval(session=K.get_session())
but that didn't compile and I hope there's a better way. Will Keras callbacks be of any help?

Learned from here:
model.compile(optimizer='adam')
# hack after compile
output_layers = [ 'gru' ]
s_name = 's'
model.metrics_names += [s_name]
model.metrics_tensors += [layer.output for layer in model.layers if layer.name in output_layers]
class my_callback(Callback):
def on_batch_end(self, batch, logs=None):
s_pred = logs[s_name]
print('s_pred:', s_pred)
return
model.fit(..., callbacks=[my_callback()])

I use this in the Tensorflow version of Keras, but it should work in Keras without Tensorflow
import tensorflow as tf
class ModelOutput:
''' Class wrapper for a metric that stores the output passed to it '''
def __init__(self, name):
self.name = name
self.y_true = None
self.y_pred = None
def save_output(self, y_true, y_pred):
self.y_true = y_true
self.y_pred = y_pred
return tf.constant(True)
class ModelOutputCallback(tf.keras.callbacks.Callback):
def __init__(self, model_outputs):
tf.keras.callbacks.Callback.__init__(self)
self.model_outputs = model_outputs
def on_train_batch_end(self, batch, logs=None):
#use self.model_outputs to get the outputs here
model_outputs = [
ModelOutput('rbox_score_map'),
ModelOutput('rbox_shapes'),
ModelOutput('rbox_angles')
]
# Note the extra [] around m.save_output, this example is for a model with
# 3 outputs, metrics must be a list of lists if you type it out
model.compile( ..., metrics=[[m.save_output] for m in self.model_outputs])
model.fit(..., callbacks=[ModelOutputCallback(model_outputs)])

Why does creating a Dataset with LinearRegressionModel fail with "No Encoder found for org.apache.spark.ml.regression.LinearRegressionModel"?

I get a DataFrame contians Tuple(String, org.apache.spark.ml.regression.LinearRegressionModel):
val result = rows.map(row => {
val userid = row.getString(0)
val frame = filterByUserId(userid ,dataFrame)
(userid, lr.fit(frame, "topicDistribution", "s"))
}).toDF()
When I use foreach function, I get this error.
result.foreach(row => {
val model = row.getAs[LinearRegressionModel](1)
val userid = row.getString(0)
model.save(SocialTextTest.userModelPath + userid)
})
Exception in thread "main" java.lang.UnsupportedOperationException:
No Encoder found for org.apache.spark.ml.regression.LinearRegressionModel
- field (class: "org.apache.spark.ml.regression.LinearRegressionModel", name: "_2")
- root class: "scala.Tuple2"
Should I write a Encoder by myself?

The issue is for a reason.
No Encoder found for org.apache.spark.ml.regression.LinearRegressionModel
The code simply does not make much sense once you get the gist of what Dataset data abstraction really is and the purpose of Encoders.
In essence, prepare your dataset first (as a collection of transformations on Dataset) and only when the dataset is ready train a model (aka fit a model). The model will then be outside the realm of Dataset and you won't see the exception.
The reason the exception happens when you call foreach is that that's when you trigger a computation and Spark tries to execute the code.
Should I write a Encoder by myself?
Oh, no. Rewrite the code following the guide at Machine Learning Library (MLlib) Guide and reviewing some examples to learn how to use the API.

Infinite loop when replacing concrete value by parameter name

I have the two following objects (in scala and using spark):
1. The main object
object Omain {
def main(args: Array[String]) {
odbscan
}
}
2. The object odbscan
object odbscan {
val conf = new SparkConf().setAppName("Clustering").setMaster("local")
conf.set("spark.driver.maxResultSize", "3g")
val sc = new SparkContext(conf)
val param_user_minimal_rating_count = 2
/***Connexion***/
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val sql = "SELECT id, data FROM user_profile"
val options = connectMysql.getOptionsMap(sql)
val uSQL = sqlcontext.load("jdbc", options)
val users = uSQL.rdd.map { x =>
val v = x.toString().substring(1, x.toString().size - 1).split(",")
var ap: Map[Int, Double] = Map()
if (v.size > 1)
ap = v(1).split(";").map { y => (y.split(":")(0).toInt, y.split(":")(1).toDouble) }.toMap
(v(0).toInt, ap)
}.filter(_._2.size >= param_user_minimal_rating_count)
println(users.collect().mkString("\n"))
}
When I execute this code I obtain an infinite loop, until I change:
filter(_._2.size >= param_user_minimal_rating_count)
to
filter(_._2.size >= 1)
or any other numerical value, in this case the code work, and I have my result displayed

What I think is happening here is that Spark serializes functions to send them over the wire. And that because your function (the one you're passing to map) calls the accessor param_user_minimal_rating_count of object odbscan, the entire object odbscan will need to get serialized and sent along with it. Deserializing and then using that deserialized object will cause the code in its body to get executed again which causes an infinite loop of serializing-->sending-->deserializing-->executing-->serializing-->...
Probably the easiest thing to do here is changing that val to final val param_user_minimal_rating_count = 2 so the compiler will inline the value. But note that this will only be a solution for literal constants. For more information see constant value definitions and constant expressions.
An other and better solution would be to refactor your code so that no instance variables are used in lambda expressions. Referencing vals that are defined in an object or class will get the whole object serialized. So try to only refer to vals that are local (to a method). And most importantly don't execute your business logic from within a constructor/the body of an object or class.

Your problem is somewhere else.
The only difference between the 2 snippets is the definition of val Eps = 5 outside of the map which does not change at all the control flow of your code.
Please post more context so we can help.

How to new a class inside an object?

I'm trying out Nak (Machine Learning package for Scala). However, they don't provide easy access for basic methods like NaiveBayes or Maximum Entropy. I want to do it manually, but I failed to create an instance of the NaiveBayes class. Part of their NaiveBayes code looks like this:
object NaiveBayes {
class Trainer[L,T](wordSmoothing: Double=0.05, classSmoothing: Double= 0.01) extends Classifier.Trainer[L,Counter[T,Double]] {
type MyClassifier = NaiveBayes[L,T]
override def train(data: Iterable[Example[L,Counter[T,Double]]]) = {
new NaiveBayes(data,wordSmoothing,classSmoothing)
}
}
}
I can't access the Trainer class...and I don't know why. The full code can be found here:
https://github.com/scalanlp/nak/blob/master/src/main/scala/nak/classify/NaiveBayes.scala
I try to write code like:
Trainer train = new Trainer() or NaiveBayes.Trainer train = new ...
It's just not working...

Trainer takes type parameters, so you have to specify them if they can't be inferred. For example:
val trainer = new NaiveBayes.Trainer[???,???]()
where the question marks should be replaced by type arguments for L and T. According to the comments in Classifier.scala, L should be the type of your labels, and T should be the type of your observations.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark: Logistic regression - scala

Related

Why I cannot extend a scipy rv_discrete class successfully?

Keras get model outputs after each batch

Why does creating a Dataset with LinearRegressionModel fail with "No Encoder found for org.apache.spark.ml.regression.LinearRegressionModel"?

Infinite loop when replacing concrete value by parameter name

How to new a class inside an object?

Categories

Resources