Spark ML - Save OneVsRestModel - scala

I am in the middle of refactoring my code to take advantage of DataFrames, Estimators, and Pipelines. I was originally using MLlib Multiclass LogisticRegressionWithLBFGS on RDD[LabeledPoint]. I am enjoying learning and using the new API, but I am not sure how to save my new model and apply it on new data.
Currently, the ML implementation of LogisticRegression only supports binary classification. I am, instead using OneVsRest like so:
val lr = new LogisticRegression().setFitIntercept(true)
val ovr = new OneVsRest()
ovr.setClassifier(lr)
val ovrModel = ovr.fit(training)
I would now like to save my OneVsRestModel, but this does not seem to be supported by the API. I have tried:
ovrModel.save("my-ovr") // Cannot resolve symbol save
ovrModel.models.foreach(_.save("model-" + _.uid)) // Cannot resolve symbol save
Is there a way to save this, so I can load it in a new application for making new predictions?

Spark 2.0.0
OneVsRestModel implements MLWritable so it should be possible to save it directly. Method shown below can be still useful to save individual models separately.
Spark < 2.0.0
The problem here is that models returns an Array of ClassificationModel[_, _]] not an Array of LogisticRegressionModel (or MLWritable). To make it work you'll have to be specific about the types:
import org.apache.spark.ml.classification.LogisticRegressionModel
ovrModel.models.zipWithIndex.foreach {
case (model: LogisticRegressionModel, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
or to be more generic:
import org.apache.spark.ml.util.MLWritable
ovrModel.models.zipWithIndex.foreach {
case (model: MLWritable, i: Int) =>
model.save(s"model-${model.uid}-$i")
}
Unfortunately as for now (Spark 1.6) OneVsRestModel doesn't implement MLWritable so it cannot be saved alone.
Note:
All models int the OneVsRest seem to use the same uid hence we need an explicit index. It will be also useful to identify the model later.

Related

Register UDF to SqlContext from Scala to use in PySpark

Is it possible to register a UDF (or function) written in Scala to use in PySpark ?
E.g.:
val mytable = sc.parallelize(1 to 2).toDF("spam")
mytable.registerTempTable("mytable")
def addOne(m: Integer): Integer = m + 1
// Spam: 1, 2
In Scala, the following is now possible:
val UDFaddOne = sqlContext.udf.register("UDFaddOne", addOne _)
val mybiggertable = mytable.withColumn("moreSpam", UDFaddOne(mytable("spam")))
// Spam: 1, 2
// moreSpam: 2, 3
I would like to use "UDFaddOne" in PySpark like
%pyspark
mytable = sqlContext.table("mytable")
UDFaddOne = sqlContext.udf("UDFaddOne") # does not work
mybiggertable = mytable.withColumn("+1", UDFaddOne(mytable("spam"))) # does not work
Background: We are a team of developpers, some coding in Scala and some in Python, and would like to share already written functions. It would also be possible to save it into a library and import it.
As far as I know PySpark doesn't provide any equivalent of the callUDF function and because of that it is not possible to access registered UDF directly.
The simplest solution here is to use raw SQL expression:
mytable.withColumn("moreSpam", expr("UDFaddOne({})".format("spam")))
## OR
sqlContext.sql("SELECT *, UDFaddOne(spam) AS moreSpam FROM mytable")
## OR
mytable.selectExpr("*", "UDFaddOne(spam) AS moreSpam")
This approach is rather limited so if you need to support more complex workflows you should build a package and provide complete Python wrappers. You'll find and example UDAF wrapper in my answer to Spark: How to map Python with Scala or Java User Defined Functions?
The following worked for me (basically a summary of multiple places including the link provided by zero323):
In scala:
package com.example
import org.apache.spark.sql.functions.udf
object udfObj extends Serializable {
def createUDF = {
udf((x: Int) => x + 1)
}
}
in python (assume sc is the spark context. If you are using spark 2.0 you can get it from the spark session):
from py4j.java_gateway import java_import
from pyspark.sql.column import Column
jvm = sc._gateway.jvm
java_import(jvm, "com.example")
def udf_f(col):
return Column(jvm.com.example.udfObj.createUDF().apply(col))
And of course make sure the jar created in scala is added using --jars and --driver-class-path
So what happens here:
We create a function inside a serializable object which returns the udf in scala (I am not 100% sure Serializable is required, it was required for me for more complex UDF so it could be because it needed to pass java objects).
In python we use access the internal jvm (this is a private member so it could be changed in the future but I see no way around it) and import our package using java_import.
We access the createUDF function and call it. This creates an object which has the apply method (functions in scala are actually java objects with the apply method). The input to the apply method is a column. The result of applying the column is a new column so we need to wrap it with the Column method to make it available to withColumn.

Read Parquet files from Scala without using Spark

Is it possible to read parquet files from Scala without using Apache Spark?
I found a project which allows us to read and write avro files using plain scala.
https://github.com/sksamuel/avro4s
However I can't find a way to read and write parquet files using plain scala program without using Spark?
It's straightforward enough to do using the parquet-mr project, which is the project Alexey Raga is referring to in his answer.
Some sample code
val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
// iter is of type Iterator[GenericRecord]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
// if you want a list then...
val list = iter.toList
This will return you a standard Avro GenericRecords, but if you want to turn that into a scala case class, then you can use my Avro4s library as you linked to in your question, to do the marshalling for you. Assuming you are using version 1.30 or higher then:
case class Bibble(name: String, location: String)
val format = RecordFormat[Bibble]
// then for a given record
val bibble = format.from(record)
We can obviously combine that with the original iterator in one step:
val reader = AvroParquetReader.builder[GenericRecord](path).build().asInstanceOf[ParquetReader[GenericRecord]]
val format = RecordFormat[Bibble]
// iter is now an Iterator[Bibble]
val iter = Iterator.continually(reader.read).takeWhile(_ != null).map(format.from)
// and list is now a List[Bibble]
val list = iter.toList
There is also a relatively new project called eel this is a lightweight (non distributed processing) toolkit for using some of the 'big data' technology in the small.
Yes, you don't have to use Spark to read/write Parquet.
Just use parquet lib directly from your Scala code (and that's what Spark is doing anyway): http://search.maven.org/#search%7Cga%7C1%7Cparquet

Run a read-only test in Spark

I want to compare the read performance of different storage systems using Spark ,e.g. HDFS/S3N. I have written a small Scala program for this:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val file = sc.textFile("s3n://test/wordtest")
val splits = file.map(word => word)
splits.saveAsTextFile("s3n://test/myoutput")
}
}
My question is, is it possible to run a read-only test with Spark? For the program above, isn't saveAsTextFile() causing some write as well?
I am not sure if that is possible at all. In order to run a transformation, a posterior action is necessary.
From the official Spark documentation:
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
Taking this into account, saveAsTextFile might not be considered the lightest from the wide range of actions available. Several lightweight alternatives exists, actions like count or first, for example. These would leverage almost the totality of the work to the transformations phase, making you able to measure the performance of your solution.
You might want to check the available actions and choose the one that best fits your requirements.
Yes."saveAsTextFile" writes the RDD data to text file using given path.

Fastest serialization/deserialization of Scala case classes

If I've got a nested object graph of case classes, similar to the example below, and I want to store collections of them in a redis list, what libraries or tools should I look at that that will give the fastest overall round trip to redis?
This will include:
Time to serialize the item
network cost of transferring the serialized data
network cost of retrieving stored serialized data
time to deserialize back into case classes
case class Person(name: String, age: Int, children: List[Person]) {}
UPDATE (2018): scala/pickling is no longer actively maintained. There are hoards of other libraries that have arisen as alternatives which take similar approaches but which tend to focus on specific serialization formats; e.g., JSON, binary, protobuf.
Your use case is exactly the targeted use case for scala/pickling (https://github.com/scala/pickling). Disclaimer: I'm an author.
Scala/pickling was designed to be a faster, more typesafe, and more open alternative to automatic frameworks like Java or Kryo. It was built in particular for distributed applications, so serialization/deserialization time and serialized data size take a front seat. It takes a different approach to serialization all together- it generates pickling (serialization) code inline at the use-site at compile-time, so it's really very fast.
The latest benchmarks are in our OOPSLA paper- for the binary pickle format (you can also choose others, like JSON) scala/pickling is consistently faster than Java and Kryo, and produces binary representations that are on par or smaller than Kryo's, meaning less latency when passing your pickled data over the network.
For more info, there's a project page:
http://lampwww.epfl.ch/~hmiller/pickling
And a ScalaDays 2013 talk from June on Parley's.
We'll also be presenting some new developments in particular related to dealing with sending closures over the network at Strange Loop 2013, in case that might also be a pain point for your use case.
As of the time of this writing, scala/pickling is in pre-release, with our first stable release planned for August 21st.
Update:
You must be careful to use the serialize methods from JDK. The performance is not great and one small change in your class will make the data unable to deserialize.
I've used scala/pickling but it has a global lock while serializing/deserializing.
So instead of using it, I write my own serialization/deserialization code like this:
import java.io._
object Serializer {
def serialize[T <: Serializable](obj: T): Array[Byte] = {
val byteOut = new ByteArrayOutputStream()
val objOut = new ObjectOutputStream(byteOut)
objOut.writeObject(obj)
objOut.close()
byteOut.close()
byteOut.toByteArray
}
def deserialize[T <: Serializable](bytes: Array[Byte]): T = {
val byteIn = new ByteArrayInputStream(bytes)
val objIn = new ObjectInputStream(byteIn)
val obj = objIn.readObject().asInstanceOf[T]
byteIn.close()
objIn.close()
obj
}
}
Here is an example of using it:
case class Example(a: String, b: String)
val obj = Example("a", "b")
val bytes = Serializer.serialize(obj)
val obj2 = Serializer.deserialize[Example](bytes)
According to the upickle benchmarks: "uPickle runs 30-50% faster than Circe for reads/writes, and ~200% faster than play-json" for serializing case classes.
It's easy to use, here's how to serialize a case class to a JSON string:
case class City(name: String, funActivity: String, latitude: Double)
val bengaluru = City("Bengaluru", "South Indian food", 12.97)
implicit val cityRW = upickle.default.macroRW[City]
upickle.default.write(bengaluru) // "{\"name\":\"Bengaluru\",\"funActivity\":\"South Indian food\",\"latitude\":12.97}"
You can also serialize to binary or other formats.
The accepted answer from 2013 proposes a library that is no longer maintained. There are many similar questions on StackOverflow but I really couldn't find a good answer which would meet the following criteria:
serialization/ deserialization should be fast
high performance data exchange over the wire where you only encode as much metadata as you need
supports schema evolution so that changing the serialized object (ex: case class) doesn't break past deserializations
I recommend against using low-level JDK SerDes (like ByteArrayOutputStream and ByteArrayInputStream). Supporting schema evolution becomes a pain and it's difficult to make it work with external services (ex: Thrift) since you have no control if the data being sent back used the same type of streams.
Some people use the JSON spec, using libraries like json4s but it is not suitable for distributed computing message transfer. It marshalls data as a JSON string so it'll be both slower and storage inefficient, since it will use 8 bits to store every character in the string.
I highly recommend using the MessagePack binary serialization format. I would recommend reading the spec to understand the encoding specifics. It has implementations in many different languages, here's a generic example I wrote for a Scala case class that you can copy-paste in your code.
import java.nio.ByteBuffer
import java.util.concurrent.TimeUnit
import org.msgpack.core.MessagePack
case class Data(message: String, number: Long, timeUnit: TimeUnit, price: Long)
object Data extends App {
def serialize(data: Data): ByteBuffer = {
val packer = MessagePack.newDefaultBufferPacker
packer
.packString(data.message)
.packLong(data.number)
.packString(data.timeUnit.toString)
.packLong(data.price)
packer.close()
ByteBuffer.wrap(packer.toByteArray)
}
def deserialize(data: ByteBuffer): Data = {
val unpacker = MessagePack.newDefaultUnpacker(convertDataToByteArray(data))
val newdata = Data.apply(
message = unpacker.unpackString(),
number = unpacker.unpackLong(),
timeUnit = TimeUnit.valueOf(unpacker.unpackString()),
price = unpacker.unpackLong()
)
unpacker.close()
newdata
}
def convertDataToByteArray(data: ByteBuffer): Array[Byte] = {
val buffer = Array.ofDim[Byte](data.remaining())
data.duplicate().get(buffer)
buffer
}
println(deserialize(serialize(Data("Hello world!", 1L, TimeUnit.DAYS, 3L))))
}
It will print:
Data(Hello world!,1,DAYS,3)

Storing an object to a file

I want to save an object (an instance of a class) to a file. I didn't find any valuable example of it. Do I need to use serialization for it?
How do I do that?
UPDATE:
Here is how I tried to do that
import scala.util.Marshal
import scala.io.Source
import scala.collection.immutable
import java.io._
object Example {
class Foo(val message: String) extends scala.Serializable
val foo = new Foo("qweqwe")
val out = new FileOutputStream("out123.txt")
out.write(Marshal.dump(foo))
out.close
}
First of all, out123.txt contains many extra data and it was in a wrong encoding. My gut tells me there should be another proper way.
On the last ScalaDays Heather introduced a new library which gives a new cool mechanism for serialization - pickling. I think it's would be an idiomatic way in scala to use serialization and just what you want.
Check out a paper on this topic, slides and talk on ScalaDays'13
It is also possible to serialize to and deserialize from JSON using Jackson.
A nice wrapper that makes it Scala friendly is Jacks
JSON has the following advantages
a simple human readable text
a rather efficient format byte wise
it can be used directly by Javascript
and even be natively stored and queried using a DB like Mongo DB
(Edit) Example Usage
Serializing to JSON:
val json = JacksMapper.writeValueAsString[MyClass](instance)
... and deserializing
val obj = JacksMapper.readValue[MyClass](json)
Take a look at Twitter Chill to handle your serialization: https://github.com/twitter/chill. It's a Scala helper for the Kyro serialization library. The documentation/example on the Github page looks to be sufficient for your needs.
Just add my answer here for the convenience of someone like me.
The pickling library, which is mentioned by #4lex1v, only supports Scala 2.10/2.11 but I'm using Scala 2.12. So I'm not able to use it in my project.
And then I find out BooPickle. It supports Scala 2.11 as well as 2.12!
Here's the example:
import boopickle.Default._
val data = Seq("Hello", "World!")
val buf = Pickle.intoBytes(data)
val helloWorld = Unpickle[Seq[String]].fromBytes(buf)
More details please check here.