Extending DefaultParamsReadable and DefaultParamsWritable not allowing reading of custom model - scala

Good day,
I have been struggling for a few days to save a custom transformer that is part of a large pipeline of stages. I have a transformer that is completely defined by its params. I have an estimator which in it's fit method will generate a matrix and then set the transformer parameters accordingly so that I can use DefaultParamsReadable and DefaultParamsReadable to take advantage of the serialisation/deserialisation already present in util.ReadWrite.scala.
My summarised code is as follows (includes important aspects):
...
import org.apache.spark.ml.util._
...
// trait to implement in Estimator and Transformer for params
trait NBParams extends Params {
final val featuresCol= new Param[String](this, "featuresCol", "The input column")
setDefault(featuresCol, "_tfIdfOut")
final val labelCol = new Param[String](this, "labelCol", "The labels column")
setDefault(labelCol, "P_Root_Code_Index")
final val predictionsCol = new Param[String](this, "predictionsCol", "The output column")
setDefault(predictionsCol, "NBOutput")
final val ratioMatrix = new Param[DenseMatrix](this, "ratioMatrix", "The transformation matrix")
def getfeaturesCol: String = $(featuresCol)
def getlabelCol: String = $(labelCol)
def getPredictionCol: String = $(predictionsCol)
def getRatioMatrix: DenseMatrix = $(ratioMatrix)
}
// Estimator
class CustomNaiveBayes(override val uid: String, val alpha: Double)
extends Estimator[CustomNaiveBayesModel] with NBParams with DefaultParamsWritable {
def copy(extra: ParamMap): CustomNaiveBayes = {
defaultCopy(extra)
}
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
def setLabelCol(value: String): this.type = set(labelCol, value)
def setPredictionCol(value: String): this.type = set(predictionsCol, value)
def setRatioMatrix(value: DenseMatrix): this.type = set(ratioMatrix, value)
override def transformSchema(schema: StructType): StructType = {...}
override def fit(ds: Dataset[_]): CustomNaiveBayesModel = {
...
val model = new CustomNaiveBayesModel(uid)
model
.setRatioMatrix(ratioMatrix)
.setFeaturesCol($(featuresCol))
.setLabelCol($(labelCol))
.setPredictionCol($(predictionsCol))
}
}
// companion object for Estimator
object CustomNaiveBayes extends DefaultParamsReadable[CustomNaiveBayes]{
override def load(path: String): CustomNaiveBayes = super.load(path)
}
// Transformer
class CustomNaiveBayesModel(override val uid: String)
extends Model[CustomNaiveBayesModel] with NBParams with DefaultParamsWritable {
def this() = this(Identifiable.randomUID("customnaivebayes"))
def copy(extra: ParamMap): CustomNaiveBayesModel = {defaultCopy(extra)}
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
def setLabelCol(value: String): this.type = set(labelCol, value)
def setPredictionCol(value: String): this.type = set(predictionsCol, value)
def setRatioMatrix(value: DenseMatrix): this.type = set(ratioMatrix, value)
override def transformSchema(schema: StructType): StructType = {...}
}
def transform(dataset: Dataset[_]): DataFrame = {...}
}
// companion object for Transformer
object CustomNaiveBayesModel extends DefaultParamsReadable[CustomNaiveBayesModel]
When I add this Model as part of a pipeline and fit the pipeline, all runs ok. When I save the pipeline, there are no errors. However, when I attempt to load the pipeline in I get the following error:
NoSuchMethodException: $line3b380bcad77e4e84ae25a6bfb1f3ec0d45.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$6fa979eb27fa6bf89c6b6d1b271932c$$$$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$CustomNaiveBayesModel.read()
To save the pipeline, which includes a number of other transformers related to NLP pre-processing, I run
fittedModelRootCode.write.save("path")
and to then load it (where the failure occurs) I run
import org.apache.spark.ml.PipelineModel
val fittedModelRootCode = PipelineModel.load("path")
The model itself appears to be working well but I cannot afford to retrain the model on a dataset every time I wish to use it. Does anyone have any ideas why even with the companion object, the read() method appears to be unavailable?
Notes:
I am running on Databricks Runtime 8.3 (Spark 3.1.1, Scala 2.12)
My model is in a separate package so is external to Spark
I have reproduced this based on a number of existing examples all of which appear to work fine so I am unsure why my code is failing
I am aware there is a Naive Bayes model available in Spark ML, however, I have been tasked with making a large number of customizations so it is not worth modifying the existing version (plus I would like to learn how to get this right)
Any help would be greatly appreciated.

Since you extend the CustomNaiveBayesModel companion object by DefaultParamsReadable, I think you should use the companion object CustomNaiveBayesModel for loading the model. Here I write some code for saving and loading models and it works properly:
import org.apache.spark.SparkConf
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql.SparkSession
import path.to.CustomNaiveBayesModel
object SavingModelApp extends App {
val spark: SparkSession = SparkSession.builder().config(
new SparkConf()
.setMaster("local[*]")
.setAppName("Test app")
.set("spark.driver.host", "localhost")
.set("spark.ui.enabled", "false")
).getOrCreate()
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")
val fittedModelRootCode: PipelineModel = new Pipeline().setStages(Array(new CustomNaiveBayesModel())).fit(training)
fittedModelRootCode.write.save("path/to/model")
val mod = PipelineModel.load("path/to/model")
}
I think your mistake is using PipelineModel.load for loading the concrete model.
My environment:
scalaVersion := "2.12.6"
scalacOptions := Seq(
"-encoding", "UTF-8", "-target:jvm-1.8", "-deprecation",
"-feature", "-unchecked", "-language:implicitConversions", "-language:postfixOps")
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.1",
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.1.1"

Related

What is Scala 3 equivalent to this Scala 2 code that uses Enumeration and play-json?

I have some code that works in Scala 2.{10,11,12,13} that I'm now trying to convert to Scala 3. Scala 3 does Enumeration differently than Scala 2. I'm trying to figure out how to convert the following code that interacts with play-json so that it will work with Scala 3. Any tips or pointers to code from projects that have already crossed this bridge?
// Scala 2.x style code in EnumUtils.scala
import play.api.libs.json._
import scala.language.implicitConversions
// see: http://perevillega.com/enums-to-json-in-scala
object EnumUtils {
def enumReads[E <: Enumeration](enum: E): Reads[E#Value] =
new Reads[E#Value] {
def reads(json: JsValue): JsResult[E#Value] = json match {
case JsString(s) => {
try {
JsSuccess(enum.withName(s))
} catch {
case _: NoSuchElementException =>
JsError(s"Enumeration expected of type: '${enum.getClass}', but it does not appear to contain the value: '$s'")
}
}
case _ => JsError("String value expected")
}
}
implicit def enumWrites[E <: Enumeration]: Writes[E#Value] = new Writes[E#Value] {
def writes(v: E#Value): JsValue = JsString(v.toString)
}
implicit def enumFormat[E <: Enumeration](enum: E): Format[E#Value] = {
Format(EnumUtils.enumReads(enum), EnumUtils.enumWrites)
}
}
// ----------------------------------------------------------------------------------
// Scala 2.x style code in Xyz.scala
import play.api.libs.json.{Reads, Writes}
object Xyz extends Enumeration {
type Xyz = Value
val name, link, unknown = Value
implicit val enumReads: Reads[Xyz] = EnumUtils.enumReads(Xyz)
implicit def enumWrites: Writes[Xyz] = EnumUtils.enumWrites
}
As an option you can switch to jsoniter-scala.
It supports enums for Scala 2 and Scala 3 out of the box.
Also it has handy derivation of safe and efficient JSON codecs.
Just need to add required libraries to your dependencies:
libraryDependencies ++= Seq(
// Use the %%% operator instead of %% for Scala.js and Scala Native
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-core" % "2.13.5",
// Use the "provided" scope instead when the "compile-internal" scope is not supported
"com.github.plokhotnyuk.jsoniter-scala" %% "jsoniter-scala-macros" % "2.13.5" % "compile-internal"
)
And then derive a codec and use it:
import com.github.plokhotnyuk.jsoniter_scala.core._
import com.github.plokhotnyuk.jsoniter_scala.macros._
implicit val codec: JsonValueCodec[Xyz.Xyz] = JsonCodecMaker.make
println(readFromString[Xyz.Xyz]("\"name\""))
BTW, you can run the full code on Scastie: https://scastie.scala-lang.org/Evj718q6TcCZow9lRhKaPw

Could not find implicit value of org.json4s.AsJsonInput in json4s 4.0.4

json4s version
In sbt:
"org.json4s" %% "json4s-jackson" % "4.0.4"
scala version
2.12.15
jdk version
JDK8
My problem
When I learnt to use json4s to read a json file "file.json".
(In book "Scala design patterns")
import org.json4s._
import org.json4s.jackson.JsonMethods._
trait DataReader {
def readData(): List[Person]
def readDataInefficiently(): List[Person]
}
class DataReaderImpl extends DataReader {
implicit val formats = DefaultFormats
private def readUntimed(): List[Person] =
parse(StreamInput(getClass.getResourceAsStream("file.json"))).extract[List[Person]]
override def readData(): List[Person] = readUntimed()
override def readDataInefficiently(): List[Person] = {
(1 to 10000).foreach(_ =>
readUntimed())
readUntimed()
}
}
object DataReaderExample {
def main(args: Array[String]): Unit = {
val dataReader = new DataReaderImpl
println(s"I just read the following data efficiently:${dataReader.readData()}")
println(s"I just read the following data inefficiently:${dataReader.readDataInefficiently()}")
}
}
It cannot compile correctly, and throw:
could not find implicit value for evidence parameter of type org.json4s.AsJsonInput[org.json4s.StreamInput]
Error occurred in an application involving default arguments.
parse(StreamInput(getClass.getResourceAsStream("file.json"))).extract[List[Person]]
when I change json4s version in 3.6.0-M2 in sbt:
"org.json4s" %% "json4s-jackson" % "3.6.0-M2"
It works well.
Why would this happen? How should I fix it in 4.0.4 or higher version?
Thank you for your Help.
I tried many ways to solve this problem.
And finally:
remove StreamInput in :
private def readUntimed(): List[Person] = {
val inputStream: InputStream = getClass.getResourceAsStream("file.json")
// parse(StreamInput(inputStream)).extract[List[Person]] // this will work in 3.6.0-M2
parse(inputStream).extract[List[Person]]
}
and now it works !

get annotations from class in scala 3 macros

i am writing a macro to get annotations from a 'Class'
inline def getAnnotations(clazz: Class[?]): Seq[Any] = ${ getAnnotationsImpl('clazz) }
def getAnnotationsImpl(expr: Expr[Class[?]])(using Quotes): Expr[Seq[Any]] =
import quotes.reflect.*
val cls = expr.valueOrError // error: value value is not a member of quoted.Expr[Class[?]]
val tpe = TypeRepr.typeConstructorOf(cls)
val annotations = tpe.typeSymbol.annotations.map(_.asExpr)
Expr.ofSeq(annotations)
but i get an error when i get class value from expr parameter
#main def test(): Unit =
val cls = getCls
val annotations = getAnnotations(cls)
def getCls: Class[?] = Class.forName("Foo")
is it possible to get annotations of a Class at compile time by this macro ?!
By the way, eval for Class[_] doesn't work even in Scala 2 macros: c.eval(c.Expr[Class[_]](clazz)) produces
java.lang.ClassCastException:
scala.reflect.internal.Types$ClassNoArgsTypeRef cannot be cast to java.lang.Class.
Class[_] is too runtimy thing. How can you extract its value from its tree ( Expr is a wrapper over tree)?
If you already have a Class[?] you should use Java reflection rather than Scala 3 macros (with Tasty reflection).
Actually, you can try to evaluate a tree from its source code (hacking multi-staging programming and implementing our own eval instead of forbidden staging.run). It's a little similar to context.eval in Scala 2 macros (but we evaluate from a source code rather than from a tree).
import scala.quoted.*
object Macro {
inline def getAnnotations(clazz: Class[?]): Seq[Any] = ${getAnnotationsImpl('clazz)}
def getAnnotationsImpl(expr: Expr[Class[?]])(using Quotes): Expr[Seq[Any]] = {
import quotes.reflect.*
val str = expr.asTerm.pos.sourceCode.getOrElse(
report.errorAndAbort(s"No source code for ${expr.show}")
)
val cls = Eval[Class[?]](str)
val tpe = TypeRepr.typeConstructorOf(cls)
val annotations = tpe.typeSymbol.annotations.map(_.asExpr)
Expr.ofSeq(annotations)
}
}
import dotty.tools.dotc.core.Contexts.Context
import dotty.tools.dotc.{Driver, util}
import dotty.tools.io.{VirtualDirectory, VirtualFile}
import java.net.URLClassLoader
import java.nio.charset.StandardCharsets
import dotty.tools.repl.AbstractFileClassLoader
object Eval {
def apply[A](str: String): A = {
val content =
s"""
|package $$generated
|
|object $$Generated {
| def run = $str
|}""".stripMargin
val sourceFile = util.SourceFile(
VirtualFile(
name = "$Generated.scala",
content = content.getBytes(StandardCharsets.UTF_8)),
codec = scala.io.Codec.UTF8
)
val files = this.getClass.getClassLoader.asInstanceOf[URLClassLoader].getURLs
val depClassLoader = new URLClassLoader(files, null)
val classpathString = files.mkString(":")
val outputDir = VirtualDirectory("output")
class DriverImpl extends Driver {
private val compileCtx0 = initCtx.fresh
val compileCtx = compileCtx0.fresh
.setSetting(
compileCtx0.settings.classpath,
classpathString
).setSetting(
compileCtx0.settings.outputDir,
outputDir
)
val compiler = newCompiler(using compileCtx)
}
val driver = new DriverImpl
given Context = driver.compileCtx
val run = driver.compiler.newRun
run.compileSources(List(sourceFile))
val classLoader = AbstractFileClassLoader(outputDir, depClassLoader)
val clazz = Class.forName("$generated.$Generated$", true, classLoader)
val module = clazz.getField("MODULE$").get(null)
val method = module.getClass.getMethod("run")
method.invoke(module).asInstanceOf[A]
}
}
package mypackage
import scala.annotation.experimental
#experimental
class Foo
Macro.getAnnotations(Class.forName("mypackage.Foo")))
// new scala.annotation.internal.SourceFile("/path/to/src/main/scala/mypackage/Foo.scala"), new scala.annotation.experimental()
scalaVersion := "3.1.3"
libraryDependencies += scalaOrganization.value %% "scala3-compiler" % scalaVersion.value
How to compile and execute scala code at run-time in Scala3?
(compile time of the code expanding macros is the runtime of macros)
Actually, there is even a way to evaluate a tree itself (not its source code). Such functionality exists in Scala 3 compiler but is deliberately blocked because of phase consistency principle. So this to work, the code expanding macros should be compiled with a compiler patched
https://github.com/DmytroMitin/dotty-patched
scalaVersion := "3.2.1"
libraryDependencies += scalaOrganization.value %% "scala3-staging" % scalaVersion.value
// custom Scala settings
managedScalaInstance := false
ivyConfigurations += Configurations.ScalaTool
libraryDependencies ++= Seq(
scalaOrganization.value % "scala-library" % "2.13.10",
scalaOrganization.value %% "scala3-library" % "3.2.1",
"com.github.dmytromitin" %% "scala3-compiler-patched-assembly" % "3.2.1" % "scala-tool"
)
import scala.quoted.{Expr, Quotes, staging, quotes}
object Macro {
inline def getAnnotations(clazz: Class[?]): Seq[String] = ${impl('clazz)}
def impl(expr: Expr[Class[?]])(using Quotes): Expr[Seq[String]] = {
import quotes.reflect.*
given staging.Compiler = staging.Compiler.make(this.getClass.getClassLoader)
val tpe = staging.run[Any](expr).asInstanceOf[TypeRepr]
val annotations = Expr(tpe.typeSymbol.annotations.map(_.asExpr.show))
report.info(s"annotations=${annotations.show}")
annotations
}
}
Normally, for expr: Expr[A] staging.run(expr) returns a value of type A. But Class is specific. For expr: Expr[Class[_]] inside macros it returns a value of type dotty.tools.dotc.core.Types.CachedAppliedType <: TypeRepr. That's why I had to cast.
In Scala 2 this also would be c.eval(c.Expr[Any](/*c.untypecheck*/(clazz))).asInstanceOf[Type].typeSymbol.annotations because for Class[_] c.eval returns scala.reflect.internal.Types$ClassNoArgsTypeRef <: Type.
https://github.com/scala/bug/issues/12680

create an ambiguous low priority implicit

Consider the default codec as offered in the io package.
implicitly[io.Codec].name //res0: String = UTF-8
It's a "low priority" implicit so it's easy to override without ambiguity.
implicit val betterCodec: io.Codec = io.Codec("US-ASCII")
implicitly[io.Codec].name //res1: String = US-ASCII
It's also easy to raise its priority level.
import io.Codec.fallbackSystemCodec
implicit val betterCodec: io.Codec = io.Codec("US-ASCII")
implicitly[io.Codec].name //won't compile: ambiguous implicit values
But can we go in the opposite direction? Can we create a low level implicit that disables ("ambiguates"?) the default? I've been looking at the priority equation and playing around with low priority implicits but I've yet to create something ambiguous to the default.
If I understand correctly you want to check at compile time that there is local implicit io.Codec ("higher-priority") or produce compile error otherwise. This can be done with macros (using compiler internals).
import scala.language.experimental.macros
import scala.reflect.macros.{contexts, whitebox}
object Macros {
def localImplicitly[A]: A = macro impl[A]
def impl[A: c.WeakTypeTag](c: whitebox.Context): c.Tree = {
import c.universe._
val context = c.asInstanceOf[contexts.Context]
val global: context.universe.type = context.universe
val analyzer: global.analyzer.type = global.analyzer
val callsiteContext = context.callsiteTyper.context
val tpA = weakTypeOf[A]
val localImplicit = new analyzer.ImplicitSearch(
tree = EmptyTree.asInstanceOf[global.Tree],
pt = tpA.asInstanceOf[global.Type],
isView = false,
context0 = callsiteContext.makeImplicit(reportAmbiguousErrors = true),
pos0 = c.enclosingPosition.asInstanceOf[global.Position]
) {
override def searchImplicit(
implicitInfoss: List[List[analyzer.ImplicitInfo]],
isLocalToCallsite: Boolean
): analyzer.SearchResult = {
if (isLocalToCallsite)
super.searchImplicit(implicitInfoss, isLocalToCallsite)
else analyzer.SearchFailure
}
}.bestImplicit
if (localImplicit.isSuccess)
localImplicit.tree.asInstanceOf[c.Tree]
else c.abort(c.enclosingPosition, s"no local implicit $tpA")
}
}
localImplicitly[io.Codec].name // doesn't compile
// Error: no local implicit scala.io.Codec
implicit val betterCodec: io.Codec = io.Codec("US-ASCII")
localImplicitly[Codec].name // US-ASCII
import io.Codec.fallbackSystemCodec
localImplicitly[Codec].name // UTF-8
import io.Codec.fallbackSystemCodec
implicit val betterCodec: io.Codec = io.Codec("US-ASCII")
localImplicitly[Codec].name // doesn't compile
//Error: ambiguous implicit values:
// both value betterCodec in object App of type => scala.io.Codec
// and lazy value fallbackSystemCodec in trait LowPriorityCodecImplicits of type => //scala.io.Codec
// match expected type scala.io.Codec
Tested in 2.13.0.
libraryDependencies ++= Seq(
scalaOrganization.value % "scala-reflect" % scalaVersion.value,
scalaOrganization.value % "scala-compiler" % scalaVersion.value
)
Still working in Scala 2.13.10.
Scala 3 implementation
import scala.quoted.{Expr, Quotes, Type, quotes}
import dotty.tools.dotc.typer.{Implicits => dottyImplicits}
inline def localImplicitly[A]: A = ${impl[A]}
def impl[A: Type](using Quotes): Expr[A] = {
import quotes.reflect.*
given c: dotty.tools.dotc.core.Contexts.Context =
quotes.asInstanceOf[scala.quoted.runtime.impl.QuotesImpl].ctx
val typer = c.typer
val search = new typer.ImplicitSearch(
TypeRepr.of[A].asInstanceOf[dotty.tools.dotc.core.Types.Type],
dotty.tools.dotc.ast.tpd.EmptyTree,
Position.ofMacroExpansion.asInstanceOf[dotty.tools.dotc.util.SourcePosition].span
)
def eligible(contextual: Boolean): List[dottyImplicits.Candidate] =
if contextual then
if c.gadt.isNarrowing then
dotty.tools.dotc.core.Contexts.withoutMode(dotty.tools.dotc.core.Mode.ImplicitsEnabled) {
c.implicits.uncachedEligible(search.wildProto)
}
else c.implicits.eligible(search.wildProto)
else search.implicitScope(search.wildProto).eligible
val searchImplicitMethod = classOf[typer.ImplicitSearch]
.getDeclaredMethod("searchImplicit", classOf[List[dottyImplicits.Candidate]], classOf[Boolean])
searchImplicitMethod.setAccessible(true)
def implicitSearchResult(contextual: Boolean) =
searchImplicitMethod.invoke(search, eligible(contextual), contextual)
.asInstanceOf[dottyImplicits.SearchResult]
.tree.asInstanceOf[ImplicitSearchResult]
implicitSearchResult(true) match {
case success: ImplicitSearchSuccess => success.tree.asExprOf[A]
case failure: ImplicitSearchFailure =>
report.errorAndAbort(s"no local implicit ${Type.show[A]}: ${failure.explanation}")
}
}
Scala 3.2.0.
Sort of, yes.
You can do this by creating a 'newtype'. I.e. a type that is simply a proxy to io.Codec, and wraps the instance. This means that you also need to change all your implicit arguments from io.Codec to CodecWrapper, which may not be possible.
trait CodecWraper {
def orphan: io.Codec
}
object CodecWrapper {
/* because it's in the companion, this will have the highest implicit resolution priority. */
implicit def defaultInstance: CodecWrapper =
new CodecWrapper {
def orphan = new io.Codec { /* your default implementation here */ }
}
}
}
import io.Codec.fallbackSystemCodec
implicitly[CodecWrapper].orphan // io.Codec we defined above - no ambiguity

About an error accessing a field inside Tuple2

i am trying to access to a field within a Tuple2 and compiler is returning me an error. The software tries to push a case class within a kafka topic, then i want to recover it using spark streaming so i can feed a machine learning algorithm and save results within a mongo instance.
Solved!
I finally solved my problem, i am going to post the final solution:
This is the github project:
https://github.com/alonsoir/awesome-recommendation-engine/tree/develop
build.sbt
name := "my-recommendation-spark-engine"
version := "1.0-SNAPSHOT"
scalaVersion := "2.10.4"
val sparkVersion = "1.6.1"
val akkaVersion = "2.3.11" // override Akka to be this version to match the one in Spark
libraryDependencies ++= Seq(
"org.apache.kafka" % "kafka_2.10" % "0.8.1"
exclude("javax.jms", "jms")
exclude("com.sun.jdmk", "jmxtools")
exclude("com.sun.jmx", "jmxri"),
//not working play module!! check
//jdbc,
//anorm,
//cache,
// HTTP client
"net.databinder.dispatch" %% "dispatch-core" % "0.11.1",
// HTML parser
"org.jodd" % "jodd-lagarto" % "3.5.2",
"com.typesafe" % "config" % "1.2.1",
"com.typesafe.play" % "play-json_2.10" % "2.4.0-M2",
"org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",
"org.twitter4j" % "twitter4j-core" % "4.0.2",
"org.twitter4j" % "twitter4j-stream" % "4.0.2",
"org.codehaus.jackson" % "jackson-core-asl" % "1.6.1",
"org.scala-tools.testing" % "specs_2.8.0" % "1.6.5" % "test",
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.1" ,
"org.apache.spark" % "spark-core_2.10" % "1.6.1" ,
"org.apache.spark" % "spark-streaming_2.10" % "1.6.1",
"org.apache.spark" % "spark-sql_2.10" % "1.6.1",
"org.apache.spark" % "spark-mllib_2.10" % "1.6.1",
"com.google.code.gson" % "gson" % "2.6.2",
"commons-cli" % "commons-cli" % "1.3.1",
"com.stratio.datasource" % "spark-mongodb_2.10" % "0.11.1",
// Akka
"com.typesafe.akka" %% "akka-actor" % akkaVersion,
"com.typesafe.akka" %% "akka-slf4j" % akkaVersion,
// MongoDB
"org.reactivemongo" %% "reactivemongo" % "0.10.0"
)
packAutoSettings
//play.Project.playScalaSettings
Kafka Producer
package example.producer
import play.api.libs.json._
import example.utils._
import scala.concurrent.Future
import example.model.{AmazonProductAndRating,AmazonProduct,AmazonRating}
import example.utils.AmazonPageParser
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
/**
args(0) : productId
args(1) : userdId
Usage: ./amazon-producer-example 0981531679 someUserId 3.0
*/
object AmazonProducerExample {
def main(args: Array[String]): Unit = {
val productId = args(0).toString
val userId = args(1).toString
val rating = args(2).toDouble
val topicName = "amazonRatingsTopic"
val producer = Producer[String](topicName)
//0981531679 is Scala Puzzlers...
AmazonPageParser.parse(productId,userId,rating).onSuccess { case amazonRating =>
//Is this the correct way? the best performance? possibly not, what about using avro or parquet? How can i push data in avro or parquet format?
//You can see that i am pushing json String to kafka topic, not raw String, but is there any difference?
//of course there are differences...
producer.send(Json.toJson(amazonRating).toString)
//producer.send(amazonRating.toString)
println("amazon product with rating sent to kafka cluster..." + amazonRating.toString)
System.exit(0)
}
}
}
This is the definition of necessary case classes (UPDATED), the file is named models.scala:
package example.model
import play.api.libs.json.Json
import reactivemongo.bson.Macros
case class AmazonProduct(itemId: String, title: String, url: String, img: String, description: String)
case class AmazonRating(userId: String, productId: String, rating: Double)
case class AmazonProductAndRating(product: AmazonProduct, rating: AmazonRating)
// For MongoDB
object AmazonRating {
implicit val amazonRatingHandler = Macros.handler[AmazonRating]
implicit val amazonRatingFormat = Json.format[AmazonRating]
//added using #Yuval tip
lazy val empty: AmazonRating = AmazonRating("-1", "-1", -1d)
}
This is the full code of the spark streaming process:
package example.spark
import java.io.File
import java.util.Date
import play.api.libs.json._
import com.google.gson.{Gson,GsonBuilder, JsonParser}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import com.mongodb.casbah.Imports._
import com.mongodb.QueryBuilder
import com.mongodb.casbah.MongoClient
import com.mongodb.casbah.commons.{MongoDBList, MongoDBObject}
import reactivemongo.api.MongoDriver
import reactivemongo.api.collections.default.BSONCollection
import reactivemongo.bson.BSONDocument
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
import example.model._
import example.utils.Recommender
/**
* Collect at least the specified number of json amazon products in order to feed recomedation system and feed mongo instance with results.
Usage: ./amazon-kafka-connector 127.0.0.1:9092 amazonRatingsTopic
on mongo shell:
use alonsodb;
db.amazonRatings.find();
*/
object AmazonKafkaConnector {
private var numAmazonProductCollected = 0L
private var partNum = 0
private val numAmazonProductToCollect = 10000000
//this settings must be in reference.conf
private val Database = "alonsodb"
private val ratingCollection = "amazonRatings"
private val MongoHost = "127.0.0.1"
private val MongoPort = 27017
private val MongoProvider = "com.stratio.datasource.mongodb"
private val jsonParser = new JsonParser()
private val gson = new GsonBuilder().setPrettyPrinting().create()
private def prepareMongoEnvironment(): MongoClient = {
val mongoClient = MongoClient(MongoHost, MongoPort)
mongoClient
}
private def closeMongoEnviroment(mongoClient : MongoClient) = {
mongoClient.close()
println("mongoclient closed!")
}
private def cleanMongoEnvironment(mongoClient: MongoClient) = {
cleanMongoData(mongoClient)
mongoClient.close()
}
private def cleanMongoData(client: MongoClient): Unit = {
val collection = client(Database)(ratingCollection)
collection.dropCollection()
}
def main(args: Array[String]) {
// Process program arguments and set properties
if (args.length < 2) {
System.err.println("Usage: " + this.getClass.getSimpleName + " <brokers> <topics>")
System.exit(1)
}
val Array(brokers, topics) = args
println("Initializing Streaming Spark Context and kafka connector...")
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
.setMaster("local[4]")
.set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
sc.addJar("target/scala-2.10/blog-spark-recommendation_2.10-1.0-SNAPSHOT.jar")
val ssc = new StreamingContext(sparkConf, Seconds(2))
//this checkpointdir should be in a conf file, for now it is hardcoded!
val streamingCheckpointDir = "/Users/aironman/my-recommendation-spark-engine/checkpoint"
ssc.checkpoint(streamingCheckpointDir)
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
println("Initialized Streaming Spark Context and kafka connector...")
//create recomendation module
println("Creating rating recommender module...")
val ratingFile= "ratings.csv"
val recommender = new Recommender(sc,ratingFile)
println("Initialized rating recommender module...")
//THIS IS THE MOST INTERESTING PART AND WHAT I NEED!
//THE SOLUTION IS NOT PROBABLY THE MOST EFFICIENT, BECAUSE I HAD TO
//USE DATAFRAMES, ARRAYs and SEQs BUT IS FUNCTIONAL!
try{
messages.foreachRDD(rdd => {
val count = rdd.count()
if (count > 0){
val json= rdd.map(_._2)
val dataFrame = sqlContext.read.json(json) //converts json to DF
val myRow = dataFrame.select(dataFrame("userId"),dataFrame("productId"),dataFrame("rating")).take(count.toInt)
println("myRow is: " + myRow)
val myAmazonRating = AmazonRating(myRow(0).getString(0), myRow(0).getString(1), myRow(0).getDouble(2))
println("myAmazonRating is: " + myAmazonRating.toString)
val arrayAmazonRating = Array(myAmazonRating)
//this method needs Seq[AmazonRating]
recommender.predictWithALS(arrayAmazonRating.toSeq)
}//if
})
}catch{
case e: IllegalArgumentException => {println("illegal arg. exception")};
case e: IllegalStateException => {println("illegal state exception")};
case e: ClassCastException => {println("ClassCastException")};
case e: Exception => {println(" Generic Exception")};
}finally{
println("Finished taking data from kafka topic...")
}
ssc.start()
ssc.awaitTermination()
println("Finished!")
}
}
Thank you all, folks, #Yuval, #Emecas and #Riccardo.cardin.
Recommender.predict signature method looks like:
def predict(ratings: Seq[AmazonRating]) = {
// train model
val myRatings = ratings.map(toSparkRating)
val myRatingRDD = sc.parallelize(myRatings)
val startAls = DateTime.now
val model = ALS.train((sparkRatings ++ myRatingRDD).repartition(NumPartitions), 10, 20, 0.01)
val myProducts = myRatings.map(_.product).toSet
val candidates = sc.parallelize((0 until productDict.size).filterNot(myProducts.contains))
// get ratings of all products not in my history ordered by rating (higher first) and only keep the first NumRecommendations
val myUserId = userDict.getIndex(MyUsername)
val recommendations = model.predict(candidates.map((myUserId, _))).collect
val endAls = DateTime.now
val result = recommendations.sortBy(-_.rating).take(NumRecommendations).map(toAmazonRating)
val alsTime = Seconds.secondsBetween(startAls, endAls).getSeconds
println(s"ALS Time: $alsTime seconds")
result
}
//I think I've been as clear as possible, tell me if you need anything more and thanks for your patience teaching me #Yuval
Diagnosis
IllegalStateException suggests that you are operating over a StreamingContext that is already ACTIVE or STOPPED. see details here (lines 218-231)
java.lang.IllegalStateException: Adding new inputs, transformations, and output operations after starting a context is not supported
Code Review
By observing your code AmazonKafkaConnector , you are doing map, filter and foreachRDD into another foreachRDD over the same DirectStream object called : messages
General Advice:
Be functional my friend, by dividing your logic in small pieces for each one of the tasks you want to perform:
Streaming
ML Recommendation
Persistence
etc.
That will help you to understand and debug easier the Spark pipeline that you want to implement.
The problem is that the statement rdd.take(count.toInt) return an Array[T], as stated here
def take(num: Int): Array[T]
Take the first num elements of the RDD.
You're saying to your RDD to take the first n elements in it. Then, differently from what you guess, you've not a object of type Tuple2, but an array.
If you want to print each element of the array, you can use the method mkString defined on the Array type to obtain a single String with all the elements of the array.
It looks like what you're trying to do is is simply a map over a DStream. A map operation is a projection from type A to type B, where A is a String (that you're receiving from Kafka), and B is your case class AmazonRating.
Let's add an empty value to your AmazonRating:
case class AmazonRating(userId: String, productId: String, rating: Double)
object AmazonRating {
lazy val empty: AmazonRating = AmazonRating("-1", "-1", -1d)
}
Now let's parse the JSONs:
val messages = KafkaUtils
.createDirectStream[String, String, StringDecoder, StringDecoder]
(ssc, kafkaParams, topicsSet)
messages
.map { case (_, jsonRating) =>
val format = Json.format[AmazonRating]
val jsValue = Json.parse(record)
format.reads(jsValue) match {
case JsSuccess(rating, _) => rating
case JsError(_) => AmazonRating.empty
}
.filter(_ != AmazonRating.empty)
.foreachRDD(_.foreachPartition(it => recommender.predict(it.toSeq)))