Spark Scala - compile errors - scala

I have a script in scala, when I run it in Zeppelin works well, but when I try compile with sbt, it doesnt work. I believe is something related to the versions but Im not being able to identify.
Those three ways returns the same error:
val catMap = catDF.rdd.map((row: Row) => (row.getAs[String](1)->row.getAs[Integer](0))).collect.toMap
val catMap = catDF.select($"description", $"id".cast("int")).as[(String, Int)].collect.toMap
val catMap = catDF.rdd.map((row: Row) => (row.getAs[String](1)->row.getAs[Integer](0))).collectAsMap()
Returning an error: "value rdd is not a member of Unit"
val bizCat = bizCatRDD.rdd.map(t => (t.getAs[String](0),catMap(t.getAs[String](1)))).toDF
Returning an error: "value toDF is not a member of org.apache.spark.rdd.RDD[U]"
Scala version: 2.12
Sbt Version: 1.3.13
UPDATE:
The whole class is:
package importer
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import udf.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Column
object BusinessImporter extends Importer{
def importa(spark: SparkSession, inputDir: String): Unit = {
import spark.implicits._
val bizDF = spark.read.json(inputDir).cache
// categories
val explode_categories = bizDF.withColumn("categories", explode(split(col("categories"), ",")))
val sort_categories = explode_categories.select(col("categories").as("description"))
.distinct
.coalesce(1)
.orderBy(asc("categories"))
// Create sequence column
val windowSpec = Window.orderBy("description")
val categories_with_sequence = sort_categories.withColumn("id",row_number.over(windowSpec))
val categories = categories_with_sequence.select("id","description")
val catDF = categories.write.insertInto("categories")
// business categories
//val catMap = catDF.rdd.map((row: Row) => (row.getAs[String](1)->row.getAs[Integer](0))).collect.toMap
//val catMap = catDF.select($"description", $"id".cast("int")).as[(String, Int)].collect.toMap
val catMap = catDF.rdd.map((row: Row) => (row.getAs[String](1)->row.getAs[Integer](0))).collectAsMap()
val auxbizCatRDD = bizDF.withColumn("categories", explode(split(col("categories"), ",")))
val bizCatRDD = auxbizCatRDD.select("business_id","categories")
val bizCat = bizCatRDD.rdd.map(t => (t.getAs[String](0),catMap(t.getAs[String](1)))).toDF
bizCat.write.insertInto("business_category")
// Business
val businessDF = bizDF.select("business_id","categories","city","address","latitude","longitude","name","is_open","review_count","stars","state")
businessDF.coalesce(1).write.insertInto("business")
// Hours
val bizHoursDF = bizDF.select("business_id","hours.Sunday","hours.Monday","hours.Tuesday","hours.Wednesday","hours.Thursday","hours.Friday","hours.Saturday")
val bizHoursDF_structs = bizHoursDF
.withColumn("Sunday",struct(
split(col("Sunday"),"-").getItem(0).as("Open"),
split(col("Sunday"),"-").getItem(1).as("Close")))
.withColumn("Monday",struct(
split(col("Monday"),"-").getItem(0).as("Open"),
split(col("Monday"),"-").getItem(1).as("Close")))
.withColumn("Tuesday",struct(
split(col("Tuesday"),"-").getItem(0).as("Open"),
split(col("Tuesday"),"-").getItem(1).as("Close")))
.withColumn("Wednesday",struct(
split(col("Wednesday"),"-").getItem(0).as("Open"),
split(col("Wednesday"),"-").getItem(1).as("Close")))
.withColumn("Thursday",struct(
split(col("Thursday"),"-").getItem(0).as("Open"),
split(col("Thursday"),"-").getItem(1).as("Close")))
.withColumn("Friday",struct(
split(col("Friday"),"-").getItem(0).as("Open"),
split(col("Friday"),"-").getItem(1).as("Close")))
.withColumn("Saturday",struct(
split(col("Saturday"),"-").getItem(0).as("Open"),
split(col("Saturday"),"-").getItem(1).as("Close")))
bizHoursDF_structs.coalesce(1).write.insertInto("business_hour")
}
def singleSpace(col: Column): Column = {
trim(regexp_replace(col, " +", " "))
}
}
sbt file:
name := "yelp-spark-processor"
version := "1.0"
scalaVersion := "2.12.12"
libraryDependencies += "org.apache.spark" % "spark-core_2.12" % "3.0.1"
libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "3.0.1"
libraryDependencies += "org.apache.spark" % "spark-hive_2.12" % "3.0.1"
Can someone pls give me some orientations about what is wrong?
Many Thanks
Xavy

The issue here is that in scala this line returns type Unit:
val catDF = categories.write.insertInto("categories")
Unit in scala is like void in java, it's returned by functions that don't return anything meaningful. So basically at this point catDF is not a dataframe and you can't treat it as such. So you probably want to keep using categories instead of catDF in the lines that follow.

Related

Spark: DF.as[Type] fails to compile

I'm trying to run an example from the Spark book Spark: The Definitive Guide
build.sbt
ThisBuild / scalaVersion := "3.2.1"
libraryDependencies ++= Seq(
("org.apache.spark" %% "spark-sql" % "3.2.0" % "provided").cross(CrossVersion.for3Use2_13)
)
Compile / run := Defaults.runTask(Compile / fullClasspath, Compile / run / mainClass, Compile / run / runner).evaluated
lazy val root = (project in file("."))
.settings(
name := "scalalearn"
)
main.scala
// imports
...
object spark1 {
#main
def main(args: String*): Unit = {
...
case class Flight(
DEST_COUNTRY_NAME: String,
ORIGIN_COUNTRY_NAME: String,
count: BigInt
)
val flightsDF = spark.read
.parquet(s"$dataRootPath/data/flight-data/parquet/2010-summary.parquet/")
// import spark.implicits._ // FAILS
// import spark.sqlContext.implicits._ // FAILS
val flights = flightsDF.as[Flight]
// in Scala
flights
.filter(flight_row => flight_row.ORIGIN_COUNTRY_NAME != "Canada")
.map(flight_row => flight_row)
.take(5)
spark.stop()
}
}
Getting an error with the line val flights = flightsDF.as[Flight]
Unable to find encoder for type Flight. An implicit Encoder[Flight] is needed to store Flight
instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._ Support for serializing other types will be added in
future releases.
Any help is appreciated.
Scala - 3.2.1
Spark - 3.2.0
Tried importing implicits from spark.implicits._ and spark.sqlContext.implicits._
The example works on scala 2.x
Looking for a way to convert DF to case class without any third party workarounds
You need to add Scala-3 dependency for Spark codecs
https://github.com/vincenzobaz/spark-scala3
libraryDependencies += "io.github.vincenzobaz" %% "spark-scala3" % "0.1.3"
and import Scala-3
import scala3encoders.given
instead of Scala-2
import spark.implicits._ // FAILS
import spark.sqlContext.implicits._ // FAILS
Scala Spark Encoders.product[X] (where X is a case class) keeps giving me "No TypeTag available for X" error
Regarding BigInt,
Does Spark support BigInteger type?
Spark does support Java BigIntegers but possibly with some loss of precision. If the numerical value of the BigInteger fits in a long (i.e. between -2^63 and 2^63-1) then it will be stored by Spark as a LongType. Otherwise it will be stored as a DecimalType, but this type only supports 38 digits of precision.
Correct codecs for comparatively small BigInts (fitting into LongType) are
import scala3encoders.derivation.{Deserializer, Serializer}
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, StaticInvoke}
import org.apache.spark.sql.types.{DataType, LongType, ObjectType}
given Deserializer[BigInt] with
def inputType: DataType = LongType
def deserialize(path: Expression): Expression =
StaticInvoke(
BigInt.getClass,
ObjectType(classOf[BigInt]),
"apply",
path :: Nil,
returnNullable = false
)
given Serializer[BigInt] with
def inputType: DataType = ObjectType(classOf[BigInt])
def serialize(inputObject: Expression): Expression =
Invoke(inputObject, "longValue", LongType, returnNullable = false)
import scala3encoders.given
https://github.com/DmytroMitin/spark_stackoverflow/blob/87ef5361dd3553f8cc5ced26fed4c17c0061d6a2/src/main/scala/main.scala
(https://github.com/databricks/Spark-The-Definitive-Guide)
https://github.com/yashwanthreddyg/spark_stackoverflow/pull/1
https://gist.github.com/DmytroMitin/3c0fe6983a254b350ff9feedbb066bef
https://github.com/vincenzobaz/spark-scala3/pull/22
For large BigInts (not fitting into LongType when DecimalType is necessary) the codecs are
import scala3encoders.derivation.{Deserializer, Serializer}
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.catalyst.expressions.objects.{Invoke, StaticInvoke}
import org.apache.spark.sql.types.{DataType, DataTypes, Decimal, ObjectType}
val decimalType = DataTypes.createDecimalType(38, 0)
given Deserializer[BigInt] with
def inputType: DataType = decimalType
def deserialize(path: Expression): Expression =
Invoke(path, "toScalaBigInt", ObjectType(classOf[scala.math.BigInt]), returnNullable = false)
given Serializer[BigInt] with
def inputType: DataType = ObjectType(classOf[BigInt])
def serialize(inputObject: Expression): Expression =
StaticInvoke(
Decimal.getClass,
decimalType,
"apply",
inputObject :: Nil,
returnNullable = false
)
import scala3encoders.given
which is almost the same as
import org.apache.spark.sql.catalyst.DeserializerBuildHelper.createDeserializerForScalaBigInt
import org.apache.spark.sql.catalyst.SerializerBuildHelper.createSerializerForScalaBigInt
import scala3encoders.derivation.{Deserializer, Serializer}
import org.apache.spark.sql.catalyst.expressions.Expression
import org.apache.spark.sql.types.{DataType, DataTypes, ObjectType}
val decimalType = DataTypes.createDecimalType(38, 0)
given Deserializer[BigInt] with
def inputType: DataType = decimalType
def deserialize(path: Expression): Expression =
createDeserializerForScalaBigInt(path)
given Serializer[BigInt] with
def inputType: DataType = ObjectType(classOf[BigInt])
def serialize(inputObject: Expression): Expression =
createSerializerForScalaBigInt(inputObject)
import scala3encoders.given
https://gist.github.com/DmytroMitin/8124d2a4cd25c8488c00c5a32f244f64
Runtime exception you observed meant that BigInts from the parquet file are comparatively small (fitting into LongType) and you tried my codecs for large BigInts (DecimalType)
https://gist.github.com/DmytroMitin/ad77677072c1d8d5538c94cb428c8fa4 (ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java': A method named "toScalaBigInt" is not declared in any enclosing class nor any supertype, nor through a static import)
Vice versa, for large BigInts (DecimalType) and codecs for small BigInts (LongType): https://gist.github.com/DmytroMitin/3a3a61082fbfc12447f6e926fc45c7cd (ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java': No applicable constructor/method found for actual parameters "org.apache.spark.sql.types.Decimal"; candidates are: ...)
We can't use both codecs for LongType and DecimalType: https://gist.github.com/DmytroMitin/32040a6b702fff5c53c727616b318cb5 (Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: All input types must be the same except nullable, containsNull, valueContainsNull flags. The input types found are LongType DecimalType(38,0))
For a mixture of small and large BigInts correct is using codecs for DecimalType https://gist.github.com/DmytroMitin/626e09a63a387e6ff1d7fe264fc14d6b
The approach with manually created TypeTags seems to work too (not using scala3encoders)
// libraryDependencies += scalaOrganization.value % "scala-reflect" % "2.13.10" // in Scala 3
import scala.reflect.api
import scala.reflect.runtime.universe.{NoType, Type, TypeTag, internal}
import scala.reflect.runtime.universe
inline def createTypeTag[T](mirror: api.Mirror[_ <: api.Universe with Singleton], tpe: mirror.universe.Type): mirror.universe.TypeTag[T] = {
mirror.universe.TypeTag.apply[T](mirror.asInstanceOf[api.Mirror[mirror.universe.type]],
new api.TypeCreator {
override def apply[U <: api.Universe with Singleton](m: api.Mirror[U]): m.universe.Type = {
tpe.asInstanceOf[m.universe.Type]
}
}
)
}
val rm = universe.runtimeMirror(this.getClass.getClassLoader)
// val bigIntTpe = internal.typeRef(internal.typeRef(NoType, rm.staticPackage("scala.math"), Nil), rm.staticClass("scala.math.BigInt"), Nil)
// val strTpe = internal.typeRef(internal.typeRef(NoType, rm.staticPackage("java.lang"), Nil), rm.staticClass("java.lang.String"), Nil)
val flightTpe = internal.typeRef(NoType, rm.staticClass("Flight"), Nil)
// given TypeTag[BigInt] = createTypeTag[BigInt](rm, bigIntTpe)
// given TypeTag[String] = createTypeTag[String](rm, strTpe)
given TypeTag[Flight] = createTypeTag[Flight](rm, flightTpe)
import spark.implicits._
https://gist.github.com/DmytroMitin/bb0ccd5f1c533b2baec1756da52f8824

get annotations from class in scala 3 macros

i am writing a macro to get annotations from a 'Class'
inline def getAnnotations(clazz: Class[?]): Seq[Any] = ${ getAnnotationsImpl('clazz) }
def getAnnotationsImpl(expr: Expr[Class[?]])(using Quotes): Expr[Seq[Any]] =
import quotes.reflect.*
val cls = expr.valueOrError // error: value value is not a member of quoted.Expr[Class[?]]
val tpe = TypeRepr.typeConstructorOf(cls)
val annotations = tpe.typeSymbol.annotations.map(_.asExpr)
Expr.ofSeq(annotations)
but i get an error when i get class value from expr parameter
#main def test(): Unit =
val cls = getCls
val annotations = getAnnotations(cls)
def getCls: Class[?] = Class.forName("Foo")
is it possible to get annotations of a Class at compile time by this macro ?!
By the way, eval for Class[_] doesn't work even in Scala 2 macros: c.eval(c.Expr[Class[_]](clazz)) produces
java.lang.ClassCastException:
scala.reflect.internal.Types$ClassNoArgsTypeRef cannot be cast to java.lang.Class.
Class[_] is too runtimy thing. How can you extract its value from its tree ( Expr is a wrapper over tree)?
If you already have a Class[?] you should use Java reflection rather than Scala 3 macros (with Tasty reflection).
Actually, you can try to evaluate a tree from its source code (hacking multi-staging programming and implementing our own eval instead of forbidden staging.run). It's a little similar to context.eval in Scala 2 macros (but we evaluate from a source code rather than from a tree).
import scala.quoted.*
object Macro {
inline def getAnnotations(clazz: Class[?]): Seq[Any] = ${getAnnotationsImpl('clazz)}
def getAnnotationsImpl(expr: Expr[Class[?]])(using Quotes): Expr[Seq[Any]] = {
import quotes.reflect.*
val str = expr.asTerm.pos.sourceCode.getOrElse(
report.errorAndAbort(s"No source code for ${expr.show}")
)
val cls = Eval[Class[?]](str)
val tpe = TypeRepr.typeConstructorOf(cls)
val annotations = tpe.typeSymbol.annotations.map(_.asExpr)
Expr.ofSeq(annotations)
}
}
import dotty.tools.dotc.core.Contexts.Context
import dotty.tools.dotc.{Driver, util}
import dotty.tools.io.{VirtualDirectory, VirtualFile}
import java.net.URLClassLoader
import java.nio.charset.StandardCharsets
import dotty.tools.repl.AbstractFileClassLoader
object Eval {
def apply[A](str: String): A = {
val content =
s"""
|package $$generated
|
|object $$Generated {
| def run = $str
|}""".stripMargin
val sourceFile = util.SourceFile(
VirtualFile(
name = "$Generated.scala",
content = content.getBytes(StandardCharsets.UTF_8)),
codec = scala.io.Codec.UTF8
)
val files = this.getClass.getClassLoader.asInstanceOf[URLClassLoader].getURLs
val depClassLoader = new URLClassLoader(files, null)
val classpathString = files.mkString(":")
val outputDir = VirtualDirectory("output")
class DriverImpl extends Driver {
private val compileCtx0 = initCtx.fresh
val compileCtx = compileCtx0.fresh
.setSetting(
compileCtx0.settings.classpath,
classpathString
).setSetting(
compileCtx0.settings.outputDir,
outputDir
)
val compiler = newCompiler(using compileCtx)
}
val driver = new DriverImpl
given Context = driver.compileCtx
val run = driver.compiler.newRun
run.compileSources(List(sourceFile))
val classLoader = AbstractFileClassLoader(outputDir, depClassLoader)
val clazz = Class.forName("$generated.$Generated$", true, classLoader)
val module = clazz.getField("MODULE$").get(null)
val method = module.getClass.getMethod("run")
method.invoke(module).asInstanceOf[A]
}
}
package mypackage
import scala.annotation.experimental
#experimental
class Foo
Macro.getAnnotations(Class.forName("mypackage.Foo")))
// new scala.annotation.internal.SourceFile("/path/to/src/main/scala/mypackage/Foo.scala"), new scala.annotation.experimental()
scalaVersion := "3.1.3"
libraryDependencies += scalaOrganization.value %% "scala3-compiler" % scalaVersion.value
How to compile and execute scala code at run-time in Scala3?
(compile time of the code expanding macros is the runtime of macros)
Actually, there is even a way to evaluate a tree itself (not its source code). Such functionality exists in Scala 3 compiler but is deliberately blocked because of phase consistency principle. So this to work, the code expanding macros should be compiled with a compiler patched
https://github.com/DmytroMitin/dotty-patched
scalaVersion := "3.2.1"
libraryDependencies += scalaOrganization.value %% "scala3-staging" % scalaVersion.value
// custom Scala settings
managedScalaInstance := false
ivyConfigurations += Configurations.ScalaTool
libraryDependencies ++= Seq(
scalaOrganization.value % "scala-library" % "2.13.10",
scalaOrganization.value %% "scala3-library" % "3.2.1",
"com.github.dmytromitin" %% "scala3-compiler-patched-assembly" % "3.2.1" % "scala-tool"
)
import scala.quoted.{Expr, Quotes, staging, quotes}
object Macro {
inline def getAnnotations(clazz: Class[?]): Seq[String] = ${impl('clazz)}
def impl(expr: Expr[Class[?]])(using Quotes): Expr[Seq[String]] = {
import quotes.reflect.*
given staging.Compiler = staging.Compiler.make(this.getClass.getClassLoader)
val tpe = staging.run[Any](expr).asInstanceOf[TypeRepr]
val annotations = Expr(tpe.typeSymbol.annotations.map(_.asExpr.show))
report.info(s"annotations=${annotations.show}")
annotations
}
}
Normally, for expr: Expr[A] staging.run(expr) returns a value of type A. But Class is specific. For expr: Expr[Class[_]] inside macros it returns a value of type dotty.tools.dotc.core.Types.CachedAppliedType <: TypeRepr. That's why I had to cast.
In Scala 2 this also would be c.eval(c.Expr[Any](/*c.untypecheck*/(clazz))).asInstanceOf[Type].typeSymbol.annotations because for Class[_] c.eval returns scala.reflect.internal.Types$ClassNoArgsTypeRef <: Type.
https://github.com/scala/bug/issues/12680

Why there is a error when I use org.json4s.jackson.Serialization.write from spark

This is my first time use scala and spark. I use this code to extract data. I extracted successfully. If I don't use println(write(output1)) but use println(output1), the code runs well and did print the correct output without keys. But when I use println(write(output1)), I meet error. I use scala 2.12.10 and Java 1.8. My code is as follows:
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization.write
import org.apache.spark.SparkContext
import scala.collection.mutable.Map
//import java.io.Writer
object task1 {
def main(args : Array[String]): Unit = {
// Logger.getRootLogger.setLevel(Level.INFO)
val sc = new SparkContext("local[*]", "task1")
val reviewRDD = sc.textFile("review.json")
val n_review = reviewRDD.count().toInt
val n_review_2018 = reviewRDD.filter(x => x contains ",\"date\":\"2018-").count().toInt
val n_user = reviewRDD.map(x => x.split(",")(1).split(":")(1)).distinct().count().toInt
val top10_user = reviewRDD.map(x => (x.split(",")(1).split(":")(1).split("\"")(1), 1)).reduceByKey((x, y) => x + y).sortBy(r => (-r._2, r._1)).take(10)
val n_business = reviewRDD.map(x => x.split(",")(2).split(":")(1)).distinct().count().toInt
val top10_business = reviewRDD.map(x => (x.split(",")(2).split(":")(1).split("\"")(1), 1)).reduceByKey((x, y) => x + y).sortBy(r => (-r._2, r._1)).take(10)
implicit val formats = DefaultFormats
case class Output1(
n_review: Int,
n_review_2018: Int,
n_user: Int,
top10_user: List[(String, Int)],
n_business: Int,
top10_business: List[(String, Int)]
)
val output1 = Output1(
n_review = n_review,
n_review_2018 = n_review_2018,
n_user = n_user,
top10_user = top10_user.toList,
n_business = n_business,
top10_business = top10_business.toList
)
//val output11: String = write(output1)
println(write(output1))
//println(output11)
}
}
The error is as follows:
Exception in thread "main" org.json4s.package$MappingException: Can't find ScalaSig for class task1$Output1$1
at org.json4s.reflect.package$.fail(package.scala:95)
at org.json4s.reflect.ScalaSigReader$.$anonfun$findClass$1(ScalaSigReader.scala:63)
at scala.Option.getOrElse(Option.scala:189)
at org.json4s.reflect.ScalaSigReader$.findClass(ScalaSigReader.scala:63)
at org.json4s.reflect.ScalaSigReader$.readConstructor(ScalaSigReader.scala:31)
at org.json4s.reflect.Reflector$ClassDescriptorBuilder.ctorParamType(Reflector.scala:119)
at org.json4s.reflect.Reflector$ClassDescriptorBuilder.$anonfun$ctorParamType$3(Reflector.scala:109)
at scala.collection.immutable.List.map(List.scala:297)
at org.json4s.reflect.Reflector$ClassDescriptorBuilder.ctorParamType(Reflector.scala:106)
at org.json4s.reflect.Reflector$ClassDescriptorBuilder.$anonfun$ctorParamType$3(Reflector.scala:109)
at scala.collection.immutable.List.map(List.scala:293)
at org.json4s.reflect.Reflector$ClassDescriptorBuilder.ctorParamType(Reflector.scala:106)
at org.json4s.reflect.Reflector$ClassDescriptorBuilder.$anonfun$createConstructorDescriptors$7(Reflector.scala:194)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.json4s.reflect.Reflector$ClassDescriptorBuilder.$anonfun$createConstructorDescriptors$3(Reflector.scala:177)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:75)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.json4s.reflect.Reflector$ClassDescriptorBuilder.createConstructorDescriptors(Reflector.scala:157)
at org.json4s.reflect.Reflector$ClassDescriptorBuilder.constructorsAndCompanion(Reflector.scala:136)
at org.json4s.reflect.Reflector$ClassDescriptorBuilder.result(Reflector.scala:220)
at org.json4s.reflect.Reflector$.createDescriptorWithFormats(Reflector.scala:61)
at org.json4s.reflect.Reflector$.$anonfun$describeWithFormats$1(Reflector.scala:52)
at org.json4s.reflect.package$Memo.apply(package.scala:36)
at org.json4s.reflect.Reflector$.describeWithFormats(Reflector.scala:52)
at org.json4s.Extraction$.decomposeObject$1(Extraction.scala:126)
at org.json4s.Extraction$.internalDecomposeWithBuilder(Extraction.scala:241)
at org.json4s.Extraction$.decomposeWithBuilder(Extraction.scala:70)
at org.json4s.Extraction$.decompose(Extraction.scala:255)
at org.json4s.jackson.Serialization$.write(Serialization.scala:22)
at task1$.main(task1.scala:51)
at task1.main(task1.scala)
Here is my build.sbt
name := "hw1"
ThisBuild / version := "0.1"
ThisBuild / scalaVersion := "2.12.15"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.2"

About an error accessing a field inside Tuple2

i am trying to access to a field within a Tuple2 and compiler is returning me an error. The software tries to push a case class within a kafka topic, then i want to recover it using spark streaming so i can feed a machine learning algorithm and save results within a mongo instance.
Solved!
I finally solved my problem, i am going to post the final solution:
This is the github project:
https://github.com/alonsoir/awesome-recommendation-engine/tree/develop
build.sbt
name := "my-recommendation-spark-engine"
version := "1.0-SNAPSHOT"
scalaVersion := "2.10.4"
val sparkVersion = "1.6.1"
val akkaVersion = "2.3.11" // override Akka to be this version to match the one in Spark
libraryDependencies ++= Seq(
"org.apache.kafka" % "kafka_2.10" % "0.8.1"
exclude("javax.jms", "jms")
exclude("com.sun.jdmk", "jmxtools")
exclude("com.sun.jmx", "jmxri"),
//not working play module!! check
//jdbc,
//anorm,
//cache,
// HTTP client
"net.databinder.dispatch" %% "dispatch-core" % "0.11.1",
// HTML parser
"org.jodd" % "jodd-lagarto" % "3.5.2",
"com.typesafe" % "config" % "1.2.1",
"com.typesafe.play" % "play-json_2.10" % "2.4.0-M2",
"org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",
"org.twitter4j" % "twitter4j-core" % "4.0.2",
"org.twitter4j" % "twitter4j-stream" % "4.0.2",
"org.codehaus.jackson" % "jackson-core-asl" % "1.6.1",
"org.scala-tools.testing" % "specs_2.8.0" % "1.6.5" % "test",
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.1" ,
"org.apache.spark" % "spark-core_2.10" % "1.6.1" ,
"org.apache.spark" % "spark-streaming_2.10" % "1.6.1",
"org.apache.spark" % "spark-sql_2.10" % "1.6.1",
"org.apache.spark" % "spark-mllib_2.10" % "1.6.1",
"com.google.code.gson" % "gson" % "2.6.2",
"commons-cli" % "commons-cli" % "1.3.1",
"com.stratio.datasource" % "spark-mongodb_2.10" % "0.11.1",
// Akka
"com.typesafe.akka" %% "akka-actor" % akkaVersion,
"com.typesafe.akka" %% "akka-slf4j" % akkaVersion,
// MongoDB
"org.reactivemongo" %% "reactivemongo" % "0.10.0"
)
packAutoSettings
//play.Project.playScalaSettings
Kafka Producer
package example.producer
import play.api.libs.json._
import example.utils._
import scala.concurrent.Future
import example.model.{AmazonProductAndRating,AmazonProduct,AmazonRating}
import example.utils.AmazonPageParser
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
/**
args(0) : productId
args(1) : userdId
Usage: ./amazon-producer-example 0981531679 someUserId 3.0
*/
object AmazonProducerExample {
def main(args: Array[String]): Unit = {
val productId = args(0).toString
val userId = args(1).toString
val rating = args(2).toDouble
val topicName = "amazonRatingsTopic"
val producer = Producer[String](topicName)
//0981531679 is Scala Puzzlers...
AmazonPageParser.parse(productId,userId,rating).onSuccess { case amazonRating =>
//Is this the correct way? the best performance? possibly not, what about using avro or parquet? How can i push data in avro or parquet format?
//You can see that i am pushing json String to kafka topic, not raw String, but is there any difference?
//of course there are differences...
producer.send(Json.toJson(amazonRating).toString)
//producer.send(amazonRating.toString)
println("amazon product with rating sent to kafka cluster..." + amazonRating.toString)
System.exit(0)
}
}
}
This is the definition of necessary case classes (UPDATED), the file is named models.scala:
package example.model
import play.api.libs.json.Json
import reactivemongo.bson.Macros
case class AmazonProduct(itemId: String, title: String, url: String, img: String, description: String)
case class AmazonRating(userId: String, productId: String, rating: Double)
case class AmazonProductAndRating(product: AmazonProduct, rating: AmazonRating)
// For MongoDB
object AmazonRating {
implicit val amazonRatingHandler = Macros.handler[AmazonRating]
implicit val amazonRatingFormat = Json.format[AmazonRating]
//added using #Yuval tip
lazy val empty: AmazonRating = AmazonRating("-1", "-1", -1d)
}
This is the full code of the spark streaming process:
package example.spark
import java.io.File
import java.util.Date
import play.api.libs.json._
import com.google.gson.{Gson,GsonBuilder, JsonParser}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import com.mongodb.casbah.Imports._
import com.mongodb.QueryBuilder
import com.mongodb.casbah.MongoClient
import com.mongodb.casbah.commons.{MongoDBList, MongoDBObject}
import reactivemongo.api.MongoDriver
import reactivemongo.api.collections.default.BSONCollection
import reactivemongo.bson.BSONDocument
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
import example.model._
import example.utils.Recommender
/**
* Collect at least the specified number of json amazon products in order to feed recomedation system and feed mongo instance with results.
Usage: ./amazon-kafka-connector 127.0.0.1:9092 amazonRatingsTopic
on mongo shell:
use alonsodb;
db.amazonRatings.find();
*/
object AmazonKafkaConnector {
private var numAmazonProductCollected = 0L
private var partNum = 0
private val numAmazonProductToCollect = 10000000
//this settings must be in reference.conf
private val Database = "alonsodb"
private val ratingCollection = "amazonRatings"
private val MongoHost = "127.0.0.1"
private val MongoPort = 27017
private val MongoProvider = "com.stratio.datasource.mongodb"
private val jsonParser = new JsonParser()
private val gson = new GsonBuilder().setPrettyPrinting().create()
private def prepareMongoEnvironment(): MongoClient = {
val mongoClient = MongoClient(MongoHost, MongoPort)
mongoClient
}
private def closeMongoEnviroment(mongoClient : MongoClient) = {
mongoClient.close()
println("mongoclient closed!")
}
private def cleanMongoEnvironment(mongoClient: MongoClient) = {
cleanMongoData(mongoClient)
mongoClient.close()
}
private def cleanMongoData(client: MongoClient): Unit = {
val collection = client(Database)(ratingCollection)
collection.dropCollection()
}
def main(args: Array[String]) {
// Process program arguments and set properties
if (args.length < 2) {
System.err.println("Usage: " + this.getClass.getSimpleName + " <brokers> <topics>")
System.exit(1)
}
val Array(brokers, topics) = args
println("Initializing Streaming Spark Context and kafka connector...")
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
.setMaster("local[4]")
.set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
sc.addJar("target/scala-2.10/blog-spark-recommendation_2.10-1.0-SNAPSHOT.jar")
val ssc = new StreamingContext(sparkConf, Seconds(2))
//this checkpointdir should be in a conf file, for now it is hardcoded!
val streamingCheckpointDir = "/Users/aironman/my-recommendation-spark-engine/checkpoint"
ssc.checkpoint(streamingCheckpointDir)
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
println("Initialized Streaming Spark Context and kafka connector...")
//create recomendation module
println("Creating rating recommender module...")
val ratingFile= "ratings.csv"
val recommender = new Recommender(sc,ratingFile)
println("Initialized rating recommender module...")
//THIS IS THE MOST INTERESTING PART AND WHAT I NEED!
//THE SOLUTION IS NOT PROBABLY THE MOST EFFICIENT, BECAUSE I HAD TO
//USE DATAFRAMES, ARRAYs and SEQs BUT IS FUNCTIONAL!
try{
messages.foreachRDD(rdd => {
val count = rdd.count()
if (count > 0){
val json= rdd.map(_._2)
val dataFrame = sqlContext.read.json(json) //converts json to DF
val myRow = dataFrame.select(dataFrame("userId"),dataFrame("productId"),dataFrame("rating")).take(count.toInt)
println("myRow is: " + myRow)
val myAmazonRating = AmazonRating(myRow(0).getString(0), myRow(0).getString(1), myRow(0).getDouble(2))
println("myAmazonRating is: " + myAmazonRating.toString)
val arrayAmazonRating = Array(myAmazonRating)
//this method needs Seq[AmazonRating]
recommender.predictWithALS(arrayAmazonRating.toSeq)
}//if
})
}catch{
case e: IllegalArgumentException => {println("illegal arg. exception")};
case e: IllegalStateException => {println("illegal state exception")};
case e: ClassCastException => {println("ClassCastException")};
case e: Exception => {println(" Generic Exception")};
}finally{
println("Finished taking data from kafka topic...")
}
ssc.start()
ssc.awaitTermination()
println("Finished!")
}
}
Thank you all, folks, #Yuval, #Emecas and #Riccardo.cardin.
Recommender.predict signature method looks like:
def predict(ratings: Seq[AmazonRating]) = {
// train model
val myRatings = ratings.map(toSparkRating)
val myRatingRDD = sc.parallelize(myRatings)
val startAls = DateTime.now
val model = ALS.train((sparkRatings ++ myRatingRDD).repartition(NumPartitions), 10, 20, 0.01)
val myProducts = myRatings.map(_.product).toSet
val candidates = sc.parallelize((0 until productDict.size).filterNot(myProducts.contains))
// get ratings of all products not in my history ordered by rating (higher first) and only keep the first NumRecommendations
val myUserId = userDict.getIndex(MyUsername)
val recommendations = model.predict(candidates.map((myUserId, _))).collect
val endAls = DateTime.now
val result = recommendations.sortBy(-_.rating).take(NumRecommendations).map(toAmazonRating)
val alsTime = Seconds.secondsBetween(startAls, endAls).getSeconds
println(s"ALS Time: $alsTime seconds")
result
}
//I think I've been as clear as possible, tell me if you need anything more and thanks for your patience teaching me #Yuval
Diagnosis
IllegalStateException suggests that you are operating over a StreamingContext that is already ACTIVE or STOPPED. see details here (lines 218-231)
java.lang.IllegalStateException: Adding new inputs, transformations, and output operations after starting a context is not supported
Code Review
By observing your code AmazonKafkaConnector , you are doing map, filter and foreachRDD into another foreachRDD over the same DirectStream object called : messages
General Advice:
Be functional my friend, by dividing your logic in small pieces for each one of the tasks you want to perform:
Streaming
ML Recommendation
Persistence
etc.
That will help you to understand and debug easier the Spark pipeline that you want to implement.
The problem is that the statement rdd.take(count.toInt) return an Array[T], as stated here
def take(num: Int): Array[T]
Take the first num elements of the RDD.
You're saying to your RDD to take the first n elements in it. Then, differently from what you guess, you've not a object of type Tuple2, but an array.
If you want to print each element of the array, you can use the method mkString defined on the Array type to obtain a single String with all the elements of the array.
It looks like what you're trying to do is is simply a map over a DStream. A map operation is a projection from type A to type B, where A is a String (that you're receiving from Kafka), and B is your case class AmazonRating.
Let's add an empty value to your AmazonRating:
case class AmazonRating(userId: String, productId: String, rating: Double)
object AmazonRating {
lazy val empty: AmazonRating = AmazonRating("-1", "-1", -1d)
}
Now let's parse the JSONs:
val messages = KafkaUtils
.createDirectStream[String, String, StringDecoder, StringDecoder]
(ssc, kafkaParams, topicsSet)
messages
.map { case (_, jsonRating) =>
val format = Json.format[AmazonRating]
val jsValue = Json.parse(record)
format.reads(jsValue) match {
case JsSuccess(rating, _) => rating
case JsError(_) => AmazonRating.empty
}
.filter(_ != AmazonRating.empty)
.foreachRDD(_.foreachPartition(it => recommender.predict(it.toSeq)))

spark unit testing with dataframe : Collect return empty array

Im using spark and i've been struggling to make a simple unit test pass with a Dataframe and Spark SQL.
Here is the snippet code :
class TestDFSpec extends SharedSparkContext {
"Test DF " should {
"pass equality" in {
val createDF = sqlCtx.createDataFrame(createsRDD,classOf[Test]).toDF()
createDF.registerTempTable("test")
sqlCtx.sql("select * FROM test").collectAsList() === List(Row(Test.from(create1)),Row(Test.from(create2)))
}
}
val create1 = "4869215,bbbbb"
val create2 = "4869215,aaaaa"
val createsRDD = sparkContext.parallelize(Seq(create1,create2)).map(Test.from)
}
I copy code from spark github and add some small changes to provide a SQLContext :
trait SharedSparkContext extends Specification with BeforeAfterAll {
import net.lizeo.bi.spark.conf.JobConfiguration._
#transient private var _sql: SQLContext = _
def sqlCtx: SQLContext = _sql
override def beforeAll() {
println(sparkConf)
_sql = new SQLContext(sparkContext)
}
override def afterAll() {
sparkContext.stop()
_sql = null
}
}
Model Test is pretty simple :
case class Test(key:Int, value:String)
object Test {
def from(line:String):Test = {
val f = line.split(",")
Test(f(0).toInt,f(1))
}
}
The job configuration object :
object JobConfiguration {
val conf = ConfigFactory.load()
val sparkName = conf.getString("spark.name")
val sparkMaster = conf.getString("spark.master")
lazy val sparkConf = new SparkConf()
.setAppName(sparkName)
.setMaster(sparkMaster)
.set("spark.executor.memory",conf.getString("spark.executor.memory"))
.set("spark.io.compression.codec",conf.getString("spark.io.compression.codec"))
val sparkContext = new SparkContext(sparkConf)
}
I'm using Spark 1.3.0 with Spec2. The exact dependencies from my sbt project files are :
object Dependencies {
private val sparkVersion = "1.3.0"
private val clouderaVersion = "5.4.4"
private val sparkClouderaVersion = s"$sparkVersion-cdh$clouderaVersion"
val sparkCdhDependencies = Seq(
"org.apache.spark" %% "spark-core" % sparkClouderaVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkClouderaVersion % "provided"
)
}
The test output is :
[info] TestDFSpec
[info]
[info] Test DF should
[error] x pass equality
[error] '[[], []]'
[error]
[error] is not equal to
[error]
[error] List([Test(4869215,bbbbb)], [Test(4869215,aaaaa)]) (TestDFSpec.scala:17)
[error] Actual: [[], []] [error] Expected: List([Test(4869215,bbbbb)], [Test(4869215,aaaaa)])
sqlCtx.sql("select * FROM test").collectAsList() return [[], []]
Any help would be greatly appreciated. I didn't meet any problem testing with RDD
I do want to migrate from RDD to Dataframe and be able to use Parquet directly from Spark to store files
Thanks in advance
The test pass with the following code :
class TestDFSpec extends SharedSparkContext {
import sqlCtx.implicits._
"Test DF " should {
"pass equality" in {
val createDF = sqlCtx.createDataFrame(Seq(create1,create2).map(Test.from))
createDF.registerTempTable("test")
val result = sqlCtx.sql("select * FROM test").collect()
result === Array(Test.from(create1),Test.from(create2)).map(Row.fromTuple)
}
}
val create1 = "4869215,bbbbb"
val create2 = "4869215,aaaaa"
}
The main difference is the way the DataFrame is created : from a Seq[Test] instead of RDD[Test]
I asked some explanation on spark mailing :http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-dataframe-td24240.html#none