About an error accessing a field inside Tuple2 - scala

i am trying to access to a field within a Tuple2 and compiler is returning me an error. The software tries to push a case class within a kafka topic, then i want to recover it using spark streaming so i can feed a machine learning algorithm and save results within a mongo instance.
Solved!
I finally solved my problem, i am going to post the final solution:
This is the github project:
https://github.com/alonsoir/awesome-recommendation-engine/tree/develop
build.sbt
name := "my-recommendation-spark-engine"
version := "1.0-SNAPSHOT"
scalaVersion := "2.10.4"
val sparkVersion = "1.6.1"
val akkaVersion = "2.3.11" // override Akka to be this version to match the one in Spark
libraryDependencies ++= Seq(
"org.apache.kafka" % "kafka_2.10" % "0.8.1"
exclude("javax.jms", "jms")
exclude("com.sun.jdmk", "jmxtools")
exclude("com.sun.jmx", "jmxri"),
//not working play module!! check
//jdbc,
//anorm,
//cache,
// HTTP client
"net.databinder.dispatch" %% "dispatch-core" % "0.11.1",
// HTML parser
"org.jodd" % "jodd-lagarto" % "3.5.2",
"com.typesafe" % "config" % "1.2.1",
"com.typesafe.play" % "play-json_2.10" % "2.4.0-M2",
"org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",
"org.twitter4j" % "twitter4j-core" % "4.0.2",
"org.twitter4j" % "twitter4j-stream" % "4.0.2",
"org.codehaus.jackson" % "jackson-core-asl" % "1.6.1",
"org.scala-tools.testing" % "specs_2.8.0" % "1.6.5" % "test",
"org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.1" ,
"org.apache.spark" % "spark-core_2.10" % "1.6.1" ,
"org.apache.spark" % "spark-streaming_2.10" % "1.6.1",
"org.apache.spark" % "spark-sql_2.10" % "1.6.1",
"org.apache.spark" % "spark-mllib_2.10" % "1.6.1",
"com.google.code.gson" % "gson" % "2.6.2",
"commons-cli" % "commons-cli" % "1.3.1",
"com.stratio.datasource" % "spark-mongodb_2.10" % "0.11.1",
// Akka
"com.typesafe.akka" %% "akka-actor" % akkaVersion,
"com.typesafe.akka" %% "akka-slf4j" % akkaVersion,
// MongoDB
"org.reactivemongo" %% "reactivemongo" % "0.10.0"
)
packAutoSettings
//play.Project.playScalaSettings
Kafka Producer
package example.producer
import play.api.libs.json._
import example.utils._
import scala.concurrent.Future
import example.model.{AmazonProductAndRating,AmazonProduct,AmazonRating}
import example.utils.AmazonPageParser
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
/**
args(0) : productId
args(1) : userdId
Usage: ./amazon-producer-example 0981531679 someUserId 3.0
*/
object AmazonProducerExample {
def main(args: Array[String]): Unit = {
val productId = args(0).toString
val userId = args(1).toString
val rating = args(2).toDouble
val topicName = "amazonRatingsTopic"
val producer = Producer[String](topicName)
//0981531679 is Scala Puzzlers...
AmazonPageParser.parse(productId,userId,rating).onSuccess { case amazonRating =>
//Is this the correct way? the best performance? possibly not, what about using avro or parquet? How can i push data in avro or parquet format?
//You can see that i am pushing json String to kafka topic, not raw String, but is there any difference?
//of course there are differences...
producer.send(Json.toJson(amazonRating).toString)
//producer.send(amazonRating.toString)
println("amazon product with rating sent to kafka cluster..." + amazonRating.toString)
System.exit(0)
}
}
}
This is the definition of necessary case classes (UPDATED), the file is named models.scala:
package example.model
import play.api.libs.json.Json
import reactivemongo.bson.Macros
case class AmazonProduct(itemId: String, title: String, url: String, img: String, description: String)
case class AmazonRating(userId: String, productId: String, rating: Double)
case class AmazonProductAndRating(product: AmazonProduct, rating: AmazonRating)
// For MongoDB
object AmazonRating {
implicit val amazonRatingHandler = Macros.handler[AmazonRating]
implicit val amazonRatingFormat = Json.format[AmazonRating]
//added using #Yuval tip
lazy val empty: AmazonRating = AmazonRating("-1", "-1", -1d)
}
This is the full code of the spark streaming process:
package example.spark
import java.io.File
import java.util.Date
import play.api.libs.json._
import com.google.gson.{Gson,GsonBuilder, JsonParser}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import com.mongodb.casbah.Imports._
import com.mongodb.QueryBuilder
import com.mongodb.casbah.MongoClient
import com.mongodb.casbah.commons.{MongoDBList, MongoDBObject}
import reactivemongo.api.MongoDriver
import reactivemongo.api.collections.default.BSONCollection
import reactivemongo.bson.BSONDocument
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder
import example.model._
import example.utils.Recommender
/**
* Collect at least the specified number of json amazon products in order to feed recomedation system and feed mongo instance with results.
Usage: ./amazon-kafka-connector 127.0.0.1:9092 amazonRatingsTopic
on mongo shell:
use alonsodb;
db.amazonRatings.find();
*/
object AmazonKafkaConnector {
private var numAmazonProductCollected = 0L
private var partNum = 0
private val numAmazonProductToCollect = 10000000
//this settings must be in reference.conf
private val Database = "alonsodb"
private val ratingCollection = "amazonRatings"
private val MongoHost = "127.0.0.1"
private val MongoPort = 27017
private val MongoProvider = "com.stratio.datasource.mongodb"
private val jsonParser = new JsonParser()
private val gson = new GsonBuilder().setPrettyPrinting().create()
private def prepareMongoEnvironment(): MongoClient = {
val mongoClient = MongoClient(MongoHost, MongoPort)
mongoClient
}
private def closeMongoEnviroment(mongoClient : MongoClient) = {
mongoClient.close()
println("mongoclient closed!")
}
private def cleanMongoEnvironment(mongoClient: MongoClient) = {
cleanMongoData(mongoClient)
mongoClient.close()
}
private def cleanMongoData(client: MongoClient): Unit = {
val collection = client(Database)(ratingCollection)
collection.dropCollection()
}
def main(args: Array[String]) {
// Process program arguments and set properties
if (args.length < 2) {
System.err.println("Usage: " + this.getClass.getSimpleName + " <brokers> <topics>")
System.exit(1)
}
val Array(brokers, topics) = args
println("Initializing Streaming Spark Context and kafka connector...")
// Create context with 2 second batch interval
val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
.setMaster("local[4]")
.set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sc)
sc.addJar("target/scala-2.10/blog-spark-recommendation_2.10-1.0-SNAPSHOT.jar")
val ssc = new StreamingContext(sparkConf, Seconds(2))
//this checkpointdir should be in a conf file, for now it is hardcoded!
val streamingCheckpointDir = "/Users/aironman/my-recommendation-spark-engine/checkpoint"
ssc.checkpoint(streamingCheckpointDir)
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
println("Initialized Streaming Spark Context and kafka connector...")
//create recomendation module
println("Creating rating recommender module...")
val ratingFile= "ratings.csv"
val recommender = new Recommender(sc,ratingFile)
println("Initialized rating recommender module...")
//THIS IS THE MOST INTERESTING PART AND WHAT I NEED!
//THE SOLUTION IS NOT PROBABLY THE MOST EFFICIENT, BECAUSE I HAD TO
//USE DATAFRAMES, ARRAYs and SEQs BUT IS FUNCTIONAL!
try{
messages.foreachRDD(rdd => {
val count = rdd.count()
if (count > 0){
val json= rdd.map(_._2)
val dataFrame = sqlContext.read.json(json) //converts json to DF
val myRow = dataFrame.select(dataFrame("userId"),dataFrame("productId"),dataFrame("rating")).take(count.toInt)
println("myRow is: " + myRow)
val myAmazonRating = AmazonRating(myRow(0).getString(0), myRow(0).getString(1), myRow(0).getDouble(2))
println("myAmazonRating is: " + myAmazonRating.toString)
val arrayAmazonRating = Array(myAmazonRating)
//this method needs Seq[AmazonRating]
recommender.predictWithALS(arrayAmazonRating.toSeq)
}//if
})
}catch{
case e: IllegalArgumentException => {println("illegal arg. exception")};
case e: IllegalStateException => {println("illegal state exception")};
case e: ClassCastException => {println("ClassCastException")};
case e: Exception => {println(" Generic Exception")};
}finally{
println("Finished taking data from kafka topic...")
}
ssc.start()
ssc.awaitTermination()
println("Finished!")
}
}
Thank you all, folks, #Yuval, #Emecas and #Riccardo.cardin.
Recommender.predict signature method looks like:
def predict(ratings: Seq[AmazonRating]) = {
// train model
val myRatings = ratings.map(toSparkRating)
val myRatingRDD = sc.parallelize(myRatings)
val startAls = DateTime.now
val model = ALS.train((sparkRatings ++ myRatingRDD).repartition(NumPartitions), 10, 20, 0.01)
val myProducts = myRatings.map(_.product).toSet
val candidates = sc.parallelize((0 until productDict.size).filterNot(myProducts.contains))
// get ratings of all products not in my history ordered by rating (higher first) and only keep the first NumRecommendations
val myUserId = userDict.getIndex(MyUsername)
val recommendations = model.predict(candidates.map((myUserId, _))).collect
val endAls = DateTime.now
val result = recommendations.sortBy(-_.rating).take(NumRecommendations).map(toAmazonRating)
val alsTime = Seconds.secondsBetween(startAls, endAls).getSeconds
println(s"ALS Time: $alsTime seconds")
result
}
//I think I've been as clear as possible, tell me if you need anything more and thanks for your patience teaching me #Yuval

Diagnosis
IllegalStateException suggests that you are operating over a StreamingContext that is already ACTIVE or STOPPED. see details here (lines 218-231)
java.lang.IllegalStateException: Adding new inputs, transformations, and output operations after starting a context is not supported
Code Review
By observing your code AmazonKafkaConnector , you are doing map, filter and foreachRDD into another foreachRDD over the same DirectStream object called : messages
General Advice:
Be functional my friend, by dividing your logic in small pieces for each one of the tasks you want to perform:
Streaming
ML Recommendation
Persistence
etc.
That will help you to understand and debug easier the Spark pipeline that you want to implement.

The problem is that the statement rdd.take(count.toInt) return an Array[T], as stated here
def take(num: Int): Array[T]
Take the first num elements of the RDD.
You're saying to your RDD to take the first n elements in it. Then, differently from what you guess, you've not a object of type Tuple2, but an array.
If you want to print each element of the array, you can use the method mkString defined on the Array type to obtain a single String with all the elements of the array.

It looks like what you're trying to do is is simply a map over a DStream. A map operation is a projection from type A to type B, where A is a String (that you're receiving from Kafka), and B is your case class AmazonRating.
Let's add an empty value to your AmazonRating:
case class AmazonRating(userId: String, productId: String, rating: Double)
object AmazonRating {
lazy val empty: AmazonRating = AmazonRating("-1", "-1", -1d)
}
Now let's parse the JSONs:
val messages = KafkaUtils
.createDirectStream[String, String, StringDecoder, StringDecoder]
(ssc, kafkaParams, topicsSet)
messages
.map { case (_, jsonRating) =>
val format = Json.format[AmazonRating]
val jsValue = Json.parse(record)
format.reads(jsValue) match {
case JsSuccess(rating, _) => rating
case JsError(_) => AmazonRating.empty
}
.filter(_ != AmazonRating.empty)
.foreachRDD(_.foreachPartition(it => recommender.predict(it.toSeq)))

Related

Extending DefaultParamsReadable and DefaultParamsWritable not allowing reading of custom model

Good day,
I have been struggling for a few days to save a custom transformer that is part of a large pipeline of stages. I have a transformer that is completely defined by its params. I have an estimator which in it's fit method will generate a matrix and then set the transformer parameters accordingly so that I can use DefaultParamsReadable and DefaultParamsReadable to take advantage of the serialisation/deserialisation already present in util.ReadWrite.scala.
My summarised code is as follows (includes important aspects):
...
import org.apache.spark.ml.util._
...
// trait to implement in Estimator and Transformer for params
trait NBParams extends Params {
final val featuresCol= new Param[String](this, "featuresCol", "The input column")
setDefault(featuresCol, "_tfIdfOut")
final val labelCol = new Param[String](this, "labelCol", "The labels column")
setDefault(labelCol, "P_Root_Code_Index")
final val predictionsCol = new Param[String](this, "predictionsCol", "The output column")
setDefault(predictionsCol, "NBOutput")
final val ratioMatrix = new Param[DenseMatrix](this, "ratioMatrix", "The transformation matrix")
def getfeaturesCol: String = $(featuresCol)
def getlabelCol: String = $(labelCol)
def getPredictionCol: String = $(predictionsCol)
def getRatioMatrix: DenseMatrix = $(ratioMatrix)
}
// Estimator
class CustomNaiveBayes(override val uid: String, val alpha: Double)
extends Estimator[CustomNaiveBayesModel] with NBParams with DefaultParamsWritable {
def copy(extra: ParamMap): CustomNaiveBayes = {
defaultCopy(extra)
}
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
def setLabelCol(value: String): this.type = set(labelCol, value)
def setPredictionCol(value: String): this.type = set(predictionsCol, value)
def setRatioMatrix(value: DenseMatrix): this.type = set(ratioMatrix, value)
override def transformSchema(schema: StructType): StructType = {...}
override def fit(ds: Dataset[_]): CustomNaiveBayesModel = {
...
val model = new CustomNaiveBayesModel(uid)
model
.setRatioMatrix(ratioMatrix)
.setFeaturesCol($(featuresCol))
.setLabelCol($(labelCol))
.setPredictionCol($(predictionsCol))
}
}
// companion object for Estimator
object CustomNaiveBayes extends DefaultParamsReadable[CustomNaiveBayes]{
override def load(path: String): CustomNaiveBayes = super.load(path)
}
// Transformer
class CustomNaiveBayesModel(override val uid: String)
extends Model[CustomNaiveBayesModel] with NBParams with DefaultParamsWritable {
def this() = this(Identifiable.randomUID("customnaivebayes"))
def copy(extra: ParamMap): CustomNaiveBayesModel = {defaultCopy(extra)}
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
def setLabelCol(value: String): this.type = set(labelCol, value)
def setPredictionCol(value: String): this.type = set(predictionsCol, value)
def setRatioMatrix(value: DenseMatrix): this.type = set(ratioMatrix, value)
override def transformSchema(schema: StructType): StructType = {...}
}
def transform(dataset: Dataset[_]): DataFrame = {...}
}
// companion object for Transformer
object CustomNaiveBayesModel extends DefaultParamsReadable[CustomNaiveBayesModel]
When I add this Model as part of a pipeline and fit the pipeline, all runs ok. When I save the pipeline, there are no errors. However, when I attempt to load the pipeline in I get the following error:
NoSuchMethodException: $line3b380bcad77e4e84ae25a6bfb1f3ec0d45.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$6fa979eb27fa6bf89c6b6d1b271932c$$$$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$CustomNaiveBayesModel.read()
To save the pipeline, which includes a number of other transformers related to NLP pre-processing, I run
fittedModelRootCode.write.save("path")
and to then load it (where the failure occurs) I run
import org.apache.spark.ml.PipelineModel
val fittedModelRootCode = PipelineModel.load("path")
The model itself appears to be working well but I cannot afford to retrain the model on a dataset every time I wish to use it. Does anyone have any ideas why even with the companion object, the read() method appears to be unavailable?
Notes:
I am running on Databricks Runtime 8.3 (Spark 3.1.1, Scala 2.12)
My model is in a separate package so is external to Spark
I have reproduced this based on a number of existing examples all of which appear to work fine so I am unsure why my code is failing
I am aware there is a Naive Bayes model available in Spark ML, however, I have been tasked with making a large number of customizations so it is not worth modifying the existing version (plus I would like to learn how to get this right)
Any help would be greatly appreciated.
Since you extend the CustomNaiveBayesModel companion object by DefaultParamsReadable, I think you should use the companion object CustomNaiveBayesModel for loading the model. Here I write some code for saving and loading models and it works properly:
import org.apache.spark.SparkConf
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.sql.SparkSession
import path.to.CustomNaiveBayesModel
object SavingModelApp extends App {
val spark: SparkSession = SparkSession.builder().config(
new SparkConf()
.setMaster("local[*]")
.setAppName("Test app")
.set("spark.driver.host", "localhost")
.set("spark.ui.enabled", "false")
).getOrCreate()
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")
val fittedModelRootCode: PipelineModel = new Pipeline().setStages(Array(new CustomNaiveBayesModel())).fit(training)
fittedModelRootCode.write.save("path/to/model")
val mod = PipelineModel.load("path/to/model")
}
I think your mistake is using PipelineModel.load for loading the concrete model.
My environment:
scalaVersion := "2.12.6"
scalacOptions := Seq(
"-encoding", "UTF-8", "-target:jvm-1.8", "-deprecation",
"-feature", "-unchecked", "-language:implicitConversions", "-language:postfixOps")
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.1.1",
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.1.1"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.1.1"

Spark Scala - compile errors

I have a script in scala, when I run it in Zeppelin works well, but when I try compile with sbt, it doesnt work. I believe is something related to the versions but Im not being able to identify.
Those three ways returns the same error:
val catMap = catDF.rdd.map((row: Row) => (row.getAs[String](1)->row.getAs[Integer](0))).collect.toMap
val catMap = catDF.select($"description", $"id".cast("int")).as[(String, Int)].collect.toMap
val catMap = catDF.rdd.map((row: Row) => (row.getAs[String](1)->row.getAs[Integer](0))).collectAsMap()
Returning an error: "value rdd is not a member of Unit"
val bizCat = bizCatRDD.rdd.map(t => (t.getAs[String](0),catMap(t.getAs[String](1)))).toDF
Returning an error: "value toDF is not a member of org.apache.spark.rdd.RDD[U]"
Scala version: 2.12
Sbt Version: 1.3.13
UPDATE:
The whole class is:
package importer
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import udf.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.Column
object BusinessImporter extends Importer{
def importa(spark: SparkSession, inputDir: String): Unit = {
import spark.implicits._
val bizDF = spark.read.json(inputDir).cache
// categories
val explode_categories = bizDF.withColumn("categories", explode(split(col("categories"), ",")))
val sort_categories = explode_categories.select(col("categories").as("description"))
.distinct
.coalesce(1)
.orderBy(asc("categories"))
// Create sequence column
val windowSpec = Window.orderBy("description")
val categories_with_sequence = sort_categories.withColumn("id",row_number.over(windowSpec))
val categories = categories_with_sequence.select("id","description")
val catDF = categories.write.insertInto("categories")
// business categories
//val catMap = catDF.rdd.map((row: Row) => (row.getAs[String](1)->row.getAs[Integer](0))).collect.toMap
//val catMap = catDF.select($"description", $"id".cast("int")).as[(String, Int)].collect.toMap
val catMap = catDF.rdd.map((row: Row) => (row.getAs[String](1)->row.getAs[Integer](0))).collectAsMap()
val auxbizCatRDD = bizDF.withColumn("categories", explode(split(col("categories"), ",")))
val bizCatRDD = auxbizCatRDD.select("business_id","categories")
val bizCat = bizCatRDD.rdd.map(t => (t.getAs[String](0),catMap(t.getAs[String](1)))).toDF
bizCat.write.insertInto("business_category")
// Business
val businessDF = bizDF.select("business_id","categories","city","address","latitude","longitude","name","is_open","review_count","stars","state")
businessDF.coalesce(1).write.insertInto("business")
// Hours
val bizHoursDF = bizDF.select("business_id","hours.Sunday","hours.Monday","hours.Tuesday","hours.Wednesday","hours.Thursday","hours.Friday","hours.Saturday")
val bizHoursDF_structs = bizHoursDF
.withColumn("Sunday",struct(
split(col("Sunday"),"-").getItem(0).as("Open"),
split(col("Sunday"),"-").getItem(1).as("Close")))
.withColumn("Monday",struct(
split(col("Monday"),"-").getItem(0).as("Open"),
split(col("Monday"),"-").getItem(1).as("Close")))
.withColumn("Tuesday",struct(
split(col("Tuesday"),"-").getItem(0).as("Open"),
split(col("Tuesday"),"-").getItem(1).as("Close")))
.withColumn("Wednesday",struct(
split(col("Wednesday"),"-").getItem(0).as("Open"),
split(col("Wednesday"),"-").getItem(1).as("Close")))
.withColumn("Thursday",struct(
split(col("Thursday"),"-").getItem(0).as("Open"),
split(col("Thursday"),"-").getItem(1).as("Close")))
.withColumn("Friday",struct(
split(col("Friday"),"-").getItem(0).as("Open"),
split(col("Friday"),"-").getItem(1).as("Close")))
.withColumn("Saturday",struct(
split(col("Saturday"),"-").getItem(0).as("Open"),
split(col("Saturday"),"-").getItem(1).as("Close")))
bizHoursDF_structs.coalesce(1).write.insertInto("business_hour")
}
def singleSpace(col: Column): Column = {
trim(regexp_replace(col, " +", " "))
}
}
sbt file:
name := "yelp-spark-processor"
version := "1.0"
scalaVersion := "2.12.12"
libraryDependencies += "org.apache.spark" % "spark-core_2.12" % "3.0.1"
libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "3.0.1"
libraryDependencies += "org.apache.spark" % "spark-hive_2.12" % "3.0.1"
Can someone pls give me some orientations about what is wrong?
Many Thanks
Xavy
The issue here is that in scala this line returns type Unit:
val catDF = categories.write.insertInto("categories")
Unit in scala is like void in java, it's returned by functions that don't return anything meaningful. So basically at this point catDF is not a dataframe and you can't treat it as such. So you probably want to keep using categories instead of catDF in the lines that follow.

How to read from embedded-kafka with fs2-kafka

I am using fs2-kafka to read from embedded-kafka.
I create the embedded kafka using withRunningKafkaOnFoundPort, create topic and publish a few messages. However when I try to read it back with fs2-kafka I get a NullPointerException. I have isolated a test case and the code is below.
Here is my code:
import cats.effect._
import cats.implicits._
import cats.effect.implicits._
import fs2.Stream
import fs2.kafka.{AutoOffsetReset, ConsumerSettings, KafkaConsumer, consumerStream}
import net.manub.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig}
import org.scalatest.{BeforeAndAfterAll, FunSuite}
import scala.concurrent.ExecutionContext
class KafkaSuite extends FunSuite with EmbeddedKafka {
val singleThreadExecutor = ExecutionContext.fromExecutor((task: Runnable) => task.run())
implicit val contextShift = IO.contextShift(singleThreadExecutor)
implicit val timer = IO.timer(singleThreadExecutor)
val topic = "example"
val partition = 0
val clientId = "client"
test("works") {
val userDefinedConfig = EmbeddedKafkaConfig(kafkaPort = 0, zooKeeperPort = 0)
withRunningKafkaOnFoundPort(userDefinedConfig) { implicit actualConfig =>
createCustomTopic(topic)
publishStringMessageToKafka(topic, "example-message1")
publishStringMessageToKafka(topic, "example-message2")
publishStringMessageToKafka(topic, "example-message3")
publishStringMessageToKafka(topic, "example-message4")
val broker = s"localhost:${actualConfig.kafkaPort}"
val consumerSettings = ConsumerSettings[IO, String, String]
.withAutoOffsetReset(AutoOffsetReset.Earliest)
.withBootstrapServers(broker)
.withGroupId("group")
.withClientId(clientId)
val r = consumerStream[IO].using(consumerSettings)
.evalTap(_.subscribeTo(topic))
.evalTap(_.seekToBeginning)
.flatMap { consumer =>
consumer.stream.take(1)
}
.compile
.toList
val res = r.unsafeRunSync()
Console.println(res)
assert(res.size == 1)
}
}
}
build.sbt:
name := "test"
version := "0.1"
scalaVersion := "2.12.6"
libraryDependencies ++= Seq(
"org.scalatest" % "scalatest_2.12" % "3.1.2" % "test",
"org.slf4j" % "slf4j-simple" % "1.7.25",
"com.github.fd4s" %% "fs2-kafka" % "1.0.0",
"io.github.embeddedkafka" %% "embedded-kafka" % "2.4.1.1" % Test
)
An here is the stacktrace:
java.lang.NullPointerException was thrown.
java.lang.NullPointerException
at java.lang.String.<init>(String.java:515)
at fs2.kafka.Deserializer$.$anonfun$string$1(Deserializer.scala:208)
at fs2.kafka.Deserializer$.$anonfun$lift$1(Deserializer.scala:184)
at fs2.kafka.Deserializer$$anon$1.deserialize(Deserializer.scala:133)
at fs2.kafka.ConsumerRecord$.deserializeFromBytes(ConsumerRecord.scala:166)
at fs2.kafka.ConsumerRecord$.fromJava(ConsumerRecord.scala:177)
at fs2.kafka.internal.KafkaConsumerActor.$anonfun$records$2(KafkaConsumerActor.scala:378)
at cats.data.NonEmptyVectorInstances$$anon$1.traverse(NonEmptyVector.scala:300)
at cats.data.NonEmptyVectorInstances$$anon$1.traverse(NonEmptyVector.scala:245)
at cats.Traverse$Ops.traverse(Traverse.scala:19)
at cats.Traverse$Ops.traverse$(Traverse.scala:19)
at cats.Traverse$ToTraverseOps$$anon$2.traverse(Traverse.scala:19)
at fs2.kafka.internal.KafkaConsumerActor.$anonfun$records$1(KafkaConsumerActor.scala:376)
at cats.instances.VectorInstances$$anon$1.$anonfun$traverse$2(vector.scala:80)
at cats.instances.VectorInstances$$anon$1.loop$2(vector.scala:43)
at cats.instances.VectorInstances$$anon$1.$anonfun$foldRight$2(vector.scala:44)
at cats.Eval$.advance(Eval.scala:271)
at cats.Eval$.loop$1(Eval.scala:350)
at cats.Eval$.cats$Eval$$evaluate(Eval.scala:368)
at cats.Eval$Defer.value(Eval.scala:257)
at cats.instances.VectorInstances$$anon$1.traverse(vector.scala:79)
at cats.instances.VectorInstances$$anon$1.traverse(vector.scala:15)
at cats.Traverse$Ops.traverse(Traverse.scala:19)
at cats.Traverse$Ops.traverse$(Traverse.scala:19)
at cats.Traverse$ToTraverseOps$$anon$2.traverse(Traverse.scala:19)
at fs2.kafka.internal.KafkaConsumerActor.records(KafkaConsumerActor.scala:373)
at fs2.kafka.internal.KafkaConsumerActor.$anonfun$poll$2(KafkaConsumerActor.scala:405)
at cats.effect.internals.IORunLoop$.liftedTree1$1(IORunLoop.scala:95)
at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:95)
at cats.effect.internals.IORunLoop$.startCancelable(IORunLoop.scala:41)
at cats.effect.internals.IOBracket$BracketStart.run(IOBracket.scala:86)
at cats.effect.internals.Trampoline.cats$effect$internals$Trampoline$$immediateLoop(Trampoline.scala:70)
at cats.effect.internals.Trampoline.startLoop(Trampoline.scala:36)
at cats.effect.internals.TrampolineEC$JVMTrampoline.super$startLoop(TrampolineEC.scala:93)
at cats.effect.internals.TrampolineEC$JVMTrampoline.$anonfun$startLoop$1(TrampolineEC.scala:93)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
at cats.effect.internals.TrampolineEC$JVMTrampoline.startLoop(TrampolineEC.scala:93)
at cats.effect.internals.Trampoline.execute(Trampoline.scala:43)
at cats.effect.internals.TrampolineEC.execute(TrampolineEC.scala:44)
at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:72)
at cats.effect.internals.IOBracket$BracketStart.apply(IOBracket.scala:52)
at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:136)
at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:355)
at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:376)
at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:316)
at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Turns out the problem is that the key type in ConsumerSettings[IO, String, String] is String but embedded-kafka writes Null as a key, so on deserializing the key it fails with NullPointerException. Setting key type to Unit solves the problem with exception.
Another problem is that withRunningKafkaOnFoundPort finished before the evaluation of the IO starts. To have it running it is needed to make a Resource from embedded-kafka and wrap the IO into that.
val embeddedKafka = Resource.make(IO(EmbeddedKafka.start()))((kafka) => IO(kafka.stop(true)))
Next problem is that fs2-kafka cannot work with a single thread executor so you have to provide it with an executor pool (for example ExecutionContext.global).
Here is a full working example:
import cats.effect._
import fs2.Stream
import fs2.kafka.{AutoOffsetReset, ConsumerSettings, consumerStream}
import net.manub.embeddedkafka.{EmbeddedKafka, EmbeddedKafkaConfig}
import org.scalatest.FunSuite
import scala.concurrent.ExecutionContext
class KafkaSuite extends FunSuite with EmbeddedKafka {
implicit val ec = ExecutionContext.global
implicit val contextShift = IO.contextShift(ec)
implicit val timer = IO.timer(ec)
val topic = "example"
val partition = 0
val clientId = "client"
val userDefinedConfig = EmbeddedKafkaConfig(kafkaPort = 0, zooKeeperPort = 0)
def broker(port: Long) = s"localhost:${port}"
val consumerSettings = ConsumerSettings[IO, Unit, String]
.withAutoOffsetReset(AutoOffsetReset.Earliest)
.withEnableAutoCommit(true)
.withGroupId("group")
.withClientId(clientId)
val embeddedKafka = Resource.make(IO(EmbeddedKafka.start()))((kafka) => IO(kafka.stop(true)))
test("works") {
val r = Stream.resource(embeddedKafka).flatMap { kafka =>
implicit val actualConfig: EmbeddedKafkaConfig = kafka.config
createCustomTopic(topic)
publishStringMessageToKafka(topic, "example-message1")
publishStringMessageToKafka(topic, "example-message2")
publishStringMessageToKafka(topic, "example-message3")
publishStringMessageToKafka(topic, "example-message4")
consumerStream(consumerSettings.withBootstrapServers(broker(actualConfig.kafkaPort)))
.evalTap(_.subscribeTo(topic))
.evalTap(_.seekToBeginning)
.flatMap(_.stream)
.map(_.record.value)
.take(1)
}
val res = r.compile.toList.unsafeRunSync()
assert(res.contains("example-message1"))
}
}

scala how to parameterized case class, and pass the case class variable to [T <: Product: TypeTag]

// class definition of RsGoods schema
case class RsGoods(add_time: Int)
// my operation
originRDD.toDF[Schemas.RsGoods]()
// and the function definition
def toDF[T <: Product: TypeTag](): DataFrame = mongoSpark.toDF[T]()
now i defined too many schemas(RsGoods1,RsGoods2,RsGoods3), and more will be added in the future.
so the question is how to pass a case class as a variable to structure the code
Attach sbt dependency
"org.apache.spark" % "spark-core_2.11" % "2.3.0",
"org.apache.spark" %% "spark-sql" % "2.3.0",
"org.mongodb.spark" %% "mongo-spark-connector" % "2.3.1",
Attach the key code snippet
var originRDD = MongoSpark.load(sc, readConfig)
val df = table match {
case "rs_goods_multi" => originRDD.toDF[Schemas.RsGoodsMulti]()
case "rs_goods" => originRDD.toDF[Schemas.RsGoods]()
case "ma_item_price" => originRDD.toDF[Schemas.MaItemPrice]()
case "ma_siteuid" => originRDD.toDF[Schemas.MaSiteuid]()
case "pi_attribute" => originRDD.toDF[Schemas.PiAttribute]()
case "pi_attribute_name" => originRDD.toDF[Schemas.PiAttributeName]()
case "pi_attribute_value" => originRDD.toDF[Schemas.PiAttributeValue]()
case "pi_attribute_value_name" => originRDD.toDF[Schemas.PiAttributeValueName]()
From what I have understood about your requirement, i think following should be a decent starting point.
def readDataset[A: Encoder](
spark: SparkSession,
mongoUrl: String,
collectionName: String,
clazz: Class[A]
): Dataset[A] = {
val config = ReadConfig(
Map("uri" -> s"$mongoUrl.$collectionName")
)
val df = MongoSpark.load(spark, config)
val fieldNames = clazz.getDeclaredFields.map(f => f.getName).dropRight(1).toList
val dfWithMatchingFieldNames = df.toDf(fieldNames: _*)
dfWithMatchingFieldNames.as[A]
}
You can use it like this,
case class RsGoods(add_time: Int)
val spark: SparkSession = ...
import spark.implicts._
val rdGoodsDS = readDataset[RsGoods](
spark,
"mongodb://example.com/database",
"rs_goods",
classOf[RsGoods]
)
Also, the following two lines,
val fieldNames = clazz.getDeclaredFields.map(f => f.getName).dropRight(1).toList
val dfWithMatchingFieldNames = df.toDf(fieldNames: _*)
are only required because normally Spark reads DataFrames with column names like value1, value2, .... So we want to change the column names to match what we have in our case class.
I am not sure what these "defalut" column names will be because MongoSpark is involved.
You should first check the column names in the df created as following,
val config = ReadConfig(
Map("uri" -> s"$mongoUrl.$collectionName")
)
val df = MongoSpark.load(spark, config)
If, MongoSpark fixes the problem of these "default" column names and picks the coulmn names from your collection then those 2 lines will not be required and your method will become just this,
def readDataset[A: Encoder](
spark: SparkSession,
mongoUrl: String,
collectionName: String,
): Dataset[A] = {
val config = ReadConfig(
Map("uri" -> s"$mongoUrl.$collectionName")
)
val df = MongoSpark.load(spark, config)
df.as[A]
}
And,
val rsGoodsDS = readDataset[RsGoods](
spark,
"mongodb://example.com/database",
"rs_goods"
)

spark unit testing with dataframe : Collect return empty array

Im using spark and i've been struggling to make a simple unit test pass with a Dataframe and Spark SQL.
Here is the snippet code :
class TestDFSpec extends SharedSparkContext {
"Test DF " should {
"pass equality" in {
val createDF = sqlCtx.createDataFrame(createsRDD,classOf[Test]).toDF()
createDF.registerTempTable("test")
sqlCtx.sql("select * FROM test").collectAsList() === List(Row(Test.from(create1)),Row(Test.from(create2)))
}
}
val create1 = "4869215,bbbbb"
val create2 = "4869215,aaaaa"
val createsRDD = sparkContext.parallelize(Seq(create1,create2)).map(Test.from)
}
I copy code from spark github and add some small changes to provide a SQLContext :
trait SharedSparkContext extends Specification with BeforeAfterAll {
import net.lizeo.bi.spark.conf.JobConfiguration._
#transient private var _sql: SQLContext = _
def sqlCtx: SQLContext = _sql
override def beforeAll() {
println(sparkConf)
_sql = new SQLContext(sparkContext)
}
override def afterAll() {
sparkContext.stop()
_sql = null
}
}
Model Test is pretty simple :
case class Test(key:Int, value:String)
object Test {
def from(line:String):Test = {
val f = line.split(",")
Test(f(0).toInt,f(1))
}
}
The job configuration object :
object JobConfiguration {
val conf = ConfigFactory.load()
val sparkName = conf.getString("spark.name")
val sparkMaster = conf.getString("spark.master")
lazy val sparkConf = new SparkConf()
.setAppName(sparkName)
.setMaster(sparkMaster)
.set("spark.executor.memory",conf.getString("spark.executor.memory"))
.set("spark.io.compression.codec",conf.getString("spark.io.compression.codec"))
val sparkContext = new SparkContext(sparkConf)
}
I'm using Spark 1.3.0 with Spec2. The exact dependencies from my sbt project files are :
object Dependencies {
private val sparkVersion = "1.3.0"
private val clouderaVersion = "5.4.4"
private val sparkClouderaVersion = s"$sparkVersion-cdh$clouderaVersion"
val sparkCdhDependencies = Seq(
"org.apache.spark" %% "spark-core" % sparkClouderaVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkClouderaVersion % "provided"
)
}
The test output is :
[info] TestDFSpec
[info]
[info] Test DF should
[error] x pass equality
[error] '[[], []]'
[error]
[error] is not equal to
[error]
[error] List([Test(4869215,bbbbb)], [Test(4869215,aaaaa)]) (TestDFSpec.scala:17)
[error] Actual: [[], []] [error] Expected: List([Test(4869215,bbbbb)], [Test(4869215,aaaaa)])
sqlCtx.sql("select * FROM test").collectAsList() return [[], []]
Any help would be greatly appreciated. I didn't meet any problem testing with RDD
I do want to migrate from RDD to Dataframe and be able to use Parquet directly from Spark to store files
Thanks in advance
The test pass with the following code :
class TestDFSpec extends SharedSparkContext {
import sqlCtx.implicits._
"Test DF " should {
"pass equality" in {
val createDF = sqlCtx.createDataFrame(Seq(create1,create2).map(Test.from))
createDF.registerTempTable("test")
val result = sqlCtx.sql("select * FROM test").collect()
result === Array(Test.from(create1),Test.from(create2)).map(Row.fromTuple)
}
}
val create1 = "4869215,bbbbb"
val create2 = "4869215,aaaaa"
}
The main difference is the way the DataFrame is created : from a Seq[Test] instead of RDD[Test]
I asked some explanation on spark mailing :http://apache-spark-user-list.1001560.n3.nabble.com/Unit-testing-dataframe-td24240.html#none