I am new to Spark and I'm using it with Scala. I wrote a simple object that is loaded fine in spark-shell using :load test.scala.
import org.apache.spark.ml.feature.StringIndexer
object Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
Now I want to put it in a class to pass parameters. I use the same code with class instead.
import org.apache.spark.ml.feature.StringIndexer
class Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
This returns import errors.
<console>:19: error: value toDF is not a member of org.apache.spark.rdd.RDD[(String, String, Double)]
val df = data.map(_.split(",") match { case Array(user,food,fav) => (user,food,fav.toDouble) }).toDF("userID","foodID","favorite")
<console>:24: error: not found: type StringIndexer
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
What am I missing here?
Try this one, this one seems to work fine.
def trainModel() ={
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()
import spark.implicits._
val data = spark.read.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
Related
How can I select a case class based on a String value?
My code is
val spark = SparkSession.builder()...
val rddOfJsonStrings: RDD[String] = // some json strings as RDD
val classSelector: String = ??? // could be "Foo" or "Bar", or any other String value
case class Foo(foo: String)
case class Bar(bar: String)
if (classSelector == "Foo") {
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Foo]
} else if (classSelector == "Bar") {
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Bar]
} else {
throw ClassUnknownException //custom Exception
}
The variable classSeletector is a simple String that should be used to point to the case class of the same name.
Imagine I don't only have Foo and Bar as case classes but more then those two. How is it possible to call the df.as[] statement based on the String (if possible at all)?
Or is there a completely different approach available in Scala?
Check below code
classSeletector match {
case c if Foo.getClass.getSimpleName.replace("$","").equalsIgnoreCase(c) => spark.read.json(rddOfJsonStrings).as[Foo]
case c if Bar.getClass.getSimpleName.replace("$","").equalsIgnoreCase(c) => spark.read.json(rddOfJsonStrings).as[Bar]
case _ => throw ClassUnknownException //custom Exception
}
How is it possible to call the df.as[] statement based on the String (if possible at all)?
It isn't (or based on any runtime value). You may note that all answers still need to:
have a separate branch for Foo and Bar (and one more branch for each class you'll want to add);
repeat the class name twice in the branch.
You can avoid the second:
import scala.reflect.{classTag, ClassTag}
val df: DataFrame = spark.read.json(rddOfJsonStrings)
// local function defined where df and classSelector are visible
def dfAsOption[T : Encoder : ClassTag] =
Option.when(classSelector == classTag[T].runtimeClass.simpleName)(df.as[T])
dfAsOption[Foo].dfAsOption(asOption[Bar]).getOrElse(throw ClassUnknownException)
But for the first you'd need a macro if it's possible at all. I would guess it isn't.
Define a generic method and invoke it,
getDs[Foo](spark,rddOfJsonStrings)
getDs[Bar](spark,rddOfJsonStrings)
def getDs[T](spark : SparkSession, rddOfJsonStrings:String) {
spark.read.json(rddOfJsonStrings).as[T](Encoders.bean[T](classOf[T]))
}
Alternative-
highlights-
Use simpleName of the case class and not of the companion object
if classSelector is null, the solution won't fail
case class Foo(foo: String)
case class Bar(bar: String)
Testcase-
val rddOfJsonStrings: RDD[String] = spark.sparkContext.parallelize(Seq("""{"foo":1}"""))
val classSelector: String = "Foo" // could be "Foo" or "Bar", or any other String value
val ds = classSelector match {
case foo if classOf[Foo].getSimpleName == foo =>
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Foo]
case bar if classOf[Bar].getSimpleName == bar =>
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[Bar]
case _ => throw new UnsupportedOperationException
}
ds.show(false)
/**
* +---+
* |foo|
* +---+
* |1 |
* +---+
*/
You can use reflective toolbox
import org.apache.spark.sql.{Dataset, SparkSession}
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
object Main extends App {
val spark = SparkSession.builder
.master("local")
.appName("Spark SQL basic example")
.getOrCreate()
import spark.implicits._
val rddOfJsonStrings: Dataset[String] = spark.createDataset(Seq("""{"foo":"aaa"}"""))
// val rddOfJsonStrings: Dataset[String] = spark.createDataset(Seq("""{"bar":"bbb"}"""))
val classSelector: String = "Foo"
// val classSelector: String = "Bar"
case class Foo(foo: String)
case class Bar(bar: String)
val runtimeMirror = runtime.currentMirror
val toolbox = runtimeMirror.mkToolBox()
val res = toolbox.eval(toolbox.parse(s"""
import org.apache.spark.sql.DataFrame
import Main._
import spark.implicits._
val df: DataFrame = spark.read.json(rddOfJsonStrings)
df.as[$classSelector]
""")).asInstanceOf[Dataset[_]]
println(res) // [foo: string]
}
Notice that statically you will have a Dataset[_], not Dataset[Foo] or Dataset[Bar].
I am not able to perform an implicit conversion from an RDD to a Dataframe in a Scala program although I am importing spark.implicits._.
Any help would be appreciated.
Main Program with the implicits:
object spark1 {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("e1").config("o1", "sv").getOrCreate()
import spark.implicits._
val conf = new SparkConf().setMaster("local").setAppName("My App")
val sc = spark.sparkContext
val data = sc.textFile("/TestDataB.txt")
val allSplit = data.map(line => line.split(","))
case class CC1(LAT: Double, LONG: Double)
val allData = allSplit.map( p => CC1( p(0).trim.toDouble, p(1).trim.toDouble))
val allDF = allData.toDF()
// ... other code
}
}
Error is as follows:
Error:(40, 25) value toDF is not a member of org.apache.spark.rdd.RDD[CC1]
val allDF = allData.toDF()
When you define the case class CC1 inside the main method, you hit https://issues.scala-lang.org/browse/SI-6649; toDF() then fails to locate the appropriate implicit TypeTag for that class at compile time.
You can see this in this simple example:
case class Out()
object TestImplicits {
def main(args: Array[String]) {
case class In()
val typeTagOut = implicitly[TypeTag[Out]] // compiles
val typeTagIn = implicitly[TypeTag[In]] // does not compile: Error:(23, 31) No TypeTag available for In
}
}
Spark's relevant implicit conversion has this type parameter: [T <: Product : TypeTag] (see newProductEncoder here), which means an implicit TypeTag[CC1] is required.
To fix this - simply move the definition of CC1 out of the method, or out of object entirely:
case class CC1(LAT: Double, LONG: Double)
object spark1 {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("e1").config("o1", "sv").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.textFile("/TestDataB.txt")
val allSplit = data.map(line => line.split(","))
val allData = allSplit.map( p => CC1( p(0).trim.toDouble, p(1).trim.toDouble))
val allDF = allData.toDF()
// ... other code
}
}
I thought the toDF is in sqlContext.implicits._ so you need to import that not spark.implicits._. At least that is the case in spark 1.6
I am trying to build a framework on top of spark that can automatically create datasets from data store to disk. An example of the sort of thing I would like to do is:
var sparkSession: SparkSession = _ // initialized elsewhere
def generateDataset[T <: Product : TypeTag](path: Path): Dataset[T] = {
val df: DataFrame = generateDataFrameFromPath(path)
import sparkSession.implicits._
df.as[T]
}
which works just fine. The problem I have is trying to extend that to Dataframes and other classes for which implicit encoders can be generated (like String or Int). I have tried to do something like this:
var sparkSession: SparkSession = _ // initialized elsewhere
def generateDataset[T : TypeTag](path: Path): Dataset[T] = {
val df: DataFrame = generateDataFrameFromPath(path)
typeOf[T] match {
case t if t =:= typeOf[Row] => df
case t if t <:< typeOf[Product] =>
import sparkSession.implicits._
df.as[T]
}
}
But the compiler doesn't like this even though we know T is a subclass of Product when we call .as[T].
I know that the standard approach would be to use the Encoder context bound/implicit however my calling code has no knowledge of the sparkSession until it gets the data back.
Is there a way to get this to work without having the encoder generated by the caller?
Try generating encode in place:
def generateDataset[T : TypeTag](path: Path) = {
val df: DataFrame = generateDataFrameFromPath(path)
typeOf[T] match {
case t if t =:= typeOf[Row] => df
case t if t <:< typeOf[Product] =>
implicit val enc: org.apache.spark.sql.Encoder[T] =
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder()
df.as[T]
}
}
I am very new to akka-http, and I would like to stream a csv with an arbitrary number of lines.
For instance, I would like to return :
a,1
b,2
c,3
with the following code
implicit val actorSystem = ActorSystem("system")
implicit val actorMaterializer = ActorMaterializer()
val map = new mutable.HashMap[String, Int]()
map.put("a", 1)
map.put("b", 2)
map.put("c", 3)
val `text/csv` = ContentType(MediaTypes.`text/csv`, `UTF-8`)
val route =
path("test") {
complete {
HttpEntity(`text/csv`, ??? using map)
}
}
Http().bindAndHandle(route,"localhost",8080)
Thanks for your help
EDIT: Thanks to Ramon J Romero y Vigil
package test
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.http.scaladsl.model.HttpCharsets.`UTF-8`
import akka.http.scaladsl.model._
import akka.http.scaladsl.server.Directives._
import akka.stream._
import akka.util.ByteString
import scala.collection.mutable
object Test{
def main(args: Array[String]) {
implicit val actorSystem = ActorSystem("system")
implicit val actorMaterializer = ActorMaterializer()
val map = new mutable.HashMap[String, Int]()
map.put("a", 1)
map.put("b", 2)
map.put("c", 3)
val mapStream = Stream.fromIterator(() => map.toIterator)
.map((k: String, v: Int) => s"$k,$v")
.map(ByteString.apply)
val `text/csv` = ContentType(MediaTypes.`text/csv`, `UTF-8`)
val route =
path("test") {
complete {
HttpEntity(`text/csv`, mapStream)
}
}
Http().bindAndHandle(route, "localhost", 8080)
}
}
With this code I have two compile error:
Error:(29, 28) value fromIterator is not a member of object scala.collection.immutable.Stream
val mapStream = Stream.fromIterator(() => map.toIterator)
Error:(38, 11) overloaded method value apply with alternatives:
(contentType: akka.http.scaladsl.model.ContentType,file: java.io.File,chunkSize: Int)akka.http.scaladsl.model.UniversalEntity <and>
(contentType: akka.http.scaladsl.model.ContentType,data: akka.stream.scaladsl.Source[akka.util.ByteString,Any])akka.http.scaladsl.model.HttpEntity.Chunked <and>
(contentType: akka.http.scaladsl.model.ContentType,data: akka.util.ByteString)akka.http.scaladsl.model.HttpEntity.Strict <and>
(contentType: akka.http.scaladsl.model.ContentType,bytes: Array[Byte])akka.http.scaladsl.model.HttpEntity.Strict <and>
(contentType: akka.http.scaladsl.model.ContentType.NonBinary,string: String)akka.http.scaladsl.model.HttpEntity.Strict
cannot be applied to (akka.http.scaladsl.model.ContentType.WithCharset, List[akka.util.ByteString])
HttpEntity(`text/csv`, mapStream)
I used a List of tuples to get arround the first issue (hower i do not know how to stream a map in Scala)
No idea for the second
Thanks for your help.
(I am using scala 2.11.8)
Use the apply function in HttpEntity that takes in a Source[ByteString,Any]. The apply creates a Chunked entity. You can read your file using code based on the documentation for streaming file IO using an akka stream Source:
import akka.stream.scaladsl._
val file = Paths.get("yourFile.csv")
val entity = HttpEntity(`txt/csv`, FileIO.fromPath(file))
The stream will break up your file into chunk sizes, default is currently set to 8192.
To stream the map that you've created you can use a similar trick:
val mapStream = Source.fromIterator(() => map.toIterator)
.map( (k : String, v : Int) => s"$k,$v" )
.map(ByteString.apply)
val mapEntity = HttpEntity(`test/csv`, mapStream)
I am trying to return RDD[(String,String,String)] and I am not able to do that using flatMap. I tried (tweetId, tweetBody, gender) and (tweetId, tweetBody, gender) but it give me an error of type mismatch can you guid me to know how I can return RDD[(String, String, String)] from flatMap
override def transform(sqlContext: SQLContext, rdd: RDD[Array[Byte]], config: UserTransformConfig, logger: PhaseLogger): DataFrame = {
val idColumnName = config.getConfigString("column_name").getOrElse("id")
val bodyColumnName = config.getConfigString("column_name").getOrElse("body")
val genderColumnName = config.getConfigString("column_name").getOrElse("gender")
// convert each input element to a JsonValue
val jsonRDD = rdd.map(r => byteUtils.bytesToUTF8String(r))
val hashtagsRDD: RDD[(String,String, String)] = jsonRDD.mapPartitions(r => {
// register jackson mapper (this needs to be instantiated per partition
// since it is not serializable)
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
r.flatMap(tweet => tweet match {
case _ :: tweet =>
val rootNode = mapper.readTree(tweet)
val tweetId = rootNode.path("id").asText.split(":")(2)
val tweetBody = rootNode.path("body").asText
val tweetVector = new HashingTF().transform(tweetBody.split(" "))
val result =genderModel.predict(tweetVector)
val gender = if(result == 1.0){"Male"}else{"Female"}
(tweetId, tweetBody, gender)
// Array(1).map(x => (tweetId, tweetBody, gender))
})
})
val rowRDD: RDD[Row] = hashtagsRDD.map(x => Row(x._1,x._2,x._3))
val schema = StructType(Array(StructField(idColumnName,StringType, true),StructField(bodyColumnName, StringType, true),StructField(genderColumnName,StringType, true)))
sqlContext.createDataFrame(rowRDD, schema)
}
}
Try to use map instead of flatMap.
flatMap is being used when result type of parameter function is collection or RDD
I.e. flatMap is being used when every element of current collection is mapped to zero or more elements.
While map is being used when every element of current collection is mapped to exactly one element.
map with A => B exchanges symbol A with symbol B in functorial types, i.e. transforms RDD[A] to RDD[B]
flatMap could be read as map then flatten in monadic types. E.g. you have and RDD[A] and parameter function is of type A => RDD[B] result of simple map will be RDD[RDD[B]] and that pair of occurences could be simplified to just RDD[B] via flatten
Here the example of succesfully compiled code.
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
class UserTransformConfig {
def getConfigString(name: String): Option[String] = ???
}
class PhaseLogger
object byteUtils {
def bytesToUTF8String(r: Array[Byte]): String = ???
}
class HashingTF {
def transform(strs: Array[String]): Array[Double] = ???
}
object genderModel {
def predict(v: Array[Double]): Double = ???
}
def transform(sqlContext: SQLContext, rdd: RDD[Array[Byte]], config: UserTransformConfig, logger: PhaseLogger): DataFrame = {
val idColumnName = config.getConfigString("column_name").getOrElse("id")
val bodyColumnName = config.getConfigString("column_name").getOrElse("body")
val genderColumnName = config.getConfigString("column_name").getOrElse("gender")
// convert each input element to a JsonValue
val jsonRDD = rdd.map(r => byteUtils.bytesToUTF8String(r))
val hashtagsRDD: RDD[(String, String, String)] = jsonRDD.mapPartitions(r => {
// register jackson mapper (this needs to be instantiated per partition
// since it is not serializable)
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
r.map { tweet =>
val rootNode = mapper.readTree(tweet)
val tweetId = rootNode.path("id").asText.split(":")(2)
val tweetBody = rootNode.path("body").asText
val tweetVector = new HashingTF().transform(tweetBody.split(" "))
val result = genderModel.predict(tweetVector)
val gender = if (result == 1.0) {"Male"} else {"Female"}
(tweetId, tweetBody, gender)
}
})
val rowRDD: RDD[Row] = hashtagsRDD.map(x => Row(x._1, x._2, x._3))
val schema = StructType(Array(StructField(idColumnName, StringType, true), StructField(bodyColumnName, StringType, true), StructField(genderColumnName, StringType, true)))
sqlContext.createDataFrame(rowRDD, schema)
}
please note how much I should bring from my imagination because you did not supply the minimum example. In general questions like this are not worth to answer