I currently do as follows to save a spark RDD result into a mysql database.
import anorm._
import java.sql.Connection
import org.apache.spark.rdd.RDD
val wordCounts: RDD[(String, Int)] = ...
def getDbConnection(dbUrl: String): Connection = {
Class.forName("com.mysql.jdbc.Driver").newInstance()
java.sql.DriverManager.getConnection(dbUrl)
}
def using[X <: {def close()}, A](resource : X)(f : X => A): A =
try { f(resource)
} finally { resource.close() }
wordCounts.map.foreachPartition(iter => {
using(getDbConnection(dbUrl)) { implicit conn =>
iter.foreach { case (word, count) =>
SQL"insert into WordCount VALUES(word, count)".executeUpdate()
}
}
})
Is there a better way?
I tried as follows, but it is so slow, compared to the first approach:
val sqlContext = new SQLContext(sc)
val wordCountSchema = StructType(List(StructField("word", StringType, nullable = false), StructField("count", IntegerType, nullable = false)))
val wordCountRowRDD = wordCounts.map(p => org.apache.spark.sql.Row(p._1,p._2))
val wordCountDF = sqlContext.createDataFrame(wordCountRowRDD, wordCountSchema)
wordCountDF.registerTempTable("WordCount")
wordCountDF.write.mode("overwrite").jdbc(dbUrl, "WordCount", new java.util.Properties())
Related
I have a DF looking like this:
time,channel,value
0,foo,5
0,bar,23
100,foo,42
...
I want a DF like this:
time,foo,bar
0,5,23
100,42,...
In Spark 2, I did it with a UDAF like this:
case class ColumnBuilderUDAF(channels: Seq[String]) extends UserDefinedAggregateFunction {
#transient lazy val inputSchema: StructType = StructType {
StructField("channel", StringType, nullable = false) ::
StructField("value", DoubleType, nullable = false) ::
Nil
}
#transient lazy val bufferSchema: StructType = StructType {
channels
.toList
.indices
.map(i => StructField("c%d".format(i), DoubleType, nullable = false))
}
#transient lazy val dataType: DataType = bufferSchema
#transient lazy val deterministic: Boolean = false
def initialize(buffer: MutableAggregationBuffer): Unit = channels.indices.foreach(buffer(_) = NaN)
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val channel = input.getAs[String](0)
val p = channels.indexOf(channel)
if (p >= 0 && p < channels.length) {
val v = input.getAs[Double](1)
if (!v.isNaN) {
buffer(p) = v
}
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit =
channels
.indices
.foreach { i =>
val v2 = buffer2.getAs[Double](i)
if ((!v2.isNaN) && buffer1.getAs[Double](i).isNaN) {
buffer1(i) = v2
}
}
def evaluate(buffer: Row): Any =
new GenericRowWithSchema(channels.indices.map(buffer.getAs[Double]).toArray, dataType.asInstanceOf[StructType])
}
which I use like this:
val cb = ColumnBuilderUDAF(Seq("foo", "bar"))
val dfColumnar = df.groupBy($"time").agg(cb($"channel", $"value") as "c")
and then, I rename c.c0, c.c1 etc. to foo, bar etc.
In Spark 3, UDAF is deprecated and Aggregator should be used instead. So I began to port it like this:
case class ColumnBuilder(channels: Seq[String]) extends Aggregator[(String, Double), Array[Double], Row] {
lazy val bufferEncoder: Encoder[Array[Double]] = Encoders.javaSerialization[Array[Double]]
lazy val zero: Array[Double] = channels.map(_ => Double.NaN).toArray
def reduce(b: Array[Double], a: (String, Double)): Array[Double] = {
val index = channels.indexOf(a._1)
if (index >= 0 && !a._2.isNaN) b(index) = a._2
b
}
def merge(b1: Array[Double], b2: Array[Double]): Array[Double] = {
(0 until b1.length.min(b2.length)).foreach(i => if (b1(i).isNaN) b1(i) = b2(i))
b1
}
def finish(reduction: Array[Double]): Row =
new GenericRowWithSchema(reduction.map(x => x: Any), outputEncoder.schema)
def outputEncoder: Encoder[Row] = ??? // what goes here?
}
I don't know how to implement the Encoder[Row] as Spark does not have a pre-defined one. If I simply do a straightforward approach like this:
val outputEncoder: Encoder[Row] = new Encoder[Row] {
val schema: StructType = StructType(channels.map(StructField(_, DoubleType, nullable = false)))
val clsTag: ClassTag[Row] = classTag[Row]
}
I get a ClassCastException because outputEncoder actually has to be ExpressionEncoder.
So, how do I implement this correctly? Or do I still have to use the deprecated UDAF?
You can do it with the use of groupBy and pivot
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(0, "foo", 5),
(0, "bar", 23),
(100, "foo", 42)
).toDF("time", "channel", "value")
df.groupBy("time")
.pivot("channel")
.agg(first("value"))
.show(false)
Output:
+----+----+---+
|time|bar |foo|
+----+----+---+
|100 |null|42 |
|0 |23 |5 |
+----+----+---+
I want to do clickstream sessionization on the spark data frame. Let's I have loaded the data frame which has events from multiple sessions with the following schema -
And I want to aggregate(stitch) the sessions, like this -
I have explored UDAF and Window functions but could not understand how I can use them for this specific use case. I know that partitioning the data by session id puts entire session data in a single partition but how do I aggregate them?
The idea is to aggregate all the events specific to each session as a single output record.
You can use collect_set:
def process(implicit spark: SparkSession) = {
import spark._
import org.apache.spark.sql.functions.{ concat, col, collect_set }
val seq = Seq(Row(1, 1, "startTime=1549270909"), Row(1, 1, "endTime=1549270913"))
val rdd = spark.sparkContext.parallelize(seq)
val df1 = spark.createDataFrame(rdd, StructType(List(StructField("sessionId", IntegerType, false), StructField("userId", IntegerType, false), StructField("session", StringType, false))))
df1.groupBy("sessionId").agg(collect_set("session"))
}
}
That gives you:
+---------+------------------------------------------+
|sessionId|collect_set(session) |
+---------+------------------------------------------+
|1 |[startTime=1549270909, endTime=1549270913]|
+---------+------------------------------------------+
as output.
If you need a more complex logic, it could be included in the following UDAF:
class YourComplexLogicStrings extends UserDefinedAggregateFunction {
override def inputSchema: StructType = StructType(StructField("input", StringType) :: Nil)
override def bufferSchema: StructType = StructType(StructField("pair", StringType) :: Nil)
override def dataType: DataType = StringType
override def deterministic: Boolean = true
override def initialize(buffer: MutableAggregationBuffer): Unit = buffer(0) = ""
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val b = buffer.getAs[String](0)
val i = input.getAs[String](0)
buffer(0) = { if(b.isEmpty) b + i else b + " + " + i }
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val b1 = buffer1.getAs[String](0)
val b2 = buffer2.getAs[String](0)
if(!b1.isEmpty)
buffer1(0) = (b1) ++ "," ++ (b2)
else
buffer1(0) = b2
}
override def evaluate(buffer: Row): Any = {
val yourString = buffer.getAs[String](0)
// Compute your logic and return another String
yourString
}
}
def process0(implicit spark: SparkSession) = {
import org.apache.spark.sql.functions.{ concat, col, collect_set }
val agg0 = new YourComplexLogicStrings()
val seq = Seq(Row(1, 1, "startTime=1549270909"), Row(1, 1, "endTime=1549270913"))
val rdd = spark.sparkContext.parallelize(seq)
val df1 = spark.createDataFrame(rdd, StructType(List(StructField("sessionId", IntegerType, false), StructField("userId", IntegerType, false), StructField("session", StringType, false))))
df1.groupBy("sessionId").agg(agg0(col("session")))
}
It gives:
+---------+---------------------------------------+
|sessionId|yourcomplexlogicstrings(session) |
+---------+---------------------------------------+
|1 |startTime=1549270909,endTime=1549270913|
+---------+---------------------------------------+
Note that you could include very complex logic using spark sql functions directly if you want to avoid UDAFs.
I am new to Spark and I'm using it with Scala. I wrote a simple object that is loaded fine in spark-shell using :load test.scala.
import org.apache.spark.ml.feature.StringIndexer
object Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
Now I want to put it in a class to pass parameters. I use the same code with class instead.
import org.apache.spark.ml.feature.StringIndexer
class Collaborative{
def trainModel() ={
val data = sc.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
}
This returns import errors.
<console>:19: error: value toDF is not a member of org.apache.spark.rdd.RDD[(String, String, Double)]
val df = data.map(_.split(",") match { case Array(user,food,fav) => (user,food,fav.toDouble) }).toDF("userID","foodID","favorite")
<console>:24: error: not found: type StringIndexer
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
What am I missing here?
Try this one, this one seems to work fine.
def trainModel() ={
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()
import spark.implicits._
val data = spark.read.textFile("/user/PT/data/newfav.csv")
val df = data.map(_.split(",") match {
case Array(user,food,fav) => (user,food,fav.toDouble)
}).toDF("userID","foodID","favorite")
val userIndexer = new StringIndexer().setInputCol("userID").setOutputCol("userIndex")
}
I am trying to return RDD[(String,String,String)] and I am not able to do that using flatMap. I tried (tweetId, tweetBody, gender) and (tweetId, tweetBody, gender) but it give me an error of type mismatch can you guid me to know how I can return RDD[(String, String, String)] from flatMap
override def transform(sqlContext: SQLContext, rdd: RDD[Array[Byte]], config: UserTransformConfig, logger: PhaseLogger): DataFrame = {
val idColumnName = config.getConfigString("column_name").getOrElse("id")
val bodyColumnName = config.getConfigString("column_name").getOrElse("body")
val genderColumnName = config.getConfigString("column_name").getOrElse("gender")
// convert each input element to a JsonValue
val jsonRDD = rdd.map(r => byteUtils.bytesToUTF8String(r))
val hashtagsRDD: RDD[(String,String, String)] = jsonRDD.mapPartitions(r => {
// register jackson mapper (this needs to be instantiated per partition
// since it is not serializable)
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
r.flatMap(tweet => tweet match {
case _ :: tweet =>
val rootNode = mapper.readTree(tweet)
val tweetId = rootNode.path("id").asText.split(":")(2)
val tweetBody = rootNode.path("body").asText
val tweetVector = new HashingTF().transform(tweetBody.split(" "))
val result =genderModel.predict(tweetVector)
val gender = if(result == 1.0){"Male"}else{"Female"}
(tweetId, tweetBody, gender)
// Array(1).map(x => (tweetId, tweetBody, gender))
})
})
val rowRDD: RDD[Row] = hashtagsRDD.map(x => Row(x._1,x._2,x._3))
val schema = StructType(Array(StructField(idColumnName,StringType, true),StructField(bodyColumnName, StringType, true),StructField(genderColumnName,StringType, true)))
sqlContext.createDataFrame(rowRDD, schema)
}
}
Try to use map instead of flatMap.
flatMap is being used when result type of parameter function is collection or RDD
I.e. flatMap is being used when every element of current collection is mapped to zero or more elements.
While map is being used when every element of current collection is mapped to exactly one element.
map with A => B exchanges symbol A with symbol B in functorial types, i.e. transforms RDD[A] to RDD[B]
flatMap could be read as map then flatten in monadic types. E.g. you have and RDD[A] and parameter function is of type A => RDD[B] result of simple map will be RDD[RDD[B]] and that pair of occurences could be simplified to just RDD[B] via flatten
Here the example of succesfully compiled code.
import com.fasterxml.jackson.databind.ObjectMapper
import com.fasterxml.jackson.module.scala.DefaultScalaModule
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
class UserTransformConfig {
def getConfigString(name: String): Option[String] = ???
}
class PhaseLogger
object byteUtils {
def bytesToUTF8String(r: Array[Byte]): String = ???
}
class HashingTF {
def transform(strs: Array[String]): Array[Double] = ???
}
object genderModel {
def predict(v: Array[Double]): Double = ???
}
def transform(sqlContext: SQLContext, rdd: RDD[Array[Byte]], config: UserTransformConfig, logger: PhaseLogger): DataFrame = {
val idColumnName = config.getConfigString("column_name").getOrElse("id")
val bodyColumnName = config.getConfigString("column_name").getOrElse("body")
val genderColumnName = config.getConfigString("column_name").getOrElse("gender")
// convert each input element to a JsonValue
val jsonRDD = rdd.map(r => byteUtils.bytesToUTF8String(r))
val hashtagsRDD: RDD[(String, String, String)] = jsonRDD.mapPartitions(r => {
// register jackson mapper (this needs to be instantiated per partition
// since it is not serializable)
val mapper = new ObjectMapper
mapper.registerModule(DefaultScalaModule)
r.map { tweet =>
val rootNode = mapper.readTree(tweet)
val tweetId = rootNode.path("id").asText.split(":")(2)
val tweetBody = rootNode.path("body").asText
val tweetVector = new HashingTF().transform(tweetBody.split(" "))
val result = genderModel.predict(tweetVector)
val gender = if (result == 1.0) {"Male"} else {"Female"}
(tweetId, tweetBody, gender)
}
})
val rowRDD: RDD[Row] = hashtagsRDD.map(x => Row(x._1, x._2, x._3))
val schema = StructType(Array(StructField(idColumnName, StringType, true), StructField(bodyColumnName, StringType, true), StructField(genderColumnName, StringType, true)))
sqlContext.createDataFrame(rowRDD, schema)
}
please note how much I should bring from my imagination because you did not supply the minimum example. In general questions like this are not worth to answer
The column names in this example from spark-sql come from the case class Person.
case class Person(name: String, age: Int)
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html
However in many cases the parameter names may be changed. This would cause columns to not be found if the file has not been updated to reflect the change.
How can I specify an appropriate mapping?
I am thinking something like:
val schema = StructType(Seq(
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false)
))
val ps: Seq[Person] = ???
val personRDD = sc.parallelize(ps)
// Apply the schema to the RDD.
val personDF: DataFrame = sqlContext.createDataFrame(personRDD, schema)
Basically, all the mapping you need to do can be achieved with DataFrame.select(...). (Here, I assume, that no type conversions need to be done.)
Given the forward- and backward-mapping as maps, the essential part is
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
// personsDF your original dataframe
val mappedDF = personsDF.select( mapping: _* )
where mapping is an array of Columns with alias.
Example code
object Example {
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
case class Person(name: String, age: Int)
object Mapping {
val from = Map("name" -> "a", "age" -> "b")
val to = Map("a" -> "name", "b" -> "age")
}
def main(args: Array[String]) : Unit = {
// init
val conf = new SparkConf()
.setAppName( "Example." )
.setMaster( "local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// create persons
val persons = Seq(Person("bob", 35), Person("alice", 27))
val personsRDD = sc.parallelize(persons, 4)
val personsDF = personsRDD.toDF
writeParquet( personsDF, "persons.parquet", sc, sqlContext)
val otherPersonDF = readParquet( "persons.parquet", sc, sqlContext )
}
def writeParquet(personsDF: DataFrame, path:String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.from
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
val mappedDF = personsDF.select( mapping: _* )
mappedDF.write.parquet("/output/path.parquet") // parquet with columns "a" and "b"
}
def readParquet(path: String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.to
val df = sqlContext.read.parquet(path) // this df has columns a and b
val mapping = to.map{ (x:(String, String)) => df(x._1).as(x._2) }.toArray
df.select( mapping: _* )
}
}
Remark
If you need to convert a dataframe back to an RDD[Person], then
val rdd : RDD[Row] = personsDF.rdd
val personsRDD : Rdd[Person] = rdd.map { r: Row =>
Person( r.getAs("person"), r.getAs("age") )
}
Alternatives
Have also a look at How to convert spark SchemaRDD into RDD of my case class?