I am trying to substring(column,numOne,numTwo) for a given original DataFrame and create a new DataFrame by doing UNION on all subsets of DataFrame which were being created by doing substring(column,numOne,numTwo).
Below is some piece of code I've come up with
def main(args: Array[String]): Unit = {
//To Log only ERRORS
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("PopularMoviesDS")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.master("local[*]")
.getOrCreate()
var swing = 2
val dataframeInt = spark.createDataFrame(Seq(
(1, "Chandler", "Pasadena", "US")
)).toDF("id", "name", "city", "country")
var returnDf:DataFrame = spark.emptyDataFrame.withColumn("name",functions.lit(null))
def dataFrameCreatorOrg(df:DataFrame): DataFrame ={
val map:Map[Int, Seq[String]] = Map(1 -> Seq("1","4"), 2 -> Seq("2","5"))
var returnDf:DataFrame = spark.emptyDataFrame.withColumn("name",functions.lit(null))
while(swing>0){
returnDf = returnDf.union(df.selectExpr(s"substring(name,${map(swing)(0)},${map(swing)(1)})"))
swing -= 1
}
returnDf
}
dataFrameCreator(dataframeInt).show()
+-----+
| name|
+-----+
|handl|
| Chan|
+-----+
The above code is working as I expected, but I want to run the above-using tail recursion. Code below,
var swing = 2
val dataframeInt = spark.createDataFrame(Seq(
(1, "Chandler", "Pasadena", "US")
)).toDF("id", "name", "city", "country")
var returnDf:DataFrame = spark.emptyDataFrame.withColumn("name",functions.lit(null))
def dataFrameCreator(df:DataFrame): DataFrame ={
val map:Map[Int, Seq[String]] = Map(1 -> Seq("1","4"), 2 -> Seq("2","5"))
returnDf = returnDf.union(df.selectExpr(s"substring(name,${map(swing)(0)},${map(swing)(1)})"))
returnDf
}
#tailrec
def bigUnionHelper(num: Int, df: DataFrame): DataFrame = {
if (num<0) df
else bigUnionHelper(num-1, dataFrameCreator(dataframeInt))
}
bigUnionHelper(swing, dataframeInt).show()
//Result:
+-----+
| name|
+-----+
|handl|
|handl|
|handl|
+-----+
I totally get that there is room for optimization but I am unable to figure out why the tailRecursive - bigUnionHelper is not working and not giving the same result as the first function.
Any help is appreciated, Thank you so much in Advance.
I think it should be this way.
val swing = 2
val dataframeInt = spark.createDataFrame(Seq(
(1, "Chandler", "Pasadena", "US")
)).toDF("id", "name", "city", "country")
def bigUnionHelper(df: DataFrame, num: Int): DataFrame = {
#tailrec
def dataFrameCreator(df: DataFrame, num:Int, acc:List[DataFrame] = List()): List[DataFrame] = {
if (num < 1) acc
else {
val map: Map[Int, Seq[String]] = Map(1 -> Seq("1", "4"), 2 -> Seq("2", "5"))
val tempDf = df.selectExpr(s"substring(name,${map(num).head},${map(swing)(1)})")
dataFrameCreator(df, num -1, tempDf :: acc)
}
}
dataFrameCreator(df, num).reduce(_ union _)
}
bigUnionHelper(dataframeInt, swing).show()
Related
I have a DF looking like this:
time,channel,value
0,foo,5
0,bar,23
100,foo,42
...
I want a DF like this:
time,foo,bar
0,5,23
100,42,...
In Spark 2, I did it with a UDAF like this:
case class ColumnBuilderUDAF(channels: Seq[String]) extends UserDefinedAggregateFunction {
#transient lazy val inputSchema: StructType = StructType {
StructField("channel", StringType, nullable = false) ::
StructField("value", DoubleType, nullable = false) ::
Nil
}
#transient lazy val bufferSchema: StructType = StructType {
channels
.toList
.indices
.map(i => StructField("c%d".format(i), DoubleType, nullable = false))
}
#transient lazy val dataType: DataType = bufferSchema
#transient lazy val deterministic: Boolean = false
def initialize(buffer: MutableAggregationBuffer): Unit = channels.indices.foreach(buffer(_) = NaN)
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val channel = input.getAs[String](0)
val p = channels.indexOf(channel)
if (p >= 0 && p < channels.length) {
val v = input.getAs[Double](1)
if (!v.isNaN) {
buffer(p) = v
}
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit =
channels
.indices
.foreach { i =>
val v2 = buffer2.getAs[Double](i)
if ((!v2.isNaN) && buffer1.getAs[Double](i).isNaN) {
buffer1(i) = v2
}
}
def evaluate(buffer: Row): Any =
new GenericRowWithSchema(channels.indices.map(buffer.getAs[Double]).toArray, dataType.asInstanceOf[StructType])
}
which I use like this:
val cb = ColumnBuilderUDAF(Seq("foo", "bar"))
val dfColumnar = df.groupBy($"time").agg(cb($"channel", $"value") as "c")
and then, I rename c.c0, c.c1 etc. to foo, bar etc.
In Spark 3, UDAF is deprecated and Aggregator should be used instead. So I began to port it like this:
case class ColumnBuilder(channels: Seq[String]) extends Aggregator[(String, Double), Array[Double], Row] {
lazy val bufferEncoder: Encoder[Array[Double]] = Encoders.javaSerialization[Array[Double]]
lazy val zero: Array[Double] = channels.map(_ => Double.NaN).toArray
def reduce(b: Array[Double], a: (String, Double)): Array[Double] = {
val index = channels.indexOf(a._1)
if (index >= 0 && !a._2.isNaN) b(index) = a._2
b
}
def merge(b1: Array[Double], b2: Array[Double]): Array[Double] = {
(0 until b1.length.min(b2.length)).foreach(i => if (b1(i).isNaN) b1(i) = b2(i))
b1
}
def finish(reduction: Array[Double]): Row =
new GenericRowWithSchema(reduction.map(x => x: Any), outputEncoder.schema)
def outputEncoder: Encoder[Row] = ??? // what goes here?
}
I don't know how to implement the Encoder[Row] as Spark does not have a pre-defined one. If I simply do a straightforward approach like this:
val outputEncoder: Encoder[Row] = new Encoder[Row] {
val schema: StructType = StructType(channels.map(StructField(_, DoubleType, nullable = false)))
val clsTag: ClassTag[Row] = classTag[Row]
}
I get a ClassCastException because outputEncoder actually has to be ExpressionEncoder.
So, how do I implement this correctly? Or do I still have to use the deprecated UDAF?
You can do it with the use of groupBy and pivot
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(0, "foo", 5),
(0, "bar", 23),
(100, "foo", 42)
).toDF("time", "channel", "value")
df.groupBy("time")
.pivot("channel")
.agg(first("value"))
.show(false)
Output:
+----+----+---+
|time|bar |foo|
+----+----+---+
|100 |null|42 |
|0 |23 |5 |
+----+----+---+
I want to do clickstream sessionization on the spark data frame. Let's I have loaded the data frame which has events from multiple sessions with the following schema -
And I want to aggregate(stitch) the sessions, like this -
I have explored UDAF and Window functions but could not understand how I can use them for this specific use case. I know that partitioning the data by session id puts entire session data in a single partition but how do I aggregate them?
The idea is to aggregate all the events specific to each session as a single output record.
You can use collect_set:
def process(implicit spark: SparkSession) = {
import spark._
import org.apache.spark.sql.functions.{ concat, col, collect_set }
val seq = Seq(Row(1, 1, "startTime=1549270909"), Row(1, 1, "endTime=1549270913"))
val rdd = spark.sparkContext.parallelize(seq)
val df1 = spark.createDataFrame(rdd, StructType(List(StructField("sessionId", IntegerType, false), StructField("userId", IntegerType, false), StructField("session", StringType, false))))
df1.groupBy("sessionId").agg(collect_set("session"))
}
}
That gives you:
+---------+------------------------------------------+
|sessionId|collect_set(session) |
+---------+------------------------------------------+
|1 |[startTime=1549270909, endTime=1549270913]|
+---------+------------------------------------------+
as output.
If you need a more complex logic, it could be included in the following UDAF:
class YourComplexLogicStrings extends UserDefinedAggregateFunction {
override def inputSchema: StructType = StructType(StructField("input", StringType) :: Nil)
override def bufferSchema: StructType = StructType(StructField("pair", StringType) :: Nil)
override def dataType: DataType = StringType
override def deterministic: Boolean = true
override def initialize(buffer: MutableAggregationBuffer): Unit = buffer(0) = ""
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val b = buffer.getAs[String](0)
val i = input.getAs[String](0)
buffer(0) = { if(b.isEmpty) b + i else b + " + " + i }
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val b1 = buffer1.getAs[String](0)
val b2 = buffer2.getAs[String](0)
if(!b1.isEmpty)
buffer1(0) = (b1) ++ "," ++ (b2)
else
buffer1(0) = b2
}
override def evaluate(buffer: Row): Any = {
val yourString = buffer.getAs[String](0)
// Compute your logic and return another String
yourString
}
}
def process0(implicit spark: SparkSession) = {
import org.apache.spark.sql.functions.{ concat, col, collect_set }
val agg0 = new YourComplexLogicStrings()
val seq = Seq(Row(1, 1, "startTime=1549270909"), Row(1, 1, "endTime=1549270913"))
val rdd = spark.sparkContext.parallelize(seq)
val df1 = spark.createDataFrame(rdd, StructType(List(StructField("sessionId", IntegerType, false), StructField("userId", IntegerType, false), StructField("session", StringType, false))))
df1.groupBy("sessionId").agg(agg0(col("session")))
}
It gives:
+---------+---------------------------------------+
|sessionId|yourcomplexlogicstrings(session) |
+---------+---------------------------------------+
|1 |startTime=1549270909,endTime=1549270913|
+---------+---------------------------------------+
Note that you could include very complex logic using spark sql functions directly if you want to avoid UDAFs.
I was wondering if there is another option much more efficient to do this job, for example:
val df0 = df.select($"id", explode($"event.x0") as "n_0" ).groupBy("id").agg(sum("n_0") as "0")
val df1 = df.select($"id", explode($"event.x1") as "n_1").groupBy("id").agg(sum("n_1") as "1")
val df2 = df.select($"id", explode($"event.x2") as "n_2").groupBy("id").agg(sum("n_2") as "2")
val df3 = df.select($"id", explode($"event.x3") as "n_3").groupBy("id").agg(sum("n_3") as "3)
val final_df = df.join(df0, "id").join(df1, "id").join(df2, "id").join(df3, "id")
I was trying something like this:
val df_x = df.select($"id", $"event", explode($"event.x0") as "0" )
.select($"id", $"event", $"0", explode($"event.x1") as "1")
.select($"id", $"event", $"0", $"1", explode($"event.x2") as "2")
.groupBy("id")
.agg(sum("0") as "0", sum("1") as "1", sum("2") as "2")
val final_df = df.join(df_x, "id")
Despite it runs much more faster!!!! The aggregations values are wrong, so it does not work actually :( !
Any ideas to decrease the amount of joins ?
Assuming each id doesn't have too many matching records, you can use the collect_list aggregation function to collect all matching arrays into an array-of-arrays, and then a User Defined Function to sum over these nested arrays:
val flattenAndSum = udf[Int, mutable.Seq[mutable.Seq[Int]]] { seqOfArrays => seqOfArrays.flatten.sum }
val sums = df.groupBy($"id").agg(
collect_list($"event.x0") as "arr0",
collect_list($"event.x1") as "arr1",
collect_list($"event.x2") as "arr2",
collect_list($"event.x3") as "arr3"
).select($"id",
flattenAndSum($"arr0") as "0",
flattenAndSum($"arr1") as "1",
flattenAndSum($"arr2") as "2",
flattenAndSum($"arr3") as "3"
)
df.join(sums, "id")
Alternatively, if that assumption cannot be made, you can create a User Defined Aggregation Function to perform the flatten-and-sum on the fly. This is safer and potentially faster but requires a bit more work:
// implement a UDAF:
class FlattenAndSum extends UserDefinedAggregateFunction {
override def inputSchema: StructType = new StructType().add("arr", ArrayType(IntegerType))
override def bufferSchema: StructType = new StructType().add("sum", IntegerType)
override def dataType: DataType = IntegerType
override def deterministic: Boolean = true
override def initialize(buffer: MutableAggregationBuffer): Unit = buffer.update(0, 0)
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val current = buffer.getAs[Int](0)
val toAdd = input.getAs[Seq[Int]](0).sum
buffer.update(0, current + toAdd)
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1.update(0, buffer1.getAs[Int](0) + buffer2.getAs[Int](0))
}
override def evaluate(buffer: Row): Any = buffer.getAs[Int](0)
}
// use it in aggregation:
val flattenAndSum = new FlattenAndSum()
val sums = df.groupBy($"id").agg(
flattenAndSum($"event.x0") as "0",
flattenAndSum($"event.x1") as "1",
flattenAndSum($"event.x2") as "2",
flattenAndSum($"event.x3") as "3"
)
df.join(sums, "id")
I have following DataFrame:
|-----id-------|----value------|-----desc------|
| 1 | v1 | d1 |
| 1 | v2 | d2 |
| 2 | v21 | d21 |
| 2 | v22 | d22 |
|--------------|---------------|---------------|
I want to transform it into:
|-----id-------|----value------|-----desc------|
| 1 | v1;v2 | d1;d2 |
| 2 | v21;v22 | d21;d22 |
|--------------|---------------|---------------|
Is it possible through data frame operations?
How would rdd transformation look like in this case?
I presume rdd.reduce is the key, but I have no idea how to adapt it to this scenario.
You can transform your data using spark sql
case class Test(id: Int, value: String, desc: String)
val data = sc.parallelize(Seq((1, "v1", "d1"), (1, "v2", "d2"), (2, "v21", "d21"), (2, "v22", "d22")))
.map(line => Test(line._1, line._2, line._3))
.df
data.registerTempTable("data")
val result = sqlContext.sql("select id,concat_ws(';', collect_list(value)),concat_ws(';', collect_list(value)) from data group by id")
result.show
Suppose you have something like
import scala.util.Random
val sqlc: SQLContext = ???
case class Record(id: Long, value: String, desc: String)
val testData = for {
(i, j) <- List.fill(30)(Random.nextInt(5), Random.nextInt(5))
} yield Record(i, s"v$i$j", s"d$i$j")
val df = sqlc.createDataFrame(testData)
You can easily join data as:
import sqlc.implicits._
def aggConcat(col: String) = df
.map(row => (row.getAs[Long]("id"), row.getAs[String](col)))
.aggregateByKey(Vector[String]())(_ :+ _, _ ++ _)
val result = aggConcat("value").zip(aggConcat("desc")).map{
case ((id, value), (_, desc)) => (id, value, desc)
}.toDF("id", "values", "descs")
If you would like to have concatenated strings instead of arrays, you can run later
import org.apache.spark.sql.functions._
val resultConcat = result
.withColumn("values", concat_ws(";", $"values"))
.withColumn("descs" , concat_ws(";", $"descs" ))
If working with DataFrames, use UDAF
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, StringType, StructField, StructType}
class ConcatStringsUDAF(InputColumnName: String, sep:String = ",") extends UserDefinedAggregateFunction {
def inputSchema: StructType = StructType(StructField(InputColumnName, StringType) :: Nil)
def bufferSchema: StructType = StructType(StructField("concatString", StringType) :: Nil)
def dataType: DataType = StringType
def deterministic: Boolean = true
def initialize(buffer: MutableAggregationBuffer): Unit = buffer(0) = ""
private def concatStrings(str1: String, str2: String): String = {
(str1, str2) match {
case (s1: String, s2: String) => Seq(s1, s2).filter(_ != "").mkString(sep)
case (null, s: String) => s
case (s: String, null) => s
case _ => ""
}
}
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val acc1 = buffer.getAs[String](0)
val acc2 = input.getAs[String](0)
buffer(0) = concatStrings(acc1, acc2)
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val acc1 = buffer1.getAs[String](0)
val acc2 = buffer2.getAs[String](0)
buffer1(0) = concatStrings(acc1, acc2)
}
def evaluate(buffer: Row): Any = buffer.getAs[String](0)
}
And then use this way
val stringConcatener = new ConcatStringsUDAF("Category_ID", ",")
data.groupBy("aaid", "os_country").agg(stringConcatener(data("X")).as("Xs"))
As from Spark 1.6, have a look at Datasets and Aggregator.
After some research I've came up with sth like that:
val data = sc.parallelize(
List(
("1", "v1", "d1"),
("1", "v2", "d2"),
("2", "v21", "d21"),
("2", "v22", "d22")))
.map{ case(id, value, desc)=>((id), (value, desc))}
.reduceByKey((x,y)=>(x._1+";"+y._1, x._2+";"+x._2))
.map{ case(id,(value, desc))=>(id, value, desc)}.toDF("id", "value","desc")
.show()
that leaves me with:
+---+-------+-------+
| id| value| desc|
+---+-------+-------+
| 1| v1;v2| d1;d1|
| 2|v21;v22|d21;d21|
+---+-------+-------+
The column names in this example from spark-sql come from the case class Person.
case class Person(name: String, age: Int)
val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.
// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html
However in many cases the parameter names may be changed. This would cause columns to not be found if the file has not been updated to reflect the change.
How can I specify an appropriate mapping?
I am thinking something like:
val schema = StructType(Seq(
StructField("name", StringType, nullable = false),
StructField("age", IntegerType, nullable = false)
))
val ps: Seq[Person] = ???
val personRDD = sc.parallelize(ps)
// Apply the schema to the RDD.
val personDF: DataFrame = sqlContext.createDataFrame(personRDD, schema)
Basically, all the mapping you need to do can be achieved with DataFrame.select(...). (Here, I assume, that no type conversions need to be done.)
Given the forward- and backward-mapping as maps, the essential part is
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
// personsDF your original dataframe
val mappedDF = personsDF.select( mapping: _* )
where mapping is an array of Columns with alias.
Example code
object Example {
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
case class Person(name: String, age: Int)
object Mapping {
val from = Map("name" -> "a", "age" -> "b")
val to = Map("a" -> "name", "b" -> "age")
}
def main(args: Array[String]) : Unit = {
// init
val conf = new SparkConf()
.setAppName( "Example." )
.setMaster( "local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// create persons
val persons = Seq(Person("bob", 35), Person("alice", 27))
val personsRDD = sc.parallelize(persons, 4)
val personsDF = personsRDD.toDF
writeParquet( personsDF, "persons.parquet", sc, sqlContext)
val otherPersonDF = readParquet( "persons.parquet", sc, sqlContext )
}
def writeParquet(personsDF: DataFrame, path:String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.from
val mapping = from.map{ (x:(String, String)) => personsDF(x._1).as(x._2) }.toArray
val mappedDF = personsDF.select( mapping: _* )
mappedDF.write.parquet("/output/path.parquet") // parquet with columns "a" and "b"
}
def readParquet(path: String, sc: SparkContext, sqlContext: SQLContext) : Unit = {
import Mapping.to
val df = sqlContext.read.parquet(path) // this df has columns a and b
val mapping = to.map{ (x:(String, String)) => df(x._1).as(x._2) }.toArray
df.select( mapping: _* )
}
}
Remark
If you need to convert a dataframe back to an RDD[Person], then
val rdd : RDD[Row] = personsDF.rdd
val personsRDD : Rdd[Person] = rdd.map { r: Row =>
Person( r.getAs("person"), r.getAs("age") )
}
Alternatives
Have also a look at How to convert spark SchemaRDD into RDD of my case class?