Aggregating all Column values within a Map after groupBy in Apache Spark - scala

I've been trying this all day long with a Dataframe but no luck so far. Already did it with a RDD but it isn't really readable, so this approach would be much better when it comes to code readability.
Take this initial and result DF, both the starting DF and what I would like to obtain after peforming .groupBy().
case class SampleRow(name:String, surname:String, age:Int, city:String)
case class ResultRow(name: String, surnamesAndAges: Map[String, (Int, String)])
val df = List(
SampleRow("Rick", "Fake", 17, "NY"),
SampleRow("Rick", "Jordan", 18, "NY"),
SampleRow("Sandy", "Sample", 19, "NY")
).toDF()
val resultDf = List(
ResultRow("Rick", Map("Fake" -> (17, "NY"), "Jordan" -> (18, "NY"))),
ResultRow("Sandy", Map("Sample" -> (19, "NY")))
).toDF()
What I've tried so far is performing the following .groupBy...
val resultDf = df
.groupBy(
Name
)
.agg(
functions.map(
selectColumn(Surname),
functions.array(
selectColumn(Age),
selectColumn(City)
)
)
)
However, the following is prompt into console.
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression '`surname`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
However, doing that would result in a single entry per surname and I would like to accumulate those in a single Map as you can see in resultDf. Is there an easy way to achieve this using DFs?

you can achieve it with a single UDF to convert your data to map:
val toMap = udf((keys: Seq[String], values1: Seq[String], values2: Seq[String]) => {
keys.zip(values1.zip(values2)).toMap
})
val myResultDF = df.groupBy("name").agg(collect_list("surname") as "surname", collect_list("age") as "age", collect_list("city") as "city").withColumn("surnamesAndAges", toMap($"surname", $"age", $"city")).drop("age", "city", "surname").show(false)
+-----+--------------------------------------+
|name |surnamesAndAges |
+-----+--------------------------------------+
|Sandy|[Sample -> [19, NY]] |
|Rick |[Fake -> [17, NY], Jordan -> [18, NY]]|
+-----+--------------------------------------+

If you are not concerned about typecasting the Dataframe to DataSet (In this case ResultRow you could do something like this
val grouped =df.withColumn("surnameAndAge",struct($"surname",$"age"))
.groupBy($"name")
.agg(collect_list("surnameAndAge").alias("surnamesAndAges"))
Then you could create a User defined function which would look like
import org.apache.spark.sql._
val arrayToMap = udf[Map[String, String], Seq[Row]] {
array => array.map {
case Row(key: String, value: String) => (key, value) }.toMap
}
Now you could use a .withColumn and call this udf
val finalData = grouped.withColumn("surnamesAndAges",arrayToMap($"surnamesAndAges"))
The Dataframe would look something like this
finalData: org.apache.spark.sql.DataFrame = [name: string, surnamesAndAges: map<string,string>]

Since Spark 2.4, you don't need to use a Spark user-defined function:
import org.apache.spark.sql.functions.{col, collect_set, map_from_entries, struct}
df.withColumn("mapEntry", struct(col("surname"), struct(col("age"), col("city"))))
.groupBy("name")
.agg(map_from_entries(collect_set("mapEntry")).as("surnameAndAges"))
Explanation
You first add a column containing a Map entry from desired columns. a Map entry is merely a struct containing two columns: first column is the key and the second column is the value. You can put another struct as the value. So here your Map entry will use column surname as key, and a struct of columns age and city as value:
struct(col("surname"), struct(col("age"), col("city")))
Then, you collect all the Map entries grouped by your groupBy key, which is column name using function collect_set, and you convert this list of Map entries to a Map using function map_from_entries

Related

Passing struct type to methods or UDFS in spark sql dataframes

I have two dataframes and I have joined them and after joining in the joined dataframe , i have got two columns which are of type struct. Basically they are of Array[[String,Int]]. I need to derive a third column based on the elements of this struct type.
My code looks like below.
val bdf = Seq(
("a",1,1,10)
,("a",1,2,10)
,("a",1,3,10)
,("a",1,4,10)
,("b",1,1,20)
,("b",1,2,10)
,("a",2,3,10)
,("a",2,4,20)
,("a",2,5,20)
,("c",2,1,10)
,("c",2,2,20)
,("c",2,3,20)
).toDF("contract_number","linenumber","monthdel","open_quant")
val gbdf = bdf.withColumn("bmergedcol",struct(bdf("monthdel"),bdf("open_quant"))).groupBy("contract_number","linenumber").agg(collect_list("bmergedcol"))
val pl = Seq(
("a",1,"FLAT",10)
,("a",1,"FLAT",30)
,("a",1,"NFE",10)
,("b",1,"FLAT",10)
,("b",1,"NFE",10)
,("c",2,"NFE",10)
,("a",3,"NFE",20)
,("c",2,"FLAT",20)).toDF("connum","linnum","type","qnt")
import org.apache.spark.sql.functions._
val gpl = pl.withColumn("mergedcol",struct(pl("type"),pl("qnt"))).groupBy("connum","linnum").agg(collect_list("mergedcol"))
val jdf = gbdf.join(gpl,expr("((contract_number = connum) AND (linenumber = linnum ))"),"left_outer")
My output of jdf is like
I need to understand how can i pass the two struct type fields to some method and derive a third one from it?
Both array of structs should enter your UDF as Seq[Row], which you can then map into tuples by specifing the types of the structs (i think its string,int in your case). In this example I use pattern-matching on Row, but there are also other ways to do it (e.g. using Row#.getAs):
val myUDF = udf((arr1:Seq[Row],arr2:Seq[Row]) => {
// convert to tuples
val arr1Tup: Seq[(String, Int)] = arr1.map{case Row(s:String,i:Int) => (s,i)}
val arr2Tup: Seq[(String, Int)] = arr2.map{case Row(s:String,i:Int) => (s,i)}
// now do derive new quantities
})
Using the 2 Sequences of Tuples you can derive your new column
User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based functions transforming Datasets. An UDF could be used to pass the two struct type fields to derive a result.
val customUdf = udf((col1: Seq[Row], col2: Int) => {
// This is an example.
col1(1).getAs[String]("type") + "--" + col2
})
val cdf = jdf.withColumn("custom", customUdf(jdf.col("collect_list(mergedcol)"), jdf.col("linnum")))
cdf.show(10)
In above udf col1 is Seq[Row] as it an array of struct type, If only struct type has to be accessed than simply Row should be used.

Spark - provide extra parameter to udf

I'm trying to create a spark scala udf in order to transform MongoDB objects of the following shape:
Object:
"1": 50.3
"8": 2.4
"117": 1.0
Into Spark ml SparseVector.
The problem is that in order to create a SparseVector, I need one more input parameter - its size.
And in my app I keep the Vector sizes in a separate MongoDB collection.
So, I defined the following UDF function:
val mapToSparseVectorUdf = udf {
(myMap: Map[String, Double], size: Int) => {
val vb: VectorBuilder[Double] = new VectorBuilder(length = -1)
vb.use(myMap.keys.map(key => key.toInt).toArray, myMap.values.toArray, size)
vb.toSparseVector
}
}
And I was trying to call it like this:
df.withColumn("VecColumn", mapToSparseVectorUdf(col("MapColumn"), vecSize)).drop("MapColumn")
However, my IDE says "Not applicable" to that udf call.
Is there a way to make this kind of UDF that can take an extra parameter?
Udf functions would require columns to be passed as arguments and the columns passed would be parsed to primitive data types through serialization and desirialization. Thats why udf functions are expensive
If vecSize is an Integer constant then you can simply use lit inbuilt function as
df.withColumn("VecColumn", mapToSparseVectorUdf(col("MapColumn"), lit(vecSize))).drop("MapColumn")
This will do it:
def mapToSparseVectorUdf(vectorSize: Int) = udf[Vector, Map[String, Double]](
(myMap: Map[String, Double]) => {
val elements = myMap.toSeq.map {case (index, value) => (index.toInt, value)}
Vectors.sparse(vectorSize, elements)
}
)
Usage:
val data = spark.createDataFrame(Seq(
("1", Map("1" -> 50.3, "8" -> 2.4)),
("2", Map("2" -> 23.5, "3" -> 41.2))
)).toDF("id", "MapColumn")
data.withColumn("VecColumn", mapToSparseVectorUdf(10)($"MapColumn")).show(false)
NOTE:
Consider fixing your MongoDB schema! ;) The size is a member of a SparseVector, I wouldn't separate it from it's elements.

Filtering in Scala

So suppose I have the following data (only the first few rows, this data covers an entire year) -
(2014-08-31T00:05:00.000+01:00, John)
(2014-08-31T00:11:00.000+01:00, Sarah)
(2014-08-31T00:12:00.000+01:00, George)
(2014-08-31T00:05:00.000+01:00, John)
(2014-09-01T00:05:00.000+01:00, Sarah)
(2014-09-01T00:05:00.000+01:00, George)
(2014-09-01T00:05:00.000+01:00, Jason)
I would like to filter the data so that I only see what the names are for a specific date (say, 2014-09-05). I've tried doing this using the filter function in Scala but I keep receiving the following error -
error: value xxxx is not a member of (org.joda.time.DateTime, String)
Is there another way of doing this?
The filter method takes a function, called a predicate, that takes as parameter an element of your (I'm assuming) RDD, and returns a Boolean.
The returned RDD will keep only the rows for which the predicate evaluates to true.
In your case, it seems that what you want is something like
rdd.filter{
case (date, _) => date.withTimeAtStartOfDay() == new DateTime("2017-03-31")
}
I presume from the tag your question is in the context of Spark and not pure Scala. Given that, you could filter a dataframe on a date and get the associated name(s) like this:
import org.apache.spark.sql.functions._
import sparkSession.implicits._
Seq(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah")
...
)
.toDF("date", "name")
.filter(to_date('date).equalTo(Date.valueOf("2014-09-05")))
.select("name")
Note that the Date above is java.sql.Date.
Here's a function that takes a date, a list of datetime-name pairs, and returns a list of names for the date:
def getNames(d: String, l: List[(String, String)]): List[String] = {
val date = """^([^T]*).*""".r
val dateMap = list.map {
case (x, y) => ( x match { case date(z) => z }, y )
}.
groupBy(_._1) mapValues( _.map(_._2) )
dateMap.getOrElse(d, List[String]())
}
val list = List(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah"),
("2014-08-31T00:12:00.000+01:00", "George"),
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-09-01T00:05:00.000+01:00", "Sarah"),
("2014-09-01T00:05:00.000+01:00", "George"),
("2014-09-01T00:05:00.000+01:00", "Jason")
)
getNames("2014-09-01", list)
res1: List[String] = List(Sarah, George, Jason)
val dateTimeStringZero = "2014-08-12T00:05:00.000+01:00"
val dateTimeOne:DateTime = org.joda.time.format.ISODateTimeFormat.dateTime.withZoneUTC.parseDateTime(dateTimeStringZero)
import java.text.SimpleDateFormat
val df = new DateTime(new SimpleDateFormat("yyyy-MM-dd").parse("2014-08-12"))
println(dateTimeOne.getYear==df.getYear)
println(dateTimeOne.getMonthOfYear==df.getYear)
...

How to convert datatype in SPARK SQL to specific datatype but RDD result to a specifical class

I am reading a csv file and need to create a RDDSchema
I read the file by using the sqlContext.csvFile
val testfile = sqlContext.csvFile("file")
testfile.registerTempTable(testtable)
I wanted to change the pick some of the fields and return an RDD type of those fields
For example : class Test(ID: String, order_date: Date, Name: String, value: Double)
Using sqlContext.sql("Select col1, col2, col3, col4 FROM ...)
val testfile = sqlContext.sql("Select col1, col2, col3, col4 FROM testtable).collect
testfile.getClass
Class[_ <: Array[org.apache.spark.sql.Row]] = class [Lorg.apache.spark.sql.Row;
So I wanted to change col1 to double, col2 to a date , and column3 to string?
Is there a way to do this in the sqlContext.sql or I have to run a map function to the result and then turn it back to RDD..
I tried to do the do the item in one statement and I got this error:
val old_rdd : RDD[Test] = sqlContext.sql("SELECT col, col2, col3,col4 FROM testtable").collect.map(t => (t(0) : String ,dateFormat.parse(dateFormat.format(1)),t(2) : String, t(3) : Double))
The issue I am having is the assignment does not result on RDD[Test] where Test is a class defined
The error is saying that the map command is coming out as an Array Class and not an RDD Class
found : Array[edu.model.Test]
[error] required: org.apache.spark.rdd.RDD[edu.model.Test]
Lets say you have a case class like this:
case class Test(
ID: String, order_date: java.sql.Date, Name: String, value: Double)
Since you load your data with csvFile with default parameters it doesn't perform any schema inference and your data is stored as plain strings. Lets assume that there are no other fields:
val df = sc.parallelize(
("ORD1", "2016-01-02", "foo", "2.23") ::
("ORD2", "2016-07-03", "bar", "9.99") :: Nil
).toDF("col1", "col2", "col3", "col4")
Your attempt to use map is wrong for more than one reason:
function you use annotates individual values with incorrect types. Not only Row.apply is of type Int => Any but also your data table contains shouldn't contain any Double values
since you collect (which doesn't makes sense here) you fetch all data to the driver and result is local Array not RDD
finally, if all previous issues were resolved, (String, Date, String, Double) is clearly not a Test
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
val casted = df.select(
$"col1".alias("ID"),
$"col2".cast("date").alias("order_date"),
$"col3".alias("name"),
$"col4".cast("double").alias("value")
)
val tests: RDD[Test] = casted.map {
case Row(id: String, date: java.sql.Date, name: String, value: Double) =>
Test(id, date, name, value)
}
You can also try to use new Dataset API but it is far from stable:
casted.as[Test].rdd

Converting multiple different columns to Map column with Spark Dataframe scala

I have a data frame with column: user, address1, address2, address3, phone1, phone2 and so on.
I want to convert this data frame to - user, address, phone where address = Map("address1" -> address1.value, "address2" -> address2.value, "address3" -> address3.value)
I was able to convert the columns to map using:
val mapData = List("address1", "address2", "address3")
df.map(_.getValuesMap[Any](mapData))
but I am not sure how to add this to my df.
I am new to spark and scala and could really use some help here.
Spark >= 2.0
You can skip udf and use map (create_map in Python) SQL function:
import org.apache.spark.sql.functions.map
df.select(
map(mapData.map(c => lit(c) :: col(c) :: Nil).flatten: _*).alias("a_map")
)
Spark < 2.0
As far as I know there is no direct way to do it. You can use an UDF like this:
import org.apache.spark.sql.functions.{udf, array, lit, col}
val df = sc.parallelize(Seq(
(1L, "addr1", "addr2", "addr3")
)).toDF("user", "address1", "address2", "address3")
val asMap = udf((keys: Seq[String], values: Seq[String]) =>
keys.zip(values).filter{
case (k, null) => false
case _ => true
}.toMap)
val keys = array(mapData.map(lit): _*)
val values = array(mapData.map(col): _*)
val dfWithMap = df.withColumn("address", asMap(keys, values))
Another option, which doesn't require UDFs, is to struct field instead of map:
val dfWithStruct = df.withColumn("address", struct(mapData.map(col): _*))
The biggest advantage is that it can easily handle values of different types.