How to do aggregation on multiple columns at once in Spark - scala

I have a dataframe which has multiple columns. I want to group by one of the columns and aggregate other columns all the once. Let's say the table have 4 columns, cust_id, f1,f2,f3 and I want to group by cust_id and then get avg(f1), avg(f2) and avg(f3).The table will have many columns. Any hints?
The following code is good start but as I have many columns it may not be good idea to manually write them.
df.groupBy("cust_id").agg(sum("f1"), sum("f2"), sum("f3"))

Maybe you can try mapping a list with the colum names:
val groupCol = "cust_id"
val aggCols = (df.columns.toSet - groupCol).map(
colName => avg(colName).as(colName + "_avg")
).toList
df.groupBy(groupCol).agg(aggCols.head, aggCols.tail: _*)
Alternatively, if needed, you can also match the schema and build the aggregations based on the type:
val aggCols = df.schema.collect {
case StructField(colName, IntegerType, _, _) => avg(colName).as(colName + "_avg")
case StructField(colName, StringType, _, _) => first(colName).as(colName + "_first")
}

Related

Join Dataframes dynamically using Spark Scala when JOIN columns differ

Dynamically select multiple columns while joining different Dataframe in scala spark
From the above link , I was able to have the join expression working , but what if the column names are different, we cannot use Seq(columns) and need to join it dynamically. Here left_ds and right_ds are the dataframes which I wanted to join.
Below I want to join columns id=acc_id and "acc_no=number"
left_da => id,acc_no,name,ph
right_ds => acc_id,number,location
val joinKeys="id,acc_id|acc_no,number"
val joinKeyPair: Array[(String, String)] = joinKeys.split("\\|").map(_.split(",")).map(x => x(0).toUpperCase -> x(1).toUpperCase)
val joinExpr: Column = joinKeyPair.map { case (ltable_col, rtable_col) =>left_ds.col(ltable_col) === right_ds.col(rtable_col)}.reduce(_ and _)
left_ds.join(right_ds, joinExpr, "left_outer")
Above is the join expression I was trying but it not working. Is there a way to achieve this if the join column names are different with out using Seq. So if the number of join keys increase ,I should still be able to make the code work dynamically.
With aliases have to work fine:
val conditionArrays = joinKeys.split("\\|").map(c => c.split(","))
val joinExpr = conditionArrays.map { case Array(a, b) => col("a." + a) === col("b." + b) }.reduce(_ and _)
left_ds.alias("a").join(right_ds.alias("b"), joinExpr, "left_outer")

How to change column type for a list of dataframe columns

I'm trying to change the type of a list of columns for a Dataframe in Spark 1.6.0.
All the examples found so far however only allow casting for a single column (df.withColumn) or for all the columns in the dataframe:
val castedDF = filteredDf.columns.foldLeft(filteredDf)((filteredDf, c) => filteredDf.withColumn(c, col(c).cast("String")))
Is there any efficient, batch way of doing this for a list of columns in the dataframe?
There is nothing wrong with withColumn* but you can use select if you prefer:
import org.apache.spark.sql.functions col
val columnsToCast: Set[String]
val outputType: String = "string"
df.select(df.columns map (
c => if(columnsToCast.contains(c)) col(c).cast(outputType) else col(c)
): _*)
* Execution plan will be the same for a single select as with chained withColumn.

Join two Dataframe without a common field in Spark-scala

I have two dataframes in Spark Scala, but one of these is composed by a unique column. I have to join them but they have no column in common. The number of row is the same.
val userFriends=userJson.select($"friends",$"user_id")
val x = userFriends("friends")
.rdd
.map(x => x.getList(0).toArray.map(_.toString))
val y = x.map(z=>z.count(z=>true)).toDF("friendCount")
I have to join userFriends with y
It's not possible to join them without common fields, except if you can rely on a ordering, in this case you can use row-number (with window-function) on both dataframes and join on the row-number.
But in your case this does not seem necessary, just keep the user_id column in your dataframe, something like this should work:
val userFriends=userJson.select($"friends",$"user_id")
val result_df =
userFriends.select($"friends",$"user_id")
.rdd
.map(x => (x.getList(0).toArray.map(_.toString).count(z=>true)),x.getInt(1)))
.toDF("friendsCount","user_id")

Scala/Spark - Aggregating RDD

Just wondering how I can do the following:
Suppose I have an RDD containing (username, age, movieBought) for many usernames and some lines can have the same username and age but a different movieBought.
How can I remove the duplicated lines and transform it into (username, age, movieBought1, movieBought2...)?
Kind Regards
val grouped = rdd.groupBy(x => (x._1, x._2)).map(x => (x._1._1, x._1._2, x._2.map(_._3)))
val results = grouped.collect.toList
UPDATE (if each tuple also has number of movies item):
val grouped = rdd.groupBy(x => (x._1, x._2)).map(x => (x._1._1, x._1._2, x._2.map(m => (m._3, m._4))))
val results = grouped.collect.toList
I was gonna suggest collect and to list, but ka4eli beat me to it.
I guess you could also use the groupBy / groupByKey and then reduce/reduceByKey operation. The downside of this ofc is that the result (movie1,movie2,movie3..) are concatenated into 1 string (instead of a List structure, which makes accessing it difficult).
val group = rdd.map(x=>((x.name,x.age),x.movie))).groupBy(_._1)
val result = group.map(x=>(x._1._1,x._1._2,x._2.map(y=>y._2).reduce(_+","+_)

Access joined RDD fields in a readable way

I joined 2 RDDs and now when I'm trying to access the new RDD fields I need to treat them as Tuples. It leads to code that is not so readable. I tried to use 'type' in order to create some aliases however it doesn't work and I still need to access the fields as Tuples. Any idea how to make the code more readable?
for example - when trying to filter rows in the joined RDD:
val joinedRDD = RDD1.join(RDD2).filter(x=>x._2._2._5!='temp')
I would like to use names instead of 2,5 etc.
Thanks
Use pattern matching wisely.
val rdd1 = sc.parallelize(List(("John", (28, true)), ("Mary", (22, true)))
val rdd2 = sc.parallelize(List(("John", List(100, 200, -20))))
rdd1
.join(rdd2)
.map {
case (name, ((age, isProlonged), payments)) => (name, payments.sum)
}
.filter {
case (name, sum) => sum > 0
}
.collect()
res0: Array[(String, Int)] = Array((John,280))
Another option is using dataframes abstraction over RDD and writing sql queries.