Spark DataFrame, how to to aggregate sequence of columns? - scala

I have a dataframe and I could do aggregate with static column names i.e:
df.groupBy("_c0", "_c1", "_c2", "_c3", "_c4").agg(
concat_ws(",", collect_list("_c5")),
concat_ws(",", collect_list("_c6")))
And it works fine but how to do same if I get sequence of groupby columns and sequence of aggregate columns?
In other words, what if I have
val toGroupBy = Seq("_c0", "_c1", "_c2", "_c3", "_c4")
val toAggregate = Seq("_c5", "_c6")
and want to perform the above?

To perform the same groupBy and aggregation using the sequences you can do the following:
val aggCols = toAggregate.map(c => expr(s"""concat_ws(",", collect_list($c))"""))
df.groupBy(toGroupBy.head, toGroupBy.tail:_*).agg(aggCols.head, aggCols.tail:_*)
The expr function takes an expression and evaluates it into a column. Then the varargs variants of groupBy and agg are applied on the lists of columns.

Related

List of columns for orderBy in spark dataframe

I have a list of variables that contains column names. I am trying to use that to call orderBy on a dataframe.
val l = List("COL1", "COL2")
df.orderBy(l.mkString(","))
But mkstring combines the column names to be one string, leading to this error -
org.apache.spark.sql.AnalysisException: cannot resolve '`COL1,COL2`' given input columns: [COL1, COL2, COL3, COL4];
How can I convert this list of strings into different strings so it looks for "COL1", "COL2" instead of "COL1,COL2"?
Thanks,
You can call orderBy for a specific column:
import org.apache.spark.sql.functions._
df.orderBy(asc("COL1")) // df.orderBy(asc(l.headOption.getOrElse("COL1")))
// OR
df.orderBy(desc("COL1"))
If you want sort by multiple columns you can write something like this:
val l = List($"COL1", $"COL2".desc)
df.sort(l: _*)
Passing single String argument is telling Spark to sort data frame using one column with given name. There is a method that accepts multiple column names and you can use it that way:
val l = List("COL1", "COL2")
df.orderBy(l.head, l.tail: _*)
If you care about the order use Column version of orderBy instead
val l = List($"COL1", $"COL2".desc)
df.orderBy(l: _*)

Spark filter out columns and create dataFrame with remaining columns and create dataFrame with filtered columns

I am new to Spark.
I have loaded a CSV file into a Spark DataFrame, say OriginalDF
Now I want to
1. filter out some columns from it and create a new dataframe of the originalDF
2. create a dataFrame out of the extracted columns
How can these 2 dataframes be created in spark scala?
using select, you can select what columns you want.
val df2 = OriginalDF.select($"col1",$"col2",$"col3")
using filter you should able to filter the rows.
val df3 = OriginalDF.where($"col1" < 10)
another way to filter data is using where. Both filter and where are synonyms so you can use them interchangeably.
val df3 = OriginalDF.filter($"col1" < 10)
Note select and filter returns a new dataframe as a result.

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)

How to join two dataframes in Scala and select on few columns from the dataframes by their index?

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)

Passing a list of tuples as a parameter to a spark udf in scala

I am trying to pass a list of tuples to a udf in scala. I am not sure how to exactly define the datatype for this. I tried to pass it as a whole row but it can't really resolve it. I need to sort the list based on the first element of the tuple and then send n number of elements back. I have tried the following definitions for the udf
def udfFilterPath = udf((id: Long, idList: Array[structType[Long, String]] )
def udfFilterPath = udf((id: Long, idList: Array[Tuple2[Long, String]] )
def udfFilterPath = udf((id: Long, idList: Row)
This is what the idList looks like:
[[1234,"Tony"], [2345, "Angela"]]
[[1234,"Tony"], [234545, "Ruby"], [353445, "Ria"]]
This is a dataframe with a 100 rows like the above. I call the udf as follows:
testSet.select("id", "idList").withColumn("result", udfFilterPath($"id", $"idList")).show
When I print the schema for the dataframe it reads it as a array of structs. The idList itself is generated by doing a collect list over a column of tuples grouped by a key and stored in the dataframe. Any ideas on what I am doing wrong? Thanks!
When defining a UDF, you should use plain Scala types (e.g. Tuples, Primitives...) and not the Spark SQL types (e.g. StructType) as the output types.
As for the input types - this is where it gets tricky (and not too well documented) - an array of tuples would actually be a mutable.WrappedArray[Row]. So - you'll have to "convert" each row into a tuple first, then you can do the sorting and return the result.
Lastly, by your description it seems that id column isn't used at all, so I removed it from the UDF definition, but it can easily be added back.
val udfFilterPath = udf { idList: mutable.WrappedArray[Row] =>
// converts the array items into tuples, sorts by first item and returns first two tuples:
idList.map(r => (r.getAs[Long](0), r.getAs[String](1))).sortBy(_._1).take(2)
}
df.withColumn("result", udfFilterPath($"idList")).show(false)
+------+-------------------------------------------+----------------------------+
|id |idList |result |
+------+-------------------------------------------+----------------------------+
|1234 |[[1234,Tony], [2345,Angela]] |[[1234,Tony], [2345,Angela]]|
|234545|[[1234,Tony], [2345454,Ruby], [353445,Ria]]|[[1234,Tony], [353445,Ria]] |
+------+-------------------------------------------+----------------------------+