Dataframe : GroupBy by list of column names [duplicate] - scala

This question already has an answer here:
Scala-Spark Dynamically call groupby and agg with parameter values
(1 answer)
Closed 4 years ago.
I have a Dataframe with multiple columns and a List of column names.
I want to process my Dataframe by grouping it according to my list.
Here is an example of what I am trying to do :
val tagList = List("col1","col3","col5")
var tagsForGroupBy = tagList(0)
if(tagList.length>1){
for(i <- 1 to tagList.length-1){
tagsForGroupBy = tagsForGroupBy+","+tags(i)
}
}
// df is a Dataframe with schema (col0, col1, col2, col3, col4, col5)
df.groupBy("col0",tagsForGroupBy)
I understand why it does not work, but I don't know how to make it work.
What is the best solution to do that ?
EDIT :
Here is a more complete example of what I am doing (including SCouto solution) :
I have my tagList that contains some column names ("col3","col5"). I also want to include "col0" and "col1" in my groupBy, independently of my list.
After my groupBy and my aggregations, I want to select all columns used for group By and the new columns from aggregation.
val tagList = List("col3","col5")
val tmpListForGroup = new ListBuffer[String]()
val tmpListForSelect = new ListBuffer[String]()
tmpListForGroup +=tagList (0)
tmpListForSelect +=tagList (0)
for(i <- 1 to tagList .length-1){
tmpListForGroup +=(tagList (i))
tmpListForSelect +=(tagList (i))
}
tmpListForGroup +="col0"
tmpListForGroup +="col1"
tmpListForSelect +="aggValue1"
tmpListForSelect +="aggValue2"
// df is a Dataframe with schema (col0, col1, col2, col3, col4, col5)
df.groupBy(tmpListForGroup.head,tmpListForGroup.tail:_*)
.agg(
[aggFunction].as("aggValue1"),
[aggFunction].as("aggValue1"))
)
.select(tmpListForSelect .head,tmpListForSelect .tail:_*)
This code do exactly what I want, but it look very ugly and complicated for something that (I think) should be simple.
Is there another solution for that ?

When sending column names as Strings, groupBy receives a column as first parameter and a sequence of them as second:
def groupBy(col1: String,cols: String*)
So you need to send two arguments and convert the second one to a sequence:
This will work fine for you:
df.groupBy(tagsForGroupBy.head, tagsForGroupBy.tail:_*)
Or if you want to separete col0 from the list as in your example:
df.groupBy("col0", tagsForGroupBy:_*)

Related

Transform rows to multiple rows in Spark Scala

I have a problem where I need to transform one row to multiple rows. This is based on a different mapping that I have. I have tried to provide an example below.
Suppose I have a parquet file with the below schema
ColA, ColB, ColC, Size, User
I need to aggregate the above data into multiple rows based on a lookup map. Suppose I have a static map
ColA, ColB, Sum(Size)
ColB, ColC, Distinct (User)
ColA, ColC, Sum(Size)
This means that one row in the input RDD needs to be transformed to 3 aggregate. I believe RDD is the way to go with FlatMapPair, but I am not sure how to go about this.
I am also OK to concat the columns into one key, something like ColA_ColB etc.
For creating multiple aggregates from the same data, I have started with something like this
val keyData: PairFunction[Row, String, Long] = new PairFunction[Row, String, Long]() {
override def call(x: Row) = {
(x.getString(1),x.getLong(5))
}
}
val ip15M = spark.read.parquet("a.parquet").toJavaRDD
val pairs = ip15M.mapToPair(keyData)
java.util.List[(String, Long)] = [(ios,22), (ios,23), (ios,10), (ios,37), (ios,26), (web,52), (web,1)]
I believe I need to do flatmaptopair instead of mapToPair. On similar lines, I tried
val FlatMapData: PairFlatMapFunction[Row, String, Long] = new PairFlatMapFunction[Row, String, Long]() {
override def call(x: Row) = {
(x.getString(1),x.getLong(5))
}
}
but it is giving Error
Expression of type (String, Long) doesn't conform to expected type util.Iterator[(String, Long)]
Any help is appreciated. Please let me know if I need to add any more details.
the outcome should only have 3 columns? I mean col1, col2, col3 (the agg outcome).
The second aggregate is a distinct count of users? (I assume yes).
If so you can basically create 3 data frames and then union them.
Something in the way of:
val df1 = spark.sql("select colA as col1, colB as col2 ,sum(Size) as colAgg group by colA,colB")
val df2 = spark.sql("select colB as col1, colC as col2 ,Distinct(User) as colAgg group by colB,colC")
val df3 = spark.sql("select colA as col1, colC as col2 ,sum(Size) as colAgg group by colA,colC")
df1.union(df2).union(df3)

Best way to update a dataframe in Spark scala

Consider two Dataframe data_df and update_df. These two dataframes have the same schema (key, update_time, bunch of columns).
I know two (main) way to "update" data_df with update_df
full outer join
I join the two dataframes (on key) and then pick the appropriate columns (according to the value of update_timestamp)
max over partition
Union both dataframes, compute the max update_timestamp by key and then filter only rows that equal this maximum.
Here are the questions :
Is there any other way ?
Which one is the best way and why ?
I've already done the comparison with some Open Data
Here is the join code
var join_df = data_df.alias("data").join(maj_df.alias("maj"), Seq("key"), "outer")
var res_df = join_df.where( $"data.update_time" > $"maj.update_time" || $"maj.update_time".isNull)
.select(col("data.*"))
.union(
join_df.where( $"data.update_time" < $"maj.update_time" || $"data.update_time".isNull)
.select(col("maj.*")))
And here is window code
import org.apache.spark.sql.expressions._
val byKey = Window.partitionBy($"key") // orderBy is implicit here
res_df = data_df.union(maj_df)
.withColumn("max_version", max("update_time").over(byKey))
.where($"update_time" === $"max_version")
I can paste you DAGs and Plans here if needed, but they are pretty large
My first guess is that the join solution might be the best way but it only works if the update dataframe got only one version per key.
PS : I'm aware of Apache Delta solution but sadly i'm not able too use it.
Below is one way of doing it to only join on the keys, in an effort to minimize the amount of memory to be used on filters and on join commands.
///Two records, one with a change, one no change
val originalDF = spark.sql("select 'aa' as Key, 'Joe' as Name").unionAll(spark.sql("select 'cc' as Key, 'Doe' as Name"))
///Two records, one change, one new
val updateDF = = spark.sql("select 'aa' as Key, 'Aoe' as Name").unionAll(spark.sql("select 'bb' as Key, 'Moe' as Name"))
///Make new DFs of each just for Key
val originalKeyDF = originalDF.selectExpr("Key")
val updateKeyDF = updateDF.selectExpr("Key")
///Find the keys that are similar between both
val joinKeyDF = updateKeyDF.join(originalKeyDF, updateKeyDF("Key") === originalKeyDF("Key"), "inner")
///Turn the known keys into an Array
val joinKeyArray = joinKeyDF.select(originalKeyDF("Key")).rdd.map(x=>x.mkString).collect
///Filter the rows from original that are not found in the new file
val originalNoChangeDF = originalDF.where(!($"Key".isin(joinKeyArray:_*)))
///Update the output with unchanged records, update records, and new records
val finalDF = originalNoChangeDF.unionAll(updateDF)

get multiple columns within a map: rdd

I've a DF that I'm explicitly converting into an RDD and trying to fetch each column's record. Not able to fetch each of them within a map. Below is what I've tried:
val df = sql("Select col1, col2, col3, col4, col5 from tableName").rdd
The resultant df becomes the member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
Now I'm trying to access each element of this RDD via:
val dfrdd = df.map{x => x.get(0); x.getAs[String](1); x.get(3)}
The issue is, the above statement returns only the data present on the last transformation of map i.e., the data present on x.get(3). Can someone let me know what I'm doing wrong?
The last line is always returned from the map, In your case x.get(3) gets returned.
To return multiple values you can return tuples as below
val dfrdd = df.map{x => (x.get(0), x.getAs[String](1), x.get(3))}
Hope this helped!

How to insert record into a dataframe in spark

I have a dataframe (df1) which has 50 columns, the first one is a cust_id and the rest are features. I also have another dataframe (df2) which contains only cust_id. I'd like to add one records per customer in df2 to df1 with all the features as 0. But as the two dataframe have two different schema, I cannot do a union. What is the best way to do that?
I use a full outer join but it generates two cust_id columns and I need one. I should somehow merge these two cust_id columns but don't know how.
You can try to achieve something like that by doing a full outer join like the following:
val result = df1.join(df2, Seq("cust_id"), "full_outer")
However, the features are going to be null instead of 0. If you really need them to be zero, one way to do it would be:
val features = df1.columns.toSet - "cust_id" // Remove "cust_id" column
val newDF = features.foldLeft(df2)(
(df, colName) => df.withColumn(colName, lit(0))
)
df1.unionAll(newDF)

how to select all columns that starts with a common label

I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
I know I can do like this to select specific columns:
df.select("colA", "colB", "colE")
but how to select, say "colA", "colB" and all the colF-* columns at once? Is there a way like in Pandas?
The process canbe broken down into following steps:
First grab the column names with df.columns,
then filter down to just the column names you want .filter(_.startsWith("colF")). This gives you an array of Strings.
But the select takes select(String, String*). Luckily select for columns is select(Column*), so finally convert the Strings into Columns with .map(df(_)),
and finally turn the Array of Columns into a var arg with : _*.
df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):
df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show
If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.
df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show
Python (tested in Azure Databricks)
selected_columns = [column for column in df.columns if column.startswith("colF")]
df2 = df.select(selected_columns)
In PySpark, use: colRegex to select columns starting with colF
Whit the sample:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
Apply:
df.select(col("colA"), col("colB"), df.colRegex("`(colF)+?.+`")).show()
The result is:
colA, colB, colF-0, colF-1, colF-2
I wrote a function that does that. Read the comments to see how it works.
/**
* Given a sequence of prefixes, select suitable columns from [[DataFrame]]
* #param columnPrefixes Sequence of prefixes
* #param dF Incoming [[DataFrame]]
* #return [[DataFrame]] with prefixed columns selected
*/
def selectPrefixedColumns(columnPrefixes: Seq[String], dF: DataFrame): DataFrame = {
// Find out if given column name matches any of the provided prefixes
def colNameStartsWith: String => Boolean = (colName: String) =>
columnsPrefix.map(prefix => colName.startsWith(prefix)).reduce(_ || _)
// Filter columns list by checking against given prefixes sequence
val columns = dF.columns.filter(colNameStartsWith)
// Select filtered columns list
dF.select(columns.head, columns.tail:_*)
}