Scala Spark collect_list() vs array() - scala

What is the difference between collect_list() and array() in spark using scala?
I see uses all over the place and the use cases are not clear to me to determine the difference.

Even though both array and collect_list return an ArrayType column, the two methods are very different.
Method array combines "column-wise" a number of columns into an array, whereas collect_list aggregates "row-wise" on a single column typically by group (or Window partition) into an array, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "a", "b"),
(1, "c", "d"),
(2, "e", "f")
).toDF("c1", "c2", "c3")
df.
withColumn("arr", array("c2", "c3")).
show
// +---+---+---+------+
// | c1| c2| c3| arr|
// +---+---+---+------+
// | 1| a| b|[a, b]|
// | 1| c| d|[c, d]|
// | 2| e| f|[e, f]|
// +---+---+---+------+
df.
groupBy("c1").agg(collect_list("c2")).
show
// +---+----------------+
// | c1|collect_list(c2)|
// +---+----------------+
// | 1| [a, c]|
// | 2| [e]|
// +---+----------------+

Related

How to apply a customized function with multiple parameters to each group of a dataframe and union the resulting dataframes in Scala Spark?

I have a customized function that looks like this that returns a different dataframe as the output
def customizedfun(data : DataFrame, param1 : Boolean, param2 : string) : DataFrame = {...}
and I want to apply this function to each group of
df.groupBy("type")
then append the output dataframes from each type into one dataframe.
This is a little different from other questions regarding applying customized functions to grouped dataframes in that this function also take other inputs, in addition to the dataframe in question df.groupBy("type").
What's the best way to do this?
You can filter down the original df to the different groups, call customizedfun for each group and then union the results.
I assume that customizedfun is a function that simply adds the two parameters as a new column, but it could be any function:
def customizedfun(data : DataFrame, param1 : Boolean, param2 : String) : DataFrame =
data.withColumn("newCol", lit(s"$param2 $param1"))
I need two helper function that calculate the values of param1 and param2 dependent on the value of type. In a real world application, these functions could be for example a lookup into a dictionary.
def calcParam1(typ: Integer): Boolean = typ % 2 == 0
def calcParam2(typ: Integer): String = s"type is $typ"
Now the original df is filtered into the different groups, customizedfun is called and the result is unioned:
//create some test data
val df = Seq((1, "A", "a"), (1, "B", "b"), (1, "C", "c"), (2, "D", "d"), (2, "E", "e"), (3, "F", "f"))
.toDF("type", "val1", "val2")
//+----+----+----+
//|type|val1|val2|
//+----+----+----+
//| 1| A| a|
//| 1| B| b|
//| 1| C| c|
//| 2| D| d|
//| 2| E| e|
//| 3| F| f|
//+----+----+----+
//get the distinct values of column type
val distinctTypes = df.select("type").distinct().as[Integer].collect()
//call customizedfun for each group
val resultPerGroup= for( typ <- distinctTypes)
yield customizedfun( df.filter(s"type = $typ"), calcParam1(typ), calcParam2(typ))
//the final union
val result = resultPerGroup.tail.foldLeft(resultPerGroup.head)(_ union _)
//+----+----+----+---------------+
//|type|val1|val2| newCol|
//+----+----+----+---------------+
//| 1| A| a|type is 1 false|
//| 1| B| b|type is 1 false|
//| 1| C| c|type is 1 false|
//| 3| F| f|type is 3 false|
//| 2| D| d| type is 2 true|
//| 2| E| e| type is 2 true|
//+----+----+----+---------------+

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

Sum columns of a Spark dataframe and create another dataframe

I have a dataframe like below -
I am trying to create another dataframe from this which has 2 columns - the column name and the sum of values in each column like this -
So far, I've tried this (in Spark 2.2.0) but throws a stack trace -
val get_count: (String => Long) = (c: String) => {
df.groupBy("id")
.agg(sum(c) as "s")
.select("s")
.collect()(0)
.getLong(0)
}
val sqlfunc = udf(get_count)
summary = summary.withColumn("sum_of_column", sqlfunc(col("c")))
Are there any other alternatives of accomplishing this task?
I think that the most efficient way is to do an aggregation and then build a new dataframe. That way you avoid a costly explode.
First, let's create the dataframe. BTW, it's always nice to provide the code to do it when you ask a question. This way we can reproduce your problem in seconds.
val df = Seq((1, 1, 0, 0, 1), (1, 1, 5, 0, 0),
(0, 1, 0, 6, 0), (0, 1, 0, 4, 3))
.toDF("output_label", "ID", "C1", "C2", "C3")
Then we build the list of columns that we are interested in, the aggregations, and compute the result.
val cols = (1 to 3).map(i => s"C$i")
val aggs = cols.map(name => sum(col(name)).as(name))
val agg_df = df.agg(aggs.head, aggs.tail :_*) // See the note below
agg_df.show
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 5| 10| 4|
+---+---+---+
We almost have what we need, we just need to collect the data and build a new dataframe:
val agg_row = agg_df.first
cols.map(name => name -> agg_row.getAs[Long](name))
.toDF("column", "sum")
.show
+------+---+
|column|sum|
+------+---+
| C1| 5|
| C2| 10|
| C3| 4|
+------+---+
EDIT:
NB: df.agg(aggs.head, aggs.tail :_*) may seem strange. The idea is simply to compute all the aggregations computed in aggs. One would expect something more simple like df.agg(aggs : _*). Yet the signature of the agg method is as follows:
def agg(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)
maybe to ensure that at least one column is used, and this is why you need to split aggs in aggs.head and aggs.tail.
What i do is to define a method to create a struct from the desired values:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
This functions receives a list of columns to transpose (your 3 last columns in your case) and transform them in a struct with the column name as key and the column value as value
And then use that method to create an struct and process it as you want
df.withColumn("kv", kv(df.columns.tail.tail))
.select( $"kv.k".as("column"), $"kv.v".alias("values"))
.groupBy("column")
.agg(sum("values").as("sum"))
First apply the previous defined function to have the desired columns as the said struct, and then deconstruct the struct to have a column key and a column value in each row.
Then you can aggregate by the column name and sum the values
INPUT
+------------+---+---+---+---+
|output_label| id| c1| c2| c3|
+------------+---+---+---+---+
| 1| 1| 0| 0| 1|
| 1| 1| 5| 0| 0|
| 0| 1| 0| 6| 0|
| 0| 1| 0| 4| 3|
+------------+---+---+---+---+
OUTPUT
+------+---+
|column|sum|
+------+---+
| c1| 5|
| c3| 4|
| c2| 10|
+------+---+

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

How to use Spark ML ALS algorithm? [duplicate]

Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.
Example, the original dataframe:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| A|
|1473492972|4060600988513370| A|
|1473509764|4060600988513370| B|
|1473513432|4060600988513370| C|
|1473513432|4060600988513370| A|
+----------+----------------+--------------------+
to the factorized version:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| 0|
|1473492972|4060600988513370| 0|
|1473509764|4060600988513370| 1|
|1473513432|4060600988513370| 2|
|1473513432|4060600988513370| 0|
+----------+----------------+--------------------+
In scala itself it would be fairly simple, but since Spark distributes it's dataframes over nodes I'm not sure how to keep a mapping from A->0, B->1, C->2.
Also, assume the dataframe is pretty big (gigabytes), which means loading one entire column into the memory of a single machine might not be possible.
Can it be done?
You can use StringIndexer to encode letters into indices:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("col3")
.setOutputCol("col3Index")
val indexed = indexer.fit(df).transform(df)
indexed.show()
+----------+----------------+----+---------+
| col1| col2|col3|col3Index|
+----------+----------------+----+---------+
|1473490929|4060600988513370| A| 0.0|
|1473492972|4060600988513370| A| 0.0|
|1473509764|4060600988513370| B| 1.0|
|1473513432|4060600988513370| C| 2.0|
|1473513432|4060600988513370| A| 0.0|
+----------+----------------+----+---------+
Data:
val df = spark.createDataFrame(Seq(
(1473490929, "4060600988513370", "A"),
(1473492972, "4060600988513370", "A"),
(1473509764, "4060600988513370", "B"),
(1473513432, "4060600988513370", "C"),
(1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")
You can use an user defined function.
First you create the mapping you need:
val updateFunction = udf {(x: String) =>
x match {
case "A" => 0
case "B" => 1
case "C" => 2
case _ => 3
}
}
And now you only have to apply it to your DataFrame:
df.withColumn("col3", updateFunction(df.col("col3")))