Spark: get rows value from a Window - scala

With spark I defined a Window:
val window = Window
.partitionBy("myaggcol")
.orderBy("datefield")
.rowsBetween(-2, 0)
Then I can compute a new column from the window' rows, eg:
dataset
.withColumn("newcol", last("diffcol").over(window) - first("diffcol").over(window))
This will compute, for each point, the difference in "diffcol" with the n-2 row.
Now my question: how can I get the "diffcol" of n-1 row, not the first nor the last but the intermediary one?

If I understand your question correctly, Window function lag would work better than rowsBetween, as shown in the following example:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = Seq(
("a", 1, 100), ("a", 2, 200), ("a", 3, 300), ("a", 4, 400),
("b", 1, 500), ("b", 2, 600), ("b", 3, 700)
).toDF("c1", "c2", "c3")
val win = Window.partitionBy("c1").orderBy("c2")
df.
withColumn("c3Diff1", $"c3" - coalesce(lag("c3", 1).over(win), lit(0))).
withColumn("c3Diff2", $"c3" - coalesce(lag("c3", 2).over(win), lit(0))).
show
// +---+---+---+-------+-------+
// | c1| c2| c3|c3Diff1|c3Diff2|
// +---+---+---+-------+-------+
// | b| 1|500| 500| 500|
// | b| 2|600| 100| 600|
// | b| 3|700| 100| 200|
// | a| 1|100| 100| 100|
// | a| 2|200| 100| 200|
// | a| 3|300| 100| 200|
// | a| 4|400| 100| 200|
// +---+---+---+-------+-------+

Related

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

How to move a specific column of a pyspark dataframe in the start of the dataframe

I have a pyspark dataframe as follows (this is just a simplified example, my actual dataframe has hundreds of columns):
col1,col2,......,col_with_fix_header
1,2,.......,3
4,5,.......,6
2,3,........,4
and I want to move col_with_fix_header in the start, so that the output comes as follows:
col_with_fix_header,col1,col2,............
3,1,2,..........
6,4,5,....
4,2,3,.......
I don't want to list all the columns in the solution.
In case you don't want to list all columns of your dataframe, you can use the dataframe property columns. This property gives you a python list of column names and you can simply slice it:
df = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
df.select([df.columns[-1]] + df.columns[:-1]).show()
Output:
+---+---+-------+
|age| id| name|
+---+---+-------+
| 34| a| Alice|
| 36| b| Bob|
| 30| c|Charlie|
| 29| d| David|
| 32| e| Esther|
| 36| f| Fanny|
| 60| g| Gabby|
+---+---+-------+

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

Get the last element of a window in Spark 2.1.1

I have a dataframe in which I have subcategories, and want the last element of each of these subcategories.
val windowSpec = Window.partitionBy("name").orderBy("count")
sqlContext
.createDataFrame(
Seq[(String, Int)](
("A", 1),
("A", 2),
("A", 3),
("B", 10),
("B", 20),
("B", 30)
))
.toDF("name", "count")
.withColumn("firstCountOfName", first("count").over(windowSpec))
.withColumn("lastCountOfName", last("count").over(windowSpec))
.show()
returns me something strange:
+----+-----+----------------+---------------+
|name|count|firstCountOfName|lastCountOfName|
+----+-----+----------------+---------------+
| B| 10| 10| 10|
| B| 20| 10| 20|
| B| 30| 10| 30|
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
+----+-----+----------------+---------------+
As we can see, the first value returned is correctly computed, but the last isn't, it's always the current value of the column.
Has someone a solution to do what I want?
According to the issue SPARK-20969, you should be able to get the expected results by defining adequate bounds to your window, as shown below.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window
.partitionBy("name")
.orderBy("count")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
sqlContext
.createDataFrame(
Seq[(String, Int)](
("A", 1),
("A", 2),
("A", 3),
("B", 10),
("B", 20),
("B", 30)
))
.toDF("name", "count")
.withColumn("firstCountOfName", first("count").over(windowSpec))
.withColumn("lastCountOfName", last("count").over(windowSpec))
.show()
Alternatively, if your are ordering on the same column you are computing first and last, you can change for min and max with a non-ordered window, then it should also work properly.

How to combine two spark data frames in sorted order

I want to combine two dataframes a and b into a dataframe c that is sorted on a column.
val a = Seq(("a", 1), ("c", 2), ("e", 3)).toDF("char", "num")
val b = Seq(("b", 4), ("d", 5)).toDF("char", "num")
val c = // how do I sort on char column?
Here is the result I want:
a.show() b.show() c.show()
+----+---+ +----+---+ +----+---+
|char|num| |char|num| |char|num|
+----+---+ +----+---+ +----+---+
| a| 1| | b| 4| | a| 1|
| c| 2| | d| 5| | b| 4|
| e| 3| +----+---+ | c| 2|
+----+---+ | d| 5|
| e| 3|
+----+---+
In simple, you can use sort() on each dataframe and union().
val a = Seq(("a", 1), ("c", 2), ("e", 3)).toDF("char", "num").sort($"char")
val b = Seq(("b", 4), ("d", 5)).toDF("char", "num").sort($"char")
val c = a.union(b).sort($"char")
if you want to do union for multiple dataframes we can try this way.
val df1 = sc.parallelize(List(
(50, 2, "arjun"),
(34, 4, "bob")
)).toDF("age", "children","name")
val df2 = sc.parallelize(List(
(51, 3, "jane"),
(35, 5, "bob")
)).toDF("age", "children","name")
val df3 = sc.parallelize(List(
(50, 2,"arjun"),
(34, 4,"bob")
)).toDF("age", "children","name")
val result= Seq(df1, df2, df3)
val res_union=result.reduce(_ union _).sort($"age",$"name",$"children")
res_union.show()