Programmatically Rename All But One Column Spark Scala - scala

I have this DataFrame:
val df = Seq(
("LeBron", 36, 18, 12),
("Kevin", 42, 8, 9),
("Russell", 44, 5, 14)).
toDF("player", "points", "rebounds", "assists")
df.show()
+-------+------+--------+-------+
| player|points|rebounds|assists|
+-------+------+--------+-------+
| LeBron| 36| 18| 12|
| Kevin| 42| 8| 9|
|Russell| 44| 5| 14|
+-------+------+--------+-------+
I want to add "season_high" to every column name except player. I also want to use a function to do this because my real data set has 250 columns.
I've come up with the method below that gets me the output that I want, but I'm wondering if there is a way to pass a rule to the renamedColumns mapping function that makes it so that the column name player doesn't get switched to season_high_player, and then back to player with the additional .withColumnRenamed function.
val renamedColumns = df.columns.map(name => col(name).as(s"season_high_$name"))
val df2 = df.select(renamedColumns : _*).
withColumnRenamed("season_high_player", "player")
df2.show()
+-------+------------------+--------------------+-------------------+
| player|season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
| LeBron| 36| 18| 12|
| Kevin| 42| 8| 9|
|Russell| 44| 5| 14|
+-------+------------------+--------------------+-------------------+

#philantrovert was right but he just forgot to tell you how to use that "formula", so here you go :
val selection : Seq[Column] = Seq(col("player")) ++ df.columns.filter(_ != "player")
.map(name => col(name).as(s"season_high_$name"))
df.select(selection : _*).show
// +-------+------------------+--------------------+-------------------+
// | player|season_high_points|season_high_rebounds|season_high_assists|
// +-------+------------------+--------------------+-------------------+
// | LeBron| 36| 18| 12|
// | Kevin| 42| 8| 9|
// |Russell| 44| 5| 14|
// +-------+------------------+--------------------+-------------------+
So what we have done here is to filter out the column name that we don't need (This is plain scala). Then we map the column names that we kept to convert them into column which we rename.

you can do the following by making the one column you don't want to rename as the first column and apply the following logic
import org.apache.spark.sql.functions._
val columnsRenamed = col(df.columns.head) +: df.columns.tail.map(name => col(name).as(s"season_high_$name"))
df.select(columnsRenamed :_*).show(false)
You should be getting output as
+-------+------------------+--------------------+-------------------+
|player |season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
|LeBron |36 |18 |12 |
|Kevin |42 |8 |9 |
|Russell|44 |5 |14 |
+-------+------------------+--------------------+-------------------+

one more variation that doesn't depend on the position of the field.
scala> val df = Seq(
| ("LeBron", 36, 18, 12),
| ("Kevin", 42, 8, 9),
| ("Russell", 44, 5, 14)).
| toDF("player", "points", "rebounds", "assists")
df: org.apache.spark.sql.DataFrame = [player: string, points: int ... 2 more fields]
scala> val newColumns = df.columns.map( x => x match { case "player" => col("player") case x => col(x).as(s"season_high_$x")} )
newColumns: Array[org.apache.spark.sql.Column] = Array(player, points AS `season_high_points`, rebounds AS `season_high_rebounds`, assists AS `season_high_assists`)
scala> df.select(newColumns:_*).show(false)
+-------+------------------+--------------------+-------------------+
|player |season_high_points|season_high_rebounds|season_high_assists|
+-------+------------------+--------------------+-------------------+
|LeBron |36 |18 |12 |
|Kevin |42 |8 |9 |
|Russell|44 |5 |14 |
+-------+------------------+--------------------+-------------------+
scala>

Related

Spark Scala split column values in a dataframe to appended lists

I have data in a spark dataframe that I need to search for elements by name, append the values to a list, and split searched elements into separate columns of the dataframe.
I am using scala and the below is an example of my current code that works to get the first value but I need to append all values available not just the first.
I'm new to Scala (and python) so any help will be greatly appreciated!
val getNumber: (String => String) = (colString: String) => {
if (colString != null) {
raw"number:(\d+)".r
.findAllIn(colString)
.group(1)
}
else
null
}
val udfGetColumn = udf(getNumber)
val mydf = df.select(cols.....)
.withColumn("var_number", udfGetColumn($"var"))
Example Data:
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| key| var |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |["[number:123456 rate:111970 position:1]","[number:123457 rate:662352 position:2]","[number:123458 rate:890 position:3]","[number:123459 rate:190 position:4]"] | |
|2 |["[number:654321 rate:211971 position:1]","[number:654322 rate:124 position:2]","[number:654323 rate:421 position:3]"] |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
Desired Result:
+------+------------------------------------------------------------+
| key| var_number | var_rate | var_position |
+------+------------------------------------------------------------+
|1 | 123456 | 111970 | 1 |
|1 | 123457 | 662352 | 2 |
|1 | 123458 | 890 | 3 |
|1 | 123459 | 190 | 4 |
|2 | 654321 | 211971 | 1 |
|2 | 654322 | 124 | 2 |
|2 | 654323 | 421 | 3 |
+------+-----------------+---------------------+--------------------+
You don't need to use UDF here. You can easily transform the array column var by converting each element into a map using str_to_map after removing the square brackets ([]) with regexp_replace function. Finally, explode the transformed array and select the fields:
val df = Seq(
(1, Seq("[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]")),
(2, Seq("[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"))
).toDF("key", "var")
val result = df.withColumn(
"var",
explode(expr(raw"transform(var, x -> str_to_map(regexp_replace(x, '[\\[\\]]', ''), ' '))"))
).select(
col("key"),
col("var").getField("number").alias("var_number"),
col("var").getField("rate").alias("var_rate"),
col("var").getField("position").alias("var_position")
)
result.show
//+---+----------+--------+------------+
//|key|var_number|var_rate|var_position|
//+---+----------+--------+------------+
//| 1| 123456| 111970| 1|
//| 1| 123457| 662352| 2|
//| 1| 123458| 890| 3|
//| 1| 123459| 190| 4|
//| 2| 654321| 211971| 1|
//| 2| 654322| 124| 2|
//| 2| 654323| 421| 3|
//+---+----------+--------+------------+
From you comment, it appears the column var is of type string not array. In this case, you can first transform it by removing [] and " characters then split by comma to get an array:
val df = Seq(
(1, """["[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]"]"""),
(2, """["[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"]""")
).toDF("key", "var")
val result = df.withColumn(
"var",
split(regexp_replace(col("var"), "[\\[\\]\"]", ""), ",")
).withColumn(
"var",
explode(expr("transform(var, x -> str_to_map(x, ' '))"))
).select(
// select your columns as above...
)

How to split Comma-separated multiple columns into multiple rows?

I have a data-frame with N fields as mentioned below. The number of columns and length of the value will vary.
Input Table:
+--------------+-----------+-----------------------+
|Date |Amount |Status |
+--------------+-----------+-----------------------+
|2019,2018,2017|100,200,300|IN,PRE,POST |
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------------------+
I have to convert it into the below format with one sequence column.
Expected Output Table:
+-------------+------+---------+
|Date |Amount|Status| Sequence|
+------+------+------+---------+
|2019 |100 |IN | 1 |
|2018 |200 |PRE | 2 |
|2017 |300 |POST | 3 |
|2018 |73 |IN | 1 |
|2018 |56 |IN | 1 |
|2017 |89 |PRE | 2 |
+-------------+------+---------+
I have Tried using explode but explode only take one array at a time.
var df = dataRefined.withColumn("TOT_OVRDUE_TYPE", explode(split($"TOT_OVRDUE_TYPE", "\\"))).toDF
var df1 = df.withColumn("TOT_OD_TYPE_AMT", explode(split($"TOT_OD_TYPE_AMT", "\\"))).show
Does someone know how I can do it? Thank you for your help.
Here is another approach using posexplode for each column and joining all produced dataframes into one:
import org.apache.spark.sql.functions.{posexplode, monotonically_increasing_id, col}
val df = Seq(
(Seq("2019", "2018", "2017"), Seq("100", "200", "300"), Seq("IN", "PRE", "POST")),
(Seq("2018"), Seq("73"), Seq("IN")),
(Seq("2018", "2017"), Seq("56", "89"), Seq("IN", "PRE")))
.toDF("Date","Amount", "Status")
.withColumn("idx", monotonically_increasing_id)
df.columns.filter(_ != "idx").map{
c => df.select($"idx", posexplode(col(c))).withColumnRenamed("col", c)
}
.reduce((ds1, ds2) => ds1.join(ds2, Seq("idx", "pos")))
.select($"Date", $"Amount", $"Status", $"pos".plus(1).as("Sequence"))
.show
Output:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
You can achieve this by using Dataframe inbuilt functions arrays_zip,split,posexplode
Explanation:
scala>val df=Seq((("2019,2018,2017"),("100,200,300"),("IN,PRE,POST")),(("2018"),("73"),("IN")),(("2018,2017"),("56,89"),("IN,PRE"))).toDF("date","amount","status")
scala>:paste
df.selectExpr("""posexplode(
arrays_zip(
split(date,","), //split date string with ',' to create array
split(amount,","),
split(status,","))) //zip arrays
as (p,colum) //pos explode on zip arrays will give position and column value
""")
.selectExpr("colum.`0` as Date", //get 0 column as date
"colum.`1` as Amount",
"colum.`2` as Status",
"p+1 as Sequence") //add 1 to the position value
.show()
Result:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Yes, I personally also find explode a bit annoying and in your case I would probably go with a flatMap instead:
import spark.implicits._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq((Seq(2019,2018,2017), Seq(100,200,300), Seq("IN","PRE","POST")),(Seq(2018), Seq(73), Seq("IN")),(Seq(2018,2017), Seq(56,89), Seq("IN","PRE")))).toDF()
val transformedDF = df
.flatMap{case Row(dates: Seq[Int], amounts: Seq[Int], statuses: Seq[String]) =>
dates.indices.map(index => (dates(index), amounts(index), statuses(index), index+1))}
.toDF("Date", "Amount", "Status", "Sequence")
Output:
df.show
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Assuming the number of data elements in each column is the same for each row:
First, I recreated your DataFrame
import org.apache.spark.sql._
import scala.collection.mutable.ListBuffer
val df = Seq(("2019,2018,2017", "100,200,300", "IN,PRE,POST"), ("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")).toDF("Date", "Amount", "Status")
Next, I split the rows and added a sequence value, then converted back to a DF:
val exploded = df.rdd.flatMap(row => {
val buffer = new ListBuffer[(String, String, String, Int)]
val dateSplit = row(0).toString.split("\\,", -1)
val amountSplit = row(1).toString.split("\\,", -1)
val statusSplit = row(2).toString.split("\\,", -1)
val seqSize = dateSplit.size
for(i <- 0 to seqSize-1)
buffer += Tuple4(dateSplit(i), amountSplit(i), statusSplit(i), i+1)
buffer.toList
}).toDF((df.columns:+"Sequence"): _*)
I'm sure there are other ways to do it without first converting the DF to an RDD, but this will still result with a DF with the correct answer.
Let me know if you have any questions.
I took advantage of the transpose to zip all Sequences by position and then did a posexplode. Selects on dataFrames are dynamic to satisfy the condition: The number of columns and length of the value will vary in the question.
import org.apache.spark.sql.functions._
val df = Seq(
("2019,2018,2017", "100,200,300", "IN,PRE,POST"),
("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")
).toDF("Date", "Amount", "Status")
df: org.apache.spark.sql.DataFrame = [Date: string, Amount: string ... 1 more field]
scala> df.show(false)
+--------------+-----------+-----------+
|Date |Amount |Status |
+--------------+-----------+-----------+
|2019,2018,2017|100,200,300|IN,PRE,POST|
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------+
scala> def transposeSeqOfSeq[S](x:Seq[Seq[S]]): Seq[Seq[S]] = { x.transpose }
transposeSeqOfSeq: [S](x: Seq[Seq[S]])Seq[Seq[S]]
scala> val myUdf = udf { transposeSeqOfSeq[String] _}
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true))))
scala> val df2 = df.select(df.columns.map(c => split(col(c), ",") as c): _*)
df2: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 1 more field]
scala> df2.show(false)
+------------------+---------------+---------------+
|Date |Amount |Status |
+------------------+---------------+---------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|
|[2018] |[73] |[IN] |
|[2018, 2017] |[56, 89] |[IN, PRE] |
+------------------+---------------+---------------+
scala> val df3 = df2.withColumn("allcols", array(df.columns.map(c => col(c)): _*))
df3: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 2 more fields]
scala> df3.show(false)
+------------------+---------------+---------------+------------------------------------------------------+
|Date |Amount |Status |allcols |
+------------------+---------------+---------------+------------------------------------------------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|[[2019, 2018, 2017], [100, 200, 300], [IN, PRE, POST]]|
|[2018] |[73] |[IN] |[[2018], [73], [IN]] |
|[2018, 2017] |[56, 89] |[IN, PRE] |[[2018, 2017], [56, 89], [IN, PRE]] |
+------------------+---------------+---------------+------------------------------------------------------+
scala> val df4 = df3.withColumn("ab", myUdf($"allcols")).select($"ab", posexplode($"ab"))
df4: org.apache.spark.sql.DataFrame = [ab: array<array<string>>, pos: int ... 1 more field]
scala> df4.show(false)
+------------------------------------------------------+---+-----------------+
|ab |pos|col |
+------------------------------------------------------+---+-----------------+
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|0 |[2019, 100, IN] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|1 |[2018, 200, PRE] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|2 |[2017, 300, POST]|
|[[2018, 73, IN]] |0 |[2018, 73, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |0 |[2018, 56, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |1 |[2017, 89, PRE] |
+------------------------------------------------------+---+-----------------+
scala> val selCols = (0 until df.columns.length).map(i => $"col".getItem(i).as(df.columns(i))) :+ ($"pos"+1).as("Sequence")
selCols: scala.collection.immutable.IndexedSeq[org.apache.spark.sql.Column] = Vector(col[0] AS `Date`, col[1] AS `Amount`, col[2] AS `Status`, (pos + 1) AS `Sequence`)
scala> df4.select(selCols:_*).show(false)
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019|100 |IN |1 |
|2018|200 |PRE |2 |
|2017|300 |POST |3 |
|2018|73 |IN |1 |
|2018|56 |IN |1 |
|2017|89 |PRE |2 |
+----+------+------+--------+
This is why I love spark-core APIs. Just with the help of map and flatMap you can handle many problems. Just pass your df and the instance of SQLContext to below method and it will give the desired result -
def reShapeDf(df:DataFrame, sqlContext: SQLContext): DataFrame ={
val rdd = df.rdd.map(m => (m.getAs[String](0),m.getAs[String](1),m.getAs[String](2)))
val rdd1 = rdd.flatMap(a => a._1.split(",").zip(a._2.split(",")).zip(a._3.split(",")))
val rdd2 = rdd1.map{
case ((a,b),c) => (a,b,c)
}
sqlContext.createDataFrame(rdd2.map(m => Row.fromTuple(m)),df.schema)
}

How to randomly choose element in array column of different size?

Given a dataframe with a column of arrays of integers with different sizes:
scala> sampleDf.show()
+------------+
| arrays|
+------------+
|[15, 16, 17]|
|[15, 16, 17]|
| [14]|
| [11]|
| [11]|
+------------+
scala> sampleDf.printSchema()
root
|-- arrays: array (nullable = true)
| |-- element: integer (containsNull = true)
I would like to generate a new column with a random chosen item in each array.
I've tried two solution:
1. Using UDF:
import scala.util.Random
def getRandomElement(arr: Array[Int]): Int = {
arr(Random.nextInt(arr.size))
}
val getRandomElementUdf = udf{arr: Array[Int] => getRandomElement(arr)}
sampleDf.withColumn("randomItem", getRandomElementUdf('arrays)).show
crashes on the last line with a long error message: (extracts)
...
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<int>) => int)
...
Caused by: java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I
I've tried with the alternative udf definition:
val getRandomElementUdf = udf[Int, Array[Int]] (getRandomElement)
but I get the same error.
2. Second method by creating intermediary columns with a random index in the range of the corresponding array:
// Return a dataframe with a column with random index from column of Arrays with different sizes
def choice(df: DataFrame, colName: String): DataFrame = {
df.withColumn("array_size", size(col(colName)))
.withColumn("random_idx", least('array_size, floor(rand * 'array_size)))
}
choice(sampleDf, "arrays").show
outputs:
+------------+----------+----------+
| arrays|array_size|random_idx|
+------------+----------+----------+
|[15, 16, 17]| 3| 2|
|[15, 16, 17]| 3| 1|
| [14]| 1| 0|
| [11]| 1| 0|
| [11]| 1| 0|
+------------+----------+----------+
and ideally we would like to use the column random_idx to choose an item in column arrays, kind of:
sampleDf.withColumn("choosen_item", 'arrays.getItem('random_idx))
Unfortunaltely, getItem cannot take a column as argument.
Any suggestion is welcome.
You can use the below udf to select the random element from the array as
val getRandomElement = udf ((array: Seq[Integer]) => {
array(Random.nextInt(array.size))
})
df.withColumn("c1", getRandomElement($"arrays"))
.withColumn("c2", getRandomElement($"arrays"))
.withColumn("c3", getRandomElement($"arrays"))
.withColumn("c4", getRandomElement($"arrays"))
.withColumn("c5", getRandomElement($"arrays"))
.show(false)
You can see the random element selected in each use as a new column.
+------------+---+---+---+---+---+
|arrays |c1 |c2 |c3 |c4 |c5 |
+------------+---+---+---+---+---+
|[15, 16, 17]|15 |16 |16 |17 |16 |
|[15, 16, 17]|16 |16 |17 |15 |15 |
|[14] |14 |14 |14 |14 |14 |
|[11] |11 |11 |11 |11 |11 |
|[11] |11 |11 |11 |11 |11 |
+------------+---+---+---+---+---+
If you want to remain udf-free, here is a possibility:
first add a key to the dataframe outputed by choice (assume its name is choiceDf)
val myDf = choiceDf.withColumn("key", monotonically_increasing_id())
then create an intermediary dataframe that explode the arrays column and keep the index of the values
val tmp = myDf.select('key, posexplode('arrays))
finally join using key and random_idx
myDf.join(tmp.withColumnRenamed("pos", "random_idx"), Seq("key", "random_idx", "left")
the item you look for is stored in the column col
+---+----------+------------+----------+---+
|key|random_idx| arrays|array_size|col|
+---+----------+------------+----------+---+
| 0| 2|[15, 16, 17]| 3| 17|
| 1| 1|[15, 16, 17]| 3| 16|
| 2| 0| [14]| 1| 14|
+---+----------+------------+----------+---+
You can extract the random element from an array by one line using Spark-SQL
sampleDF.createOrReplaceTempView("sampleDF")
spark.sql("select arrays[Cast((FLOOR(RAND() * FLOOR(size(arrays)))) as INT)] as random from sampleDF")

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

Dataframe.map need to result with more than the rows in dataset

I am using scala and spark and have a simple dataframe.map to produce the required transformation on data. However I need to provide an additional row of data with the modified original. How can I use the dataframe.map to give out this.
ex:
dataset from:
id, name, age
1, john, 23
2, peter, 32
if age < 25 default to 25.
dataset to:
id, name, age
1, john, 25
1, john, -23
2, peter, 32
Would a 'UnionAll' handle it?
eg.
df1 = original dataframe
df2 = transformed df1
df1.unionAll(df2)
EDIT: implementation using unionAll()
val df1=sqlContext.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfTransform= udf[Int,Int] { (age) => if (age<25) 25 else age }
val df2=df1.withColumn("age2", udfTransform($"age")).
where("age!=age2").
drop("age2")
df1.withColumn("age", udfTransform($"age")).
unionAll(df2).
orderBy("id").
show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 25|
| 1| john| 23|
| 2|peter| 32|
+---+-----+---+
Note: the implementation differs a bit from the originally proposed (naive) solution. The devil is always in the detail!
EDIT 2: implementation using nested array and explode
val df1=sx.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfArr= udf[Array[Int],Int] { (age) =>
if (age<25) Array(age,25) else Array(age) }
val df2=df1.withColumn("age", udfArr($"age"))
df2.show()
+---+-----+--------+
| id| name| age|
+---+-----+--------+
| 1| john|[23, 25]|
| 2|peter| [32]|
+---+-----+--------+
df2.withColumn("age",explode($"age") ).show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 23|
| 1| john| 25|
| 2|peter| 32|
+---+-----+---+