How to map values in column(multiple columns also) of one dataset to other dataset - scala

I am woking on graphframes part,where I need to have edges/links in d3.js to be in indexed values of Vertex/nodes as source and destination.
Now I have VertexDF as
+--------------------+-----------+
| id| rowID|
+--------------------+-----------+
| Raashul Tandon| 3|
| Helen Jones| 5|
----------------------------------
EdgesDF
+-------------------+--------------------+
| src| dst|
+-------------------+--------------------+
| Raashul Tandon| Helen Jones |
------------------------------------------
Now I need to transform this EdgesDF as below
+-------------------+--------------------+
| src| dst|
+-------------------+--------------------+
| 3 | 5 |
------------------------------------------
All the column values should be having the index of the names taken from VertexDF.I am expecting in Higher-order functions.
My approach is to convert VertexDF to map, then iterating the EdgesDF and replaces every occurence.
What I have Tried
made a map of name to ids
val Actmap = VertxDF.collect().map(f =>{
val name = f.getString(0)
val id = f.getLong(1)
(name,id)
})
.toMap
Used that map with EdgesDF
EdgesDF.collect().map(f => {
val src = f.getString(0)
val dst = f.getString(0)
val src_id = Actmap.get(src)
val dst_id = Actmap.get(dst)
(src_id,dst_id)
})

Your approach of collect-ing the vertex and edge dataframes would work only if they're small. I would suggest left-joining the edge and vertex dataframes to get what you need:
import org.apache.spark.sql.functions._
import spark.implicits._
val VertxDF = Seq(
("Raashul Tandon", 3),
("Helen Jones", 5),
("John Doe", 6),
("Rachel Smith", 7)
).toDF("id", "rowID")
val EdgesDF = Seq(
("Raashul Tandon", "Helen Jones"),
("Helen Jones", "John Doe"),
("Unknown", "Raashul Tandon"),
("John Doe", "Rachel Smith")
).toDF("src", "dst")
EdgesDF.as("e").
join(VertxDF.as("v1"), $"e.src" === $"v1.id", "left_outer").
join(VertxDF.as("v2"), $"e.dst" === $"v2.id", "left_outer").
select($"v1.rowID".as("src"), $"v2.rowID".as("dst")).
show
// +----+---+
// | src|dst|
// +----+---+
// | 3| 5|
// | 5| 6|
// |null| 3|
// | 6| 7|
// +----+---+

Related

Spark Scala split column values in a dataframe to appended lists

I have data in a spark dataframe that I need to search for elements by name, append the values to a list, and split searched elements into separate columns of the dataframe.
I am using scala and the below is an example of my current code that works to get the first value but I need to append all values available not just the first.
I'm new to Scala (and python) so any help will be greatly appreciated!
val getNumber: (String => String) = (colString: String) => {
if (colString != null) {
raw"number:(\d+)".r
.findAllIn(colString)
.group(1)
}
else
null
}
val udfGetColumn = udf(getNumber)
val mydf = df.select(cols.....)
.withColumn("var_number", udfGetColumn($"var"))
Example Data:
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| key| var |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |["[number:123456 rate:111970 position:1]","[number:123457 rate:662352 position:2]","[number:123458 rate:890 position:3]","[number:123459 rate:190 position:4]"] | |
|2 |["[number:654321 rate:211971 position:1]","[number:654322 rate:124 position:2]","[number:654323 rate:421 position:3]"] |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
Desired Result:
+------+------------------------------------------------------------+
| key| var_number | var_rate | var_position |
+------+------------------------------------------------------------+
|1 | 123456 | 111970 | 1 |
|1 | 123457 | 662352 | 2 |
|1 | 123458 | 890 | 3 |
|1 | 123459 | 190 | 4 |
|2 | 654321 | 211971 | 1 |
|2 | 654322 | 124 | 2 |
|2 | 654323 | 421 | 3 |
+------+-----------------+---------------------+--------------------+
You don't need to use UDF here. You can easily transform the array column var by converting each element into a map using str_to_map after removing the square brackets ([]) with regexp_replace function. Finally, explode the transformed array and select the fields:
val df = Seq(
(1, Seq("[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]")),
(2, Seq("[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"))
).toDF("key", "var")
val result = df.withColumn(
"var",
explode(expr(raw"transform(var, x -> str_to_map(regexp_replace(x, '[\\[\\]]', ''), ' '))"))
).select(
col("key"),
col("var").getField("number").alias("var_number"),
col("var").getField("rate").alias("var_rate"),
col("var").getField("position").alias("var_position")
)
result.show
//+---+----------+--------+------------+
//|key|var_number|var_rate|var_position|
//+---+----------+--------+------------+
//| 1| 123456| 111970| 1|
//| 1| 123457| 662352| 2|
//| 1| 123458| 890| 3|
//| 1| 123459| 190| 4|
//| 2| 654321| 211971| 1|
//| 2| 654322| 124| 2|
//| 2| 654323| 421| 3|
//+---+----------+--------+------------+
From you comment, it appears the column var is of type string not array. In this case, you can first transform it by removing [] and " characters then split by comma to get an array:
val df = Seq(
(1, """["[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]"]"""),
(2, """["[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"]""")
).toDF("key", "var")
val result = df.withColumn(
"var",
split(regexp_replace(col("var"), "[\\[\\]\"]", ""), ",")
).withColumn(
"var",
explode(expr("transform(var, x -> str_to_map(x, ' '))"))
).select(
// select your columns as above...
)

Iterating on columns in dataframe

I have the following data frames
df1
+----------+----+----+----+-----+
| WEEK|DIM1|DIM2| T1| T2|
+----------+----+----+----+-----+
|2016-04-02| 14|NULL|9874| 880|
|2016-04-30| 14| FR|9875| 13|
|2017-06-10| 15| PQR|9867|57721|
+----------+----+----+----+-----+
df2
+----------+----+----+----+-----+
| WEEK|DIM1|DIM2| T1| T2|
+----------+----+----+----+-----+
|2016-04-02| 14|NULL|9879| 820|
|2016-04-30| 14| FR|9785| 9|
|2017-06-10| 15| XYZ|9967|57771|
+----------+----+----+----+-----+
I need to produce my output as following -
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
| WEEK|DIM1|DIM2| T1| T2| T1| T2|t1_diff|t2_diff|pr_primary|pr_reference|
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
|2016-04-02| 14|NULL|9874| 880|9879| 820| -5| 60| Y| Y|
|2017-06-10| 15| PQR|9867|57721|null| null| null| null| Y| N|
|2017-06-10| 15| XYZ|null| null|9967|57771| null| null| N| Y|
|2016-04-30| 14| FR|9875| 13|9785| 9| 90| 4| Y| Y|
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
Here, t1_diff is difference between left T1 and right T1, t2_diff is difference between left T2 and right T2, pr_primary is Y if row is present in df1 and not in df2 and similarly for pr_reference.
I have generated the above with following piece of code
val df1 = Seq(
("2016-04-02", "14", "NULL", 9874, 880), ("2016-04-30", "14", "FR", 9875, 13), ("2017-06-10", "15", "PQR", 9867, 57721)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
val df2 = Seq(
("2016-04-02", "14", "NULL", 9879, 820), ("2016-04-30", "14", "FR", 9785, 9), ("2017-06-10", "15", "XYZ", 9967, 57771)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
import org.apache.spark.sql.functions._
val joined = df1.as("l").join(df2.as("r"), Seq("WEEK", "DIM1", "DIM2"), "fullouter")
val j1 = joined.withColumn("t1_diff",col(s"l.T1") - col(s"r.T1")).withColumn("t2_diff",col(s"l.T2") - col(s"r.T2"))
val isPresentSubstitution = udf( (x: String, y: String) => if (x == null && y == null) "N" else "Y")
j1.withColumn("pr_primary",isPresentSubstitution(col(s"l.T1"), col(s"l.T2"))).withColumn("pr_reference",isPresentSubstitution(col(s"r.T1"), col(s"r.T2"))).show
I want to make it generalize for any number of columns not just T1 and T2. Can someone suggest me a better way to do this ? I am running this in spark.
To be able to set any number of columns like t1_diff with any expresion calculating their values, we need to make some refactoring allowing to use withColumn in a more generic manner.
First, we need to collect the target values: the names of the target columns and the expressions that calculate their contents. This can be done with a sequence of Tuples:
val diffColumns = Seq(
("t1_diff", col("l.T1") - col("r.T1")),
("t2_diff", col("l.T2") - col("r.T2"))
)
// or, to make it more readable, create a dedicated "case class DiffColumn(colName: String, expression: Column)"
Now we can use folding to produce the joined DataFrame from joined and the sequence above:
val joinedWithDiffCols =
diffColumns.foldLeft(joined) { case(df, diffTuple) =>
df.withColumn(diffTuple._1, diffTuple._2)
}
joinedWithDiffCols contains the same data as j1 from the question.
To append new columns, you now have to modify diffColumns sequence only. You can even put the calculation of pr_primary and pr_reference in this sequence (but rename the ref to appendedColumns in this case, to be more precise).
Update
To facilitate the creation of the tuples for diffCollumns, it also can be generalized, for example:
// when both column names are same:
def generateDiff(column: String): (String, Column) = generateDiff(column, column)
// when left and right column names are different:
def generateDiff(leftCol: String, rightCol: String): (String, Column) =
(s"${leftCol}_diff", col("l." + leftCol) - col("r." + rightCol))
val diffColumns = Seq("T1", "T2").map(generateDiff)
End-of-update
Assuming the columns are named same in both df1 and df2, you can do something like:
val diffCols = df1.columns
.filter(_.matches("T\\d+"))
.map(c => col(s"l.$c") - col(s"r.$c") as (s"${c.toLowerCase}_diff") )
And then use it with joined like:
joined.select( ( col("*") :+ diffCols ) :_*).show(false)
//+----------+----+----+----+-----+----+-----+-------+-------+
//|WEEK |DIM1|DIM2|T1 |T2 |T1 |T2 |t1_diff|t2_diff|
//+----------+----+----+----+-----+----+-----+-------+-------+
//|2016-04-02|14 |NULL|9874|880 |9879|820 |-5 |60 |
//|2017-06-10|15 |PQR |9867|57721|null|null |null |null |
//|2017-06-10|15 |XYZ |null|null |9967|57771|null |null |
//|2016-04-30|14 |FR |9875|13 |9785|9 |90 |4 |
//+----------+----+----+----+-----+----+-----+-------+-------+
You can do it by adding sequence number to each dataframe and later join those two dataframes based on seq number.
val df3 = df1.withColumn("SeqNum", monotonicallyIncreasingId)
val df4 = df2.withColumn("SeqNum", monotonicallyIncreasingId)
df3.as("l").join(df4.as("r"),"SeqNum").withColumn("t1_diff",col("l.T1") - col("r.T1")).withColumn("t2_diff",col("l.T2") - col("r.T2")).drop("SeqNum").show()

Spark lookup to return value less than equal to key

Table A:
------------------------
col1|col2|col3|col4|
------------------------
A |123 |d1 |d2 |
------------------------
A |134 |d3 |d4 |
------------------------
B |156 |d5 |d6 |
------------------------
B |178 |d7 |d8 |
------------------------
Table B:
----------
col1|col2|
----------
A |129 |
----------
A |147 |
----------
B |199 |
----------
B |175 |
----------
I am trying to create a lookup in spark udf to lookup values from A using col1 and col2 to get remaining columns from table B using condition
tableA.col1 = tableB.col1 and TableA.col2 <= TableB.col2.
Output:
------------------------
col1|col2|col3|col4|
------------------------
A |129 |d1 |d2 |
------------------------
A |147 |d3 |d4 |
------------------------
B |199 |d7 |d8 |
------------------------
B |175 |d5 |d6 |
------------------------
This is what I have done so far.It works for conditions equal to but not sure how to get less than value.
// get values
val values = df.select("col1,col2").map(r => r.toString()).collect.toList
//get keys
val keys = enriched_2080.select($"col3",$"col4").map(r => (r.getString(0),r.getLong(1))).collect.toList
//create a map
val lookup_map = keys.zip(values).toMap
//udf
val lookup_udf = udf{ (a:String,b:Long) =>
(a,b) match {case (x:String,y:Long) => (a,b) match {case (x:String,y:Long) => lookup_map.getOrElse((x, y),"")}
}
//call udf
df1.withColumn("result", lookup_udf(df1("col1"),
df1("col2"))).show(false)
Is there any reason you want to use a lookup UDF that requires your DataFrame data collect-ed, thus putting a constraint on the size of DataFrames?
If your goal is only to generate the required output DataFrame, the following approach won't impose unnecessary contraint on the DataFrame size:
val dfA = Seq(
("A", 123L, "d1", "d2"),
("A", 134L, "d3", "d4"),
("B", 156L, "d5", "d6"),
("B", 178L, "d7", "d8")
).toDF("col1", "col2", "col3", "col4")
val dfB = Seq(
("A", 129L),
("A", 147L),
("B", 199L),
("B", 175L)
).toDF("col1", "col2")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
// Create a DataFrame with all `dfB.col2 - dfA.col2` values that are >= 0
val dfDiff = dfA.join(dfB, Seq("col1")).
select(
dfA("col1"), dfA("col2").as("col2a"), dfB("col2").as("col2b"),
dfA("col3"), dfA("col4"), (dfB("col2") - dfA("col2")).as("diff")
).
where($"diff" >= 0)
dfDiff.show
// +----+-----+-----+----+----+----+
// |col1|col2a|col2b|col3|col4|diff|
// +----+-----+-----+----+----+----+
// | A| 123| 147| d1| d2| 24|
// | A| 123| 129| d1| d2| 6|
// | A| 134| 147| d3| d4| 13|
// | B| 156| 175| d5| d6| 19|
// | B| 156| 199| d5| d6| 43|
// | B| 178| 199| d7| d8| 21|
// +----+-----+-----+----+----+----+
// Create result dataset with minimum `diff` for every `(col1, col2)` in dfA
// and assign corresponding `dfB.col2` as the new `col2`
val dfResult = dfDiff.withColumn( "rank",
rank.over(Window.partitionBy($"col1", $"col2a").orderBy($"diff"))
).
where($"rank" === 1).
select( $"col1", $"col2b".as("col2"), $"col3", $"col4" )
dfResult.show
// +----+----+----+----+
// |col1|col2|col3|col4|
// +----+----+----+----+
// | A| 147| d3| d4|
// | B| 175| d5| d6|
// | A| 129| d1| d2|
// | B| 199| d7| d8|
// +----+----+----+----+

How to update column of spark dataframe based on the values of previous record

I have three columns in df
Col1,col2,col3
X,x1,x2
Z,z1,z2
Y,
X,x3,x4
P,p1,p2
Q,q1,q2
Y
I want to do the following
when col1=x,store the value of col2 and col3
and assign those column values to next row when col1=y
expected output
X,x1,x2
Z,z1,z2
Y,x1,x2
X,x3,x4
P,p1,p2
Q,q1,q2
Y,x3,x4
Any help would be appreciated
Note:-spark 1.6
Here's one approach using Window function with steps as follows:
Add row-identifying column (not needed if there is already one) and combine non-key columns (presumably many of them) into one
Create tmp1 with conditional nulls and tmp2 using last/rowsBetween Window function to back-fill with the last non-null value
Create newcols conditionally from cols and tmp2
Expand newcols back to individual columns using foldLeft
Note that this solution uses Window function without partitioning, thus may not work for large dataset.
val df = Seq(
("X", "x1", "x2"),
("Z", "z1", "z2"),
("Y", "", ""),
("X", "x3", "x4"),
("P", "p1", "p2"),
("Q", "q1", "q2"),
("Y", "", "")
).toDF("col1", "col2", "col3")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val colList = df.columns.filter(_ != "col1")
val df2 = df.select($"col1", monotonically_increasing_id.as("id"),
struct(colList.map(col): _*).as("cols")
)
val df3 = df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
withColumn( "tmp2", last("tmp1", ignoreNulls = true).over(
Window.orderBy("id").rowsBetween(Window.unboundedPreceding, 0)
) )
df3.show
// +----+---+-------+-------+-------+
// |col1| id| cols| tmp1| tmp2|
// +----+---+-------+-------+-------+
// | X| 0|[x1,x2]|[x1,x2]|[x1,x2]|
// | Z| 1|[z1,z2]| null|[x1,x2]|
// | Y| 2| [,]| null|[x1,x2]|
// | X| 3|[x3,x4]|[x3,x4]|[x3,x4]|
// | P| 4|[p1,p2]| null|[x3,x4]|
// | Q| 5|[q1,q2]| null|[x3,x4]|
// | Y| 6| [,]| null|[x3,x4]|
// +----+---+-------+-------+-------+
val df4 = df3.withColumn( "newcols",
when($"col1" === "Y", $"tmp2").otherwise($"cols")
).select($"col1", $"newcols")
df4.show
// +----+-------+
// |col1|newcols|
// +----+-------+
// | X|[x1,x2]|
// | Z|[z1,z2]|
// | Y|[x1,x2]|
// | X|[x3,x4]|
// | P|[p1,p2]|
// | Q|[q1,q2]|
// | Y|[x3,x4]|
// +----+-------+
val dfResult = colList.foldLeft( df4 )(
(accDF, c) => accDF.withColumn(c, df4(s"newcols.$c"))
).drop($"newcols")
dfResult.show
// +----+----+----+
// |col1|col2|col3|
// +----+----+----+
// | X| x1| x2|
// | Z| z1| z2|
// | Y| x1| x2|
// | X| x3| x4|
// | P| p1| p2|
// | Q| q1| q2|
// | Y| x3| x4|
// +----+----+----+
[UPDATE]
For Spark 1.x, last(colName, ignoreNulls) isn't available in the DataFrame API. A work-around is to revert to use Spark SQL which supports ignore-null in its last() method:
df2.
withColumn( "tmp1", when($"col1" === "X", $"cols") ).
createOrReplaceTempView("df2table")
// might need to use registerTempTable("df2table") instead
val df3 = spark.sqlContext.sql("""
select col1, id, cols, tmp1, last(tmp1, true) over (
order by id rows between unbounded preceding and current row
) as tmp2
from df2table
""")
Yes, there is a lag function that requires ordering
import org.apache.spark.sql.expressions.Window.orderBy
import org.apache.spark.sql.functions.{coalesce, lag}
case class Temp(a: String, b: Option[String], c: Option[String])
val input = ss.createDataFrame(
Seq(
Temp("A", Some("a1"), Some("a2")),
Temp("D", Some("d1"), Some("d2")),
Temp("B", Some("b1"), Some("b2")),
Temp("E", None, None),
Temp("C", None, None)
))
+---+----+----+
| a| b| c|
+---+----+----+
| A| a1| a2|
| D| d1| d2|
| B| b1| b2|
| E|null|null|
| C|null|null|
+---+----+----+
val order = orderBy($"a")
input
.withColumn("b", coalesce($"b", lag($"b", 1).over(order)))
.withColumn("c", coalesce($"c", lag($"c", 1).over(order)))
.show()
+---+---+---+
| a| b| c|
+---+---+---+
| A| a1| a2|
| B| b1| b2|
| C| b1| b2|
| D| d1| d2|
| E| d1| d2|
+---+---+---+

Dataframe.map need to result with more than the rows in dataset

I am using scala and spark and have a simple dataframe.map to produce the required transformation on data. However I need to provide an additional row of data with the modified original. How can I use the dataframe.map to give out this.
ex:
dataset from:
id, name, age
1, john, 23
2, peter, 32
if age < 25 default to 25.
dataset to:
id, name, age
1, john, 25
1, john, -23
2, peter, 32
Would a 'UnionAll' handle it?
eg.
df1 = original dataframe
df2 = transformed df1
df1.unionAll(df2)
EDIT: implementation using unionAll()
val df1=sqlContext.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfTransform= udf[Int,Int] { (age) => if (age<25) 25 else age }
val df2=df1.withColumn("age2", udfTransform($"age")).
where("age!=age2").
drop("age2")
df1.withColumn("age", udfTransform($"age")).
unionAll(df2).
orderBy("id").
show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 25|
| 1| john| 23|
| 2|peter| 32|
+---+-----+---+
Note: the implementation differs a bit from the originally proposed (naive) solution. The devil is always in the detail!
EDIT 2: implementation using nested array and explode
val df1=sx.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfArr= udf[Array[Int],Int] { (age) =>
if (age<25) Array(age,25) else Array(age) }
val df2=df1.withColumn("age", udfArr($"age"))
df2.show()
+---+-----+--------+
| id| name| age|
+---+-----+--------+
| 1| john|[23, 25]|
| 2|peter| [32]|
+---+-----+--------+
df2.withColumn("age",explode($"age") ).show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 23|
| 1| john| 25|
| 2|peter| 32|
+---+-----+---+