Add additional columns to Spark dataframe - scala

I parse Spark dataframe using file paths but now I would like to add paths to the resulting dataframe along with time as a separate column too. Here is a current solution (pathToDF is a helper method):
val paths = pathsDF
.orderBy($"time")
.select($"path")
.as[String]
.collect()
if(paths.nonEmpty) {
paths
.grouped(groupsNum.getOrElse(paths.length))
.map(_.map(pathToDF).reduceLeft(_ union _))
} else {
Seq.empty[DataFrame]
}
I am trying to do something like this but I am not sure how to add time column too using withColumn:
val orderedPaths = pathsDF
.orderBy($"time")
.select($"path")
//.select($"path", $"time") for both columns
val paths = orderedPaths
.as[String]
.collect()
if (paths.nonEmpty) {
paths
.grouped(groupsNum.getOrElse(paths.length))
.map(group => group.map(pathToDataDF).reduceLeft(_ union _)
.withColumn("path", orderedPaths("path")))
//.withColumn("time", orderedPaths("time") something like this
} else {
Seq.empty[DataFrame]
}
What would be a better way to implement it?
Input DF:
time Long
path String
Current result:
resultDF schema
field1 Int
field2 String
....
fieldN String
Expected result:
resultDF schema
field1 Int
field2 String
....
path String
time Long

Please check below code.
1. Change grouped to par function for parallel data load.
2. Change
// Below code will add same path for multiple files content.
paths.grouped(groupsNum.getOrElse(paths.length))
.map(group => group.map(pathToDataDF).reduceLeft(_ union _)
.withColumn("path", orderedPaths("path")))
to
// Below code will add same path for same file content.
paths
.grouped(groupsNum.getOrElse(paths.length))
.flatMap(group => {
group.map(path => {
pathToDataDF(path).withColumn("path", lit(path))
}
)
})
.reduceLeft(_ union _)
For example I have used both par & grouped to show you.
Note Ignore some of method like pathToDataDF I have tried to replicate your methods.
scala> val orderedPaths = Seq(("/tmp/data/foldera/foldera.json","2020-05-29 01:30:00"),("/tmp/data/folderb/folderb.json","2020-05-29 02:00:00"),("/tmp/data/folderc/folderc.json","2020-05-29 03:00:00")).toDF("path","time")
orderedPaths: org.apache.spark.sql.DataFrame = [path: string, time: string]
scala> def pathToDataDF(path: String) = spark.read.format("json").load(path)
pathToDataDF: (path: String)org.apache.spark.sql.DataFrame
//Sample File content I have taken.
scala> "cat /tmp/data/foldera/foldera.json".!
{"name":"Srinivas","age":29}
scala> "cat /tmp/data/folderb/folderb.json".!
{"name":"Ravi","age":20}
scala> "cat /tmp/data/folderc/folderc.json".!
{"name":"Raju","age":25}
Using par
scala> val paths = orderedPaths.orderBy($"time").select($"path").as[String].collect
paths: Array[String] = Array(/tmp/data/foldera/foldera.json, /tmp/data/folderb/folderb.json, /tmp/data/folderc/folderc.json)
scala> val parDF = paths match {
case p if !p.isEmpty => {
p.par
.map(path => {
pathToDataDF(path)
.withColumn("path",lit(path))
}).reduceLeft(_ union _)
}
case _ => spark.emptyDataFrame
}
parDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string ... 1 more field]
scala> parDF.show(false)
+---+--------+------------------------------+
|age|name |path |
+---+--------+------------------------------+
|29 |Srinivas|/tmp/data/foldera/foldera.json|
|20 |Ravi |/tmp/data/folderb/folderb.json|
|25 |Raju |/tmp/data/folderc/folderc.json|
+---+--------+------------------------------+
// With time column.
scala> val paths = orderedPaths.orderBy($"time").select($"path",$"time").as[(String,String)].collect
paths: Array[(String, String)] = Array((/tmp/data/foldera/foldera.json,2020-05-29 01:30:00), (/tmp/data/folderb/folderb.json,2020-05-29 02:00:00), (/tmp/data/folderc/folderc.json,2020-05-29 03:00:00))
scala> val parDF = paths match {
case p if !p.isEmpty => {
p.par
.map(path => {
pathToDataDF(path._1)
.withColumn("path",lit(path._1))
.withColumn("time",lit(path._2))
}).reduceLeft(_ union _)
}
case _ => spark.emptyDataFrame
}
parDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string ... 2 more fields]
scala> parDF.show(false)
+---+--------+------------------------------+-------------------+
|age|name |path |time |
+---+--------+------------------------------+-------------------+
|29 |Srinivas|/tmp/data/foldera/foldera.json|2020-05-29 01:30:00|
|20 |Ravi |/tmp/data/folderb/folderb.json|2020-05-29 02:00:00|
|25 |Raju |/tmp/data/folderc/folderc.json|2020-05-29 03:00:00|
+---+--------+------------------------------+-------------------+
Using grouped
scala> val paths = orderedPaths.orderBy($"time").select($"path").as[String].collect
paths: Array[String] = Array(/tmp/data/foldera/foldera.json, /tmp/data/folderb/folderb.json, /tmp/data/folderc/folderc.json)
scala> val groupedDF = paths match {
case p if !p.isEmpty => {
paths
.grouped(groupsNum.getOrElse(paths.length))
.flatMap(group => {
group
.map(path => {
pathToDataDF(path)
.withColumn("path", lit(path))
})
}).reduceLeft(_ union _)
}
case _ => spark.emptyDataFrame
}
groupedDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string ... 1 more field]
scala> groupedDF.show(false)
+---+--------+------------------------------+
|age|name |path |
+---+--------+------------------------------+
|29 |Srinivas|/tmp/data/foldera/foldera.json|
|20 |Ravi |/tmp/data/folderb/folderb.json|
|25 |Raju |/tmp/data/folderc/folderc.json|
+---+--------+------------------------------+
// with time column.
scala> val paths = orderedPaths.orderBy($"time").select($"path",$"time").as[(String,String)].collect
paths: Array[(String, String)] = Array((/tmp/data/foldera/foldera.json,2020-05-29 01:30:00), (/tmp/data/folderb/folderb.json,2020-05-29 02:00:00), (/tmp/data/folderc/folderc.json,2020-05-29 03:00:00))
scala> val groupedDF = paths match {
case p if !p.isEmpty => {
paths
.grouped(groupsNum.getOrElse(paths.length))
.flatMap(group => {
group
.map(path => {
pathToDataDF(path._1)
.withColumn("path",lit(path._1))
.withColumn("time",lit(path._2))
})
}).reduceLeft(_ union _)
}
case _ => spark.emptyDataFrame
}
groupedDF: org.apache.spark.sql.DataFrame = [age: bigint, name: string ... 2 more fields]
scala> groupedDF.show(false)
+---+--------+------------------------------+-------------------+
|age|name |path |time |
+---+--------+------------------------------+-------------------+
|29 |Srinivas|/tmp/data/foldera/foldera.json|2020-05-29 01:30:00|
|20 |Ravi |/tmp/data/folderb/folderb.json|2020-05-29 02:00:00|
|25 |Raju |/tmp/data/folderc/folderc.json|2020-05-29 03:00:00|
+---+--------+------------------------------+-------------------+

Related

how to order .jpg file with ascending order datetime of image was when it was captured in scala

The image was saved with format photo name.extension, cityname, yyyy-mm-dd hh:mm:ss
i am trying to write function in scala which give desired result.
for eg:
john.jpg, USA,2013-09-15 14:08:15
BOB.jpg, UK,2013-09-15 14:08:15
RONY.jpg, USA,2013-09-15 19:08:15
A.PNG, USA,2018-09-15 21:08:15
TONY.jpg, CHINA,2020-09-15 19:08:15
MONY.PNG, CHINA,2021-09-15 21:08:15
RONY.jpg, CHINA,2015-09-15 19:08:15
A.PNG, JAPAN,2019-09-15 21:08:15
EXPECTED OUTPUT:
USA01.JPG
UK01.JPG
USA02.JPG
USA03.PNG
CHINA01.JPG
CHINA02.PNG
CHINA03.JPG
JAPAN01.PNG
There 3 pic from USA, so usa01, usa02 and usa03.
similarly china01,china02 and china03
appreciate your suggestion or approach.
Thanks
I broke it down into steps to make it clearer:
scala> val images = List(
| "john.jpg, USA,2013-09-15 14:08:15",
| "BOB.jpg, UK,2013-09-15 14:08:15",
| "RONY.jpg, USA,2013-09-15 19:08:15",
| "A.PNG, USA,2018-09-15 21:08:15",
| "TONY.jpg, CHINA,2020-09-15 19:08:15",
| "MONY.PNG, CHINA,2021-09-15 21:08:15",
| "RONY.jpg, CHINA,2015-09-15 19:08:15",
| "A.PNG, JAPAN,2019-09-15 21:08:15"
| )
val images: List[String] = List(john.jpg, USA,2013-09-15 14:08:15, BOB.jpg, UK,2013-09-15 14:08:15, RONY.jpg, USA,2013-09-15 19:08:15, A.PNG, USA,2018-09-15 21:08:15, TONY.jpg, CHINA,2020-09-15 19:08:15, MONY.PNG, CHINA,2021-09-15 21:08:15, RONY.jpg, CHINA,2015-09-15 19:08:15, A.PNG, JAPAN,2019-09-15 21:08:15)
scala> val props = images.map(_.split(",").map(_.trim)).filter(_.size == 3).map{case Array(x, y, z) => (x, y, z)}
val props: List[(String, String, String)] = List((john.jpg,USA,2013-09-15 14:08:15), (BOB.jpg,UK,2013-09-15 14:08:15), (RONY.jpg,USA,2013-09-15 19:08:15), (A.PNG,USA,2018-09-15 21:08:15), (TONY.jpg,CHINA,2020-09-15 19:08:15), (MONY.PNG,CHINA,2021-09-15 21:08:15), (RONY.jpg,CHINA,2015-09-15 19:08:15), (A.PNG,JAPAN,2019-09-15 21:08:15))
scala> val sortedProps = props.sortBy(_._3)
val sortedProps: List[(String, String, String)] = List((john.jpg,USA,2013-09-15 14:08:15), (BOB.jpg,UK,2013-09-15 14:08:15), (RONY.jpg,USA,2013-09-15 19:08:15), (RONY.jpg,CHINA,2015-09-15 19:08:15), (A.PNG,USA,2018-09-15 21:08:15), (A.PNG,JAPAN,2019-09-15 21:08:15), (TONY.jpg,CHINA,2020-09-15 19:08:15), (MONY.PNG,CHINA,2021-09-15 21:08:15))
scala> val relevantProps = sortedProps.map{ case (fname, cntry, date) => (fname.split("\\.")(1).toUpperCase, cntry) }
val relevantProps: List[(String, String)] = List((JPG,USA), (JPG,UK), (JPG,USA), (JPG,CHINA), (PNG,USA), (PNG,JAPAN), (JPG,CHINA), (PNG,CHINA))
scala> val (files, counts) = relevantProps.foldLeft((List[String](), Map[String, Int]())) { case ((res, counts), (ext, cntry)) =>
| val count = counts.getOrElse(cntry, 0) + 1
| ((s"$cntry$count.$ext") :: res, counts.updated(cntry, count))
| }
val files: List[String] = List(CHINA3.PNG, CHINA2.JPG, JAPAN1.PNG, USA3.PNG, CHINA1.JPG, USA2.JPG, UK1.JPG, USA1.JPG)
val counts: scala.collection.immutable.Map[String,Int] = Map(USA -> 3, UK -> 1, CHINA -> 3, JAPAN -> 1)
scala> val result = files.reverse
val result: List[String] = List(USA1.JPG, UK1.JPG, USA2.JPG, CHINA1.JPG, USA3.PNG, JAPAN1.PNG, CHINA2.JPG, CHINA3.PNG)
Or one-liner just for fun:
List(
"john.jpg, USA,2013-09-15 14:08:15",
"BOB.jpg, UK,2013-09-15 14:08:15",
"RONY.jpg, USA,2013-09-15 19:08:15",
"A.PNG, USA,2018-09-15 21:08:15",
"TONY.jpg, CHINA,2020-09-15 19:08:15",
"MONY.PNG, CHINA,2021-09-15 21:08:15",
"RONY.jpg, CHINA,2015-09-15 19:08:15",
"A.PNG, JAPAN,2019-09-15 21:08:15"
).map(_.split(",").map(_.trim))
.filter(_.size == 3).map{case Array(x, y, z) => (x, y, z)}
.sortBy(_._3)
.map{ case (fname, cntry, date) => (fname.split("\\.")(1).toUpperCase, cntry) }
.foldLeft((List[String](), Map[String, Int]())) { case ((res, counts), (ext, cntry)) =>
val count = counts.getOrElse(cntry, 0) + 1
((s"$cntry$count.$ext") :: res, counts.updated(cntry, count))
}._1.reverse
Output:
val res0: List[String] = List(USA1.JPG, UK1.JPG, USA2.JPG, CHINA1.JPG, USA3.PNG, JAPAN1.PNG, CHINA2.JPG, CHINA3.PNG)

How to Loop through multiple Col values in a dataframe to get count

I have a list of tables let say x,y,z and each table is having some cols for example test,test1,test2,test3 for table x. just like we have cols like rem,rem1,rem2 for table y. Similarly is the case for table z. Now the requirement is that we have to loop through each col in a table and have to get row count based on below scenario's.
If test is not NULL and all other are NULL(test1,test2,test3) then it will be one count.
Now we have to loop through each table and then find cols like test* then match the above condition then marked that row as one 1 count if it satisfy above condition.
I'm pretty new to scala but i thought of the below approach.
for each $tablename{
{
Val df = sql("select * from $tablename ")
val coldf = df.select(df.columns.filter(_.startsWith("test")).map(df(_)) : _*)
val df_filtered = coldf.map(eachrow ->df.filter(s$"eachrow".isNull))
}
}
}
It is not working for me and not getting any idea where to put the count variable.if someone can help on this i will really appreciate.
Im using spark 2 with scala.
Code update
below is the code for generating the table list and table-col mapping list.
val table_names = sql("SELECT t1.Table_Name ,t1.col_name FROM table_list t1 LEFT JOIN db_list t2 ON t2.tableName == t1.Table_Name WHERE t2.tableName IS NOT NULL ").toDF("tabname", "colname")
//List of all tables in the db as list of df
val dfList = table_names.select("tabname").map(r => r.getString(0)).collect.toList
val dfTableList = dfList.map(spark.table)
//Mapping each table with keycol
val tabColPairList = table_names.rdd.map( r => (r(0).toString, r(1).toString)).collect
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
After this i'm using the below methods..
def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
(acc, i) => if (row.isNullAt(i)) acc else acc + 1
)
if (nonNulls == 0) 1 else 0
}
}
val dfCountList = dfTableList.map{ df =>
df.cols
val keyCol = dfColMap(df)
//println(keyCol)
val colPattern = s"$keyCol\\d+".r
val checkCols = df.columns.map( c => c match {
case colPattern() => Some(c)
case _ => None
} ).flatten
val rddWithCount = df.rdd.map{ case r: Row =>
Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
}
spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
its giving me below error:
createCount: (row: org.apache.spark.sql.Row, keyCol: String, checkCols: Seq[String])Int
java.util.NoSuchElementException: key not found: [id: string, projid: string ... 40 more fields]
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:121)
at $$$$cfae60361b5eed1a149235c6e9b59b24$$$$$anonfun$1.apply(<console>:120)
at scala.collection.immutable.List.map(List.scala:273)
... 78 elided
Given your requirement, I would suggest taking advantage of RDD's functionality and use a Row-based method that creates your count for each Row per DataFrame:
val dfX = Seq(
("a", "ma", "a1", "a2", "a3"),
("b", "mb", null, null, null),
("null", "mc", "c1", null, "c3")
).toDF("xx", "mm", "xx1", "xx2", "xx3")
val dfY = Seq(
("nd", "d", "d1", null),
("ne", "e", "e1", "e2"),
("nf", "f", null, null)
).toDF("nn", "yy", "yy1", "yy2")
val dfZ = Seq(
("g", null, "g1", "g2", "qg"),
("h", "ph", null, null, null),
("i", "pi", null, null, "qi")
).toDF("zz", "pp", "zz1", "zz2", "qq")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.IntegerType
val dfList = List(dfX, dfY, dfZ)
val dfColMap = Map(dfX -> "xx", dfY -> "yy", dfZ -> "zz")
def createCount(row: Row, keyCol: String, checkCols: Seq[String]): Int = {
if (row.isNullAt(row.fieldIndex(keyCol))) 0 else {
val nonNulls = checkCols.map(row.fieldIndex(_)).foldLeft( 0 )(
(acc, i) => if (row.isNullAt(i)) acc else acc + 1
)
if (nonNulls == 0) 1 else 0
}
}
val dfCountList = dfList.map{ df =>
val keyCol = dfColMap(df)
val colPattern = s"$keyCol\\d+".r
val checkCols = df.columns.map( c => c match {
case colPattern() => Some(c)
case _ => None
} ).flatten
val rddWithCount = df.rdd.map{ case r: Row =>
Row.fromSeq(r.toSeq ++ Seq(createCount(r, keyCol, checkCols)))
}
spark.createDataFrame(rddWithCount, df.schema.add("count", IntegerType))
}
dfCountList(0).show
// +----+---+----+----+----+-----+
// | xx| mm| xx1| xx2| xx3|count|
// +----+---+----+----+----+-----+
// | a| ma| a1| a2| a3| 0|
// | b| mb|null|null|null| 1|
// |null| mc| c1|null| c3| 0|
// +----+---+----+----+----+-----+
dfCountList(1).show
// +---+---+----+----+-----+
// | nn| yy| yy1| yy2|count|
// +---+---+----+----+-----+
// | nd| d| d1|null| 0|
// | ne| e| e1| e2| 0|
// | nf| f|null|null| 1|
// +---+---+----+----+-----+
dfCountList(2).show
// +---+----+----+----+----+-----+
// | zz| pp| zz1| zz2| qq|count|
// +---+----+----+----+----+-----+
// | g|null| g1| g2| qg| 0|
// | h| ph|null|null|null| 1|
// | i| pi|null|null| qi| 1|
// +---+----+----+----+----+-----+
[UPDATE]
Note that the above solution works for any number of DataFrames as long as you have them in dfList and their corresponding key columns in dfColMap.
If you have a list of Hive tables instead, simply convert them into DataFrames using spark.table(), as below:
val tableList = List("tableX", "tableY", "tableZ")
val dfList = tableList.map(spark.table)
// dfList: List[org.apache.spark.sql.DataFrame] = List(...)
Now you still have to tell Spark what the key column for each table is. Let's say you have the list in the same order as the table list's. You can zip the list to create dfColMap and you'll have everything needed to apply the above solution:
val keyColList = List("xx", "yy", "zz")
val dfColMap = dfList.zip(keyColList).toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)
[UPDATE #2]
If you have the Hive table names and their corresponding key column names stored in a DataFrame, you can generate dfColMap as follows:
val dfTabColPair = Seq(
("tableX", "xx"),
("tableY", "yy"),
("tableZ", "zz")
).toDF("tabname", "colname")
val tabColPairList = dfTabColPair.rdd.map( r => (r(0).toString, r(1).toString)).
collect
// tabColPairList: Array[(String, String)] = Array((tableX,xx), (tableY,yy), (tableZ,zz))
val dfColMap = tabColPairList.map{ case (t, c) => (spark.table(t), c) }.toMap
// dfColMap: scala.collection.immutable.Map[org.apache.spark.sql.DataFrame,String] = Map(...)

Spark Best way groupByKey, orderBy and filter

I have 50GB of data with this schema [ID, timestamp, countryId] and I would like to get each "change" of each person in all of their events ordered by timestamp using spark 2.2.1. I mean if I have this events:
1,20180101,2
1,20180102,3
1,20180105,3
2,20180105,3
1,20180108,4
1,20180109,3
2,20180108,3
2,20180109,6
I would like to obtain this:
1,20180101,2
1,20180102,3
1,20180108,4
1,20180109,3
2,20180105,3
2,20180109,6
For this I have developed this code:
val eventsOrdened = eventsDataFrame.orderBy("ID", "timestamp")
val grouped = eventsOrdened
.rdd.map(x => (x.getString(0), x))
.groupByKey(300)
.mapValues(y => cleanEvents(y))
.flatMap(_._2)
where "cleanEvents" is:
def cleanEvents(ordenedEvents: Iterable[Row]): Iterable[Row] = {
val ordered = ordenedEvents.toList
val cleanedList: ListBuffer[Row] = ListBuffer.empty[Row]
ordered.map {
x => {
val next = if (ordered.indexOf(x) != ordered.length - 1) ordered(ordered.indexOf(x) + 1) else x
val country = x.get(2)
val nextountry = next.get(2)
val isFirst = if (cleanedList.isEmpty) true else false
val isLast = if (ordered.indexOf(x) == ordered.length - 1) true else false
if (isFirst) {
cleanedList.append(x)
} else {
if (cleanedList.size >= 1 && cleanedList.last.get(2) != country && country != nextCountry) {
cleanedList.append(x)
} else {
if (isLast && cleanedList.last.get(2) != zipCode) cleanedList.append(x)
}
}
}
}
cleanedList
}
It works but it's too slow, any optimization are welcome!!
Thanks!
Window function "lag" can be used:
case class Details(id: Int, date: Int, cc: Int)
val list = List[Details](
Details(1, 20180101, 2),
Details(1, 20180102, 3),
Details(1, 20180105, 3),
Details(2, 20180105, 3),
Details(1, 20180108, 4),
Details(1, 20180109, 3),
Details(2, 20180108, 3),
Details(2, 20180109, 6))
val ds = list.toDS()
// action
val window = Window.partitionBy("id").orderBy("date")
val result = ds.withColumn("lag", lag($"cc", 1).over(window)).where(isnull($"lag") || $"lag" =!= $"cc").orderBy("id", "date")
result.show(false)
Result is (lag column can be removed):
|id |date |cc |lag |
+---+--------+---+----+
|1 |20180101|2 |null|
|1 |20180102|3 |2 |
|1 |20180108|4 |3 |
|1 |20180109|3 |4 |
|2 |20180105|3 |null|
|2 |20180109|6 |3 |
+---+--------+---+----+
You might want to try the following:
Secondary sorting. It's does low-level partitioning and sorting and you will create a customize partition. More info here: http://codingjunkie.net/spark-secondary-sort/
Use combineByKey
case class Details(id: Int, date: Int, cc: Int)
val sc = new SparkContext("local[*]", "App")
val list = List[Details](
Details(1,20180101,2),
Details(1,20180102,3),
Details(1,20180105,3),
Details(2,20180105,3),
Details(1,20180108,4),
Details(1,20180109,3),
Details(2,20180108,3),
Details(2,20180109,6))
val rdd = sc.parallelize(list)
val createCombiner = (v: (Int, Int)) => List[(Int, Int)](v)
val combiner = (c: List[(Int, Int)], v: (Int, Int)) => (c :+ v).sortBy(_._1)
val mergeCombiner = (c1: List[(Int, Int)], c2: List[(Int, Int)]) => (c1 ++ c2).sortBy(_._1)
rdd
.map(det => (det.id, (det.date, det.cc)))
.combineByKey(createCombiner, combiner, mergeCombiner)
.collect()
.foreach(println)
the output would be something like this:
(1,List((20180101,2), (20180102,3), (20180105,3), (20180108,4), (20180109,3)))
(2,List((20180105,3), (20180108,3), (20180109,6)))

NullPointerException when using UDF in Spark

I have a DataFrame in Spark such as this one:
var df = List(
(1,"{NUM.0002}*{NUM.0003}"),
(2,"{NUM.0004}+{NUM.0003}"),
(3,"END(6)"),
(4,"END(4)")
).toDF("CODE", "VALUE")
+----+---------------------+
|CODE| VALUE|
+----+---------------------+
| 1|{NUM.0002}*{NUM.0003}|
| 2|{NUM.0004}+{NUM.0003}|
| 3| END(6)|
| 4| END(4)|
+----+---------------------+
My task is to iterate through the VALUE column and do the following: check if there is a substring such as {NUM.XXXX}, get the XXXX number, get the row where $"CODE" === XXXX, and replace the {NUM.XXX} substring with the VALUE string in that row.
I would like the dataframe to look like this in the end:
+----+--------------------+
|CODE| VALUE|
+----+--------------------+
| 1|END(4)+END(6)*END(6)|
| 2| END(4)+END(6)|
| 3| END(6)|
| 4| END(4)|
+----+--------------------+
This is the best I've come up with:
val process = udf((ln: String) => {
var newln = ln
while(newln contains "{NUM."){
var num = newln.slice(newln.indexOf("{")+5, newln.indexOf("}")).toInt
var new_value = df.where($"CODE" === num).head.getAs[String](1)
newln = newln.replace(newln.slice(newln.indexOf("{"),newln.indexOf("}")+1), new_value)
}
newln
})
var df2 = df.withColumn("VALUE", when('VALUE contains "{NUM.",process('VALUE)).otherwise('VALUE))
Unfortunately, I get a NullPointerException when I try to filter/select/save df2, and no error when I just show df2. I believe the error appears when I access the DataFrame df within the UDF, but I need to access it every iteration, so I can't pass it as an input. Also, I've tried saving a copy of df inside the UDF but I don't know how to do that. What can I do here?
Any suggestions to improve the algorithm are very welcome! Thanks!
I wrote something which works but not very optimized I think. I actually do recursive joins on the initial DataFrame to replace the NUMs by END. Here is the code :
case class Data(code: Long, value: String)
def main(args: Array[String]): Unit = {
val sparkSession: SparkSession = SparkSession.builder().master("local").getOrCreate()
val data = Seq(
Data(1,"{NUM.0002}*{NUM.0003}"),
Data(2,"{NUM.0004}+{NUM.0003}"),
Data(3,"END(6)"),
Data(4,"END(4)"),
Data(5,"{NUM.0002}")
)
val initialDF = sparkSession.createDataFrame(data)
val endDF = initialDF.filter(!(col("value") contains "{NUM"))
val numDF = initialDF.filter(col("value") contains "{NUM")
val resultDF = endDF.union(replaceNumByEnd(initialDF, numDF))
resultDF.show(false)
}
val parseNumUdf = udf((value: String) => {
if (value.contains("{NUM")) {
val regex = """.*?\{NUM\.(\d+)\}.*""".r
value match {
case regex(code) => code.toLong
}
} else {
-1L
}
})
val replaceUdf = udf((value: String, replacement: String) => {
val regex = """\{NUM\.(\d+)\}""".r
regex.replaceFirstIn(value, replacement)
})
def replaceNumByEnd(initialDF: DataFrame, currentDF: DataFrame): DataFrame = {
if (currentDF.count() == 0) {
currentDF
} else {
val numDFWithCode = currentDF
.withColumn("num_code", parseNumUdf(col("value")))
.withColumnRenamed("code", "code_original")
.withColumnRenamed("value", "value_original")
val joinedDF = numDFWithCode.join(initialDF, numDFWithCode("num_code") === initialDF("code"))
val replacedDF = joinedDF.withColumn("value_replaced", replaceUdf(col("value_original"), col("value")))
val nextDF = replacedDF.select(col("code_original").as("code"), col("value_replaced").as("value"))
val endDF = nextDF.filter(!(col("value") contains "{NUM"))
val numDF = nextDF.filter(col("value") contains "{NUM")
endDF.union(replaceNumByEnd(initialDF, numDF))
}
}
If you need more explanation, don't hesitate.

Spliting columns in a Spark dataframe in to new rows [Scala]

I have output from a spark data frame like below:
Amt |id |num |Start_date |Identifier
43.45|19840|A345|[2014-12-26, 2013-12-12]|[232323,45466]|
43.45|19840|A345|[2010-03-16, 2013-16-12]|[34343,45454]|
My requirement is to generate output in below format from the above output
Amt |id |num |Start_date |Identifier
43.45|19840|A345|2014-12-26|232323
43.45|19840|A345|2013-12-12|45466
43.45|19840|A345|2010-03-16|34343
43.45|19840|A345|2013-16-12|45454
Can somebody help me to achieve this.
Is this the thing you're looking for?
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sparkSession = ...
import sparkSession.implicits._
val input = sc.parallelize(Seq(
(43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq(232323,45466)),
(43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq(34343,45454))
)).toDF("amt", "id", "num", "start_date", "identifier")
val zipArrays = udf { (dates: Seq[String], identifiers: Seq[Int]) =>
dates.zip(identifiers)
}
val output = input.select($"amt", $"id", $"num", explode(zipArrays($"start_date", $"identifier")))
.select($"amt", $"id", $"num", $"col._1".as("start_date"), $"col._2".as("identifier"))
output.show()
Which returns:
+-----+-----+----+----------+----------+
| amt| id| num|start_date|identifier|
+-----+-----+----+----------+----------+
|43.45|19840|A345|2014-12-26| 232323|
|43.45|19840|A345|2013-12-12| 45466|
|43.45|19840|A345|2010-03-16| 34343|
|43.45|19840|A345|2013-16-12| 45454|
+-----+-----+----+----------+----------+
EDIT:
Since you would like to have multiple columns that should be zipped, you should try something like this:
val input = sc.parallelize(Seq(
(43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq("232323","45466"), Seq("123", "234")),
(43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq("34343","45454"), Seq("345", "456"))
)).toDF("amt", "id", "num", "start_date", "identifier", "another_column")
val zipArrays = udf { seqs: Seq[Seq[String]] =>
for(i <- seqs.head.indices) yield seqs.fold(Seq.empty)((accu, seq) => accu :+ seq(i))
}
val columnsToSelect = Seq($"amt", $"id", $"num")
val columnsToZip = Seq($"start_date", $"identifier", $"another_column")
val outputColumns = columnsToSelect ++ columnsToZip.zipWithIndex.map { case (column, index) =>
$"col".getItem(index).as(column.toString())
}
val output = input.select($"amt", $"id", $"num", explode(zipArrays(array(columnsToZip: _*)))).select(outputColumns: _*)
output.show()
/*
+-----+-----+----+----------+----------+--------------+
| amt| id| num|start_date|identifier|another_column|
+-----+-----+----+----------+----------+--------------+
|43.45|19840|A345|2014-12-26| 232323| 123|
|43.45|19840|A345|2013-12-12| 45466| 234|
|43.45|19840|A345|2010-03-16| 34343| 345|
|43.45|19840|A345|2013-16-12| 45454| 456|
+-----+-----+----+----------+----------+--------------+
*/
If I understand correctly, you want the first elements of col 3 and 4.
Does this make sense?
val newDataFrame = for {
row <- oldDataFrame
} yield {
val zro = row(0) // 43.45
val one = row(1) // 19840
val two = row(2) // A345
val dates = row(3) // [2014-12-26, 2013-12-12]
val numbers = row(4) // [232323,45466]
Row(zro, one, two, dates(0), numbers(0))
}
You could use SparkSQL.
First you create a view with the information we need to process:
df.createOrReplaceTempView("tableTest")
Then you can select the data with the expansions:
sparkSession.sqlContext.sql(
"SELECT Amt, id, num, expanded_start_date, expanded_id " +
"FROM tableTest " +
"LATERAL VIEW explode(Start_date) Start_date AS expanded_start_date " +
"LATERAL VIEW explode(Identifier) AS expanded_id")
.show()