Spark dataframe explode pair of lists - scala

My dataframe has 2 columns which look like this:
col_id| col_name
-----------
id1 | name1
id2 | name2
------------
id3 | name3
id4 | name4
....
so for each row, there are 2 matching arrays of the same length in columns id and name. What I want is to get each pair of id/name as a separate row like:
col_id| col_name
-----------
id1 | name1
-----------
id2 | name2
....
explode seems like the function to use but I can't seem to get it to work. What I tried is:
rdd.explode(col("col_id"), col("col_name")) ({
case row: Row =>
val ids: java.util.List[String] = row.getList(0)
val names: java.util.List[String] = row.getList(1)
var res: Array[(String, String)] = new Array[(String, String)](ids.size)
for (i <- 0 until ids.size) {
res :+ (ids.get(i), names.get(i))
}
res
})
This however returns only nulls so it might just be my poor knowledge of Scala. Can anyone point out the issue?

Looks like the last 10mins out of the past 1-2hours did the trick lol. This works just fine:
df.explode(col("id"), col("name")) ({
case row: Row =>
val ids: List[String] = row.getList(0).asScala.toList
val names: List[String] = row.getList(1).asScala.toList
ids zip names
})

Related

How to convert spark dataframe array to tuple

How can I convert spark dataframe to a tuple of 2 in scala?
I tried to explode the array and create a new column with help of lead function, so that I can use two columns to create tuple.
In order to use lead function, I need a column to sort by, I don't have any.
Please suggest which is best way to solve this?
Note: I need to retain the same order in the array.
For example:
Input
For example, input looks something like this,
id1 | [text1, text2, text3, text4]
id2 | [txt, txt2, txt4, txt5, txt6, txt7, txt8, txt9]
expected o/p:
I need to get output of tuple of length 2
id1 | [(text1, text2), (text2, text3), (text3,text4)]
id2 | [(txt, txt2), (txt2, txt4), (txt4, txt5), (txt5, txt6), (txt6, txt7), (txt7, txt8), (txt8, txt9)]
You can create an udf to create list of tuple using sliding window function
val df = Seq(
("id1", List("text1", "text2", "text3", "text4")),
("id2", List("txt", "txt2", "txt4", "txt5", "txt6", "txt7", "txt8", "txt9"))
).toDF("id", "text")
val sliding = udf((value: Seq[String]) => {
value.toList.sliding(2).map { case List(a, b) => (a, b) }.toList
})
val result = df.withColumn("text", sliding($"text"))
Output:
+---+-------------------------------------------------------------------------------------------------+
|id |text |
+---+-------------------------------------------------------------------------------------------------+
|id1|[[text1, text2], [text2, text3], [text3, text4]] |
|id2|[[txt, txt2], [txt2, txt4], [txt4, txt5], [txt5, txt6], [txt6, txt7], [txt7, txt8], [txt8, txt9]]|
+---+-------------------------------------------------------------------------------------------------+
Hope this helps!

Spark generate a list of column names that contains(SQL LIKE) a string

This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+
This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.

Spark - Select sum and all columns of joined dataset

I have 2 tables Employees (Id, Name), EmployeeSalary (EmployeeId, Designation, Salary). One employee can hold multiple designations in the company and have multiple salaries. How do I get EmployeeId, Name, Sum of salaries, Seq of all designations.
What I tried so far is
employeeDS.join(employeeSalaryDS, employeeDS.col("Id")
.equalTo(employeeSalaryDS.col("EmployeeId")),"left_outer")
.groupBy(employeeDS.col("Id")).agg(sum("Salary") as "Sum of salaries")
Something like this
scala> val dfe = Seq((101,"John"),(102,"Mike"), (103,"Paul"), (104,"Tom")).toDF("id","name")
dfe: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> val dfes = Seq((101,"Dev", 4000),(102,"Designer", 4000),(102,"Architect", 5000), (103,"Designer",6000), (104,"Consultant",8000), (104,"Supervisor",9000), (104,"PM",10000) ).toDF("id","desig","salary")
dfes: org.apache.spark.sql.DataFrame = [id: int, desig: string ... 1 more field]
scala> dfe.join(dfes, dfe.col("id").equalTo(dfes.col("id")),"left_outer").groupBy(dfe.col("Id")).agg(sum("Salary") as "Sum of salaries", collect_list('desig as "desig_list")).show(false)
+---+---------------+-----------------------------------+
|Id |Sum of salaries|collect_list(desig AS `desig_list`)|
+---+---------------+-----------------------------------+
|101|4000 |[Dev] |
|103|6000 |[Designer] |
|102|9000 |[Architect, Designer] |
|104|27000 |[PM, Supervisor, Consultant] |
+---+---------------+-----------------------------------+
scala>

Lookup in Spark dataframes

I am using Spark 1.6 and I would like to know how to implement in lookup in the dataframes.
I have two dataframes employee & department.
Employee Dataframe
-------------------
Emp Id | Emp Name
------------------
1 | john
2 | David
Department Dataframe
--------------------
Dept Id | Dept Name | Emp Id
-----------------------------
1 | Admin | 1
2 | HR | 2
I would like to lookup emp id from the employee table to the department table and get the dept name. So, the resultset would be
Emp Id | Dept Name
-------------------
1 | Admin
2 | HR
How do I implement this look up UDF feature in SPARK. I don't want to use JOIN on both the dataframes.
As already mentioned in the comments, joining the dataframes is the way to go.
You can use a lookup, but I think there is no "distributed" solution, i.e. you have to collect the lookup-table into driver memory. Also note that this approach assumes that EmpID is unique:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import scala.collection.Map
val emp = Seq((1,"John"),(2,"David"))
val deps = Seq((1,"Admin",1),(2,"HR",2))
val empRdd = sc.parallelize(emp)
val depsDF = sc.parallelize(deps).toDF("DepID","Name","EmpID")
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,String]) = udf((empID:Int) => lookupMap.get(empID))
val combinedDF = depsDF
.withColumn("empNames",lookup(lookupMap)($"EmpID"))
My initial thought was to pass the empRdd to the UDF and use the lookup method defined on PairRDD, but this does of course not work because you cannot have spark actions (i.e. lookup) within transformations (ie. the UDF).
EDIT:
If your empDf has multiple columns (e.g. Name,Age), you can use this
val empRdd = empDf.rdd.map{row =>
(row.getInt(0),(row.getString(1),row.getInt(2)))}
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,(String,Int)]) =
udf((empID:Int) => lookupMap.lift(empID))
depsDF
.withColumn("lookup",lookup(lookupMap)($"EmpID"))
.withColumn("empName",$"lookup._1")
.withColumn("empAge",$"lookup._2")
.drop($"lookup")
.show()
As you are saying you already have Dataframes then its pretty easy follow these steps:
1)create a sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2) Create Temporary tables for all 3 Eg:
EmployeeDataframe.createOrReplaceTempView("EmpTable")
3) Query using MySQL Queries
val MatchingDetails = sqlContext.sql("SELECT DISTINCT E.EmpID, DeptName FROM EmpTable E inner join DeptTable G on " +
"E.EmpID=g.EmpID")
Starting with some "lookup" data, there are two approaches:
Method #1 -- using a lookup DataFrame
// use a DataFrame (via a join)
val lookupDF = sc.parallelize(Seq(
("banana", "yellow"),
("apple", "red"),
("grape", "purple"),
("blueberry","blue")
)).toDF("SomeKeys","SomeValues")
Method #2 -- using a map in a UDF
// turn the above DataFrame into a map which a UDF uses
val Keys = lookupDF.select("SomeKeys").collect().map(_(0).toString).toList
val Values = lookupDF.select("SomeValues").collect().map(_(0).toString).toList
val KeyValueMap = Keys.zip(Values).toMap
def ThingToColor(key: String): String = {
if (key == null) return ""
val firstword = key.split(" ")(0) // fragile!
val result: String = KeyValueMap.getOrElse(firstword,"not found!")
return (result)
}
val ThingToColorUDF = udf( ThingToColor(_: String): String )
Take a sample data frame of things that will be looked up:
val thingsDF = sc.parallelize(Seq(
("blueberry muffin"),
("grape nuts"),
("apple pie"),
("rutabaga pudding")
)).toDF("SomeThings")
Method #1 is to join on the lookup DataFrame
Here, the rlike is doing the matching. And null appears where that does not work. Both columns of the lookup DataFrame get added.
val result_1_DF = thingsDF.join(lookupDF, expr("SomeThings rlike SomeKeys"),
"left_outer")
Method #2 is to add a column using the UDF
Here, only 1 column is added. And the UDF can return a non-Null value. However, if the lookup data is very large it may fail to "serialize" as required to send to the workers in the cluster.
val result_2_DF = thingsDF.withColumn("AddValues",ThingToColorUDF($"SomeThings"))
Which gives you:
In my case I had some lookup data that was over 1 million values, so Method #1 was my only choice.

Add new column with its data to existing DataFrame using

In scala I have a List[String] which I want to add as a new Column to an existing DataFrame.
Original DF:
Name | Date
======|===========
Rohan | 2007-12-21
... | ...
... | ...
Suppose want to add a new Column of Department
Expected DF:
Name | Date | Department
=====|============|============
Rohan| 2007-12-21 | Comp
... | ... | ...
... | ... | ...
How can I do this in Scala?
You can do it with one way like just create the dataframe of name and listvalues and join both the dataframe with name column
This solved my issue
val newrows = dataset.rdd.zipWithIndex.map(_.swap)
.join(spark.sparkContext.parallelize(results).zipWithIndex.map(_.swap))
.values
.map { case (row: Row, x: String) => Row.fromSeq(row.toSeq :+ x) }
Still need some exact explanation of it.