I have two Spark DataFrames:
cities DataFrame with the following column:
city
-----
London
Austin
bigCities DataFrame with the following column:
name
------
London
Cairo
I need to transform DataFrame cities and add an additional Boolean column there: bigCity Value of this column must be calculated based on the following condition "cities.city IN bigCities.name"
I can do this in the following way(with a static bigCities collection):
cities.createOrReplaceTempView("cities")
var resultDf = spark.sql("SELECT city, CASE WHEN city IN ['London', 'Cairo'] THEN 'Y' ELSE 'N' END AS bigCity FROM cities")
but I don't know how to replace the static bigCities collection ['London', 'Cairo'] with bigCities DataFrame in the query. I want to use bigCities as the reference data in the query.
Please advise how to achieve this.
val df = cities.join(bigCities, $"name".equalTo($"city"), "leftouter").
withColumn("bigCity", when($"name".isNull, "N").otherwise("Y")).
drop("name")
You can use collect_list() on the the bigCities table. Check this out
scala> val df_city = Seq(("London"),("Austin")).toDF("city")
df_city: org.apache.spark.sql.DataFrame = [city: string]
scala> val df_bigCities = Seq(("London"),("Cairo")).toDF("name")
df_bigCities: org.apache.spark.sql.DataFrame = [name: string]
scala> df_city.createOrReplaceTempView("cities")
scala> df_bigCities.createOrReplaceTempView("bigCities")
scala> spark.sql(" select city, case when array_contains((select collect_list(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city |bigCity|
+------+-------+
|London|Y |
|Austin|N |
+------+-------+
scala>
If the dataset is big, you can use collect_set which will be more efficient.
scala> spark.sql(" select city, case when array_contains((select collect_set(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city |bigCity|
+------+-------+
|London|Y |
|Austin|N |
+------+-------+
scala>
Related
I've a simple LEFT JOIN
spark.sql(
s"""
|SELECT
| a.*,
| b.company_id AS companyId
|FROM profile_views a
|LEFT JOIN companies_info b
| ON a.memberId = b.member_id
|""".stripMargin
).createOrReplaceTempView("company_views")
How do I replace this with the scala API?
Try below code.
Below code will work for temp views as well as hive tables.
val profile_views = spark
.table("profile_views")
.as("a")
val companies_info = spark
.table("companies_info")
.select($"company_id".as("companyId"),$"member_id".as("memberId"))
.as("b")
profile_views
.join(companies_info,Seq("memberId"),"left")
.createOrReplaceTempView("company_views")
If you have already data in DataFrame, You can use below code.
profile_viewsDF.as("a")
.join(
companies_infoDF.select($"company_id".as("companyId"),$"member_id".as("memberId")).as("b"),
Seq("memberId"),
"left"
)
.createOrReplaceTempView("company_views")
Update : temp views can also called using spark.table(). Please check below code.
scala> val df = Seq(("srinivas",10)).toDF("name","age")
df: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> df.createTempView("person")
scala> spark.table("person").show
+--------+---+
| name|age|
+--------+---+
|srinivas| 10|
+--------+---+
The question is the same as this one but can this:
df.withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID")).
select("VEHICLE","ACCOUNTNO"). //only select reqired columns
groupBy("ACCOUNTNO").
agg(collect_list("VEHICLE").as("VEHICLE")). //for the same group create a list of vehicles
toJSON. //convert to json
show(false)
be rewritten with pure SQL? I mean something like that:
val sqlDF = spark.sql("SELECT VEHICLE, ACCOUNTNO as collect_list(ACCOUNTNO) FROM VEHICLES group by ACCOUNTNO)
sqlDF.show()
Does it possible?
The SQL equivalent of your dataframe example would be:
scala> val df = Seq((10003014,"MH43AJ411",20000000),
| (10003014,"MH43AJ411",20000001),
| (10003015,"MH12GZ3392",20000002)
| ).toDF("ACCOUNTNO","VEHICLENUMBER","CUSTOMERID").withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID"))
df: org.apache.spark.sql.DataFrame = [ACCOUNTNO: int, VEHICLENUMBER: string ... 2 more fields]
scala> df.registerTempTable("vehicles")
scala> val sqlDF = spark.sql("SELECT ACCOUNTNO, collect_list(VEHICLE) as ACCOUNT_LIST FROM VEHICLES group by ACCOUNTNO").toJSON
sqlDF: org.apache.spark.sql.Dataset[String] = [value: string]
scala> sqlDF.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|{"ACCOUNTNO":10003014,"ACCOUNT_LIST":[{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000},{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001}]}|
|{"ACCOUNTNO":10003015,"ACCOUNT_LIST":[{"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002}]} |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
I have a table which has a column containing array like this -
Student_ID | Subject_List | New_Subject
1 | [Mat, Phy, Eng] | Chem
I want to append the new subject into the subject list and get the new list.
Creating the dataframe -
val df = sc.parallelize(Seq((1, Array("Mat", "Phy", "Eng"), "Chem"))).toDF("Student_ID","Subject_List","New_Subject")
I have tried this with UDF as follows -
def append_list = (arr: Seq[String], s: String) => {
arr :+ s
}
val append_list_UDF = udf(append_list)
val df_new = df.withColumn("New_List", append_list_UDF($"Subject_List",$"New_Subject"))
With UDF, I get the required output
Student_ID | Subject_List | New_Subject | New_List
1 | [Mat, Phy, Eng] | Chem | [Mat, Phy, Eng, Chem]
Can we do it without udf ? Thanks.
In Spark 2.4 or later a combination of array and concat should do the trick,
import org.apache.spark.sql.functions.{array, concat}
import org.apache.spark.sql.Column
def append(arr: Column, col: Column) = concat(arr, array(col))
df.withColumn("New_List", append($"Subject_List",$"New_Subject")).show
+----------+---------------+-----------+--------------------+
|Student_ID| Subject_List|New_Subject| New_List|
+----------+---------------+-----------+--------------------+
| 1|[Mat, Phy, Eng]| Chem|[Mat, Phy, Eng, C...|
+----------+---------------+-----------+--------------------+
but I wouldn't expect serious performance gains here.
val df = Seq((1, Array("Mat", "Phy", "Eng"), "Chem"),
(2, Array("Hindi", "Bio", "Eng"), "IoT"),
(3, Array("Python", "R", "scala"), "C")).toDF("Student_ID","Subject_List","New_Subject")
df.show(false)
val final_df = df.withColumn("exploded", explode($"Subject_List")).select($"Student_ID",$"exploded")
.union(df.select($"Student_ID",$"New_Subject"))
.groupBy($"Student_ID").agg(collect_list($"exploded") as "Your_New_List").show(false)
[enter code here][1]
This one below is a simple syntax to search for a string in a particular column uisng SQL Like functionality.
val dfx = df.filter($"name".like(s"%${productName}%"))
The questions is How do I grab each and every column NAME that contained the particular string in its VALUES and generate a new column with a list of those "column names" for every row.
So far this is the approach I took but stuck as I cant use spark-sql "Like" function inside a UDF.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.types._
import spark.implicits._
val df1 = Seq(
(0, "mango", "man", "dit"),
(1, "i-man", "man2", "mane"),
(2, "iman", "mango", "ho"),
(3, "dim", "kim", "sim")
).toDF("id", "col1", "col2", "col3")
val df2 = df1.columns.foldLeft(df1) {
(acc: DataFrame, colName: String) =>
acc.withColumn(colName, concat(lit(colName + "="), col(colName)))
}
val df3 = df2.withColumn("merged_cols", split(concat_ws("X", df2.columns.map(c=> col(c)):_*), "X"))
Here is a sample output. Note that here there are only 3 columns but in the real job I'll be reading multiple tables which can contain dynamic number of columns.
+--------------------------------------------+
|id | col1| col2| col3| merged_cols
+--------------------------------------------+
0 | mango| man | dit | col1, col2
1 | i-man| man2 | mane | col1, col2, col3
2 | iman | mango| ho | col1, col2
3 | dim | kim | sim|
+--------------------------------------------+
This can be done using a foldLeft over the columns together with when and otherwise:
val e = "%man%"
val df2 = df1.columns.foldLeft(df.withColumn("merged_cols", lit(""))){(df, c) =>
df.withColumn("merged_cols", when(col(c).like(e), concat($"merged_cols", lit(s"$c,"))).otherwise($"merged_cols"))}
.withColumn("merged_cols", expr("substring(merged_cols, 1, length(merged_cols)-1)"))
All columns that satisfies the condition e will be appended to the string in the merged_cols column. Note that the column must exist for the first append to work so it is added (containing an empty string) to the dataframe when sent into the foldLeft.
The last row in the code simply removes the extra , that is added in the end. If you want the result as an array instead, simply adding .withColumn("merged_cols", split($"merged_cols", ",")) would work.
An alternative appraoch is to instead use an UDF. This could be preferred when dealing with many columns since foldLeft will create multiple dataframe copies. Here regex is used (not the SQL like since that operates on whole columns).
val e = ".*man.*"
val concat_cols = udf((vals: Seq[String], names: Seq[String]) => {
vals.zip(names).filter{case (v, n) => v.matches(e)}.map(_._2)
})
val df2 = df.withColumn("merged_cols", concat_cols(array(df.columns.map(col(_)): _*), typedLit(df.columns.toSeq)))
Note: typedLit can be used in Spark versions 2.2+, when using older versions use array(df.columns.map(lit(_)): _*) instead.
I am using Spark 1.6 and I would like to know how to implement in lookup in the dataframes.
I have two dataframes employee & department.
Employee Dataframe
-------------------
Emp Id | Emp Name
------------------
1 | john
2 | David
Department Dataframe
--------------------
Dept Id | Dept Name | Emp Id
-----------------------------
1 | Admin | 1
2 | HR | 2
I would like to lookup emp id from the employee table to the department table and get the dept name. So, the resultset would be
Emp Id | Dept Name
-------------------
1 | Admin
2 | HR
How do I implement this look up UDF feature in SPARK. I don't want to use JOIN on both the dataframes.
As already mentioned in the comments, joining the dataframes is the way to go.
You can use a lookup, but I think there is no "distributed" solution, i.e. you have to collect the lookup-table into driver memory. Also note that this approach assumes that EmpID is unique:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import scala.collection.Map
val emp = Seq((1,"John"),(2,"David"))
val deps = Seq((1,"Admin",1),(2,"HR",2))
val empRdd = sc.parallelize(emp)
val depsDF = sc.parallelize(deps).toDF("DepID","Name","EmpID")
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,String]) = udf((empID:Int) => lookupMap.get(empID))
val combinedDF = depsDF
.withColumn("empNames",lookup(lookupMap)($"EmpID"))
My initial thought was to pass the empRdd to the UDF and use the lookup method defined on PairRDD, but this does of course not work because you cannot have spark actions (i.e. lookup) within transformations (ie. the UDF).
EDIT:
If your empDf has multiple columns (e.g. Name,Age), you can use this
val empRdd = empDf.rdd.map{row =>
(row.getInt(0),(row.getString(1),row.getInt(2)))}
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,(String,Int)]) =
udf((empID:Int) => lookupMap.lift(empID))
depsDF
.withColumn("lookup",lookup(lookupMap)($"EmpID"))
.withColumn("empName",$"lookup._1")
.withColumn("empAge",$"lookup._2")
.drop($"lookup")
.show()
As you are saying you already have Dataframes then its pretty easy follow these steps:
1)create a sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2) Create Temporary tables for all 3 Eg:
EmployeeDataframe.createOrReplaceTempView("EmpTable")
3) Query using MySQL Queries
val MatchingDetails = sqlContext.sql("SELECT DISTINCT E.EmpID, DeptName FROM EmpTable E inner join DeptTable G on " +
"E.EmpID=g.EmpID")
Starting with some "lookup" data, there are two approaches:
Method #1 -- using a lookup DataFrame
// use a DataFrame (via a join)
val lookupDF = sc.parallelize(Seq(
("banana", "yellow"),
("apple", "red"),
("grape", "purple"),
("blueberry","blue")
)).toDF("SomeKeys","SomeValues")
Method #2 -- using a map in a UDF
// turn the above DataFrame into a map which a UDF uses
val Keys = lookupDF.select("SomeKeys").collect().map(_(0).toString).toList
val Values = lookupDF.select("SomeValues").collect().map(_(0).toString).toList
val KeyValueMap = Keys.zip(Values).toMap
def ThingToColor(key: String): String = {
if (key == null) return ""
val firstword = key.split(" ")(0) // fragile!
val result: String = KeyValueMap.getOrElse(firstword,"not found!")
return (result)
}
val ThingToColorUDF = udf( ThingToColor(_: String): String )
Take a sample data frame of things that will be looked up:
val thingsDF = sc.parallelize(Seq(
("blueberry muffin"),
("grape nuts"),
("apple pie"),
("rutabaga pudding")
)).toDF("SomeThings")
Method #1 is to join on the lookup DataFrame
Here, the rlike is doing the matching. And null appears where that does not work. Both columns of the lookup DataFrame get added.
val result_1_DF = thingsDF.join(lookupDF, expr("SomeThings rlike SomeKeys"),
"left_outer")
Method #2 is to add a column using the UDF
Here, only 1 column is added. And the UDF can return a non-Null value. However, if the lookup data is very large it may fail to "serialize" as required to send to the workers in the cluster.
val result_2_DF = thingsDF.withColumn("AddValues",ThingToColorUDF($"SomeThings"))
Which gives you:
In my case I had some lookup data that was over 1 million values, so Method #1 was my only choice.