spark dataframe collecting specific results - scala

I have two cases classes as below
case class EmployeeDetails(id:Long, empName:String, dept:String)
case class SalDetails(salary:Long, dept:String)
and created two dataframes out of them and did a average of salary for each department
val spark = SparkSession
.builder
.appName("Emp")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val empDetails=Seq(
EmployeeDetails(1,"nachiket","IT"),
EmployeeDetails(2,"sanketa","Admin"),
EmployeeDetails(3,"kedar","IT")).toDF()
val salaryDetails=Seq(SalDetails(120000,"IT"),
SalDetails(35000,"Admin"),
SalDetails(300000,"IT")).toDF()
val commonFields=salaryDetails.join(empDetails,"dept").orderBy("salary")
val sortedFields= commonFields.groupBy("dept").avg("salary")
sortedFields.show()
output is similar to below.. so far so gud
+-----+-----------+
| dept|avg(salary)|
+-----+-----------+
|Admin| 35000.0|
| IT| 210000.0|
+-----+-----------+
as you can see average is calculated for 2 IT department employees and 1 Admin department employees. Along with the above output I need to show another column say "count" with output 1 and 2 for each row

val sortedFields= commonFields.groupBy("dept").agg(avg("salary"),countDistinct("name"))
will give you the desired result.
However I see a problem with the above logic, since there are multiple entries for 'IT' department in SalDetails, this will yield incorrect results for commonFields. Not sure if that is intended. Rather you can think of having employee_id and salary in the SalDetails and join the two dataframes based on employee ids.

Related

How to find the unique product among the stores using spark?

I am new to Apache spark. I want to find the unique product among the stores using scala spark.
Data in file is like below where 1st column in each row represents store name.
Sears,shoe,ring,pan,shirt,pen
Walmart,ring,pan,hat,meat,watch
Target,shoe,pan,shirt,hat,watch
I want the output to be
Only Walmart has Meat.
only Sears has Pen.
I tried the below in scala spark, able to get the unique products but don't know how to get the store name of those products. Please help.
val filerdd = sc.textFile("file:///home/hduser/stores_products")
val uniquerdd = filerdd.map(x=>x.split(",")).map(x=>Array(x(1),x(2),x(3),x(4),x(5))).flatMap(x=>x).map(x=>(x,1)).reduceByKey((a,b)=>a+b).filter(x=>x._2==1)
uniquerdd holds - Array((pen,1),(meat,1))
Now I want to find in which row of filerdd these products presents and should display the output as below
Only Walmart has Meat.
Only Sears has Pen.
can you please help me to get the desired output?
The dataframe API is probably easier than the RDD API to do this. You can explode the list of products and filter those with count = 1.
import org.apache.spark.sql.expressions.Window
df = spark.read.csv("filepath")
result = df.select(
$"_c0".as("store"),
explode(array(df.columns.tail.map(col):_*)).as("product")
).withColumn(
"count",
count("*").over(Window.partitionBy("product"))
).filter(
"count = 1"
).select(
format_string("Only %s has %s.", $"store", $"product").as("output")
)
result.show(false)
+----------------------+
|output |
+----------------------+
|Only Walmart has meat.|
|Only Sears has pen. |
+----------------------+

Compare dataframes in scala and write the mismatching old and new columns to new dataframe

I have two df's
df1
ID |BTH_DT |CDC_FLAG|CDC_TS |CNSM_ID
123|1986-10-07|I |2018-10-10 05:51:24.000000941|301634310
124|1973-02-15|I |2018-10-10 17:12:22.000000254|298910234
df2
ID |BTH_DT |CDC_FLAG|CDC_TS |CNSM_ID
123|1986-10-07|I |2018-10-10 05:51:24.000000941|\c
124|1973-02-15|I |2018-10-10 17:12:22.000000254|298910234
How do i compare two df's and write the mismatching columns alone to different df?
ID |CNSM_ID
123|301634310
123| \\c
df2.except(df1)
above isn't serving the purpose
How about
val diff1=df1.except(df2)
val diff2=df2.except(df1)
val join=diff1.unionAll(diff2)
Then join.select("id","CNSM_ID").

How to combine several Dataframes together in scala?

I have several dataframes which contains single column in them. Let's say I have 4 such dataframe all with one column. How can I form a single dataframe by combining all of them?
val df = xmldf.select(col("UserData.UserValue._valueRef"))
val df2 = xmldf.select(col("UserData.UserValue._title"))
val df3 = xmldf.select(col("author"))
val df4 = xmldf.select(col("price"))
To combine, I am trying this, but it doesn't work:
var newdf = df
newdf = newdf.withColumn("col1",df1.col("UserData.UserValue._title"))
newdf.show()
It errors out saying that field of one column are not present in another. I am not sure how can I combine these 4 dataframes together. They don't have any common column.
df2 looks like this:
+---------------+
| _title|
+---------------+
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
|_CONFIG_CONTEXT|
+---------------+
and df looks like this:
+-----------+
|_valuegiven|
+-----------+
| qwe|
| dfdfrt|
| dfdf|
+-----------+
df3 and df4 are also in same format. I want like below dataframe:
+-----------+---------------+
|_valuegiven| _title|
+-----------+---------------+
| qwe|_CONFIG_CONTEXT|
| dfdfrt|_CONFIG_CONTEXT|
| dfdf|_CONFIG_CONTEXT|
+-----------+---------------+
I used this:
val newdf = xmldf.select(col("UserData.UserValue._valuegiven"),col("UserData.UserValue._title") )
newdf.show()
But I am getting column name on the go and as such, I would need to append on the go, due to which I don't know exactly how many columns I will get. Which is why I cannot use the above command.
It's a little unclear of your goal. If asking to join these dataframes, but perhaps you just want to select those 4 columns.
val newdf = xmldf.select($"UserData.UserValue._valueRef", $"UserData.UserValue._title", 'author,'price")
newdf.show
If you really want to join all these dataframes, you'll need to join them all and select the appropriate fields.
If the goal is to get 4 columns from xmldf into a new dataframe you shouldn't be splitting it into 4 dataframes in the first place.
You can select multiple columns from a dataframe by providing additional column names in the select function.
val newdf = xmldf.select(
col("UserData.UserValue._valueRef"),
col("UserData.UserValue._title"),
col("author"),
col("price"))
newdf.show()
So I looked at various ways and finally Ram Ghadiyaram's answer in Solution 2 does what I wanted to do. Using this approach, you can combine any number of columns on the go. Basically, you need to create indexes by which you can join the dataframes together and after joining, drop the index column altogether.

Lookup in Spark dataframes

I am using Spark 1.6 and I would like to know how to implement in lookup in the dataframes.
I have two dataframes employee & department.
Employee Dataframe
-------------------
Emp Id | Emp Name
------------------
1 | john
2 | David
Department Dataframe
--------------------
Dept Id | Dept Name | Emp Id
-----------------------------
1 | Admin | 1
2 | HR | 2
I would like to lookup emp id from the employee table to the department table and get the dept name. So, the resultset would be
Emp Id | Dept Name
-------------------
1 | Admin
2 | HR
How do I implement this look up UDF feature in SPARK. I don't want to use JOIN on both the dataframes.
As already mentioned in the comments, joining the dataframes is the way to go.
You can use a lookup, but I think there is no "distributed" solution, i.e. you have to collect the lookup-table into driver memory. Also note that this approach assumes that EmpID is unique:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import scala.collection.Map
val emp = Seq((1,"John"),(2,"David"))
val deps = Seq((1,"Admin",1),(2,"HR",2))
val empRdd = sc.parallelize(emp)
val depsDF = sc.parallelize(deps).toDF("DepID","Name","EmpID")
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,String]) = udf((empID:Int) => lookupMap.get(empID))
val combinedDF = depsDF
.withColumn("empNames",lookup(lookupMap)($"EmpID"))
My initial thought was to pass the empRdd to the UDF and use the lookup method defined on PairRDD, but this does of course not work because you cannot have spark actions (i.e. lookup) within transformations (ie. the UDF).
EDIT:
If your empDf has multiple columns (e.g. Name,Age), you can use this
val empRdd = empDf.rdd.map{row =>
(row.getInt(0),(row.getString(1),row.getInt(2)))}
val lookupMap = empRdd.collectAsMap()
def lookup(lookupMap:Map[Int,(String,Int)]) =
udf((empID:Int) => lookupMap.lift(empID))
depsDF
.withColumn("lookup",lookup(lookupMap)($"EmpID"))
.withColumn("empName",$"lookup._1")
.withColumn("empAge",$"lookup._2")
.drop($"lookup")
.show()
As you are saying you already have Dataframes then its pretty easy follow these steps:
1)create a sqlcontext
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2) Create Temporary tables for all 3 Eg:
EmployeeDataframe.createOrReplaceTempView("EmpTable")
3) Query using MySQL Queries
val MatchingDetails = sqlContext.sql("SELECT DISTINCT E.EmpID, DeptName FROM EmpTable E inner join DeptTable G on " +
"E.EmpID=g.EmpID")
Starting with some "lookup" data, there are two approaches:
Method #1 -- using a lookup DataFrame
// use a DataFrame (via a join)
val lookupDF = sc.parallelize(Seq(
("banana", "yellow"),
("apple", "red"),
("grape", "purple"),
("blueberry","blue")
)).toDF("SomeKeys","SomeValues")
Method #2 -- using a map in a UDF
// turn the above DataFrame into a map which a UDF uses
val Keys = lookupDF.select("SomeKeys").collect().map(_(0).toString).toList
val Values = lookupDF.select("SomeValues").collect().map(_(0).toString).toList
val KeyValueMap = Keys.zip(Values).toMap
def ThingToColor(key: String): String = {
if (key == null) return ""
val firstword = key.split(" ")(0) // fragile!
val result: String = KeyValueMap.getOrElse(firstword,"not found!")
return (result)
}
val ThingToColorUDF = udf( ThingToColor(_: String): String )
Take a sample data frame of things that will be looked up:
val thingsDF = sc.parallelize(Seq(
("blueberry muffin"),
("grape nuts"),
("apple pie"),
("rutabaga pudding")
)).toDF("SomeThings")
Method #1 is to join on the lookup DataFrame
Here, the rlike is doing the matching. And null appears where that does not work. Both columns of the lookup DataFrame get added.
val result_1_DF = thingsDF.join(lookupDF, expr("SomeThings rlike SomeKeys"),
"left_outer")
Method #2 is to add a column using the UDF
Here, only 1 column is added. And the UDF can return a non-Null value. However, if the lookup data is very large it may fail to "serialize" as required to send to the workers in the cluster.
val result_2_DF = thingsDF.withColumn("AddValues",ThingToColorUDF($"SomeThings"))
Which gives you:
In my case I had some lookup data that was over 1 million values, so Method #1 was my only choice.

Filter dataframe by value NOT present in column of other dataframe [duplicate]

This question already has answers here:
Filter Spark DataFrame based on another DataFrame that specifies denylist criteria
(2 answers)
Closed 6 years ago.
Banging my head a little with this one, and I suspect the answer is very simple. Given two dataframes, I want to filter the first where values in one column are not present in a column of another dataframe.
I would like to do this without resorting to full-blown Spark SQL, so just using DataFrame.filter, or Column.contains or the "isin" keyword, or one of the join methods.
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
val res = df1.filter(df2("cities").contains("city") === false)
// doesn't work, nor do the 20 other variants I have tried
Anyone got any ideas?
I've discovered that I can solve this using a simpler method - it seems that an antijoin is possible as a parameter to the join method, but the Spark Scaladoc does not describe it:
import org.apache.spark.sql.functions._
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
df1.join(df2, df1("city") === df2("cities"), "leftanti").show
Results in:
+----------+-------+
| location| city|
+----------+-------+
|Chittagong|Chennai|
+----------+-------+
P.S. thanks for the pointer to the duplicate - duly marked as such
If you are trying to filter a DataFrame using another, you should use join (or any of its variants). If what you need is to filter it using a List or any data structure that fits in your master and workers you could broadcast it, then reference it inside the filter or where method.
For instance I would do something like:
import org.apache.spark.sql.functions._
val df1 = Seq(("Hampstead", "London"),
("Spui", "Amsterdam"),
("Chittagong", "Chennai")).toDF("location", "city")
val df2 = Seq(("London"),("Amsterdam"), ("New York")).toDF("cities")
df2.join(df1, joinExprs=df1("city") === df2("cities"), joinType="full_outer")
.select("city", "cities")
.where(isnull($"cities"))
.drop("cities").show()