Spark Sql Dataframe Join on one field - scala

I am very new to Spark. I have below queries -->
I have 2 tables. Business and Inspections.
Business Table has fields -> Business_id, name, address
Inspections table has --> score
I want to calculate top 10 scores.
So, I need to join based on Business_id filed. I tried 2 ways but none of them working -->
1) Using sqlContext.sql (I wrote sql query)
1)sqlContext.sql("""select CBusinesses.BUSINESS_ID,CBusinesses.name, CBusinesses.address, CBusinesses.city, CBusinesses.postal_code, CBusinesses.latitude, CBusinesses.longitude, Inspections_notnull.score from CBusinesses, Inspections_notnull where CBusinesses.BUSINESS_ID=Inspections_notnull.BUSINESS_ID and Inspections_notnull.score <>0 order by Inspections_notnull.score""").show()
2) val df = businessesDF.join(raw_inspectionsDF, businessesDF.col("BUSINESS_ID") == raw_inspectionsDF.col("BUSINESS_ID"))
How should I write it?
Thanks!

val df = businessesDF.join(raw_inspectionsDF, businessesDF("BUSINESS_ID") === raw_inspectionsDF("BUSINESS_ID"))
This should work, please take a look here for more details: https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrame.html

Sure... I created case class for each dataset then split it by tab then converted rdd to dataframe
import sqlContext. implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.util.{Try, Success, Failure}
def parseScore(s: String): Option[Int] = {
Try(s.toInt) match {
case Success(x) => Some(x)
case Failure(x) => None
}
}
case class CInspections (business_id:Int, score:Option[Int], date:String, type1:String)
val baseDir = "/FileStore/tables/484qrxx21488929011080/"
val raw_inspections = sc.textFile (s"$baseDir/inspections_plus.txt")
val raw_inspectionsmap = raw_inspections.map ( line => line.split ("\t"))
val raw_inspectionsRDD = raw_inspectionsmap.map ( raw_inspections => CInspections (raw_inspections(0).toInt,parseScore(raw_inspections(1)), raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
//raw_inspectionsDF.show()
val raw_inspectionsDF_replacenull = raw_inspectionsDF.na.fill(0) // Replacing null values with '0'
raw_inspectionsDF_replacenull.show()
raw_inspectionsDF_replacenull.createOrReplaceTempView ("Inspections_notnull")
For Business -->
case class CBusinesses (business_id:Int, name: String, address:String, city:String, postal_code:Int, latitude:String, longitude:String, phone_number:String, tax_code:String, business_certificate:String, application_date:String,owner_name:String, owner_address:String, owner_city:String, owner_state:String, owner_zip:String )
val businesses = sc.textFile (s"$baseDir/businesses_plus.txt")
val businessesmap = businesses.map ( line => line.split ("\t"))
val businessesRDD = businessesmap.map ( businesses => CBusinesses (businesses(0).toInt, businesses(1), businesses(2),businesses(3),businesses(4).toInt,
businesses(5),businesses(6), businesses(7), businesses(8), businesses(9), businesses(10), businesses(11), businesses(12), businesses(13), businesses(14), businesses(15)))
val businessesDF = businessesRDD.toDF
businessesDF.createOrReplaceTempView ("CBusinesses")
businessesDF.printSchema
//businessesDF.show()
It is showing proper resiult for both dataframe
For Inspection -->
+-----------+-----+--------+--------------------+
|business_id|score| date| type1|
+-----------+-----+--------+--------------------+
| 10| 0|20140807|Reinspection/Foll...|
| 10| 94|20140729|Routine - Unsched...|
| 10| 0|20140124|Reinspection/Foll...|
| 10| 92|20140114|Routine - Unsched...|
For Business -->
+-----------+--------------------+--------------------+-------------+-----------+---------+-----------+------------+--------+--------------------+----------------+--------------------+--------------------+-----------------+-------------+---------+
|business_id| name| address| city|postal_code| latitude| longitude|phone_number|tax_code|business_certificate|application_date| owner_name| owner_address| owner_city| owner_state|owner_zip|
+-----------+--------------------+--------------------+-------------+------- ----+---------+-----------+------------+--------+--------------------+---------- ------+--------------------+--------------------+-----------------+------------- +---------+
| 10| Tiramisu Kitchen| 033 Belden Pl|San Francisco| 94104|37.791116|-122.403816| | H24| 779059| | Tiramisu LLC| 33 Belden St| San Francisco| CA| 94104|
| 17|GEORGE'S COFFEE SHOP| 2200 OAKDALE Ave | S.F.| 94124|37.741086|-122.401737| 14155531470| H24| 78443| 4/5/75|"LIEUW, VICTOR & ...| 648 MACARTHUR DRIVE| DALY CITY| CA| 94015|

Related

Functionnal way of writing huge when rlike statement

I'm using regex to identify file type based on extension in DataFrame.
import org.apache.spark.sql.{Column, DataFrame}
val ignoreCase :String = "(?i)"
val ignoreExtension :String = "(?:\\.[_\\d]+)*(?:|\\.bck|\\.old|\\.orig|\\.bz2|\\.gz|\\.7z|\\.z|\\.zip)*(?:\\.[_\\d]+)*$"
val pictureFileName :String = "image"
val pictureFileType :String = ignoreCase + "^.+(?:\\.gif|\\.ico|\\.jpeg|\\.jpg|\\.png|\\.svg|\\.tga|\\.tif|\\.tiff|\\.xmp)" + ignoreExtension
val videoFileName :String = "video"
val videoFileType :String = ignoreCase + "^.+(?:\\.mod|\\.mp4|\\.mkv|\\.avi|\\.mpg|\\.mpeg|\\.flv)" + ignoreExtension
val otherFileName :String = "other"
def pathToExtension(cl: Column): Column = {
when(cl.rlike( pictureFileType ), pictureFileName ).
when(cl.rlike( videoFileType ), videoFileName ).
otherwise(otherFileName)
}
val df = List("file.jpg", "file.avi", "file.jpg", "file3.tIf", "file5.AVI.zip", "file4.mp4","afile" ).toDF("filename")
val df2 = df.withColumn("filetype", pathToExtension( col( "filename" ) ) )
df2.show
This is only a sample and I have 30 regex and type identified, thus the function pathToExtension() is really long because I have to put a new when statement for each type.
I can't find a proper way to write this code the functional way with a list or map containing the regexp and the name like this :
val typelist = List((pictureFileName,pictureFileType),(videoFileName,videoFileType))
foreach [need help for this part]
All the code I've tried so far won't work properly.
You can use foldLeft to traverse your list of when conditions and chain them as shown below:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import spark.implicits._
val default = "other"
def chainedWhen(c: Column, rList: List[(String, String)]): Column = rList.tail.
foldLeft(when(c rlike rList.head._2, rList.head._1))( (acc, t) =>
acc.when(c rlike t._2, t._1)
).otherwise(default)
Testing the method:
val df = Seq(
(1, "a.txt"), (2, "b.gif"), (3, "c.zip"), (4, "d.oth")
).toDF("id", "file_name")
val rList = List(("text", ".*\\.txt"), ("gif", ".*\\.gif"), ("zip", ".*\\.zip"))
df.withColumn("file_type", chainedWhen($"file_name", rList)).show
// +---+---------+---------+
// | id|file_name|file_type|
// +---+---------+---------+
// | 1| a.txt| text|
// | 2| b.gif| gif|
// | 3| c.zip| zip|
// | 4| d.oth| other|
// +---+---------+---------+

Create a dataframe based on 2 dataframes for eligibility and ordering

I have 2 input dataframes like so
eligibleDs
+---------+--------------------+
| cid| eligibleUIds|
+---------+--------------------+
| 1234|offer3,offer1,offer2|
| 2345| offer1,offer3|
| 3456| offer2,offer3|
| 4567| offer2|
| 5678| null|
+---------+--------------------+
suggestedDs
+---------+--------------------+
| cid| suggestedUids|
+---------+--------------------+
| 1234|offer2,offer1,offer3|
| 2345|offer1,offer2,offer3|
| 3456|offer1,offer3,offer2|
| 4567|offer3,offer1,offer2|
| 5678|offer1,offer2,offer3|
+---------+--------------------+
I want the output dataframe to be like so
outputDs
+---------+--------+
| cid| topUid|
+---------+--------+
|3456 |offer3 |
|5678 |null |
|4567 |offer2 |
|1234 |offer2 |
|2345 |offer1 |
+---------+--------+
The idea being that
First data frame is a list of uids(each corresponding to some content id) that a user is eligible to see
Second data frame is a suggested order of uids to be shown for that user
If, for an id, the top suggested uid is present in the uids that can be shown then show that uid else move down the suggested list till you get to an uid that can be shown
Basically eligibleDs decides the presence and suggestedDs decides the order
I have been able to come up with something like this
val combinedDs = eligibleDs.join(suggestedDs, Seq("cid"), "left")
val outputDs = combinedDs.map(row => {
val cid = row.getInt(0)
val eligibleUids = row.getString(1)
val suggestedUids = row.getString(2)
val suggestedUidsList = suggestedUids.split(",")
var topUid = ""
import scala.util.control.Breaks._
breakable {
for(uid <- suggestedUidsList) {
if(eligibleUids!=null && eligibleUids.contains(uid)) {
topOffer = uid
break
}
}
}
Out(cid, topUid)
})
This seems rather kludgy, can someone help let me know if there is a better way to do this?
Using dropWhile to drop unmatched items in the list in suggestedUids and headOption to check for the first item in the remaining list, here's a more idiomatic way for generating outputDs:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
case class Out(cid: Int, topUid: String)
val outputDs = combinedDs.map{
case Row(cid: Int, null, _) =>
Out(cid, null)
case Row(cid: Int, eligibleUids: String, suggestedUids: String) =>
val topUid = suggestedUids.split(",").
dropWhile(!eligibleUids.contains(_)).headOption match {
case Some(uid) => uid
case None => null
}
Out(cid, topUid)
}
outputDs.show
// +----+------+
// | cid|topUid|
// +----+------+
// |1234|offer2|
// |2345|offer1|
// |3456|offer3|
// |4567|offer2|
// |5678| null|
// +----+------+
Note that combinedDs as described in the question is a DataFrame. Should it be converted to a Dataset, case Row(...) should be replaced with case (...).

How to extract values from an RDD based on the parameter passed

I have created an key-value RDD , but i am not sure how to select the values from it.
val mapdf = merchantData_df.rdd.map(row => {
val Merchant_Name = row.getString(0)
val Display_Name = row.getString(1)
val Store_ID_name = row.getString(2)
val jsonString = s"{Display_Name: $Display_Name, Store_ID_name: $Store_ID_name}"
(Merchant_Name, jsonString)
})
scala> mapdf.take(4).foreach(println)
(Amul,{Display_Name: Amul, Store_ID_name: null})
(Nestle,{Display_Name: Nestle, Store_ID_name: null})
(Ace,{Display_Name: Ace , Store_ID_name: null})
(Acme ,{Display_Name: Acme Fresh Market, Store_ID_name: Acme Markets})
Now suppose my input string to a function will be Amul, My expected output for DisplayName is Amul and another function for StoreID to return NULL.
How can i achieve it?
I don't want to use SparkSQL for this purpose
Given input dataframe as
+-----------------+-----------------+-------------+
|Merchant_Name |Display_Name |Store_ID_name|
+-----------------+-----------------+-------------+
|Fitch |Fitch |null |
|Kids |Kids |null |
|Ace Hardware |Ace Hardware |null |
| Fresh Market |Acme Market |Acme Markets |
|Adventure | Island |null |
+-----------------+-----------------+-------------+
You can write a function with string parameter as
import org.apache.spark.sql.functions._
def filterRowsWithKey(key: String) = df.filter(col("Merchant_Name") === key).select("Display_Name", "Store_ID_name")
And calling the function as
filterRowsWithKey("Fitch").show(false)
would give you
+------------+-------------+
|Display_Name|Store_ID_name|
+------------+-------------+
|Fitch |null |
+------------+-------------+
I hope the answer is helpful
Updated
If you want first row as string to be returned from the function then you can do
import org.apache.spark.sql.functions._
def filterRowsWithKey(key: String) = df.filter(col("Merchant_Name") === key).select("Display_Name", "Store_ID_name").first().mkString(",")
println(filterRowsWithKey("Fitch"))
which should give you
Fitch,null
above function will throw exception if the key passed is not found so to be safe you can use following function
import org.apache.spark.sql.functions._
def filterRowsWithKey(key: String) = {
val filteredDF = df.filter(col("Merchant_Name") === key).select("Display_Name", "Store_ID_name")
if(filteredDF.count() > 0) filteredDF.first().mkString(",") else "key not found"
}

How to convert a simple DataFrame to a DataSet Spark Scala with case class?

I am trying to convert a simple DataFrame to a DataSet from the example in Spark:
https://spark.apache.org/docs/latest/sql-programming-guide.html
case class Person(name: String, age: Int)
import spark.implicits._
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
But the following problem arises:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot up cast `age` from bigint to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: ....
Can anyone help me out?
Edit
I noticed that with Long instead of Int works!
Why is that?
Also:
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()
augmentedDS.as[Person].show()
Prints:
+-----+---+
| _1| _2|
+-----+---+
|var_1| 2|
|var_2| 3|
|var_3| 4|
+-----+---+
Exception in thread "main"
org.apache.spark.sql.AnalysisException: cannot resolve '`name`' given input columns: [_1, _2];
Can Anyone Help me out understand here?
If you change Int to Long (or BigInt) it works fine:
case class Person(name: String, age: Long)
import spark.implicits._
val path = "examples/src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
Output:
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
EDIT:
Spark.read.json by default parses numbers as Long types - it's safer to do so.
You can change the col type after using casting or udfs.
EDIT2:
To answer your 2nd question, you need to name the columns correctly before the conversion to Person will work:
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => ("var_" + i.toString, (i + 1).toLong)).
withColumnRenamed ("_1", "name" ).
withColumnRenamed ("_2", "age" )
augmentedDS.as[Person].show()
Outputs:
+-----+---+
| name|age|
+-----+---+
|var_1| 2|
|var_2| 3|
|var_3| 4|
+-----+---+
This is how you create dataset from case class
case class Person(name: String, age: Long)
Keep the case class outside of the class that has below code
val primitiveDS = Seq(1,2,3).toDS()
val augmentedDS = primitiveDS.map(i => Person("var_" + i.toString, (i + 1).toLong))
augmentedDS.show()
augmentedDS.as[Person].show()
Hope this helped

How to calculate a certain character in the Spark

I'd like to calculate the character 'a' in spark-shell.
I have a somewhat troublesome method, split by 'a' and "length - 1" is what i want.
Here is the code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val test_data = sqlContext.read.json("music.json")
test_data.registerTempTable("test_data")
val temp1 = sqlContext.sql("select user.id_str as userid, text from test_data")
val temp2 = temp1.map(t => (t.getAs[String]("userid"),t.getAs[String]("text").split('a').length-1))
However, someone told me this is not remotely safe. I don't know why and can you give me a better way to do this?
It is not safe because if value is NULL:
val df = Seq((1, None), (2, Some("abcda"))).toDF("id", "text")
getAs[String] will return null:
scala> df.first.getAs[String]("text") == null
res1: Boolean = true
and split will give NPE:
scala> df.first.getAs[String]("text").split("a")
java.lang.NullPointerException
...
which is most likely the situation you got in your previous question.
One simple solution:
import org.apache.spark.sql.functions._
val aCnt = coalesce(length(regexp_replace($"text", "[^a]", "")), lit(0))
df.withColumn("a_cnt", aCnt).show
// +---+-----+-----+
// | id| text|a_cnt|
// +---+-----+-----+
// | 1| null| 0|
// | 2|abcda| 2|
// +---+-----+-----+
If you want to make your code relatively safe you should either check for existence of null:
def countAs1(s: String) = s match {
case null => 0
case chars => s.count(_ == 'a')
}
countAs1(df.where($"id" === 1).first.getAs[String]("text"))
// Int = 0
countAs1(df.where($"id" === 2).first.getAs[String]("text"))
// Int = 2
or catch possible exceptions:
import scala.util.Try
def countAs2(s: String) = Try(s.count(_ == 'a')).toOption.getOrElse(0)
countAs2(df.where($"id" === 1).first.getAs[String]("text"))
// Int = 0
countAs2(df.where($"id" === 2).first.getAs[String]("text"))
// Int = 2