How to extract values from an RDD based on the parameter passed - scala

I have created an key-value RDD , but i am not sure how to select the values from it.
val mapdf = merchantData_df.rdd.map(row => {
val Merchant_Name = row.getString(0)
val Display_Name = row.getString(1)
val Store_ID_name = row.getString(2)
val jsonString = s"{Display_Name: $Display_Name, Store_ID_name: $Store_ID_name}"
(Merchant_Name, jsonString)
})
scala> mapdf.take(4).foreach(println)
(Amul,{Display_Name: Amul, Store_ID_name: null})
(Nestle,{Display_Name: Nestle, Store_ID_name: null})
(Ace,{Display_Name: Ace , Store_ID_name: null})
(Acme ,{Display_Name: Acme Fresh Market, Store_ID_name: Acme Markets})
Now suppose my input string to a function will be Amul, My expected output for DisplayName is Amul and another function for StoreID to return NULL.
How can i achieve it?
I don't want to use SparkSQL for this purpose

Given input dataframe as
+-----------------+-----------------+-------------+
|Merchant_Name |Display_Name |Store_ID_name|
+-----------------+-----------------+-------------+
|Fitch |Fitch |null |
|Kids |Kids |null |
|Ace Hardware |Ace Hardware |null |
| Fresh Market |Acme Market |Acme Markets |
|Adventure | Island |null |
+-----------------+-----------------+-------------+
You can write a function with string parameter as
import org.apache.spark.sql.functions._
def filterRowsWithKey(key: String) = df.filter(col("Merchant_Name") === key).select("Display_Name", "Store_ID_name")
And calling the function as
filterRowsWithKey("Fitch").show(false)
would give you
+------------+-------------+
|Display_Name|Store_ID_name|
+------------+-------------+
|Fitch |null |
+------------+-------------+
I hope the answer is helpful
Updated
If you want first row as string to be returned from the function then you can do
import org.apache.spark.sql.functions._
def filterRowsWithKey(key: String) = df.filter(col("Merchant_Name") === key).select("Display_Name", "Store_ID_name").first().mkString(",")
println(filterRowsWithKey("Fitch"))
which should give you
Fitch,null
above function will throw exception if the key passed is not found so to be safe you can use following function
import org.apache.spark.sql.functions._
def filterRowsWithKey(key: String) = {
val filteredDF = df.filter(col("Merchant_Name") === key).select("Display_Name", "Store_ID_name")
if(filteredDF.count() > 0) filteredDF.first().mkString(",") else "key not found"
}

Related

Dataframe foreach loop - better way to extract result?

I have a DataFrame in scala which from which I need to create a new DataFrame for distinct values of SourceHash field.
var myProductsList = List[ProductInfo]()
val distinctFiles = dfDateFiltered.select(col("SourceHash")).distinct()
distinctFiles.foreach(rowFilter => {
val productInfo = createProductInfo(validFrom, validTo, dfDateFiltered, rowFilter.getString(0))
myProductsList = myProductsList :+ productInfo
})
myProductsList.toDF()
The problem is, this code throws java.lang.NullPointerException inside createProductInfo for any invocation on the dataframe dfDateFiltered.
The only way I can overcome this is using collect() before foreach like :
distinctFiles.collect().foreach(rowFilter => {...
}
But collect is expensive call, so this must be avoided.
How can I efficiently extract a new DataSet without losing on performance?
Below is the createProductInfo code:
private def createProductInfo(validFrom: String, validTo: String, dfDateFiltered: Dataset[Row], rowFilter: String) : ProductInfo = {
val dfPerFile = dfDateFiltered.filter(col("SourceHash") === rowFilter)
val dfRow = dfPerFile.head
val clientCount = dfPerFile.filter(col("ServerOrClient") === "Client").count
val buildVersion = dfPerFile.filter(col("ServerOrClient") === "Server").select(col("BuildVersion")).head.getString(0)
val productInfo = ProductInfo(dfRow.getInt(0),
dfRow.getInt(1),
dfRow.getString(12),
dfRow.getString(13),
dfRow.getString(14),
validFrom,
validTo,
dfRow.getString(8),
dfRow.getTimestamp(9),
clientCount,
buildVersion
)
productInfo
}
Function "createProductInfo" can be avoided, values can be collected by grouping. Original dataset don't exists in question, approach can be shown on such data:
val dfDateFiltered = Seq(
(1, "Server", 1),
(1, "Client", 2),
(2, "Client", 3)
).toDF("SourceHash", "ServerOrClient", "BuildVersion")
val validFrom = "Today"
dfDateFiltered
.groupBy("SourceHash")
.agg(sum(when($"ServerOrClient" === lit("Client"), 1).otherwise(0)).alias("clientCount"),
first(when($"ServerOrClient" === lit("Server"), col("BuildVersion")).otherwise(null), true).alias("buildVersion")
)
.withColumn("validFrom", lit(validFrom))
.as[Product]
Output:
+----------+-----------+------------+---------+
|SourceHash|clientCount|buildVersion|validFrom|
+----------+-----------+------------+---------+
|1 |1 |1 |Today |
|2 |1 |null |Today |
+----------+-----------+------------+---------+

create view for two different dataframe in scala spark

I have a code snippet that will read a Json array of the file path and then union the output and gives me two different tables. So I want to create two different createOrReplaceview(name) for those two tables and the name will be available in json array like below:
{
"source": [
{
"name": "testPersons",
"data": [
"E:\\dataset\\2020-05-01\\",
"E:\\dataset\\2020-05-02\\"
],
"type": "json"
},
{
"name": "testPets",
"data": [
"E:\\dataset\\2020-05-01\\078\\",
"E:\\dataset\\2020-05-02\\078\\"
],
"type": "json"
}
]
}
My output:
testPersons
+---+------+
|name |age|
+---+------+
|John |24 |
|Cammy |20 |
|Britto|30 |
|George|23 |
|Mikle |15 |
+---+------+
testPets
+---+------+
|name |age|
+---+------+
|piku |2 |
|jimmy |3 |
|rapido|1 |
+---+------+
Above is my Output and Json array my code iterate through each array and read the data section and read the data.
But how to change my below code to create a temp view for each output table.
for example i want to create .createOrReplaceTempView(testPersons) and .createOrReplaceTempView(testPets)
view name as per in Json array
if (dataArr(counter)("type").value.toString() == "json") {
val name = dataArr(counter)("name").value.toString()
val dataPath = dataArr(counter)("data").arr
val input = dataPath.map(item => {
val rdd = spark.sparkContext.wholeTextFiles(item.str).map(i => "[" + i._2.replaceAll("\\}.*\n{0,}.*\\{", "},{") + "]")
spark
.read
.schema(Schema.getSchema(name))
.option("multiLine", true)
.json(rdd)
})
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], Schema.getSchema(name))
val finalDF = input.foldLeft(emptyDF)((x, y) => x.union(y))
finalDF.show()
Expected output:
spark.sql("SELECT * FROM testPersons").show()
spark.sql("SELECT * FROM testPets").show()
It should give me the table for each one.
Since you already have your data wrangled into shape and have your rows in DataFrames and simply want to access them as temporary views, I suppose you are looking for the function(s):
createOrReplaceGlobalTempView
createOrReplaceTempView
They can be invoked from a DataFrame/Dataset.
df.createOrReplaceGlobalTempView("testPersons")
spark.sql("SELECT * FROM global_temp.testPersons").show()
df.createOrReplaceTempView("testPersons")
spark.sql("SELECT * FROM testPersons").show()
For an explanation about the difference between the two, you can take a look at this question.
If you are trying to dynamically read the JSON, get the files in data into DataFrames and then save them into their own table.
import net.liftweb.json._
import net.liftweb.json.DefaultFormats
case class Source(name: String, data: List[String], `type`: String)
val file = scala.io.Source.fromFile("path/to/your/file").mkString
implicit val formats: DefaultFormats.type = DefaultFormats
val json = parse(file)
val sourceList = (json \ "source").children
for (source <- sourceList) {
val s = source.extract[Source]
val df = s.data.map(d => spark.read(d)).reduce(_ union _)
df.createOrReplaceTempView(s.name)
}

Create a dataframe based on 2 dataframes for eligibility and ordering

I have 2 input dataframes like so
eligibleDs
+---------+--------------------+
| cid| eligibleUIds|
+---------+--------------------+
| 1234|offer3,offer1,offer2|
| 2345| offer1,offer3|
| 3456| offer2,offer3|
| 4567| offer2|
| 5678| null|
+---------+--------------------+
suggestedDs
+---------+--------------------+
| cid| suggestedUids|
+---------+--------------------+
| 1234|offer2,offer1,offer3|
| 2345|offer1,offer2,offer3|
| 3456|offer1,offer3,offer2|
| 4567|offer3,offer1,offer2|
| 5678|offer1,offer2,offer3|
+---------+--------------------+
I want the output dataframe to be like so
outputDs
+---------+--------+
| cid| topUid|
+---------+--------+
|3456 |offer3 |
|5678 |null |
|4567 |offer2 |
|1234 |offer2 |
|2345 |offer1 |
+---------+--------+
The idea being that
First data frame is a list of uids(each corresponding to some content id) that a user is eligible to see
Second data frame is a suggested order of uids to be shown for that user
If, for an id, the top suggested uid is present in the uids that can be shown then show that uid else move down the suggested list till you get to an uid that can be shown
Basically eligibleDs decides the presence and suggestedDs decides the order
I have been able to come up with something like this
val combinedDs = eligibleDs.join(suggestedDs, Seq("cid"), "left")
val outputDs = combinedDs.map(row => {
val cid = row.getInt(0)
val eligibleUids = row.getString(1)
val suggestedUids = row.getString(2)
val suggestedUidsList = suggestedUids.split(",")
var topUid = ""
import scala.util.control.Breaks._
breakable {
for(uid <- suggestedUidsList) {
if(eligibleUids!=null && eligibleUids.contains(uid)) {
topOffer = uid
break
}
}
}
Out(cid, topUid)
})
This seems rather kludgy, can someone help let me know if there is a better way to do this?
Using dropWhile to drop unmatched items in the list in suggestedUids and headOption to check for the first item in the remaining list, here's a more idiomatic way for generating outputDs:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
case class Out(cid: Int, topUid: String)
val outputDs = combinedDs.map{
case Row(cid: Int, null, _) =>
Out(cid, null)
case Row(cid: Int, eligibleUids: String, suggestedUids: String) =>
val topUid = suggestedUids.split(",").
dropWhile(!eligibleUids.contains(_)).headOption match {
case Some(uid) => uid
case None => null
}
Out(cid, topUid)
}
outputDs.show
// +----+------+
// | cid|topUid|
// +----+------+
// |1234|offer2|
// |2345|offer1|
// |3456|offer3|
// |4567|offer2|
// |5678| null|
// +----+------+
Note that combinedDs as described in the question is a DataFrame. Should it be converted to a Dataset, case Row(...) should be replaced with case (...).

UDF usage in spark

I have a custom udf and registered in spark.If I try to access that UDF ,It throws error.Unable to access.
I tried like this.
spark.udf.register("rssi_weightage", FilterMap.rssi_weightage)
val filterop = input_data.groupBy($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID").agg(first(rssi_weightage($"RSSI").as("RSSI_Weight")))
Showing error in first(rssi_weightage($"RSSI") // rssi_weightage not found error
Any help will be appreciated.
this is not how you use the udf, the actual udf is a return value from spark.udf.register. So you can do :
val udf_rssii_weightage = spark.udf.register("rssi_weightage", FilterMap.rssi_weightage)
val filterop = input_data.groupBy($"tagShortID", $"Timestamp", $"ListenerShortID", $"rootOrgID", $"subOrgID").agg(first(udf_rssi_weightage($"RSSI")).as("RSSI_Weight"))
But in your case you do not need to register the udf, just use org.apache.spark.sql.functions.udf to convert a regular function to an udf:
val udf_rssii_weightage = udf(FilterMap.rssi_weightage)
I suppose you have an issue with the way you're defining the udf function,
the next snapshot has a slightly different approach in announcement udf - it's directly defined function:
import org.apache.spark.sql.functions._
val data = sqlContext.read.json(sc.parallelize(Seq("{'foo' : 'Bar'}", "{'foo': 'Baz'}")))
val example = Seq("Bar", "Bazzz")
val urbf = udf { foo: String => if (example.contains(example)) 1 else 0 }
data.select($"foo", urbf($"foo")).show
+--------+-------------+
| foo |UDF(foo) |
+--------+-------------+
| Bar | 1|
| Bazzz | 0|
+--------+-------------+

Spark Sql Dataframe Join on one field

I am very new to Spark. I have below queries -->
I have 2 tables. Business and Inspections.
Business Table has fields -> Business_id, name, address
Inspections table has --> score
I want to calculate top 10 scores.
So, I need to join based on Business_id filed. I tried 2 ways but none of them working -->
1) Using sqlContext.sql (I wrote sql query)
1)sqlContext.sql("""select CBusinesses.BUSINESS_ID,CBusinesses.name, CBusinesses.address, CBusinesses.city, CBusinesses.postal_code, CBusinesses.latitude, CBusinesses.longitude, Inspections_notnull.score from CBusinesses, Inspections_notnull where CBusinesses.BUSINESS_ID=Inspections_notnull.BUSINESS_ID and Inspections_notnull.score <>0 order by Inspections_notnull.score""").show()
2) val df = businessesDF.join(raw_inspectionsDF, businessesDF.col("BUSINESS_ID") == raw_inspectionsDF.col("BUSINESS_ID"))
How should I write it?
Thanks!
val df = businessesDF.join(raw_inspectionsDF, businessesDF("BUSINESS_ID") === raw_inspectionsDF("BUSINESS_ID"))
This should work, please take a look here for more details: https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/sql/DataFrame.html
Sure... I created case class for each dataset then split it by tab then converted rdd to dataframe
import sqlContext. implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.util.{Try, Success, Failure}
def parseScore(s: String): Option[Int] = {
Try(s.toInt) match {
case Success(x) => Some(x)
case Failure(x) => None
}
}
case class CInspections (business_id:Int, score:Option[Int], date:String, type1:String)
val baseDir = "/FileStore/tables/484qrxx21488929011080/"
val raw_inspections = sc.textFile (s"$baseDir/inspections_plus.txt")
val raw_inspectionsmap = raw_inspections.map ( line => line.split ("\t"))
val raw_inspectionsRDD = raw_inspectionsmap.map ( raw_inspections => CInspections (raw_inspections(0).toInt,parseScore(raw_inspections(1)), raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
//raw_inspectionsDF.show()
val raw_inspectionsDF_replacenull = raw_inspectionsDF.na.fill(0) // Replacing null values with '0'
raw_inspectionsDF_replacenull.show()
raw_inspectionsDF_replacenull.createOrReplaceTempView ("Inspections_notnull")
For Business -->
case class CBusinesses (business_id:Int, name: String, address:String, city:String, postal_code:Int, latitude:String, longitude:String, phone_number:String, tax_code:String, business_certificate:String, application_date:String,owner_name:String, owner_address:String, owner_city:String, owner_state:String, owner_zip:String )
val businesses = sc.textFile (s"$baseDir/businesses_plus.txt")
val businessesmap = businesses.map ( line => line.split ("\t"))
val businessesRDD = businessesmap.map ( businesses => CBusinesses (businesses(0).toInt, businesses(1), businesses(2),businesses(3),businesses(4).toInt,
businesses(5),businesses(6), businesses(7), businesses(8), businesses(9), businesses(10), businesses(11), businesses(12), businesses(13), businesses(14), businesses(15)))
val businessesDF = businessesRDD.toDF
businessesDF.createOrReplaceTempView ("CBusinesses")
businessesDF.printSchema
//businessesDF.show()
It is showing proper resiult for both dataframe
For Inspection -->
+-----------+-----+--------+--------------------+
|business_id|score| date| type1|
+-----------+-----+--------+--------------------+
| 10| 0|20140807|Reinspection/Foll...|
| 10| 94|20140729|Routine - Unsched...|
| 10| 0|20140124|Reinspection/Foll...|
| 10| 92|20140114|Routine - Unsched...|
For Business -->
+-----------+--------------------+--------------------+-------------+-----------+---------+-----------+------------+--------+--------------------+----------------+--------------------+--------------------+-----------------+-------------+---------+
|business_id| name| address| city|postal_code| latitude| longitude|phone_number|tax_code|business_certificate|application_date| owner_name| owner_address| owner_city| owner_state|owner_zip|
+-----------+--------------------+--------------------+-------------+------- ----+---------+-----------+------------+--------+--------------------+---------- ------+--------------------+--------------------+-----------------+------------- +---------+
| 10| Tiramisu Kitchen| 033 Belden Pl|San Francisco| 94104|37.791116|-122.403816| | H24| 779059| | Tiramisu LLC| 33 Belden St| San Francisco| CA| 94104|
| 17|GEORGE'S COFFEE SHOP| 2200 OAKDALE Ave | S.F.| 94124|37.741086|-122.401737| 14155531470| H24| 78443| 4/5/75|"LIEUW, VICTOR & ...| 648 MACARTHUR DRIVE| DALY CITY| CA| 94015|