Extracting value of columns in spark dataframe - scala

I have a requirement , where I need to filter out rows from spark dataframe where value of a certain column (say "price") needs to be matched with values present in a scala map.The key of scala map is value of another column (say "id").
My dataframe contains two columns : id and price.
I need to filter out all the columns where price does not match the price mentioned in scala map.
My code resembles this:
object obj1{
// This method returns value price for items as per their id
getPrice(id:String):String {
//lookup in a map and return the price
}
}
object Main{
val validIds = Seq[String]("1","2","3","4")
val filteredDf = baseDataframe.where(baseDataframe("id").in(validIDs.map(lit(_)): _*) &&
baseDataframe("price") === (obj1.getPrice(baseDataframe("id").toString())))
// But this line send string "id" to obj1.getPrice() function
// rather than value of id column
}
}
I am not able to pass value of id columns to function obj1.getPrice().
Any suggestion how to achieve this?
Thanks,

You can write a udf to do this:
val checkPrice(id: String, price: String) = validIds.exists(_ == id) && obj1.getPrice(id) == price
val checkPriceUdf = udf(checkPrice)
baseDataFrame.where(checkPriceUdf($"id", $"price"))
Or another solution is convert the Map of id -> price to a data frame, and then do an inner join with baseDataFrame on the id and price columns.

Related

Spark - pass column value to a udf and then get another column value inside udf

I am trying to make a udf function which takes a column value, on condition of that column value i have to insert into this column another column value. my code is like :
val udfMobileDeviceId = udf { (os_type: String) =>
if (os_type == "android") $"androidIdfa" else $"appleIdfv"
}
either you pass those columns to the udf :
val udfMobileDeviceId = udf { (os_type: String, androidIfa:String, appleIdfv:String) =>
if (os_type == "android") androidIdfa else appleIdfv
}
or even better: don't use an UDF for that, just do it in DataFrame API
:
df
.withColumn("mobileDeviceId", when($"os_type"==="andoid",$"androidIdfa").otherwise($"appleIdfv"))

how to access map values and keys stored in a data frame in scala spark

i have a table which description is as follows:
# col_name data_type comment
id string
persona_model map<string,struct<score:double,tag:string>>
# Partition Information
# col_name data_type comment
process_date string
sample row would be something like this(tab separated):
000000E91010441BB122402A45D439E7 {"Tech":{"score":0.21678,"tag":"OTHERS"}} 2018-05-16-01
Now I want to form another table with only 2 columns id and its respective score in it.
How can i do it in scala spark?
Moreover, whats really bugging me is how can I access only a particular score and how can I store it in an integer variable lets say temp?
You can do this:
val newDF = oldDF.select(col("id"), col("persona_model")("Tech")("score").as("temp"))
then you can extract temp values easily.
update: if you have more than one Key then the procedure is a little more complex.
first create a class for the struct (necesary for type cast):
case class Score(score: Double, tag: String)
then extract all the keys from the data:
val keys = oldDF.rdd
.flatMap(r => r.getMap(1).asInstanceOf[Map[String, Score]].toList)
.collect.map(_._1).distinct.toList
finally you can extract all names like this:
def condition(keys: List[String]): Column = {
keys match {
case k::ks => when(col("persona_model")(k)("score").isNotNull, col("persona_model")(k)("score")).otherwise(condition(ks))
case nil => lit(null)
}
}
val newDF = oldDF.select(col("id"), condition(keys))

Filter a list based on a parameter

I want to filter the employees based on name and return the id of each employee
case class Company(emp:List[Employee])
case class Employee(id:String,name:String)
val emp1=Employee("1","abc")
val emp2=Employee("2","def")
val cmpy= Company(List(emp1,emp2))
val a = cmpy.emp.find(_.name == "abc")
val b = a.map(_.id)
val c = cmpy.emp.find(_.name == "def")
val d = c.map(_.id)
println(b)
println(d)
I want to create a generic function that contains the filter logic and I can have different kind of list and filter parameter for those list
Ex employeeIdByName which takes the parameters
Updated
criteria for filter eg :_.name and id
list to filter eg:cmpy.emp value
for comparison eg :abc/def
Any better way to achieve the result
I have used map and find
If you really want a "generic" filter function, that can filter any list of elements, by any property of these elements, based on a closed set of "allowed" values, while mapping results to some other property - it would look something like this:
def filter[T, P, R](
list: List[T], // input list of elements with type T (in our case: Employee)
propertyGetter: T => P, // function extracting value for comparison, in our case a function from Employee to String
values: List[P], // "allowed" values for the result of propertyGetter
resultMapper: T => R // function extracting result from each item, in our case from Employee to String
): List[R] = {
list
// first we filter only items for which the result of
// applying "propertyGetter" is one of the "allowed" values:
.filter(item => values.contains(propertyGetter(item)))
// then we map remaining values to the result using the "resultMapper"
.map(resultMapper)
}
// for example, we can use it to filter by name and return id:
filter(
List(emp1, emp2),
(emp: Employee) => emp.name, // function that takes an Employee and returns its name
List("abc"),
(emp: Employee) => emp.id // function that takes an Employee and returns its id
)
// List(1)
However, this is a ton of noise around a very simple Scala operation: filtering and mapping a list; This specific usecase can be written as:
val goodNames = List("abc")
val input = List(emp1, emp2)
val result = input.filter(emp => goodNames.contains(emp.name)).map(_.id)
Or even:
val result = input.collect {
case Employee(id, name) if goodNames.contains(name) => id
}
Scala's built-in map, filter, collect functions are already "generic" in the sense that they can filter/map by any function that applies to the elements in the collection.
You can use Shapeless. If you have a employees: List[Employee], you can use
import shapeless._
import shapeless.record._
employees.map(LabelledGeneric[Employee].to(_).toMap)
To convert each Employee to a map from field key to field value. Then you can apply the filters on the map.

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

Array of Sequences in Scala

I'm trying to read distinct values column wise in a data frame and store them in a Array of sequence
def getColumnDistinctValues(df: DataFrame, colNames:String): Unit = {
val cols: Array[String] = colNames.split(',')
cols.foreach(println) // print column names
var colDistValues: Array[Seq[Any]] = null
for (i <- 0 until cols.length) {
colDistValues(i) = df.select(cols(i)).distinct.map(x => x.get(0)).collect // read distinct values from each column
}
The assignment to colDistValues(i) doesn't work and always results in null pointer exception, what is the correct syntax to assign it the distinct values for each column?
Regards
You're trying to access the ith index of a null pointer (which you assign yourself), of course you'll get a NullPointerException. You don't need to initialize an Array[T] beforehand, let the returned collection do that for you:
val colDistValues: Array[Array[Any]] =
cols.map(c => df.select(c).distinct.map(x => x.get(0)).collect)
You are initialising the colDistValues to null.
Replace
var colDistValues: Array[Seq[Any]] = null
with
var colDistValues: Array[Seq[Any]] = Array.ofDim[Seq[Any]](cols.length)