Can I have a condition inside of a where or a filter? - scala

I have a dataframe with many columns, and to explain the situation, let's say, there is a column with letters in it from a-z. I also have a list, which includes some specific letters.
val testerList = List("a","k")
The dataframe has to be filtered, to only include entries with the specified letters in the list. This is very straightforward:
val resultDF = df.where($"column".isin(testerList:_*)))
So the problem is, that the list is given to this function as a parameter, and it can be an empty list, which situation could be solved like this (resultDF is defined here as an empty dataframe):
if (!(testerList.isEmpty)) {
resultDF = df.where(some other stuff has to be filtered away)
.where($"column".isin(testerList:_*)))
} else {
resultDF = df.where(some other stuff has to be filtered away)
}
Is there a way to make this in a more simple way, something like this:
val resultDF = df.where(some other stuff has to be filtered away)
.where((!(testerList.isEmpty)) && $"column".isin(testerList:_*)))
This one throws an error though:
error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
.where( (!(testerList.isEmpty)) && (($"agent_account_homepage").isin(testerList:_*)))
^
So, thanks a lot for any kind of ideas for a solution!! :)

What about this?
val filtered1 = df.where(some other stuff has to be filtered away)
val resultDF = if (testerList.isEmpty)
filtered1
else
filtered1.where($"column".isin(testerList:_*))
Or if you don't want filtered1 to be available below and perhaps unintentionally used, it can be declared inside a block initializing resultDF:
val resultDF = {
val filtered1 = df.where(some other stuff has to be filtered away)
if (testerList.isEmpty) filtered1 else filtered1.where($"column".isin(testerList:_*))
}
or if you change the order
val resultDF = (if (testerList.isEmpty)
df
else
df.where($"column".isin(testerList:_*))
).where(some other stuff has to be filtered away)

Essentially what Spark expects to receive in where is plain object Column. This means that you can extract all your complicated where logic to separate function:
def testerFilter(testerList: List[String]): Column = testerList match {
//of course, you have to replace ??? with real conditions
//just apend them by joining with "and"
case Nil => $"column".isNotNull and ???
case tl => $"column".isin(tl: _*) and ???
}
And then you just use it like:
df.where(testerFilter(testerList))

The solution I use now, use sql code inside the where clause:
var testerList = s""""""
var cond = testerList.isEmpty().toString
testerList = if (cond == "true") "''" else testerList
val resultDF= df.where(some other stuff has to be filtered away)
.where("('"+cond+"' = 'true') or (agent_account_homepage in ("+testerList+"))")
What do you think?

Related

How can i remap a list based on different conditions in scala?

I have a list of custom objects (Buffer [CustomObject] ) on which I'm applying .map, in order to return a list of (String,Boolean,Boolean,Boolean) only if at least one of the conditions on the element are met. The string is taken inside the custom object. The code is something like this :
list.map{ item=>
val string = item.string
val boolean1 = true
val boolean2 = true
val boolean3 = true
//-- some code to check the conditions -- //
if (!boolean1 || !boolean2 || !boolean3)
(string,boolean1,boolean2,boolean3)
else
null
}
The issue with this is that the list obviously contains some null values, and I need to remove them later. I also tried to use .collect() but without success, given that the conditions are more than one and must be checked with some code. I know I could use maybe a couple of .filter and .map in order to achieve this but my goal is to do everything in one iteration. Can you help me?
Thanks
Edit: found a "partial" solution, basically you can return an option and using .flatten or a .flatMap that automatically removes None values.
list.flatMap{ item=>
val string = item.string
val boolean1 = true
val boolean2 = true
val boolean3 = true
//-- some code to check the conditions -- //
if (!boolean1 || !boolean2 || !boolean3)
Some (string,boolean1,boolean2,boolean3)
else
None
}
But it is a kind of "ugly" solution, is there a more "elegant" one?
You can take advantage that flatMap on List threats Options as collections of at most one element.
list.flatMap { item=>
val string = item.string
val boolean1 = true
val boolean2 = true
val boolean3 = true
Option.when((!boolean1 || !boolean2 || !boolean3)) {
(string,boolean1,boolean2,boolean3)
}
}

unable to convert a java.util.List into Scala list

I want that the if block returns Right(List[PracticeQuestionTags]) but I am not able to do so. The if/else returns Either
//I get java.util.List[Result]
val resultList:java.util.List[Result] = transaction.scan(scan);
if(resultList.isEmpty == false){
val listIterator = resultList.listIterator()
val finalList:List[PracticeQuestionTag] = List()
//this returns Unit. How do I make it return List[PracticeQuestionTags]
val answer = while(listIterator.hasNext){
val result = listIterator.next()
val convertedResult:PracticeQuestionTag = rowToModel(result) //rowToModel takes Result and converts it into PracticeQuestionTag
finalList ++ List(convertedResult) //Add to List. I assumed that the while will return List[PracticeQuestionTag] because it is the last statement of the block but the while returns Unit
}
Right(answer) //answer is Unit, The block is returning Right[Nothing,Unit] :(
} else {Left(Error)}
Change the java.util.List list to a Scala List as soon as possible. Then you can handle it in Scala fashion.
import scala.jdk.CollectionConverters._
val resultList = transaction.scan(scan).asScala.toList
Either.cond( resultList.nonEmpty
, resultList.map(rowToModel(_))
, new Error)
Your finalList: List[PracticeQuestionTag] = List() is immutable scala list. So you can not change it, meaning there is no way to add, remove or do change to this list.
One way to achieve this is by using scala functional approach. Another is using a mutable list, then adding to that and that list can be final value of if expression.
Also, a while expression always evaluates to Unit, it will never have any value. You can use while to create your answer and then return it seperately.
val resultList: java.util.List[Result] = transaction.scan(scan)
if (resultList.isEmpty) {
Left(Error)
}
else {
val listIterator = resultList.listIterator()
val listBuffer: scala.collection.mutable.ListBuffer[PracticeQuestionTag] =
scala.collection.mutable.ListBuffer()
while (listIterator.hasNext) {
val result = listIterator.next()
val convertedResult: PracticeQuestionTag = rowToModel(result)
listBuffer.append(convertedResult)
}
Right(listBuffer.toList)
}

Matching Column name from Csv file in spark scala

I want to take headers (column name) from my csv file and the want to match with it my existing header.
I am using below code:
val cc = sparksession.read.csv(filepath).take(1)
Its giving me value like:
Array([id,name,salary])
and I have created one more static schema, which is giving me value like this:
val ss=Array("id","name","salary")
and then I'm trying to compare column name using if condition:
if(cc==ss){
println("matched")
} else{
println("not matched")
}
I guess due to [] and () mismatch its always going to else part is there any other way to compare these value without considering [] and ()?
First, for convenience, set the header option to true when reading the file:
val df = sparksession.read.option("header", true).csv(filepath)
Get the column names and define the expected column names:
val cc = df.columns
val ss = Array("id", "name", "salary")
To check if the two match (not considering the ordering):
if (cc.toSet == ss.toSet) {
println("matched")
} else {
println("not matched")
}
If the order is relevant, then the condition can be done as follows (you can't use Array here but Seq works):
cc.toSeq == ss.toSeq
or you a deep array comparison:
cc.deep == d.deep
First of all, I think you are trying to compare a Array[org.apache.spark.sql.Row] with an Array[String]. I believe you should change how you load the headers to something like: val cc = spark.read.format("csv").option("header", "true").load(fileName).columns.toArray.
Then you could compare using cc.deep == ss.deep.
Below code worked for me.
val cc= spark.read.csv("filepath").take(1)(0).toString
The above code gave output as String:[id,name,salary].
created one one stating schema as
val ss="[id,name,salary]"
then wrote the if else Conditions.

Scala broadcast join with "one to many" relationship

I am fairly new to Scala and RDDs.
I have a very simple scenario yet it seems very hard to implement with RDDs.
Scenario:
I have two tables. One large and one small. I broadcast the smaller table.
I then want to join the table and finally aggregate the values after the join to a final total.
Here is an example of the code:
val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))
val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1,a._2) } // turn to pair RDD
.collectAsMap() // collect as Map
)
//first join
val joinedRDD = bigRDD.map( accs => {
//get list of groups
val groups = List("Fruit", "ZipCode")
val i = "Fruit"
//for each group
//for(i <- groups) {
if (broadcastVar.value.get(accs._1, i) != None) {
( broadcastVar.value.get(accs._1, i).get._1,
broadcastVar.value.get(accs._1, i).get._2,
accs._2, accs._3)
} else {
None
}
//}
}
)
//expected after this
//("A","Fruit","Apples",1, "1Jan2000"),("B","Fruit","Apples",2, "1Jan2000"),
//("A","ZipCode","1234", 1,"1Jan2000"),("B","ZipCode","456", 2,"1Jan2000")
//then group and sum
//cannot do anything with the joinedRDD!!!
//error == value copy is not a member of Product with Serializable
// Final Expected Result
//("Fruit","Apples",3, "1Jan2000"),("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")
My questions:
Is this the best approach first of all with RDDs?
Disclaimer - I have done this whole task using dataframes successfully. The idea is to create another version using only RDDs to compare performance.
Why is the type of my joinedRDD not recognised after it was created so that I can continue to use functions like copy on it?
How can I get away with not doing a .collectAsMap() when broadcasting the variable. I currently have to include the first to items to enforce uniqueness and not dropping any values.
Thanks for the help in advance!
Final solution for anyone interested
case class dt (group:String, group_key:String, count:Long, date:String)
val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))
val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1) } // turn to pair RDD
.groupByKey() //to not loose any data
.collectAsMap() // collect as Map
)
//first join
val joinedRDD = bigRDD.flatMap( accs => {
if (broadcastVar.value.get(accs._1) != None) {
val bc = broadcastVar.value.get(accs._1).get
bc.map(p => {
dt(p._2, p._3,accs._2, accs._3)
})
} else {
None
}
}
)
//expected after this
//("Fruit","Apples",1, "1Jan2000"),("Fruit","Apples",2, "1Jan2000"),
//("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")
//then group and sum
var finalRDD = joinedRDD.map(s => {
(s.copy(count=0),s.count) //trick to keep code to minimum (count = 0)
})
.reduceByKey(_ + _)
.map(pair => {
pair._1.copy(count=pair._2)
})
In your map statement you return either a tuple or None based on the if condition. These types do not match so you fall back the a common supertype so joinedRDD is an RDD[Product with Serializable] Which is not what you want at all (it's basically RDD[Any]). You need to make sure all paths return the same type. In this case, you probably want an Option[(String, String, Int, String)]. All you need to do is wrap the tuple result into a Some
if (broadcastVar.value.get(accs._1, i) != None) {
Some(( broadcastVar.value.get(accs._1, i).get.group_key,
broadcastVar.value.get(accs._1, i).get.group,
accs._2, accs._3))
} else {
None
}
And now your types will match up. This will make joinedRDD and RDD[Option(String, String, Int, String)]. Now that the type is correct the data is usable, however, it means that you will need to map the Option to work with the tuples. If you don't need the None values in the final result, you can use flatmap instead of map to create joinedRDD which will unwrap the Options for you, filtering out all the Nones.
CollectAsMap is the correct way to turnan RDD into a Hashmap, but you need multiple values for a single key. Before using collectAsMap but after mapping the smallRDD into a Key,Value pair, use groupByKey to group all of the values for a single key together. When when you look up a key from your HashMap, you can map over the values, creating a new record for each one.

How to efficiently delete subset in spark RDD

When conducting research, I find it somewhat difficult to delete all the subsets in Spark RDD.
The data structure is RDD[(key,set)]. For example, it could be:
RDD[ ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ]
Since the set of mike (Set(1,3)) is a subset of peter's (Set(1,2,3)), I want to delete "mike", which will end up with
RDD[ ("peter",Set(1,2,3)), ("jack",Set(5)) ]
It is easy to implement in python locally with two "for" loop operation. But when I want to extend to cloud with scala and spark, it is not that easy to find a good solution.
Thanks
I doubt we can escape to comparing each element to each other (the equivalent of a double loop in a non-distributed algorithm). The subset operation between sets is not reflexive, meaning that we need to compare is "alice" subsetof "bob" and is "bob" subsetof "alice".
To do this using the Spark API, we can resort to multiplying the data with itself using a cartesian product and verifying each entry of the resulting matrix:
val data = Seq(("peter",Set(1,2,3)), ("mike",Set(1,3)), ("anne", Set(7)),("jack",Set(5,4,1)), ("lizza", Set(5,1)), ("bart", Set(5,4)), ("maggie", Set(5)))
// expected result from this dataset = peter, olga, anne, jack
val userSet = sparkContext.parallelize(data)
val prod = userSet.cartesian(userSet)
val subsetMembers = prod.collect{case ((name1, set1), (name2,set2)) if (name1 != name2) && (set2.subsetOf(set1)) && (set1 -- set2).nonEmpty => (name2, set2) }
val superset = userSet.subtract(subsetMembers)
// lets see the results:
superset.collect()
// Array[(String, scala.collection.immutable.Set[Int])] = Array((olga,Set(1, 2, 3)), (peter,Set(1, 2, 3)), (anne,Set(7)), (jack,Set(5, 4, 1)))
This can be achieved by using RDD.fold function.
In this case the output required is a "List" (ItemList) of superset items. For this the input should also be converted to "List" (RDD of ItemList)
import org.apache.spark.rdd.RDD
// type alias for convinience
type Item = Tuple2[String, Set[Int]]
type ItemList = List[Item]
// Source RDD
val lst:RDD[Item] = sc.parallelize( List( ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ) )
// Convert each element as a List. This is needed for using fold function on RDD
// since the data-type of the parameters are the same as output parameter
// data-type for fold function
val listOflst:RDD[ItemList] = lst.map(x => List(x))
// for each element in second ItemList
// - Check if it is not subset of any element in first ItemList and add first
// - Remove the subset of newly added elements
def combiner(first:ItemList, second:ItemList) : ItemList = {
def helper(lst: ItemList, i:Item) : ItemList = {
val isSubset: Boolean = lst.exists( x=> i._2.subsetOf(x._2))
if( isSubset) lst else i :: lst.filterNot( x => x._2.subsetOf(i._2))
}
second.foldLeft(first)(helper)
}
listOflst.fold(List())(combiner)
You can use filter after a map.
You can build like a map that will return a value for what you want to delete. First build a function:
def filter_mike(line):
if line[1] != Set(1,3):
return line
else:
return None
Then you can filter now like this:
your_rdd.map(filter_mike).filter(lambda x: x != None)
This will work