How to add more code logic to spark/scala's getOrElse()? - scala

In Spark rdd's join function, we could use getOrElse() function like this:
rdd_a.leftOuterJoin(rdd_b) { (id, data_a, data_b) =>
data_b.getOrElse(0)
}
But I want to add more complex code logic to the getOrElse function, and I do not know how to do that.
For example, if I want to generate a array of Gaussian distribution random variables
rdd_a.leftOuterJoin(rdd_b) { (id, data_a, data_b) =>
arr = new Array[Double](10)
arr.map(_=>Utils.random.nextGaussian())
data_b.getOrElse(arr)
}
But you know, there are only few pair without data_b after leftOuterJoin, if I generate a array for every join-pair, it is a waste...

You can build the array in the .getOrElse() :
rdd_a.leftOuterJoin(rdd_b) { (id, data_a, data_b) =>
data_b.getOrElse{
arr = new Array[Double](10)
arr.map(_=>Utils.random.nextGaussian())
}
}
Option's .getOrElse() is lazy (I don't have spark handy, but an example in scala) :
scala> List().headOption.getOrElse{ println("Building an array then."); Array(1) }
Building an array then.
res1: Array[Int] = Array(1)
scala> List(1).headOption.getOrElse{ println("Building an array then."); Array(1) }
res2: Any = 1
By the way you can create a n-element array with specific values directly using .fill (rather than creating it and then mapping over it) :
scala> Array.fill(3)(Random.nextGaussian)
res6: Array[Double] = Array(-0.2773138805049755, -1.4178827462945545, -0.8710624835785054)

Related

Array of tuples to Map while retaining insertion order

def arrayToMap(fields: Array[CustomClass]): Map[String, CustomClass] = {
val fieldData = fields.map(f => f.name -> CustomClass(f.name)) // This is Array[(String, CustomClass)], and order is fine at this point
fieldData.toMap // order gets jumbled up
/*
What I've also tried
Map(fieldData : _*)
*/
}
why is converting Array to Map messing up the order? Is there a way to retain the order of the Array of tuples when converting to a Map?
The solution is to use ListMap rather than Map, but the question remains why the order matters. Also, Array is a Java type rather than a pure Scala type, so use Seq to allow Scala types to be used as well.
import scala.collection.immutable.ListMap
def arrayToMap(fields: Seq[CustomClass]): ListMap[String, CustomClass] =
ListMap(fields.map(f => f.name -> CustomClass(f.name)):_*)

Scala broadcast join with "one to many" relationship

I am fairly new to Scala and RDDs.
I have a very simple scenario yet it seems very hard to implement with RDDs.
Scenario:
I have two tables. One large and one small. I broadcast the smaller table.
I then want to join the table and finally aggregate the values after the join to a final total.
Here is an example of the code:
val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))
val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1,a._2) } // turn to pair RDD
.collectAsMap() // collect as Map
)
//first join
val joinedRDD = bigRDD.map( accs => {
//get list of groups
val groups = List("Fruit", "ZipCode")
val i = "Fruit"
//for each group
//for(i <- groups) {
if (broadcastVar.value.get(accs._1, i) != None) {
( broadcastVar.value.get(accs._1, i).get._1,
broadcastVar.value.get(accs._1, i).get._2,
accs._2, accs._3)
} else {
None
}
//}
}
)
//expected after this
//("A","Fruit","Apples",1, "1Jan2000"),("B","Fruit","Apples",2, "1Jan2000"),
//("A","ZipCode","1234", 1,"1Jan2000"),("B","ZipCode","456", 2,"1Jan2000")
//then group and sum
//cannot do anything with the joinedRDD!!!
//error == value copy is not a member of Product with Serializable
// Final Expected Result
//("Fruit","Apples",3, "1Jan2000"),("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")
My questions:
Is this the best approach first of all with RDDs?
Disclaimer - I have done this whole task using dataframes successfully. The idea is to create another version using only RDDs to compare performance.
Why is the type of my joinedRDD not recognised after it was created so that I can continue to use functions like copy on it?
How can I get away with not doing a .collectAsMap() when broadcasting the variable. I currently have to include the first to items to enforce uniqueness and not dropping any values.
Thanks for the help in advance!
Final solution for anyone interested
case class dt (group:String, group_key:String, count:Long, date:String)
val bigRDD = sc.parallelize(List(("A",1,"1Jan2000"),("B",2,"1Jan2000"),("C",3,"1Jan2000"),("D",3,"1Jan2000"),("E",3,"1Jan2000")))
val smallRDD = sc.parallelize(List(("A","Fruit","Apples"),("A","ZipCode","1234"),("B","Fruit","Apples"),("B","ZipCode","456")))
val broadcastVar = sc.broadcast(smallRDD.keyBy{ a => (a._1) } // turn to pair RDD
.groupByKey() //to not loose any data
.collectAsMap() // collect as Map
)
//first join
val joinedRDD = bigRDD.flatMap( accs => {
if (broadcastVar.value.get(accs._1) != None) {
val bc = broadcastVar.value.get(accs._1).get
bc.map(p => {
dt(p._2, p._3,accs._2, accs._3)
})
} else {
None
}
}
)
//expected after this
//("Fruit","Apples",1, "1Jan2000"),("Fruit","Apples",2, "1Jan2000"),
//("ZipCode","1234", 1,"1Jan2000"),("ZipCode","456", 2,"1Jan2000")
//then group and sum
var finalRDD = joinedRDD.map(s => {
(s.copy(count=0),s.count) //trick to keep code to minimum (count = 0)
})
.reduceByKey(_ + _)
.map(pair => {
pair._1.copy(count=pair._2)
})
In your map statement you return either a tuple or None based on the if condition. These types do not match so you fall back the a common supertype so joinedRDD is an RDD[Product with Serializable] Which is not what you want at all (it's basically RDD[Any]). You need to make sure all paths return the same type. In this case, you probably want an Option[(String, String, Int, String)]. All you need to do is wrap the tuple result into a Some
if (broadcastVar.value.get(accs._1, i) != None) {
Some(( broadcastVar.value.get(accs._1, i).get.group_key,
broadcastVar.value.get(accs._1, i).get.group,
accs._2, accs._3))
} else {
None
}
And now your types will match up. This will make joinedRDD and RDD[Option(String, String, Int, String)]. Now that the type is correct the data is usable, however, it means that you will need to map the Option to work with the tuples. If you don't need the None values in the final result, you can use flatmap instead of map to create joinedRDD which will unwrap the Options for you, filtering out all the Nones.
CollectAsMap is the correct way to turnan RDD into a Hashmap, but you need multiple values for a single key. Before using collectAsMap but after mapping the smallRDD into a Key,Value pair, use groupByKey to group all of the values for a single key together. When when you look up a key from your HashMap, you can map over the values, creating a new record for each one.

How to efficiently delete subset in spark RDD

When conducting research, I find it somewhat difficult to delete all the subsets in Spark RDD.
The data structure is RDD[(key,set)]. For example, it could be:
RDD[ ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ]
Since the set of mike (Set(1,3)) is a subset of peter's (Set(1,2,3)), I want to delete "mike", which will end up with
RDD[ ("peter",Set(1,2,3)), ("jack",Set(5)) ]
It is easy to implement in python locally with two "for" loop operation. But when I want to extend to cloud with scala and spark, it is not that easy to find a good solution.
Thanks
I doubt we can escape to comparing each element to each other (the equivalent of a double loop in a non-distributed algorithm). The subset operation between sets is not reflexive, meaning that we need to compare is "alice" subsetof "bob" and is "bob" subsetof "alice".
To do this using the Spark API, we can resort to multiplying the data with itself using a cartesian product and verifying each entry of the resulting matrix:
val data = Seq(("peter",Set(1,2,3)), ("mike",Set(1,3)), ("anne", Set(7)),("jack",Set(5,4,1)), ("lizza", Set(5,1)), ("bart", Set(5,4)), ("maggie", Set(5)))
// expected result from this dataset = peter, olga, anne, jack
val userSet = sparkContext.parallelize(data)
val prod = userSet.cartesian(userSet)
val subsetMembers = prod.collect{case ((name1, set1), (name2,set2)) if (name1 != name2) && (set2.subsetOf(set1)) && (set1 -- set2).nonEmpty => (name2, set2) }
val superset = userSet.subtract(subsetMembers)
// lets see the results:
superset.collect()
// Array[(String, scala.collection.immutable.Set[Int])] = Array((olga,Set(1, 2, 3)), (peter,Set(1, 2, 3)), (anne,Set(7)), (jack,Set(5, 4, 1)))
This can be achieved by using RDD.fold function.
In this case the output required is a "List" (ItemList) of superset items. For this the input should also be converted to "List" (RDD of ItemList)
import org.apache.spark.rdd.RDD
// type alias for convinience
type Item = Tuple2[String, Set[Int]]
type ItemList = List[Item]
// Source RDD
val lst:RDD[Item] = sc.parallelize( List( ("peter",Set(1,2,3)), ("mike",Set(1,3)), ("jack",Set(5)) ) )
// Convert each element as a List. This is needed for using fold function on RDD
// since the data-type of the parameters are the same as output parameter
// data-type for fold function
val listOflst:RDD[ItemList] = lst.map(x => List(x))
// for each element in second ItemList
// - Check if it is not subset of any element in first ItemList and add first
// - Remove the subset of newly added elements
def combiner(first:ItemList, second:ItemList) : ItemList = {
def helper(lst: ItemList, i:Item) : ItemList = {
val isSubset: Boolean = lst.exists( x=> i._2.subsetOf(x._2))
if( isSubset) lst else i :: lst.filterNot( x => x._2.subsetOf(i._2))
}
second.foldLeft(first)(helper)
}
listOflst.fold(List())(combiner)
You can use filter after a map.
You can build like a map that will return a value for what you want to delete. First build a function:
def filter_mike(line):
if line[1] != Set(1,3):
return line
else:
return None
Then you can filter now like this:
your_rdd.map(filter_mike).filter(lambda x: x != None)
This will work

Looping through a list of tuples in Scala

I have a sample List as below
List[(String, Object)]
How can I loop through this list using for?
I want to do something like
for(str <- strlist)
but for the 2d list above. What would be placeholder for str?
Here it is,
scala> val fruits: List[(Int, String)] = List((1, "apple"), (2, "orange"))
fruits: List[(Int, String)] = List((1,apple), (2,orange))
scala>
scala> fruits.foreach {
| case (id, name) => {
| println(s"$id is $name")
| }
| }
1 is apple
2 is orange
Note: The expected type requires a one-argument function accepting a 2-Tuple.
Consider a pattern matching anonymous function, { case (id, name) => ... }
Easy to copy code:
val fruits: List[(Int, String)] = List((1, "apple"), (2, "orange"))
fruits.foreach {
case (id, name) => {
println(s"$id is $name")
}
}
With for you can extract the elements of the tuple,
for ( (s,o) <- list ) yield f(s,o)
I will suggest using map, filter,fold or foreach(whatever suits your need) rather than iterating over a collection using loop.
Edit 1:
e.g
if you want to apply some func foo(tuple) on each element
val newList=oldList.map(tuple=>foo(tuple))
val tupleStrings=tupleList.map(tuple=>tuple._1) //in your situation
if you want to filter according to some boolean condition
val newList=oldList.filter(tuple=>someCondition(tuple))
or simply if you want to print your List
oldList.foreach(tuple=>println(tuple)) //assuming tuple is printable
you can find example and similar functions here https://twitter.github.io/scala_school/collections.html
If you just want to get the strings you could map over your list of tuples like this:
// Just some example object
case class MyObj(i: Int = 0)
// Create a list of tuples like you have
val tuples = Seq(("a", new MyObj), ("b", new MyObj), ("c", new MyObj))
// Get the strings from the tuples
val strings = tuples.map(_._1)
// Output: Seq[String] = List(a, b, c)
Note: Tuple members are accessed using the underscore notation (which
is indexed from 1, not 0)

Filter a list by item index?

val data = List("foo", "bar", "bash")
val selection = List(0, 2)
val selectedData = data.filter(datum => selection.contains(datum.MYINDEX))
// INVALID CODE HERE ^
// selectedData: List("foo", "bash")
Say I want to filter a List given a list of selected indices. If, in the filter method, I could reference the index of a list item then I could solve this as above, but datum.MYINDEX isn't valid in the above case.
How could I do this instead?
How about using zipWithIndex to keep a reference to the item's index, filtering as such, then mapping the index away?
data.zipWithIndex
.filter{ case (datum, index) => selection.contains(index) }
.map(_._1)
It's neater to do it the other way about (although potentially slow with Lists as indexing is slow (O(n)). Vectors would be better. On the other hand, the contains of the other solution for every item in data isn't exactly fast)
val data = List("foo", "bar", "bash")
//> data : List[String] = List(foo, bar, bash)
val selection = List(0, 2)
//> selection : List[Int] = List(0, 2)
selection.map(index=>data(index))
//> res0: List[String] = List(foo, bash)
First solution that came to my mind was to create a list of pairs (element, index), filter every element by checking if selection contains that index, then map resulting list in order to keep only raw elementd (omit index). Code is self explanatory:
data.zipWithIndex.filter(pair => selection.contains(pair._2)).map(_._1)
or more readable:
val elemsWithIndices = data.zipWithIndex
val filteredPairs = elemsWithIndices.filter(pair => selection.contains(pair._2))
val selectedElements = filteredPairs.map(_._1)
This Works :
val data = List("foo", "bar", "bash")
val selection = List(0, 2)
val selectedData = data.filter(datum => selection.contains(data.indexOf(datum)))
println (selectedData)
output :
List(foo, bash)
Since you have a list of indices already, the most efficient way is to pick those indices directly:
val data = List("foo", "bar", "bash")
val selection = List(0, 2)
val selectedData = selection.map(index => data(index))
or even:
val selectedData = selection.map(data)
or if you need to preserve the order of the items in data:
val selectedData = selection.sorted.map(data)
UPDATED
In the spirit of finding all the possible algorithms, here's the version using collect:
val selectedData = data
.zipWithIndex
.collect {
case (item, index) if selection.contains(index) => item
}
The following is the probably most scalable way to do it in terms of efficiency, and unlike many answers on SO, actually follows the official scala style guide exactly.
import scala.collection.immutable.HashSet
val selectionSet = new HashSet() ++ selection
data.zipWithIndex.collect {
case (datum, index) if selectionSet.contains(index) => datum
}
If the resulting collection is to be passed to additional map, flatMap, etc, suggest turning data into a lazy sequence. In fact perhaps you should do this anyway in order to avoid 2-passes, one for the zipWithIndex one for the collect, but I doubt when benchmarked one would gain much.
There is actually an easier way to filter by index using the map method. Here is an example
val indices = List(0, 2)
val data = List("a", "b", "c")
println(indices.map(data)) // will print List("a", "c")