flatMapping in scala/spark - scala

Looking for some assistance with a problem with how to to something in scala using spark.
I have:
type DistanceMap = HashMap[(VertexId,String), Int]
this forms part of my data in the form of an RDD of:
org.apache.spark.rdd.RDD[(DistanceMap, String)]
in short my dataset looks like this:
({(101,S)=3},piece_of_data_1)
({(101,S)=3},piece_of_data_2)
({(101,S)=1, (100,9)=2},piece_of_data_3)
What I want to do us flat map my distance map (which I can do) but at the same time for each flatmapped DistanceMap want to retain the associated string with that. So my resulting data would look like this:
({(101,S)=3},piece_of_data_1))<br>
({(101,S)=3},piece_of_data_2))<br>
({(101,S)=1},piece_of_data_3))<br>
({(109,S)=2},piece_of_data_3))<br>
As mentioned I can flatMap the first part using:
x.flatMap(x=>x._1).collect.foreach(println))
but am stuck on how I can retain the string from the second part of my original data.

This might work for you:
x.flatMap(x => x._1.map(y => (y,x._2)))
The idea is to convert from (Seq(a,b,c),Value) to Seq( (a,Value), (b, Value), (c, Value)).
This is the same in Scala, so here is a standalone simplified Scala example you can paste in Scala REPL:
Seq((Seq("a","b","c"), 34), (Seq("r","t"), 2)).flatMap( x => x._1.map(y => (y,x._2)))
This results in:
res0: Seq[(String, Int)] = List((a,34), (b,34), (c,34), (r,2), (t,2))

update
I have an alternative solution - flip key with value and use flatMapValues transformation, and then flip key with value again: see pseudo code:
x.map(x=>x._2, x._1).flatMapValues(x=>x).map(x=>x._2, x._1)
previous version
I propose to add one preprocessing step (sorry I have no computer with scala interpreter in front of me till tomorrow to come up with working code).
transform the pair rdd from (DistanceMap, String) into the rdd with list of Tuple4: List((VertexId,String, Int, String), ... ())
apply flatMap on on result
Pseudocode:
rdd.map( (DistanceMap, String) => List((VertexId,String, Int, String), ... ()))
.flatMap(x=>x)

Related

Is there a better functional method to operate on Map[String,List[Int]]

I'm fairly new to scala and functional programming, and I'm working on a project where I have grocery prices in 30 days and want to apply some analysis over the data that I have.
The data is saved as map(string, List[Int])
What I'm trying to do is to get the lowest and highest price for each item, I did it like this and then I have another function that loops over the returned Map and prints it.
def f(): Map[String,List[Int]] = {
var result= Map.empty[String, List[Int]]
for ((k,v) <- data){
var low = v.min
var high = v.max
result+= (k -> List(low,high));
}
result
}
I think this is not the most functional method to do it, can anyone elaborate if there is a way to iterate over the data and return the result without creating an empty map?
The computation does not depend on the keys in any way, so there is no reason to introduce the ks anywhere, it's just distracting from the main goal. Just map the values:
data.view.mapValues(v => (v.min, v.max)).toMap
Also, your signature f() doesn't tell anything useful. How do you know what it's doing? If you deleted the body of that function, and were given only "f()", would you be able to unambiguously reconstruct the body? Would the GPT be able to reconstruct the body? Probably not.
Ideally, the signature should be precise enough so you never need to dig into the implementation bodies (and also that you don't actually have to write them). Here is a possible improvement:
def priceRanges(itemsToPrices: Map[String, List[Int]]): Map[String, (Int, Int)] =
itemsToPrices.view.mapValues(v => (v.min, v.max)).toMap
There are several ways to achieve this. I think one key aspect is readability, so while the following can be done as a pure one-liner, I think this could be a viable and readable solution:
data.map { case (k, v) =>
k -> Seq(v.min, v.max)
}
Feel free to shorten it if you like.
This would also work, but it may be less readable for someone not used to functional programming.
data.map(kv => kv._1 -> Seq(kv._2.min, kv._2.max))
Another thing you may want to consider:
There is nothing that protects the List/Seq in the result type from containing more than two elements. You may want to use a tuple or create a custom type for it.
I love it when people encourage themselves to do functional Scala instead of the imperative style, so congratulations on that.
Returning to your question, I think the easiest way to solve this problem is with the famous map function: it takes a function as a parameter which describes how you want to transform each element within the collection. In your case, this function goes from the tuple (item, values), which in your question would be the (k, v), to a new similar tuple, but this time only with the "prices" we are interested in:
def getLowAndHighPrices(itemsWithPrices: Map[String, List[Int]]): Map[String, List[Int]] =
itemsWithPrices.map((item, prices) => (item, List(prices.min, prices.max)))
You can read the previous map implementation as: for each value (item, prices), convert it into the tuple (item, List(prices.min, prices.max). The map function literally describes what you want to do, without telling exactly what steps to follow, because map takes care of that for you; that is personally one of the advantages of functional programming.
You can also print the results in a very “functional” way (ignoring the side effects):
// For demonstration purposes
val allItemPrices: Map[String, List[Int]] =
Map(
"Milk" -> List(9, 8, 7, 10),
"Eggs" -> List(1, 3, 4, 3, 5, 2)
)
def main(args: Array[String]): Unit =
getLowAndHighPrices(allItemPrices).foreach((item, prices) => println(s"$item -> $prices"))
/**
* Which prints out:
* Milk -> List(10, 7)
* Eggs -> List(5, 2)
*/
In this case, foreach does something very similar to map, with the difference that foreach is design to perform side-effects such as printing to the console.
I hope I made myself clear. Good luck on your Scala journey!

Char count in string

I'm new to scala and FP in general and trying to practice it on a dummy example.
val counts = ransomNote.map(e=>(e,1)).reduceByKey{case (x,y) => x+y}
The following error is raised:
Line 5: error: value reduceByKey is not a member of IndexedSeq[(Char, Int)] (in solution.scala)
The above example looks similar to staring FP primer on word count, I'll appreciate it if you point on my mistake.
It looks like you are trying to use a Spark method on a Scala collection. The two APIs have a few similarities, but reduceByKey is not part of it.
In pure Scala you can do it like this:
val counts =
ransomNote.foldLeft(Map.empty[Char, Int].withDefaultValue(0)) {
(counts, c) => counts.updated(c, counts(c) + 1)
}
foldLeft iterates over the collection from the left, using the empty map of counts as the accumulated state (which returns 0 is no value is found), which is updated in the function passed as argument by being updated with the found value, incremented.
Note that accessing a map directly (counts(c)) is likely to be unsafe in most situations (since it will throw an exception if no item is found). In this situation it's fine because in this scope I know I'm using a map with a default value. When accessing a map you will more often than not want to use get, which returns an Option. More on that on the official Scala documentation (here for version 2.13.2).
You can play around with this code here on Scastie.
On Scala 2.13 you can use the new groupMapReduce
ransomNote.groupMapReduce(identity)(_ => 1)(_ + _)
val str = "hello"
val countsMap: Map[Char, Int] = str
.groupBy(identity)
.mapValues(_.length)
println(countsMap)

How to work with a Spark RDD to produce or map to another RDD

I have a Key/Value RDD I want to take that "iterate over" the entities in it, Key/Value, and create, or map, to another RDD which could have more or less entries that the first RDD.
Example:
I have records in accumulo that represent observations of colors in paintings.
An observation entity/object holds data on the painting name and the colors in the painting.
Observation
public String getPaintingName() {return paintingName;}
public List<String> getObservedColors() {return colorList}
I pull the observations from accumulo into my code as an RDD.
val observationRDD: RDD[(Text, Observation)] = getObservationsFromAccumulo();
I want to take this RDD and create an RDD of the form of (Color, paintingName) where the key is the color observed and the value is the painting name which the color was observed in.
val colorToPaintingRDD: RDD[(String, String)] = observationRDD.somefunction({ case (_, observation) =>
for(String color : observations.getObservedColors()) {
// Some how output a entry into a new RDD
//output/map (color, observation.getPaintingName)
})
I know map can't work, because its 1 to 1, I thought maybe observationRDD.flatmap(some function) but can't seem to find any examples on how to do that to create a new, larger or smaller, RDD.
Could someone help me out and tell me if flatmap is correct, and if so give me an example using this example I provided, or tell me if i'm way off base?
Please understand this is just a simple example, its not the content im asking about, its how one would transform a RDD to a RDD with more or less entries.
You should use flatmap and return a List[(String, String)] foreach element in RDD. FlatMap will flat the result and you'll get an RDD[(String, String)]
I didn't try the code, but it would be something like this:
val colorToPaintingRDD: RDD[(String, String)] = observationRDD.flatMap { case (_, observation) =>
observations.getObservedColors().map(color => (color, observation.getPaintingName))
}
Probably if getObservedColors method is in Java you have to import JavaConversions and change to scala list.
import scala.collection.JavaConversions._
observations.getObservedColors().toList

Scala Spark map type matching issue

I'm trying to perform a series of transformations on log data with Scala, and I'm having difficulties with matching tuples. I have a data frame with user ids, urls and dates. I can map the data frame to an RDD and reduce by key with this map:
val countsRDD = usersUrlsDays.map { case Row(date:java.sql.Date, user_id:Long, url:String) => Tuple2(Tuple2(user_id, url), 1) }.rdd.reduceByKey(_+_)
This gives me an RDD of ((user_id, url), count):
scala> countsRDD.take(1)
res9: Array[((Long, String), Int)]
scala> countsRDD.take(1)(0)
res10: ((Long, String), Int)
Now I want to invert that by url to yield:
(url, [(user_id, count), ...])
I have tried this:
val urlIndex = countsRDD.map{ case Row(((user_id:Long, url:String), count:Int)) => Tuple2(url, List(Tuple2(user_id, count))) }.reduceByKey(_++_)
This produces match errors, however:
scala.MatchError: ... (of class scala.Tuple2)
I've tried many, many different permutations of these two map calls with explicitly and implicit types and this seems to have gotten me the farthest. I'm hoping that someone here can help point me in the right direction.
Something like this should work:
countsRDD
.map{ case ((user_id, url), count) => (url, (user_id, count)) }
.groupByKey
countsRDD is RDD[((String, String), Int)] not RDD[Row].
There is no need to use TupleN. Tuple literals will work just fine.
Since countsRDD is statically typed (unlike RDD[Row]) you don't have to specify types.
Don't use reduceByKey for list concatenation. It is the worst possible approach you can take and ignores computational complexity, garbage colector and the common sense. If you really need grouped data use operation which is designed for it.

Spark: Cannot add RDD elements into a mutable HashMap inside a closure

I have the following code where rddMap is of org.apache.spark.rdd.RDD[(String, (String, String))], and myHashMap is scala.collection.mutable.HashMap.
I did .saveAsTextFile("temp_out") to force the evaluation of rddMap.map.
However, even println(" t " + t) is printing things, later myHashMap still has only one element I manually put in the beginning ("test1", ("10", "20")).
Everything in the rddMap is not put into myHashMap.
Snippet code:
val myHashMap = new HashMap[String, (String, String)]
myHashMap.put("test1", ("10", "20"))
rddMap.map { t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}.saveAsTextFile("temp_out")
println(rddMap.count)
println(myHashMap.toString)
Why I cannot put the elements from rddMap to my myHashMap?
Here is a working example of what you want to accomplish.
val rddMap = sc.parallelize(Map("A" -> ("v", "v"), "B" -> ("d","d")).toSeq)
// Collects all the data in the RDD and converts the data to a Map
val myMap = rddMap.collect().toMap
myMap.foreach(println)
Output:
(A,(v,v))
(B,(d,d))
Here is similar code to what you've posted
rddMap.map { t=>
println("t" + t)
newHashMap.put(t._1, t._2)
println(newHashMap.toString)
}.collect
Here is the output to the above code from the Spark shell
t(A,(v,v))
Map(A -> (v,v), test1 -> (10,20))
t(B,(d,d))
Map(test1 -> (10,20), B -> (d,d))
To me it looks like Spark copies your HashMap and does add the element to the copied map.
What you are trying to do is not really supported in Spark today.
Note that every user defined function (e.g., what you add inside a map()) is a closure that gets serialized and pushed to each executioner.
Therefore everything you have inside this map() gets serialized and gets transferred around:
.map{ t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}
Essentially, your myHashMap will be copied to each executioner and each executioner will be updating it's own version of that HashMap. This is why at the end of the execution the myHashMap you have in your driver will never get changed. (Driver is the JVM that manages/orchestrates your Spark jobs. It's the place where you define your SparkContext.)
In order to push structures defined in the driver to all executioners you need to broadcast them (see link here). Note that broadcasted variables are read-only, so again, using broadcasts will not help you here.
Another way is to use Accumulators but I feel that these are more tune towards summarizing numeric values, like doing sum, max, min, etc. Maybe you can take a look at creating a custom accumulator that extends AccumulatorParam. See link here.
Coming back to the original question, if you want to collect values to your driver, currently the best way to do this is to transform your RDDs until they become a small and manageable collection of elements and then you collect() this final/small RDD.