Joining pairs of key-value with pairs of key-map - scala

I am having this dataset:
(apple,1)
(banana,4)
(orange,3)
(grape,2)
(watermelon,2)
, and the other dataset is:
(apple,Map(Bob -> 1))
(banana,Map(Chris -> 1))
(orange,Map(John -> 1))
(grape,Map(Smith -> 1))
(watermelon,Map(Phil -> 1))
I aiming to combine both sets to get:
(apple,1,Map(Bob -> 1))
(banana,4,Map(Chris -> 1))
(orange,3,Map(John -> 1))
(grape,2,Map(Smith -> 1))
(watermelon,2,Map(Phil -> 1))
The code I have:
...
val counts_firstDataset = words.map(word =>
(word.firstWord, 1)).reduceByKey{case (x, y) => x + y}
Second dataset:
...
val counts_secondDataset = secondSet.map(x => (x._1,
x._2.toList.groupBy(identity).mapValues(_.size)))
I tried to use the join method val joined_data = counts_firstDataset.join(counts_secondDataset) but did not work because the join takes pair of [K,V]. How would I get around this issue?

The easiest way is just to convert to DataFrames and then join:
import spark.implicits._
val counts_firstDataset = words
.map(word => (word.firstWord, 1))
.reduceByKey{case (x, y) => x + y}
.toDF("type", "value")
val counts_secondDataset = secondSet
.map(x => (x._1,x._2.toList.groupBy(identity).mapValues(_.size)))
.toDF("type_2","map")
counts_firstDataset
.join(counts_secondDataset, 'type === 'type_2)
.drop('type_2)

As first element (name of fruits) of both the lists are in the same order, you can combine the two lists of tuples using zip and then use map to change the list to a tuple in the following way:
counts_firstDataset.zip(counts_secondDataset)
.map(vk => (vk._1._1, vk._1._2, vk._2._2))

Related

foldLeft : if exists add to a map else update values

I wanted to try out the foldLeft similar to reduceByKey function.
If the letter exists increment by value else append the tuple in HashMap.
The below code fails:
val output = input.toLowerCase.filter(Character.isLetter).map(x => (x,1)).foldLeft(HashMap.empty[Char,Int].withDefaultValue(0)){case (acc,(x,y))=> acc += x }
Please suggest.
With Scala 2.13 you can use the new groupMapReduce().
val output = "In-Pint".collect{case c if c.isLetter => c.toLower}
.groupMapReduce(identity)(_ => 1)(_+_)
//output: Map[Char,Int] = Map(p -> 1, t -> 1, i -> 2, n -> 2)
Breaking down your code snippet:
.toLowerCase.filter(Character.isLetter)
As showcased in #jwvh's answer, this can be simplified to .collect{case c if c.isLetter => c.toLower}
.map(x => (x, 1))
This transformation is unnecessary if you intend to use foldLeft.
.foldLeft(HashMap.empty[Char,Int].withDefaultValue(0)){case (acc, (x,y)) => acc += x}
This wouldn't compile as += is an assignment which cannot be applied to the accumulator.
For counting distinct characters in a string, your foldLeft can be formulated as shown below:
"abac".foldLeft(Map[Char, Int]()){
case (m, c) => m + (c -> (m.getOrElse(c, 0) + 1))
}
// res1: scala.collection.immutable.Map[Char,Int] = Map(a -> 2, b -> 1, c -> 1)
The idea is simple: in foldLeft's binary operator, add to the existing Map c -> m(c)+1 if c already exists; else c -> 0+1.

loop with accumulator on an rdd

I want to loop n times where n is an accumulator over the same rdd
lets say n = 10 so I want the code bellow to loop 5 times (since the accumulator is increased by two)
val key = keyAcm.value.toInt
val rest = rdd.filter(_._1 > (key + 1))
val combined = rdd.filter(k => (k._1 == key) || (k._1 == key + 1))
.map(x => (key, x._2))
.reduceByKey { case (x, y) => (x ++ y) }
keyAcm.add(2)
combined.union(rest)
with this code I filter the rdd and keep keys 0 (init value of accumulator) and 1. Then, I am trying to merge its second parameter and change the key to create a new rdd with key 0 and a merged array. After that, I union this rdd with the original one leaving behind the filtered values (0 and 1). Lastly, I increase the accumulator by two.How can I repeat theses steps until the accumulator is 10?
any ideas?
val rdd: RDD[(Int, String)] = ???
val res: RDD[(Int, Iterable[String])] = rdd.map(x => (x._1 / 2, x._2)).groupByKey()

merge sets' elements with HashMap's key in scala

I hope there is an easy way to solve that
I have two RDDs
g.vertices
(4,Set(5, 3))
(0,Set(1, 4))
(1,Set(2))
(6,Set())
(3,Set(0))
(5,Set(2))
(2,Set(1))
maps
Map(4 -> Set(5, 3))
Map(0 -> Set(1, 4))
Map(1 -> Set(2))
Map(6 -> Set())
Map(3 -> Set(0))
Map(5 -> Set(2))
Map(2 -> Set(1))
How can I do something like this?
(4,Map(5 -> Set(2), 3 -> Set(0)))
(0,Map(1 -> Set(2), 4 -> Set(5, 3)))
(1,Map(2 -> Set(1)))
(6,Map())
(3,Map(0 -> Set(1, 4)))
(5,Map(2 -> Set(1)))
(2,Map(1 -> Set(2)))
I want to combine map's key with elements of set, so I want to change sets' elements (merge them with map's key)
I thought about
val maps = g.vertices.map { case (id, attr) => HashMap(id -> attr) }
g.mapVertices{case (id, data) => data.map{case vId => maps.
map { case i if i.keySet.contains(vId) => HashMap(vId -> i.values) } }}
but I have an error
org.apache.spark.SparkException: RDD transformations and actions can
only be invoked by the driver, not inside of other transformations;
for example, rdd1.map(x => rdd2.values.count() * x) is invalid because
the values transformation and count action cannot be performed inside
of the rdd1.map transformation. For more information, see SPARK-5063.
This is a simple use case for join. In the following code, A is the type of the keys in g.vertices, K and V are the key and value types for maps:
def joinByKeys[A, K, V](sets: RDD[(A, Set[K])], maps: RDD[Map[K, V]]): RDD[(A, Map[K, V])] = {
val flattenSets = sets.flatMap(p => p._2.map(_ -> p._1)) // create a pair for each element of vertice's sets
val flattenMaps = maps.flatMap(identity) // create an RDD with all indexed values in Maps
flattenMaps.join(flattenSets).map{ // join them by their key
case (k, (v, a)) => (a, (k, v)) // reorder to put the vertexId as id
}.aggregateByKey(Map.empty[K, V])(_ + _, _ ++ _) // aggregate the maps
}

How to un-nest a spark rdd that has the following type ((String, scala.collection.immutable.Map[String,scala.collection.immutable.Map[String,Int]]))

Its a nested map with contents like this when i print it onto screen
(5, Map ( "ABCD" -> Map("3200" -> 3,
"3350.800" -> 4,
"200.300" -> 3)
(1, Map ( "DEF" -> Map("1200" -> 32,
"1320.800" -> 4,
"2100" -> 3)
I need to get something like this
Case Class( 5, ABCD 3200, 3)
Case Class(5, ABCD 3350.800, 4)
CaseClass(5,ABCD., 200.300, 3)
CaseClass(1, DEF 1200, 32)
CaseClass(1 DEF, 1320.800, 4)
etc etc.
basically a list of case classes
And map it to a case class object so that i can save it to cassandra.
I have tried doing flatMapValues but that un nests the map only one level. Also used flatMap . that doesnt work either or I'am making mistakes
Any suggestions ?
Fairly straightforward using a for-comprehension and some pattern matching to destructure things:
val in = List((5, Map ( "ABCD" -> Map("3200" -> 3, "3350.800" -> 4, "200.300" -> 3))),
(1, Map ("DEF" -> Map("1200" -> 32, "1320.800" -> 4, "2100" -> 3))))
case class Thing(a:Int, b:String, c:String, d:Int)
for { (index, m) <- in
(k,v) <-m
(innerK, innerV) <- v}
yield Thing(index, k, innerK, innerV)
//> res0: List[maps.maps2.Thing] = List(Thing(5,ABCD,3200,3),
// Thing(5,ABCD,3350.800,4),
// Thing(5,ABCD,200.300,3),
// Thing(1,DEF,1200,32),
// Thing(1,DEF,1320.800,4),
// Thing(1,DEF,2100,3))
So let's pick part the for-comprehension
(index, m) <- in
This is the same as
t <- in
(index, m) = t
In the first line t will successively be set to each element of in.
t is therefore a tuple (Int, Map(...))
Patten matching lets us put that "patten" for the tuple on the right hand side and the compiler picks apart the tuple, sets index to the Int and m to the Map.
(k,v) <-m
As before this is equivalent to
u <-m
(k, v) = u
And this time u takes each element of Map. Which again are tuples of key and value. So k is set successively to each key and v to the value.
And v is your inner map so we do the same thing again with the inner map
(innerK, innerV) <- v}
Now we have everything we need to create the case class. yield just says make a collection of whatever is "yielded" each time through the loop.
yield Thing(index, k, innerK, innerV)
Under the hood, this just translates to a set of maps/flatmaps
The yield is just the value Thing(index, k, innerK, innerV)
We get one of those for each element of v
v.map{x=>val (innerK, innerV) = t;Thing(index, k, innerK, innerV)}
but there's an inner map per element of the outer map
m.flatMap{y=>val (k, v) = y;v.map{x=>val (innerK, innerV) = t;Thing(index, k, innerK, innerV)}}
(flatMap because we get a List of Lists if we just did a map and we want to flatten it to just the list of items)
Similarly, we do one of those for every element in the List
in.flatMap (z => val (index, m) = z; m.flatMap{y=>val (k, v) = y;v.map{x=>val (innerK, innerV) = t;Thing(index, k, innerK, innerV)}}
Let's do that in _1, _2 style-y.
in.flatMap (z=> z._2.flatMap{y=>y._2.map{x=>;Thing(z._1, y._1, x._1, x._2)}}}
which produces exactly the same result. But isn't it clearer as a for-comprehension?
You can do this like this if you prefer collection operation
case class Record(v1: Int, v2: String, v3: Double, v4: Int)
val data = List(
(5, Map ( "ABC" ->
Map(
3200. -> 3,
3350.800 -> 4,
200.300 -> 3))
),
(1, Map ( "DEF" ->
Map(
1200. -> 32,
1320.800 -> 4,
2100. -> 3))
)
)
val rdd = sc.parallelize(data)
val result = rdd.flatMap(p => {
p._2.toList
.flatMap(q => q._2.toList.map(l => (q._1, l)))
.map((p._1, _))
}).map(p => Record(p._1, p._2._1, p._2._2._1, p._2._2._2))
println(result.collect.toList)
//List(
// Record(5,ABC,3200.0,3),
// Record(5,ABC,3350.8,4),
// Record(5,ABC,200.3,3),
// Record(1,DEF,1200.0,32),
// Record(1,DEF,1320.8,4),
// Record(1,DEF,2100.0,3)
//)

Can map function be applied to Tuple?

I have a Tuple of type :
val data1 : (String , scala.collection.mutable.ArrayBuffer[(String , Int)]) = ( ("" , scala.collection.mutable.ArrayBuffer( ("a" , 1) , ("b" , 1) , ("a" , 1) ) )) // , ("" , scala.collection.mutable.ArrayBuffer("" , 1)) )
When I attempt to map using : data1.map(m => println(m)) I receive error :
Multiple markers at this line - value map is not a member of (String, scala.collection.mutable.ArrayBuffer[(String,
Int)]) - value map is not a member of (String, scala.collection.mutable.ArrayBuffer[(String, Int)])
Is it possible to use map function using Tuple accessor syntax : ._2 ?
This type of syntax data1.map(m._2 => println(m._2))) does not compile
Using this syntax I'm attempting to apply a map function to sum all the letters associated with the ArrayBuffer. So above example should map to -> ( (a , 2) , (b , 1) )
Its unclear what you want. What output are you expecting?
Do you want to print the second item of data1?
println(data1._2)
Or print each item of the buff in data1?
data1._2.foreach(m => println(m))
Do you want for data1 to be a collection of tuples and to map over that?
import scala.collection.mutable.ArrayBuffer
val data1 = Vector(("" , ArrayBuffer(("", 1))), ("", ArrayBuffer("", 1)))
data1.foreach { case (a,b) => println(b) }
Note that if you're just printing stuff out, you want foreach, not map.
Based on your edits:
import scala.collection.mutable.ArrayBuffer
val data1 = (("", ArrayBuffer(("a", 1), ("b", 1), ("a", 1))))
val grouped = data1._2.groupBy(_._1).map { case (k, vs) => (k, vs.map(_._2).sum) }
// Map(b -> 1, a -> 2)
You can't use map on tuples. Tuples don't have such method.
Also map function should transform value and return it, but you just want to print it, not change.
To print ArrayBuffer in your case try this:
data1._2.foreach(x=>println(x))
or just
data1._2.foreach(println)
Try
data1.foreach( { case ( x, y ) => println( y ) } )