How to make this RDD in spark - scala

This is my code
import org.apache.spark.SparkContext..
def main(args: Array[String]): Unit = {
val conf = new sparkConf().setMaster("local").setAppname("My app")
val sc = new SparkContext(conf_
val inputfile = "D:/test.txt"
val inputData = sc.textFile(inputFile)
val DupleRawData = inputData.map(_.split("\\<\\>").toList)
.map(s => (s(8),s(18)))
.map(s => (s, 1))
.reduceByKey(_ + _)
val UserShopCount = DupleRawData.groupBy(s => s._1._1)
.map(s => (s._1, s._2.toList.sortBy(z => z._2).reverse))
val ResultSet = UserShopCount.map(s => (s._1, s._2.take(1000).map(z => z._1._2, z._2))))
ResultSet.foreach(println)
//(aaa,List((100,4), (200,4), (300,3), (800,1)))
//(bbb,List((100,6), (400,5), (500,4)))
//(ccc,List((300,7), (400,6), (700,3)))
// here now I reach..
}
and this is the result I'm getting:
(aaa,List((100,4), (200,4), (300,3), (800,1)))
(bbb,List((100,6), (400,5), (500,4)))
(ccc,List((300,7), (400,6), (700,3)))
I want to final result set RDD is
// val ResultSet: org.apache.spark.rdd.RDD[(String, List[(String, Int)])]
(aaa, List(200,4), (800,1)) // because key of bbb and ccc except 100,300
(bbb, List((500,4)) // because aaa and ccc key except 100,400
(ccc, List((700,3)) // because aaa and bbb key except 300,400
please give me a solution or advice...sincerely

Here is my attempt:
val data: Seq[(String, List[(Int, Int)])] = Seq(
("aaa",List((1,4), (2,4), (3,3), (8,1))),
("bbb",List((1,6), (4,5), (5,4))),
("ccc",List((3,7), (6,6), (7,3)))
)
val uniqKeys = data.flatMap {
case (_, v) => {
v.map(_._1)
}
} groupBy(identity(_)) filter (_._2.size == 1)
val result = data.map {
case (pk, v) => val finalValue = v.filter {
case (k, _) => !uniqKeys.contains(k)
}
(pk, finalValue)
}
Output:
result: Seq[(String, List[(Int, Int)])] = List((aaa,List((1,4), (3,3))), (bbb,List((1,6))), (ccc,List((3,7))))

I am assuming your ResultSet is an RDD[String, List[(Int, Int)]]
val zeroVal1: (Long, String, (Int, Int)) = (Long.MaxValue, "", (0, 0))
val zeroVal2: List[(String, (Int, Int))] = List()
val yourNeededRdd = ResultSet
.zipWithIndex()
.flatMap({
((key, list), index) => list.map(t => (t._1, (index, key, t)))
})
.aggregateByKey(zeroVal1)(
(t1, t2) => { if (t1._1 <= t2._1) t1 else t2 },
(t1, t2) => { if (t1._1 <= t2._1) t1 else t2 }
)
.map({ case (t_1, (index, key, t)) => (key, t) })
.aggregateByKey(zeroVal2)(
(l, t) => { t :: l },
(l1, l2) => { l1 ++ l2 }
)

Related

scala - map & flatten shows different result than flatMap

val adjList = Map("Logging" -> List("Networking", "Game"))
// val adjList: Map[String, List[String]] = Map(Logging -> List(Networking, Game))
adjList.flatMap { case (v, vs) => vs.map(n => (v, n)) }.toList
// val res7: List[(String, String)] = List((Logging,Game))
adjList.map { case (v, vs) => vs.map(n => (v, n)) }.flatten.toList
// val res8: List[(String, String)] = List((Logging,Networking), (Logging,Game))
I am not sure what is happening here. I was expecting the same result from both of them.
.flatMap is Map's .flatMap, but .map is Iterable's .map.
For a Map "Logging" -> "Networking" and "Logging" -> "Game" become just the latter "Logging" -> "Game" because the keys are the same.
val adjList: Map[String, List[String]] = Map("Logging" -> List("Networking", "Game"))
val x0: Map[String, String] = adjList.flatMap { case (v, vs) => vs.map(n => (v, n)) }
//Map(Logging -> Game)
val x: List[(String, String)] = x0.toList
//List((Logging,Game))
val adjList: Map[String, List[String]] = Map("Logging" -> List("Networking", "Game"))
val y0: immutable.Iterable[List[(String, String)]] = adjList.map { case (v, vs) => vs.map(n => (v, n)) }
//List(List((Logging,Networking), (Logging,Game)))
val y1: immutable.Iterable[(String, String)] = y0.flatten
//List((Logging,Networking), (Logging,Game))
val y: List[(String, String)] = y1.toList
//List((Logging,Networking), (Logging,Game))
Also https://users.scala-lang.org/t/map-flatten-flatmap/4180

How to transform a tuple element values in a list of two tuple using map

I have the following:
val x : List[(String, Int)] = List((mealOne,2), (mealTWo,1), (mealThree,2))
I want to replace or transform the String to Int using the below values with a map:
val mealOne = 5.99; val mealTwo = 6.99; val mealThree = 7.99
x.map{ x => if (x._1 == "mealOne") mealOne
else if (x._1 == "mealTwo") mealTwo
else mealThree
}
Result:
List[Double] = List(5.99, 6.99, 7.99)
but I want this:
List[Double,Int] = List((5.99,2), (6.99,1), (7.99,2))
So how can I achieve the above
Thanks
Just don't drop the second component of the tuple then:
x.map{ x => (
if (x._1 == "mealOne") mealOne
else if (x._1 == "mealTwo") mealTwo
else mealThree,
x._2
)}
of course, it works for arbitrary mappings from Strings to Doubles:
def replaceNamesByPrices(
nameToPrice: String => Double,
xs: List[(String, Int)]
): List[(Double, Int)] =
for ((name, amount) <- xs) yield (nameToPrice(name), amount)
so that you can then store the mapping of names to prices in a map, i.e.
val priceTable = Map(
"mealOne" -> 42.99,
"mealTwo" -> 5.99,
"mealThree" -> 2345.65
)
so that
replaceNamesByPrices(priceTable, x)
yields the desired result.
This works in this way:(still simplified, thanks to Andrey Tyukin):
for(m<-x) yield (y(m._1),m._2)
for((m,n)<-x) yield (y(m),n)
or
x.map(t=>(y(t._1),t._2))
x.map{case (m,n)=>(y(m),n)}
Your Lists ( input ):(y is changed to Map)
val x = List(("mealOne",2), ("mealTWo",1), ("mealThree",2))
val y = Map(("mealOne",5.99), ("mealTWo",6.99), ("mealThree",7.99))
In Scala REPL:
scala> for(m<-x) yield (y(m._1),m._2)
res35: List[(Double, Int)] = List((5.99,2), (6.99,1), (7.99,2))
scala> for((m,n)<-x) yield (y(m),n)
res60: List[(Double, Int)] = List((5.99,2), (6.99,1), (7.99,2))
scala> x.map(t=>(y(t._1),t._2))
res57: List[(Double, Int)] = List((5.99,2), (6.99,1), (7.99,2))
scala> x.map{case (m,n)=>(y(m),n)}
res59: List[(Double, Int)] = List((5.99,2), (6.99,1), (7.99,2)
)

None if no clear winner in Scala Map maxBy

val valueCountsMap: mutable.Map[String, Int] = mutable.Map[String, Int]()
valueCountsMap("a") = 1
valueCountsMap("b") = 1
valueCountsMap("c") = 1
val maxOccurredValueNCount: (String, Int) = valueCountsMap.maxBy(_._2)
// maxOccurredValueNCount: (String, Int) = (b,1)
How can I get None if there's no clear winner when I do maxBy values? I am wondering if there's any native solution already implemented within scala mutable Maps.
No, there's no native solution for what you've described.
Here's how I might go about it.
implicit class UniqMax[K,V:Ordering](m: Map[K,V]) {
def uniqMaxByValue: Option[(K,V)] = {
m.headOption.fold(None:Option[(K,V)]){ hd =>
val ev = implicitly[Ordering[V]]
val (count, max) = m.tail.foldLeft((1,hd)) {case ((c, x), v) =>
if (ev.gt(v._2, x._2)) (1, v)
else if (v._2 == x._2) (c+1, x)
else (c, x)
}
if (count == 1) Some(max) else None
}
}
}
Usage:
Map("a"->11, "b"->12, "c"->11).uniqMaxByValue //res0: Option[(String, Int)] = Some((b,12))
Map(2->"abc", 1->"abx", 0->"ab").uniqMaxByValue //res1: Option[(Int, String)] = Some((1,abx))
Map.empty[Long,Boolean].uniqMaxByValue //res2: Option[(Long, Boolean)] = None
Map('c'->2.2, 'w'->2.2, 'x'->2.1).uniqMaxByValue //res3: Option[(Char, Double)] = None

Spark aggregateByKey using a Map and defining data types for functions

So the title of this should be confusing enough so I will do my best to explain. I am trying to break this function up into defined functions for better visibility into how the aggregateByKey works for other teams that will be writing to my code. I have the following aggregate:
val firstLetter = stringRDD.aggregateByKey(Map[Char, Int]())(
(accumCount, value) => accumCount.get(value.head) match {
case None => accumCount + (value.head -> 1)
case Some(count) => accumCount + (value.head -> (count + 1))
},
(accum1, accum2) => accum1 ++ accum2.map{case(k,v) => k -> (v + accum1.getOrElse(k, 0))}
).collect()
I've been wanting to break this up as follows:
val firstLet = Map[Char, Int]
def fSeq(accumCount:?, value:?) = {
accumCount.get(value.head) match {
case None => accumCount + (value.head -> 1)
case Some(count) => accumCount + (value.head -> (count + 1))
}
}
def fComb(accum1:?, accum2:?) = {
accum1 ++ accum2.map{case(k,v) => k -> (v + accum1.getOrElse(k, 0))
}
Due to the initial value being a Map[Char, Int] I am not sure what to make accumCount, Value data types to define. I've tried different things but nothing seeems to work. Can someone help me define the datatypes and explain how you determined it?
seqOp takes accumulator of the same type as the initial value as the first argument, and value of the same type as values in your RDD.
combOp takes two accumulators of the same types the initial value.
Assuming you want to aggregate RDD[(T,U)]:
def fSeq(accumCount: Map[Char, Int], value: U): Map[Char, Int] = ???
def fComb(accum1: Map[Char, Int], accum2: Map[Char, Int]): Map[Char, Int] = ???
I guess in your case U is simply as String, so you should adjust fSeq signature.
BTW, you can use provide default mapping and simplify your functions:
val firstLet = Map[Char, Int]().withDefault(x => 0)
def fSeq(accumCount: Map[Char, Int], value: String): Map[Char, Int] = {
accumCount + (value.head -> (accumCount(value.head) + 1))
}
def fComb(accum1: Map[Char, Int], accum2: Map[Char, Int]): Map[Char, Int] = {
val accum = (accum1.keys ++ accum2.keys).map(k => (k, accum1(k) + accum2(k)))
accum.toMap.withDefault(x => 0)
}
Finally it could be more efficient to use scala.collection.mutable.Map:
import scala.collection.mutable.{Map => MMap}
def firstLetM = MMap[Char, Int]().withDefault(x => 0)
def fSeqM(accumCount: MMap[Char, Int], value: String): MMap[Char, Int] = {
accumCount += (value.head -> (accumCount(value.head) + 1))
}
def fCombM(accum1: MMap[Char, Int], accum2: MMap[Char, Int]): MMap[Char, Int] = {
accum2.foreach{case (k, v) => accum1 += (k -> (accum1(k) + v))}
accum1
}
Test:
def randomChar() = (scala.util.Random.nextInt.abs % 58 + 65).toChar
def randomString() = {
(Seq(randomChar) ++ Iterator.iterate(randomChar)(_ => randomChar)
.takeWhile(_ => scala.util.Random.nextFloat > 0.1)).mkString
}
val stringRdd = sc.parallelize(
(1 to 500000).map(_ => (scala.util.Random.nextInt.abs % 60, randomString)))
val firstLetter = stringRDD.aggregateByKey(Map[Char, Int]())(
(accumCount, value) => accumCount.get(value.head) match {
case None => accumCount + (value.head -> 1)
case Some(count) => accumCount + (value.head -> (count + 1))
},
(accum1, accum2) => accum1 ++ accum2.map{
case(k,v) => k -> (v + accum1.getOrElse(k, 0))}
).collectAsMap()
val firstLetter2 = stringRDD
.aggregateByKey(firstLet)(fSeq, fComb)
.collectAsMap
val firstLetter3 = stringRDD
.aggregateByKey(firstLetM)(fSeqM, fCombM)
.mapValues(_.toMap)
.collectAsMap
firstLetter == val firstLetter2
firstLetter == val firstLetter3

How to change the sample word count to return the labels?

Below code performs a word count of collection type : org.apache.spark.rdd.RDD[(String, List[(String, Int)])]
val words : org.apache.spark.rdd.RDD[(String, List[(String, Int)])] = sc.parallelize( List(("a" , List( ("test" , 1) , ("test" , 1)))) )
val missingLabels : RDD[(String, Int)] = words.flatMap(m => m._2).reduceByKey((a, b) => a + b)
println("Labels Missing")
missingLabels.collect().foreach(println)
How can I grab the labels also so that instead of ("test" , 2) the value ("a" , ("test" , 2)) is extracted ? In other words type RDD[ (String , List( (String, Int) ))].
If I understand you right, you should just play with tuples a little bit.
import org.apache.spark.rdd.RDD
val words : RDD[(String, List[(String, Int)])] = sc.parallelize( List(("a" , List( ("test" , 1) , ("test" , 1)))) )
val wordsWithLabels = words
.flatMap {
case (label, listOfValues) => listOfValues.map {
case (word,count) => (word, (label, count))
}
}
val result = wordsWithLabels
.reduceByKey {
case ((label1, count1), (label2, count2)) =>
(label1, count1 + count2)
}
.map {
case (word, (label, count)) =>
(label, (word, count))
}
result.foreach(println)
If the key can be repeated, then I am assuming you want to reduce it all down to one pairing? If so:
def reduceList(list: List[(String, Int)]) = list.groupBy(_._1).mapValues(_.aggregate(0)(_ + _._2, _ + _))
val words : org.apache.spark.rdd.RDD[(String, List[(String, Int)])] = sc.parallelize( List(("a" , List( ("test" , 1) , ("test" , 1)))) )
val mergedList = words.mapValues((list : List[(String, Int)]) => reduceList(list).toList)
val missingLabels = mergedList.reduceByKey((accum: List[(String, Int)], value: List[(String, Int)]) =>
{
val valueMap = value.toMap
val accumMap = accum.toMap
val mergedMap = accumMap ++ valueMap.map{case(k,v) => k -> (v + accumMap.getOrElse(k, 0))}
mergedMap.toList
})
missingLabels.foreach(println)