I wrote following code to merge map and update common keys. Is there any better way to write this
case class Test(index: Int, min: Int, max: Int, aggMin: Int, aggMax: Int)
def mergeMaps(oldMap: Map[Int, Test], newMap: Map[Int, Test]): Map[Int, Test] = {
val intersect: Map[Int, Test] = oldMap.keySet.intersect(newMap.keySet)
.map(indexKey => indexKey -> (Test(newMap(indexKey).index, newMap(indexKey).min, newMap(indexKey).max,
oldMap(indexKey).aggMin.min(newMap(indexKey).aggMin), oldMap(indexKey).aggMax.max(newMap(indexKey).aggMax)))).toMap
val merge = (oldMap ++ newMap ++ intersect)
merge
}
Here is my test case
it("test my case"){
val oldMap = Map(10 -> Test(10, 1, 2, 1, 2), 25 -> Test(25, 3, 4, 3, 4), 46 -> Test(46, 3, 4, 3, 4), 26 -> Test(26, 1, 2, 1, 2))
val newMap = Map(32 -> Test(32, 5, 6, 5, 6), 26 -> Test(26, 5, 6, 5, 6))
val result = mergeMaps(oldMap, newMap)
//Total elements count should be map 1 elements + map 2 elements
assert(result.size == 5)
//Common key element aggMin and aggMax should be updated, keep min aggMin and max aggMax from 2 common key elements and keep min and max of second map key
assert(result.get(26).get.aggMin == 1)//min aggMin -> min(1,5)
assert(result.get(26).get.aggMax == 6)//max aggMax -> max(2,6)
assert(result.get(26).get.min == 5)// 5 from second map
assert(result.get(26).get.max == 6)//6 from second map
}
Here's a slightly different take on a solution.
def mergeMaps(oldMap :Map[Int,Test], newMap :Map[Int,Test]) :Map[Int,Test] =
(oldMap.values ++ newMap.values)
.groupBy(_.index)
.map{ case (k,v) =>
k -> v.reduceLeft((a,b) =>
Test(k, b.min, b.max, a.aggMin min b.aggMin, a.aggMax max b.aggMax))
}
I could have followed the groupBy() with mapValues() instead of map() but that doesn't result in a pure Map.
Another version to do the same task.
def mergeMaps(oldMap: Map[Int, Test], newMap: Map[Int, Test]): Map[Int, Test] = {
(newMap ++ oldMap).map(key => {
val _newMapData = newMap.get(key._1)
if (_newMapData.isDefined) {
val _newMapDataValue = _newMapData.get
val oldMapValue = key._2
val result = Test(_newMapDataValue.index, _newMapDataValue.min, _newMapDataValue.max,
oldMapValue.aggMin.min(_newMapDataValue.aggMin), oldMapValue.aggMax.max(_newMapDataValue.aggMax))
(key._1 -> result)
} else (key._1 -> key._2)
})
}
Related
I want to find a percentage from the value of a key, value pair which is stored in the map.
For eg: Map('a'->10,'b'->20).I need to find percentage occurance of 'a' and 'b'
Adding to Thilo's answer, you can try this below code. The final result will again be a Map[String, Double].
val map = Map("a" -> 10.0, "b" -> 20.0)
val total = map.values.sum
val mapWithPerc = map.mapValues(x => (x * 100) / total)
println(mapWithPerc)
//prints Map(a -> 33.333333333333336, b -> 66.66666666666667)
def mapToPercentage(key: String)(implicit map: Map[String, Double]) = {
val valuesSum = map.values.sum
(map(key) * 100) / valuesSum
}
implicit val m: Map[String, Double] = Map("a" -> 10, "b" -> 20, "c" -> 30)
println(mapToPercentage("a")) // 16.666666666666668
println(mapToPercentage("b")) // 33.333333333333336
println(mapToPercentage("c")) // 50
See demo here
Note: there is absolutely no need to curry the function parameters or make the map implicit. I just think it looks nicer in this example. Something like def mapToPercentage(key: String, map: Map[String, Double]) = {...} and mapToPercentage("a", m) is also perfectly valid. That being said, if you want to get even fancier:
implicit class MapToPercentage (map: Map[String, Double]) {
def getPercentage(key: String) = {
val valuesSum = map.values.sum
(map(key) * 100) / valuesSum
}
}
val m: Map[String, Double] = Map("a" -> 10, "b" -> 20, "c" -> 30)
println(m.getPercentage("a")) // 16.666666666666668
println(m.getPercentage("b")) // 33.333333333333336
println(m.getPercentage("c")) // 50
See demo here
Point being, the logic behind getting the percentage can be written a few ways:
(map(key) * 100) / valuesSum // get the value corresponding to a given key,
// multiply by 100, divide by total sum or all values
// - will complain if key doesn't exist
(map.getOrElse(key, 0D) * 100) / valuesSum // more safe than above, will not fail
// if key doesn't exist
map.get(key).map(_ * 100 / valuesSum) // will return None if key doesn't exist
// and Some(result) if key does exist
val map = Map('a' -> 10, 'b' -> 20)
val total = map.values.sum
map.get('a').map(_ * 100 / total) // gives Some(33)
I've manually created a mapping function(map) to map some elements of an RDD.
val rdd = sc.parallelize(List(
LabeledPoint(5.0, Vectors.dense(1,2)),
LabeledPoint(12.0, Vectors.dense(1,3)),
LabeledPoint(20.0, Vectors.dense(-1,4))))
val map = Map(5 -> 0.0, 12.0 -> 1.0, 20.0 -> 2.0)
val trainingData = rdd.map{
case LabeledPoint(category, features) => LabeledPoint(map(category), features)
}
How can I create a mappping function for 1000 elements. These 1000 elements are in a RDD called distinct.
I have tried this
var i=0
val map=distinct.map
{ x=>Map(x -> i)
i=i+1
}
But this cannot be used as mapping function for
val trainingData = rdd.map{
case LabeledPoint(category,
features) => LabeledPoint(map(category), features)
}
zipWithIndex and then collectAsMap:
val distinct = sc.parallelize(Seq(5, 12, 20))
distinct.zipWithIndex.collectAsMap
// res2: scala.collection.Map[Int,Long] = Map(20 -> 2, 5 -> 0, 12 -> 1)
How to get from an Iterator like this
val it = Iterator("one","two","three","four","five")
a map like
Map(four -> 4, three -> 5, two -> 3, five -> 4, one -> 3)
var m = Map[String, Int]()
while (it.hasNext) {
val cell = it.next()
m += (cell -> cell.length())
}
this is a solution using var but I'd like to use just Immutable and val variable.
If I use the for yield statement the returning object would be a Iterator[Map] and I do not want that:
val m = for(i<- it if it.hasNext) yield Map(i->i.length())
You can just use map:
val m = it.map(c => c -> c.length).toMap
I had a file that contained a list of elements like this
00|905000|20160125204123|79644809999||HGMTC|1||22|7905000|56321647569|||34110|I||||||250995210056537|354805064211510||56191|||38704||A|||11|V|81079681404134|5||||SE|||G|144|||||||||||||||Y|b00534589.huawei_anadyr.20151231184912||1|||||79681404134|0|||+##+1{79098509982}2{2}3{2}5{79644809999}6{0000002A7A5AC635}7{79681404134}|20160125|
Through a series of steps, I managed to convert it to a list of elements like this
(902996760100000,CompactBuffer(6, 5, 2, 2, 8, 6, 5, 3))
Where 905000 and 902996760100000 are keys and 6, 5, 2, 2, 8, 6, 5, 3 are values. Values can be numbers from 1 to 8. Are there any ways to count number of occurences of these values using spark, so the result looks like this?
(902996760100000, 0_1, 2_2, 1_3, 0_4, 2_5, 2_6, 0_7, 1_8)
I could do it with if else blocks and staff, but that won't be pretty, so I wondered if there are any instrumets I could use in scala/spark.
This is my code.
class ScalaJob(sc: SparkContext) {
def run(cdrPath: String) : RDD[(String, Iterable[String])] = {
//pass the file
val fileCdr = sc.textFile(cdrPath);
//find values in every raw cdr
val valuesCdr = fileCdr.map{
dataRaw =>
val p = dataRaw.split("[|]",-1)
(p(1), ScalaJob.processType(ScalaJob.processTime(p(2)) + "_" + p(32)))
}
val x = valuesCdr.groupByKey()
return x
}
Any advice on optimizing it would be appreciated. I'm really new to scala/spark.
First, Scala is a type-safe language and so is Spark's RDD API - so it's highly recommended to use the type system instead of going around it by "encoding" everything into Strings.
So I'll suggest a solution that creates an RDD[(String, Seq[(Int, Int)])] (with second item in tuple being a sequence of (ID, count) tuples) and not a RDD[(String, Iterable[String])] which seems less useful.
Here's a simple function that counts the occurrences of 1 to 8 in a given Iterable[Int]:
def countValues(l: Iterable[Int]): Seq[(Int, Int)] = {
(1 to 8).map(i => (i, l.count(_ == i)))
}
You can use mapValues with this function (place the function in the object for serializability, like you did with the rest) on an RDD[(String, Iterable[Int])] to get the result:
valuesCdr.groupByKey().mapValues(ScalaJob.countValues)
The entire solution can then be simplified a bit:
class ScalaJob(sc: SparkContext) {
import ScalaJob._
def run(cdrPath: String): RDD[(String, Seq[(Int, Int)])] = {
val valuesCdr = sc.textFile(cdrPath)
.map(_.split("\\|"))
.map(p => (p(1), processType(processTime(p(2)), p(32))))
valuesCdr.groupByKey().mapValues(countValues)
}
}
object ScalaJob {
val dayParts = Map((6 to 11) -> 1, (12 to 18) -> 2, (19 to 23) -> 3, (0 to 5) -> 4)
def processTime(s: String): Int = {
val hour = DateTime.parse(s, DateTimeFormat.forPattern("yyyyMMddHHmmss")).getHourOfDay
dayParts.filterKeys(_.contains(hour)).values.head
}
def processType(dayPart: Int, s: String): Int = s match {
case "S" => 2 * dayPart - 1
case "V" => 2 * dayPart
}
def countValues(l: Iterable[Int]): Seq[(Int, Int)] = {
(1 to 8).map(i => (i, l.count(_ == i)))
}
}
I'm working through Cay Horstmann's Scala for the Impatient book where I came across this way of updating a mutable map.
scala> val scores = scala.collection.mutable.Map("Alice" -> 10, "Bob" -> 3, "Cindy" -> 8)
scores: scala.collection.mutable.Map[String,Int] = Map(Bob -> 3, Alice -> 10, Cindy -> 8)
scala> scores("Alice") // retrieve the value of type Int
res2: Int = 10
scala> scores("Alice") = 5 // Update the Alice value to 5
scala> scores("Alice")
res4: Int = 5
It looks like scores("Alice") hits apply in MapLike.scala. But this only returns the value, not something that can be updated.
Out of curiosity I tried the same syntax on an immutable map and was presented with the following error,
scala> val immutableScores = Map("Alice" -> 10, "Bob" -> 3, "Cindy" -> 8)
immutableScores: scala.collection.immutable.Map[String,Int] = Map(Alice -> 10, Bob -> 3, Cindy -> 8)
scala> immutableScores("Alice") = 5
<console>:9: error: value update is not a member of scala.collection.immutable.Map[String,Int]
immutableScores("Alice") = 5
^
Based on that, I'm assuming that scores("Alice") = 5 is transformed into scores update ("Alice", 5) but I have no idea how it works, or how it is even possible.
How does it work?
This is an example of the apply, update syntax.
When you call map("Something") this calls map.apply("Something") which in turn calls get.
When you call map("Something") = "SomethingElse" this calls map.update("Something", "SomethingElse") which in turn calls put.
Take a look at this for a fuller explanation.
Can you try this: => to update list of Map
import java.util.concurrent.ConcurrentHashMap
import scala.collection.JavaConverters._
import scala.collection.concurrent
val map: concurrent.Map[String, List[String]] = new ConcurrentHashMap[String, List[String]].asScala
def updateMap(key: String, map: concurrent.Map[String, List[String]], value: String): Unit = {
map.get(key) match {
case Some(list: List[String]) => {
val new_list = value :: list
map.put(key, new_list)
}
case None => map += (key -> List(value))
}
}
The problem is you're trying to update immutable map. I had the same error message when my map was declared as
var m = new java.util.HashMap[String, Int]
But when i replaced the definition by
var m = new scala.collection.mutable.HashMap[String, Int]
the m.update worked.