Scala concatenate Map - one with Option and other without Option - scala

I have the below input values
import java.sql.Timestamp
import java.lang.{Double => JDouble}
val date = Timestamp.valueOf("2021-08-01 00:00:00")
val contractRate: Map[String, JDouble] = Map("ITABUS" -> 0.075,
"KARAT-S" -> 0.10,
"KAUTRA" -> 0.05)
val timeBoundContractRatesList: Map[String, List[(Timestamp, JDouble)]] = Map(
"ITABUS" -> List((Timestamp.valueOf("2021-07-30 23:59:59"), 0.085.asInstanceOf[JDouble]),
)
)
My requirement here is:
There are 2 types of rates. One is fixed rate and other is time bound rate
I need to apply the time bound rate if the date is greater than today (for example)
I am trying with the approach to have a single consolidated Map like below
val withTimeBoundContractRate = contractRate ++ timeBoundContractRatesList
.map { case (carrier, timeRateSet) =>
val filteredEntry = timeRateSet
.filter { case (startDate, _) => date.after(startDate) }
(carrier, filteredEntry.map(_._2).headOption)
}
.filter(_._2.nonEmpty)
The problem is with the output. I get the below output
withTimeBoundContractRate: scala.collection.immutable.Map[String,java.io.Serializable] = Map(ITABUS -> Some(0.085), KARAT-S -> 0.1, KAUTRA -> 0.05)
But what I am looking for is a Map with original datatype(without Option)
withTimeBoundContractRate: Map[String, JDouble] = Map(ITABUS -> 0.085, KARAT-S -> 0.1, KAUTRA -> 0.05)
Or is there a totally different approach to solve this efficiently?

(I think, I misunderstood what you are trying to do originally, so I deleted by first answer, to replace it with this):
You actually almost have it, the only problem is that the values in your second map are Options. Just "unwrap" them:
contractRate ++ timeBoundContractRatesList.mapValues {
_.find(date.after(_._1)).map(_._2)
}.collect { case(k, Some(v)) => k -> v }
The main difference from your snippet here is to use collect instead of filter: it lets you not only remove the empty values, but also transform the non-empty ones to get rid of the Option around them.

Related

Scala create immutable nested map

I have a situation here
I have two strins
val keyMap = "anrodiApp,key1;iosApp,key2;xyz,key3"
val tentMap = "androidApp,tenant1; iosApp,tenant1; xyz,tenant2"
So what I want to add is to create a nested immutable nested map like this
tenant1 -> (andoidiApp -> key1, iosApp -> key2),
tenant2 -> (xyz -> key3)
So basically want to group by tenant and create a map of keyMap
Here is what I tried but is done using mutable map which I do want, is there a way to create this using immmutable map
case class TenantSetting() {
val requesterKeyMapping = new mutable.HashMap[String, String]()
}
val requesterKeyMapping = keyMap.split(";")
.map { keyValueList => keyValueList.split(',')
.filter(_.size==2)
.map(keyValuePair => (keyValuePair[0],keyValuePair[1]))
.toMap
}.flatten.toMap
val config = new mutable.HashMap[String, TenantSetting]
tentMap.split(";")
.map { keyValueList => keyValueList.split(',')
.filter(_.size==2)
.map { keyValuePair =>
val requester = keyValuePair[0]
val tenant = keyValuePair[1]
if (!config.contains(tenant)) config.put(tenant, new TenantSetting)
config.get(tenant).get.requesterKeyMapping.put(requester, requesterKeyMapping.get(requester).get)
}
}
The logic to break the strings into a map can be the same for both as it's the same syntax.
What you had for the first string was not quite right as the filter you were applying to each string from the split result and not on the array result itself. Which also showed in that you were using [] on keyValuePair which was of type String and not Array[String] as I think you were expecting. Also you needed a trim in there to cope with the spaces in the second string. You might want to also trim the key and value to avoid other whitespace issues.
Additionally in this case the combination of map and filter can be more succinctly done with collect as shown here:
How to convert an Array to a Tuple?
The use of the pattern with 2 elements ensures you filter out anything with length other than 2 as you wanted.
The iterator is to make the combination of map and collect more efficient by only requiring one iteration of the collection returned from the first split (see comments below).
With both strings turned into a map it just needs the right use of groupByto group the first map by the value of the second based on the same key to get what you wanted. Obviously this only works if the same key is always in the second map.
def toMap(str: String): Map[String, String] =
str
.split(";")
.iterator
.map(_.trim.split(','))
.collect { case Array(key, value) => (key.trim, value.trim) }
.toMap
val keyMap = toMap("androidApp,key1;iosApp,key2;xyz,key3")
val tentMap = toMap("androidApp,tenant1; iosApp,tenant1; xyz,tenant2")
val finalMap = keyMap.groupBy { case (k, _) => tentMap(k) }
Printing out finalMap gives:
Map(tenant2 -> Map(xyz -> key3), tenant1 -> Map(androidApp -> key1, iosApp -> key2))
Which is what you wanted.

How to combine all the values with the same key in Scala?

I have a map like :
val programming = Map(("functional", 1) -> "scala", ("functional", 2) -> "perl", ("orientedObject", 1) -> "java", ("orientedObject", 2) -> "C++")
with the same first element of key appearing multiple times.
How to regroup all the values corresponding to the same first element of key ? Which would turn this map into :
Map("functional" -> List("scala","perl"), "orientedObject" -> List("java","C++"))
UPDATE: This answer is based upon your original question. If you need the more complex Map definition, using a tuple as the key, then the other answers will address your requirements. You may still find this approach simpler.
As has been pointed out, you can't actually have multiple keys with the same value in a map. In the REPL, you'll note that your declaration becomes:
scala> val programming = Map("functional" -> "scala", "functional" -> "perl", "orientedObject" -> "java", "orientedObject" -> "C++")
programming: scala.collection.immutable.Map[String,String] = Map(functional -> perl, orientedObject -> C++)
So you end up missing some values. If you make this a List instead, you can get what you want as follows:
scala> val programming = List("functional" -> "scala", "functional" -> "perl", "orientedObject" -> "java", "orientedObject" -> "C++")
programming: List[(String, String)] = List((functional,scala), (functional,perl), (orientedObject,java), (orientedObject,C++))
scala> programming.groupBy(_._1).map(p => p._1 -> p._2.map(_._2)).toMap
res0: scala.collection.immutable.Map[String,List[String]] = Map(functional -> List(scala, perl), orientedObject -> List(java, C++))
Based on your edit, you have a data structure that looks something like this
val programming = Map(("functional", 1) -> "scala", ("functional", 2) -> "perl",
("orientedObject", 1) -> "java", ("orientedObject", 2) -> "C++")
and you want to scrap the numerical indices and group by the string key. Fortunately, Scala provides a built-in that gets you close.
programming groupBy { case ((k, _), _) => k }
This will return a new map which contains submaps of the original, grouped by the key that we return from the "partial" function. But we want a map of lists, so let's ignore the keys in the submaps.
programming groupBy { case ((k, _), _) => k } mapValues { _.values }
This gets us a map of... some kind of Iterable. But we really want lists, so let's take the final step and convert to a list.
programming groupBy { case ((k, _), _) => k } mapValues { _.values.toList }
You should try the .groupBy method
programming.groupBy(_._1._1)
and you will get
scala> programming.groupBy(_._1._1)
res1: scala.collection.immutable.Map[String,scala.collection.immutable.Map[(String, Int),String]] = Map(functional -> Map((functional,1) -> scala, (functional,2) -> perl), orientedObject -> Map((orientedObject,1) -> java, (orientedObject,2) -> C++))
you can now "clean" by doing something like:
scala> res1.mapValues(m => m.values.toList)
res3: scala.collection.immutable.Map[String,List[String]] = Map(functional -> List(scala, perl), orientedObject -> List(java, C++))
Read the csv file and create a map that contains key and list of values.
val fileStream = getClass.getResourceAsStream("/keyvaluepair.csv")
val lines = Source.fromInputStream(fileStream).getLines
var mp = Seq[List[(String, String)]]();
var codeMap=List[(String, String)]();
var res = Map[String,List[String]]();
for(line <- lines )
{
val cols=line.split(",").map(_.trim())
codeMap ++= Map(cols(0)->cols(1))
}
res = codeMap.groupBy(_._1).map(p => p._1 -> p._2.map(_._2)).toMap
Since no one has put in the specific ordering he asked for:
programming.groupBy(_._1._1)
.mapValues(_.toSeq.map { case ((t, i), l) => (i, l) }.sortBy(_._1).map(_._2))

Does this specific exercise lend itself well to a 'functional style' design pattern?

Say we have an array of one dimensional javascript objects contained in a file Array.json for which the key schema isn't known, that is the keys aren't known until the file is read.
Then we wish to output a CSV file with a header or first entry which is a comma delimited set of keys from all of the objects.
Each next line of the file should contain the comma separated values which correspond to each key from the file.
Array.json
[
abc:123,
xy:"yz",
s12:13,
],
...
[
abc:1
s:133,
]
A valid output:
abc,xy,s12,s
123,yz,13,
1,,,133
I'm teaching myself 'functional style' programming but I'm thinking that this problem doesn't lend itself well to a functional solution.
I believe that this problem requires some state to be kept for the output header and that subsequently each line depends on that header.
I'm looking to solve the problem in a single pass. My goals are efficiency for a large data set, minimal traversals, and if possible, parallelizability. If this isn't possible then can you give a proof or reasoning to explain why?
EDIT: Is there a way to solve the problem like this functionally?:
Say you pass through the array once, in some particular order. Then
from the start the header set looks like abc,xy,s12 for the first
object. With CSV entry 123,yz,13 . Then on the next object we add an
additional key to the header set so abc,xy,s12,s would be the header
and the CSV entry would be 1,,,133 . In the end we wouldn't need to
pass through the data set a second time. We could just append extra
commas to the result set. This is one way we could approach a single
pass....
Are there functional tools ( functions ) designed to solve problems like this, and what should I be considering? [ By functional tools I mean Monads,FlatMap, Filters, etc. ] . Alternatively, should I be considering things like Futures ?
Currently I've been trying to approach this using Java8, but am open to solutions from Scala, etc. Ideally I would be able to determine if Java8s' functional approach can solve the problem since that's the language I'm currently working in.
Since the csv output will change with every new line of input, you must hold that in memory before writing it out. If you consider creating an output text format from an internal representation of a csv file another "pass" over the data (the internal representation of the csv is practically a Map[String,List[String]] which you must traverse to convert it to text) then it's not possible to do this in a single pass.
If, however, this is acceptable, then you can use a Stream to read a single item from your json file, merge that into the csv file, and do this until the stream is empty.
Assuming, that the internal representation of the csv file is
trait CsvFile {
def merge(line: Map[String, String]): CsvFile
}
And you can represent a single item as
trait Item {
def asMap: Map[String, String]
}
You can implement it using foldLeft:
def toCsv(items: Stream[Item]): CsvFile =
items.foldLeft(CsvFile(Map()))((csv, item) => csv.merge(item.asMap))
or use recursion to get the same result
#tailrec def toCsv(items: Stream[Item], prevCsv: CsvFile): CsvFile =
items match {
case Stream.Empty => prevCsv
case item #:: rest =>
val newCsv = prevCsv.merge(item.asMap)
toCsv(rest, newCsv)
}
Note: Of course you don't have to create types for CsvFile or Item, you can use Map[String,List[String]] and Map[String,String] respectively
UPDATE:
As more detail was requested for the CsvFile trait/class, here's an example implementation:
case class CsvFile(lines: Map[String, List[String]], rowCount: Int = 0) {
def merge(line: Map[String, String]): CsvFile = {
val orig = lines.withDefaultValue(List.fill(rowCount)(""))
val current = line.withDefaultValue("")
val newLines = (lines.keySet ++ line.keySet) map {
k => (k, orig(k) :+ current(k))
}
CsvFile(newLines.toMap, rowCount + 1)
}
}
This could be one approach:
val arr = Array(Map("abc" -> 123, "xy" -> "yz", "s12" -> 13), Map("abc" -> 1, "s" -> 133))
val keys = arr.flatMap(_.keys).distinct // get the distinct keys for header
arr.map(x => keys.map(y => x.getOrElse(y,""))) // get an array of rows
Its completely OK to have state in functional programming. But having mutable state or mutating state is not allowed in functional programming.
Functional programming advocates creating new changed state instead of mutating the state in place.
So, its Ok to read and access state created in the program until and unless you are mutating or side effecting.
Coming to the point.
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.map { inner => inner.map { case (k, v) => k}}.flatten
list.map { inner => inner.map { case (k, v) => v}}.flatten
REPL
scala> val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list: List[List[(String, String)]] = List(List((abc,123), (xy,yz)), List((abc,1)))
scala> list.map { inner => inner.map { case (k, v) => k}}.flatten
res1: List[String] = List(abc, xy, abc)
scala> list.map { inner => inner.map { case (k, v) => v}}.flatten
res2: List[String] = List(123, yz, 1)
or use flatMap instead of map and flatten
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.flatMap { inner => inner.map { case (k, v) => k}}
list.flatMap { inner => inner.map { case (k, v) => v}}
In functional programming, mutable state is not allowed. But immutable states/values are fine.
Assuming that you have read your json file in to a value input:List[Map[String,String]], the codes below will solve your problem:
val input = List(Map("abc"->"123", "xy"->"yz" , "s12"->"13"), Map("abc"->"1", "s"->"33"))
val keys = input.map(_.keys).flatten.toSet
val keyvalues = input.map(kvs => keys.map(k => (k->kvs.getOrElse(k,""))).toMap)
val values = keyvalues.map(_.values)
val result = keys.mkString(",") + "\n" + values.map(_.mkString(",")).mkString("\n")

Better way for aggregation on list of case classes

I have list of case classes. Output requires aggregation on different parameters of case class. Looking for more optimized way to do it.
Example:
case class Students(city: String, college: String, group: String,
name: String, fee: Int, age: Int)
object GroupByStudents {
val studentsList= List(
Students("Mumbai","College1","Science","Jony",100,30),
Students("Mumbai","College1","Science","Tony", 200, 25),
Students("Mumbai","College1","Social","Bony",250,30),
Students("Mumbai","College2","Science","Gony", 240, 28),
Students("Bangalore","College3","Science","Hony", 270, 28))
}
Now to get details of students from a City, i need to first aggregate by City, then break-up those details college wise, then group wise.
Output is list of case class in below format.
Students(Mumbai,,,,790,0) -- aggregate city wise
Students(Mumbai,College1,,,550,0) -- aggregate college wise
Students(Mumbai,College1,Social,,250,0)
Students(Mumbai,College1,Science,,300,0)
Students(Mumbai,College2,,,240,0)
Students(Mumbai,College2,Science,,240,0)
Students(Bangalore,,,,270,0)
Students(Bangalore,College3,,,270,0)
Students(Bangalore,College3,Science,,270,0)
Two methods to achieve this:
1) Loop all list, create a map for each combination (above case 3 combinations
), aggregate data and create new result list and append data to it.
2) Using foldLeft option
studentsList.groupBy(d=>(d.city))
.mapValues(_.foldLeft(Students("","","","",0,0))
((r,c) => Students(c.city,"","","",r.fee+c.fee,0)))
studentsList.groupBy(d=>(d.city,d.college))
.mapValues(_.foldLeft(Students("","","","",0,0))
((r,c) => Students(c.city,c.college,"","",r.fee+c.fee,0)))
studentsList.groupBy(d=>(d.city,d.college,d.group))
.mapValues(_.foldLeft(Students("","","","",0,0))
((r,c) => Students(c.city,c.college,c.group,"",r.fee+c.fee,0)))
In both cases, looping on list more than once. Is there any way to achieve this with single pass and optimized way.
With GroupBy
Code looks a little bit nicer, but I think it isn't faster. With groupby you have always 2 "loops"
studentsList.groupBy(d=>(d.city)).map { case (k,v) =>
Students(v.head.city,"","","",v.map(_.fee).sum, 0)
}
studentsList.groupBy(d=>(d.city,d.college)).map { case (k,v) =>
Students(v.head.city,v.head.college,"","",v.map(_.fee).sum, 0)
}
studentsList.groupBy(d=>(d.city,d.college,d.group)).map { case (k,v) =>
Students(v.head.city,v.head.college,v.head.group,"",v.map(_.fee).sum, 0)
}
You get then Something like this
List(Students(Bangalore,College3,Science,Hony,270,0),
Students(Mumbai,College1,Science,Jony,790,0))
List(Students(Mumbai,College2,,,240,0),
Students(Bangalore,College3,,,270,0),
Students(Mumbai,College1,,,550,0))
List(Students(Bangalore,College3,Science,,270,0),
Students(Mumbai,College2,Science,,240,0),
Students(Mumbai,College1,Social,,250,0),
Students(Mumbai,College1,Science,,300,0))
It is not exactly the same output like in your example, but it is the desired output: a list of case class students.
With a for comprehension
You could avoid this looping if your grouping by yourself. Only have the city example the other are straight forward.
var m = Map[String, Students]()
for (v <- studentsList) {
m += v.city -> Students(v.city,"","","",v.fee + m.getOrElse(v.city, Students("","","","",0,0)).asInstanceOf[Students].fee, 0)
}
m
Output
It's the same Output like your studenList but I only loop one time, for every Map[String,Students] output.
Map(Mumbai -> Students(Mumbai,,,,790,0), Bangalore -> Students(Bangalore,,,,270,0))
With Foldleft
Just going in one loop over the complete list.
val emptyStudent = Students("","","","",0,0);
studentsList.foldLeft(Map[String, Students]()) { case (m, v) =>
m + (v.city -> Students(v.city,"","","",
v.fee + m.getOrElse(v.city, emptyStudent).fee, 0))
}
studentsList.foldLeft(Map[(String,String), Students]()) { case (m, v) =>
m + ((v.city,v.college) -> Students(v.city,v.college,"","",
v.fee + m.getOrElse((v.city,v.college), emptyStudent).fee, 0))
}
studentsList.foldLeft(Map[(String,String,String), Students]()) { case (m, v) =>
m + ((v.city,v.college,v.group) -> Students(v.city,v.college,v.group,"",
v.fee + m.getOrElse((v.city,v.college,v.group), emptyStudent).fee, 0))
}
Output
It's the same Output like your studenList but I only loop one time, for every Map[String,Students] output.
Map(Mumbai -> Students(Mumbai,,,,790,0),
Bangalore -> Students(Bangalore,,,,270,0))
Map((Mumbai,College1) -> Students(Mumbai,College1,,,550,0),
(Mumbai,College2) -> Students(Mumbai,College2,,,240,0),
(Bangalore,College3) -> Students(Bangalore,College3,,,270,0))
Map((Mumbai,College1,Science) -> Students(Mumbai,College1,Science,,300,0),
(Mumbai,College1,Social) -> Students(Mumbai,College1,Social,,250,0),
(Mumbai,College2,Science) -> Students(Mumbai,College2,Science,,240,0),
(Bangalore,College3,Science) -> Students(Bangalore,College3,Science,,270,0))
With FoldLeft One Loop
You can just generate one Big Map with all the List.
val emptyStudent = Students("","","","",0,0);
studentsList.foldLeft(Map[(String,String,String), Students]()) { case (m, v) =>
{
var t = m + ((v.city,"","") -> Students(v.city,"","","",
v.fee + m.getOrElse((v.city,"",""), emptyStudent).fee, 0))
t = t + ((v.city,v.college,"") -> Students(v.city,v.college,"","",
v.fee + m.getOrElse((v.city,v.college,""), emptyStudent).fee, 0))
t + ((v.city,v.college,v.group) -> Students(v.city,v.college,v.group,"",
v.fee + m.getOrElse((v.city,v.college,v.group), emptyStudent).fee, 0))
}
}
Output
In this case you loop one time and get back the results for all aggregating, but only in oneMap. This would work with for comprehension, too.
Map((Mumbai,College1,Science) -> Students(Mumbai,College1,Science,,300,0),
(Bangalore,,) -> Students(Bangalore,,,,270,0),
(Mumbai,College2,Science) -> Students(Mumbai,College2,Science,,240,0),
(Mumbai,College2,) -> Students(Mumbai,College2,,,240,0),
(Mumbai,College1,Social) -> Students(Mumbai,College1,Social,,250,0),
(Mumbai,,) -> Students(Mumbai,,,,790,0),
(Bangalore,College3,) -> Students(Bangalore,College3,,,270,0),
(Mumbai,College1,) -> Students(Mumbai,College1,,,550,0),
(Bangalore,College3,Science) -> Students(Bangalore,College3,Science,,270,0))
The Map is always copied, so it could have some performance and memory issues. To solve this use a for comprehension
For Comprehension One Loop
This generates one Map with the 3 aggregate types.
val emptyStudent = Students("","","","",0,0);
var m = Map[(String,String,String), Students]()
for (v <- studentsList) {
m += ((v.city,"","") -> Students(v.city,"","","", v.fee + m.getOrElse((v.city,"",""), emptyStudent).fee, 0))
m += ((v.city,v.college,"") -> Students(v.city,v.college,"","", v.fee + m.getOrElse((v.city,v.college,""), emptyStudent).fee, 0))
m += ((v.city,v.college,v.group) -> Students(v.city,v.college,v.group,"", v.fee + m.getOrElse((v.city,v.college,v.group), emptyStudent).fee, 0))
}
m
This should be better in terms of memory consumption cause you aren't copy the maps like in the foldLeft example
Output
Map((Mumbai,College1,Science) -> Students(Mumbai,College1,Science,,300,0),
(Bangalore,,) -> Students(Bangalore,,,,270,0),
(Mumbai,College2,Science) -> Students(Mumbai,College2,Science,,240,0),
(Mumbai,College2,) -> Students(Mumbai,College2,,,240,0),
(Mumbai,College1,Social) -> Students(Mumbai,College1,Social,,250,0),
(Mumbai,,) -> Students(Mumbai,,,,790,0), (Bangalore,College3,) -> Students(Bangalore,College3,,,270,0),
(Mumbai,College1,) -> Students(Mumbai,College1,,,550,0),
(Bangalore,College3,Science) -> Students(Bangalore,College3,Science,,270,0))
In all cases you could just reduce the code if you make the parameter optional in your case class students, cause then you can just do something like Students(city=v.city,fee=v.fee+m.getOrElse(v.city,emptyStudent).fee during grouping
Use a foldLeft
First, let's define some type aliases to make the syntax easier
object GroupByStudents {
type City = String
type College = String
type Group = String
type Name = String
type Aggregate = Map[City, Map[College, Map[Group, List[Students]]]]
def emptyAggregate: Aggregate = Map.empty
case class Students(city: City, college: College, group: Group,
name: Name, fee: Int, age: Int)
}
You can aggregate the students list into an Aggregate map in a single foldLeft
object Test {
import GroupByStudents._
def main(args: Array[String]) {
val studentsList = List(
Students("Mumbai","College1","Science","Jony",100,30),
Students("Mumbai","College1","Science","Tony", 200, 25),
Students("Mumbai","College1","Social","Bony",250,30),
Students("Mumbai","College2","Science","Gony", 240, 28),
Students("Bangalore","College3","Science","Hony", 270, 28))
val aggregated = studentsList.foldLeft(emptyAggregate){(agg, students) =>
val cityBin = agg.getOrElse(students.city, Map.empty)
val collegeBin = cityBin.getOrElse(students.college, Map.empty)
val groupBin = collegeBin.getOrElse(students.group, List.empty)
val nextGroupBin = students :: groupBin
val nextCollegeBin= collegeBin + (students.group -> nextGroupBin)
val nextCityBin = cityBin + (students.college -> nextCollegeBin)
agg + (students.city -> nextCityBin)
}
}
}
aggregated can then be mapped over to calculate fees.
If you really want, you can calculate the fees in the foldLeft itself, but this would make the code harder to read.
Note that you can also try monocle's lenses to put the students value in the aggregated structure.

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD