acces tuple inside a tuple for anonymous map job in Spark - scala

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?

If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)

to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD

Related

Scala Map from Test File

Looking to create a scala map from a test file. A sample of the text file (a few lines of it) can be seen below:
Alabama (9),Democratic:849624,Republican:1441170,Libertarian:25176,Others:7312
Alaska (3),Democratic:153778,Republican:189951,Libertarian:8897,Others:6904
Arizona (11),Democratic:1672143,Republican:1661686,Libertarian:51465,Green:1557,Others:475
I have been given the map buffer as follows:
var mapBuffer: Map[String, List[(String, Int)]] = Map()
Note the party values are separated by a colon.
I am trying to read the file contents and store the data in a map structure where each line of the file is used to construct a map entry with the date as the key, and a list of tuples as the value. The type of the structure should be Map[String, List[(String,Int)]].
Essentially just trying to create a map of each line from the file but I can't quite get it right. I tried the below but with not luck - I think that 'val lines' should be an array rather than an iterator.
val stream : InputStream = getClass.getResourceAsStream("")
val lines: Iterator[String] = scala.io.Source.fromInputStream(stream).getLines
var map: Map[String, List[(String, Int)]] = lines
.map(_.split(","))
.map(line => (line(0).toString, line(1).toList))
.toMap
This appears to do the job. (Scala 2.13.x)
val stateVotes =
util.Using(io.Source.fromFile("votes.txt")){
val PartyVotes = "([^:]+):(\\d+)".r
_.getLines()
.map(_.split(",").toList)
.toList
.groupMapReduce(_.head)(_.tail.collect{
case PartyVotes(p,v) => (p,v.toInt)})(_ ++ _)
} //file is auto-closed
//stateVotes: Try[Map[String,List[(String, Int)]]] = Success(
// Map(Alabama (9) -> List((Democratic,849624), (Republican,1441170), (Libertarian,25176), (Others,7312))
// , Arizona (11) -> List((Democratic,1672143), (Republican,1661686), (Libertarian,51465), (Green,1557), (Others,475))
// , Alaska (3) -> List((Democratic,153778), (Republican,189951), (Libertarian,8897), (Others,6904))))
In this case the number following the state name is preserved. That can be changed.
No, iterator is fine (better than list actually),
you just need to split the values too to create those tuples.
lines
.map(_.split(","))
.map { case l =>
l.head -> l.tail.toList.map(_.split(":"))
.collect { case Seq(a,b) => a -> b.toInt }
}
.toMap
An alternative that looks a little bit more aesthetic to my eye is converting to map early, and then using mapValues (I personally much
prefer short lambdas). The downside is mapValues is lazy, so you end up
having to do .toMap twice to force it in the end:
lines
.map(_.split(","))
.map { case l => l.head -> l.tail.toList }
.toMap
.mapValues(_.split(":"))
.mapValues(_.collect { case Seq(a,b) => a -> b.toInt })
.toMap

Does this specific exercise lend itself well to a 'functional style' design pattern?

Say we have an array of one dimensional javascript objects contained in a file Array.json for which the key schema isn't known, that is the keys aren't known until the file is read.
Then we wish to output a CSV file with a header or first entry which is a comma delimited set of keys from all of the objects.
Each next line of the file should contain the comma separated values which correspond to each key from the file.
Array.json
[
abc:123,
xy:"yz",
s12:13,
],
...
[
abc:1
s:133,
]
A valid output:
abc,xy,s12,s
123,yz,13,
1,,,133
I'm teaching myself 'functional style' programming but I'm thinking that this problem doesn't lend itself well to a functional solution.
I believe that this problem requires some state to be kept for the output header and that subsequently each line depends on that header.
I'm looking to solve the problem in a single pass. My goals are efficiency for a large data set, minimal traversals, and if possible, parallelizability. If this isn't possible then can you give a proof or reasoning to explain why?
EDIT: Is there a way to solve the problem like this functionally?:
Say you pass through the array once, in some particular order. Then
from the start the header set looks like abc,xy,s12 for the first
object. With CSV entry 123,yz,13 . Then on the next object we add an
additional key to the header set so abc,xy,s12,s would be the header
and the CSV entry would be 1,,,133 . In the end we wouldn't need to
pass through the data set a second time. We could just append extra
commas to the result set. This is one way we could approach a single
pass....
Are there functional tools ( functions ) designed to solve problems like this, and what should I be considering? [ By functional tools I mean Monads,FlatMap, Filters, etc. ] . Alternatively, should I be considering things like Futures ?
Currently I've been trying to approach this using Java8, but am open to solutions from Scala, etc. Ideally I would be able to determine if Java8s' functional approach can solve the problem since that's the language I'm currently working in.
Since the csv output will change with every new line of input, you must hold that in memory before writing it out. If you consider creating an output text format from an internal representation of a csv file another "pass" over the data (the internal representation of the csv is practically a Map[String,List[String]] which you must traverse to convert it to text) then it's not possible to do this in a single pass.
If, however, this is acceptable, then you can use a Stream to read a single item from your json file, merge that into the csv file, and do this until the stream is empty.
Assuming, that the internal representation of the csv file is
trait CsvFile {
def merge(line: Map[String, String]): CsvFile
}
And you can represent a single item as
trait Item {
def asMap: Map[String, String]
}
You can implement it using foldLeft:
def toCsv(items: Stream[Item]): CsvFile =
items.foldLeft(CsvFile(Map()))((csv, item) => csv.merge(item.asMap))
or use recursion to get the same result
#tailrec def toCsv(items: Stream[Item], prevCsv: CsvFile): CsvFile =
items match {
case Stream.Empty => prevCsv
case item #:: rest =>
val newCsv = prevCsv.merge(item.asMap)
toCsv(rest, newCsv)
}
Note: Of course you don't have to create types for CsvFile or Item, you can use Map[String,List[String]] and Map[String,String] respectively
UPDATE:
As more detail was requested for the CsvFile trait/class, here's an example implementation:
case class CsvFile(lines: Map[String, List[String]], rowCount: Int = 0) {
def merge(line: Map[String, String]): CsvFile = {
val orig = lines.withDefaultValue(List.fill(rowCount)(""))
val current = line.withDefaultValue("")
val newLines = (lines.keySet ++ line.keySet) map {
k => (k, orig(k) :+ current(k))
}
CsvFile(newLines.toMap, rowCount + 1)
}
}
This could be one approach:
val arr = Array(Map("abc" -> 123, "xy" -> "yz", "s12" -> 13), Map("abc" -> 1, "s" -> 133))
val keys = arr.flatMap(_.keys).distinct // get the distinct keys for header
arr.map(x => keys.map(y => x.getOrElse(y,""))) // get an array of rows
Its completely OK to have state in functional programming. But having mutable state or mutating state is not allowed in functional programming.
Functional programming advocates creating new changed state instead of mutating the state in place.
So, its Ok to read and access state created in the program until and unless you are mutating or side effecting.
Coming to the point.
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.map { inner => inner.map { case (k, v) => k}}.flatten
list.map { inner => inner.map { case (k, v) => v}}.flatten
REPL
scala> val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list: List[List[(String, String)]] = List(List((abc,123), (xy,yz)), List((abc,1)))
scala> list.map { inner => inner.map { case (k, v) => k}}.flatten
res1: List[String] = List(abc, xy, abc)
scala> list.map { inner => inner.map { case (k, v) => v}}.flatten
res2: List[String] = List(123, yz, 1)
or use flatMap instead of map and flatten
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.flatMap { inner => inner.map { case (k, v) => k}}
list.flatMap { inner => inner.map { case (k, v) => v}}
In functional programming, mutable state is not allowed. But immutable states/values are fine.
Assuming that you have read your json file in to a value input:List[Map[String,String]], the codes below will solve your problem:
val input = List(Map("abc"->"123", "xy"->"yz" , "s12"->"13"), Map("abc"->"1", "s"->"33"))
val keys = input.map(_.keys).flatten.toSet
val keyvalues = input.map(kvs => keys.map(k => (k->kvs.getOrElse(k,""))).toMap)
val values = keyvalues.map(_.values)
val result = keys.mkString(",") + "\n" + values.map(_.mkString(",")).mkString("\n")

How to extract values from Some() in Scala

I have Some() type Map[String, String], such as
Array[Option[Any]] = Array(Some(Map(String, String)
I want to return it as
Array(Map(String, String))
I've tried few different ways of extracting it-
Let's say if
val x = Array(Some(Map(String, String)
val x1 = for (i <- 0 until x.length) yield { x.apply(i) }
but this returns IndexedSeq(Some(Map)), which is not what I want.
I tried pattern matching,
x.foreach { i =>
i match {
case Some(value) => value
case _ => println("nothing") }}
another thing I tried that was somewhat successful was that
x.apply(0).get.asInstanceOf[Map[String, String]]
will do something what I want, but it only gets 0th index of the entire array and I'd want all the maps in the array.
How can I extract Map type out of Some?
If you want an Array[Any] from your Array[Option[Any]], you can use this for expression:
for {
opt <- x
value <- opt
} yield value
This will put the values of all the non-empty Options inside a new array.
It is equivalent to this:
x.flatMap(_.toArray[Any])
Here, all options will be converted to an array of either 0 or 1 element. All these arrays will then be flattened back to one single array containing all the values.
Generally, the pattern is either to use transformations on the Option[T], like map, flatMap, filter, etc.
The problem is, we'll need to add a type cast to retrieve the underlying Map[String, String] from Any. So we'll use flatten to remove any potentially None types and unwrap the Option, and asInstanceOf to retreive the type:
scala> val y = Array(Some(Map("1" -> "1")), Some(Map("2" -> "2")), None)
y: Array[Option[scala.collection.immutable.Map[String,String]]] = Array(Some(Map(1 -> 1)), Some(Map(2 -> 2)), None)
scala> y.flatten.map(_.asInstanceOf[Map[String, String]])
res7: Array[Map[String,String]] = Array(Map(1 -> 1), Map(2 -> 2))
Also when you talk just about single value you can try Some("test").head and for null simply Some(null).flatten

How to find tuple with different value in a list using scala?

I have following list:
val list = List(("name1",20),("name2",20),("name1",30),("name2",30),
("name3",40),("name3",30),("name3",20))
I want following output:
List(("name3",40))
I tried following:
val distElements = list.map(_._2).distinct
list.groupBy(_._1).map{ case(k,v) =>
val h = v.map(_._2)
if(distElements.equals(h)) List.empty else distElements.diff(h)
}.flatten
But this is not I am looking for.
Can anybody give answer/hint me to get expected output.
I understand the question as looking for the element of the list whose _2 (number) occurs only once.
val list = List(("name1",20),("name2",20),("name1",30),("name2",30),
("name3",40),("name3",30),("name3",20))
First you group by the _2 element, which gives you a map whose keys are lists of all elements with the same _2:
val g = list.groupBy(_._2) // Map[Int, List[(String, Int)]]
Now you can filter those entries that consists only of one element:
val opt = g.collectFirst { // Option[(String, Int)]
case (_, single :: Nil) => single
}
Or (if you are expecting possibly more than one distinct value)
val col = g.collect { // Map[String, Int]
case (_, single :: Nil) => single
}
Seems to me that you're looking to match against both the value of the left hand and the right hand at the same time while also preserving the type of collection you're looking at, a List. I would use collect:
val out = myList.collect{
case item # ("name3", 40) => item
}
which combines a PartialFunction with filter and map like qualities. In this case, it filters out any value for which the PartialFunction is not defined while mapping the values which match. Here, I've only allowed for a singular match.

How to find unique elements from list of tuples based on some elements using scala?

I have following list
val a = List(("name1","add1","city1",10),("name1","add1","city1",10),
("name2","add2","city2",10),("name2","add2","city2",20),("name3","add3","city3",20))
I want distinct element from above list based on first three values of tuple. Fourth value should not be consider while finding distinct elements from list.
I want following output:
val output = List(("name1","add1","city1",10),("name2","add2","city2",10),
("name3","add3","city3",20))
Is it possible to get above output?
As per my knowledge, distinct works if whole tuple/value is duplicated. I tried out with distinct like following:
val b = List(("name1","add1","city1",10),("name1","add1","city1",10),("name2","add2","city2",10),
("name2","add2","city2",20),("name3","add3","city3",20)).distinct
but it gives output as -
List(("name1","add1","city1",10),("name2","add2","city2",10),
("name2","add2","city2",20),("name3","add3","city3",20))
Any alternate approach will also appreciated.
Use groupBy like this
a.groupBy( v => (v._1,v._2,v._3)).keys.toList
This constructs a Map where each key is by definition a unique triplet as required in the lambda function above.
Should it include also the last element in the tuple, fetch the first element for each key, like this
a.groupBy( v => (v._1,v._2,v._3)).mapValues(_.head)
If the order of the output list isn't important (i.e. you are happy to get List(("name3","add3","city3",20),("name1","add1","city1",10),("name2","add2","city2",10))), the following works as specified:
a.groupBy(v => (v._1,v._2,v._3)).values.map(_.head).toList
(Due to Scala collections design, you'll see the order kept for output lists up to 4 elements, but above that size HashMap will be used.) If you do need to keep the order, you can do something like (generalizing a bit)
def distinctBy[A, B](xs: Seq[A], f: A => B) = {
val seen = LinkedHashMap.empty[B, A]
xs.foreach { x =>
val key = f(x)
if (!seen.contains(key)) { seen.update(key, x) }
}
seen.values.toList
}
distinctBy(a, v => (v._1, v._2, v._3))
You could try
a.map{case x#(name, add, city, _) => (name,add,city) -> x}.toMap.values.toList
To make sure you have the first one in list kept,
type String3 = (String, String, String)
type String3Int = (String, String, String, Int)
a.foldLeft(collection.immutable.ListMap.empty[String3, String3Int]) {
case (a, b) => if (a.contains((b._1, b._2, b._3))) {
a
} else a + ((b._1, b._2, b._3) -> b)
}.values.toList
On simple solution would be to convert the List to a Set. Sets don't contain duplicates: check the documentation.
val setOfTuples = a.toSet
println(setOfTuples)
Output: Set((1,1), (1,2), (1,3), (2,1))