Scala: Swapping two key values in a collection? - scala

I have a val pointing to a big collection of records that were read from a file in HDFS. Let's call this val 'a'. 'a' has a bunch of records that all contain these 3 properties: SRC, DEST, ACT. I need to make a clone of 'a', but have the values of SRC and DEST keys swapped in each record. How do I go about doing this in scala? I've tried different variants of the map function but can't seem to get this to work correctly.

Well, without a code example, I am guessing at your needs and prerequisites, but something like this could work:
case class Record(src: String, dest: String, act: String)
val a = List(
Record("srcA", "destA", "actA"),
Record("srcB", "destB", "actB"),
Record("srcC", "destC", "actC"),
Record("srcD", "destD", "actD"),
Record("srcE", "destE", "actE"),
)
val b = a.map(r => Record(r.dest, r.src, r.act))
println(a)
// => List(Record(srcA,destA,actA), Record(srcB,destB,actB), Record(srcC,destC,actC), Record(srcD,destD,actD), Record(srcE,destE,actE))
println(b)
// => List(Record(destA,srcA,actA), Record(destB,srcB,actB), Record(destC,srcC,actC), Record(destD,srcD,actD), Record(destE,srcE,actE))

Related

Does this specific exercise lend itself well to a 'functional style' design pattern?

Say we have an array of one dimensional javascript objects contained in a file Array.json for which the key schema isn't known, that is the keys aren't known until the file is read.
Then we wish to output a CSV file with a header or first entry which is a comma delimited set of keys from all of the objects.
Each next line of the file should contain the comma separated values which correspond to each key from the file.
Array.json
[
abc:123,
xy:"yz",
s12:13,
],
...
[
abc:1
s:133,
]
A valid output:
abc,xy,s12,s
123,yz,13,
1,,,133
I'm teaching myself 'functional style' programming but I'm thinking that this problem doesn't lend itself well to a functional solution.
I believe that this problem requires some state to be kept for the output header and that subsequently each line depends on that header.
I'm looking to solve the problem in a single pass. My goals are efficiency for a large data set, minimal traversals, and if possible, parallelizability. If this isn't possible then can you give a proof or reasoning to explain why?
EDIT: Is there a way to solve the problem like this functionally?:
Say you pass through the array once, in some particular order. Then
from the start the header set looks like abc,xy,s12 for the first
object. With CSV entry 123,yz,13 . Then on the next object we add an
additional key to the header set so abc,xy,s12,s would be the header
and the CSV entry would be 1,,,133 . In the end we wouldn't need to
pass through the data set a second time. We could just append extra
commas to the result set. This is one way we could approach a single
pass....
Are there functional tools ( functions ) designed to solve problems like this, and what should I be considering? [ By functional tools I mean Monads,FlatMap, Filters, etc. ] . Alternatively, should I be considering things like Futures ?
Currently I've been trying to approach this using Java8, but am open to solutions from Scala, etc. Ideally I would be able to determine if Java8s' functional approach can solve the problem since that's the language I'm currently working in.
Since the csv output will change with every new line of input, you must hold that in memory before writing it out. If you consider creating an output text format from an internal representation of a csv file another "pass" over the data (the internal representation of the csv is practically a Map[String,List[String]] which you must traverse to convert it to text) then it's not possible to do this in a single pass.
If, however, this is acceptable, then you can use a Stream to read a single item from your json file, merge that into the csv file, and do this until the stream is empty.
Assuming, that the internal representation of the csv file is
trait CsvFile {
def merge(line: Map[String, String]): CsvFile
}
And you can represent a single item as
trait Item {
def asMap: Map[String, String]
}
You can implement it using foldLeft:
def toCsv(items: Stream[Item]): CsvFile =
items.foldLeft(CsvFile(Map()))((csv, item) => csv.merge(item.asMap))
or use recursion to get the same result
#tailrec def toCsv(items: Stream[Item], prevCsv: CsvFile): CsvFile =
items match {
case Stream.Empty => prevCsv
case item #:: rest =>
val newCsv = prevCsv.merge(item.asMap)
toCsv(rest, newCsv)
}
Note: Of course you don't have to create types for CsvFile or Item, you can use Map[String,List[String]] and Map[String,String] respectively
UPDATE:
As more detail was requested for the CsvFile trait/class, here's an example implementation:
case class CsvFile(lines: Map[String, List[String]], rowCount: Int = 0) {
def merge(line: Map[String, String]): CsvFile = {
val orig = lines.withDefaultValue(List.fill(rowCount)(""))
val current = line.withDefaultValue("")
val newLines = (lines.keySet ++ line.keySet) map {
k => (k, orig(k) :+ current(k))
}
CsvFile(newLines.toMap, rowCount + 1)
}
}
This could be one approach:
val arr = Array(Map("abc" -> 123, "xy" -> "yz", "s12" -> 13), Map("abc" -> 1, "s" -> 133))
val keys = arr.flatMap(_.keys).distinct // get the distinct keys for header
arr.map(x => keys.map(y => x.getOrElse(y,""))) // get an array of rows
Its completely OK to have state in functional programming. But having mutable state or mutating state is not allowed in functional programming.
Functional programming advocates creating new changed state instead of mutating the state in place.
So, its Ok to read and access state created in the program until and unless you are mutating or side effecting.
Coming to the point.
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.map { inner => inner.map { case (k, v) => k}}.flatten
list.map { inner => inner.map { case (k, v) => v}}.flatten
REPL
scala> val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list: List[List[(String, String)]] = List(List((abc,123), (xy,yz)), List((abc,1)))
scala> list.map { inner => inner.map { case (k, v) => k}}.flatten
res1: List[String] = List(abc, xy, abc)
scala> list.map { inner => inner.map { case (k, v) => v}}.flatten
res2: List[String] = List(123, yz, 1)
or use flatMap instead of map and flatten
val list = List(List("abc" -> "123", "xy" -> "yz"), List("abc" -> "1"))
list.flatMap { inner => inner.map { case (k, v) => k}}
list.flatMap { inner => inner.map { case (k, v) => v}}
In functional programming, mutable state is not allowed. But immutable states/values are fine.
Assuming that you have read your json file in to a value input:List[Map[String,String]], the codes below will solve your problem:
val input = List(Map("abc"->"123", "xy"->"yz" , "s12"->"13"), Map("abc"->"1", "s"->"33"))
val keys = input.map(_.keys).flatten.toSet
val keyvalues = input.map(kvs => keys.map(k => (k->kvs.getOrElse(k,""))).toMap)
val values = keyvalues.map(_.values)
val result = keys.mkString(",") + "\n" + values.map(_.mkString(",")).mkString("\n")

How to fill Scala Seq of Sets with unique values from Spark RDD?

I'm working with Spark and Scala. I have an RDD of Array[String] that I'm going to iterate through. The RDD contains values for attributes like (name, age, work, ...). I'm using a Sequence of mutable Sets of Strings (called attributes) to collect all unique values for each attribute.
Think of the RDD as something like this:
("name1","21","JobA")
("name2","21","JobB")
("name3","22","JobA")
In the end I want something like this:
attributes = (("name1","name2","name3"),("21","22"),("JobA","JobB"))
I have the following code:
val someLength = 10
val attributes = Seq.fill[mutable.Set[String]](someLength)(mutable.Set())
val splitLines = rdd.map(line => line.split("\t"))
lines.foreach(line => {
for {(value, index) <- line.zipWithIndex} {
attributes(index).add(value)
// #1
}
})
// #2
When I debug and stop at the line marked with #1, everything is fine, attributes is correctly filled with unique values.
But after the loop, at line #2, attributes is empty again. Looking into it shows, that attributes is a sequence of sets, that are all of size 0.
Seq()
Seq()
...
What am I doing wrong? Is there some kind of scoping going on, that I'm not aware of?
The answer lies in the fact that Spark is a distributed engine. I will give you a rough idea of the problem that you are facing. Here the elements in each RDD are bucketed into Partitions and each Partition potentially lives on a different node.
When you write rdd1.foreach(f) that f is wrapped inside a closure (Which gets copies of the corresponding objects). Now, this closure is serialized and then sent to each node where it is applied for each element in that Partition.
Here, your f will get a copy of attributes in its wrapped closure and hence when f is executed, it interacts with that copy of attributes and not with attributes that you want. This results in your attributes being left out without any changes.
I hope the problem is clear now.
val yourRdd = sc.parallelize(List(
("name1","21","JobA"),
("name2","21","JobB"),
("name3","22","JobA")
))
val yourNeededRdd = yourRdd
.flatMap({ case (name, age, work) => List(("name", name), ("age", age), ("work", work)) })
.groupBy({ case (attrName, attrVal) => attrName })
.map({ case (attrName, group) => (attrName, group.toList.map(_._2).distinct })
// RDD(
// ("name", List("name1", "name2", "name3")),
// ("age", List("21", "22")),
// ("work", List("JobA", "JobB"))
// )
// Or
val distinctNamesRdd = yourRdd.map(_._1).distinct
// RDD("name1", "name2", "name3")
val distinctAgesRdd = yourRdd.map(_._2).distinct
// RDD("21", "22")
val distinctWorksRdd = yourRdd.map(_._3).distinct
// RDD("JobA", "JobB")

How to create Map using multiple lists in Spark

I am trying to figure out how to access particular elements from RDD myRDD with example entries below:
(600,List((600,111,7,1), (615,111,3,5))
(601,List((622,112,2,1), (615,111,3,5), (456,111,9,12))
I want to extract some data from Redis DB using 3-rd field from sub-lists as ID. For example, in case of (600,List((600,111,1,1), (615,111,1,5)), the IDs are 7 and 3.
In case of (601,List((622,112,2,1), (615,111,3,5), (456,111,9,12)), the ID's are 2, 3 and 9.
The problem is that I don't know how to collect values using multiple IDs. In the given code below, I use line._2(3), but it's not correct, because this way I access sublists instead of the fields inside these sublists.
Should I use flatMap or similar?
val newRDD = myRDD.mapPartitions(iter => {
val redisPool = new Pool(new JedisPool(new JedisPoolConfig(), "localhost", 6379, 2000))
iter.map({line => (line._1,
redisPool.withJedisClient { client =>
val start_date: String = Dress.up(client).hget("id:"+line._2(3),"start_date")
val end_date: String = Dress.up(client).hget("id:"+line._2(3),"end_date")
val additionalData = List((start_date,end_date))
Map(("base_data", line._2), ("additional_data", additionalData))
})
})
})
newRDD.collect().foreach(println)
If we assume that Redis DB contains some relevant data, then the result newRDD could be the following:
(600,Map("base_data" -> List((600,111,7,1), (615,111,3,5)), "additional_data" -> List((2014,2015),(2015,2016)))
(601,Map("base_data" -> List((622,112,2,1), (615,111,3,5), (456,111,9,12)), "additional_data" -> List((2010,2015),(2011,2016),(2014,2016)))
To get a list of third elements of each tuple in line._2, use line._2.map(_._3) (assuming the type of line is (Int, List[(Int, Int, Int, Int)]), like it looks from your example, and types like Any aren't involved). Overall, it seems like your code should look like
iter.map({ case (first, second) => (first,
redisPool.withJedisClient { client =>
val additionalData = second.map { tuple =>
val start_date: String = Dress.up(client).hget("id:"+tuple._3,"start_date")
val end_date: String = Dress.up(client).hget("id:"+tuple._3,"end_date")
(start_date, end_date)
}
Map(("base_data", second), ("additional_data", additionalData))
})
})

How to find unique elements from list of tuples based on some elements using scala?

I have following list
val a = List(("name1","add1","city1",10),("name1","add1","city1",10),
("name2","add2","city2",10),("name2","add2","city2",20),("name3","add3","city3",20))
I want distinct element from above list based on first three values of tuple. Fourth value should not be consider while finding distinct elements from list.
I want following output:
val output = List(("name1","add1","city1",10),("name2","add2","city2",10),
("name3","add3","city3",20))
Is it possible to get above output?
As per my knowledge, distinct works if whole tuple/value is duplicated. I tried out with distinct like following:
val b = List(("name1","add1","city1",10),("name1","add1","city1",10),("name2","add2","city2",10),
("name2","add2","city2",20),("name3","add3","city3",20)).distinct
but it gives output as -
List(("name1","add1","city1",10),("name2","add2","city2",10),
("name2","add2","city2",20),("name3","add3","city3",20))
Any alternate approach will also appreciated.
Use groupBy like this
a.groupBy( v => (v._1,v._2,v._3)).keys.toList
This constructs a Map where each key is by definition a unique triplet as required in the lambda function above.
Should it include also the last element in the tuple, fetch the first element for each key, like this
a.groupBy( v => (v._1,v._2,v._3)).mapValues(_.head)
If the order of the output list isn't important (i.e. you are happy to get List(("name3","add3","city3",20),("name1","add1","city1",10),("name2","add2","city2",10))), the following works as specified:
a.groupBy(v => (v._1,v._2,v._3)).values.map(_.head).toList
(Due to Scala collections design, you'll see the order kept for output lists up to 4 elements, but above that size HashMap will be used.) If you do need to keep the order, you can do something like (generalizing a bit)
def distinctBy[A, B](xs: Seq[A], f: A => B) = {
val seen = LinkedHashMap.empty[B, A]
xs.foreach { x =>
val key = f(x)
if (!seen.contains(key)) { seen.update(key, x) }
}
seen.values.toList
}
distinctBy(a, v => (v._1, v._2, v._3))
You could try
a.map{case x#(name, add, city, _) => (name,add,city) -> x}.toMap.values.toList
To make sure you have the first one in list kept,
type String3 = (String, String, String)
type String3Int = (String, String, String, Int)
a.foldLeft(collection.immutable.ListMap.empty[String3, String3Int]) {
case (a, b) => if (a.contains((b._1, b._2, b._3))) {
a
} else a + ((b._1, b._2, b._3) -> b)
}.values.toList
On simple solution would be to convert the List to a Set. Sets don't contain duplicates: check the documentation.
val setOfTuples = a.toSet
println(setOfTuples)
Output: Set((1,1), (1,2), (1,3), (2,1))

How to sort and merge lists using scala?

I have two different lists which contains different data.
Here is a example of lists-
list1:[{"name":"name1","srno":"srno1"},{"name":"name2","srno":"srno2"}]
list2:[{"location":"location1","srno":"srno2"},{"location":"location2","srno":"srno1"}]
These two lists have a field in common that is 'srno' which is of type string.
I want to map lists on srno and merge these two lists such that record corresponding to 'srno:1' from list1 to 'srno:1' to list2.
So file list would be like this:
[{"name":"name1","srno":"srno1","location":"location2"},{"name":"name2","srno":"srno2","location":"location2"}]
How do I sort and merge these two lists to form a single list using scala?
Edit:
There will be one to one correspondance i.e. srno1 will be present exactly once in both the lists
Assuming you are converting your json to case classes, you can use for comprehension to do this.
case class NameSrno(name: String, srno: String)
case class SrnoLoc(srno: String, location: String)
case class All(name: String, srno: String, location: String)
def merge(nsl: List[NameSrno], sll: List[SrnoLoc]): List[All] = {
for {
ns <- nsl
sl <- sll
if (ns.srno == sl.srno)
} yield All(ns.name, ns.srno, sl.location)
}
Usage:
val nsl = List(NameSrno("item1", "1"), NameSrno("item2", "2"))
val sll = List(SrnoLoc("1", "London"), SrnoLoc("2", "Tokyo"))
merge(nsl, sll)
//> res0: List[test.SeqOps.All] = List(All(item1,1,London), All(item2,2,Tokyo))