combining two lists with index wise - scala

first List
remoteDeviceAndPort===>List(
(1,891w.yourdomain.com,wlan-ap0),
(13,ap,GigabitEthernet0),
(11,Router-3900,GigabitEthernet0/0)
)
second List
interfacesList===>List(
(1,UP,,0,0,0,0,UP,4294,other,VoIP-Null0,0,0),
(13,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet6,0,0),
(11,UP,,0,0,0,0,UP,100,vlan,Vlan11,4558687845,1249542878),
(2,UP,,0,0,972,1327,UP,0,Tunnel,Virtual-Access1,0,0),
(4,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0),
(6,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0)
)
The above are my two lists now i have to combine these two lists like below.
Expected OutPut =>
combineList = List(
(1,UP,,0,0,0,0,UP,4294,other,VoIP-Null0,0,0,891w.yourdomain.com,wlan-ap0),
(13,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet6,0,0,ap,GigabitEthernet0),
(11,UP,,0,0,0,0,UP,100,vlan,Vlan11,4558687845,1249542878,Router-3900,GigabitEthernet0/0),
(2,UP,,0,0,972,1327,UP,0,Tunnel,Virtual-Access1,0,0,empty,empty),
(4,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0,empty,empty),
(6,DOWN,,0,0,0,0,UP,100,Ethernet,FastEthernet2,0,0,empty,empty)
)

The similar question here
case class NetworkDeviceInterfaces(index: Int, params: String*)
val remoteDeviceAndPort = List(
(1,"891w.yourdomain.com","wlan-ap0"),
(13,"ap","GigabitEthernet0"),
(11,"Router-3900","GigabitEthernet0/0")
)
val rdapMap = remoteDeviceAndPort map {case (k, v1, v2) => k -> (v1, v2) } toMap
val interfacesList = List(NetworkDeviceInterfaces(1,"UP","","0","0","0","0","UP","4294","other","VoIP-Null0","0","0"))
val result = interfacesList map {
interface => {
val (first, second) = rdapMap.getOrElse(interface.index, ("empty", "empty"))
NetworkDeviceInterfaces(interface.index, (interface.params ++ Seq(first, second)):_*)
}
}

Related

Reading CSV into Map[String, Array[String]] in Scala

Given a csv in the format below, what is the best way to load it into Scala as type Map[String, Array[String]], with the first key being the unique values for Col2, and the value Array[String]] as all co-occurring values of Col1?
a,1,
b,2,m
c,2,
d,1,
e,3,m
f,4,
g,2,
h,3,
I,1,
j,2,n
k,2,n
l,1,
m,5,
n,2,
I have tried to use the function below, but am getting errors trying to add to the Option type:
+= is not a member of Option[Array[String]]
In addition, I get overloaded method value ++ with alternatives:
with regards to the line case None => mapping ++ (linesplit(2) -> Array(linesplit(1)))
def parseCSV() : Map[String, Array[String]] = {
var mapping = Map[String, Array[String]]()
val lines = Source.fromFile("test.csv")
for (line <- lines.getLines) {
val linesplit = line.split(",")
mapping.get(linesplit(2)) match {
case Some(_) => mapping.get(linesplit(2)) += linesplit(1)
case None => mapping ++ (linesplit(2) -> Array(linesplit(1)))
}
}
mapping
}
}
I am hoping for a Map[String, Array[String]] like the following:
(2 -> Array["b","c","g","j", "k", "n"])
(3 -> Array["e","h"])
(4 -> Array["f"])
(5 -> Array["m"])
You can do the following:
First - read the file to List[List[String]]:
val rows: List[List[String]] = using(io.Source.fromFile("test.csv")) { source =>
source.getLines.toList map { line =>
line.split(",").map(_.trim).toList
}
}
Then, because the input has only 2 values per row, I filter the rows (rows with only one value I want to ignore)
val filteredRows = rows.filter(row => row.size > 1)
And the last step is to groupBy the first value (which is the second column - the index column is not returned from Source.fromFile):
filteredRows.groupBy(row => row.head).mapValues(_.map(_.last)))
This isn't complete, but it should give you an outline of how it might be done.
io.Source
.fromFile("so.txt") //open file
.getLines() //line by line
.map(_.split(",")) //split on commas
.toArray //load into memory
.groupMap(_(1))(_(0)) //Scala 2.13
//res0: Map[String,Array[String]] = Map(4 -> Array(f), 5 -> Array(m), 1 -> Array(a, d, I, l), 2 -> Array(b, c, g, j, k, n), 3 -> Array(e, h))
You'll notice that the file resource isn't closed, and it doesn't handle malformed input. I leave that for the diligent reader.
For the above code mutable Map & ArrayBuffer should be used, as they could be mutated/updated later.
def parseCSV(): Map[String, Array[String]] = {
val mapping = scala.collection.mutable.Map[String, ArrayBuffer[String]]()
val lines = Source.fromFile("test.csv")
for (line <- lines.getLines) {
val linesplit = line.split(",")
val key = line.split(",")(1)
val values = line.replace(s",$key", "").split(",")
mapping.get(key) match {
case Some(_) => mapping(linesplit(1)) ++= values
case None =>
val ab = ArrayBuffer[String]()
mapping(linesplit(1)) = ab ++= values
}
}
mapping.map(v => (v._1, v._2.toArray)).toMap
}

Full outer join in Scala

Given a list of lists, where each list has an object that represents the key, I need to write a full outer join that combines all the lists. Each record in the resulting list is the combination of all the fields of all the lists. In case that one key is present in list 1 and not present in list 2, then the fields in list 2 should be null or empty.
One solution I thought of is to embed an in-memory database, create the tables, run a select and get the result. However, I'd like to know if there are any libraries that handle this in a more simpler way. Any ideas?
For example, let's say I have two lists, where the key is the first field in the list:
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
val allLists = List (list1, list2)
The full outer joined list would be:
val allListsJoined = List ((1,2,"A"), (3,4,None), (5,6,None), (7,None,"B"))
NOTE: the solution needs to work for N lists
def fullOuterJoin[K, V1, V2](xs: List[(K, V1)], ys: List[(K, V2)]): List[(K, Option[V1], Option[V2])] = {
val map1 = xs.toMap
val map2 = ys.toMap
val allKeys = map1.keySet ++ map2.keySet
allKeys.toList.map(k => (k, map1.get(k), map2.get(k)))
}
Example usage:
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
println(fullOuterJoin(list1, list2))
Which prints:
List((1,Some(2),Some(A)), (3,Some(4),None), (5,Some(6),None), (7,None,Some(B)))
Edit per suggestion in comments:
If you're interested in joining an arbitrary number of lists and don't care about type info, here's a version that does that:
def fullOuterJoin[K](xs: List[List[(K, Any)]]): List[(K, List[Option[Any]])] = {
val maps = xs.map(_.toMap)
val allKeys = maps.map(_.keySet).reduce(_ ++ _)
allKeys.toList.map(k => (k, maps.map(m => m.get(k))))
}
val list1 = List ((1,2), (3,4), (5,6))
val list2 = List ((1,"A"), (7,"B"))
val list3 = List((1, 3.5), (7, 4.0))
val lists = List(list1, list2, list3)
println(fullOuterJoin(lists))
which outputs:
List((1,List(Some(2), Some(A), Some(3.5))), (3,List(Some(4), None, None)), (5,List(Some(6), None, None)), (7,List(None, Some(B), Some(4.0))))
If you want both an arbitrary number of lists and well-typed results, that's probably beyond the scope of a stackoverflow answer but could probably be accomplished with shapeless.
Here is a way to do it using collect separately on both list
val list1Ite = list1.collect{
case ele if list2.filter(e=> e._1 == ele._1).size>0 => { //if list2 _1 contains ele._1
val left = list2.find(e=> e._1 == ele._1) //find the available element
(ele._1, ele._2, left.get._2) //perform join
}
case others => (others._1, others._2, None) //others add None as _3
}
//list1Ite: List[(Int, Int, java.io.Serializable)] = List((1,2,A), (3,4,None), (5,6,None))
Do similar operation but exclude the elements which are already available in list1Ite
val list2Ite = list2.collect{
case ele if list1.filter(e=> e._1 == ele._1).size==0 => (ele._1, None , ele._2)
}
//list2Ite: List[(Int, None.type, String)] = List((7,None,B))
Combine both list1Ite and list2Ite to result
val result = list1Ite.++(list2Ite)
result: List[(Int, Any, java.io.Serializable)] = List((1,2,A), (3,4,None), (5,6,None), (7,None,B))

How to use reduceByKey to add a value into a Set in Scala Spark?

After I map my RDD to
((_id_1, section_id_1), (_id_1, section_id_2), (_id_2, section_3), (_id_2, section_4))
I want to reduceByKey to
((_id_1, Set(section_id_1, section_id_2), (_id_2, Set(section_3, section_4)))
val collectionReduce = collection_filtered.map(item => {
val extras = item._2.get("extras")
var section_id = ""
var extras_id = ""
if (extras != null) {
val extras_parse = extras.asInstanceOf[BSONObject]
section_id = extras_parse.get("guid").toString
extras_id = extras_parse.get("id").toString
}
(extras_id, Set {section_id})
}).groupByKey().collect()
My output is
((_id_1, (Set(section_1), Set(section_2))), (_id_2, (Set(section_3), Set(section_4))))
How do I fix that?
You can use reduceByKey by simply using ++ to combine the lists.
val rdd = sc.parallelize((1, Set("A")) :: (2, Set("B")) :: (2, Set("C")) :: Nil)
val reducedRdd = rdd.reduceByKey(_ ++ _)
reducedRdd.collect()
// Array((1,Set(A)), (2,Set(B, C)))
In your case :
collection_filtered.map(item => {
// ...
(extras_id, Set(section_id))
}).reduceByKey(_ ++ _).collect()
Here is an alternative with groupByKey/mapValues
val rdd = sc.parallelize(List(("_id_1", "section_id_1"), ("_id_1", "section_id_2"), ("_id_2", "section_3"), ("_id_2", "section_4")))
rdd.groupByKey().mapValues( v => v.toSet).foreach(println)
Here is another alternative using combineByKey (recommended over groupByKey) :
rdd.combineByKey(
(value: String) => Set(value),
(x: Set[String], value: String) => x + value ,
(x: Set[String], y: Set[String]) => (x ++ y)
).foreach(println)

Split RDD into RDD's with no repeating values

I have a RDD of Pairs as below :
(105,918)
(105,757)
(502,516)
(105,137)
(516,816)
(350,502)
I would like to split it into two RDD's such that the first has only the pairs with non-repeating values (for both key and value) and the second will have the rest of the omitted pairs.
So from the above we could get two RDD's
1) (105,918)
(502,516)
2) (105,757) - Omitted as 105 is already included in 1st RDD
(105,137) - Omitted as 105 is already included in 1st RDD
(516,816) - Omitted as 516 is already included in 1st RDD
(350,502) - Omitted as 502 is already included in 1st RDD
Currently I am using a mutable Set variable to track the elements already selected after coalescing the original RDD to a single partition :
val evalCombinations = collection.mutable.Set.empty[String]
val currentValidCombinations = allCombinations
.filter(p => {
if(!evalCombinations.contains(p._1) && !evalCombinations.contains(p._2)) {
evalCombinations += p._1;evalCombinations += p._2; true
} else
false
})
This approach is limited by memory of the executor on which the operations run. Is there a better scalable solution for this ?
Here is a version, which will require more memory for driver.
import org.apache.spark.rdd._
import org.apache.spark._
def getUniq(rdd: RDD[(Int, Int)], sc: SparkContext): RDD[(Int, Int)] = {
val keys = rdd.keys.distinct
val values = rdd.values.distinct
// these are the keys which appear in value part also.
val both = keys.intersection(values)
val bBoth = sc.broadcast(both.collect.toSet)
// remove those key-value pairs which have value which is also a key.
val uKeys = rdd.filter(x => !bBoth.value.contains(x._2))
.reduceByKey{ case (v1, v2) => v1 } // keep uniq keys
uKeys.map{ case (k, v) => (v, k) } // swap key, value
.reduceByKey{ case (v1, v2) => v1 } // keep uniq value
.map{ case (k, v) => (v, k) } // correct placement
}
def getPartitionedRDDs(rdd: RDD[(Int, Int)], sc: SparkContext) = {
val r = getUniq(rdd, sc)
val remaining = rdd subtract r
val set = r.flatMap { case (k, v) => Array(k, v) }.collect.toSet
val a = remaining.filter{ case (x, y) => !set.contains(x) &&
!set.contains(y) }
val b = getUniq(a, sc)
val part1 = r union b
val part2 = rdd subtract part1
(part1, part2)
}
val rdd = sc.parallelize(Array((105,918),(105,757),(502,516),
(105,137),(516,816),(350,502)))
val (first, second) = getPartitionedRDDs(rdd, sc)
// first.collect: ((516,816), (105,918), (350,502))
// second.collect: ((105,137), (502,516), (105,757))
val rdd1 = sc.parallelize(Array((839,841),(842,843),(840,843),
(839,840),(1,2),(1,3),(4,3)))
val (f, s) = getPartitionedRDDs(rdd1, sc)
//f.collect: ((839,841), (1,2), (840,843), (4,3))

Map one value to all values with a common relation Scala

Having a set of data:
{sentenceA1}{\t}{sentenceB1}
{sentenceA1}{\t}{sentenceB2}
{sentenceA2}{\t}{sentenceB1}
{sentenceA3}{\t}{sentenceB1}
{sentenceA4}{\t}{sentenceB2}
I want to map a sentenceA to all the sentences that have a common sentenceB in Scala so the result will be something like this:
{sentenceA1}->{sentenceA2,sentenceA3,sentenceA4} or
{sentenceA2}->{sentenceA1, sentenceA3}
val lines = List(
"sentenceA1\tsentenceB1",
"sentenceA1\tsentenceB2",
"sentenceA2\tsentenceB1",
"sentenceA3\tsentenceB1",
"sentenceA4\tsentenceB2"
)
val afterSplit = lines.map(_.split("\t"))
val ba = afterSplit
.groupBy(_(1))
.mapValues(_.map(_(0)))
val ab = afterSplit
.groupBy(_(0))
.mapValues(_.map(_(1)))
val result = ab.map { case (a, b) =>
a -> b.foldLeft(Set[String]())(_ ++ ba(_)).diff(Set(a))
}