Combining two lists into one in Scala - scala

How to combine the following two lists into one such that:
L1 = List((a,1), (b,2), (c,3), (d,4))
L2 = List((a,b), (b,c), (a,d))
and combined list will be:
L3 = List((1,2), (2,3), (1,‌​4))

Ok. So first you will need to convert first list to map.
val l1 = List((1,1),(4,4),(5,4),(8,4),(9,5))
val l2 = List((1,4),(1,9),(5,9),(8,9))
val mapL1 = l1.toMap
val requiredList = l2.map({ case (i, j) => (mapL1(i), mapL1(j)) })

Related

How to merge two scala list in to another list without using built in functions

val l1 = List(1,2)
val l2 = List(3,4)
I know that there is List.concat(l1,l2) will merge these two lists and create a new list . But is there any way to achieve it without using built-in methods
I tried the below code but not helped
for(i <- l1; j <- l2){
List.add(i,j)
}
I want result as List(1,2,3,4) but i dont want to use concat or :::, or ++
Please help
You can choose from this several options, I do not know if List.concat(l1,l2) is something that will fit your needs, but I added for case if it will fit:
val l1 = List(1, 2)
val l2 = List(3, 4, 5)
def merge(l1: List[Int], l2: List[Int]): List[Int] = {
#tailrec
def loop(l1: Vector[Int], l2: List[Int]): List[Int] = {
l1.lastOption match {
case Some(x) => loop(l1.dropRight(1), x +: l2)
case None => l2
}
}
loop(l1.toVector, l2)
}
println(merge(l1, l2))
This example just use foldLeft function to concatenate second List to first one:
val l1 = List(1, 2)
val l2 = List(3, 4, 5)
val concatenatedList: List[Int] = l1.toVector.foldRight(l2)((el, result) => el +: result)
println(concatenatedList)
Also, as written in the comments do not use :+ on the Lists, because add to the end of the list takes O(n) time, but if you use +: it takes constant time.

Define a 2d list and append lists to it in a for loop, scala

I want to define a 2d list before a for loop and afterwards I want to append to it 1d lists in a for loop, like so:
var 2dEmptyList: listOf<List<String>>
for (element<-elements){
///do some stuff
2dEmptyList.plusAssign(1dlist)
}
The code above does not work. But I can't seem to find a solution for this and it is so simple!
scala> val elements = List("a", "b", "c")
elements: List[String] = List(a, b, c)
scala> val twoDimenstionalList: List[List[String]] = List.empty[List[String]]
twoDimenstionalList: List[List[String]] = List()
scala> val res = for(element <- elements) yield twoDimenstionalList ::: List(element)
res: List[List[java.io.Serializable]] = List(List(a), List(b), List(c))
Better still:
scala> twoDimenstionalList ::: elements.map(List(_))
res8: List[List[String]] = List(List(a), List(b), List(c))
If you want 2dEmptyList be mutable, please consider using scala.collection.mutable.ListBuffer:
scala> val ll = scala.collection.mutable.ListBuffer.empty[List[String]]
ll: scala.collection.mutable.ListBuffer[List[String]] = ListBuffer()
scala> ll += List("Hello")
res7: ll.type = ListBuffer(List(Hello))
scala> ll += List("How", "are", "you?")
res8: ll.type = ListBuffer(List(Hello), List(How, are, you?))

SubtractByKey and keep rejected values

I was playing around with spark and I am getting stuck with something that seems foolish.
Let's say we have two RDD:
rdd1 = {(1, 2), (3, 4), (3, 6)}
rdd2 = {(3, 9)}
if I am doing rdd1.substrackByKey(rdd2) , I will get {(1, 2)} wich is perfectly fine. But I also want to save the rejected values {(3,4),(3,6)} to another RDD, is there a prebuilt function in spark or an elegant way to do this?
Please keep in mind that I am new with Spark, any help will be appreciated, thanks.
As Rohan suggests, there is no (to the best of my knowledge) standard API call to do this. What you want to do can be expressed as Union - Intersection.
Here is how you can do this on spark:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val intersection = r1.map(_._1).intersection(r2.map(_._1))
val union = r1.map(_._1).union(r2.map(_._1))
val diff = union.subtract(intersection)
diff.collect()
> Array[Int] = Array(1)
To get the actual pairs:
val d = diff.collect()
r1.union(r2).filter(x => d.contains(x._1)).collect
I think I claim this is slightly more elegant:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val r3 = r1.leftOuterJoin(r2)
val subtracted = r3.filter(_._2._2.isEmpty).map(x=>(x._1, x._2._1))
val discarded = r3.filter(_._2._2.nonEmpty).map(x=>(x._1, x._2._1))
//subtracted: (1,2)
//discarded: (3,4)(3,6)
The insight is noticing that leftOuterJoin produces both the discarded (== records with a matching key in r2) and remaining (no matching key) in one go.
It's a pity Spark doesn't have RDD.partition (in the Scala collection sense of split a collection into two depending on a predicate) or we could caclculate subtracted and discarded in one pass
You can try
val rdd3 = rdd1.subtractByKey(rdd2)
val rdd4 = rdd1.subtractByKey(rdd3)
But you won't be keeping the values, just running another subtraction.
Unfortunately, I don't think there's an easy way to keep the rejected values using subtractByKey(). I think one way you get your desired result is through cogrouping and filtering. Something like:
val cogrouped = rdd1.cogroup(rdd2, numPartitions)
def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
You might be able to borrow the work done here to make the last two lines look more elegant.
When I run this on your example, I see:
scala> val rdd1 = sc.parallelize(Array((1, 2), (3, 4), (3, 6)))
scala> val rdd2 = sc.parallelize(Array((3, 9)))
scala> val cogrouped = rdd1.cogroup(rdd2)
scala> def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
scala> val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> res1.collect()
...
res7: Array[(Int, Int)] = Array((1,2))
scala> res2.collect()
...
res8: Array[(Int, Int)] = Array((3,4), (3,6))
First use substractByKey() and then subtract
val rdd1 = spark.sparkContext.parallelize(Seq((1,2), (3,4), (3,5)))
val rdd2 = spark.sparkContext.parallelize(Seq((3,10)))
val result = rdd1.subtractByKey(rdd2)
result.foreach(print) // (1,2)
val rejected = rdd1.subtract(result)
rejected.foreach(print) // (3,5)(3,4)

N to N matching in scala

I have two lists, namely
val a = List(1,2,3)
val b = List(4,5)
I want to perform N to N bipartite mapping and want to get output
List((1,4),(1,5),(2,4),(2,5),(3,4),(3,5))
How can I do this?
Assuming that B = List(4,5), then you can use for comprehensions to achieve your goal:
val A = List(1,2,3)
val B = List(4,5)
val result = for(a <- A; b <- B) yield {
(a, b)
}
The output is
result:List[(Int, Int)] = List((1,4), (1,5), (2,4), (2,5), (3,4), (3,5))
Consider also
a.flatMap(x => b.map(y => (x,y)))
though not so concise as a for comprehension.

Multiple yields in sequence comprehension?

I'm trying to learn Scala and tried to write a sequence comprehension that extracts unigrams, bigrams and trigrams from a sequence. E.g., [1,2,3,4] should be transformed to (not Scala syntax)
[1; _,1; _,_,1; 2; 1,2; _,1,2; 3; 2,3; 1,2,3; 4; 3,4; 2,3,4]
In Scala 2.8, I tried the following:
def trigrams(tokens : Seq[T]) = {
var t1 : Option[T] = None
var t2 : Option[T] = None
for (t3 <- tokens) {
yield t3
yield (t2,t3)
yield (t1,t2,Some(t3))
t1 = t2
t2 = t3
}
}
But this doesn't compile as, apparently, only one yield is allowed in a for-comprehension (no block statements either). Is there any other elegant way to get the same behavior, with only one pass over the data?
You can't have multiple yields in a for loop because for loops are syntactic sugar for the map (or flatMap) operations:
for (i <- collection) yield( func(i) )
translates into
collection map {i => func(i)}
Without a yield at all
for (i <- collection) func(i)
translates into
collection foreach {i => func(i)}
So the entire body of the for loop is turned into a single closure, and the presence of the yield keyword determines whether the function called on the collection is map or foreach (or flatMap). Because of this translation, the following are forbidden:
Using imperative statements next to a yield to determine what will be yielded.
Using multiple yields
(Not to mention that your proposed verison will return a List[Any] because the tuples and the 1-gram are all of different types. You probably want to get a List[List[Int]] instead)
Try the following instead (which put the n-grams in the order they appear):
val basis = List(1,2,3,4)
val slidingIterators = 1 to 4 map (basis sliding _)
for {onegram <- basis
ngram <- slidingIterators if ngram.hasNext}
yield (ngram.next)
or
val basis = List(1,2,3,4)
val slidingIterators = 1 to 4 map (basis sliding _)
val first=slidingIterators head
val buf=new ListBuffer[List[Int]]
while (first.hasNext)
for (i <- slidingIterators)
if (i.hasNext)
buf += i.next
If you prefer the n-grams to be in length order, try:
val basis = List(1,2,3,4)
1 to 4 flatMap { basis sliding _ toList }
scala> val basis = List(1, 2, 3, 4)
basis: List[Int] = List(1, 2, 3, 4)
scala> val nGrams = (basis sliding 1).toList ::: (basis sliding 2).toList ::: (basis sliding 3).toList
nGrams: List[List[Int]] = ...
scala> nGrams foreach (println _)
List(1)
List(2)
List(3)
List(4)
List(1, 2)
List(2, 3)
List(3, 4)
List(1, 2, 3)
List(2, 3, 4)
I guess I should have given this more thought.
def trigrams(tokens : Seq[T]) : Seq[(Option[T],Option[T],T)] = {
var t1 : Option[T] = None
var t2 : Option[T] = None
for (t3 <- tokens)
yield {
val tri = (t1,t2,t3)
t1 = t2
t2 = Some(t3)
tri
}
}
Then extract the unigrams and bigrams from the trigrams. But can anyone explain to me why 'multi-yields' are not permitted, and if there's any other way to achieve their effect?
val basis = List(1, 2, 3, 4)
val nGrams = basis.map(x => (x)) ::: (for (a <- basis; b <- basis) yield (a, b)) ::: (for (a <- basis; b <- basis; c <- basis) yield (a, b, c))
nGrams: List[Any] = ...
nGrams foreach (println(_))
1
2
3
4
(1,1)
(1,2)
(1,3)
(1,4)
(2,1)
(2,2)
(2,3)
(2,4)
(3,1)
(3,2)
(3,3)
(3,4)
(4,1)
(4,2)
(4,3)
(4,4)
(1,1,1)
(1,1,2)
(1,1,3)
(1,1,4)
(1,2,1)
(1,2,2)
(1,2,3)
(1,2,4)
(1,3,1)
(1,3,2)
(1,3,3)
(1,3,4)
(1,4,1)
(1,4,2)
(1,4,3)
(1,4,4)
(2,1,1)
(2,1,2)
(2,1,3)
(2,1,4)
(2,2,1)
(2,2,2)
(2,2,3)
(2,2,4)
(2,3,1)
(2,3,2)
(2,3,3)
(2,3,4)
(2,4,1)
(2,4,2)
(2,4,3)
(2,4,4)
(3,1,1)
(3,1,2)
(3,1,3)
(3,1,4)
(3,2,1)
(3,2,2)
(3,2,3)
(3,2,4)
(3,3,1)
(3,3,2)
(3,3,3)
(3,3,4)
(3,4,1)
(3,4,2)
(3,4,3)
(3,4,4)
(4,1,1)
(4,1,2)
(4,1,3)
(4,1,4)
(4,2,1)
(4,2,2)
(4,2,3)
(4,2,4)
(4,3,1)
(4,3,2)
(4,3,3)
(4,3,4)
(4,4,1)
(4,4,2)
(4,4,3)
(4,4,4)
You could try a functional version without assignments:
def trigrams[T](tokens : Seq[T]) = {
val s1 = tokens.map { Some(_) }
val s2 = None +: s1
val s3 = None +: s2
s1 zip s2 zip s3 map {
case ((t1, t2), t3) => (List(t1), List(t1, t2), List(t1, t2, t3))
}
}