scala intersection with count - scala

I have a simple question, suppose I have 2 RDDs:
RDD1: [a,b,b,c,c,c,d] RDD2:[a,b,c,d]
and I want to find out how many a,b,c,d are there such that the returned results should be something like:
RDD:[(a,b,c,d),(1,2,3,1)]
It can be easily done using Lists, but in RDD, I seem to have to collect them first into Array and do something like:
count(_==string)
is there something easier that I could work with?

I have very Less knowledge about RDD or Spark. but in scala you can try something like this :
val l1 = List('a', 'b', 'c', 'd')
val l2 = List('a', 'b', 'b', 'c', 'c', 'c', 'd')
def f(l1: List[Char], l2: List[Char]):(List[Char],List[Int]) = {
val count = l1.map {
x => l2.count(_ == x)
}.toList
(l1, count)
}
f(l1,l2)
Output at REPL :
res0: (List[Char], List[Int]) = (List(a, b, c, d),List(1, 2, 3, 1))

Related

Scala: concise syntax for constructing `Map` that assigns int-values to characters?

I want to define a Map that assigns values to letters like so:
'A', 'B', 'C' should be assigned value 1
'D', 'E', 'F' should be assigned value 2
etc.
Here is what I tried:
def lettersAndValues = Map(
1 -> Set('A', 'B', 'C'),
2 -> Set('D', 'E', 'F'),
).flatMap {case (value, letters) => letters.map(letter =>(letter, value))}
Now I want to use the values of the letters to compute a score for words, for instance calculating the value of "ABCD" should give 1+1+1+2 = 5. How can I define the score function? Are there other more concise ways to assign values to letters for calculations?
If your goal is to quickly define values of many letters, and then define a score function, here is a shorter way to do this:
val letterToValue = List(
"ABC" -> 1,
"DEF" -> 2
).flatMap{
case (letters, value) => letters.map(letter => (letter, value))
}.toMap
def score(word: String) = word.map(letterToValue).sum
println(score("BED"))
println(score("BAD"))
println(score("CAFEBABE"))
It prints:
5
4
11
A for-comprehension approach:
val generatorMap = Map(
"ABC" -> 1,
"DEF" -> 2
)
val letterToValue: Map[Char, Int] = for {
(ls, v) <- generatorMap
l <- ls
} yield {
(l, v)
}
def score(word: String) = word.map(letterToValue).sum

Scala - byte array of UTF8 strings

I have a byte array (or more precisely a ByteString) of UTF8 strings, which are prefixed by their length as 2-bytes (msb, lsb). For example:
val z = akka.util.ByteString(0, 3, 'A', 'B', 'C', 0, 5,
'D', 'E', 'F', 'G', 'H',0,1,'I')
I would like to convert this to a list of strings, so it should similar to List("ABC", "DEFGH", "I").
Is there an elegant way to do this?
(EDIT) These strings are NOT null terminated, the 0 you are seeing in the array is just the MSB. If the strings were long enough, the MSB would be greater than zero.
Edit: Updated based on clarification in comments that first 2 bytes define an int. So I converted it manually.
def convert(bs: List[Byte]) : List[String] = {
bs match {
case count_b1 :: count_b2 :: t =>
val count = ((count_b1 & 0xff) << 8) | (count_b2 & 0xff)
val (chars, leftover) = t.splitAt(count)
new String(chars.toArray, "UTF-8") :: convert(leftover)
case _ => List()
}
}
Call convert(z.toList)
Consider multiSpan method as defined here which is a repeated application of span over a given list,
z.multiSpan(_ == 0).map( _.drop(2).map(_.toChar).mkString )
Here the spanning condition is whether an item equals 0, then we drop the first two prefixing bytes, and convert the remaining to a String.
Note On using multiSpan, recall to import annotation.tailrec .
Here is my answer with foldLeft.
def convert(z : ByteString) = z.foldLeft((List() : List[String], ByteString(), 0, 0))((p, b : Byte) => {
p._3 match {
case 0 if p._2.nonEmpty => (p._2.utf8String :: p._1, ByteString(), -1, b.toInt)
case 0 => (p._1, p._2, -1, b.toInt)
case -1 => (p._1, p._2, (p._4 << 8) + b.toInt, 0)
case _ => (p._1, p._2 :+ b, p._3 - 1, 0)
}
})
It works like this:
scala> val bs = ByteString(0, 3, 'A', 'B', 'C', 0, 5, 'D', 'E', 'F', 'G', 'H',0,1,'I')
scala> val k = convert(bs); (k._2.utf8String :: k._1).reverse
k: (List[String], akka.util.ByteString, Int, Int) = (List(DEFGH, ABC),ByteString(73),0,0)
res20: List[String] = List(ABC, DEFGH, I)

padTo error inside a foldLeft

I'm learning myself Scala and one of the small test application I wrote just isn't working the way I expect it to. Can someone please help me understand why my test application is failing.
My small test application consists of a "decompress" method that does the following "decompression"
val testList = List(Tuple2(4, 'a'), Tuple2(1, 'b'), Tuple2(2, 'c'), Tuple2(2, 'a'), Tuple2(1, 'd'), Tuple2(4, 'e'))
require(decompress(testList) == List('a', 'a', 'a', 'a', 'b', 'c', 'c', 'a', 'a', 'd', 'e', 'e', 'e', 'e'))
In other words the Tuple2 objects should just be "decompressed" into a more verbose form. Yet all that I get back from the method is List('a', 'a', 'a', 'a') - the padTo statement works for the first Tuple2 but then it just suddenly stops working? If I however do the padding per element using a for loop - everything works...?
The full code:
object P12 extends App {
def decompress(tList: List[Tuple2[Int,Any]]): List[Any] = {
val startingList: List[Any] = List();
val newList = tList.foldLeft(startingList)((b,a) => {
val padCount = a._1;
val padElement = a._2;
println
println(" Current list: " + b)
println(" Current padCount: " + padCount)
println(" Current padElement: " + padElement)
println(" Padded using padTo: " + b.padTo(padCount, padElement))
println
// This doesn't work
b.padTo(padCount, padElement)
// // This works, yay
// var tmpNewList = b;
// for (i <- 1 to padCount)
// tmpNewList = tmpNewList :+ padElement
// tmpNewList
})
newList
}
val testList = List(Tuple2(4, 'a'), Tuple2(1, 'b'), Tuple2(2, 'c'), Tuple2(2, 'a'), Tuple2(1, 'd'), Tuple2(4, 'e'))
require(decompress(testList) == List('a', 'a', 'a', 'a', 'b', 'c', 'c', 'a', 'a', 'd', 'e', 'e', 'e', 'e'))
println("Everything is okay!")
}
Any help appreciated - learning Scala, just can't figure out this problem on my own with my current Scala knowledge.
The problem is that padTo actually fills the list up to a given size. So the first time it works with 4 elements padded, but the next time you'll have to add the actual length of the curent list - hence:
def decompress(tList: List[Tuple2[Int,Any]]): List[Any] = {
val newList = tList.foldLeft(List[Any]())((b,a) => {
b.padTo(a._1+b.length, a._2)
})
newList
}
You could do your decompress like this:
val list = List(Tuple2(4, 'a'), Tuple2(1, 'b'), Tuple2(2, 'c'), Tuple2(2, 'a'), Tuple2(1, 'd'), Tuple2(4, 'e'))
list.flatMap{case (times, value) => Seq.fill(times)(value)}
This works:
scala> testList.foldLeft(List[Char]()){ case (xs, (count, elem)) => xs ++ List(elem).padTo(count, elem)}
res7: List[Char] = List(a, a, a, a, b, c, c, a, a, d, e, e, e, e)
The problem actually is that when you say b.padTo(padCount, padElement) you use always the same list (b) to fill up the elements. Because the first tuple data generate the most elements nothing is added in the next step of foldLeft. If you change the second tuple data you will see a change:
scala> val testList = List(Tuple2(3, 'a'), Tuple2(4, 'b'))
testList: List[(Int, Char)] = List((3,a), (4,b))
scala> testList.foldLeft(List[Char]()){ case (xs, (count, elem)) => xs.padTo(count, elem)}
res11: List[Char] = List(a, a, a, b)
Instead of foldLeft you can also use flatMap to generate the elements:
scala> testList flatMap { case (count, elem) => List(elem).padTo(count, elem) }
res8: List[Char] = List(a, a, a, a, b, c, c, a, a, d, e, e, e, e)
By the way, Tuple(3, 'a') can be written (3, 'a') or 3 -> 'a'
Note that padTo doesn't work as expected when you have data with a count of <= 0:
scala> List(0 -> 'a') flatMap { case (count, elem) => List(elem).padTo(count, elem) }
res31: List[Char] = List(a)
Thus use the solution mentioned by Garret Hall:
def decompress[A](xs: Seq[(Int, A)]) =
xs flatMap { case (count, elem) => Seq.fill(count)(elem) }
scala> decompress(List(2 -> 'a', 3 -> 'b', 2 -> 'c', 0 -> 'd'))
res34: Seq[Char] = List(a, a, b, b, b, c, c)
scala> decompress(List(2 -> 0, 3 -> 1, 2 -> 2))
res35: Seq[Int] = List(0, 0, 1, 1, 1, 2, 2)
Using a generic type signature should be referred in order to return always correct type.

Scala collection one-to-one mapping?

Pardon me if it's simple, but what is the most efficient way to do the following in scala:
Say I have two collections A and B with exactly same number of elements. For example,
A = {objectA1, objectA2, .... objectAN}
B = {objectB1, objectB2, .... objectBN}
I would like to get {{objectA1, objectB1}, {objectA2, objectB2}, ... {objectAN, objectBN}}. Note that these collections might be very large.
Some additions to #Tomasz answer: If collections are very large it is inefficient to use a zip b because it will create a complete intermediate collection. There is an alternative:
scala> (a,b).zipped
res15: scala.runtime.Tuple2Zipped[Int,Seq[Int],Char,Seq[Char]] = scala.runtime.Tuple2Zipped#71060c3e
scala> (a,b,b).zipped // works also for Tuple3
res16: scala.runtime.Tuple3Zipped[Int,Seq[Int],Char,Seq[Char],Char,Seq[Char]] = scala.runtime.Tuple3Zipped#30b688e1
Internally, Tuple2Zipped and Tuple3Zipped use iterators. This makes it more efficient when you want to transform the zippers.
Zip them:
A zip B
Example:
scala> val a = Seq(1, 2, 3, 4, 5)
a: Seq[Int] = List(1, 2, 3, 4, 5)
scala> val b = Seq('a', 'b', 'c', 'd', 'e')
b: Seq[Char] = List(a, b, c, d, e)
scala> a zip b
res5: Seq[(Int, Char)] = List((1,a), (2,b), (3,c), (4,d), (5,e))
If A and B are iterators, this will create an iterator of pairs as well.

In Scala, is it possible to zip two lists of differing sizes?

For example suppose I have
val letters = ('a', 'b', 'c', 'd', 'e')
val numbers = (1, 2)
Is it possible to produce a list
(('a',1), ('b',2), ('c',1),('d',2),('e',1))
Your letters and numbers are tuples, not lists. So let's fix that
scala> val letters = List('a', 'b', 'c', 'd', 'e')
letters: List[Char] = List(a, b, c, d, e)
scala> val numbers = List(1,2)
numbers: List[Int] = List(1, 2)
Now, if we zip them we don't get the desired result
scala> letters zip numbers
res11: List[(Char, Int)] = List((a,1), (b,2))
But that suggests that if numbers were repeated infinitely then the problem would be solved
scala> letters zip (Stream continually numbers).flatten
res12: List[(Char, Int)] = List((a,1), (b,2), (c,1), (d,2), (e,1))
Unfortunately, that's based on knowledge that numbers is shorter than letters. So to fix it all up
scala> ((Stream continually letters).flatten zip (Stream continually numbers).flatten take (letters.size max numbers.size)).toList
res13: List[(Char, Int)] = List((a,1), (b,2), (c,1), (d,2), (e,1))
The shorter of the lists needs to be repeated indefinitely. In this case it's obvious that numbers is shorter, but in case you need it to work in general, here is how you can do it:
def zipLongest[T](list1 : List[T], list2 : List[T]) : Seq[(T, T)] =
if (list1.size < list2.size)
Stream.continually(list1).flatten zip list2
else
list1 zip Stream.continually(list2).flatten
val letters = List('a', 'b', 'c', 'd', 'e')
val numbers = List(1, 2)
println(zipLongest(letters, numbers))
You could do a simple one liner, using the map method
val letters = List('a', 'b', 'c', 'd', 'e')
val numbers = List(1, 2)
val longZip1 = letters.zipWithIndex.map( x => (x._1, numbers(x._2 % numbers.length)) )
//or, using a for loop
//for (x <- letters.zipWithIndex) yield (x._1, numbers(x._2 % numbers.size))
And let's consider your lists are way longer:
val letters = List('a', 'b', 'c', 'd', 'e' /* 'f', ...*/)
val numbers = List(1, 2 /* 3, ... */)
val (longest, shortest) = (letters.toArray, numbers.toArray)
val longZip1 = longest
.zipWithIndex
.map(x => (x._1, shortest(x._2 % shortest.length)))
If you do not want to reuse any of the list data however you will need to know what the gaps are to be filled with ahead of time:
val result = (0 to (Math.max(list1.size, list2.size) - 1)) map { index =>
(list1.lift(index).getOrElse(valWhen1Empty),list2.lift(index).getOrElse(valWhen2Empty))
}
I doubt this will work well with infinite lists or streams of course...