Tune Nested Loop in Scala - scala

I was wondering if I can tune the following Scala code :
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] = {
var listNoDuplicates: List[(Class1, Class2)] = Nil
for (outerIndex <- 0 until listOfTuple.size) {
if (outerIndex != listOfTuple.size - 1)
for (innerIndex <- outerIndex + 1 until listOfTuple.size) {
if (listOfTuple(i)._1.flag.equals(listOfTuple(j)._1.flag))
listNoDuplicates = listOfTuple(i) :: listNoDuplicates
}
}
listNoDuplicates
}

Usually if you have someting looking like:
var accumulator: A = new A
for( b <- collection ) {
accumulator = update(accumulator, b)
}
val result = accumulator
can be converted in something like:
val result = collection.foldLeft( new A ){ (acc,b) => update( acc, b ) }
So here we can first use a map to force the unicity of flags. Supposing the flag has a type F:
val result = listOfTuples.foldLeft( Map[F,(ClassA,ClassB)] ){
( map, tuple ) => map + ( tuple._1.flag -> tuple )
}
Then the remaining tuples can be extracted from the map and converted to a list:
val uniqList = map.values.toList
It will keep the last tuple encoutered, if you want to keep the first one, replace foldLeft by foldRight, and invert the argument of the lambda.
Example:
case class ClassA( flag: Int )
case class ClassB( value: Int )
val listOfTuples =
List( (ClassA(1),ClassB(2)), (ClassA(3),ClassB(4)), (ClassA(1),ClassB(-1)) )
val result = listOfTuples.foldRight( Map[Int,(ClassA,ClassB)]() ) {
( tuple, map ) => map + ( tuple._1.flag -> tuple )
}
val uniqList = result.values.toList
//uniqList: List((ClassA(1),ClassB(2)), (ClassA(3),ClassB(4)))
Edit: If you need to retain the order of the initial list, use instead:
val uniqList = listOfTuples.filter( result.values.toSet )

This compiles, but as I can't test it it's hard to say if it does "The Right Thing" (tm):
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] =
(for {outerIndex <- 0 until listOfTuple.size
if outerIndex != listOfTuple.size - 1
innerIndex <- outerIndex + 1 until listOfTuple.size
if listOfTuple(i)._1.flag == listOfTuple(j)._1.flag
} yield listOfTuple(i)).reverse.toList
Note that you can use == instead of equals (use eq if you need reference equality).
BTW: https://codereview.stackexchange.com/ is better suited for this type of question.

Do not use index with lists (like listOfTuple(i)). Index on lists have very lousy performance. So, some ways...
The easiest:
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] =
SortedSet(listOfTuple: _*)(Ordering by (_._1.flag)).toList
This will preserve the last element of the list. If you want it to preserve the first element, pass listOfTuple.reverse instead. Because of the sorting, performance is, at best, O(nlogn). So, here's a faster way, using a mutable HashSet:
def removeDuplicates(listOfTuple: List[(Class1,Class2)]): List[(Class1,Class2)] = {
// Produce a hash map to find the duplicates
import scala.collection.mutable.HashSet
val seen = HashSet[Flag]()
// now fold
listOfTuple.foldLeft(Nil: List[(Class1,Class2)]) {
case (acc, el) =>
val result = if (seen(el._1.flag)) acc else el :: acc
seen += el._1.flag
result
}.reverse
}
One can avoid using a mutable HashSet in two ways:
Make seen a var, so that it can be updated.
Pass the set along with the list being created in the fold. The case then becomes:
case ((seen, acc), el) =>

Related

How can I emit periodic results over an iteration?

I might have something like this:
val found = source.toCharArray.foreach{ c =>
// Process char c
// Sometimes (e.g. on newline) I want to emit a result to be
// captured in 'found'. There may be 0 or more captured results.
}
This shows my intent. I want to iterate over some collection of things. Whenever the need arrises I want to "emit" a result to be captured in found. It's not a direct 1-for-1 like map. collect() is a "pull", applying a partial function over the collection. I want a "push" behavior, where I visit everything but push out something when needed.
Is there a pattern or collection method I'm missing that does this?
Apparently, you have a Collection[Thing], and you want to obtain a new Collection[Event] by emitting a Collection[Event] for each Thing. That is, you want a function
(Collection[Thing], Thing => Collection[Event]) => Collection[Event]
That's exactly what flatMap does.
You can write it down with nested fors where the second generator defines what "events" have to be "emitted" for each input from the source. For example:
val input = "a2ba4b"
val result = (for {
c <- input
emitted <- {
if (c == 'a') List('A')
else if (c.isDigit) List.fill(c.toString.toInt)('|')
else Nil
}
} yield emitted).mkString
println(result)
prints
A||A||||
because each 'a' emits an 'A', each digit emits the right amount of tally marks, and all other symbols are ignored.
There are several other ways to express the same thing, for example, the above expression could also be rewritten with an explicit flatMap and with a pattern match instead of if-else:
println(input.flatMap{
case 'a' => "A"
case d if d.isDigit => "|" * (d.toString.toInt)
case _ => ""
})
I think you are looking for a way to build a Stream for your condition. Streams are lazy and are computed only when required.
val sourceString = "sdfdsdsfssd\ndfgdfgd\nsdfsfsggdfg\ndsgsfgdfgdfg\nsdfsffdg\nersdff\n"
val sourceStream = sourceString.toCharArray.toStream
def foundStreamCreator( source: Stream[Char], emmitBoundaryFunction: Char => Boolean): Stream[String] = {
def loop(sourceStream: Stream[Char], collector: List[Char]): Stream[String] =
sourceStream.isEmpty match {
case true => collector.mkString.reverse #:: Stream.empty[String]
case false => {
val char = sourceStream.head
emmitBoundaryFunction(char) match {
case true =>
collector.mkString.reverse #:: loop(sourceStream.tail, List.empty[Char])
case false =>
loop(sourceStream.tail, char :: collector)
}
}
}
loop(source, List.empty[Char])
}
val foundStream = foundStreamCreator(sourceStream, c => c == '\n')
val foundIterator = foundStream.toIterator
foundIterator.next()
// res0: String = sdfdsdsfssd
foundIterator.next()
// res1: String = dfgdfgd
foundIterator.next()
// res2: String = sdfsfsggdfg
It looks like foldLeft to me:
val found = ((List.empty[String], "") /: source.toCharArray) {case ((agg, tmp), char) =>
if (char == '\n') (tmp :: agg, "") // <- emit
else (agg, tmp + char)
}._1
Where you keep collecting items in a temporary location and then emit it when you run into a character signifying something. Since I used List you'll have to reverse at the end if you want it in order.

Scala count number of times function returns each value, functionally

I want to count up the number of times that a function f returns each value in it's range (0 to f_max, inclusive) when applied to a given list l, and return the result as an array, in Scala.
Currently, I accomplish as follows:
def count (l: List): Array[Int] = {
val arr = new Array[Int](f_max + 1)
l.foreach {
el => arr(f(el)) += 1
}
return arr
}
So arr(n) is the number of times that f returns n when applied to each element of l. This works however, it is imperative style, and I am wondering if there is a clean way to do this purely functionally.
Thank you
how about a more general approach:
def count[InType, ResultType](l: Seq[InType], f: InType => ResultType): Map[ResultType, Int] = {
l.view // create a view so we don't create new collections after each step
.map(f) // apply your function to every item in the original sequence
.groupBy(x => x) // group the returned values
.map(x => x._1 -> x._2.size) // count returned values
}
val f = (i:Int) => i
count(Seq(1,2,3,4,5,6,6,6,4,2), f)
l.foldLeft(Vector.fill(f_max + 1)(0)) { (acc, el) =>
val result = f(el)
acc.updated(result, acc(result) + 1)
}
Alternatively, a good balance of performance and external purity would be:
def count(l: List[???]): Vector[Int] = {
val arr = l.foldLeft(Array.fill(f_max + 1)(0)) { (acc, el) =>
val result = f(el)
acc(result) += 1
}
arr.toVector
}

New to Scala, better/idiomatic way than reduceLeft to find max (key,val) in collection?

New-ish to Scala ... I am trying to find the best match from a collection of (key,value) pairs, where best match is defined as highest frequency. The method reduceLeft would be ideal, but the collection size may be smaller than 2 (1 or 0), so well-defined behavior for small collections is good.
Is there a more idiomatic scala approach to finding the max?
Other sources explained reduceLeft, which makes sense and reads well, but other approaches suggest different methods.
Is there a better way to extract the lone item from a collection of size=1?
Assume I have a map with some unknown number of values,
m:Map[String,Int]
val vm = m.filterNot{ case (k,v) => k.equals("ignore") }
val size = vm.size
val best = if(size>1) {
val list = vm.map{ case (k,v) => KeyCount(k,v) }
list.reduceLeft( maxKey )
} else if(size == 1) {
vm.toList(0)
//another source has suggested vm.head as an alternative
} else {
KeyCount("default",0)
}
Where KeyCount and maxKey are declared as,
case class KeyCount( key:String, count:Long ) {
def max( a:KeyCount, z:KeyCount ) = { if( a.count>z.count) a else z; }
def min( a:KeyCount, d2:KeyCount ) = { if( a.count<z.count) a else z; }
}
val maxKey = (x:KeyCount, y:KeyCount) => if( x.count > y.count ) x else y;
reduceLeft works fine with lists of size 1. If count is always greater than 0 you can use foldLeft with the default case:
val list = vm.map{ case (k,v) => KeyCount(k,v) }
val best = list.foldLeft(KeyCount("default",0))(maxKey)
Otherwise simply use a condition with maxBy or reduceLeft:
val best = if(size>0) {
val list = vm.map{ case (k,v) => KeyCount(k,v) }
list.maxBy(_.count)
} else {
KeyCount("default",0)
}
Note that you can use maxBy on the original Map[String, Int], there is no need to convert the elements to KeyCount.

Apache Spark: dealing with Option/Some/None in RDDs

I'm mapping over an HBase table, generating one RDD element per HBase row. However, sometimes the row has bad data (throwing a NullPointerException in the parsing code), in which case I just want to skip it.
I have my initial mapper return an Option to indicate that it returns 0 or 1 elements, then filter for Some, then get the contained value:
// myRDD is RDD[(ImmutableBytesWritable, Result)]
val output = myRDD.
map( tuple => getData(tuple._2) ).
filter( {case Some(y) => true; case None => false} ).
map( _.get ).
// ... more RDD operations with the good data
def getData(r: Result) = {
val key = r.getRow
var id = "(unk)"
var x = -1L
try {
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
Some( ( id, ( List(x),
// more stuff ...
) ) )
} catch {
case e: NullPointerException => {
logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e)
None
}
}
}
Is there a more idiomatic way to do this that's shorter? I feel like this looks pretty messy, both in getData() and in the map.filter.map dance I'm doing.
Perhaps a flatMap could work (generate 0 or 1 items in a Seq), but I don't want it to flatten the tuples I'm creating in the map function, just eliminate empties.
An alternative, and often overlooked way, would be using collect(PartialFunction pf), which is meant to 'select' or 'collect' specific elements in the RDD that are defined at the partial function.
The code would look like this:
val output = myRDD.collect{case Success(tuple) => tuple }
def getData(r: Result):Try[(String, List[X])] = Try {
val id = Bytes.toString(key, 0, 11)
val x = Long.MaxValue - Bytes.toLong(key, 11)
(id, List(x))
}
If you change your getData to return a scala.util.Try then you can simplify your transformations considerably. Something like this could work:
def getData(r: Result) = {
val key = r.getRow
var id = "(unk)"
var x = -1L
val tr = util.Try{
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
( id, ( List(x)
// more stuff ...
) )
}
tr.failed.foreach(e => logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e))
tr
}
Then your transform could start like so:
myRDD.
flatMap(tuple => getData(tuple._2).toOption)
If your Try is a Failure it will be turned into a None via toOption and then removed as part of the flatMap logic. At that point, your next step in the transform will only be working with the successful cases being whatever the underlying type is that is returned from getData without the wrapping (i.e. No Option)
If you are ok with dropping the data then you can just use mapPartitions. Here is a sample:
import scala.util._
val mixedData = sc.parallelize(List(1,2,3,4,0))
mixedData.mapPartitions(x=>{
val foo = for(y <- x)
yield {
Try(1/y)
}
for{goodVals <- foo.partition(_.isSuccess)._1}
yield goodVals.get
})
If you want to see the bad values, then you can use an accumulator or just log as you have been.
Your code would look something like this:
val output = myRDD.
mapPartitions( tupleIter => getCleanData(tupleIter) )
// ... more RDD operations with the good data
def getCleanData(iter: Iter[???]) = {
val triedData = getDataInTry(iter)
for{goodVals <- triedData.partition(_.isSuccess)._1}
yield goodVals.get
}
def getDataInTry(iter: Iter[???]) = {
for(r <- iter) yield {
Try{
val key = r._2.getRow
var id = "(unk)"
var x = -1L
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
}
}
}

How do I flatten a nested For Comprehension that uses I/O?

I am having trouble flattening a nested For Generator into a single For Generator.
I created MapSerializer to save and load Maps.
Listing of MapSerializer.scala:
import java.io.{ObjectInputStream, ObjectOutputStream}
object MapSerializer {
def loadMap(in: ObjectInputStream): Map[String, IndexedSeq[Int]] =
(for (_ <- 1 to in.readInt()) yield {
val key = in.readUTF()
for (_ <- 1 to in.readInt()) yield {
val value = in.readInt()
(key, value)
}
}).flatten.groupBy(_ _1).mapValues(_ map(_ _2))
def saveMap(out: ObjectOutputStream, map: Map[String, Seq[Int]]) {
out.writeInt(map size)
for ((key, values) <- map) {
out.writeUTF(key)
out.writeInt(values size)
values.foreach(out.writeInt(_))
}
}
}
Modifying loadMap to assign key within the generator causes it to fail:
def loadMap(in: ObjectInputStream): Map[String, IndexedSeq[Int]] =
(for (_ <- 1 to in.readInt();
key = in.readUTF()) yield {
for (_ <- 1 to in.readInt()) yield {
val value = in.readInt()
(key, value)
}
}).flatten.groupBy(_ _1).mapValues(_ map(_ _2))
Here is the stacktrace I get:
java.io.UTFDataFormatException
at java.io.ObjectInputStream$BlockDataInputStream.readWholeUTFSpan(ObjectInputStream.java)
at java.io.ObjectInputStream$BlockDataInputStream.readOpUTFSpan(ObjectInputStream.java)
at java.io.ObjectInputStream$BlockDataInputStream.readWholeUTFSpan(ObjectInputStream.java)
at java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java)
at java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2819)
at java.io.ObjectInputStream.readUTF(ObjectInputStream.java:1050)
at MapSerializer$$anonfun$loadMap$1.apply(MapSerializer.scala:8)
at MapSerializer$$anonfun$loadMap$1.apply(MapSerializer.scala:7)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:194)
at scala.collection.immutable.Range.foreach(Range.scala:76)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:194)
at scala.collection.immutable.Range.map(Range.scala:43)
at MapSerializer$.loadMap(MapSerializer.scala:7)
I would like to flatten the loading code to a single For Comprehension, but I get errors that suggest that it is either executing in a different order or repeating steps I am not expecting it to repeat.
Why is it that moving the assignment of key into the generator causes it to fail?
Can I flatten this into a single generator? If so, what would that generator be?
Thank you for self contained compiling code in your question. I don't think you want to flatten the loops as the structure is not flat. You then need to use groupBy to recover the structure. Also if you have "zero -> Seq()" as an element of the map, it would be lost. Using this simple map avoids the groupBy and preserves the elements mapped to empty sequences:
def loadMap(in: ObjectInputStream): Map[String, IndexedSeq[Int]] = {
val size = in.readInt
(1 to size).map{ _ =>
val key = in.readUTF
val nval = in.readInt
key -> (1 to nval).map(_ => in.readInt)
}(collection.breakOut)
}
I use breakOut to generate the right type as otherwise I think the compilers complains about generic Map and immutable Map mismatch. You can also use Map() ++ (...).
Note: I arrived at this solution by being confused by your for loop and starting to rewrite using as flatMap and map:
val tuples = (1 to size).flatMap{ _ =>
val key = in.readUTF
println("key " + key)
val nval = in.readInt
(1 to nval).map(_ => key -> in.readInt)
}
I think in the for loop, something happens when you don't use some of the generator. I though this would be equivalent to:
val tuples = for {
_ <- 1 to size
key = in.readUTF
nval = in.readInt
_ <- 1 to nval
value = in.readInt
} yield { key -> value }
But this is not the case, so I think I'm missing something in the translation.
Edit: figured out what's wrong with a single for loop. Short story: the translation of definitions within for loops caused the key = in.readUTF statement to be called consecutively before the inner loop is executed. To work around this, use view and force:
val tuples = (for {
_ <- (1 to size).view
key = in.readUTF
nval = in.readInt
_ <- 1 to nval
value = in.readInt
} yield { key -> value }).force
The issue can be demonstrated more clearly with this piece of code:
val iter = Iterator.from(1)
val tuple = for {
_ <- 1 to 3
outer = iter.next
_ <- 1 to 3
inner = iter.next
} yield (outer, inner)
It returns Vector((1,4), (1,5), (1,6), (2,7), (2,8), (2,9), (3,10), (3,11), (3,12)) which shows that all outer values are evaluated before inner values. This is due to the fact that it is more or less translated to something like:
for {
(i, outer) <- for (i <- (1 to 3)) yield (i, iter.next)
_ <- 1 to 3
inner = iter.next
} yield (outer, inner)
This computes all outer iter.next first. Going back to the original use case, all in.readUTF values would be called consecutively before in.readInt.
Here is the compacted version of #huynhjl's answer that I eventually deployed:
def loadMap(in: ObjectInputStream): Map[String, IndexedSeq[Int]] =
((1 to in.readInt()) map { _ =>
in.readUTF() -> ((1 to in.readInt()) map { _ => in.readInt()) }
})(collection.breakOut)
The advantage of this version is that there are no direct assignments.