How to find the first duplicate in a Stream in scala - scala

How to find the first duplicate in a Stream in scala ?
My current idea is to pair each element with a Set of all previous elements. Afterwards, find is called on the resulting Stream.
So, for each element, we have
an insertion in a Set : O(1)
a test contains in a Set : O(1)
Hence, the overall complexity of this algo seems O(n).
def firstDuplicate[A](s: Stream[A]) = {
def recurse(s: Stream[A], set: Set[A]) : Stream[(A, Set[A])]=
(s.head, set) #:: recurse(s.tail, set + s.head)
val pairedWithElements = recurse(s, Set.empty)
pairedWithElements.find{ case (e, elems) => elems.contains(e)}.get._1
}
Is there a better way ?

You should make your function tail recursive. The way you have it, you are making pretty much another copy of your whole stream on the stack. Also, I don't understand why you are making a copy of the entire stream (and a whoooole buuunch of sets), and then scanning it again to find the dup. You can tell it's a dup right away, when adding it to the set, and stop right there.
Something like this perhaps:
def firstDup[T](s: Stream[T], seen: Set[T] = Set.empty[T]): Option[T] = s match {
case head #:: tail if seen(head) => Some(head)
case head #:: tail => firstDup(tail, seen + head)
case _ => None
}
The bloom filter suggestion from the comments above is a good idea for truly huge input streams. The "outer shell" would stay the same in that case, you'd just need to change the underlying seen implementation.

Related

Scala - Traversing a ByteString Until Empty

Is there a more concise and/or performant way to traverse the message than what I have here?
import akka.util.ByteString
#throws[GarbledMessageException]
def nextValue(message: ByteString) =
message.indexOf(delimiter) match {
case i if i >= 0 => message.splitAt(i)
case _ => throw new GarbledMessageException("Delimiter Not Found")
}
#tailrec
def processFields(message: ByteString): Unit = nextValue(message) match {
case (_, ByteString.empty) => // Complete Parsing
case (value, rest) =>
// Do work with value
// loop
processFields(rest)
}
A new ByteString is created for each split which hurts performance, but at least the underlying Buffer is not copied, only reference counted.
Maybe it can be even better than that?
It may depend on specifically what kind of work you are doing, but if you are looking for something more performant than splitting off ByteStrings, take a look at ByteIterator, which you can get by calling iterator on a ByteString.
A ByteIterator would allow you to go directly to primitive values (ints, floats, etc.) without having to split off new ByteStrings first.

Performance impact of flatten and reduce - without intermediary list creation

Given the following code:
def findMinOpt(li: List[Option[Int]]): Option[Int] =
{
val listwithoutOptions = li.flatten
listwithoutOptions.reduceLeftOption(_ min _)
}
It filters out all options effectively creating a new list and then returns the minimum.
The problem I see with this code is that it processes a list twice but in fact is worse than that as a second list is created which isnt cached. Is there an idiomatic way of processing the list just once?
Optional Question: How would one perform a benchmark? Usually OS uses caching mechanism so in between a repetition of a test I'd like to clear that cache up. Is there a way of doing that?
One more possible implementation (without list duplication):
def optMin(a: Option[Int], b: Option[Int]): Option[Int] =
(a, b) match {
case (Some(x), Some(y)) => Option( x min y)
case (sx, None) => sx
case (None, sy) => sy
case _ => None
}
li.reduceLeft { optMin(_, _) }
But for all comparison created Pair object.
Imho it is a struggle of code expression and code optimality.
You can use view to prevent the creation of an intermediate list
li.view.flatten.reduceLeftOption(_ min _)

Functional style early exit from depth-first recursion

I have a question about writing recursive algorithms in a functional style. I will use Scala for my example here, but the question applies to any functional language.
I am doing a depth-first enumeration of an n-ary tree where each node has a label and a variable number of children. Here is a simple implementation that prints the labels of the leaf nodes.
case class Node[T](label:T, ns:Node[T]*)
def dfs[T](r:Node[T]):Seq[T] = {
if (r.ns.isEmpty) Seq(r.label) else for (n<-r.ns;c<-dfs(n)) yield c
}
val r = Node('a, Node('b, Node('d), Node('e, Node('f))), Node('c))
dfs(r) // returns Seq[Symbol] = ArrayBuffer('d, 'f, 'c)
Now say that sometimes I want to be able to give up on parsing oversize trees by throwing an exception. Is this possible in a functional language? Specifically is this possible without using mutable state? That seems to depend on what you mean by "oversize". Here is a purely functional version of the algorithm that throws an exception when it tries to handle a tree with a depth of 3 or greater.
def dfs[T](r:Node[T], d:Int = 0):Seq[T] = {
require(d < 3)
if (r.ns.isEmpty) Seq(r.label) else for (n<-r.ns;c<-dfs(n, d+1)) yield c
}
But what if a tree is oversized because it is too broad rather than too deep? Specifically what if I want to throw an exception the n-th time the dfs() function is called recursively regardless of how deep the recursion goes? The only way I can see how to do this is to have a mutable counter that is incremented with each call. I can't see how to do it without a mutable variable.
I'm new to functional programming and have been working under the assumption that anything you can do with mutable state can be done without, but I don't see the answer here. The only thing I can think to do is write a version of dfs() that returns a view over all the nodes in the tree in depth-first order.
dfs[T](r:Node[T]):TraversableView[T, Traversable[_]] = ...
Then I could impose my limit by saying dfs(r).take(n), but I don't see how to write this function. In Python I'd just create a generator by yielding nodes as I visited them, but I don't see how to achieve the same effect in Scala. (Scala's equivalent to a Python-style yield statement appears to be a visitor function passed in as a parameter, but I can't figure out how to write one of these that will generate a sequence view.)
EDIT Getting close to the answer.
Here is an function that returns a Stream of nodes in depth-first order.
def dfs[T](r: Node[T]): Stream[Node[T]] = {
(r #:: Stream.empty /: r.ns)(_ ++ dfs(_))
}
That is almost it. The only problem is that Stream memoizes all results, which is a waste of memory. I want a traversable view. The following is the idea, but does not compile.
def dfs[T](r: Node[T]): TraversableView[Node[T], Traversable[Node[T]]] = {
(Traversable(r).view /: r.ns)(_ ++ dfs(_))
}
It gives a "found TraversableView[Node[T], Traversable[Node[T]]], required TraversableView[Node[T], Traversable[_]] error for the ++ operator. If I change the return type to TraversableView[Node[T], Traversable[_]], I get the same problem with the "found" and "required" clauses switched. So there's some magic type variance incantation I haven't lit upon yet, but this is close.
It can be done: you just have to write some code to actually iterate through the children in the way you want (as opposed to relying on for).
More explicitly, you'll have to write code to iterate through a list of children and check if the "depth" crossed your threshold. Here's some Haskell code (I'm really sorry, I'm not fluent in Scala, but this can probably be easily transliterated):
http://ideone.com/O5gvhM
In this code, I've basically replaced the for loop for an explicit recursive version. This allows me to stop the recursion if the number of visited nodes is already too deep (i.e., limit is not positive). When I recurse to examine the next child, I subtract the number of nodes the dfs of the previous child visited and set this as the limit for the next child.
Functional languages are fun, but they're a huge leap from imperative programming. It really makes you pay attention to the concept of state, because all of it is excruciatingly explicit in the arguments when you go functional.
EDIT: Explaining this a bit more.
I ended up converting from "print just the leaf nodes" (which was the original algorithm from the OP) to "print all nodes". This enabled me to have access to the number of nodes the subcall visited through the length of the resulting list. If you want to stick to the leaf nodes, you'll have to carry around how many nodes you have already visited:
http://ideone.com/cIQrna
EDIT again To clear up this answer, I'm putting all the Haskell code on ideone, and I've transliterated my Haskell code to Scala, so this can stay here as the definite answer to the question:
case class Node[T](label:T, children:Seq[Node[T]])
case class TraversalResult[T](num_visited:Int, labels:Seq[T])
def dfs[T](node:Node[T], limit:Int):TraversalResult[T] =
limit match {
case 0 => TraversalResult(0, Nil)
case limit =>
node.children match {
case Nil => TraversalResult(1, List(node.label))
case children => {
val result = traverse(node.children, limit - 1)
TraversalResult(result.num_visited + 1, result.labels)
}
}
}
def traverse[T](children:Seq[Node[T]], limit:Int):TraversalResult[T] =
limit match {
case 0 => TraversalResult(0, Nil)
case limit =>
children match {
case Nil => TraversalResult(0, Nil)
case first :: rest => {
val trav_first = dfs(first, limit)
val trav_rest =
traverse(rest, limit - trav_first.num_visited)
TraversalResult(
trav_first.num_visited + trav_rest.num_visited,
trav_first.labels ++ trav_rest.labels
)
}
}
}
val n = Node(0, List(
Node(1, List(Node(2, Nil), Node(3, Nil))),
Node(4, List(Node(5, List(Node(6, Nil))))),
Node(7, Nil)
))
for (i <- 1 to 8)
println(dfs(n, i))
Output:
TraversalResult(1,List())
TraversalResult(2,List())
TraversalResult(3,List(2))
TraversalResult(4,List(2, 3))
TraversalResult(5,List(2, 3))
TraversalResult(6,List(2, 3))
TraversalResult(7,List(2, 3, 6))
TraversalResult(8,List(2, 3, 6, 7))
P.S. this is my first attempt at Scala, so the above probably contains some horrid non-idiomatic code. I'm sorry.
You can convert breadth into depth by passing along an index or taking the tail:
def suml(xs: List[Int], total: Int = 0) = xs match {
case Nil => total
case x :: rest => suml(rest, total+x)
}
def suma(xs: Array[Int], from: Int = 0, total: Int = 0) = {
if (from >= xs.length) total
else suma(xs, from+1, total + xs(from))
}
In the latter case, you already have something to limit your breadth if you want; in the former, just add a width or somesuch.
The following implements a lazy depth-first search over nodes in a tree.
import collection.TraversableView
case class Node[T](label: T, ns: Node[T]*)
def dfs[T](r: Node[T]): TraversableView[Node[T], Traversable[Node[T]]] =
(Traversable[Node[T]](r).view /: r.ns) {
(a, b) => (a ++ dfs(b)).asInstanceOf[TraversableView[Node[T], Traversable[Node[T]]]]
}
This prints the labels of all the nodes in depth-first order.
val r = Node('a, Node('b, Node('d), Node('e, Node('f))), Node('c))
dfs(r).map(_.label).force
// returns Traversable[Symbol] = List('a, 'b, 'd, 'e, 'f, 'c)
This does the same thing, quitting after 3 nodes have been visited.
dfs(r).take(3).map(_.label).force
// returns Traversable[Symbol] = List('a, 'b, 'd)
If you want only leaf nodes you can use filter, and so forth.
Note that the fold clause of the dfs function requires an explicit asInstanceOf cast. See "Type variance error in Scala when doing a foldLeft over Traversable views" for a discussion of the Scala typing issues that necessitate this.

Scala - return empty Option if value contained in array

I'm splitting an input of type Option[String] into an Option[Array[String]] as follows:
val input:Option[String] = Option("a=b,1000,what?")
val result: Option[Array[String]] = input map { _.split(",") }
I want to add a test whereby if any member of the array matches (eg, is an Long less than 0), the whole array is discarded and an empty Option returned.
Use filter to perform a test on the content of an Option.
Use exists to check whether any member of the collection fullfils a condition.
result.filter(! _.exists(s => test(s)))
or
result.filterNot(_.exists(s => test(s)))
Have you considered using find() on the collection ? If it returns a Some(x), then something has satisfied the condition.
list.find(_ < 0) match {
case Some(x) => None
case None => Some(list)
}
Of course you know that you can split and then filter as #ziggystar suggests, but if you have a really big Stringand an element at the beginning matches then it's pointless to finish splitting the string when you know it's going to be discarded.
In this case, if you're worried about time efficiency, you can use a Stream and re-implement the split operation, something like this:
def result(input:Option[String]):Option[Seq[String]] = {
def split(c: Char, chars:Stream[Char]):Stream[String] = {
val (head,tail) = chars span(_ != c)
head.mkString #:: (if(tail isEmpty) Stream.empty else split(c, tail tail))
}
input map {s => split(',', Stream(s:_*)) } filter (_.forall (s => !test(s)))
}
Note that the map/filter structure stays the same, but it is now short-circuiting due to the use of Stream.
If it's a really big string you probably have it as a Stream[Char] already which means you don't even have the memory overhead of hanging on the original String.

Processing Scala Option[T]

I have a Scala Option[T]. If the value is Some(x) I want to process it with a a process that does not return a value (Unit), but if it is None, I want to print an error.
I can use the following code to do this, but I understand that the more idiomatic way is to treat the Option[T] as a sequence and use map, foreach, etc. How do I do this?
opt match {
case Some(x) => // process x with no return value, e.g. write x to a file
case None => // print error message
}
I think explicit pattern matching suits your use case best.
Scala's Option is, sadly, missing a method to do exactly this. I add one:
class OptionWrapper[A](o: Option[A]) {
def fold[Z](default: => Z)(action: A => Z) = o.map(action).getOrElse(default)
}
implicit def option_has_utility[A](o: Option[A]) = new OptionWrapper(o)
which has the slightly nicer (in my view) usage
op.fold{ println("Empty!") }{ x => doStuffWith(x) }
You can see from how it's defined that map/getOrElse can be used instead of pattern matching.
Alternatively, Either already has a fold method. So you can
op.toRight(()).fold{ _ => println("Empty!") }{ x => doStuffWith(x) }
but this is a little clumsy given that you have to provide the left value (here (), i.e. Unit) and then define a function on that, rather than just stating what you want to happen on None.
The pattern match isn't bad either, especially for longer blocks of code. For short ones, the overhead of the match starts getting in the way of the point. For example:
op.fold{ printError }{ saveUserInput }
has a lot less syntactic overhead than
op match {
case Some(x) => saveUserInput(x)
case None => printError
}
and therefore, once you expect it, is a lot easier to comprehend.
I'd recommend to simply and safely use opt.get which itself throws a NoSuchElementException exception if opt is None. Or if you want to throw your own exception, you can do this:
val x = opt.getOrElse(throw new Exception("Your error message"))
// x is of type T
as #missingfaktor says, you are in the exact scenario where pattern matching is giving the most readable results.
If Option has a value you want to do something, if not you want to do something else.
While there are various ways to use map and other functional constructs on Option types, they are generally useful when:
you want to use the Some case and ignore the None case e.g. in your case
opt.map(writeToFile(_)) //(...if None just do nothing)
or you want to chain the operations on more than one option and give a result only when all of them are Some. For instance, one way of doing this is:
val concatThreeOptions =
for {
n1 <- opt1
n2 <- opt2
n3 <- opt3
} yield n1 + n2 + n3 // this will be None if any of the three is None
// we will either write them all to a file or none of them
but none of these seem to be your case
Pattern matching is the best choice here.
However, if you want to treat Option as a sequence and to map over it, you can do it, because Unit is a value:
opt map { v =>
println(v) // process v (result type is Unit)
} getOrElse {
println("error")
}
By the way, printing an error is some kind of "anti-pattern", so it's better to throw an exception anyway:
opt.getOrElse(throw new SomeException)