populate immutable sequence with iterator - scala

I'm interoperating with some Java code that uses iterator-like functionality, and presumes you will continue to test it's .next for null values. I want to put it into immutable Scala data structures to reason about it with functional programming.
Right now, I'm filling mutable data structures and then converting them to immutable data structures. I know there's a more functional way to do this.
How can I refactor the code below to populate the immutable data structures without using intermediate mutable collections?
Thanks
{
val sentences = MutableList[Seq[Token]]()
while(this.next() != null){
val sentence = MutableList[Token]()
var token = this.next()
while(token.next != null){
sentence += token
token = token.next
}
sentences += sentence.to[Seq]
}
sentences.to[Seq]
}

You might try to use the Iterator.iterate method in order to simulate a real iterator, and then use standard collection methods like takeWhile and toSeq. I'm not totally clear on the type of this and next, but something along these lines might work:
Iterator.iterate(this){ _.next }.takeWhile{ _ != null }.map { sentence =>
Iterator.iterate(sentence.next) { _.next }.takeWhile{ _ != null }.toSeq
}.toSeq
You can also extend Iterable by defining your own next and hasNext method in order to use these standard methods more easily. You might even define an implicit or explicit conversion from these Java types to this new Iterable type – this is the same pattern you see in JavaConversions and JavaConverters

Related

Scala: dynamic programming recursion using iterators

Learning how to do dynamic programming in Scala, and I'm often finding myself in a situation where I want to recursively proceed over an array (or some other iterable) of items. When I do this, I tend to write cumbersome functions like this:
def arraySum(array: Array[Int], index: Int, accumulator: Int): Int => {
if (index == array.length) {
accumulator
} else {
arraySum(array, index + 1, accumulator + array(index)
}
}
arraySum(Array(1,2,3), 0, 0)
(Ignore for a moment that I could just call sum on the array or do a .reduce(_ + _), I'm trying to learn programming principles.)
But this seems like I'm passing alot of variables, and what exactly is the point of passing the array to each function call? This seems unclean.
So instead I got the idea to do this with iterators and not worry about passing indexes:
def arraySum(iter: Iterator[Int])(implicit accumulator: Int = 0): Int = {
try {
val nextInt = iter.next()
arraySum(iter)(accumulator + nextInt)
} catch {
case nee: NoSuchElementException => accumulator
}
}
arraySum(Array(1,2,3).toIterator)
This seems like a much cleaner solution. However, this falls apart when you need to use dynamic programming to explore some outcome space and you don't need to call the iterator at every function call. E.g.
def explore(iter: Iterator[Int])(implicit accumulator: Int = 0): Int = {
if (someCase) {
explore(iter)(accumulator)
} else if (someOtherCase){
val nextInt = iter.next()
explore(iter)(accumulator + nextInt)
} else {
// Some kind of aggregation/selection of explore results
}
}
My understanding is that the iter iterator here functions as pass by reference, so when this function calls iter.next() that changes the instance of iter that is passed to all other recursive calls of the function. So to get around that, now I'm cloning the iterator at every call of the explore function. E.g.:
def explore(iter: Iterator[Int])(implicit accumulator: Int = 0): Int = {
if (someCase) {
explore(iter)(accumulator)
} else if (someOtherCase){
val iterClone = iter.toList.toIterator
explore(iterClone)(accumulator + iterClone.next())
} else {
// Some kind of aggregation/selection of explore results
}
}
But this seems pretty stupid, and the stupidity escalates when I have multiple iterators that may or may not need cloning in multiple else if cases. What is the right way to handle situations like this? How can I elegantly solve these kinds of problems?
Suppose that you want to write a back-tracking recursive function that needs some complex data structure as an argument, so that the recursive calls receive a slightly modified version of the data structure. You have several options how you could do it:
Clone the entire data structure, modify it, pass it to recursive call. This is very simple, but usually very expensive.
Modify the mutable structure in-place, pass it to the recursive call, then revert the modification when backtracking. You have to ensure that every possible call of your recursive function always restores the original state of the data structure exactly. This is much more efficient, but is hard to implement, because it can be very error prone.
Subdivide the structure into a large immutable and a small mutable part. For example, you could pass an index (or a pair of indices) that specify some slice of an array explicitly, along with an array that is never mutated. You could then "clone" and save only the mutable part, and restore it when backtracking. If it works, it is both simple and fast, but it doesn't always work, because substructures can be hard to describe by just few integer indices.
Rely on persistent immutable data structures whenever you can.
I'd like to elaborate on the last point, because this is the preferred way to do it in Scala and in functional programming in general.
Here is your original code, that uses the third strategy:
def arraySum(array: Array[Int], index: Int, accumulator: Int): Int = {
if (index == array.length) {
accumulator
} else {
arraySum(array, index + 1, accumulator + array(index))
}
}
If you would use a List instead of an Array, you could rewrite it to this:
#annotation.tailrec
def listSum(list: List[Int], acc: Int): Int = list match {
case Nil => acc
case h :: t => listSum(t, acc + h)
}
Here, h :: t is a pattern that deconstructs the list into the head and the tail.
Note that you don't need an explicit index any more, because accessing the tail t of the list is a constant-time operation, so that only the relevant remaining sublist is passed to the recursive call of listSum.
There is no backtracking here, but if the recursive method would backtrack, using lists would bring another advantage: extracting a sublist is almost free (constant time operation), but it's still guaranteed to be immutable, so you can just pass it into the recursive call, without having to care about whether the recursive call modifies it or not, and so you don't have to do anything to undo any modifications that could have been done by the recursive calls. This is the advantage of persistent immutable data structures: related lists can share most of their structure, while still appearing immutable from the outside, so that it's impossible to break anything in the parent list just because you have the access to the tail of this list. This would not be the case with a view over a mutable array.

Elegant traversal of a source in Scala

As a data scientist I frequently use the following pattern for data extraction (i.e. DB, file reading and others):
val source = open(sourceName)
var item = source.getNextItem()
while(item != null){
processItem(item)
item = source.getNextItem()
}
source.close
My (current) dream is to wrap this verbosity into a Scala object "SourceTrav" that would allow this elegance:
SourceTrav(sourceName).foreach(item => processItem(item))
with the same functionality as above, but without running into StackOverflowError, as might happen with the examples in Semantics of Scala Traversable, Iterable, Sequence, Stream and View?
Any idea?
If Scala's standard library (for example scala.io.Source) doesn't suit your needs, you can use different Iterator or Stream companion object methods to wrap manual iterator traversal.
In this case, for example, you can do the following, when you already have an open source:
Iterator.continually(source.getNextItem()).takeWhile(_ != null).foreach(processItem)
If you also want to add automatic opening and closing of the source, don't forget to add try-finally or some other flavor of loan pattern:
case class SourceTrav(sourceName: String) {
def foreach(processItem: Item => Unit): Unit = {
val source = open(sourceName)
try {
Iterator.continually(source.getNextItem()).takeWhile(_ != null).foreach(processItem)
} finally {
source.close()
}
}
}

What is the advantage of using Option.map over Option.isEmpty and Option.get?

I am a new to Scala coming from Java background, currently confused about the best practice considering Option[T].
I feel like using Option.map is just more functional and beautiful, but this is not a good argument to convince other people. Sometimes, isEmpty check feels more straight forward thus more readable. Is there any objective advantages, or is it just personal preference?
Example:
Variation 1:
someOption.map{ value =>
{
//some lines of code
}
} orElse(foo)
Variation 2:
if(someOption.isEmpty){
foo
} else{
val value = someOption.get
//some lines of code
}
I intentionally excluded the options to use fold or pattern matching. I am simply not pleased by the idea of treating Option as a collection right now, and using pattern matching for a simple isEmpty check is an abuse of pattern matching IMHO. But no matter why I dislike these options, I want to keep the scope of this question to be the above two variations as named in the title.
Is there any objective advantages, or is it just personal preference?
I think there's a thin line between objective advantages and personal preference. You cannot make one believe there is an absolute truth to either one.
The biggest advantage one gains from using the monadic nature of Scala constructs is composition. The ability to chain operations together without having to "worry" about the internal value is powerful, not only with Option[T], but also working with Future[T], Try[T], Either[A, B] and going back and forth between them (also see Monad Transformers).
Let's try and see how using predefined methods on Option[T] can help with control flow. For example, consider a case where you have an Option[Int] which you want to multiply only if it's greater than a value, otherwise return -1. In the imperative approach, we get:
val option: Option[Int] = generateOptionValue
var res: Int = if (option.isDefined) {
val value = option.get
if (value > 40) value * 2 else -1
} else -1
Using collections style method on Option, an equivalent would look like:
val result: Int = option
.filter(_ > 40)
.map(_ * 2)
.getOrElse(-1)
Let's now consider a case for composition. Let's say we have an operation which might throw an exception. Additionaly, this operation may or may not yield a value. If it returns a value, we want to query a database with that value, otherwise, return an empty string.
A look at the imperative approach with a try-catch block:
var result: String = _
try {
val maybeResult = dangerousMethod()
if (maybeResult.isDefined) {
result = queryDatabase(maybeResult.get)
} else result = ""
}
catch {
case NonFatal(e) => result = ""
}
Now let's consider using scala.util.Try along with an Option[String] and composing both together:
val result: String = Try(dangerousMethod())
.toOption
.flatten
.map(queryDatabase)
.getOrElse("")
I think this eventually boils down to which one can help you create clear control flow of your operations. Getting used to working with Option[T].map rather than Option[T].get will make your code safer.
To wrap up, I don't believe there's a single truth. I do believe that composition can lead to beautiful, readable, side effect deferring safe code and I'm all for it. I think the best way to show other people what you feel is by giving them examples as we just saw, and letting them feel for themselves the power they can leverage with these sets of tools.
using pattern matching for a simple isEmpty check is an abuse of pattern matching IMHO
If you do just want an isEmpty check, isEmpty/isDefined is perfectly fine. But in your case you also want to get the value. And using pattern matching for this is not abuse; it's precisely the basic use-case. Using get allows to very easily make errors like forgetting to check isDefined or making the wrong check:
if(someOption.isEmpty){
val value = someOption.get
//some lines of code
} else{
//some other lines
}
Hopefully testing would catch it, but there's no reason to settle for "hopefully".
Combinators (map and friends) are better than get for the same reason pattern matching is: they don't allow you to make this kind of mistake. Choosing between pattern matching and combinators is a different question. Generally combinators are preferred because they are more composable (as Yuval's answer explains). If you want to do something covered by a single combinator, I'd generally choose them; if you need a combination like map ... getOrElse, or a fold with multi-line branches, it depends on the specific case.
It seems similar to you in case of Option but just consider the case of Future. You will not be able to interact with the future's value after going out of Future monad.
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Promise
import scala.util.{Success, Try}
// create a promise which we will complete after sometime
val promise = Promise[String]();
// Now lets consider the future contained in this promise
val future = promise.future;
val byGet = if (!future.value.isEmpty) {
val valTry = future.value.get
valTry match {
case Success(v) => v + " :: Added"
case _ => "DEFAULT :: Added"
}
} else "DEFAULT :: Added"
val byMap = future.map(s => s + " :: Added")
// promise was completed now
promise.complete(Try("PROMISE"))
//Now lets print both values
println(byGet)
// DEFAULT :: Added
println(byMap)
// Success(PROMISE :: Added)

How to efficiently select a random element from a Scala immutable HashSet

I have a scala.collection.immutable.HashSet that I want to randomly select an element from.
I could solve the problem with an extension method like this:
implicit class HashSetExtensions[T](h: HashSet[T]) {
def nextRandomElement (): Option[T] = {
val list = h.toList
list match {
case null | Nil => None
case _ => Some (list (Random.nextInt (list.length)))
}
}
}
...but converting to a list will be slow. What would be the most efficient solution?
WARNING This answer is for experimental use only. For real project you probably should use your own collection types.
So i did some research in the HashSet source and i think there is little opportunity to someway extract the inner structure of most valuable class HashTrieSet without package violation.
I did come up with this code, which is extended Ben Reich's solution:
package scala.collection
import scala.collection.immutable.HashSet
import scala.util.Random
package object random {
implicit class HashSetRandom[T](set: HashSet[T]) {
def randomElem: Option[T] = set match {
case trie: HashSet.HashTrieSet[T] => {
trie.elems(Random.nextInt(trie.elems.length)).randomElem
}
case _ => Some(set.size) collect {
case size if size > 0 => set.iterator.drop(Random.nextInt(size)).next
}
}
}
}
file should be created somewhere in the src/scala/collection/random folder
note the scala.collection package - this thing makes the elems part of HashTrieSet visible. This is only solution i could think, which could run better than O(n). Current version should have complexity O(ln(n)) as any of immutable.HashSet's operation s.
Another warning - private structure of HashSet is not part of scala's standard library API, so it could change any version making this code erroneous (though it's didn't changed since 2.8)
Since size is O(1) on HashSet, and iterator is as lazy as possible, I think this solution would be relatively efficient:
implicit class RichHashSet[T](val h: HashSet[T]) extends AnyVal {
def nextRandom: Option[T] = Some(h.size) collect {
case size if size > 0 => h.iterator.drop(Random.nextInt(size)).next
}
}
And if you're trying to get every ounce of efficiency you could use match here instead of the more concise Some/collect idiom used here.
You can look at the mutable HashSet implementation to see the size method. The iterator method defined there basically just calls iterator on FlatHashTable. The same basic efficiencies of these methods apply to immutable HashSet if that's what you're working with. As a comparison, you can see the toList implementation on HashSet is all the way up the type hierarchy at TraversableOnce and uses far more primitive elements which are probably less efficient and (of course) the entire collection must be iterated to generate the List. If you were going to convert the entire set to a Traversable collection, you should use Array or Vector which have constant-time lookup.
You might also note that there is nothing special about HashSet in the above methods, and you could enrich Set[T] instead, if you so chose (although there would be no guarantee that this would be as efficient on other Set implementations, of course).
As a side note, when implementing enriched classes for extension methods, you should always consider making an implicit, user-defined value class by extending AnyVal. You can read about some of the advantages and limitations in the docs, and on this answer.

What are good examples of: "operation of a program should map input values to output values rather than change data in place"

I came across this sentence in Scala in explaining its functional behavior.
operation of a program should map input of values to output values rather than change data in place
Could somebody explain it with a good example?
Edit: Please explain or give example for the above sentence in its context, please do not make it complicate to get more confusion
The most obvious pattern that this is referring to is the difference between how you would write code which uses collections in Java when compared with Scala. If you were writing scala but in the idiom of Java, then you would be working with collections by mutating data in place. The idiomatic scala code to do the same would favour the mapping of input values to output values.
Let's have a look at a few things you might want to do to a collection:
Filtering
In Java, if I have a List<Trade> and I am only interested in those trades executed with Deutsche Bank, I might do something like:
for (Iterator<Trade> it = trades.iterator(); it.hasNext();) {
Trade t = it.next();
if (t.getCounterparty() != DEUTSCHE_BANK) it.remove(); // MUTATION
}
Following this loop, my trades collection only contains the relevant trades. But, I have achieved this using mutation - a careless programmer could easily have missed that trades was an input parameter, an instance variable, or is used elsewhere in the method. As such, it is quite possible their code is now broken. Furthermore, such code is extremely brittle for refactoring for this same reason; a programmer wishing to refactor a piece of code must be very careful to not let mutated collections escape the scope in which they are intended to be used and, vice-versa, that they don't accidentally use an un-mutated collection where they should have used a mutated one.
Compare with Scala:
val db = trades filter (_.counterparty == DeutscheBank) //MAPPING INPUT TO OUTPUT
This creates a new collection! It doesn't affect anyone who is looking at trades and is inherently safer.
Mapping
Suppose I have a List<Trade> and I want to get a Set<Stock> for the unique stocks which I have been trading. Again, the idiom in Java is to create a collection and mutate it.
Set<Stock> stocks = new HashSet<Stock>();
for (Trade t : trades) stocks.add(t.getStock()); //MUTATION
Using scala the correct thing to do is to map the input collection and then convert to a set:
val stocks = (trades map (_.stock)).toSet //MAPPING INPUT TO OUTPUT
Or, if we are concerned about performance:
(trades.view map (_.stock)).toSet
(trades.iterator map (_.stock)).toSet
What are the advantages here? Well:
My code can never observe a partially-constructed result
The application of a function A => B to a Coll[A] to get a Coll[B] is clearer.
Accumulating
Again, in Java the idiom has to be mutation. Suppose we are trying to sum the decimal quantities of the trades we have done:
BigDecimal sum = BigDecimal.ZERO
for (Trade t : trades) {
sum.add(t.getQuantity()); //MUTATION
}
Again, we must be very careful not to accidentally observe a partially-constructed result! In scala, we can do this in a single expression:
val sum = (0 /: trades)(_ + _.quantity) //MAPPING INTO TO OUTPUT
Or the various other forms:
(trades.foldLeft(0)(_ + _.quantity)
(trades.iterator map (_.quantity)).sum
(trades.view map (_.quantity)).sum
Oh, by the way, there is a bug in the Java implementation! Did you spot it?
I'd say it's the difference between:
var counter = 0
def updateCounter(toAdd: Int): Unit = {
counter += toAdd
}
updateCounter(8)
println(counter)
and:
val originalValue = 0
def addToValue(value: Int, toAdd: Int): Int = value + toAdd
val firstNewResult = addToValue(originalValue, 8)
println(firstNewResult)
This is a gross over simplification but fuller examples are things like using a foldLeft to build up a result rather than doing the hard work yourself: foldLeft example
What it means is that if you write pure functions like this you always get the same output from the same input, and there are no side effects, which makes it easier to reason about your programs and ensure that they are correct.
so for example the function:
def times2(x:Int) = x*2
is pure, while
def add5ToList(xs: MutableList[Int]) {
xs += 5
}
is impure because it edits data in place as a side effect. This is a problem because that same list could be in use elsewhere in the the program and now we can't guarantee the behaviour because it has changed.
A pure version would use immutable lists and return a new list
def add5ToList(xs: List[Int]) = {
5::xs
}
There are plenty examples with collections, which are easy to come by but might give the wrong impression. This concept works at all levels of the language (it doesn't at the VM level, however). One example is the case classes. Consider these two alternatives:
// Java-style
class Person(initialName: String, initialAge: Int) {
def this(initialName: String) = this(initialName, 0)
private var name = initialName
private var age = initialAge
def getName = name
def getAge = age
def setName(newName: String) { name = newName }
def setAge(newAge: Int) { age = newAge }
}
val employee = new Person("John")
employee.setAge(40) // we changed the object
// Scala-style
case class Person(name: String, age: Int) {
def this(name: String) = this(name, 0)
}
val employee = new Person("John")
val employeeWithAge = employee.copy(age = 40) // employee still exists!
This concept is applied on the construction of the immutable collection themselves: a List never changes. Instead, new List objects are created when necessary. Use of persistent data structures reduce the copying that would happen on a mutable data structure.