Scala: Option[Seq[String]] vs Seq[Option[String]]? - scala

I'm creating a method to retrieve a list of users from a database by ID.
I'm trying to decide on the pros and cons of declaring the ids parameter as Option[Seq[String]] vs Seq[Option[String]]?
In what cases should I favour one over the other?

A list of users in neither well represented as an Option[Seq[String]] nor as a Seq[Option[String]]. I would expect something like a List[User] as a list of users. Or maybe a Vector or Seq
If your string represents your user, and the None case does nothing, you could consider filtering those out. You can do this with
val dbresult: Seq[Option[String]] = ???
val strings = dbresult collect { case Some(str) => str }
or
val strings = dbresult.flatten
but it's difficult to give good advice without knowing what the Option[String] or Option[Seq] represents

As usual this strongly depends on the use case.
A Seq[Option[String]] will be useful if the size of the sequence is relevant (eg., because you want to zip it with another sequence).
If this is not the case I would opt for flattening the sequence in order to just have a Seq[String]. This will likely be a better choice than Option[Seq[String]], as the sequence can also be of zero length.
In fact an Option can usually be treated as if it where an array that can have either length zero or one. Therefore wrapping an Iterable in an Option often only adds unnecessary complexity.

Related

Option as a singleton collection - real life use cases

The title pretty much sums it up. Option as a singleton collection can sometimes be confusing, but sometimes it allows for an interesting application. I have one example on top of my head, and would like to learn more of such examples.
My only example is running for comprehension on the Option[List[T]]. We can do the following:
val v = Some(List(1, 2, 3))
for {
list <- v.toList
elem <- list
} yield elem + 1
Without having Option.toList, it wouldn't be possible to stay in the same for comprehension, and I'd be forced to write something like this:
for {
list <- v
} yield for {
elem <- list
} yield elem + 1
The first example is cleaner, and it's an advantage of Option being a collection. Of course, the result type will be different in these 2 examples, but let's assume it doesn't matter for the sake of discussion.
Any other examples? I'd especially like to concentrate on collection-like usage, and not usage of Option's monadic properties - those are pretty much obvious. In other words, map and flatMap functions are out of scope of this question. They're definitely very useful, just coming from elsewhere.
I find that working with Option[T] as a collection's main benefit is that you get to use operations defined on a collection, such as map, flatmap, filter, foreach etc. This makes it easier to do operations on a given option, instead of using pattern matching or checking Option[T].isDefined to see if a value exists.
For example, let's take the user repository example from Daniel Westheide blog post about Option[T]:
Say you have a UserRepository object which returns users based on their ID. The user may or may not exist, hence it returns an Option[Person]. Now let's say we want to search a person by id and then filter their age. We can do:
val age: Some[Int] = UserRepository.findById(1).map(_.age)
Now let's say that a Person also has a gender property of type Option[String]. If you wanted to extract that out, you could use map:
val gender: Option[Option[String]] = UserRepository.findById(1).map(_.gender)
But working with nested options isn't too convenient. For that, you have flatMap:
val gender: Option[String] = UserRepository.findById(1).flatMap(_.gender)
And if we want to print out the gender if it exists, we can use foreach:
gender.foreach(println)
You'll find yourself working with scala types that have nested Option[T] fields defined and it's really handy to have collection like methods which help you remove out boilerplate and noise for extracting the actual value out of the operation.
A more real life use case I just encountered the other day was working with the awscala SDK, where I wanted to retrieve an object from S3 storage:
val bucket: Option[Bucket] = s3.bucket(amazonConfig.bucketName)
val result: Option[S3Object] = bucket.flatMap(_.get(amazonConfig.offsetKey))
result.flatMap(s3Object =>
Source.fromInputStream(s3Object.content).mkString.decodeOption[Array[KafkaOffset]])
So what happens here is that you query the S3 service for a bucket, which may or may not exist. Then, you want to extract an S3Object out of it which actually contains the data, but the API itself returns an Option[S3Object], so it's handy to use flatMap to flat out get an Option[S3Object] instead of Option[Option[S3Object]]. Finally, I want to deserialize the S3Object which actually contains a JSON, and using the Argonaut library, it returns an Option[MyObject], so then again using flatMap to the rescue of extracting the inner option type.
Edit:
As you pointed out, map and flatMap belong to the monadic property of Option[T]. I've written a blog post describing the reduction of two options where the final solution was:
def reduce[T](a: Option[T], b: Option[T], f: (T, T) => T): Option[T] = {
(a ++ b).reduceLeftOption(f)
}
Which takes advantage of the ++ operator defined on any collection which is also specifically defined on Option[T], being a collection.
I'd suggest to take a look at the corresponding chapter of The Neophyte's Guide to Scala.
In my experience, most useful use-cases of Option-as-collection are to filter an option and to make flatMap that implicitly filters None values.

filtering only one side of a list/iterable in scala

I'd like to remove only the few last elements of a List (or Seq), and avoid parsing all elements (and avoid applying the filter function to all of them).
Let's say, for example, I have a random strictly increasing list of values:
import scala.util.Random.nextInt
val r = (1 to 100).map(_ => nextInt(10)+1).scanLeft(0)(_+_)
And I want to remove the elements greater than, say, 300. I can do this like that:
r.filter(_<300)
but this method parse the whole list. So, is it possible to filter a list only on one end? Something like a filterRight method?
subquestions:
Also, would it be possible for list of values that are not strictly increasing? i.e. remove the elements from the end of a list until one element is, say, below 300.
if it is not possible for a List/Seq, what about IndexedSeq like Vector or Array
Solutions
I selected #elm solution because it answers the subquestion for general list, not only (strictly) increasing ones.
However, the solution of #dcastro looks to be more efficient as it doesn't do 2 reverse
First note SI-4247 on dropWhile but no dropRightWhile.
Though, a simple implementation that conveys the desired semantics,
def dropRightWhile[A](xs: Seq[A], p: A => Boolean) =
xs.reverse.dropWhile(p).reverse
or equivalently
implicit class OpsSeq[A](val xs: Seq[A]) extends AnyVal {
def dropRightWhile(p: A => Boolean) = xs.reverse.dropWhile(p).reverse
}
You're looking for something like dropRightWhile, which doesn't exist in the standard library (but has been requested before).
I think your best bet is:
r.takeWhile(_<300)
Since it's an increasing list of values, you can stop performing any checks when you first encounter an element greater than 300

Removing empty strings from maps in scala

val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val tokenizedLines = lines.map(Tokenizer.tokenize)
in the above code snippet, the tokenize function may return empty strings. How do i skip adding it to the map in that case? or remove empty entries post adding to map?
tokenizedLines.filter(_.nonEmpty)
The currently accepted answer, using filter and nonEmpty, incurs some performance penalty because nonEmpty is not a method on String, but, instead, it's added through implicit conversion. With value objects being used, I expect the difference to be almost imperceptible, but on versions of Scala where that is not the case, it is a substantial hit.
Instead, one could use this, which is assured to be faster:
tokenizedLines.filterNot(_.isEmpty)
You could use flatMap with Option.
Something like that:
lines.flatMap{
case "" => None
case s => Some(s)
}
val tokenizedLines = (lines.map(Tokenizer.tokenize)).filter(_.nonEmpty)

Scala - encapsulating data in objects

Motivations
This question is about working with Lists of data in Scala, and about resorting to either tuples or class objects for holding data. Perhaps some of my assumptions are wrong, so there it goes.
My current approach
As I understand, tuples do not afford the possibility of elegantly addressing their elements beyond the provided ._1, ._2, etc. I can use them, but code will be a bit unpleasant wherever data is extracted far from the lines of code that had defined it.
Also, as I understand, a Scala Map can only use a single type declaration for its values, so it can't diversify the value type of its values except for the case of type inheritance. (to the later point, considering the use of a type hierarchy for Map values "diversity" - may seem to be very artificial unless a class hierarchy fits any "model" intuition to begin with).
So, when I need to have lists where each element contains two or more named data entities, e.g. as below one of type String and one of type List, each accessible through an intelligible name, I resort to:
case class Foo (name1: String, name2: List[String])
val foos: List[Foo] = ...
Then I can later access instances of the list using .name1 and .name2.
Shortcomings and problems I see here
When the list is very large, should I assume this is less performant or more memory consuming than using a tuple as the List's type? alternatively, is there a different elegant way of accomplishing struct semantics in Scala?
In terms of performance, I don't think there is going to be any distinction between a tuple and an instance of a cases class. In fact, a tuple is an instance of a case class.
Secondly, if you're looking for another, more readable way to get the data out of the tuple, I suggest you consider pattern matching:
val (name1, name2) = ("first", List("second", "third"))

Use-cases for Streams in Scala

In Scala there is a Stream class that is very much like an iterator. The topic Difference between Iterator and Stream in Scala? offers some insights into the similarities and differences between the two.
Seeing how to use a stream is pretty simple but I don't have very many common use-cases where I would use a stream instead of other artifacts.
The ideas I have right now:
If you need to make use of an infinite series. But this does not seem like a common use-case to me so it doesn't fit my criteria. (Please correct me if it is common and I just have a blind spot)
If you have a series of data where each element needs to be computed but that you may want to reuse several times. This is weak because I could just load it into a list which is conceptually easier to follow for a large subset of the developer population.
Perhaps there is a large set of data or a computationally expensive series and there is a high probability that the items you need will not require visiting all of the elements. But in this case an Iterator would be a good match unless you need to do several searches, in that case you could use a list as well even if it would be slightly less efficient.
There is a complex series of data that needs to be reused. Again a list could be used here. Although in this case both cases would be equally difficult to use and a Stream would be a better fit since not all elements need to be loaded. But again not that common... or is it?
So have I missed any big uses? Or is it a developer preference for the most part?
Thanks
The main difference between a Stream and an Iterator is that the latter is mutable and "one-shot", so to speak, while the former is not. Iterator has a better memory footprint than Stream, but the fact that it is mutable can be inconvenient.
Take this classic prime number generator, for instance:
def primeStream(s: Stream[Int]): Stream[Int] =
Stream.cons(s.head, primeStream(s.tail filter { _ % s.head != 0 }))
val primes = primeStream(Stream.from(2))
It can be easily be written with an Iterator as well, but an Iterator won't keep the primes computed so far.
So, one important aspect of a Stream is that you can pass it to other functions without having it duplicated first, or having to generate it again and again.
As for expensive computations/infinite lists, these things can be done with Iterator as well. Infinite lists are actually quite useful -- you just don't know it because you didn't have it, so you have seen algorithms that are more complex than strictly necessary just to deal with enforced finite sizes.
In addition to Daniel's answer, keep in mind that Stream is useful for short-circuiting evaluations. For example, suppose I have a huge set of functions that take String and return Option[String], and I want to keep executing them until one of them works:
val stringOps = List(
(s:String) => if (s.length>10) Some(s.length.toString) else None ,
(s:String) => if (s.length==0) Some("empty") else None ,
(s:String) => if (s.indexOf(" ")>=0) Some(s.trim) else None
);
Well, I certainly don't want to execute the entire list, and there isn't any handy method on List that says, "treat these as functions and execute them until one of them returns something other than None". What to do? Perhaps this:
def transform(input: String, ops: List[String=>Option[String]]) = {
ops.toStream.map( _(input) ).find(_ isDefined).getOrElse(None)
}
This takes a list and treats it as a Stream (which doesn't actually evaluate anything), then defines a new Stream that is a result of applying the functions (but that doesn't evaluate anything either yet), then searches for the first one which is defined--and here, magically, it looks back and realizes it has to apply the map, and get the right data from the original list--and then unwraps it from Option[Option[String]] to Option[String] using getOrElse.
Here's an example:
scala> transform("This is a really long string",stringOps)
res0: Option[String] = Some(28)
scala> transform("",stringOps)
res1: Option[String] = Some(empty)
scala> transform(" hi ",stringOps)
res2: Option[String] = Some(hi)
scala> transform("no-match",stringOps)
res3: Option[String] = None
But does it work? If we put a println into our functions so we can tell if they're called, we get
val stringOps = List(
(s:String) => {println("1"); if (s.length>10) Some(s.length.toString) else None },
(s:String) => {println("2"); if (s.length==0) Some("empty") else None },
(s:String) => {println("3"); if (s.indexOf(" ")>=0) Some(s.trim) else None }
);
// (transform is the same)
scala> transform("This is a really long string",stringOps)
1
res0: Option[String] = Some(28)
scala> transform("no-match",stringOps)
1
2
3
res1: Option[String] = None
(This is with Scala 2.8; 2.7's implementation will sometimes overshoot by one, unfortunately. And note that you do accumulate a long list of None as your failures accrue, but presumably this is inexpensive compared to your true computation here.)
I could imagine, that if you poll some device in real time, a Stream is more convenient.
Think of an GPS tracker, which returns the actual position if you ask it. You can't precompute the location where you will be in 5 minutes. You might use it for a few minutes only to actualize a path in OpenStreetMap or you might use it for an expedition over six months in a desert or the rain forest.
Or a digital thermometer or other kinds of sensors which repeatedly return new data, as long as the hardware is alive and turned on - a log file filter could be another example.
Stream is to Iterator as immutable.List is to mutable.List. Favouring immutability prevents a class of bugs, occasionally at the cost of performance.
scalac itself isn't immune to these problems: http://article.gmane.org/gmane.comp.lang.scala.internals/2831
As Daniel points out, favouring laziness over strictness can simplify algorithms and make it easier to compose them.