Should Scala immutable case classes be defined to hold Seq[T], immutable.Seq[T], List[T] or Vector[T]? - scala

If we want to define a case class that holds a single object, say a tuple, we can do it easily:
sealed case class A(x: (Int, Int))
In this case, retrieving the "x" value will take a small constant amount of time, and this class will only take a small constant amount of space, regardless of how it was created.
Now, let's assume we want to hold a sequence of values instead; we could it like this:
sealed final case class A(x: Seq[Int])
This might seem to work as before, except that now storage and time to read all of x is proportional to x.length.
However, this is not actually the case, because someone could do something like this:
val hugeList = (1 to 1000000000).toList
val a = A(hugeList.view.filter(_ == 500000000))
In this case, the a object looks like an innocent case class holding a single int in a sequence, but in fact it requires gigabytes of memory, and it will take on the order of seconds to access that single element every time.
This could be fixed by specifying something like List[T] as the type instead of Seq[T]; however, this seems ugly since it adds a reference to a specific implementation, while in fact other well behaved implementations, like Vector[T], would also do.
Another worrying issue is that one could pass a mutable Seq[T], so it seems that one should at least use immutable.Seq instead of scala.collection.Seq (although the compiler can't actually enforce the immutability at the moment).
Looking at most libraries it seems that the common pattern is to use scala.collection.Seq[T], but is this really a good idea?
Or perhaps Seq is being used just because it's the shortest to type, and in fact it would be best to use immutable.Seq[T], List[T], Vector[T] or something else?
New text added in edit
Looking at the class library, some of the most core functionality like scala.reflect.api.Trees does in fact use List[T], and in general using a concrete class seems a good idea.
But then, why use List and not Vector?
Vector has O(1)/O(log(n)) length, prepend, append and random access, is asymptotically smaller (List is ~3-4 times bigger due to vtable and next pointers), and supports cache efficient and parallelized computation, while List has none of those properties except O(1) prepend.
So, personally I'm leaning towards Vector[T] being the correct choice for something exposed in a library data structure, where one doesn't know what operations the library user will need, despite the fact that it seems less popular.

First of all, you talk both about space and time requirements. In terms of space, your object will always be as large as the collection. It doesn't matter whether you wrap a mutable or immutable collection, that collection for obvious reasons needs to be in memory, and the case class wrapping it doesn't take any additional space (except its own small object reference). So if your collection takes "gigabytes of memory", that's a problem of your collection, not whether you wrap it in a case class or not.
You then go on to argue that a problem arises when using views instead of eager collections. But again the question is what the problem actually is? You use the example of lazily filtering a collection. In general running a filter will be an O(n) operation just as if you were iterating over the original list. In that example it would be O(1) for successive calls if that collection was made strict. But that's a problem of the calling site of your case class, not the definition of your case class.
The only valid point I see is with respect to mutable collections. Given the defining semantics of case classes, you should really only use effectively immutable objects as arguments, so either pure immutable collections or collections to which no instance has any more write access.
There is a design error in Scala in that scala.Seq is not aliased to collection.immutable.Seq but a general seq which can be either mutable or immutable. I advise against any use of unqualified Seq. It is really wrong and should be rectified in the Scala standard library. Use collection.immutable.Seq instead, or if the collection doesn't need to be ordered, collection.immutable.Traversable.
So I agree with your suspicion:
Looking at most libraries it seems that the common pattern is to use scala.collection.Seq[T], but is this really a good idea?
No! Not good. It might be convenient, because you can pass in an Array for example without explicit conversion, but I think a cleaner design is to require immutability.

Related

Convert tuple to array in Scala

What is the best way to convert a tuple into an array in Scala? Here "best" means in as few lines of code as possible. I was shocked to search Google and StackOverflow only to find nothing on this topic, which seems like it should be trivial and common. Lists have a a toArray function; why don't tuples?
Use productIterator, immediately followed by toArray:
(42, 3.14, "hello", true).productIterator.toArray
gives:
res0: Array[Any] = Array(42, 3.14, hello, true)
The type of the result shows the main reason why it's rarely used: in tuples, the types of the elements can be heterogeneous, in arrays they must be homogeneous, so that often too much type information is lost during this conversion. If you want to do this, then you probably shouldn't have stored your information in tuples in the first place.
There is simply almost nothing you can (safely) do with an Array[Any], except printing it out, or converting it to an even more degenerate Set[Any]. Instead you could use:
Lists of case classes belonging to a common sealed trait,
shapeless HLists,
a carefully chosen base class with a bit of inheritance,
or something that at least keeps some kind of schema at runtime (like Apache Spark Datasets)
they would all be better alternatives.
In the somewhat less likely case that the elements of the "tuples" that you are processing frequently turn out to have an informative least upper bound type, then it might be because you aren't working with plain tuples, but with some kind of traversable data structure that puts restrictions on the number of substructures in the nodes. In this case, you should consider implementing something like Traverse interface for the structure, instead of messing with some "tuples" manually.

Scala - encapsulating data in objects

Motivations
This question is about working with Lists of data in Scala, and about resorting to either tuples or class objects for holding data. Perhaps some of my assumptions are wrong, so there it goes.
My current approach
As I understand, tuples do not afford the possibility of elegantly addressing their elements beyond the provided ._1, ._2, etc. I can use them, but code will be a bit unpleasant wherever data is extracted far from the lines of code that had defined it.
Also, as I understand, a Scala Map can only use a single type declaration for its values, so it can't diversify the value type of its values except for the case of type inheritance. (to the later point, considering the use of a type hierarchy for Map values "diversity" - may seem to be very artificial unless a class hierarchy fits any "model" intuition to begin with).
So, when I need to have lists where each element contains two or more named data entities, e.g. as below one of type String and one of type List, each accessible through an intelligible name, I resort to:
case class Foo (name1: String, name2: List[String])
val foos: List[Foo] = ...
Then I can later access instances of the list using .name1 and .name2.
Shortcomings and problems I see here
When the list is very large, should I assume this is less performant or more memory consuming than using a tuple as the List's type? alternatively, is there a different elegant way of accomplishing struct semantics in Scala?
In terms of performance, I don't think there is going to be any distinction between a tuple and an instance of a cases class. In fact, a tuple is an instance of a case class.
Secondly, if you're looking for another, more readable way to get the data out of the tuple, I suggest you consider pattern matching:
val (name1, name2) = ("first", List("second", "third"))

Scala immutable map, when to go mutable?

My present use case is pretty trivial, either mutable or immutable Map will do the trick.
Have a method that takes an immutable Map, which then calls a 3rd party API method that takes an immutable Map as well
def doFoo(foo: String = "default", params: Map[String, Any] = Map()) {
val newMap =
if(someCondition) params + ("foo" -> foo) else params
api.doSomething(newMap)
}
The Map in question will generally be quite small, at most there might be an embedded List of case class instances, a few thousand entries max. So, again, assume little impact in going immutable in this case (i.e. having essentially 2 instances of the Map via the newMap val copy).
Still, it nags me a bit, copying the map just to get a new map with a few k->v entries tacked onto it.
I could go mutable and params.put("bar", bar), etc. for the entries I want to tack on, and then params.toMap to convert to immutable for the api call, that is an option. but then I have to import and pass around mutable maps, which is a bit of hassle compared to going with Scala's default immutable Map.
So, what are the general guidelines for when it is justified/good practice to use mutable Map over immutable Maps?
Thanks
EDIT
so, it appears that an add operation on an immutable map takes near constant time, confirming #dhg's and #Nicolas's assertion that a full copy is not made, which solves the problem for the concrete case presented.
Depending on the immutable Map implementation, adding a few entries may not actually copy the entire original Map. This is one of the advantages to the immutable data structure approach: Scala will try to get away with copying as little as possible.
This kind of behavior is easiest to see with a List. If I have a val a = List(1,2,3), then that list is stored in memory. However, if I prepend an additional element like val b = 0 :: a, I do get a new 4-element List back, but Scala did not copy the orignal list a. Instead, we just created one new link, called it b, and gave it a pointer to the existing List a.
You can envision strategies like this for other kinds of collections as well. For example, if I add one element to a Map, the collection could simply wrap the existing map, falling back to it when needed, all while providing an API as if it were a single Map.
Using a mutable object is not bad in itself, it becomes bad in a functional programming environment, where you try to avoid side-effects by keeping functions pure and objects immutable.
However, if you create a mutable object inside a function and modify this object, the function is still pure if you don't release a reference to this object outside the function. It is acceptable to have code like:
def buildVector( x: Double, y: Double, z: Double ): Vector[Double] = {
val ary = Array.ofDim[Double]( 3 )
ary( 0 ) = x
ary( 1 ) = y
ary( 2 ) = z
ary.toVector
}
Now, I think this approach is useful/recommended in two cases: (1) Performance, if creating and modifying an immutable object is a bottleneck of your whole application; (2) Code readability, because sometimes it's easier to modify a complex object in place (rather than resorting to lenses, zippers, etc.)
In addition to dhg's answer, you can take a look to the performance of the scala collections. If an add/remove operation doesn't take a linear time, it must do something else than just simply copying the entire structure. (Note that the converse is not true: it's not beacuase it takes linear time that your copying the whole structure)
I like to use collections.maps as the declared parameter types (input or return values) rather than mutable or immutable maps. The Collections maps are immutable interfaces that work for both types of implementations. A consumer method using a map really doesn't need to know about a map implementation or how it was constructed. (It's really none of its business anyway).
If you go with the approach of hiding a map's particular construction (be it mutable or immutable) from the consumers who use it then you're still getting an essentially immutable map downstream. And by using collection.Map as an immutable interface you completely remove all the ".toMap" inefficiency that you would have with consumers written to use immutable.Map typed objects. Having to convert a completely constructed map into another one simply to comply to an interface not supported by the first one really is absolutely unnecessary overhead when you think about it.
I suspect in a few years from now we'll look back at the three separate sets of interfaces (mutable maps, immutable maps, and collections maps) and realize that 99% of the time only 2 are really needed (mutable and collections) and that using the (unfortunately) default immutable map interface really adds a lot of unnecessary overhead for the "Scalable Language".

How do I deal with Scala collections generically?

I have realized that my typical way of passing Scala collections around could use some improvement.
def doSomethingCool(theFoos: List[Foo]) = { /* insert cool stuff here */ }
// if I happen to have a List
doSomethingCool(theFoos)
// but elsewhere I may have a Vector, Set, Option, ...
doSomethingCool(theFoos.toList)
I tend to write my library functions to take a List as the parameter type, but I'm certain that there's something more general I can put there to avoid all the occasional .toList calls I have in the application code. This is especially annoying since my doSomethingCool function typically only needs to call map, flatMap and filter, which are defined on all the collection types.
What are my options for that 'something more general'?
Here are more general traits, each of which extends the previous one:
GenTraversableOnce
GenTraversable
GenIterable
GenSeq
The traits above do not specify whether the collection is sequential or parallel. If your code requires that things be executed sequentially (typically, if your code has side effects of any kind), they are too general for it.
The following traits mandate sequential execution:
TraversableOnce
Traversable
Iterable
Seq
LinearSeq
The first one, TraversableOnce only allows you to call one method on the collection. After that, the collection has been "used". In exchange, it is general enough to accept iterators as well as collections.
Traversable is a pretty general collection that has most methods. There are some things it cannot do, however, in which case you need to go to Iterable.
All Iterable implement the iterator method, which allows you to get an Iterator for that collection. This gives it the capability for a few methods not present in Traversable.
A Seq[A] implements the function Int => A, which means you can access any element by its index. This is not guaranteed to be efficient, but it is a guarantee that each element has an index, and that you can make assertions about what that index is going to be. Contrast this with Map and Set, where you cannot tell what the index of an element is.
A LinearSeq is a Seq that provides fast head, tail, isEmpty and prepend. This is as close as you can get to a List without actually using a List explicitly.
Alternatively, you could have an IndexedSeq, which has fast indexed access (something List does not provide).
See also this question and this FAQ based on it.
The most obvious one is to use Traversable as the most general trait which will have the goodies you want. However, I think you are generally better sticking to:
Seq
IndexedSeq
Set
Map
A Seq will cover List, Vector etc, IndexedSeq will cover Vector etc etc. I found myself not using Iterable because I often need (or want) to know the size of the thing I have and back pre scala-2.8 Iterable did not provide access to this, so I kept having to turn things into sequences anyway!
Looks like Traversable and Iterable now have size methods so maybe I should go back to using them! Of course you could start "going mad" with GenTraversableOnce but that is not likely to aid in readability.

Scala type that is Iterable and has a length?

Writing Scala code, I regularly encounter cases where I have "processor" functions that operate iteratively on a collection of elements and also need to know the length of the collection.
On the other hand I have "provider" functions that generate collections and so already know the length. The generated collections may be List[T], Array[T] or Set[T], etc., but even in the case of List[T], my generator knows the size (even if the List type does not store it).
So I would naturally declare the "processor" functions as taking the most generic type that seems to fit all collection types, Iterable[T], as a parameter. However, they then internally need to find out the size via iterative collection traversal at a cost of O(N), which is undesirable.
So my naive solution would be to create a new type like IterableWithSize[T] and have the provider and processor functions create and take this type. Neither Seq[T] nor IndexedSeq[T] seem to fit the bill. But this seems like a relatively common use case, so I'm suspecting that there is a more idiomatic way to do this. What would that be?
In Scala collections, performance sensitive methods like size are not inherited from traits but overridden in the bottom type. For example see the implementation of immutable.HashSet:
https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_0_1/src//library/scala/collection/immutable/HashSet.scala
So you don't need to care about it. Just define an high-level common trait like Traversable or Iterable and you're done.
Actually, there's no idiomatic way around that. Scala collections were really meant to be traversed or used in other prescribed manners (such as Set.contains or Map.get). Checking for size is not part of them, and some of them are not even finite.
Now, IndexedSeq is a relatively safe bet -- it guarantees O(logn) indexed access, which is only possible if you have O(logn) size. Also, Set and Map are reasonably safe as well, for similar reasons. But if you are looking for a trait that gives you a guarantee on size speed, there isn't one.
I don't think there is an idiomatic way to do this. But here are two alternatives:
(1) Extend Scala's List/Set/Array collections and override the size method. This is not as difficult as it seems at first glance.
(2) Wrap your List/Set/Array collections together with the size and define an implicit unwrapper like:
class IterableWithSizeWrapper[E](private val c: Iterable[E], val size: Int)
object IterableWithSizeWrapper {
implicit def unwrap[E](iws: IterableWithSizeWrapper[E]): Iterable[E] = iws.c
}
object ListWithSizeTest {
def process[E](iws: IterableWithSizeWrapper[E]) {
// iws.size uses your cached size value
// iws.take(i) forces the unwrap to the original collect
// so iws.take(i).size takes the calculated size
for (i <- 0 to iws.size) assert(iws.take(i).size == i)
}
def main(args: Array[String]) {
process(new IterableWithSizeWrapper(List(1,2,3), 3))
process(new IterableWithSizeWrapper(Set(1,2,3), 3))
process(new IterableWithSizeWrapper(Array(1,2,3), 3))
}
}
How about Traversable? All your collections you mention inherit from it (Array indirectly via WrappedArray) and it provides size and toIterable (or toIterator) for traversal.
Your processor functions should accept Seq[T]. A Seq is precisely an Iterable that "has a length". Your only remaining problem is making length efficient. AFAIK it is already efficient in all cases except for List. To make List.length efficient just do as the others describe: Create an implementation of Seq that wraps a List and stores its length.