Scala - encapsulating data in objects - scala

Motivations
This question is about working with Lists of data in Scala, and about resorting to either tuples or class objects for holding data. Perhaps some of my assumptions are wrong, so there it goes.
My current approach
As I understand, tuples do not afford the possibility of elegantly addressing their elements beyond the provided ._1, ._2, etc. I can use them, but code will be a bit unpleasant wherever data is extracted far from the lines of code that had defined it.
Also, as I understand, a Scala Map can only use a single type declaration for its values, so it can't diversify the value type of its values except for the case of type inheritance. (to the later point, considering the use of a type hierarchy for Map values "diversity" - may seem to be very artificial unless a class hierarchy fits any "model" intuition to begin with).
So, when I need to have lists where each element contains two or more named data entities, e.g. as below one of type String and one of type List, each accessible through an intelligible name, I resort to:
case class Foo (name1: String, name2: List[String])
val foos: List[Foo] = ...
Then I can later access instances of the list using .name1 and .name2.
Shortcomings and problems I see here
When the list is very large, should I assume this is less performant or more memory consuming than using a tuple as the List's type? alternatively, is there a different elegant way of accomplishing struct semantics in Scala?

In terms of performance, I don't think there is going to be any distinction between a tuple and an instance of a cases class. In fact, a tuple is an instance of a case class.
Secondly, if you're looking for another, more readable way to get the data out of the tuple, I suggest you consider pattern matching:
val (name1, name2) = ("first", List("second", "third"))

Related

Convert tuple to array in Scala

What is the best way to convert a tuple into an array in Scala? Here "best" means in as few lines of code as possible. I was shocked to search Google and StackOverflow only to find nothing on this topic, which seems like it should be trivial and common. Lists have a a toArray function; why don't tuples?
Use productIterator, immediately followed by toArray:
(42, 3.14, "hello", true).productIterator.toArray
gives:
res0: Array[Any] = Array(42, 3.14, hello, true)
The type of the result shows the main reason why it's rarely used: in tuples, the types of the elements can be heterogeneous, in arrays they must be homogeneous, so that often too much type information is lost during this conversion. If you want to do this, then you probably shouldn't have stored your information in tuples in the first place.
There is simply almost nothing you can (safely) do with an Array[Any], except printing it out, or converting it to an even more degenerate Set[Any]. Instead you could use:
Lists of case classes belonging to a common sealed trait,
shapeless HLists,
a carefully chosen base class with a bit of inheritance,
or something that at least keeps some kind of schema at runtime (like Apache Spark Datasets)
they would all be better alternatives.
In the somewhat less likely case that the elements of the "tuples" that you are processing frequently turn out to have an informative least upper bound type, then it might be because you aren't working with plain tuples, but with some kind of traversable data structure that puts restrictions on the number of substructures in the nodes. In this case, you should consider implementing something like Traverse interface for the structure, instead of messing with some "tuples" manually.

Is it possible for a structure made of immutable types to have a cycle?

Consider the following:
case class Node(var left: Option[Node], var right: Option[Node])
It's easy to see how you could traverse this, search it, whatever. But now imagine you did this:
val root = Node(None, None)
root.left = root
Now, this is bad, catastrophic. In fact, you type it into a REPL, you'll get a StackOverflow (hey, that would be a good name for a band!) and a stack trace a thousand lines long. If you want to try it, do this:
{ root.left = root }: Unit
to suppress the REPL well-intentioned attempt to print out the results.
But to construct that, I had to specifically give the case-class mutable members, something I would never do in real life. If I use ordinary mutable members, I get a problem with construction. The closest I can come is
case class Node(left: Option[Node], right: Option[Node])
val root: Node = Node(Some(loop), None)
Then root has the rather ugly value Node(Some(null),None), but it's still not cyclic.
So my question is, if a data-structure is transitively immutable (that is, all of its members are either immutable values or references to other data-structures that are themselves transitively immutable), is it guaranteed to be acyclic?
It would be cool if it were.
Yes, it is possible to create cyclic data structures even with purely immutable data structures in a pure, referentially transparent, effect-free language.
The "obvious" solution is to pull out the potentially cyclic references into a separate data structure. For example, if you represent a graph as an adjacency matrix, then you don't need cycles in your data structure to represent cycles in your graph. But that's cheating: every problem can be solved by adding a layer of indirection (except the problem of having too many layers of indirection).
Another cheat would be to circumvent Scala's immutability guarantees from the outside, e.g. on the default Scala-JVM implementation by using Java reflection methods.
It is possible to create actual cyclic references. The technique is called Tying the Knot, and it relies on laziness: you can actually set the reference to an object that you haven't created yet because the reference will be evaluated lazily, by which time the object will have been created. Scala has support for laziness in various forms: lazy vals, by-name parameters, and the now deprecated DelayedInit. Also, you can "fake" laziness using functions or method: wrap the thing you want to make lazy in a function or method which produces the thing, and it won't be created until you call the function or method.
So, the same techniques should be possible in Scala as well.
How about using lazy with call by name ?
scala> class Node(l: => Node, r: => Node, v: Int)
// defined class Node
scala> lazy val root: Node = new Node(root, root, 5)
// root: Node = <lazy>

why class Set does not exists in scala?

There are many classes which inherit trait Set.
HashSet, TreeSet, etc.
And there's Object(could i call it companion object of trait Set? not in the case of class Set?) Set and trait Set.
It seems to me that just adding one more class "Set" to this list make it seems to be really easy to understand the structure.
is there any reason Class Set should not exists?
If you just need a set, use Set.apply and you will have a valid set that supports all important operations. You don't need to worry how is it implemented. It is prepared to work well for most use cases.
On the other hand, if performance of certain operations matters for you, create a concrete class for concrete implementation of set, and you will know exactly what you have.
In java you would write:
Set<String> strings = new HashSet<>(Arrays.asList("a", "b"));
in scala you could as well have those types
val strings: Set[String] = HashSet("a", "b")
but you can also use a handy factory if you don't need to worry about the type and simply use
val strings = Set("a", "b")
and nothing is wrong with this, and I don't see how adding another class would help at all. It is normal thing to have an interface/trait and concrete implementations, nothing in the middle is needed nor helpful.
Set.apply is a factory for sets. You can check what is the actual class of resulting object using getClass. This factory creates special, optimized sets for sizes 0-4, for example
scala.collection.immutable.Set$EmptySet$
scala.collection.immutable.Set$Set1
scala.collection.immutable.Set$Set2
for bigger sets it is a hash set, namely scala.collection.immutable.HashSet$HashTrieSet.
In Scala there is no overlap between classes and traits. Classes are implementations that can be instantiated, while traits are independently mixable interfaces. The use of Set.apply gives an object with interface Set and that is all you need to know to use it. I fully understand wanting a concrete type, but that would be unnecessary. The right thing to do here is save it to a val of type Set and use only the interface Set provides.
I know that may not be satisfying, but give it time and the Scala type system will make sense in terms of itself, even if that is different than what Java does.

Should Scala immutable case classes be defined to hold Seq[T], immutable.Seq[T], List[T] or Vector[T]?

If we want to define a case class that holds a single object, say a tuple, we can do it easily:
sealed case class A(x: (Int, Int))
In this case, retrieving the "x" value will take a small constant amount of time, and this class will only take a small constant amount of space, regardless of how it was created.
Now, let's assume we want to hold a sequence of values instead; we could it like this:
sealed final case class A(x: Seq[Int])
This might seem to work as before, except that now storage and time to read all of x is proportional to x.length.
However, this is not actually the case, because someone could do something like this:
val hugeList = (1 to 1000000000).toList
val a = A(hugeList.view.filter(_ == 500000000))
In this case, the a object looks like an innocent case class holding a single int in a sequence, but in fact it requires gigabytes of memory, and it will take on the order of seconds to access that single element every time.
This could be fixed by specifying something like List[T] as the type instead of Seq[T]; however, this seems ugly since it adds a reference to a specific implementation, while in fact other well behaved implementations, like Vector[T], would also do.
Another worrying issue is that one could pass a mutable Seq[T], so it seems that one should at least use immutable.Seq instead of scala.collection.Seq (although the compiler can't actually enforce the immutability at the moment).
Looking at most libraries it seems that the common pattern is to use scala.collection.Seq[T], but is this really a good idea?
Or perhaps Seq is being used just because it's the shortest to type, and in fact it would be best to use immutable.Seq[T], List[T], Vector[T] or something else?
New text added in edit
Looking at the class library, some of the most core functionality like scala.reflect.api.Trees does in fact use List[T], and in general using a concrete class seems a good idea.
But then, why use List and not Vector?
Vector has O(1)/O(log(n)) length, prepend, append and random access, is asymptotically smaller (List is ~3-4 times bigger due to vtable and next pointers), and supports cache efficient and parallelized computation, while List has none of those properties except O(1) prepend.
So, personally I'm leaning towards Vector[T] being the correct choice for something exposed in a library data structure, where one doesn't know what operations the library user will need, despite the fact that it seems less popular.
First of all, you talk both about space and time requirements. In terms of space, your object will always be as large as the collection. It doesn't matter whether you wrap a mutable or immutable collection, that collection for obvious reasons needs to be in memory, and the case class wrapping it doesn't take any additional space (except its own small object reference). So if your collection takes "gigabytes of memory", that's a problem of your collection, not whether you wrap it in a case class or not.
You then go on to argue that a problem arises when using views instead of eager collections. But again the question is what the problem actually is? You use the example of lazily filtering a collection. In general running a filter will be an O(n) operation just as if you were iterating over the original list. In that example it would be O(1) for successive calls if that collection was made strict. But that's a problem of the calling site of your case class, not the definition of your case class.
The only valid point I see is with respect to mutable collections. Given the defining semantics of case classes, you should really only use effectively immutable objects as arguments, so either pure immutable collections or collections to which no instance has any more write access.
There is a design error in Scala in that scala.Seq is not aliased to collection.immutable.Seq but a general seq which can be either mutable or immutable. I advise against any use of unqualified Seq. It is really wrong and should be rectified in the Scala standard library. Use collection.immutable.Seq instead, or if the collection doesn't need to be ordered, collection.immutable.Traversable.
So I agree with your suspicion:
Looking at most libraries it seems that the common pattern is to use scala.collection.Seq[T], but is this really a good idea?
No! Not good. It might be convenient, because you can pass in an Array for example without explicit conversion, but I think a cleaner design is to require immutability.

Scala immutable map, when to go mutable?

My present use case is pretty trivial, either mutable or immutable Map will do the trick.
Have a method that takes an immutable Map, which then calls a 3rd party API method that takes an immutable Map as well
def doFoo(foo: String = "default", params: Map[String, Any] = Map()) {
val newMap =
if(someCondition) params + ("foo" -> foo) else params
api.doSomething(newMap)
}
The Map in question will generally be quite small, at most there might be an embedded List of case class instances, a few thousand entries max. So, again, assume little impact in going immutable in this case (i.e. having essentially 2 instances of the Map via the newMap val copy).
Still, it nags me a bit, copying the map just to get a new map with a few k->v entries tacked onto it.
I could go mutable and params.put("bar", bar), etc. for the entries I want to tack on, and then params.toMap to convert to immutable for the api call, that is an option. but then I have to import and pass around mutable maps, which is a bit of hassle compared to going with Scala's default immutable Map.
So, what are the general guidelines for when it is justified/good practice to use mutable Map over immutable Maps?
Thanks
EDIT
so, it appears that an add operation on an immutable map takes near constant time, confirming #dhg's and #Nicolas's assertion that a full copy is not made, which solves the problem for the concrete case presented.
Depending on the immutable Map implementation, adding a few entries may not actually copy the entire original Map. This is one of the advantages to the immutable data structure approach: Scala will try to get away with copying as little as possible.
This kind of behavior is easiest to see with a List. If I have a val a = List(1,2,3), then that list is stored in memory. However, if I prepend an additional element like val b = 0 :: a, I do get a new 4-element List back, but Scala did not copy the orignal list a. Instead, we just created one new link, called it b, and gave it a pointer to the existing List a.
You can envision strategies like this for other kinds of collections as well. For example, if I add one element to a Map, the collection could simply wrap the existing map, falling back to it when needed, all while providing an API as if it were a single Map.
Using a mutable object is not bad in itself, it becomes bad in a functional programming environment, where you try to avoid side-effects by keeping functions pure and objects immutable.
However, if you create a mutable object inside a function and modify this object, the function is still pure if you don't release a reference to this object outside the function. It is acceptable to have code like:
def buildVector( x: Double, y: Double, z: Double ): Vector[Double] = {
val ary = Array.ofDim[Double]( 3 )
ary( 0 ) = x
ary( 1 ) = y
ary( 2 ) = z
ary.toVector
}
Now, I think this approach is useful/recommended in two cases: (1) Performance, if creating and modifying an immutable object is a bottleneck of your whole application; (2) Code readability, because sometimes it's easier to modify a complex object in place (rather than resorting to lenses, zippers, etc.)
In addition to dhg's answer, you can take a look to the performance of the scala collections. If an add/remove operation doesn't take a linear time, it must do something else than just simply copying the entire structure. (Note that the converse is not true: it's not beacuase it takes linear time that your copying the whole structure)
I like to use collections.maps as the declared parameter types (input or return values) rather than mutable or immutable maps. The Collections maps are immutable interfaces that work for both types of implementations. A consumer method using a map really doesn't need to know about a map implementation or how it was constructed. (It's really none of its business anyway).
If you go with the approach of hiding a map's particular construction (be it mutable or immutable) from the consumers who use it then you're still getting an essentially immutable map downstream. And by using collection.Map as an immutable interface you completely remove all the ".toMap" inefficiency that you would have with consumers written to use immutable.Map typed objects. Having to convert a completely constructed map into another one simply to comply to an interface not supported by the first one really is absolutely unnecessary overhead when you think about it.
I suspect in a few years from now we'll look back at the three separate sets of interfaces (mutable maps, immutable maps, and collections maps) and realize that 99% of the time only 2 are really needed (mutable and collections) and that using the (unfortunately) default immutable map interface really adds a lot of unnecessary overhead for the "Scalable Language".