how to access complex data structures in Scala while preserving immutability? - scala

Calling expert Scala developers! Let's say you have a large object representing a writable data store. Are you comfortable with this common Java-like approach:
val complexModel = new ComplexModel()
complexModel.modify()
complexModel.access(...)
Or do you prefer:
val newComplexModel = complexModel.withADifference
newComplexModel.access(...)
If you prefer that, and you have a client accessing the model, how is the client going
to know to point to newComplexModel rather than complexModel? From the user's perspective
you have a mutable data store. How do you reconcile that perspective with Scala's emphasis
on immutability?
How about this:
var complexModel = new ComplexModel()
complexModel = complexModel.withADifference
complexModel.access(...)
This seems a bit like the first approach, except that it seems the code inside withADifference is going to have to do more work than the code inside modify(), because it has to create a whole new complex data object rather than modifying the existing one. (Do you run into this problem of having to do more work in trying to preserve
immutability?) Also, you now have a var with a large scope.
How would you decide on the best strategy? Are there exceptions to the strategy you would choose?

I think the functional way is to actually have Stream containing all your different versions of your datastructure and the consumer just trying to pull the next element from that stream.
But I think in Scala it is an absolutely valid approach to a mutable reference in one central place and change that, while your whole datastructure stays immutable.
When the datastructure becomes more complex you might be interested in this question: Cleaner way to update nested structures which asks (and gets answered) how to actually create new change versions of an immutable data structure that is not trivial.

By such name of method as modify only it's easy to identify your ComplexModel as a mutator object, which means that it changes some state. That only implies that this kind of object has nothing to do with functional programming and trying to make it immutable just because someone with questionable knowledge told you that everything in Scala should be immutable will simply be a mistake.
Now you could modify your api so that this ComplexModel operated on immutable data, and I btw think you should, but you definitely must not try to convert this ComplexModel into immutable itself.

The canonical answer to your question is using Zipper, one SO question about it.
The only implementation for Scala I know of is in ScalaZ.

Immutability is merely a useful tool, not dogma. Situations will arise where the cost and inconvenience of immutability outweigh its usefulness.
The size of a ComplexModel may make it so that creating a modified copy is sufficiently expensive in terms of memory and/or CPU that a mutable model is more practical.

Related

Scala advantages of Seq.newBuilder over Seq vars

Currently in my application, I'm using var fooSeq: Seq[Foo] = Seq.empty and then using :+ to append items. I understand that this could lead to multithreading issues and potential race conditions, but so far have not had any issues.
I recently discovered Seq.newBuilder() and seems this might be the preferred way to use Scala sequences. I'm wondering if the performance benefit is advantageous over using vars, and any other types of benefits that it may bring
In general, if you are concerned with thread-safety then a common approach is to use Java's AtomicReference to wrap your mutable variable like so:
val fooSeq: AtomicReference[Seq[Foo]] = new AtomicReference(Seq.empty)
and that would be the better approach if you need intermediate results rather than going with the Builder.
If you don't need intermediate results then Builders are generally better. (Though as Luis Miguel mentions in a comment Builders are internally mutable and not necessarily thread-safe)
A third alternative is to use a mutable data structure from Scala's collections: https://docs.scala-lang.org/overviews/collections/performance-characteristics.html
You might be interested in: MutableList, however this would still need the AtomicReference wrapping for thread-safety if that is a concern. There are some data structures that are natively thread-safe like TrieMap and those are available in collections.concurrent

Optimizing lazy collections

This question is about optimizing lazy collections. I will first explain the problem and then give some thoughts for a possible solution. Questions are in bold.
Problem
Swift expects operations on Collections to be O(1). Some operations, especially prefix and suffix-like types, deviate and are on the order of O(n) or higher.
Lazy collections can't iterate through the base collection during initialization since computation should be deferred for as long as possible until the value is actually needed.
So, how can we optimize lazy collections? And of course this begs the question, what constitutes an optimized lazy collection?
Thoughts
The most obvious solution is caching. This means that the first call to a collection's method has an unfavourable time complexity, but subsequent calls to the same or other methods can possibly be computed in O(1). We trade some space complexity to the order of O(n) for faster computation.
Attempting to optimize lazy collections on structs by using caching is impossible since subscript(_ position:) and all other methods that you'd need to implement to conform to LazyProtocolCollection are non-mutating and structs are immutable by default. This means that we have to recompute all operations for every call to a property or method.
This leaves us with classes. Classes are mutable, meaning that all computed properties and methods can internally mutate state. When we use classes to optimize a lazy collection we have two options. First, if the properties of the lazy type are variables then we're bringing ourselves into a world of hurt. If we change a property it could potentially invalidate previously cached results. I can imagine managing the code paths to make properties mutable to be headache inducing. Second, if we use lets we're good; the state set during initialization can't be changed so a cached result doesn't need to be updated. Note that we're only talking about lazy collections with pure methods without side effects here.
But classes are reference types. What are the downsides of using reference types for lazy collections? The Swift standard library doesn't use them for starters.
Any thoughts or thoughts on different approaches?
I completely agree with Alexander here. If you're storing lazy collections, you're generally doing something wrong, and the cost of repeated accesses is going to constantly surprise you.
These collections already blow up their complexity requirements, it's true:
Note: The performance of accessing startIndex, first, or any methods that depend on startIndex depends on how many elements satisfy the predicate at the start of the collection, and may not offer the usual performance given by the Collection protocol. Be aware, therefore, that general operations on LazyDropWhileCollection instances may not have the documented complexity.
But caching won't fix that. They'll still be O(n) on the first access, so a loop like
for i in 0..<xs.count { print(xs[i]) }
is still O(n^2). Also remember that O(1) and "fast" are not the same thing. It feels like you're trying to get to "fast" but that doesn't fix the complexity promise (that said, lazy structures are already breaking their complexity promises in Swift).
Caching is a net-negative because it makes the normal (and expected) use of lazy data structures slower. The normal way to use lazy data structures is to consume them either zero or one times. If you were going to consume them more than one time, you should use a strict data structure. Caching something that you never use is a waste of time and space.
There are certainly conceivable use cases where you have a large data structure that will be sparsely accessed multiple times, and so caching would be useful, but this isn't the use case lazy was built to handle.
Attempting to optimize lazy collections on structs by using caching is impossible since subscript(_ position:) and all other methods that you'd need to implement to conform to LazyProtocolCollection are non-mutating and structs are immutable by default. This means that we have to recompute all operations for every call to a property or method.
This isn't true. A struct can internally store a reference type to hold its cache and this is common. Strings do exactly this. They include a StringBuffer which is a reference type (for reasons related to a Swift compiler bug, StringBuffer is actually implemented as a struct that wraps a class, but conceptually it is a reference type). Lots of value types in Swift store internal buffer classes this way, which allows them to be internally mutable while presenting an immutable interface. (It's also important for CoW and lots of other performance and memory related reasons.)
Note that adding caching today would also break existing use cases of lazy:
struct Massive {
let id: Int
// Lots of data, but rarely needed.
}
// We have lots of items that we look at occassionally
let ids = 0..<10_000_000
// `massives` is lazy. When we ask for something it creates it, but when we're
// done with it, it's thrown away. If `lazy` forced caching, then everything
// we accessed would be forever. Also, if the values in `Massive` change over
// time, I certainly may want it to be rebuilt at this point and not cached.
let massives = ids.lazy.map(Massive.init)
let aMassive = massives[10]
This isn't to say a caching data structure wouldn't be useful in some cases, but it certainly isn't always a win. It imposes a lot of costs and breaks some uses while helping others. So if you want those other use cases, you should build a data structure that provides them. But it's reasonable that lazy is not that tool.
Swift's lazy collections are intended to provide one off access to elements. Subsequent access cause redundant computation (e.g. a lazy map sequence would recompute the transform closure.
In the case where you want repeated access to elements, it's best to just slice the portion of the lazy sequence/collection you care about, and create a proper Collection (e.g. an Array) out of it.
The book keeping overhead of lazily evaluating and caching each element would probably be greater than the benefits.

should entity field be mutable or immutable

In a scala project, should entity field be mutable or immutable ?
Mutable field:
It is very easy to change field in a nested entity, also when logic is pushed into entity, it is very easy to be implemented.
Immutable field:
It guarantees consensus for one system is running, but it still may have inconsistency data if more than one systems are running, Also, if entity fields are immutable, it has lots of boilerplates to update nested fields. That means that some concept like lens should be introduced.
What should I choose to start up a scala project ?
Always favor immutability. Definitely in Scala, and probably in every other language too.
It's hard to give a more specific answer without a more specific question. But immutability is almost always a safe answer.

Best way to create a List container object in Scala

Here's the scenario. I am creating a simple session handler in Scala and I need a class that can store lists. The class needs other functions associated with it to function properly.
I will be accessing sessions by a session ID
I will rarely be traversing the list
I will be constantly adding and removing from the list
My questions:
What is the proper Scala object to use for this situation?
What is the best way to add or remove an entity from said Scala object?
I am fairly new to Scala so please forgive the elementary question I might be asking. Any assistance would be most appreciated.
Edit: To add to it all...Thread Safty is a factor. The object used must be thread safe or it must be easy to allow for thread safty when adding and removing items by Session ID.
You can use java.util.concurrent.ConcurrentHashMap - it has best performance with guarantied thread safety.
You can use the immutable implementation of HashSet which operations of adding and removing take effectively constant time.
Once this collection is immutable, you'll need to learn the "scala way" of working with collections, how to deal with state and so on. Maybe you'll need to change the way you're working the collections, but this way you won't need to worry about concurrency.
val list = new List(1,2,3,4,5,6,7,8,9,10)

NSDictionaries vs. custom objects with properties, what's your take?

I'm writing an App that basically uses 5 business entities, A, B C, D and E
A has some properties and holds a list of B's
B has some other properties and a list of C's and a list of D's
C has some other properties and a list of D's and a list of E's
D has only a few properties
E has only a few properties
There is no inheritance between any of them.
There's no real business logic involved, the objects are created, populated, and then accessed read-only, no further manipulations.
My natural coding style would be to go object oriented and write classes for each of those entities, use NSArrays for the lists, and have the mentioned properties synthesized.
It would make the code readable.
But another approach seems obvious too: only use NSDictionaries and NSArrays, and working with keys/values instead of properties. This seems more efficient, and somehow "closer" to iPhone-style programming to me... but obviously leads to less readable code. Another advantage is there's no additional custom encoding/decoding for serialization required (persisting state to disk, using JSON, ...)
So on the paper, it speaks for the latter approach, on the other hand, it still feels somehow awkward NOT to use custom objects...
Is this really just a matter of taste question? Or are there maybe other arguments in favour/against one of the approaches? Is only using Dictionaries better memory/performance-wise? Is it the preferred "Apple Coding Style"? (I'm coming from Java/C#).
I don't see much difference between Java/C# and Cocoa in this area. Your question is equivalently applicable to those platforms as well (the same also applies to key-value stores and relational stores).
In an object oriented environment, you have to make a trade-off between the flexibility of the key-value approach for storing data and the structured and object oriented style. I'd go with the key-value approach only when I need the flexibility (e.g. the structure is dynamic and might change by user or not known at compile time). Otherwise, taking that route might get you completely off the OOP conventions and benefits (By the way, this is the important point. Does the hassle of sticking to object oriented principles worth it for that specific circumstance? I think your question reduces to this one and to answer it, you should analyze your specific situation)
It largely depends on whether your objects are just collections of data (key/value pairs) or implement their own functionality.
If they're data I'd say go with NSDictionary, it's a lot less code and as you point out you won't have to write serialization routines for each class.
Use a hybrid approach. Store the dictionaries the objects are based on, but expose the most-used values as properties that are either filled when the object is initialized from a dictionary, or have the accessors look into the dictionary for values (less efficient).
Also provide a property to get at the dictionary. This way if you need to propagate a new value quickly to a specific area of the code from the dictionary (presumably a new value added by the server) you have that flexibility. Then if callers are making heavy use of a value you can migrate it to be a true property and get the completion and type checking of a property.