Scala advantages of Seq.newBuilder over Seq vars - scala

Currently in my application, I'm using var fooSeq: Seq[Foo] = Seq.empty and then using :+ to append items. I understand that this could lead to multithreading issues and potential race conditions, but so far have not had any issues.
I recently discovered Seq.newBuilder() and seems this might be the preferred way to use Scala sequences. I'm wondering if the performance benefit is advantageous over using vars, and any other types of benefits that it may bring

In general, if you are concerned with thread-safety then a common approach is to use Java's AtomicReference to wrap your mutable variable like so:
val fooSeq: AtomicReference[Seq[Foo]] = new AtomicReference(Seq.empty)
and that would be the better approach if you need intermediate results rather than going with the Builder.
If you don't need intermediate results then Builders are generally better. (Though as Luis Miguel mentions in a comment Builders are internally mutable and not necessarily thread-safe)
A third alternative is to use a mutable data structure from Scala's collections: https://docs.scala-lang.org/overviews/collections/performance-characteristics.html
You might be interested in: MutableList, however this would still need the AtomicReference wrapping for thread-safety if that is a concern. There are some data structures that are natively thread-safe like TrieMap and those are available in collections.concurrent

Related

Optimizing lazy collections

This question is about optimizing lazy collections. I will first explain the problem and then give some thoughts for a possible solution. Questions are in bold.
Problem
Swift expects operations on Collections to be O(1). Some operations, especially prefix and suffix-like types, deviate and are on the order of O(n) or higher.
Lazy collections can't iterate through the base collection during initialization since computation should be deferred for as long as possible until the value is actually needed.
So, how can we optimize lazy collections? And of course this begs the question, what constitutes an optimized lazy collection?
Thoughts
The most obvious solution is caching. This means that the first call to a collection's method has an unfavourable time complexity, but subsequent calls to the same or other methods can possibly be computed in O(1). We trade some space complexity to the order of O(n) for faster computation.
Attempting to optimize lazy collections on structs by using caching is impossible since subscript(_ position:) and all other methods that you'd need to implement to conform to LazyProtocolCollection are non-mutating and structs are immutable by default. This means that we have to recompute all operations for every call to a property or method.
This leaves us with classes. Classes are mutable, meaning that all computed properties and methods can internally mutate state. When we use classes to optimize a lazy collection we have two options. First, if the properties of the lazy type are variables then we're bringing ourselves into a world of hurt. If we change a property it could potentially invalidate previously cached results. I can imagine managing the code paths to make properties mutable to be headache inducing. Second, if we use lets we're good; the state set during initialization can't be changed so a cached result doesn't need to be updated. Note that we're only talking about lazy collections with pure methods without side effects here.
But classes are reference types. What are the downsides of using reference types for lazy collections? The Swift standard library doesn't use them for starters.
Any thoughts or thoughts on different approaches?
I completely agree with Alexander here. If you're storing lazy collections, you're generally doing something wrong, and the cost of repeated accesses is going to constantly surprise you.
These collections already blow up their complexity requirements, it's true:
Note: The performance of accessing startIndex, first, or any methods that depend on startIndex depends on how many elements satisfy the predicate at the start of the collection, and may not offer the usual performance given by the Collection protocol. Be aware, therefore, that general operations on LazyDropWhileCollection instances may not have the documented complexity.
But caching won't fix that. They'll still be O(n) on the first access, so a loop like
for i in 0..<xs.count { print(xs[i]) }
is still O(n^2). Also remember that O(1) and "fast" are not the same thing. It feels like you're trying to get to "fast" but that doesn't fix the complexity promise (that said, lazy structures are already breaking their complexity promises in Swift).
Caching is a net-negative because it makes the normal (and expected) use of lazy data structures slower. The normal way to use lazy data structures is to consume them either zero or one times. If you were going to consume them more than one time, you should use a strict data structure. Caching something that you never use is a waste of time and space.
There are certainly conceivable use cases where you have a large data structure that will be sparsely accessed multiple times, and so caching would be useful, but this isn't the use case lazy was built to handle.
Attempting to optimize lazy collections on structs by using caching is impossible since subscript(_ position:) and all other methods that you'd need to implement to conform to LazyProtocolCollection are non-mutating and structs are immutable by default. This means that we have to recompute all operations for every call to a property or method.
This isn't true. A struct can internally store a reference type to hold its cache and this is common. Strings do exactly this. They include a StringBuffer which is a reference type (for reasons related to a Swift compiler bug, StringBuffer is actually implemented as a struct that wraps a class, but conceptually it is a reference type). Lots of value types in Swift store internal buffer classes this way, which allows them to be internally mutable while presenting an immutable interface. (It's also important for CoW and lots of other performance and memory related reasons.)
Note that adding caching today would also break existing use cases of lazy:
struct Massive {
let id: Int
// Lots of data, but rarely needed.
}
// We have lots of items that we look at occassionally
let ids = 0..<10_000_000
// `massives` is lazy. When we ask for something it creates it, but when we're
// done with it, it's thrown away. If `lazy` forced caching, then everything
// we accessed would be forever. Also, if the values in `Massive` change over
// time, I certainly may want it to be rebuilt at this point and not cached.
let massives = ids.lazy.map(Massive.init)
let aMassive = massives[10]
This isn't to say a caching data structure wouldn't be useful in some cases, but it certainly isn't always a win. It imposes a lot of costs and breaks some uses while helping others. So if you want those other use cases, you should build a data structure that provides them. But it's reasonable that lazy is not that tool.
Swift's lazy collections are intended to provide one off access to elements. Subsequent access cause redundant computation (e.g. a lazy map sequence would recompute the transform closure.
In the case where you want repeated access to elements, it's best to just slice the portion of the lazy sequence/collection you care about, and create a proper Collection (e.g. an Array) out of it.
The book keeping overhead of lazily evaluating and caching each element would probably be greater than the benefits.

Why does Breeze use Array to represent a matrix?

The class DenseMatrix has a parameter data of type Array[V]. Why not using some other mutable collection that can grow dynamically, such as Vector?
The comments (from Raphael Roth and Jasper-M) both make good points and they're both part of the reason. Breeze uses netlib-java for handling its interface to native BLAS via JNI, and it uses arrays. (It could have been implemented in terms of Java Buffers, but they didn't.) Dynamically resizing DenseMatrices doesn't make a lot of sense, and the implementations of DM and DV are deliberately similar.
Arrays also have way better performance characteristics than any other built-in collection-like thing in Java and in Scala, and since Breeze cares about being fast, it's the best choice. All generic collections in both scala and java box primitive elements, which is totally unacceptable in performance sensitive environments. (I could have rolled my own specialized ArrayBuffer like thing using Scala's specialized, but I didn't.) Also, java Vector synchronizes all accesses and so it is particularly unacceptable unless you actually need the locking.
You can use VectorBuilder (which has a settable length parameter and can be set to -1 to turn off bounds checking) if you're unsure of the dimensionality of your data set.

Functional Programming + Domain-Driven Design

Functional programming promotes immutable classes and referential transparency.
Domain-driven design is composed of Value Object (immutable) and Entities (mutable).
Should we create immutable Entities instead of mutable ones?
Let's assume, project uses Scala as main language, how could we write Entities as case classes (immutable so) without risking stale status if we're dealing with concurrency?
What is a good practice? Keeping Entities mutable (var fields etc...) and avoiding great syntax of case classes?
You can effectively use immutable Entities in Scala and avoid the horror of mutable fields and all the bugs that derives from mutable state. Using Immutable entities help you with concurrency, doesn't make things worse. Your previous mutable state will become a set of transformation which will create a new reference at each change.
At a certain level of your application, however, you will need to have a mutable state, or your application would be useless. The idea is to push it as up as you can in your program logic. Let's take an example of a Bank Account, which can change because of interest rate and ATM withdrawal or
deposit.
You have two valid approaches:
You expose methods that can modify an internal property and you manage concurrency on those methods (very few, in fact)
You make all the class immutable and you surround it with a "manager" that can change the account.
Since the first is pretty straightforward, I will detail the first.
case class BankAccount(val balance:Double, val code:Int)
class BankAccountRef(private var bankAccount:BankAccount){
def withdraw(withdrawal) = {
bankAccount = bankAccount.copy(balance = bankAccount.balance - withdrawal)
bankAccount.balance
}
}
This is nice, but gosh, you are still stuck with managing concurrency. Well, Scala offers you a solution for that. The problem here is that if you share your reference to BankAccountRef to your Background job, then you will have to synchronize the call. The problem is that you are doing concurrency in a suboptimal way.
The optimal way of doing concurrency: message passing
What if on the other side, the different jobs cannot invoke methods directly on the BankAccount or a BankAccountRef, but just notify them that some operations needs to be performed? Well, then you have an Actor, the favourite way of doing concurrency in Scala.
class BankAccountActor(private var bankAccount:BankAccount) extends Actor {
def receive {
case BalanceRequest => sender ! Balance(bankAccount.balance)
case Withdraw(amount) => {
this.bankAccount = bankAccount.copy(balance = bankAccount.balance - amount)
}
case Deposit(amount) => {
this.bankAccount = bankAccount.copy(balance = bankAccount.balance + amount)
}
}
}
This solution is extensively described in Akka documentation: http://doc.akka.io/docs/akka/2.1.0/scala/actors.html . The idea is that you communicate with an Actor by sending messages to its mailbox, and those messages are processed in order of receival. As such, you will never have concurrency flaws if using this model.
This is sort of an opinion question that is less scala specific then you think.
If you really want to embrace FP I would go the immutable route for all your domain objects and never put any behavior them.
That is some people call the above the service pattern where there is always a seperation between behavior and state. This eschewed in OOP but natural in FP.
It also depends what your domain is. OOP is some times easier with stateful things like UI and video games. For hard core backend services like web sites or REST I think the service pattern is better.
Two really nice things that I like about immutable objects besides the often mentioned concurrency is that they are much more reliable to cache and they are also great for distributed message passing (e.g. protobuf over amqp) as the intent is very clear.
Also in FP people combat the mutable to immutable bridge by creating a "language" or "dialogue" aka DSL (Builders, Monads, Pipes, Arrows, STM etc...) that enables you to mutate and then to transform back to the immutable domain. The services mentioned above uses the DSL to make changes. This is more natural than you think (e.g. SQL is an example "dialogue"). OOP on the other hand prefers having a mutable domain and leveraging the existing procedural part of the language.

What are the obstacles for Scala having "const classes" a la Fantom?

Fantom supports provably immutable classes. The advantages of the compiler knowing a class is immutable must be numerous, not the least of which would be guaranteed immutable messages passed between actors. Fantom's approach seems straightforward - what difficulties would it pose for Scala?
There's more interest on Scala side on tracking side effects, which is a much harder proposition, than simply immutability.
Immutability in itself isn't as relevant as referential transparency, and, as a matter of fact, some of Scala's immutable collections would not pass muster on an "proven immutable" test because, in fact, they are not. They are immutable as far as anyone can observer from the outside, but they have mutable fields for various purposes.
One such example is List's subclass :: (the class that makes up everything in a list but the empty list), in which the fields for head and tail are actually mutable. This is done that way so that a List can be composed efficiently in FIFO order -- see ListBuffer and its toList method.
Regardless, while it would be interesting to have a guarantee of immutability, such things are really more of a artifact of languages where mutability is the default. It doesn't come up as a practical concern when programming in Scala, in my experience.
While the approach may be straightforward,
its guarantees can be broken by reflection;
it requires quite a bit of effort, which the Scala team may think not worth it or not a priority.

how to access complex data structures in Scala while preserving immutability?

Calling expert Scala developers! Let's say you have a large object representing a writable data store. Are you comfortable with this common Java-like approach:
val complexModel = new ComplexModel()
complexModel.modify()
complexModel.access(...)
Or do you prefer:
val newComplexModel = complexModel.withADifference
newComplexModel.access(...)
If you prefer that, and you have a client accessing the model, how is the client going
to know to point to newComplexModel rather than complexModel? From the user's perspective
you have a mutable data store. How do you reconcile that perspective with Scala's emphasis
on immutability?
How about this:
var complexModel = new ComplexModel()
complexModel = complexModel.withADifference
complexModel.access(...)
This seems a bit like the first approach, except that it seems the code inside withADifference is going to have to do more work than the code inside modify(), because it has to create a whole new complex data object rather than modifying the existing one. (Do you run into this problem of having to do more work in trying to preserve
immutability?) Also, you now have a var with a large scope.
How would you decide on the best strategy? Are there exceptions to the strategy you would choose?
I think the functional way is to actually have Stream containing all your different versions of your datastructure and the consumer just trying to pull the next element from that stream.
But I think in Scala it is an absolutely valid approach to a mutable reference in one central place and change that, while your whole datastructure stays immutable.
When the datastructure becomes more complex you might be interested in this question: Cleaner way to update nested structures which asks (and gets answered) how to actually create new change versions of an immutable data structure that is not trivial.
By such name of method as modify only it's easy to identify your ComplexModel as a mutator object, which means that it changes some state. That only implies that this kind of object has nothing to do with functional programming and trying to make it immutable just because someone with questionable knowledge told you that everything in Scala should be immutable will simply be a mistake.
Now you could modify your api so that this ComplexModel operated on immutable data, and I btw think you should, but you definitely must not try to convert this ComplexModel into immutable itself.
The canonical answer to your question is using Zipper, one SO question about it.
The only implementation for Scala I know of is in ScalaZ.
Immutability is merely a useful tool, not dogma. Situations will arise where the cost and inconvenience of immutability outweigh its usefulness.
The size of a ComplexModel may make it so that creating a modified copy is sufficiently expensive in terms of memory and/or CPU that a mutable model is more practical.