In my project I'm using a Seq[Drone] to track drones in a world. It's a functional project, so both the world and the drones are values of case classes.
In the world's process() method, a new World instance is returned containing a transformed version of that sequence, and since it's unordered, there's no guarantee the drones come back in the same order. This was by design, for a preliminary implementation.
Now, though, it's time to implement an ID system so that they can be assigned actions individually (e.g. "d1 move to (4, 6)"). This means that the drones need to be stored in a way that preserves "order".
I've spent some time coming up with several approaches, but first, an establishment of how IDs actually work.
ID Behaviour
IDs are unique for all existing drones (drones that are in the world).
When a drone expires, its ID is released.
When a drone is added, it takes the lowest available ID. (This means IDs can be reused.)
The Drone type does not have an ID - this is a concept only given meaning by a World.
Option 1: Plain tuples
My Seq[Drone] would become a Vector[(Int, Drone)]. References to drones would change from world.drones(n) to world.drones(n)._2, which is bad for a whole bunch of reasons. ID would be accessible by world.drones(n)._1.
Option 2: Type-aliased tuples
I'd add a typealias called D to (Int, Drone), and change the Seq[Drone] to Vector[D]. This has similar issues to Option 1, I believe, though I don't have a lot of experience with typealiasing.
Option 3: Case class
I'd make something like case class D(id: Int, drone: Drone) and turn Seq[Drone] into Vector[D] as Option 2. This has the advantage of providing nicer calls (d.id and d.drone rather than tuple element syntax), and can be used almost identically to tuples (D(1, Drone()) vs (1, Drone()) - a single character's difference).
My question is thus: is Option 3 a suitable solution here? If so, what kinds of problems might I run into in the future? (I envisage some work to tidy up calls, but other than that, nothing.) If not, what avenues can I explore to find something more suitable?
All 3 of your options are almost the same. A 2-tuple is really just a case class called Tuple2, where Scala adds some syntactic sugar so you can write (a, b) instead of Tuple2(a,b). So given those options, I would choose Option 3 because of the more descriptive methods names. In fact for that reason tuples are often discouraged from being used.
However there is another possibility, use a Map[Int, Drone]. This will give you some of the functionality you need out of the box (including fast lookup by id and uniqueness checks) and accomplish the same thing without needing to define your own new type.
For instance, you can define adding a drone as:
def addDrone(drones: Map[Int, Drone], newDrone : Drone): Map[Int, Drone] = {
val id = (0 until drones.size).find(!drones.contains(_)).getOrElse(drones.size)
drones + (id -> newDrone)
}
and releasing an id is as simple as removing it from the map.
Related
Background:
Our Scala software consists of various components, developed by different teams, that pass Scala collections back and forth. The APIs usually use abstract collections such as Seq[T] and Set[T], and developers are currently essentially free to choose any implementation they like: e.g. when creating new instances, some go with List() or Vector(), others with Seq.empty.
Problem:
Different implementations have different performance characteristics, e.g. List might have been a good choice locally (for one component) because the collection is only sequentially iterated over or modified at the head, but it could have been a poor choice globally, because another component performs loads of random accesses.
Question:
Are their any tools — ideally Scala-specific, but JVM-general might also be OK — that can monitor runtime use of collections and record the information necessary to detect and report undesirable access/usage patterns of collections?
My feeling is that runtime monitoring would be more fruitful than static analyses (including simple linting) because (i) statically detecting usage patterns in hot code is virtually impossible, and (ii) would most likely miss collections that are internally created, e.g. when performing complex filter/map/fold/etc. operations on immutable collections.
Edits/Clarifications:
Changing the interfaces to enforce specific types such as List isn't an option; it would also not prevent purely internal use of "wrong" collections/usage patterns.
The goal is identifying a globally optimal (over many runs of the software) collection type rather than locally optimising for each applied algorithm
You don't need linting for this, let alone runtime monitoring. This is exactly what having a strictly-typed language does for you out of the box. If you want to ensure a particular collection type is passed to the API, just declare that that API accepts that collection type (e.g., def foo(x: Stream[Bar]), not def foo(x: Seq[Bar]), etc.).
Alternatively, when practical, just convert to the desired type as part of implementation: def foo(x: List[Bar]) = { val y = x.toArray ; lotsOfRandomAccess(y); }
Collections that are "internally created" are typically the same type as the parent object: List(1,2,3).map(_ + 1) returns a List etc.
Again, if you want to ensure you are using a particular type, just say so:
val mapped: List[Int] = List(1,2,3).map(_ + 1)
You can actually, change the type this way if there is a need for that:
val mappedStream: Stream[Int] = List(1,2,3).map(_ + 1)(breakOut)
As discussed in the comments, this is a problem that needs to be solved at a local level rather than via global optimisation.
Each algorithm in the system will work best with a particular data type, so using a single global structure will never be optimal. Instead, each algorithm should ensure that the incoming data is in a format that can be processed efficiently. If it is not in the right format, the data should be converted to a better format as the first part of the process. Since the algorithm works better on the right format, this conversion is always a performance improvement.
The output data format is more of a problem if the system does not know which algorithm will be used next. The solution is to use the most efficient output format for the algorithm in question, and rely on other algorithms to re-format the data if required.
If you do want to monitor the whole system, it would be better to track the algorithms rather than the collections. If you monitor which algorithms are called and in which order you can create multiple traces through the code. You can then play back those traces with different algorithms and data structures to see which is the most efficient configuration.
I want to use a cache to hold recently accessed objects that just came from a database read.
The database primary key, in my case, will be a Long.
In each case I'll have an Object (Case Class) that represents this data.
The combination of the Long plus the full class name will be a unique identifier for finding any specific object. (The namespace should never have an conflicts as class names do not use numbers (as a rule?). In any case for this usage case I control the entire name space so not a huge concern).
The objects will be relatively short lived in the cache - I just see a few situations where I can save memory by holding the same immutable Object more than once as opposed to different instances of the same Object that would be extremely difficult to "pass everything everywhere" to avoid.
This also would help performance in situations where different eyeballs are checking out the same stuff but this is not the driver for this particular use case (just gravy).
My concern is now for every time I need a given object I'll need to recreate the cache key. This will involve a Long.toString and a String Concat. The case classes in question have a val in their companion object so that they know their class name without any further reflection occurring.
I'm thinking of putting a "cache" together in the companion object for the main cache keys as I wish to avoid the (needless?) repeat ops per lookup as well as the resultant garbage collection etc. (The fastest code to run is the code that never gets written (or called) - right?)
Is there a more elegant way to handle this? Has someone else already solved this specific problem?
I thought of writing a key class but even with a val (lazy or otherwise) for the hash and toString I still get a hit for each and every object I ask for as now I have to create the key object each time. (That could of course go back into the companion object key cache but if I go to the trouble of setting up that companion object cache for keys the key object approach is redundant.)
As a secondary ask of this question - assuming I use a Long and a full class name (as a String) which is most likely to get the quickest pull for the cache?
Long.toString + fullClassName
or
fullClassName + Long.toString
The Long IS a string in the key so assuming it is a string "find" on the cache which would be easier to index find? The numeric portion first or the string class name.
Numbers first means you wade through ALL the objects with matching numbers searching for the matching class whereas class first means you find the block of a particular class first but you have to go to the very end of the string to find the exact match.
I suspect the former might be more easily optimized for a "fast find" (I know in MySQL terms it would be...)
Then again perhaps someone already has a dual-key lookup based cache? :)
I would keep it extremely simple until you had concrete performance metrics to the contrary. Something like:
trait Key {
def id: Long
lazy val key: String = s"${getClass.getName}-${id}"
}
case class MyRecordObject(id: Long, ...) extends Key
Use a simple existing caching solution like Guava Caching.
To your secondary question, I would not worry about the performance of generating a key at all until you could actually prove key generation is a bottleneck (which I kind of doubt it ever would be).
import play.api.cache.Cache
It turns out that Cache.getOrElse[T](idAsString, seconds) actually does most of the heavy lifting!
[T] is of course a type in Scala and that is enough to keep things separated in the cache. Each [T] is a unique, separate and distinct bucket in the cache.
So Cache.getOrElse[AUser](10, 5) will get a completely different object from Cache.getOrElse[ALog](10, 5) (where the ID of 10 just happens to be the same for the purpose of illustration here).
I'm currently doing this with thousands of objects across hundreds of types so I know it works...
I say most of the work as the Long has to be .toString'ed before it can be used as a key. Not a complete GC disaster as I simply set up a Map to hold the most commonly/recently .toString'ed Long values.
For those of you that simply don't get the value of this consider a simple log screen which is very common in most web applications.
2015/10/22 10:22 - Johnny Rotten - deleted an important file
2015/10/22 10:22 - Johnny Rotten - deleted another important file
2015/10/22 10:22 - Johnny Rotten - looked up another user
2015/10/22 10:22 - Johnny Rotten - added a bogus file
2015/10/22 10:22 - Johnny Rotten - insulted his boss
Under Java (Tomcat) there would typically a single Object that represented that user (Johnny Rotten) and that single Object would be linked to each and every time the Name of that user appeared in the log display.
Now under Scala we tend to create a new instance (Case Class) for each and every line of the log entry simply because we have no (efficient/plumbing) way of getting to the last used instance of that Case Class. The Log itself tends to be a case class and it has a lazy val of the User Case Class.
So, along comes user-x and they look up a log and the set the pagination to 500 lines and low and behold we now have 500 case classes being created simply to display a users name (the "who" in each log entry).
And then a few seconds later we have yet another 500 User Case Classes when they hit refresh because they didn't think they clicked the mouse right the first time...
With a simple cache however that holds a recently accessed object for all of say 5 seconds, all we create for the entire 500 log entries is a single instance of a User Case Class for each unique name we display in the log.
In Scala Case Classes are immutable so the single instance is perfectly acceptable use case here and the GC has no needless work to do...
In Odersky et al's Scala book, they say use lists. I haven't read the book cover to cover but all the examples seem to use val List. As I understand it one also is encouraged to use vals over vars. But in most applications is there not a trade off between using a var List or a val MutableList?. Obviously we use a val List when we can. But is it good practice to be using a lot of var Lists (or var Vectors etc)?
I'm pretty new to Scala coming from C#. There I had a lot of:
public List<T> myList {get; private set;}
collections which could easily have been declared as vals if C# had immutability built in, because the collection itself never changed after construction, even though elements would be added and subtracted from the collection in its life time. So declaring a var collection almost feels like a step away from immutability.
In response to answers and comments, one of the strong selling points of Scala is: that it can have many benefits without having to completely change the way one writes code as is the case with say Lisp or Haskell.
Is it good practice to be using a lot of var Lists (or var Vectors
etc)?
I would say it's better practice to use var with immutable collections than it is to use val with mutable ones. Off the top of my head, because
You have more guarantees about behaviour: if your object has a mutable list, you never know if some other external object is going to update it
You limit the extent of mutability; methods returning a collection will yield an immutable one, so you only have mutablility within your one object
It's easy to immutabilize a var by simply assigning it to a val, whereas to make a mutable collection immutable you have to use a different collection type and rebuild it
In some circumstances, such as time-dependent applications with extensive I/O, the simplest solution is to use mutable state. And in some circumstances, a mutable solution is just more elegant. However in most code you don't need mutability at all. The key is to use collections with higher order functions instead of looping, or recursion if a suitable function doesn't exist. This is simpler than it sounds. You just need to spend some time getting to know the methods on List (and other collections, which are mostly the same). The most important ones are:
map: applies your supplied function to each element in the collection - use instead of looping and updating values in an array
foldLeft: returns a single result from a collection - use instead of looping and updating an accumulator variable
for-yield expressions: simplify your mapping and filtering especially for nested-loop type problems
Ultimately, much of functional programming is a consequence of immutability and you can't really have one without the other; however, local vars are mostly an implementation detail: there's nothing wrong with a bit of mutability so long as it cannot escape from the local scope. So use vars with immutable collections since the local vars are not what will be exported.
You are assuming either the List must be mutable, or whatever is pointing to it must be mutable. In other words, that you need to pick one of the two choices below:
val list: collection.mutable.LinkedList[T]
var list: List[T]
That is a false dichotomy. You can have both:
val list: List[T]
So, the question you ought to be asking is how do I do that?, but we can only answer that when you try it out and face a specific problem. There's no generic answer that will solve all your problems (well, there is -- monads and recursion -- but that's too generic).
So... give it a try. You might be interested in looking at Code Review, where most Scala questions pertain precisely how to make some code more functional. But, ultimately, I think the best way to go about it is to try, and resort to Stack Overflow when you just can't figure out some specific problem.
Here is how I see this problem of mutability in functional programming.
Best solution: Values are best, so the best in functional programming usage is values and recursive functions:
val myList = func(4);
def func(n) = if (n>0) n::func(n) else Nil
Need mutable stuff: Sometimes mutable stuff is needed or makes everything a lot easier. My impression when we face this situation is to use the mutables structures, so to use val list: collection.mutable.LinkedList[T] instead of var list: List[T], this is not because of a real improvement on performances but because of mutable functions which are already defined in the mutable collection.
This advice is personal and maybe not recommended when you want performance but it is a guideline I use for daily programming in scala.
I believe you can't separate the question of mutable val / immutable var from the specific use case. Let me deepen a bit: there are two questions you want to ask yourself:
How am I exposing my collection to the outside?
I want a snapshot of the current state, and this snapshot should not change regardless of the changes that are made to the entity hosting the collection. In such case, you should prefer immutable var. The reason is that the only way to do so with a mutable val is through a defensive copy.
I want a view on the state, that should change to reflect changes to the original object state. In this case, you should opt for an immutable val, and return it wrapped through an unmodifiable wrapper (much like what Collections.unmodifiableList() does in Java, here is a question where it's asked how to do so in Scala). I see no way to achieve this with an immutable var, so I believe your choice here is mandatory.
I only care about the absence of side effects. In this case the two approaches are very similar. With an immutable var you can directly return the internal representation, so it is slightly clearer maybe.
How am I modifying my collection?
I usually make bulk changes, namely, I set the whole collection at once. This makes immutable var a better choice: you can just assign what's in input to your current state, and you are fine already. With an immutable val, you need to first clear your collection, then copy the new contents in. Definitely worse.
I usually make pointwise changes, namely, I add/remove a single element (or few of them) to/from the collection. This is what I actually see most of the time, the collection being just an implementation detail and the trait only exposing methods for the pointwise manipulation of its status. In this case, a mutable val may be generally better performance-wise, but in case this is an issue I'd recommend taking a look at Scala's collections performance.
If it is necessary to use var lists, why not? To avoid problems you could for example limit the scope of the variable. There was a similar question a while ago with a pretty good answer: scala's mutable and immutable set when to use val and var.
the other day I was wondering why scala.collection.Map defines its unzip method as
def unzip [A1, A2] (implicit asPair: ((A, B)) ⇒ (A1, A2)): (Iterable[A1], Iterable[A2])
Since the method returns "only" a pair of Iterable instead of a pair of Seq it is not guaranteed that the key/value pairs in the original map occur at matching indices in the returned sequences since Iterable doesn't guarantee the order of traversal. So if I had a
Map((1,A), (2,B))
, then after calling
Map((1,A), (2,B)) unzip
I might end up with
... = (List(1, 2),List(A, B))
just as well as with
... = (List(2, 1),List(B, A))
While I can imagine storage-related reasons behind this (think of HashMaps, for example) I wonder what you guys think about this behavior. It might appear to users of the Map.unzip method that the items were returned in the same pair order (and I bet this is probably almost always the case) yet since there's no guarantee this might in turn yield hard-to-find bugs in the library user's code.
Maybe that behavior should be expressed more explicitly in the accompanying scaladoc?
EDIT: Please note that I'm not referring to maps as ordered collections. I'm only interested in "matching" sequences after unzip, i.e. for
val (keys, values) = someMap.unzip
it holds for all i that (keys(i), values(i)) is an element of the original mapping.
Actually, the examples you gave will not occur. The Map will always be unzipped in a pair-wise fashion. Your statement that Iterable does not guarantee the ordering, is not entirely true. It is more accurate to say that any given Iterable does not have to guarantee the ordering, but this is dependent on implementation. In the case of Map.unzip, the ordering of pairs is not guaranteed, but items in the pairs will not change they way they are matched up -- that matching is a fundamental property of the Map. You can read the source to GenericTraversableTemplate to verify this is the case.
If you expand unzip's description, you'll get the answer:
definition classes: GenericTraversableTemplate
In other words, it didn't get specialized for Map.
Your argument is sound, though, and I daresay you might get your wishes if you open an enhancement ticket with your reasoning. Specially if you go ahead an produce a patch as well -- if nothing else, at least you'll learn a lot more about Scala collections in doing so.
Maps, generally, do not have a natural sequence: they are unordered collections. The fact your keys happen to have a natural order does not change the general case.
(However I am at a loss to explain why Map has a zipWithIndex method. This provides a counter-argument to my point. I guess it is there for consistency with other collections and that, although it provides indices, they are not guaranteed to be the same on subsequent calls.)
If you use a LinkedHashMap or LinkedHashSet the iterators are supposed to return the pairs in the original order of insertion. Other HashMaps, yeah, you have no control. Retaining the original order of insertion is quite useful in UI contexts, it allows you to resort tables on any column in a Web application without changing types, for instance.
Consider this simplified application domain:
Criminal Investigative database
Person is anyone involved in an investigation
Report is a bit of info that is part of an investigation
A Report references a primary Person (the subject of an investigation)
A Report has accomplices who are secondarily related (and could certainly be primary in other investigations or reports
These classes have ids that are used to store them in a database, since their info can change over time (e.g. we might find new aliases for a person, or add persons of interest to a report)
Domain http://yuml.me/13fc6da0
If these are stored in some sort of database and I wish to use immutable objects, there seems to be an issue regarding state and referencing.
Supposing that I change some meta-data about a Person. Since my Person objects immutable, I might have some code like:
class Person(
val id:UUID,
val aliases:List[String],
val reports:List[Report]) {
def addAlias(name:String) = new Person(id,name :: aliases,reports)
}
So that my Person with a new alias becomes a new object, also immutable. If a Report refers to that person, but the alias was changed elsewhere in the system, my Report now refers to the "old" person, i.e. the person without the new alias.
Similarly, I might have:
class Report(val id:UUID, val content:String) {
/** Adding more info to our report */
def updateContent(newContent:String) = new Report(id,newContent)
}
Since these objects don't know who refers to them, it's not clear to me how to let all the "referrers" know that there is a new object available representing the most recent state.
This could be done by having all objects "refresh" from a central data store and all operations that create new, updated, objects store to the central data store, but this feels like a cheesy reimplementation of the underlying language's referencing. i.e. it would be more clear to just make these "secondary storable objects" mutable. So, if I add an alias to a Person, all referrers see the new value without doing anything.
How is this dealt with when we want to avoid mutability, or is this a case where immutability is not helpful?
If X refers to Y, both are immutable, and Y changes (i.e. you replace it with an updated copy), then you have no choice but to replace X also (because it has changed, since the new X points to the new Y, not the old one).
This rapidly becomes a headache to maintain in highly interconnected data structures. You have three general approaches.
Forget immutability in general. Make the links mutable. Fix them as needed. Be sure you really do fix them, or you might get a memory leak (X refers to old Y, which refers to old X, which refers to older Y, etc.).
Don't store direct links, but rather ID codes that you can look up (e.g. a key into a hash map). You then need to handle the lookup failure case, but otherwise things are pretty robust. This is a little slower than the direct link, of course.
Change the entire world. If something is changed, everything that links to it must also be changed (and performing this operation simultaneously across a complex data set is tricky, but theoretically possible, or at least the mutable aspects of it can be hidden e.g. with lots of lazy vals).
Which is preferable depends on your rate of lookups and updates, I expect.
I suggest you to read how they people deal with the problem in clojure and Akka. Read about Software transactional memory. And some of my thoughts...
The immutability exists not for the sake of itself. Immutability is abstraction. It does not "exist" in nature. World is mutable, world is permanently changing. So it's quite natural for data structures to be mutable - they describe the state of the real or simulated object at a given moment in time. And it looks like OOP rulez here. At conceptual level the problem with this attitude is that object in RAM != real object - the data can be inaccurate, it comes with delay etc
So in case of most trivial requirements you can go with everything mutable - persons, reports etc Practical problems will arise when:
data structures are modified from concurrent threads
users provide conficting changes for the same objects
a user provide an invalid data and it should be rolled back
With naive mutable model you will quickly end up with inconsistent data and crushing system. Mutability is error prone, immutability is impossible. What you need is transactional view of the world. Within transaction program sees immutable world. And STM manages changes to be applied in consistent and thread-safe way.
I think you are trying to square the circle. Person is immutable, the list of Reports on a Person is part of the Person, and the list of Reports can change.
Would it be possible for an immutable Person have a reference to a mutable PersonRecord that keeps things like Reports and Aliases?