Monitoring runtime use of concrete collections - scala

Background:
Our Scala software consists of various components, developed by different teams, that pass Scala collections back and forth. The APIs usually use abstract collections such as Seq[T] and Set[T], and developers are currently essentially free to choose any implementation they like: e.g. when creating new instances, some go with List() or Vector(), others with Seq.empty.
Problem:
Different implementations have different performance characteristics, e.g. List might have been a good choice locally (for one component) because the collection is only sequentially iterated over or modified at the head, but it could have been a poor choice globally, because another component performs loads of random accesses.
Question:
Are their any tools — ideally Scala-specific, but JVM-general might also be OK — that can monitor runtime use of collections and record the information necessary to detect and report undesirable access/usage patterns of collections?
My feeling is that runtime monitoring would be more fruitful than static analyses (including simple linting) because (i) statically detecting usage patterns in hot code is virtually impossible, and (ii) would most likely miss collections that are internally created, e.g. when performing complex filter/map/fold/etc. operations on immutable collections.
Edits/Clarifications:
Changing the interfaces to enforce specific types such as List isn't an option; it would also not prevent purely internal use of "wrong" collections/usage patterns.
The goal is identifying a globally optimal (over many runs of the software) collection type rather than locally optimising for each applied algorithm

You don't need linting for this, let alone runtime monitoring. This is exactly what having a strictly-typed language does for you out of the box. If you want to ensure a particular collection type is passed to the API, just declare that that API accepts that collection type (e.g., def foo(x: Stream[Bar]), not def foo(x: Seq[Bar]), etc.).
Alternatively, when practical, just convert to the desired type as part of implementation: def foo(x: List[Bar]) = { val y = x.toArray ; lotsOfRandomAccess(y); }
Collections that are "internally created" are typically the same type as the parent object: List(1,2,3).map(_ + 1) returns a List etc.
Again, if you want to ensure you are using a particular type, just say so:
val mapped: List[Int] = List(1,2,3).map(_ + 1)
You can actually, change the type this way if there is a need for that:
val mappedStream: Stream[Int] = List(1,2,3).map(_ + 1)(breakOut)

As discussed in the comments, this is a problem that needs to be solved at a local level rather than via global optimisation.
Each algorithm in the system will work best with a particular data type, so using a single global structure will never be optimal. Instead, each algorithm should ensure that the incoming data is in a format that can be processed efficiently. If it is not in the right format, the data should be converted to a better format as the first part of the process. Since the algorithm works better on the right format, this conversion is always a performance improvement.
The output data format is more of a problem if the system does not know which algorithm will be used next. The solution is to use the most efficient output format for the algorithm in question, and rely on other algorithms to re-format the data if required.
If you do want to monitor the whole system, it would be better to track the algorithms rather than the collections. If you monitor which algorithms are called and in which order you can create multiple traces through the code. You can then play back those traces with different algorithms and data structures to see which is the most efficient configuration.

Related

How best to implement an ID system on a functional collection

In my project I'm using a Seq[Drone] to track drones in a world. It's a functional project, so both the world and the drones are values of case classes.
In the world's process() method, a new World instance is returned containing a transformed version of that sequence, and since it's unordered, there's no guarantee the drones come back in the same order. This was by design, for a preliminary implementation.
Now, though, it's time to implement an ID system so that they can be assigned actions individually (e.g. "d1 move to (4, 6)"). This means that the drones need to be stored in a way that preserves "order".
I've spent some time coming up with several approaches, but first, an establishment of how IDs actually work.
ID Behaviour
IDs are unique for all existing drones (drones that are in the world).
When a drone expires, its ID is released.
When a drone is added, it takes the lowest available ID. (This means IDs can be reused.)
The Drone type does not have an ID - this is a concept only given meaning by a World.
Option 1: Plain tuples
My Seq[Drone] would become a Vector[(Int, Drone)]. References to drones would change from world.drones(n) to world.drones(n)._2, which is bad for a whole bunch of reasons. ID would be accessible by world.drones(n)._1.
Option 2: Type-aliased tuples
I'd add a typealias called D to (Int, Drone), and change the Seq[Drone] to Vector[D]. This has similar issues to Option 1, I believe, though I don't have a lot of experience with typealiasing.
Option 3: Case class
I'd make something like case class D(id: Int, drone: Drone) and turn Seq[Drone] into Vector[D] as Option 2. This has the advantage of providing nicer calls (d.id and d.drone rather than tuple element syntax), and can be used almost identically to tuples (D(1, Drone()) vs (1, Drone()) - a single character's difference).
My question is thus: is Option 3 a suitable solution here? If so, what kinds of problems might I run into in the future? (I envisage some work to tidy up calls, but other than that, nothing.) If not, what avenues can I explore to find something more suitable?
All 3 of your options are almost the same. A 2-tuple is really just a case class called Tuple2, where Scala adds some syntactic sugar so you can write (a, b) instead of Tuple2(a,b). So given those options, I would choose Option 3 because of the more descriptive methods names. In fact for that reason tuples are often discouraged from being used.
However there is another possibility, use a Map[Int, Drone]. This will give you some of the functionality you need out of the box (including fast lookup by id and uniqueness checks) and accomplish the same thing without needing to define your own new type.
For instance, you can define adding a drone as:
def addDrone(drones: Map[Int, Drone], newDrone : Drone): Map[Int, Drone] = {
val id = (0 until drones.size).find(!drones.contains(_)).getOrElse(drones.size)
drones + (id -> newDrone)
}
and releasing an id is as simple as removing it from the map.

Disadvantages of Immutable objects

I know that Immutable objects offer several advantages over mutable objects like they are easier to reason about than mutable ones, they do not have complex state spaces that change over time, we can pass them around freely, they make safe hash table keys etc etc.So my question is what are the disadvantages of immutable objects??
Quoting from Effective Java:
The only real disadvantage of immutable classes is that they require a
separate object for each distinct value. Creating these objects can be
costly, especially if they are large. For example, suppose that you
have a million-bit BigInteger and you want to change its low-order
bit:
BigInteger moby = ...;
moby = moby.flipBit(0);
The flipBit method
creates a new BigInteger instance, also a million bits long, that
differs from the original in only one bit. The operation requires time
and space proportional to the size of the BigInteger. Contrast this to
java.util.BitSet. Like BigInteger, BitSet represents an arbitrarily
long sequence of bits, but unlike BigInteger, BitSet is mutable. The
BitSet class provides a method that allows you to change the state of
a single bit of a millionbit instance in constant time.
Read the full item on Item 15: Minimize mutability
Apart from possible performance drawbacks (possible! because with the complexity of GC and HotSpot optimisations, immutable structures are not necessarily slower) - one drawback can be that state must now be threaded through your whole application. For simple applications or tiny scripts the effort to maintain state this way might be too high to buy you concurrency safety.
For example think of a GUI framework like Swing. It would be definitely possible to write a GUI framework entirely using immutable structures and one main "unsafe" outer loop, and I guess this has been done in Haskell. Some of the problems of maintaining nested immutable state can be addressed for example with lenses. But managing all the interactions (registering listeners etc.) may get quite involved, so you might instead want to introduce new abstractions such as functional-reactive or hybrid-reactive GUIs.
Basically you lose some of OO's encapsulation by going all immutable, and when this becomes a problem there are alternative approaches such as actors or STM.
I work with Scala on a daily basis. Immutability has certain key advantages as we all know. However sometimes it's just plain easier to allow mutable content in some situations. Here's a contrived example:
var counter = 0
something.map {e =>
...
counter += 1
}
Of course I could just have the map return a tuple with the payload and count, or use a collection.size if available. But in this case the mutable counter is arguably more clear. In general I prefer immutability but also allow myself to make exceptions.
To answer this question I would quote Programming in Scala, second Edition, chapter "Next Steps in Scala", item 11, by Lex Spoon, Bill Venners and Martin Odersky :
The Scala perspective, however, is that val and var are just two different tools in your toolbox, both useful, neither inherently evil. Scala encourages you to lean towards vals, but ultimately reach for the best tool given the job at hand.
So I would say that just as for programming languages, val and var solves different problems : there is no "disavantage / avantage" without context, there is just a problem to solve, and both of val / var address differently the problem.
Hope it helps, even if it does not provide a concrete list of pros / cons !

How is Scala suitable for Big Scalable Application

I am taking course Functional Programming Principles in Scala | Coursera on Scala.
I fail to understand with immutability , so many functions and so much dependencies on recursion , how is Scala is really suitable for real world applications.
I mean coming from imperative languages I see a risk of StackOverflow or Garbage Collection kicking in and with multiple copies of everything I am running Out Of Memory
What I a missing here?
Stack overflow: it's possible to make your recursive function tail recursive. Add #tailrec from scala.annotation.tailrec to make sure your function is 100% tail recursive. This is basically a loop.
Most importantly recursive solutions is only one of many patterns available. See "Effective Java" why mutability is bad. Immutable data is much better suitable for large applications: no need to synchronize access, client can't mess with data internals, etc. Immutable structures are very efficient in many cases. If you add an element to the head of a list: elem :: list all data is shared between 2 lists - awesome! Only head is created and pointed to the list. Imagine that you have to create a new deep clone of a list every time client asks for.
Expressions in Scala are more succinct and maybe more lazy - create filter and map and all that applied as needed. You can do the same in Java but ceremony takes forever so usually devs just create multiple temp collections on the way.
Martin Odersky defines mutability as a dependence on time/history. That's very interesting because you can use var inside of a function as long as no other code can be affected in any way, i.e. results are always the same.
Look at Option[T] and compare to null. Use them in for comprehensions. Exception becomes really exceptional and Option, Try, Box, Either communicate failures in a very nice way.
Scala allows to write more modular and generic code with less effort compared to Java.
Find a good piece of Scala code and try to see how you would do it in Java - it will be self evident.
Real world applications are getting more event-driven which involves passing around data across different processes or systems needing immutable data structures
In most of the cases we are either manipulating data or waiting on a resource.
In that case its easy to hook in a callback with Actors
Take a look at
http://pavelfatin.com/scala-for-project-euler/
Which gives you some examples on using functions like map fllter etc. Functions like these are used routinely by Ruby applications
Combination of immutability and recursion avoids a lot of stackoverflow problems. This come in handly while dealing with event driven applications
akka.io is a classic example which could have been build very concisely in scala.

Using Scala, does a functional paradigm make sense for analyzing live data?

For example, when analyzing live stockmarket data I expose a method to my clients
def onTrade(trade: Trade) {
}
The clients may choose to do anything from counting the number of trades, calculating averages, storing high lows, price comparisons and so on. The method I expose returns nothing and the clients often use vars and mutable structures for their computation. For example when calculating the total trades they may do something like
var numTrades = 0
def onTrade(trade: Trade) {
numTrades += 1
}
A single onTrade call may have to do six or seven different things. Is there any way to reconcile this type of flexibility with a functional paradigm? In other words a return type, vals and nonmutable data structures
You might want to look into Functional Reactive Programming. Using FRP, you would express your trades as a stream of events, and manipulate this stream as a whole, rather than focusing on a single trade at a time.
You would then use various combinators to construct new streams, for example one that would return the number of trades or highest price seen so far.
The link above contains links to several Haskell implementations, but there are probably several Scala FRP implementations available as well.
One possibility is using monads to encapsulate state within a purely functional program. You might check out the Scalaz library.
Also, according to reports, the Scala team is developing a compiler plug-in for an effect system. Then you might consider providing an interface like this to your clients,
def callbackOnTrade[A, B](f: (A, Trade) => B)
The clients define their input and output types A and B, and define a pure function f that processes the trade. All "state" gets encapsulated in A and B and threaded through f.
Callbacks may not be the best approach, but there are certainly functional designs that can solve such a problem. You might want to consider FRP or a state-monad solution as already suggested, actors are another possibility, as is some form of dataflow concurrency, and you can also take advantage of the copy method that's automatically generated for case classes.
A different approach is to use STM (software transactional memory) and stick with the imperative paradigm whilst still retaining some safety.
The best approach depends on exactly how you're persisting the data and what you're actually doing in these state changes. As always, let a profiler be your guide if performance is critical.

Specification: Use cases for CRUD

I am writing a Product requirements specification. In this document I must describe the ways that the user can interact with the system in a very high level. Several of these operations are "Create-Read-Update-Delete" on some objects.
The question is, when writing use cases for these operations, what is the right way to do so? Can I write only one Use Case called "Manage Object x" and then have these operations as included Use Cases? Or do I have to create one use case per operation, per object? The problem I see with the last approach is that I would be writing quite a few pages that I feel do not really contribute to the understanding of the problem.
What is the best practice?
The original concept for use cases was that they, like actors, and class definitions, and -- frankly everything -- enjoy inheritance, as well as <<uses>> and <<extends>> relationships.
A Use Case superclass ("CRUD") makes sense. A lot of use cases are trivial extensions to "CRUD" with an entity type plugged into the use case.
A few use cases will be interesting extensions to "CRUD" with variant processing scenarios for -- maybe -- a fancy search as part of Retrieve, or a multi-step process for Create or Update, or a complex confirmation for Delete.
Feel free to use inheritance to simplify and normalize your use cases. If you use a UML tool, you'll notice that Use Cases have an "inheritance" arrow available to them.
The answer really depends on how complex the interactions are and how many variations are possible from object to object. There are two real reasons why I suggest that you develop specific use cases for each CRUD
(a) If you really are only doing a high-level summary of the interaction then the overhead is very small
(b) I've found it useful to specify a set of generic Use Cases for modifying 'Resources' and then extending / overriding particular steps for particular objects. Obviously the common behaviour is captured in the generic 'Resource' use cases.
As your understanding of the domain develops (i.e. as business users dump more requirements on you), you are more likely to add to the CRUD rather than remove it.
It makes sense to distinguish between workflow cases and resource/object lifecycles.
They interact but they are not the same; it makes sense to specify them both.
Use case scenarios or more extended workflow specifications typically describe how a case may proceed through the system's workflow. This will typically include interaction with various different resources. These interactions can often be characterized as C,R,U or D.
Resource lifecycles provide the process model of what may happen to a particular (type of) resource (object). They are often trivial "flower" models that say: any of C,R,U,D may happen to this resource in any order, so they are not very interesting by themselves.
The link between the two is that steps from the workflow and from the lifecycles coincide.
I feel representation - as long as it makes sense and is readable - does not matter. Conforming to the UML spec in all details is especially irrelevant.
What does matter, that you spec clearly states the operations and operation types the implementaton requires.
C: What form of insert operations exists. Can you insert rows not fully populated? Can you insert rows without an ID? Can you retrieve the ID last inserted? Can you cancel an insert selectively? What happens on duplicate keys or constraints failure? Is there a REPLACE INTO equivalent?
R: By what fields can you select? Can you do arbitrary grouping, orders? Can you create aggregate fields, aliases? How can you retrieve embedded (has many etc.) data? How do you specify depth of recursion, limits?
U, D: see R + C