Using Scala, does a functional paradigm make sense for analyzing live data? - scala

For example, when analyzing live stockmarket data I expose a method to my clients
def onTrade(trade: Trade) {
}
The clients may choose to do anything from counting the number of trades, calculating averages, storing high lows, price comparisons and so on. The method I expose returns nothing and the clients often use vars and mutable structures for their computation. For example when calculating the total trades they may do something like
var numTrades = 0
def onTrade(trade: Trade) {
numTrades += 1
}
A single onTrade call may have to do six or seven different things. Is there any way to reconcile this type of flexibility with a functional paradigm? In other words a return type, vals and nonmutable data structures

You might want to look into Functional Reactive Programming. Using FRP, you would express your trades as a stream of events, and manipulate this stream as a whole, rather than focusing on a single trade at a time.
You would then use various combinators to construct new streams, for example one that would return the number of trades or highest price seen so far.
The link above contains links to several Haskell implementations, but there are probably several Scala FRP implementations available as well.

One possibility is using monads to encapsulate state within a purely functional program. You might check out the Scalaz library.
Also, according to reports, the Scala team is developing a compiler plug-in for an effect system. Then you might consider providing an interface like this to your clients,
def callbackOnTrade[A, B](f: (A, Trade) => B)
The clients define their input and output types A and B, and define a pure function f that processes the trade. All "state" gets encapsulated in A and B and threaded through f.

Callbacks may not be the best approach, but there are certainly functional designs that can solve such a problem. You might want to consider FRP or a state-monad solution as already suggested, actors are another possibility, as is some form of dataflow concurrency, and you can also take advantage of the copy method that's automatically generated for case classes.
A different approach is to use STM (software transactional memory) and stick with the imperative paradigm whilst still retaining some safety.
The best approach depends on exactly how you're persisting the data and what you're actually doing in these state changes. As always, let a profiler be your guide if performance is critical.

Related

Monitoring runtime use of concrete collections

Background:
Our Scala software consists of various components, developed by different teams, that pass Scala collections back and forth. The APIs usually use abstract collections such as Seq[T] and Set[T], and developers are currently essentially free to choose any implementation they like: e.g. when creating new instances, some go with List() or Vector(), others with Seq.empty.
Problem:
Different implementations have different performance characteristics, e.g. List might have been a good choice locally (for one component) because the collection is only sequentially iterated over or modified at the head, but it could have been a poor choice globally, because another component performs loads of random accesses.
Question:
Are their any tools — ideally Scala-specific, but JVM-general might also be OK — that can monitor runtime use of collections and record the information necessary to detect and report undesirable access/usage patterns of collections?
My feeling is that runtime monitoring would be more fruitful than static analyses (including simple linting) because (i) statically detecting usage patterns in hot code is virtually impossible, and (ii) would most likely miss collections that are internally created, e.g. when performing complex filter/map/fold/etc. operations on immutable collections.
Edits/Clarifications:
Changing the interfaces to enforce specific types such as List isn't an option; it would also not prevent purely internal use of "wrong" collections/usage patterns.
The goal is identifying a globally optimal (over many runs of the software) collection type rather than locally optimising for each applied algorithm
You don't need linting for this, let alone runtime monitoring. This is exactly what having a strictly-typed language does for you out of the box. If you want to ensure a particular collection type is passed to the API, just declare that that API accepts that collection type (e.g., def foo(x: Stream[Bar]), not def foo(x: Seq[Bar]), etc.).
Alternatively, when practical, just convert to the desired type as part of implementation: def foo(x: List[Bar]) = { val y = x.toArray ; lotsOfRandomAccess(y); }
Collections that are "internally created" are typically the same type as the parent object: List(1,2,3).map(_ + 1) returns a List etc.
Again, if you want to ensure you are using a particular type, just say so:
val mapped: List[Int] = List(1,2,3).map(_ + 1)
You can actually, change the type this way if there is a need for that:
val mappedStream: Stream[Int] = List(1,2,3).map(_ + 1)(breakOut)
As discussed in the comments, this is a problem that needs to be solved at a local level rather than via global optimisation.
Each algorithm in the system will work best with a particular data type, so using a single global structure will never be optimal. Instead, each algorithm should ensure that the incoming data is in a format that can be processed efficiently. If it is not in the right format, the data should be converted to a better format as the first part of the process. Since the algorithm works better on the right format, this conversion is always a performance improvement.
The output data format is more of a problem if the system does not know which algorithm will be used next. The solution is to use the most efficient output format for the algorithm in question, and rely on other algorithms to re-format the data if required.
If you do want to monitor the whole system, it would be better to track the algorithms rather than the collections. If you monitor which algorithms are called and in which order you can create multiple traces through the code. You can then play back those traces with different algorithms and data structures to see which is the most efficient configuration.

Is the actor model not an anti-pattern, as the fire-and-forget style forces actors to remember a state?

When learning Scala, one of the first things I learned was that every function returns something. There is no "void"-function/method as there is, for instance in Java. Thus many Scala-functions are true functions, in a mathematic way, and objects can remain largely stateless.
Now I learned that the actor model is a very popular model among functional languages like Scala. However, actors promote a fire-and-forget style of programming, and callers usually don't expect callees to directly reply to messages (except when using the "ask"/"?"-method). Therefore, actors need to remember some sort of state.
Am I right assuming that the actor model is more like a trade-off between scalability and maintainability (due to its statefulness), and could sometimes even be considered an anti-pattern?
Yes you're essentially right (I'm not quite sure what you have in mind when you say scalability vs maintainability).
Actors are popular in Scala because of Akka (which presumably is in turn popular because of the support it gets from Lightbend). It is, not however, the case that actors are overwhelmingly popular in general in the functional programming world (although implementations exist for all the languages I'm thinking of). Below are my vastly simplified impressions (so take them with the requisite amount of salt) of two other FP language communities, both of which use actors (far?) less frequently than Scala does.
The Haskell community tends to use either STM/channels (often in an STM context). Straight up MVars also get used surprisingly often.
The Clojure community sometimes touts its own built-in version of STM, but its flagship concurrency model is really core.async, which is at its heart again channels.
As an aside STM, channels, and actors can all be layered upon one another; its sort of weird to compare them as if they were mutually exclusive approaches. In practice though it's rare to see them all used in tandem.
Actors do indeed involve state (and in the case of Akka skirt type safety) and as a result are very expressive and can pretty much do anything concurrency-wise. In this way they're similar to side-effectful functions, which are more expressive than pure functions. Indeed actors in a way are the pure essence of OO, with all its pros and cons.
As such there is a sizable chunk of the Scala community that would say yes, if most of the time when you face concurrency issues, you're using actors, that's probably an anti-pattern.
If you can, try to get away with just using Futures or scalaz.concurrent.Tasks. In return for less expressiveness you get more composability.
If your problem naturally lends itself to a single, global state (e.g. in the form of global invariants that you want to enforce), think about STM. In the Scala community, although an STM library exists, my impression is that STM is usually emulated by using actors.
If your concurrency problems mainly revolves around streaming multiple sources of data, think about using one of Scala's streaming libraries.
Actors are specifically a tool in the toolbox for handling and distributing state. So yes, they should have state - if they don't then you just could use Futures.
Please note however that Actors (at least Akka Actors) handle distribution (running location-transparently on multiple nodes) which neither functions of Futures are able to do. The concurrency aspects of Actors are a result of them handling the more complex case - networking. In that sense, Actors unify the remote case with the local case, by making the remote case be first-class. And as it turns out, on networks messaging is exactly what you can both count and build on if you want reliable, resilient and also fast systems.
Hope this answers the "big picture" part of your question.

Disadvantages of Immutable objects

I know that Immutable objects offer several advantages over mutable objects like they are easier to reason about than mutable ones, they do not have complex state spaces that change over time, we can pass them around freely, they make safe hash table keys etc etc.So my question is what are the disadvantages of immutable objects??
Quoting from Effective Java:
The only real disadvantage of immutable classes is that they require a
separate object for each distinct value. Creating these objects can be
costly, especially if they are large. For example, suppose that you
have a million-bit BigInteger and you want to change its low-order
bit:
BigInteger moby = ...;
moby = moby.flipBit(0);
The flipBit method
creates a new BigInteger instance, also a million bits long, that
differs from the original in only one bit. The operation requires time
and space proportional to the size of the BigInteger. Contrast this to
java.util.BitSet. Like BigInteger, BitSet represents an arbitrarily
long sequence of bits, but unlike BigInteger, BitSet is mutable. The
BitSet class provides a method that allows you to change the state of
a single bit of a millionbit instance in constant time.
Read the full item on Item 15: Minimize mutability
Apart from possible performance drawbacks (possible! because with the complexity of GC and HotSpot optimisations, immutable structures are not necessarily slower) - one drawback can be that state must now be threaded through your whole application. For simple applications or tiny scripts the effort to maintain state this way might be too high to buy you concurrency safety.
For example think of a GUI framework like Swing. It would be definitely possible to write a GUI framework entirely using immutable structures and one main "unsafe" outer loop, and I guess this has been done in Haskell. Some of the problems of maintaining nested immutable state can be addressed for example with lenses. But managing all the interactions (registering listeners etc.) may get quite involved, so you might instead want to introduce new abstractions such as functional-reactive or hybrid-reactive GUIs.
Basically you lose some of OO's encapsulation by going all immutable, and when this becomes a problem there are alternative approaches such as actors or STM.
I work with Scala on a daily basis. Immutability has certain key advantages as we all know. However sometimes it's just plain easier to allow mutable content in some situations. Here's a contrived example:
var counter = 0
something.map {e =>
...
counter += 1
}
Of course I could just have the map return a tuple with the payload and count, or use a collection.size if available. But in this case the mutable counter is arguably more clear. In general I prefer immutability but also allow myself to make exceptions.
To answer this question I would quote Programming in Scala, second Edition, chapter "Next Steps in Scala", item 11, by Lex Spoon, Bill Venners and Martin Odersky :
The Scala perspective, however, is that val and var are just two different tools in your toolbox, both useful, neither inherently evil. Scala encourages you to lean towards vals, but ultimately reach for the best tool given the job at hand.
So I would say that just as for programming languages, val and var solves different problems : there is no "disavantage / avantage" without context, there is just a problem to solve, and both of val / var address differently the problem.
Hope it helps, even if it does not provide a concrete list of pros / cons !

How is Scala suitable for Big Scalable Application

I am taking course Functional Programming Principles in Scala | Coursera on Scala.
I fail to understand with immutability , so many functions and so much dependencies on recursion , how is Scala is really suitable for real world applications.
I mean coming from imperative languages I see a risk of StackOverflow or Garbage Collection kicking in and with multiple copies of everything I am running Out Of Memory
What I a missing here?
Stack overflow: it's possible to make your recursive function tail recursive. Add #tailrec from scala.annotation.tailrec to make sure your function is 100% tail recursive. This is basically a loop.
Most importantly recursive solutions is only one of many patterns available. See "Effective Java" why mutability is bad. Immutable data is much better suitable for large applications: no need to synchronize access, client can't mess with data internals, etc. Immutable structures are very efficient in many cases. If you add an element to the head of a list: elem :: list all data is shared between 2 lists - awesome! Only head is created and pointed to the list. Imagine that you have to create a new deep clone of a list every time client asks for.
Expressions in Scala are more succinct and maybe more lazy - create filter and map and all that applied as needed. You can do the same in Java but ceremony takes forever so usually devs just create multiple temp collections on the way.
Martin Odersky defines mutability as a dependence on time/history. That's very interesting because you can use var inside of a function as long as no other code can be affected in any way, i.e. results are always the same.
Look at Option[T] and compare to null. Use them in for comprehensions. Exception becomes really exceptional and Option, Try, Box, Either communicate failures in a very nice way.
Scala allows to write more modular and generic code with less effort compared to Java.
Find a good piece of Scala code and try to see how you would do it in Java - it will be self evident.
Real world applications are getting more event-driven which involves passing around data across different processes or systems needing immutable data structures
In most of the cases we are either manipulating data or waiting on a resource.
In that case its easy to hook in a callback with Actors
Take a look at
http://pavelfatin.com/scala-for-project-euler/
Which gives you some examples on using functions like map fllter etc. Functions like these are used routinely by Ruby applications
Combination of immutability and recursion avoids a lot of stackoverflow problems. This come in handly while dealing with event driven applications
akka.io is a classic example which could have been build very concisely in scala.

What is the best way to manage mutable state?

I just finished Martin Odersky's scala class at Coursera. Scala being my first FP language, I was excited by the idea of limiting mutable state. This allows for much easier concurrency and also makes the code super maintainable.
While learning all this, I realized you could guarantee the mutability of an object as long as it had no mutable variables and only referenced immutable objects. So now I can do everything by creating a new state instead of modifying an old one, use tail recursion when possible.
Great. So I can only do this so far up the chain. At some point, my application needs to be able to modify some existing state. I know where put in concurrency control at this point, locks, blah blah. I'm still defaulting to my standard multi-threaded concurrency control I've always used.
Oh scala community, is there a better way? Monads maybe?
EDIT: this question is a bit general, so I wanted to give a use case:
I have a machine learning algorithm that stores several collections of data. They have functions that return updated representations of the data (training, etc), all immutable. Ultimately I can keep this return-updated-state pattern going up the chain to the actual object running the simulation. This has a state that is mutable and holds references to the collections. I may want to distributed to multi-cores, or multi-system.
This is a bit of a subjective question, so I won't attempt to answer the 'which is best' part of it. If your chief concern is state in the context of multithreaded concurrency, then one option may be Software Transactional Memory.
There is an Implementation (see the quickstart) of STM as provided by Akka. Depending on your use-case, it might be heavy-weight or overkill, but then again, it might be preferable to a mess of locks. Unlike locks, STM tends to be optimistic, in the same way as database transactions are. As with database transactions, you make changes to shared state explicitly in a transactional context, and the changes you describe will be committed atomically or re-attempted if a conflict is detected. Basically you have to wrap all your state in Refs which can be manipulated only in an 'atomic' block - implemented as a method that takes a closure within which you use manipulate your Refs and ScalaSTM ensures that the whole set of operations on your state either succeed or fail - there will be no half-way or inconsistent changes.
This leverages Scala's implicit parameters - all operation to Refs require a transaction object as an argument, and this is received by the closure given to atomic and can be declared implicit, so all the code within atomic will can be written in a very natural yet safe style.
The catch is, for this to be useful, you do need to use the transactional data-structures provided; so that will mean using TSet instead of Set, TMap instead of Map. These provide all-or-nothing update semantics when used in the transactional context (within an atomic block). This are very much like clojure's persistent collections. You can also build your own transactional data structures out of Refs for use within these atomic blocks.
If you are not averse to parenthesis, the clojure explanation of refs is really good: http://clojure.org/refs
Depending on your use case you might be able to stick with deeply immutable object structures which you partially copy instead of actually mutating them (similar to an "updated" immutable list that shares a suffix with its original list). So-called lenses are a nice way of dealing with such structures, read about them in this SO question or in this blog post.
Sticking with immutable structures of course only works if you don't want changes to be globally observable. An example where immutable structures are most likely not an option are two concurrent clients working on a shared list, where the modifications done by client A must be observable by client B, and vice versa.
I suggest the best way is to store the mutable variable inside a Akka actor, use message passing in and out of the Akka actor to send and receive this mutable reference. Use immutable data structures.
I have a StorageActor as follows. The variable entityMap gets updated every time something is stored via the StoreEntity. Also it doesn't need to be volatile and still works.
The Akka actor is the place where things can change, messages are passed in and out into the pure functional world.
import akka.actor.Actor
import java.util.UUID
import com.orsa.minutesheet.entity.Entity
case class EntityRef(entity: Option[Entity])
case class FindEntity(uuid: UUID)
case class StoreEntity[T >: Entity](uuid: UUID, entity: Option[T])
class StorageActor extends Actor {
private var entityMap = Map[UUID, Entity]()
private def findEntityByUUID(uuid:UUID): Option[Entity] = entityMap.get(uuid)
def receive = {
case FindEntity(uuid) => sender ! EntityRef( findEntityByUUID(uuid) )
case StoreEntity(uuid, entity) =>
entity match {
case Some(store) => entityMap += uuid -> store.asInstanceOf[Entity]
case None => entityMap -= uuid
}
}
}