Clojure or Scala for bioinformatics/biostatistics/medical research [closed] - scala

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am not a professional programmer (my area is medical research), but I am quite capable in C/C++, and various scripting languages. A while back I got intrigued by Lisp, but I never got the time to seriously learn it. After a brief exposure to R I decided to invest more time in a functional programming language.
I would like the practicality of a JVM language and thus narrowed to Clojure and Scala. From what I understand, both can use already existing Java libraries and given at performance-critical code can be delegated to Java, have the potential to perform relatively equally well.
How do these languages compare in the application space I need them for?
Are There any real-life projects in bioinformatics using either?
Already existing code would be a serious plus, as would be good documentation and a fairly gentle learning curve. Also, how does the concurrency model of the two compare with each other?
Any significant advantages/disadvantages any one has?

I can personally vouch for Clojure as a great tool for this kind of work. (I believe Scala would be great too, I just have less experience with it).
My personal research is in the field of predictive modelling / machine learning and is very computationally intensive - so I think it has many parallels with bioinformatics or biostatistics.
My personal approach / setup includes:
Incanter used primarily as a data visualisation tool. Great for producing quick visualisations which are usually just 1-liners at the REPL. There are also lots of statistical and numerical processing tools which I believe use the Colt library under the hood. I'm not an expert in R but I understand that Incanter is roughly "R translated to Clojure/Lisp".
Exploiting quite a few Java libraries as needed. Some of these are my own, for example algorithms that I have written in Java in order to get the best possible fine-tuned performance out of the JVM. But you could equally easily use any of the other great Java libraries available, as calling Java from Clojure is very simple (.methodName object param1 param2)
Quite a lot of higher order functions to automate my workflow. For example I have a higher order function that will run an optimisation algorithm of any kind in a loop for a specified amount of time and then produce an Incanter graph of the improvement on each iteration. Not rocket science, but really easy to code up in a few lines of Clojure.
Never really having to worry about performance. You can make Clojure go pretty fast if you want to (e.g. with type hints, primitive arithmetic support etc.) but normally it's irrelevant as you're going to spend 99%+ of your cycles in well-optimised library code anyway. Hence a bit of overhead in the "glue" code is negligible - I feel I gain much more in terms of personal productivity by having a dynamic, high-level, functional language to work in.
Major use of Clojure's concurrency features - this has to be one of Clojure's strongest features. I tend to use the STM to code concurrent processes with transactions that can't interfere with each other, then kick off long-running calculations in a future so that I can get on with other tasks and wait for notification of the result.
A slowly growing collection of macros to "extend the language" when needed. I actually use macros less than I thought I would (higher order functions are often a better choice). But when you need them they are invaluable - this is where you really appreciate the value of a homoiconic language. Since they effectively allow you to add new syntax to the language itself, they are very powerful when used correctly to build the DSL that you need.
In short - I don't think you can go wrong with Clojure as a researcher.
The one thing I probably wouldn't use it for (yet) is actually writing a new numerical library - this would probably be better done in Scala or pure Java as you would probably want to adopt a more imperative / OOP style.

I am not sure about bioinformatics and biostatistics per se, but I do scientific data analysis frequently and I appreciate that Scala allows me to write as-fast-as-Java code with relative ease. I believe that it is often possible in Clojure now, but I haven't seen the benchmarks to back that up. For the time being, I think the prudent thing to assume is that they do not perform equally well. See, for example, the Computer Languages Benchmark Game, where Scala is faster than Clojure in every single test. (Ignore the horrible "pidigits" result for Clojure--Scala (and Java) are calling the GMP library written in C, which Clojure could do but because of a technical detail requiring a different wrapping for the library, isn't presently allowed in the game). Looking at multicore comparisons doesn't improve Clojure's showing, and note that the Clojure code is no shorter for these sorts of lowish-level algorithmic tasks.
Clojure is ahead for the time being with parallel collections, though the upcoming 2.9 release of Scala should make up much of the difference. Neither has a gentle learning curve when coming from C++; Scala is maybe a little easier given that the syntax outwardly looks a little more familiar. I believe there are good materials for learning each of them.
Edit: P.S. You can call R from Java (and therefore from either Clojure or Scala) using rJava (specifically the JRI interface). Edit to edit: and, these days, rScala.
Edit #2: Scala was faster than Clojure in everything at the time of writing; as of this edit, Clojure's a little ahead in one (at the cost of a huge amount of code)--but anyway, the overall point stands. (And the Scala implementation on that one test could be sped up.)

If you like R, give Incanter a try! It's R for Clojure.
Scala's is geared toward being syntactically easy for people coming from Java, which was intended to be syntactically easy for people coming from C though with two levels of indirection like this the advantage may be lost.
Clojure is getting a lot of traction in the Big Data space and maps very well onto Hadoop jobs for Huge Data. I think this would be a big advantage in the bioinformatics world.
Really, these things are largely personal taste so try both and see that makes you happy :)
If you are looking to get a feel for Clojure without a lot of "intellectual overhead" may I suggest using leiningen to get a test project started quickly?

To build on Rex's answer I would like to add some Scala libraries/products that may be of interest to you:
ADAM
Spark (sparkseq, 2)
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
ScalaLab: MATLAB-like scientific computing in Scala
ScalaNLP: Collection of libraries for natural language processing (NLP), machine learning, and statistics.
Factorie: Toolkit for deployable probabilistic modeling
Gridgain: Compute cluster for Scala and Java
BioScala: Bioinformatics for the Scala programming language

I don't know Scala, so I can't offer a comparison, but I am actively using Clojure in bioinformatics projects.
The Java integration is excellent, and I have had no problem making use of the BioJava libraries.
Where Clojure's concurrency model shines is in the immutable default data types and functional programming with the seq abstraction.
In my bioinformatic work I very often find myself with a lot input data (say gene sequences) which need to be subjected to the same analysis. Once I have my analysis function I can map it over a sequence of inputs (with the results lazily generated). I have gotten full utilization of a large 48-core server simply by changing that map to a pmap.
Large scale parallelization with a single character change is hard to beat!
Of course pmap isn't a magic bullet and only helps when the analysis function computationally dominates, but the fact that map and pmap can just be plugged in and out shows the elegance and simplicity enabled by Clojure's design.

I am only passingly familiar with Scala, so the best I can do is evangelize a bit for Clojure. It's a great language, but take all this advice with a grain of salt as it's coming from an enthusiast.
If you are looking for concurrency, Clojure is fantastic both for ease of programming and for performance. The immutable data structures mean that it's trivial to work with a coherent snapshot of the world without any manual and error-prone locking; the STM makes it fairly simple to change data in a thread-sensitive way without breaking anyone else's snapshots.
My understanding is that Scala has a lot of the nice functional tools that Clojure does, but Clojure will always win syntactically by virtue of being a Lisp. If you're looking to do some specialized bioinformatics stuff, Clojure is able to hide the bits of Lisp that you don't want, and raise your own constructs to the same level as the built-in language constructs. I can't find the reference right now, but there's some well-known quote about Lisp that goes like:
Lisp is not the perfect language for any program. But it is the perfect language for building the perfect language for every program.
That's horribly paraphrased, but in my experience it has been true. It looks like you'll want a fairly specialized set of tools, and no language will make those feel as natural as a Lisp.

You have to ask yourself how important functional programming is for you. You know C++ so you probably know OO. I would say it's easier to do FP in Clojure (because you can't really drop back to OO-style) in Scala you will probebly end up dropping FP and do more OO style.
I can't really say anything about your application space.
Since you mentioned R, there is an R-like Clojure library for statistics called Incanter. I don't know about other existing projects in your application space.
There is a lot of information about both languages, so that should not be a problem. The learning curve is kind of steep with both languages. Clojure is a much smaller language and since you already know some lisp it should not be to hard to learn the important stuff. Scala has a type system that will be hard to pick up especially since your main experience is with C/C++.
Both languages have great concurrency models and you will probably be happy with both.

I have some experience in Scala and only little knowledge in Clojure, but I programmed Lisp many years ago.
Lisp is a beautiful language, but it never made it to the world, because it was too limited. I believe you need a statically-typed language to develop robust systems. The type system in Scala is not difficult to master to benefit from it. If you want to do very advanced things with it to make your libraries idiot-proof, you can, but then you will need to study the type system a little more.
Scala favours immutable types, but you can use mutables without any problem, which you sometimes do need. Concurrency in Scala is very well implemented and frameworks like akka extend and enhance these possibilities.
Scala stands a better chance to become a mainstream language since it's a fuller language. I'm afraid that Clojure is too much like Lisp (but reimplemented on the JVM). I liked Lisp a lot, but it had too many disadvantages for real-life programs. With Scala I think we have the best of both worlds (OO and functional) in a clean marriage. On top of that, Scala seems to really catch on in the market.

We have been working on some experimental code in the Rudolf/BioClojure project on GitHub. Also, look at Jan Aert's BioClojure project which is more structured.
Additionally, there is a BioCaml project in the works...

Related

Why was NetLogo implemented in a mixture of Java and Scala?

Why was NetLogo implemented in a mixture of Java and Scala? Was it because of better support for concurrency? I am not familiar with Scala, but I think the functional programming style is better suited for expressing guarantees for concurrent programming.
The following is based on my experience as the lead developer of NetLogo from 2001 until 2014.
All of the Java code in NetLogo is old code that we simply never got around to converting to Scala.
The NetLogo project began around 1999, long before Scala existed.
By the time we switched to Scala in 2008, we had accumulated a great deal of Java code. Over time, a lot of that Java code has been converted to Scala, but not all of it. Converting absolutely all of it to Scala would be a big effort; it isn't worth it.
Pretty much all new code in NetLogo, since 2008, is in Scala.
Concurrency was not a major motivation for switching from Java to Scala. (In most of NetLogo, there are only two threads, the GUI thread and the job thread. Only BehaviorSpace uses more threads, and they are relatively easy to keep separate in either language.)
The main motivation for using Scala was simply that Scala is a much more powerful, expressive, flexible, and elegant language than Java, with much better support for functional programming and immutability, which are a great help in writing elegant, correct code.
That makes Scala is more fun and satisfying to work in, and the resulting code is likelier to be correct and is shorter and more maintainable. Using Scala also (more often than not, in our experience) attracts smart people to the development team.

Cost of Scala's immutable object creation [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I see posts like the for-comprehension in [1] and it really makes me wonder what the overall implication of using the immutable Map vs a Mutable one is. It seems like Scala developers are very comfortable with allowing mutations of immutable data structures to incur the cost of a new object- or maybe I'm just missing something. If every mutation operation on an immutable data structure is returning a new instance, though I understand it's good for thread safety, but what if i know how to fine-tune my mutable objects already to make these same guarantees?
[1] In Scala, how can I do the equivalent of an SQL SUM and GROUP BY?
In general, the only way to answer these kind of performance questions is to profile them in your real-world code. Microbenchmarks are often misleading (see e.g. this benchmarking tale) - and particularly if you're talking about concurrency the best strategy can be very different depending on how concurrent your use case is in practice.
In theory, a Sufficiently Smart Compilerâ„¢ should be able - perhaps with the help of a linear type system (inferred or otherwise) - to reproduce all the efficiency advantages of a mutable data structure. In fact, since it has more information available about the programmer's intent and is less constrained by incidental details that the programmer had to specify, such a compiler ought to be able to generate higher-performance code - and e.g. GCC rewrites code into immutable form (SSA) for optimization purposes. For an example that hits closer to home, many real-world Java programs have perfectly adequate throughput, but have issues with latency caused by Java's garbage collector stopping the world to compact the heap. A JVM that was aware that certain objects were immutable would be able to move them without stopping the world (you can simply copy the object, update all references to it, and then delete the old copy, since it doesn't matter if some threads see the old version while some of them see the new one).
In practice, it depends, and again the only way is to benchmark your specific case. In my experience, for the level of investment of programmer time that's available for most practical business problems, spending x hours on a (immutable) Scala version tends to yield a more performant program than spending the same time on a mutable Scala or Java version - indeed, in the amount of programmer time it takes to produce an acceptably-performing Scala version it would probably be impossible to complete a Java version at all (particularly if we require the same defect rate). On the other hand, if you have unlimited expert programmer time available and need to get the absolute best performance possible, you would probably want to use a very low-level mutable language (this is why LAPACK is still written in Fortran) - or even implement your algorithm directly on an FPGA as JP Morgan recently did.
But even in this case you probably want to have a prototype in a higher-level language so that you can write tests and compare the two to confirm that the high-performance implementation works correctly. Particularly if we're just talking about mutable vs. immutable in Scala, premature optimization is the root of all evil. Write your program, and then if performance is inadequate, profile it and look at the hotspots. If you really are spending too much time copying an immutable data structure, that's an appropriate time to replace it with a mutable version, and carefully check the thread safety guarantees by hand. If you're writing properly decoupled code then it should be easy to replace the performance-critical pieces as and when you need to, and until then you can reap the development time gains of code that's simpler and easier to reason about (particularly in concurrency cases). In my experience performance problems in well-written code are a lot less likely than people expect; most software performance issues are caused by a poor choice of algorithm or data structure rather than this kind of small overhead.
Your question starts with a wrong assumption, based on a misunderstanding of the cost incurring of using immutable objects.
Working with guaranteed immutable objects that are build form immutable objects allows you to use structural sharing, so you can create new objects based on the old ones without having to resort to a deep copy of the object and you can ,roughly spoken, reuse parts of the object the new on is based on.
So this mitigates the impact of using immutable objects greatly.
So what is the difference to fine-tuned, hand-crafted mutable objects ?
immutable objects fit better for the FP paradigma
compile time optimization and checks
lowers the chance of runtime exceptions
The question is very generic, so it is hard to give a definite answer. It seems that you are just uncomfortable with the amount of object allocation happening in idiomatic scala code using for comprehensions and the like.
The scala compiler does not do any special magic to fuse operations or to elide object allocations. It is up to the person writing the data structure to make sure that functional data structures reuse the as much as possible from previous versions (structural sharing). Many of the data structures used in scala collections do this reasonably well. See for example this talk about Functional Data Structures in Scala to give you a general idea.
If you are interested in the details, the book to get is Purely Functional Data Structures by Chris Okasaki. The material in this book applies also to other functional languages like Haskell and OCaml and Clojure.
The JVM is extremely good at allocating and collecting short-lived objects. So many things that seem outrageously inefficient to somebody accustomed to low level programming are actually surprisingly efficient. But there are definitely situations where mutable state has performance or other advantages. That is why scala does not forbid mutable state, but only has a preference towards immutability. If you find that you really need mutable state for performance reasons, it is usually a good idea to wrap your mutable state in an akka actor instead of trying to get low-level thread synchronization right.

Is Scala faster than Java 7 for number crunching and for heavy string processing?

Assume there are two class of applications:
(1) Intensive number crunching and numerical and mathematical computations
(2) Intensive string regex expression matching, xpath searching, and other string manipulations where strings are mostly stored in collection classes.
In Both cases assume clients access these applications thousands of times per second or even in parallel.
So if I have the choice to implement the applications in the server backends, I can choose either Java 7 or Scala. Which one should I choose to get faster performance and produce more reliable code?
Google did some benchmarks recently that you might find interesting - see paper linked to here: http://www.readwriteweb.com/hack/2011/06/cpp-go-java-scala-performance-benchmark.php
The paper is surprisingly un-scientific, but you will get a rough feel for what can be done. Of particular interest may be section V.F
Daniel Mahler improved the Scala version by creating a
more functional version, which is kept in the Scala Pro
directories. This version is only 270 lines of code, about 25%
of the C++ version, and not only is it shorter, run-time also
improved by about 3x. It should be noted that this version
performs algorithmic improvements as well, and is therefore
not directly comparable to the other Pro versions.
It's not clear to me whether this version with algorithmic improvements is included in their speed benchmark table (I don't think so), but it does indicate that you may be able to produce performance improvements by adopting algorithmic improvements that are more viable to implement in Scala. It won't do much for simple string processing, however.
A big factor will be how competent you are in programming these languages, and how good you are at optimizing them. Java is obviously more verbose but you're less likely to run into performance "gotchas".
Two points which might enable better performance for numerical computations than in Java:
The practical one: Scala makes it extremely easy to enable parallel computation of "embarrassingly parallel" problems. While the same could be done in Java it would require much more time and expertise, making it likely that it will only be done in rare circumstances.
The technical one: Scala can specialize generic data structures for primitive types, making boxing/unboxing unnecessary. The Java compiler is not able to do that.
Scala uses Java's String so the amount of possible improvements here is quite limited. But there are other data structures like ropes which provide better performance than String in some cases.
Depending on your expertise and effort, I would expect that you can get better results here or there. Normally, with an infinite amount of development time and money, you can improve, improve and improve your code in every language. (Think of bigger and bigger caches, specialised sorters, precomputed defaults and so on).
With a good understanding of both languages and some experience in performance questions of your field, I wouldn't expect much differences, but you could save some time by the more collection friendly scala approach, and the time, saved on normal development, could be spend in performance analysis and improvement.
There is in principle not really a reason why Scala would be faster than Java for number crunching applications.
I would not choose Java or Scala or any other JVM language if I wanted to write a serious high-performance number crunching application.
From my own experience (and ofcourse this is only anecdotal evidence and definitely not proof that this is true in all cases) the JVM is not the best suited platform for heavy number crunching. If raw number crunching speed is important you would probably be better off with something that's more close to the "metal", for example C++, which allows you to for example use Intel SSE instructions and do other low-level optimizations, or use the GPU with CUDA if your algorithm is suitable for that.

Why is Scala good for concurrency?

Are there any special concurrency operators, or is functional style programming good for concurrency? And why?
At the moment, Scala already supports two major strategies for concurrency - thread-based concurrency (derived from Java) and type-safe Actor-based concurrency (inspired by Erlang). In the nearest future (Scala 2.9), there will be two big additions to that:
Software Transactional Memory (the basis for concurrency in Clojure, and probably the second most popular concurrency-style in Haskell)
Parallel Collections (without going into detail, they allow for parallelizing basic collection transformers, like foreach or map across multiple threads).
Actor syntax (concurrency operators) is heavily influenced by Erlang (with some important additions) - with regards to the library you use (standard actors, Akka, Lift, scalaz), there will be different combinations of question and exclamation marks: ! (in the most cases for sending a message one-way), !!, !?, etc.
Besides that, first-class functions make your life way easier even when you work with old Java concurrency frameworks: ExecutorService, Fork-Join Framework, etc.
Above all of that stands immutability that simplifies concurrency a lot, making code more predictable and reliable.
There are a number of language features that make Scala good for concurrency. For example:
Most of the data structures are immutable and don't require anything special for concurrent access.
The functional style is a really good way to do highly concurrent operations.
Scala includes a really handy "actor" framework that helps with concurrent asynchronous operations.
Further reading:
http://www.ibm.com/developerworks/java/library/j-scala02049.html
http://blog.objectmentor.com/articles/2008/08/14/the-seductions-of-scala-part-iii-concurrent-programming
http://akka.io
Well, there is the hype and there is the reality. Scala got a fame for being good for concurrency because it is a functional language and because of its actors library. Functional languages are good for concurrency because they focus on immutability, which helps concurrent algorithms. Actors got their reputation because they are the base to Erlang's track record of massively concurrent systems.
So, in a sense, Scala's reputation is due to being a "me too" of successful techniques. Yet, there is something that Scala does bring to the table, which is its ability to support such additions to the language through libraries, which makes it able to adapt and adopt new techniques as they are devised.
Actors are not native to Scala, and yet there are already there different libraries in wide use that all seem to be. Neither is transactional memory, but, again, there are already libraries that look like they are.
They, these libraries, are even available for java, but there they are clunky to use.
So the secret is not so much what it can do, but that it makes it look easy.
The big keyword here is immutability. See this Wiki page. Since any variable can be defined as mutable or immutable, this is a big win for concurrency, since if an object cannot be changed, it is thread safe and therefore writing concurrent programs is easier.
Just to rain on everyone's parade a bit, see this question.
I've dual-coded a fair few multithreaded things in Scala (mainly using Futures, a bit with Actors too) and C++ (using TBB) since then (mostly Project Euler problems). General picture seems to be that Scala needs ~1/3 the number of lines of code of the C++ solution (and is quicker to write), but the C++ solution will be ~x10 faster runtime (unless you go to some efforts to avoid any "object churn", as shown in the fast version in the answer referenced above, but at that point you're losing much of the elegance of Scala). I'm still on 2.7.7 mind you; haven't tried Scala 2.8 yet.
Using Scala was my first real encounter with a language which strongly emphasised immutability and I'm impressed by its power to hugely simplify the mental model you have to maintain of object "state machines" while programming (and this in turn makes coding for concurrency easier). It's certainly influenced how I approach C++ coding.

Does Perl language aim at producing fast programs at runtime?

I recently had a friend tell me
"see perl was never designed to be fast"
Is that true?
The relevant piece of information I can find is this from Wikipedia:
The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal).
But it doesn't directly talk about speed. I think that with all the text processing that it needs to do, speed of execution really matters for a language like Perl. And with all the weird syntax, elegance was never an objective, I agree.
Was high speed of execution one of the design objectives of Perl?
There is one important aspect to be considered : algorithms. Perl secret weapons are the algorithms backing certain language features and the CPAN library.
Good algorithms trump raw execution speed for non trivial problems. It typically takes more effort to select and implement algorithms in C-like languages than in Perl. This means that for half a day coding some little tool the perl version often outperforms a C version because it was easier to make good datastructures with hashes and by using the features provided in the language and libraries.
Once a Perl script starts running (i.e. after loading and compiling everything), it can be very speedy. It's that yucky compile-every-time that's a bit nasty.
However, I find that people don't really have to worry about how fast Perl can be. They waste all of their time by implementing stupid designs that do a lot more work than they need to do, misunderstanding key technologies, or just being boneheaded. It's not uncommon for me to help someone make their stuff go orders of magnitude faster by just tuning in the right places. That's not particular to Perl though. People have that problem with every language.
Perl has always aimed toward practicality, not anything (even close to) some sort of ivory tower purity, where a few goals are given absolute priority, and others are ignored (completely or nearly so).
As such, I think it's reasonable to say that maintaining a reasonable speed of execution has always been seen as important for Perl, but there are other factors (especially things like flexibility and ease of use) that are generally more important, so if a choice has to be made between one of them and speed of execution, the other factor will generally win unless the effect on execution speed is really serious.
I would have said that a language that designed for optimal run time performance would not have constructs that allow compiling while running. So no, perhaps.
It became a design objective as of Perl 5.0. But keep in mind it is still interpreted, so it is fast for an interpreted language.