Is Scala faster than Java 7 for number crunching and for heavy string processing? - scala

Assume there are two class of applications:
(1) Intensive number crunching and numerical and mathematical computations
(2) Intensive string regex expression matching, xpath searching, and other string manipulations where strings are mostly stored in collection classes.
In Both cases assume clients access these applications thousands of times per second or even in parallel.
So if I have the choice to implement the applications in the server backends, I can choose either Java 7 or Scala. Which one should I choose to get faster performance and produce more reliable code?

Google did some benchmarks recently that you might find interesting - see paper linked to here: http://www.readwriteweb.com/hack/2011/06/cpp-go-java-scala-performance-benchmark.php
The paper is surprisingly un-scientific, but you will get a rough feel for what can be done. Of particular interest may be section V.F
Daniel Mahler improved the Scala version by creating a
more functional version, which is kept in the Scala Pro
directories. This version is only 270 lines of code, about 25%
of the C++ version, and not only is it shorter, run-time also
improved by about 3x. It should be noted that this version
performs algorithmic improvements as well, and is therefore
not directly comparable to the other Pro versions.
It's not clear to me whether this version with algorithmic improvements is included in their speed benchmark table (I don't think so), but it does indicate that you may be able to produce performance improvements by adopting algorithmic improvements that are more viable to implement in Scala. It won't do much for simple string processing, however.
A big factor will be how competent you are in programming these languages, and how good you are at optimizing them. Java is obviously more verbose but you're less likely to run into performance "gotchas".

Two points which might enable better performance for numerical computations than in Java:
The practical one: Scala makes it extremely easy to enable parallel computation of "embarrassingly parallel" problems. While the same could be done in Java it would require much more time and expertise, making it likely that it will only be done in rare circumstances.
The technical one: Scala can specialize generic data structures for primitive types, making boxing/unboxing unnecessary. The Java compiler is not able to do that.
Scala uses Java's String so the amount of possible improvements here is quite limited. But there are other data structures like ropes which provide better performance than String in some cases.

Depending on your expertise and effort, I would expect that you can get better results here or there. Normally, with an infinite amount of development time and money, you can improve, improve and improve your code in every language. (Think of bigger and bigger caches, specialised sorters, precomputed defaults and so on).
With a good understanding of both languages and some experience in performance questions of your field, I wouldn't expect much differences, but you could save some time by the more collection friendly scala approach, and the time, saved on normal development, could be spend in performance analysis and improvement.

There is in principle not really a reason why Scala would be faster than Java for number crunching applications.
I would not choose Java or Scala or any other JVM language if I wanted to write a serious high-performance number crunching application.
From my own experience (and ofcourse this is only anecdotal evidence and definitely not proof that this is true in all cases) the JVM is not the best suited platform for heavy number crunching. If raw number crunching speed is important you would probably be better off with something that's more close to the "metal", for example C++, which allows you to for example use Intel SSE instructions and do other low-level optimizations, or use the GPU with CUDA if your algorithm is suitable for that.

Related

Do cats and scalaz create performance overhead on application?

I know it is totally a nonsense question but due to my illiteracy on programming skill this question came to my mind.
Cats and scalaz are used so that we can code in Scala similar to Haskell/in pure functional programming way. But for achieving this we need to add those libraries additionally with our projects. Eventually for using these we need to wrap our codes with their objects and functions. It is something adding extra codes and dependencies.
I don't know whether these create larger objects in memory.
These is making me think about. So my question: will I face any performance issue like more memory consumption if I use cats/scalaz ?
Or should I avoid these if my application needs performance?
Do cats and scalaz create performance overhead on application?
Absolutely.
The same way any line of code adds performance overhead.
So, if that is your concern, then don't write any code (well, actually the world may be simpler if we would have never tried all this).
Now, dick answer outside. The proper question you should be asking is: "Does the overhead of X library is harmful to my software?"; remember this applies to any library, actually to any code you write, to any algorithm you pick, etc.
And, in order to answer that question, we need some things before.
Define the SLAs the software you are writing must hold. Without those, any performance question / observation you made is pointless. It doesn't matter if something is faster / slower if you don't know if that is meaningful for you and your clients.
Once you have SLAs you need to perform stress tests to verify if your current version of the software satisfies those. Because, if your current code is performant enough, then you should worry about other things like maintainability, testing, adding more features, etc.
PS: Remember that those SLAs should not be raw numbers but be expressed in terms of percentiles, the same goes for the results of the tests.
When you found that you are falling your SLAs then you need to do proper benchmarking and debugging to identify the bottlenecks of your project. As you saw, caring about performance must be done on each line of code, but that is a lot of work that usually doesn't produce any relevant output. Thus, instead of evaluating the performance of everything, we find the bottlenecks first, those small pieces of code that have the biggest contributions to the overall performance of your software (remember the Pareto principle).
Remember that in this step, we have to be integral, network matters too. (and you will see this last one is usually the biggest slowdown; thus, usually you would rather search for architectural solutions like using Fibers instead of Threads rather than trying to optimize small functions. Also, sometimes the easier and cheaper solution is better infrastructure).
When you find the bottleneck, then you need to formulate some alternatives, implement those and not only benchmark them but do Statistical hypothesis testing to validate if the proposed changes are worth it or not. And, of course, validate if they were enough to satisfy the SLAs.
Thus, as you can see, performance is an art and a lot of work. So, unless you are committed to doing all this then stop worrying about something you will not measure and optimize properly.
Rather, focus on increasing the maintainability of your code. This actually also helps performance, because when you find that you need to change something you would be grateful that the code is as clean as possible and that the whole architecture of the code allows for an easy change.
And, believe me when I say that, using tools like cats, cats-effect, fs2, etc will help with that regard. Also, they actually pretty optimized on their core so you should be good for a lot of use cases.
Now, the big exception is that if you know that the work you are doing will be very CPU and memory bound then yeah, you pretty much can be sure all those abstractions will be harmful. In those cases, you may even want to stay away from the JVM and rather write pretty low-level code in a language like Rust which will provide you with proper tools for that kind of problem and still be way safer than plain old C.

What is the obvious advantage of using AMPL?

I am doing a project using CPLEX solver, on Netbeans with Java. We have several optimization problems to solve, I have already solved one of them by coding in Java all the constraints, objective and variables, without using AMPL. However, some people in my team want to use AMPL.
Thus, as I don't want to read all the AMPL book to find the answer, is there an obvious reason to rather use AMPL than coding all the constraints "manually"? Moreover, can AMPL be integrated in Netbeans ? I did not find any documentation about that.
Is AMPL useful when the constraints need to be "flexible" (I mean, we can't guess in advance the exact number of constraints, it depends on the parameters fixed by the user, modularity is a high importance factor...)
I am really curious to hear about that soon !
Thanks for help
AMPL is an algebraic modeling language and quoting from that link:
One advantage of AMPL is the similarity of its syntax to the
mathematical notation of optimization problems.
For example, this can allow you to define groups of constraints without knowing in advance the dimensions of the model. And, perhaps, you can make big changes to your model more quickly. (You'll have to think about how often you will actually do that.)
However, one could argue that the "obvious advantage" of AMPL is that it supports dozens of different solvers. You can create your model and solve it with CPLEX, but then decide that you want to use a different solver (e.g., Gurobi, Xpress, etc.). On the AMPL Solvers web page, they have the following recommendation:
We recommend that you then test alternative solvers to determine which
offers the best tradeoff of price and performance for your needs.
The AMPL API web page says that there is a Java API, so that should allow you to include it in a Netbeans project, but I have no experience with that.
At the end of the day, you could also argue that these "advantages" are a matter of taste. Using the CPLEX Java API directly, as you have already done, is certainly a valid solution if it meets your requirements. It may allow you to build the model more efficiently, use solver-specific/advanced features that might not be supported by AMPL, and to have more fine-grained control over the model formulation.
You have just coded an optimisation model to optimise your company's production of widgets. Your company got a really good deal on $SOLVER1 so that's what you're using.
Over the next ten years, you improve and extend that model as your bosses throw new requirements at you. By the end of that time, you may have tens of thousands of lines of optimisation code as part of a system that, by now, is absolutely critical to your company's operations.
Your company's original licensing deal has expired, and the manufacturers of $SOLVER1 have massively increased the licensing fees, so you're now paying hundreds of thousands a year in licensing costs.
Meanwhile, the boffins at a rival company have just released a new version of $SOLVER2. It has fancy new algorithms that could solve the widget optimisation problem 20% faster and find better solutions than $SOLVER1 is giving you. It doesn't cost any more than $SOLVER1 and the performance is better.
Meanwhile, the open-source community has released $FREESOLVER. It might not be quite as powerful as the top commercial options, but it's as good as $SOLVER1 was ten years ago, and if you weren't paying $100k/year for licensing you could rent an awful lot of server time to make up for it.
...so, did you write your optimisation model on a platform that lets you switch to a new solver and take advantage of these opportunities without having to jettison ten years' worth of code?
There are huge advantages to being able to switch solvers quickly and easily. I know of one company who uses three different solvers for their work: they try two different open-source solvers both running in the cloud, and if neither of those can find an adequate solution then they throw it to an expensive solver with smarter algorithms. The open-source solvers handle 90% of their problems, so they only have to use the commercial solver for the last 10%, which allows them to make significant savings on their licensing costs.
One option we've discussed at my work is to use a commercial solver for mission-critical work, and open-source alternatives for applications like training or small-scale prototyping where we don't have the same requirements. That way we can minimise the number of concurrent users we need to license for the commercial solver.
(And, yes, there is still an issue of lock-in with the platform, but platforms like AMPL are significantly cheaper than a high-end commercial solver.)
Totally agree with everything that rkersh says. Also note that you should never write your model in a way that hard-codes details of your problem sizes etc. whether you write in an algebraic modelling language or through one of the more direct APIs.
Also, working with a modelling language gives you an extra level/layer of abstraction which can help, especially in sharing or explaining your model to others, comparing with a range of standard problem types etc., but I prefer the more nuts-and-bolts 'feel' of working with the more direct APIs, and almost never need (or have time & budget) to reformulate my models that deeply.
Even GPL means "general" yet newer and newer GPLs coming to life, so a given GPL is "more general" to somet tasks than others... :-) In theory writing a compiler the most efficiently for Pascal or Perl should not matter, so in fact you could write in whatever language you want and yet you should not lose expressivity or efficiency (e.g. for C# which is in the same league for Java now, MS writes a better compiler than the opensource equivalent).
Humans are specializing - this is why we have gotten this far :-) . No different when it comes to achieve a given task to convert a business problem to a math model (aka modeling). The whole idea of having a given modeling layer is that
A. you have the outmost expressivity for that particular task (aka math modeling)
B. it enforces some best practicies for modeling what in GPL you are not "forced" to do (1. you are free to do 2. it is marketed to you as such = flexibility). E.g. AMPL, GAMS, others are mixing declarative code (aka model code) and procedural code (aka flow-control-like) which is not a good practice. On the other hand e.g. separating data and an abstract model is getting to ALL modeling languages but interestingly enough very slowly...
C. thru no.A you can maintain the code more efficiently than otherwise (contrary to API modeling - I have clients who say they turned to modelinglanguage becuase API modeling is a liability for rapid model revamp)
D. in theory you could be solver independent.
If you look around all modeling languages are trying to maintain no.C except OPL (that's for historical reasons). But even in case of OPL, you get constraint-programming and constraint-based scheduling (beside math-programming) what with AMPL/GAMS you don't, however solverindependent they are...
the $Solver1 and $Solver2 + $Freesolver comparison is a bit broken for 4 reasons
A. opensolvers are still very far away from commercial solvers in term of performance when it comes to large/complex problems (probably LP is getting to the exception) - I have clients - the fastest ever sales in my memory - when they tested commercial solvers after their "free-ride".
B. while indeed the scenario described in relation with $Solver1 and $Solver2 seems plausible ($Solver1, the incumbent is getting more expensive over time), we could witness just the other way around where the $Solver2 (a new comer) actualy increased its pricing 4x in 7 years and in some cases doubled it, while $Solver1 (the incumbent) has had no change.
C. mixing up modeling capabilities and solvers is a mistake. The whole idea is that somebody writes models in APIs IS the way to stick to a solver much more than thru modeling languages. At a minimum, as the Hungarians say "what you gain on the custom you lose it on the ferry", in other words, "freedom (i.e. flexibility) comes with using it responsibly"
D. owning a solver for development is NOT expensive at all, i.e. a company can maintain large # of solvers (for less than 10k$ a company could have +4 solvers for development) to test which is the fastest for any given model and then choose the best suited for deployment.
in addition, solver is just one piece of the puzzle. E.g. I have a client who has disparate data sources and it takes 8hours to create a model and 4hours to solve it. Would this client welcome a more efficient data handling suite or would it insist that the solver should be faster? Modelers are too isolated from the business in most cases and while in their mind a given model is perfect, how it is populated by data is secondary, yet it makes or breaks a good performance.
I witness that API modelers are moving to modeling languages, not the other way around for various reasons...
but as somebody wrote above, there are lots of "tastes in the game", so eventually if you feel more confortable with a given approach then nobody can blame you to choose so... :-) after all it is very difficult to compare the/an other approach since it's almost never there on a given case... so eventually what counts is speed from business problem to a model which solve fast in the given application context :-)
phew, it was long... but I gave all my shots... :-)
To keep it short to illustrate advantage/disadvantage of using AMPL just compare using Java(AMPL) instead of assembly language(CPLEX).

Cost of Scala's immutable object creation [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I see posts like the for-comprehension in [1] and it really makes me wonder what the overall implication of using the immutable Map vs a Mutable one is. It seems like Scala developers are very comfortable with allowing mutations of immutable data structures to incur the cost of a new object- or maybe I'm just missing something. If every mutation operation on an immutable data structure is returning a new instance, though I understand it's good for thread safety, but what if i know how to fine-tune my mutable objects already to make these same guarantees?
[1] In Scala, how can I do the equivalent of an SQL SUM and GROUP BY?
In general, the only way to answer these kind of performance questions is to profile them in your real-world code. Microbenchmarks are often misleading (see e.g. this benchmarking tale) - and particularly if you're talking about concurrency the best strategy can be very different depending on how concurrent your use case is in practice.
In theory, a Sufficiently Smart Compilerâ„¢ should be able - perhaps with the help of a linear type system (inferred or otherwise) - to reproduce all the efficiency advantages of a mutable data structure. In fact, since it has more information available about the programmer's intent and is less constrained by incidental details that the programmer had to specify, such a compiler ought to be able to generate higher-performance code - and e.g. GCC rewrites code into immutable form (SSA) for optimization purposes. For an example that hits closer to home, many real-world Java programs have perfectly adequate throughput, but have issues with latency caused by Java's garbage collector stopping the world to compact the heap. A JVM that was aware that certain objects were immutable would be able to move them without stopping the world (you can simply copy the object, update all references to it, and then delete the old copy, since it doesn't matter if some threads see the old version while some of them see the new one).
In practice, it depends, and again the only way is to benchmark your specific case. In my experience, for the level of investment of programmer time that's available for most practical business problems, spending x hours on a (immutable) Scala version tends to yield a more performant program than spending the same time on a mutable Scala or Java version - indeed, in the amount of programmer time it takes to produce an acceptably-performing Scala version it would probably be impossible to complete a Java version at all (particularly if we require the same defect rate). On the other hand, if you have unlimited expert programmer time available and need to get the absolute best performance possible, you would probably want to use a very low-level mutable language (this is why LAPACK is still written in Fortran) - or even implement your algorithm directly on an FPGA as JP Morgan recently did.
But even in this case you probably want to have a prototype in a higher-level language so that you can write tests and compare the two to confirm that the high-performance implementation works correctly. Particularly if we're just talking about mutable vs. immutable in Scala, premature optimization is the root of all evil. Write your program, and then if performance is inadequate, profile it and look at the hotspots. If you really are spending too much time copying an immutable data structure, that's an appropriate time to replace it with a mutable version, and carefully check the thread safety guarantees by hand. If you're writing properly decoupled code then it should be easy to replace the performance-critical pieces as and when you need to, and until then you can reap the development time gains of code that's simpler and easier to reason about (particularly in concurrency cases). In my experience performance problems in well-written code are a lot less likely than people expect; most software performance issues are caused by a poor choice of algorithm or data structure rather than this kind of small overhead.
Your question starts with a wrong assumption, based on a misunderstanding of the cost incurring of using immutable objects.
Working with guaranteed immutable objects that are build form immutable objects allows you to use structural sharing, so you can create new objects based on the old ones without having to resort to a deep copy of the object and you can ,roughly spoken, reuse parts of the object the new on is based on.
So this mitigates the impact of using immutable objects greatly.
So what is the difference to fine-tuned, hand-crafted mutable objects ?
immutable objects fit better for the FP paradigma
compile time optimization and checks
lowers the chance of runtime exceptions
The question is very generic, so it is hard to give a definite answer. It seems that you are just uncomfortable with the amount of object allocation happening in idiomatic scala code using for comprehensions and the like.
The scala compiler does not do any special magic to fuse operations or to elide object allocations. It is up to the person writing the data structure to make sure that functional data structures reuse the as much as possible from previous versions (structural sharing). Many of the data structures used in scala collections do this reasonably well. See for example this talk about Functional Data Structures in Scala to give you a general idea.
If you are interested in the details, the book to get is Purely Functional Data Structures by Chris Okasaki. The material in this book applies also to other functional languages like Haskell and OCaml and Clojure.
The JVM is extremely good at allocating and collecting short-lived objects. So many things that seem outrageously inefficient to somebody accustomed to low level programming are actually surprisingly efficient. But there are definitely situations where mutable state has performance or other advantages. That is why scala does not forbid mutable state, but only has a preference towards immutability. If you find that you really need mutable state for performance reasons, it is usually a good idea to wrap your mutable state in an akka actor instead of trying to get low-level thread synchronization right.

Clojure or Scala for bioinformatics/biostatistics/medical research [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am not a professional programmer (my area is medical research), but I am quite capable in C/C++, and various scripting languages. A while back I got intrigued by Lisp, but I never got the time to seriously learn it. After a brief exposure to R I decided to invest more time in a functional programming language.
I would like the practicality of a JVM language and thus narrowed to Clojure and Scala. From what I understand, both can use already existing Java libraries and given at performance-critical code can be delegated to Java, have the potential to perform relatively equally well.
How do these languages compare in the application space I need them for?
Are There any real-life projects in bioinformatics using either?
Already existing code would be a serious plus, as would be good documentation and a fairly gentle learning curve. Also, how does the concurrency model of the two compare with each other?
Any significant advantages/disadvantages any one has?
I can personally vouch for Clojure as a great tool for this kind of work. (I believe Scala would be great too, I just have less experience with it).
My personal research is in the field of predictive modelling / machine learning and is very computationally intensive - so I think it has many parallels with bioinformatics or biostatistics.
My personal approach / setup includes:
Incanter used primarily as a data visualisation tool. Great for producing quick visualisations which are usually just 1-liners at the REPL. There are also lots of statistical and numerical processing tools which I believe use the Colt library under the hood. I'm not an expert in R but I understand that Incanter is roughly "R translated to Clojure/Lisp".
Exploiting quite a few Java libraries as needed. Some of these are my own, for example algorithms that I have written in Java in order to get the best possible fine-tuned performance out of the JVM. But you could equally easily use any of the other great Java libraries available, as calling Java from Clojure is very simple (.methodName object param1 param2)
Quite a lot of higher order functions to automate my workflow. For example I have a higher order function that will run an optimisation algorithm of any kind in a loop for a specified amount of time and then produce an Incanter graph of the improvement on each iteration. Not rocket science, but really easy to code up in a few lines of Clojure.
Never really having to worry about performance. You can make Clojure go pretty fast if you want to (e.g. with type hints, primitive arithmetic support etc.) but normally it's irrelevant as you're going to spend 99%+ of your cycles in well-optimised library code anyway. Hence a bit of overhead in the "glue" code is negligible - I feel I gain much more in terms of personal productivity by having a dynamic, high-level, functional language to work in.
Major use of Clojure's concurrency features - this has to be one of Clojure's strongest features. I tend to use the STM to code concurrent processes with transactions that can't interfere with each other, then kick off long-running calculations in a future so that I can get on with other tasks and wait for notification of the result.
A slowly growing collection of macros to "extend the language" when needed. I actually use macros less than I thought I would (higher order functions are often a better choice). But when you need them they are invaluable - this is where you really appreciate the value of a homoiconic language. Since they effectively allow you to add new syntax to the language itself, they are very powerful when used correctly to build the DSL that you need.
In short - I don't think you can go wrong with Clojure as a researcher.
The one thing I probably wouldn't use it for (yet) is actually writing a new numerical library - this would probably be better done in Scala or pure Java as you would probably want to adopt a more imperative / OOP style.
I am not sure about bioinformatics and biostatistics per se, but I do scientific data analysis frequently and I appreciate that Scala allows me to write as-fast-as-Java code with relative ease. I believe that it is often possible in Clojure now, but I haven't seen the benchmarks to back that up. For the time being, I think the prudent thing to assume is that they do not perform equally well. See, for example, the Computer Languages Benchmark Game, where Scala is faster than Clojure in every single test. (Ignore the horrible "pidigits" result for Clojure--Scala (and Java) are calling the GMP library written in C, which Clojure could do but because of a technical detail requiring a different wrapping for the library, isn't presently allowed in the game). Looking at multicore comparisons doesn't improve Clojure's showing, and note that the Clojure code is no shorter for these sorts of lowish-level algorithmic tasks.
Clojure is ahead for the time being with parallel collections, though the upcoming 2.9 release of Scala should make up much of the difference. Neither has a gentle learning curve when coming from C++; Scala is maybe a little easier given that the syntax outwardly looks a little more familiar. I believe there are good materials for learning each of them.
Edit: P.S. You can call R from Java (and therefore from either Clojure or Scala) using rJava (specifically the JRI interface). Edit to edit: and, these days, rScala.
Edit #2: Scala was faster than Clojure in everything at the time of writing; as of this edit, Clojure's a little ahead in one (at the cost of a huge amount of code)--but anyway, the overall point stands. (And the Scala implementation on that one test could be sped up.)
If you like R, give Incanter a try! It's R for Clojure.
Scala's is geared toward being syntactically easy for people coming from Java, which was intended to be syntactically easy for people coming from C though with two levels of indirection like this the advantage may be lost.
Clojure is getting a lot of traction in the Big Data space and maps very well onto Hadoop jobs for Huge Data. I think this would be a big advantage in the bioinformatics world.
Really, these things are largely personal taste so try both and see that makes you happy :)
If you are looking to get a feel for Clojure without a lot of "intellectual overhead" may I suggest using leiningen to get a test project started quickly?
To build on Rex's answer I would like to add some Scala libraries/products that may be of interest to you:
ADAM
Spark (sparkseq, 2)
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
ScalaLab: MATLAB-like scientific computing in Scala
ScalaNLP: Collection of libraries for natural language processing (NLP), machine learning, and statistics.
Factorie: Toolkit for deployable probabilistic modeling
Gridgain: Compute cluster for Scala and Java
BioScala: Bioinformatics for the Scala programming language
I don't know Scala, so I can't offer a comparison, but I am actively using Clojure in bioinformatics projects.
The Java integration is excellent, and I have had no problem making use of the BioJava libraries.
Where Clojure's concurrency model shines is in the immutable default data types and functional programming with the seq abstraction.
In my bioinformatic work I very often find myself with a lot input data (say gene sequences) which need to be subjected to the same analysis. Once I have my analysis function I can map it over a sequence of inputs (with the results lazily generated). I have gotten full utilization of a large 48-core server simply by changing that map to a pmap.
Large scale parallelization with a single character change is hard to beat!
Of course pmap isn't a magic bullet and only helps when the analysis function computationally dominates, but the fact that map and pmap can just be plugged in and out shows the elegance and simplicity enabled by Clojure's design.
I am only passingly familiar with Scala, so the best I can do is evangelize a bit for Clojure. It's a great language, but take all this advice with a grain of salt as it's coming from an enthusiast.
If you are looking for concurrency, Clojure is fantastic both for ease of programming and for performance. The immutable data structures mean that it's trivial to work with a coherent snapshot of the world without any manual and error-prone locking; the STM makes it fairly simple to change data in a thread-sensitive way without breaking anyone else's snapshots.
My understanding is that Scala has a lot of the nice functional tools that Clojure does, but Clojure will always win syntactically by virtue of being a Lisp. If you're looking to do some specialized bioinformatics stuff, Clojure is able to hide the bits of Lisp that you don't want, and raise your own constructs to the same level as the built-in language constructs. I can't find the reference right now, but there's some well-known quote about Lisp that goes like:
Lisp is not the perfect language for any program. But it is the perfect language for building the perfect language for every program.
That's horribly paraphrased, but in my experience it has been true. It looks like you'll want a fairly specialized set of tools, and no language will make those feel as natural as a Lisp.
You have to ask yourself how important functional programming is for you. You know C++ so you probably know OO. I would say it's easier to do FP in Clojure (because you can't really drop back to OO-style) in Scala you will probebly end up dropping FP and do more OO style.
I can't really say anything about your application space.
Since you mentioned R, there is an R-like Clojure library for statistics called Incanter. I don't know about other existing projects in your application space.
There is a lot of information about both languages, so that should not be a problem. The learning curve is kind of steep with both languages. Clojure is a much smaller language and since you already know some lisp it should not be to hard to learn the important stuff. Scala has a type system that will be hard to pick up especially since your main experience is with C/C++.
Both languages have great concurrency models and you will probably be happy with both.
I have some experience in Scala and only little knowledge in Clojure, but I programmed Lisp many years ago.
Lisp is a beautiful language, but it never made it to the world, because it was too limited. I believe you need a statically-typed language to develop robust systems. The type system in Scala is not difficult to master to benefit from it. If you want to do very advanced things with it to make your libraries idiot-proof, you can, but then you will need to study the type system a little more.
Scala favours immutable types, but you can use mutables without any problem, which you sometimes do need. Concurrency in Scala is very well implemented and frameworks like akka extend and enhance these possibilities.
Scala stands a better chance to become a mainstream language since it's a fuller language. I'm afraid that Clojure is too much like Lisp (but reimplemented on the JVM). I liked Lisp a lot, but it had too many disadvantages for real-life programs. With Scala I think we have the best of both worlds (OO and functional) in a clean marriage. On top of that, Scala seems to really catch on in the market.
We have been working on some experimental code in the Rudolf/BioClojure project on GitHub. Also, look at Jan Aert's BioClojure project which is more structured.
Additionally, there is a BioCaml project in the works...

Does Perl language aim at producing fast programs at runtime?

I recently had a friend tell me
"see perl was never designed to be fast"
Is that true?
The relevant piece of information I can find is this from Wikipedia:
The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal).
But it doesn't directly talk about speed. I think that with all the text processing that it needs to do, speed of execution really matters for a language like Perl. And with all the weird syntax, elegance was never an objective, I agree.
Was high speed of execution one of the design objectives of Perl?
There is one important aspect to be considered : algorithms. Perl secret weapons are the algorithms backing certain language features and the CPAN library.
Good algorithms trump raw execution speed for non trivial problems. It typically takes more effort to select and implement algorithms in C-like languages than in Perl. This means that for half a day coding some little tool the perl version often outperforms a C version because it was easier to make good datastructures with hashes and by using the features provided in the language and libraries.
Once a Perl script starts running (i.e. after loading and compiling everything), it can be very speedy. It's that yucky compile-every-time that's a bit nasty.
However, I find that people don't really have to worry about how fast Perl can be. They waste all of their time by implementing stupid designs that do a lot more work than they need to do, misunderstanding key technologies, or just being boneheaded. It's not uncommon for me to help someone make their stuff go orders of magnitude faster by just tuning in the right places. That's not particular to Perl though. People have that problem with every language.
Perl has always aimed toward practicality, not anything (even close to) some sort of ivory tower purity, where a few goals are given absolute priority, and others are ignored (completely or nearly so).
As such, I think it's reasonable to say that maintaining a reasonable speed of execution has always been seen as important for Perl, but there are other factors (especially things like flexibility and ease of use) that are generally more important, so if a choice has to be made between one of them and speed of execution, the other factor will generally win unless the effect on execution speed is really serious.
I would have said that a language that designed for optimal run time performance would not have constructs that allow compiling while running. So no, perhaps.
It became a design objective as of Perl 5.0. But keep in mind it is still interpreted, so it is fast for an interpreted language.