Does functional programming reduce the Von Neumann bottleneck? - scala

I believe (from doing some reading) that reading/writing data across the bus from CPU caches to main memory places a considerable constraint on how fast a computational task (which needs to move data across the bus) can complete - the Von Neumann bottleneck.
I have come across a few articles so far which mention that functional programming can be more performant than other paradigms like the imperative approach eg. OO (in certain models of computation).
Can someone please explain some of the ways that purely functional programming can reduce this bottleneck? ie. are any of the following points found (in general) to be true?
Using immutable data structures means generally less data is moving across that bus - less writes?
Using immutable data structures means that data is possibly more likely to be hanging around in CPU cache - because less updates to existing state means less flushing of objects from cache?
Is it possible that using immutable data structures means that we may often never even read the data back from main memory because we may create the object during computation and have it in local cache and then during same time slice create a new immutable object off of it (if there is a need for an update) and we then never use original object ie. we are working a lot more with objects that are sitting in local cache.

Oh man, that’s a classic. John Backus’ 1977 ACM Turing Award lecture is all about that: “Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs.” (The paper, “Lambda: The Ultimate Goto,” was presented at the same conference.)
I’m guessing that either you or whoever raised this question had that lecture in mind. What Backus called “the von Neumann bottleneck” was “a connecting tube that can transmit a single word between the CPU and the store (and send an address to the store).”
CPUs do still have a data bus, although in modern computers, it’s usually wide enough to hold a vector of words. Nor have we gotten away from the problem that we need to store and look up a lot of addresses, such as the links to daughter nodes of lists and trees.
But Backus was not just talking about physical architecture (emphasis added):
Not only is this tube a literal bottleneck for the data traffic of a problem, but, more importantly, it is an intellectual bottleneck that has kept us tied to word-at-a-time thinking instead of encouraging us to think in terms of the larger conceptual units of the task at hand. Thus programming is basically planning and detailing the enormous traffic of words through the von Neumann bottleneck, and much of that traffic concerns not significant data itself but where to find it.
In that sense, functional programming has been largely successful at getting people to write higher-level functions, such as maps and reductions, rather than “word-at-a-time thinking” such as the for loops of C. If you try to perform an operation on a lot of data in C, today, then just like in 1977, you need to write it as a sequential loop. Potentially, each iteration of the loop could do anything to any element of the array, or any other program state, or even muck around with the loop variable itself, and any pointer could potentially alias any of these variables. At the time, that was true of the DO loops of Backus’ first high-level language, Fortran, as well, except maybe the part about pointer aliasing. To get good performance today, you try to help the compiler figure out that, no, the loop doesn’t really need to run in the order you literally specified: this is an operation it can parallelize, like a reduction or a transformation of some other array or a pure function of the loop index alone.
But that’s no longer a good fit for the physical architecture of modern computers, which are all vectorized symmetric multiprocessors—like the Cray supercomputers of the late ’70s, but faster.
Indeed, the C++ Standard Template Library now has algorithms on containers that are totally independent of the implementation details or the internal representation of the data, and Backus’ own creation, Fortran, added FORALL and PURE in 1995.
When you look at today’s big data problems, you see that the tools we use to solve them resemble functional idioms a lot more than the imperative languages Backus designed in the ’50s and ’60s. You wouldn’t write a bunch of for loops to do machine learning in 2018; you’d define a model in something like Tensorflow and evaluate it. If you want to work with big data with a lot of processors at once, it’s extremely helpful to know that your operations are associative, and therefore can be grouped in any order and then combined, allowing for automatic parallelization and vectorization. Or that a data structure can be lock-free and wait-free because it is immutable. Or that a transformation on a vector is a map that can be implemented with SIMD instructions on another vector.
Examples
Last year, I wrote a couple short programs in several different languages to solve a problem that involved finding the coefficients that minimized a cubic polynomial. A brute-force approach in C11 looked, in relevant part, like this:
static variable_t ys[MOST_COEFFS];
// #pragma omp simd safelen(MOST_COEFFS)
for ( size_t j = 0; j < n; ++j )
ys[j] = ((a3s[j]*t + a2s[j])*t + a1s[j])*t + a0s[j];
variable_t result = ys[0];
// #pragma omp simd reduction(min:y)
for ( size_t j = 1; j < n; ++j ) {
const variable_t y = ys[j];
if (y < result)
result = y;
} // end for j
The corresponding section of the C++14 version looked like this:
const variable_t result =
(((a3s*t + a2s)*t + a1s)*t + a0s).min();
In this case, the coefficient vectors were std::valarray objects, a special type of object in the STL that have restrictions on how their components can be aliased, and whose member operations are limited, and a lot of the restrictions on what operations are safe to vectorize sound a lot like the restrictions on pure functions. The list of allowed reductions, like .min() at the end, is, not coincidentally, similar to the instances of Data.Semigroup. You’ll see a similar story these days if you look through <algorithm> in the STL.
Now, I’m not going to claim that C++ has become a functional language. As it happened, I made all the objects in the program immutable and automatically collected by RIIA, but that’s just because I’ve had a lot of exposure to functional programming and that’s how I like to code now. The language itself doesn’t impose such things as immutability, garbage collection or absence of side-effects. But when we look at what Backus in 1977 said was the real von Neumann bottleneck, “an intellectual bottleneck that has kept us tied to word-at-a-time thinking instead of encouraging us to think in terms of the larger conceptual units of the task at hand,” does that apply to the C++ version? The operations are linear algebra on coefficient vectors, not word-at-a-time. And the ideas C++ borrowed to do this—and the ideas behind expression templates even more so—are largely functional concepts. (Compare that snippet to how it would’ve looked in K&R C, and how Backus defined a functional program to compute inner product in section 5.2 of his Turing Award lecture in 1977.)
I also wrote a version in Haskell, but I don’t think it’s as good an example of escaping that kind of von Neumann bottleneck.
It’s absolutely possible to write functional code that meets all of Backus’ descriptions of the von Neumann bottleneck. Looking back on the code I wrote this week, I’ve done it myself. A fold or traversal that builds a list? They’re high-level abstractions, but they’re also defined as sequences of word-at-a-time operations, and half or more of the data passed through the bottleneck when you create and traverse a singly-linked list is the addresses of other data! They’re efficient ways to put data through the von Neumann bottleneck, and that’s basically why I did it: they’re great patterns for programming von Neumann machines.
If we’re interested in coding a different way, however, functional programming gives us tools to do so. (I’m not going to claim it’s the only thing that does.) Express a reduction as a foldMap, apply it to the right kind of vector, and the associativity of the monoidal operation lets you split up the problem into chunks of whatever size you want and combine the pieces later. Make an operation a map rather than a fold, on a data structure other than a singly-linked list, and it can be automatically parallelized or vectorized. Or transformed in other ways that produce the same result, since we’ve expressed the result at a higher level of abstraction, not a particular sequence of word-at-a-time operations.
My examples so far have been about parallel programming, but I’m sure quantum computing will shake up what programs look like a lot more fundamentally.

Related

How to do hardware independent parallel programming?

These days there are two main hardware environments for parallel programming, one is multi-threading CPU's and the other is the graphics cards which can do parallel operations on arrays of data.
The question is, given that there are two different hardware environments, how can I write a program which is parallel but independent of these two different hardware environments.
I mean that I would like to write a program and regardless of whether I have a graphics card or multi-threaded CPU or both, the system should choose automatically what to execute it on, either or both graphics card and/or multi-thread CPU.
Is there any software libraries/language constructs which allow this?
I know there are ways to target the graphics card directly to run code on, but my question is about how can we as programmers write parallel code without knowing anything about the hardware and the software system should schedule it to either graphics card or CPU.
If you require me to be more specific as to the platform/language, I would like the answer to be about C++ or Scala or Java.
Thanks
Martin Odersky's research group at EPFL just recently received a multi-million-euro European Research Grant to answer exactly this question. (The article contains several links to papers with more details.)
In a few years from now programs will rewrite themselves from scratch at run-time (hey, why not?)...
...as of right now (as far as I am aware) it's only viable to target related groups of parallel systems with given paradigms and a GPU ("embarrassingly parallel") is significantly different than a "conventional" CPU (2-8 "threads") is significantly different than a 20k processor supercomputer.
There are actually parallel run-times/libraries/protocols like Charm++ or MPI (think "Actors") that can scale -- with specially engineered algorithms to certain problems -- from a single CPU to tens of thousands of processors, so the above is a bit of hyperbole. However, there are enormous fundamental differences between a GPU -- or even a Cell micoprocessor -- and a much more general-purpose processor.
Sometimes a square peg just doesn't fit into a round hole.
Happy coding.
OpenCL is precisely about running the same code on CPUs and GPUs, on any platform (Cell, Mac, PC...).
From Java you can use JavaCL, which is an object-oriented wrapper around the OpenCL C API that will save you lots of time and effort (handles memory allocation and conversion burdens, and comes with some extras).
From Scala, there's ScalaCL which builds upon JavaCL to completely hide away the OpenCL language : it converts some parts of your Scala program into OpenCL code, at compile-time (it comes with a compiler plugin to do so).
Note that Scala features parallel collections as part of its standard library since 2.9.0, which are useable in a pretty similar way to ScalaCL's OpenCL-backed parallel collections (Scala's parallel collections can be created out of regular collections with .par, while ScalaCL's parallel collections are created with .cl).
The (very-)recently announced MS C++ AMP looks like the kind of thing you're after. It seems (from reading the news articles) that initially it's targeted at using GPUs, but the longer term aim seems to be to include multi-core too.
Sure. See ScalaCL for an example, though it's still alpha code at the moment. Note also that it uses some Java libraries that perform the same thing.
I will cover the more theoretical answer.
Different parallel hardware architectures implement different models of computation. Bridging between these is hard.
In the sequential world we've been happily hacking away basically the same single model of computation: the Random Access Machine. This creates a nice common language between hardware implementors and software writers.
No such single optimal model for parallel computation exists. Since the dawn of modern computers a large design space has been explored; current multicore CPUs and GPUs cover but a small fraction of this space.
Bridging these models is hard because parallel programming is essentially about performance. You typically make something work on two different models or systems by adding a layer of abstraction to hide specifics. However, it is rare that an abstraction does not come with a performance cost. This will typically land you with a lower common denominator of both models.
And now answering your actual question. Having a computational model (language, OS, library, ...) that is independent of CPU or GPU will typically not abstract over both while retaining the full power you're used to with your CPU, due to the performance penalties. To keep everything relatively efficient the model will lean towards GPUs by restricting what you can do.
Silver lining:
What does happen is hybrid computations. Some computations are more suited for other kinds of architectures. You also rarely do only one type of computation, so that a 'sufficiently smart compiler/runtime' will be able to distinguish what part of your computation should run on what architecture.

How to find the time value of operation to optimize new algorithm design?

My question is specific to iPhone, iPod, and iPad, since I am assuming that the architecture makes a big difference. I'm hoping there is either a specification somewhere (for the various chips perhaps), or a reliable way to measure T for each specific instruction. I know I can use any number of tools to measure aggregate processor time used, memory used, etc. I want to quantify at a lower level.
So, I'm able to figure out how many times I go through the main part of the algorithm. For example, I iterate n * (n-1) times in a naive implementation, and between n (best case) and n + n * (n-1) (worst case) in another. I can also make a reasonable count of the total number of instructions (+ - = % * /, and logic statements), and I can compare those counts, but that's assuming the weight of each operation is the same. Also, I don't have any idea how to weight the actual time value of a logic statement (if, else, for, while) vs a mathematical operator... is "if" as much work as "+" each time I use it? I would love to know where to find this information.
So, for clarity, my goal is to discover how much processor time I am demanding of the CPU (or GPU or any U) so that I can design an optimal algorithm around processor time. Can someone give me an idea of where to start for iOS hardware?
Edit: This link to ClockServices.c and SIMD stuff in the developer portal might be a good start for people interested in this. A few more cups of coffee tonight and I might get through it ;)
On a modern platform, processor time isn't the only limiting factor. Often, memory access is.
Still, processor time:
Your basic approach at an estimation for the processor load is OK, though, and is sensible: Make a rough estimate of the cost based on your knowledge of typical platforms.
In this article, Table 1 shows the times for typical primitive operations in .NET. While your platform may vary, the relative time is usually very similar. Maybe you can find - or even make - one for iStuff.
(I haven't come across one so thorough for other platforms, except processor / instruction set manuals, but they deal with assembly instructions)
memory locality:
A cache miss can cost you hundreds of cycles, a disk access a thousand times as much. So controlling your memory access patterns (i.e. reducing the working set, restructuring and accessing data in a cache-friendly way) is an important part of evaluating an algorithm.
xCode has instruments to measure performance of each function/operation, you can simply use them.

Fastest language for FOR loops

I'm trying to figure out the best programming language for an analytical model I'm building. Primary consideration is speed at which it will run FOR loops.
Some detail:
The model needs to perform numerous (~30 per entry, over 12 cycles) operations on a set of elements from an array -- there are ~300k rows, and ~150 columns in the array. Most of these operations are logical in nature, e.g., if place(i) = 1, then j(i) = 2.
I've built an earlier version of this model using Octave -- to run it takes ~55 hours on an Amazon EC2 m2.xlarge instance (and it uses ~10 GB of memory, but I'm perfectly happy to throw more memory at it). Octave/Matlab won't do elementwise logical operations, so a large number of for loops are needed -- I'm relatively certain that I've vectorized as much as possible -- the loops that remain are necessary. I've gotten octave-multicore to work with this code, which makes some improvement (~30% speed reduction when I get it running on 8 EC2 cores), but ends up being unstable with file locking, etc.
+I'm really looking for a step change in run-time -- I know that actually using Matlab might get me as much as a 50% improvement from looking at some benchmarks, but that is cost-prohibitive. The original plan when starting this was to actually run a Monte Carlo with this, but at 55 hours a run, that's completely impractical.
The next version of this is going to be a complete rebuild from the ground up (for IP reasons I won't get in to if nothing else), so I'm completely open to any programming language. I'm most familiar with Octave/Matlab, but have dabbled in R, C, C++, Java. I'm also proficient w/ SQL if the solution involves storing the data in a database. I'll learn any language for this -- these aren't complicated functionality we're looking for, no interfacing with other programs, etc., so not too concerned about learning curve.
So with all that said, what's the fastest programming language specifically for FOR loops? From a search of SO and Google, Fortran and C bubble to the top, but looking for some more advice before diving in to one or the other.
Thanks!
This for loop looks no more complex than this when it hits the CPU:
for(int i = 0; i != 1024; i++) translates to
mov r0, 0 ;;start the counter
top:
;;some processing
add r0, r0, 1 ;;increment the counter by 1
jne top: r0, 1024 ;;jump to the loop top if we havn't hit the top of the for loop (1024 elements)
;;continue on
As you can tell, this is sufficiently simple you can't really optimize it very well[1]... Refocus towards the algorithm level.
The first cut at the problem is to look at cache locality. Look up the classic example of matrix multiplication and swapping the i and j indexes.
edit: As a second cut, I would suggest evaluating the algorithm for data-dependencies between iterations and data dependency between localities in your 'matrix' of data. It may be a good candidate for parallelization.
[1] There are some micro-optimizations possible, but those will not produce the speedsups you're looking for.
~300k * ~150 * ~30 * ~12 = ~16G iterations, right?
This number of primitive operations should complete in a matter of minutes (if not seconds) in any compiled language on any decent CPU.
Fortran, C/C++ should do it almost equally well. Even managed languages, such as Java and C#, would only fall behind by a small margin (if at all).
If you have a problem of ~16G iterations running 55 hours, this means that they are very far from being primitive (80k per second? this is ridiculous), so maybe we should know more. (as was already suggested, is disk access limiting performance? is it network access?)
As #Rotsor said, 16G operations / 55 hours is about 80,000 operations per second, or one operation every 12.5 microseconds. That's a lot of time per operation.
That means your loops are not the cause of poor performance, it's what's in the innermost loop that's taking the time. And Octave is an interpreted language. That alone means an order of magnitude slowdown.
If you want speed, you first need to be in a compiled language. Then you need to do performance tuning (aka profiling) or, just single step it in a debugger at the instruction level. That will tell you what it is actually doing in its heart of hearts. Once you've got it to where it's not wasting cycles, fancier hardware, cores, CUDA, etc. will give you a parallelism speedup. But it's silly to do that if your code is taking unnecessarily many cycles. (Here's an example of performance tuning - a 43x speedup just by trimming the fat.)
I can't believe the number of responders talking about matlab, APL, and other vectorized languages. Those are interpreters. They give you concise source code, which is not at all the same thing as fast execution. When it comes down to the bare metal, they are stuck with the same hardware as every other language.
Added: to show you what I mean, I just ran this C++ code, which does 16G operations, on this moldy old laptop, and it took 94 seconds, or about 6ns per iteration. (I can't believe you baby-sat that thing for 2 whole days.)
void doit(){
double sum = 0;
for (int i = 0; i < 1000; i++){
for (int j = 0; j < 16000000; j++){
sum += j * 3.1415926;
}
}
}
In terms of absolute speed, probably Fortran followed by C, followed by C++. In practical application, well written code in any of the three, compiled with a descent compiler should be plenty fast.
Edit- generally you are going to see much better performance with any kind of looped or forking (e.g.- if statements) code with a compiled language, versus an interpreted language.
To give an example, on a recent project I was working on, the data sizes and operations were about 3/4 the size of what you're talking about here, but like your code, had very little room for vectorization, and required significant looping. After converting the code from matlab to C++, runtimes went from 16-18 hours, down to around 25 minutes.
For what you're discussing, Fortran is probably your first choice. The closest second place is probably C++. Some C++ libraries use "expression templates" to gain some speed over C for this kind of task. It's not entirely certain those will do you any good, but C++ can be at least as fast as C, and possibly somewhat faster.
At least in theory, there's no reason Ada couldn't be competitive as well, but it's been so long since I used it for anything like this that I hesitate to recommend it -- not because it isn't good, but because I just haven't kept track of current Ada compilers well enough to comment on them intelligently.
Any compiled language should perform the loop itself on roughly equal terms.
If you can formulate your problem in its terms, you might want to look at CUDA or OpenCL and run your matrix code on the GPU - though this is less good for code with lots of conditionals.
If you want to stay on conventional CPUs, you may be able to formulate your problem in terms of SSE scatter/gather and bitmask operations.
Probably the assembly language for whatever your platform is. But compilers (especially special-purpose ones that specifically target a single platform (e.g., Analog Devices or TI DSPs)) are often as good as or better than humans. Also, compilers often know about tricks that you don't. For example, the aforementioned DSPs support zero-overhead loops and the compiler will know how to optimize code to use those loops.
Matlab will do element-wise logical operations and they are generally quite fast.
Here is a quick example on my computer (AMD Athalon 2.3GHz w/3GB) :
d=rand(300000,150);
d=floor(d*10);
>> numel(d(d==1))
ans =
4501524
>> tic;d(d==1)=10;toc;
Elapsed time is 0.754711 seconds.
>> numel(d(d==1))
ans =
0
>> numel(d(d==10))
ans =
4501524
In general I've found matlab's operators are very speedy, you just have to find ways to express your algorithms directly in terms of matrix operators.
C++ is not fast when doing matrixy things with for loops. C is, in fact, specifically bad at it. See good math bad math.
I hear that C99 has __restrict pointers that help, but don't know much about it.
Fortran is still the goto language for numerical computing.
How is the data stored? Your execution time is probably more effected by I/O (especially disk or worse, network) than by your language.
Assuming operations on rows are orthogonal, I would go with C# and use PLINQ to exploit all the parallelism I could.
Might you not be best with a hand-coded assembler insert? Assuming, of course, that you don't need portability.
That and an optimized algorithm should help (and perhaps restructuring the data?).
You might also want to try multiple algorithms and profile them.
APL.
Even though it's interpreted, its primitive operators all operate on arrays natively, therefore you rarely need any explicit loops. You write the same code, whether the data is scalar or array, and the interpreter takes care of any looping needed internally, and thus with the minimum overhead - the loops themselves are in a compiled language, and will have been heavily optimised for the specific architecture of the CPU it's running on.
Here's an example of the simplicity of array handling in APL:
A <- 2 3 4 5 6 8 10
((2|A)/A) <- 0
A
2 0 4 0 6 8 10
The first line sets A to a vector of numbers.
The second line replaces all the odd numbers in the vector with zeroes.
The third line queries the new values of A, and the fourth line is the resulting output.
Note that no explicit looping was required, as scalar operators such as '|' (remainder) automatically extend to arrays as required. APL also has built-in primitives for searching and sorting, which will probably be faster than writing your own loops for these operations.
Wikipedia has a good article on APL, which also provides links to suppliers such as IBM and Dyalog.
Any modern compiled or JITted language is going to render down to pretty much the same machine language code, giving a loop overhead of 10 nano seconds or less, per iteration, on modern processors.
Quoting #Rotsor:
If you have a problem of ~16G iterations running 55 hours, this means that they are very far from being primitive (80k per second? this is ridiculous), so maybe we should know more.
80k operations per second is around 12.5 microseconds each - a factor of 1000 greater than the loop overhead you'd expect.
Assuming your 55 hour runtime is single threaded, and if your per item operations are as simple as suggested, you should be able to (conservatively) achieve a speedup of 100x and cut it down to under an hour very easily.
If you want to run faster still, you'll want to look at writing multi-threaded solution, in which case a language that provides good support would be essential. I'd lean towards PLINQ and C# 4.0, but that's because I already know C# - YMMV.
what about a lazy loading language like clojure. it is a lisp so like most lisp dialects lacks a for loop but has many other forms that operate more idiomatically for list processing. It might help your scaling issues as well because the operations are thread safe and because the language is functional it has fewer side effects. If you wanted to find all the items in the list that were 'i' values to operate on them you might do something like this.
(def mylist ["i" "j" "i" "i" "j" "i"])
(map #(= "i" %) mylist)
result
(true false true true false true)

Your experiences with Matlab/F#/R for data analysis and modeling algorithms

I've been using F# for a while now to model algorithms before coding them in C++, and also using it afterwards to check the results of the C++ code, and also against real-world recorded data.
For the modeling side of things, it's very handy, but for the 'data mashup' kind of stuff, pulling in data from CSV and other sources, generating statistics, drawing charts etc., my colleague teases me no end ("why are you coding that yourself? It's built in to MatLab").
And I have another colleague who swears by R, which also has charting stuff 'built-in'.
I know that MatLab, R and F# are not strictly comparable, so I'm not asking for a 'feature comparison shoot out'. I just wondered what other people are using for these kind of pre- and post-analysis scenarios, and how happy they are with it.
(If there's anyone out there working on wrapping Microsoft Charts into something F#-friendly, let me know, I'd be happy to participate...)
(Note: answers to this question will be subjective, but based on experience, please)
I have very little experience with F#, but regarding C++/Matlab/R: If the speed of your program's execution is the most important, use C++. If speed of implementation is the most important, use Matlab or R. This is true for a number of reasons, not the least of which is their massive libraries of math/stats packages.
Both Matlab and R can be sped up through parallelism: so generally, I think that speed and quality of implementation should be a bigger concern. That's where the real "value" of programming is taking place, in the design of the application. It's not a minor proposition if you can write 3 or 4 good R programs in the same time it takes you to write 1 good C++ program.
Regarding F#: so far as it is part of Microsoft's framework, it must have a lot to offer. If you're developing in Visual Studio or working on a big .Net project (for instance), it might make sense to use F#. On the other hand, you can call both Matlab and R from .Net applications, so I would probably argue that their libraries should be a bigger concern. For instance, see this article as an example for R and the Matlab Builder.
Long story short: comparing F# and Matlab/R isn't a good comparison. F# is a general purpose programming language, while Matlab/R can be viewed as massive mathematical/data analysis toolkits. Some people call Matlab or R from F# in order to take advantage of each language's benefits (e.g. see this discussion, this article on Matlab/F#, or this article on R/F#).
So far as charting is concerned: R is extremely strong on this front. Have a look at the graphics view on CRAN and this series of posts on the LearnR blog about Lattice and ggplot2.
I've worked a bit with matlab and python/pylab for these purposes. What these tools have 'built-in' is a programming environment, a shell, and gui tools designed for quickly looking at data from a variety of sources.
In a few commands, you can go from having a csv file to interactive plots on the screen, then to an image export in just about any format. It takes a minute or two to go from data to visualization once you have the hang of it. I would imagine this is uncommon in the C++ world (although I have seen some professors with pretty impressive work-flows).
I've tried R, but I can't say much useful about it. It seems to offer about the same set of features, but it may be troublesome to Google for support.
If you are spending more than a couple minutes getting from data to plot using your current method, it's definitely worth learning one of these environments. The best choice depends on your colleagues, your work environment, experience, and your budget.
This is a reasonable close double to the previous question on suitable functional language for scientific/statistical computing so you may want to peruse the long and detailed answers there.
Answers depends, as so often, on your experience and prior language training. I very much prefer R for data munging / modeling / visualization.
I use R because on the one hand it has everything built in and on the other hand you can still manipulate almost everything or start from scratch. Nevertheless, R is rather slow for heavy calculations (although I do all my Monte Carlo simulations in it).
I would say that Matlab is best for the availability of mathematical functionalities in general, R is best for data input/manipulation/visualisation/analysis/etc., and C++ for high-speed subroutines. You can by the way easily integrate C++ (or C, fortran, ...) code in R. Why not read and manipulate input data in R, apply the models in C++, and analyse/visualize output back in R?
I always prototype my models in MATLAB. If my prototype is fast enough, I refactor and it's done. If not, I go back and implement certain functions in C to be called by MATLAB. This requires knowledge of a low level language, which I think is always going to be the case if you are doing anything that is technically challenging.
I'm intrigued with this Lisp flavor if it ever gets off the ground.

Using Lisp (or AutoLisp) how good is the associative lists performance?

I'm doing an AutoLisp project which uses long associative structures to do heavy geometrical processing - so I'm curious about the associative list intense use timing results.
How simple/complex is the implementation? It uses some data structure or a normal list of dotted pairs?
The are any extension for b-tree or something?
the turning point for SBCL on recent x86 hardware between alists and identity based hashtables, assuming even distribution of access, is around 30-40 elements.
In Common Lisp and Emacs Lisp association lists are linked lists, so they have linear search time. Assuming that AutoLisp is the same (and if it isn't, then their use of the term "Associative List" is misleading), you can assume that all operations will be linear in the length of the list. For example, an alist with 100 elements will, on average, need 50 accesses to find the thing that you are after.
Of course, most Scheme implementations (or maybe it's in the specs?) have hashtables, which use mostly the same API; but it's not transparent, when you ask for an alist, you get a list of pairs, if you want a hashtable, ask for it.
that said, it's important to remember that linear algorithms aren't slow; they're 'unscalable'. for a small number of elements, they'll outperform a more complex 'clever' algorithm. just how large 'n' has to be, depends a lot on the algorithm, and fast processors with big caches but slow RAM, keep pushing it. Also, heavy optimising compilers (like those available on some Lisp's) generate very tight linear code.
I have not worked with AutoLisp in about 10 years, but I never found any real performance issues with association list manipulation. And I wrote code that would do a fair amount of association list manipulation.
Working in VBA or ObjectARX might have some performance benefits, but you would probably need to run some comparison testing to see if it is really better.
There is no extension for b-tree that I know of but if you use Visual LISP you can use ActiveX objects and thus access most types of databases.