Simple explanation of MapReduce? - frameworks

Related to my CouchDB question.
Can anyone explain MapReduce in terms a numbnuts could understand?

Going all the way down to the basics for Map and Reduce.
Map is a function which "transforms" items in some kind of list to another kind of item and put them back in the same kind of list.
suppose I have a list of numbers: [1,2,3] and I want to double every number, in this case, the function to "double every number" is function x = x * 2. And without mappings, I could write a simple loop, say
A = [1, 2, 3]
foreach (item in A) A[item] = A[item] * 2
and I'd have A = [2, 4, 6] but instead of writing loops, if I have a map function I could write
A = [1, 2, 3].Map(x => x * 2)
the x => x * 2 is a function to be executed against the elements in [1,2,3]. What happens is that the program takes each item, execute (x => x * 2) against it by making x equals to each item, and produce a list of the results.
1 : 1 => 1 * 2 : 2
2 : 2 => 2 * 2 : 4
3 : 3 => 3 * 2 : 6
so after executing the map function with (x => x * 2) you'd have [2, 4, 6].
Reduce is a function which "collects" the items in lists and perform some computation on all of them, thus reducing them to a single value.
Finding a sum or finding averages are all instances of a reduce function. Such as if you have a list of numbers, say [7, 8, 9] and you want them summed up, you'd write a loop like this
A = [7, 8, 9]
sum = 0
foreach (item in A) sum = sum + A[item]
But, if you have access to a reduce function, you could write it like this
A = [7, 8, 9]
sum = A.reduce( 0, (x, y) => x + y )
Now it's a little confusing why there are 2 arguments (0 and the function with x and y) passed. For a reduce function to be useful, it must be able to take 2 items, compute something and "reduce" that 2 items to just one single value, thus the program could reduce each pair until we have a single value.
the execution would follows:
result = 0
7 : result = result + 7 = 0 + 7 = 7
8 : result = result + 8 = 7 + 8 = 15
9 : result = result + 9 = 15 + 9 = 24
But you don't want to start with zeroes all the time, so the first argument is there to let you specify a seed value specifically the value in the first result = line.
say you want to sum 2 lists, it might look like this:
A = [7, 8, 9]
B = [1, 2, 3]
sum = 0
sum = A.reduce( sum, (x, y) => x + y )
sum = B.reduce( sum, (x, y) => x + y )
or a version you'd more likely to find in the real world:
A = [7, 8, 9]
B = [1, 2, 3]
sum_func = (x, y) => x + y
sum = A.reduce( B.reduce( 0, sum_func ), sum_func )
Its a good thing in a DB software because, with Map\Reduce support you can work with the database without needing to know how the data are stored in a DB to use it, thats what a DB engine is for.
You just need to be able to "tell" the engine what you want by supplying them with either a Map or a Reduce function and then the DB engine could find its way around the data, apply your function, and come up with the results you want all without you knowing how it loops over all the records.
There are indexes and keys and joins and views and a lot of stuffs a single database could hold, so by shielding you against how the data is actually stored, your code are made easier to write and maintain.
Same goes for parallel programming, if you only specify what you want to do with the data instead of actually implementing the looping code, then the underlying infrastructure could "parallelize" and execute your function in a simultaneous parallel loop for you.

MapReduce is a method to process vast sums of data in parallel without requiring the developer to write any code other than the mapper and reduce functions.
The map function takes data in and churns out a result, which is held in a barrier. This function can run in parallel with a large number of the same map task. The dataset can then be reduced to a scalar value.
So if you think of it like a SQL statement
SELECT SUM(salary)
FROM employees
WHERE salary > 1000
GROUP by deptname
We can use map to get our subset of employees with salary > 1000
which map emits to the barrier into group size buckets.
Reduce will sum each of those groups. Giving you your result set.
just plucked this from my university study notes of the google paper

Take a bunch of data
Perform some kind of transformation that converts every datum to another kind of datum
Combine those new data into yet simpler data
Step 2 is Map. Step 3 is Reduce.
For example,
Get time between two impulses on a pair of pressure meters on the road
Map those times into speeds based upon the distance of the meters
Reduce those speeds to an average speed
The reason MapReduce is split between Map and Reduce is because different parts can easily be done in parallel. (Especially if Reduce has certain mathematical properties.)
For a complex but good description of MapReduce, see: Google's MapReduce Programming Model -- Revisited (PDF).

MAP and REDUCE are old Lisp functions from a time when man killed the last dinosaurs.
Imagine you have a list of cities with informations about the name, number of people living there and the size of the city:
(defparameter *cities*
'((a :people 100000 :size 200)
(b :people 200000 :size 300)
(c :people 150000 :size 210)))
Now you may want to find the city with the highest population density.
First we create a list of city names and population density using MAP:
(map 'list
(lambda (city)
(list (first city)
(/ (getf (rest city) :people)
(getf (rest city) :size))))
*cities*)
=> ((A 500) (B 2000/3) (C 5000/7))
Using REDUCE we can now find the city with the largest population density.
(reduce (lambda (a b)
(if (> (second a) (second b))
a
b))
'((A 500) (B 2000/3) (C 5000/7)))
=> (C 5000/7)
Combining both parts we get the following code:
(reduce (lambda (a b)
(if (> (second a) (second b))
a
b))
(map 'list
(lambda (city)
(list (first city)
(/ (getf (rest city) :people)
(getf (rest city) :size))))
*cities*))
Let's introduce functions:
(defun density (city)
(list (first city)
(/ (getf (rest city) :people)
(getf (rest city) :size))))
(defun max-density (a b)
(if (> (second a) (second b))
a
b))
Then we can write our MAP REDUCE code as:
(reduce 'max-density
(map 'list 'density *cities*))
=> (C 5000/7)
It calls MAP and REDUCE (evaluation is inside out), so it is called map reduce.

Let's take the example from the Google paper. The goal of MapReduce is to be able to use efficiently a load of processing units working in parallels for some kind of algorithms. The exemple is the following: you want to extract all the words and their count in a set of documents.
Typical implementation:
for each document
for each word in the document
get the counter associated to the word for the document
increment that counter
end for
end for
MapReduce implementation:
Map phase (input: document key, document)
for each word in the document
emit an event with the word as the key and the value "1"
end for
Reduce phase (input: key (a word), an iterator going through the emitted values)
for each value in the iterator
sum up the value in a counter
end for
Around that, you'll have a master program which will partition the set of documents in "splits" which will be handled in parallel for the Map phase. The emitted values are written by the worker in a buffer specific to the worker. The master program then delegates other workers to perform the Reduce phase as soon as it is notified that the buffer is ready to be handled.
Every worker output (being a Map or a Reduce worker) is in fact a file stored on the distributed file system (GFS for Google) or in the distributed database for CouchDB.

A really easy, quick and "for dummies" introduction to MapReduce is available at: http://www.marcolotz.com/?p=67
Posting some of it's content:
First of all, why was MapReduce originally created?
Basically Google needed a solution for making large computation jobs easily parallelizable, allowing data to be distributed in a number of machines connected through a network. Aside from that, it had to handle the machine failure in a transparent way and manage load balancing issues.
What are MapReduce true strengths?
One may say that MapReduce magic is based on the Map and Reduce functions application. I must confess mate, that I strongly disagree. The main feature that made MapReduce so popular is its capability of automatic parallelization and distribution, combined with the simple interface. These factor summed with transparent failure handling for most of the errors made this framework so popular.
A little more depth on the paper:
MapReduce was originally mentioned in a Google paper (Dean & Ghemawat, 2004 – link here) as a solution to make computations in Big Data using a parallel approach and commodity-computer clusters. In contrast to Hadoop, that is written in Java, the Google’s framework is written in C++. The document describes how a parallel framework would behave using the Map and Reduce functions from functional programming over large data sets.
In this solution there would be two main steps – called Map and Reduce –, with an optional step between the first and the second – called Combine. The Map step would run first, do computations in the input key-value pair and generate a new output key-value. One must keep in mind that the format of the input key-value pairs does not need to necessarily match the output format pair. The Reduce step would assemble all values of the same key, performing other computations over it. As a result, this last step would output key-value pairs. One of the most trivial applications of MapReduce is to implement word counts.
The pseudo-code for this application, is given bellow:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, “1”);
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
As one can notice, the map reads all the words in a record (in this case a record can be a line) and emits the word as a key and the number 1 as a value.
Later on, the reduce will group all values of the same key. Let’s give an example: imagine that the word ‘house’ appears three times in the record. The input of the reducer would be [house,[1,1,1]]. In the reducer, it will sum all the values for the key house and give as an output the following key value: [house,[3]].
Here’s an image of how this would look like in a MapReduce framework:
As a few other classical examples of MapReduce applications, one can say:
•Count of URL access frequency
•Reverse Web-link Graph
•Distributed Grep
•Term Vector per host
In order to avoid too much network traffic, the paper describes how the framework should try to maintain the data locality. This means that it should always try to make sure that a machine running Map jobs has the data in its memory/local storage, avoiding to fetch it from the network. Aiming to reduce the network through put of a mapper, the optional combiner step, described before, is used. The Combiner performs computations on the output of the mappers in a given machine before sending it to the Reducers – that may be in another machine.
The document also describes how the elements of the framework should behave in case of faults. These elements, in the paper, are called as worker and master. They will be divided into more specific elements in open-source implementations.
Since the Google has only described the approach in the paper and not released its proprietary software, many open-source frameworks were created in order to implement the model. As examples one may say Hadoop or the limited MapReduce feature in MongoDB.
The run-time should take care of non-expert programmers details, like partitioning the input data, scheduling the program execution across the large set of machines, handling machines failures (in a transparent way, of course) and managing the inter-machine communication. An experienced user may tune these parameters, as how the input data will be partitioned between workers.
Key Concepts:
•Fault Tolerance: It must tolerate machine failure gracefully. In order to perform this, the master pings the workers periodically. If the master does not receive responses from a given worker in a definite time lapse, the master will define the work as failed in that worker. In this case, all map tasks completed by the faulty worker are thrown away and are given to another available worker. Similar happens if the worker was still processing a map or a reduce task. Note that if the worker already completed its reduce part, all computation was already finished by the time it failed and does not need to be reset. As a primary point of failure, if the master fails, all the job fails. For this reason, one may define periodical checkpoints for the master, in order to save its data structure. All computations that happen between the last checkpoint and the master failure are lost.
•Locality: In order to avoid network traffic, the framework tries to make sure that all the input data is locally available to the machines that are going to perform computations on them. In the original description, it uses Google File System (GFS) with replication factor set to 3 and block sizes of 64 MB. This means that the same block of 64 MB (that compose a file in the file system) will have identical copies in three different machines. The master knows where are the blocks and try to schedule map jobs in that machine. If that fails, the master tries to allocate a machine near a replica of the tasks input data (i.e. a worker machine in the same rack of the data machine).
•Task Granularity: Assuming that each map phase is divided into M pieces and that each Reduce phase is divided into R pieces, the ideal would be that M and R are a lot larger than the number of worker machines. This is due the fact that a worker performing many different tasks improves dynamic load balancing. Aside from that, it increases the recovery speed in the case of worker fail (since the many map tasks it has completed can be spread out across all the other machines).
•Backup Tasks: Sometimes, a Map or Reducer worker may behave a lot more slow than the others in the cluster. This may hold the total processing time and make it equal to the processing time of that single slow machine. The original paper describes an alternative called Backup Tasks, that are scheduled by the master when a MapReduce operation is close to completion. These are tasks that are scheduled by the Master of the in-progress tasks. Thus, the MapReduce operation completes when the primary or the backup finishes.
•Counters: Sometimes one may desire to count events occurrences. For this reason, counts where created. The counter values in each workers are periodically propagated to the master. The master then aggregates (Yep. Looks like Pregel aggregators came from this place ) the counter values of a successful map and reduce task and return them to the user code when the MapReduce operation is complete. There is also a current counter value available in the master status, so a human watching the process can keep track of how it is behaving.
Well, I guess with all the concepts above, Hadoop will be a piece of cake for you. If you have any question about the original MapReduce article or anything related, please let me know.

If you are familiar with Python, following is the simplest possible explanation of MapReduce:
In [2]: data = [1, 2, 3, 4, 5, 6]
In [3]: mapped_result = map(lambda x: x*2, data)
In [4]: mapped_result
Out[4]: [2, 4, 6, 8, 10, 12]
In [10]: final_result = reduce(lambda x, y: x+y, mapped_result)
In [11]: final_result
Out[11]: 42
See how each segment of raw data was processed individually, in this case, multiplied by 2 (the map part of MapReduce). Based on the mapped_result, we concluded that the result would be 42 (the reduce part of MapReduce).
An important conclusion from this example is the fact that each chunk of processing doesn't depend on another chunk. For instance, if thread_1 maps [1, 2, 3], and thread_2 maps [4, 5, 6], the eventual result of both the threads would still be [2, 4, 6, 8, 10, 12] but we have halved the processing time for this. The same can be said for the reduce operation and is the essence of how MapReduce works in parallel computing.

I don't want to sound trite, but this helped me so much, and it's pretty simple:
cat input | map | reduce > output

Related

Why is no global time such a big issue in distibuted systems? When would a global time be useful?

It's no hard for me to understand why a global time cannot really exist or at least be measured in a distributed system. However, I don't really understand why this is such a big issue. I mean most code is executed sequencially anyway, or in a causal relation (so something like A then we can use it in B then execute C). I've never seen code that was like "it's critical, that these two threads execute something at the exact same time". In what scenario would it be useful to have a global time?
I mean most code is executed sequencially anyway
I disagree. That's true almost by definition for a single process on a single thread. But if Taylor Swift drops a new album on twitter and 1M of her 88.7M followers like it, they don't have to wait in a queue for the other 1-999,999 users to finish their "like" update. (Or at least, the data structure is much faster than taking out a heavyweight lock to guarantee sequence.) There's a lot of nonsequential code there.
or in a causal relation (so something like A then we can use it in B then execute C).
Right, and a naive clock-based implementation does not implement causality. Say the true order of events is this:
Initially: x = 1
Process 1: set x = 2
Process 2: read x
Process 2: if (x == 2) set x = 4
But we rely on their clocks it might look like:
Process 1: set x = 2 (t = 4:00:00)
Process 2: read x (t = 3:59:59.000)
Process 2: if (x == 2) set x = 4 (t = 3:50:59.100)
The replicas might rely on timestamps to replay these operations and use timestamps to sort them. Relying on skewed clocks would violate the causality.

Bipartite graph distributed processing with dynamic programming <?>

I am trying to figure out efficient algorithm for processing Documents in distributed (FaaS to be more precise) environment.
Bruteforce approach would be O(D * F * R) where:
D is amount of Documents to process
F is amount of filters
R is highest amount of Rules in single Filter
I can assume, that:
single Filter has no more than 10 Rules
some Filters may share Rules (so it's N-to-N relation)
Rules are boolean functions (predicates) so I can try to take advantage of early cutting, meaning that if I have f() && g() && h() with f() evaluating to false then I do not have to process g() and h() at all and can return false immediately.
in single Document amount of Fields is always same (and about 5-10)
Filters, Rules and Documents are already in database
every Filter has at least one Rule
Using sharing (second assumption) I had an idea to first process Document against every Rule and then (after finishing) for every Filter using already computed Rules compute result. This way if Rule is shared then I am computing it only once. However, it doesn't take advantage of early cutting (third assumption).
Second idea is to use early cutting as slightly optimized bruteforce, but it won't use Rules sharing then.
Rules sharing looks like subproblem sharing, so probably memoization and dynamic programming will be helpful.
I have noticed, that Filter-Rule relation is bipartite graph. Not quite sure if it can help me though. I also have noticed, that I could use reverse sets and in every Rule store corresponding Set. This would however create circular dependency and may cause desynchronization problems in database.
Default idea is that Documents are streamed, and every single of them is event that will create FaaS instance to process it. However, this would probably force every FaaS instance to query for all Filters, which leaves me at O(F * D) queries because of Shared-Nothing architecture.
Sample Filter:
{
'normalForm': 'CONJUNCTIVE',
'rules':
[
{
'isNegated': true,
'field': 'X',
'relation': 'STARTS_WITH',
'value': 'G',
},
{
'isNegated': false,
'field': 'Y',
'relation': 'CONTAINS',
'value': 'KEY',
},
}
or in more condense form:
document -> !document.x.startsWith("G") && document.y.contains("KEY")
for Document:
{
'x': 'CAR',
'y': 'KEYBOARD',
'z': 'PAPER',
}
evaluates to true.
I can slightly change data model, stream something else instead of Document (ex. Filters) and use any nosql database and tools to help it. Apache Flink (event processing) and MongoDB (single query to retrieve Filter with it's Rules) or maybe Neo4j (as model looks like bipartite graph) looks like could help me, but not sure about it.
Can it be processed efficiently (with regard to - probably - database queries)? What tools would be appropriate?
I have been also wondering, if maybe I am trying to solve special case of some more general (math) problem that may have useful theorems and algorithms.
EDIT: My newest idea: Gather all Documents in cache like Redis. Then single event starts up and publishes N functions (as in Function as a Service), and every function selects F/N (amount of Filters divided by number of instances - so just evenly distributing Filters across instances) this way every Filter is fetched from database only once.
Now, every instance streams all Documents from cache (one document should be less than 1MB and at the same time I should have 1-10k of them so should fit in cache). This way every Document is selected from database only once (to cache).
I have reduced database read operations (still some Rules are selected multiple times), but still I am not taking advantage of Rule sharing across Filters. I could intentionally ignore it by using document database. This way by selecting Filter I will also get it's Rules. Still - I have to recalculate it's value.
I guess that's what I get for using Shared Nothing scalable architecture?
I realized that although my graph is indeed (in theory) bipartite but (in practice) it's going to be set of disjoint bipartite graphs (as not all Rules are going to be shared). This means, that I can process those disjoint parts independently on different FaaS instances without recalculating same Rules.
This reduces my problem to processing single bipartite connected graph. Now, I can use benefits of dynamic programming and share result of Rule computation only if memory i shared, so I cannot divide (and distribute) this problem further without sacrificing this benefit. So I thought this way: if I have already decided, that I will have to recompute some Rules, then let it be low compared to disjoint parts that I will get.
This is actually minimum cut problem, that has (fortunately) polynomial complexity known algorithm.
However, this may be not ideal in my case, because I don't want to cut any part of graph - I would like to cut graph ideally in half (divide and conquer strategy, that could be reapplied recursively till graph would be so small that could be processed in seconds in FaaS instance, that has time bound).
This means, that I am looking for cut, that would create two disjoint bipartite graphs, with possibly same amount of vertexes each (or at least similar).
This is sparsest cut problem, that is NP-hard, but has O(sqrt(logN)) approximated algorithm, that also favors less cut edges.
Currently, this does look like solution for my problem, however I would love to hear any suggestions, improvements and other answers.
Maybe it can be done better with other data model or algorithm? Maybe I can reduce it further with some theorem? Maybe I could transform it to other (simpler) problem, or at least that is easier to divide and distribute across nodes?
This idea and analysis strongly suggests using graph database.

Spark: groupBy taking lot of time

In my application when taking perfromance numbers, groupby is eating away lot of time.
My RDD is of below strcuture:
JavaPairRDD<CustomTuple, Map<String, Double>>
CustomTuple:
This object contains information about the current row in RDD like which week, month, city, etc.
public class CustomTuple implements Serializable{
private Map hierarchyMap = null;
private Map granularMap = null;
private String timePeriod = null;
private String sourceKey = null;
}
Map
This map contains the statistical data about that row like how much investment, how many GRPs, etc.
<"Inv", 20>
<"GRP", 30>
I was executing below DAG on this RDD
apply filter on this RDD and scope out relevant rows : Filter
apply filter on this RDD and scope out relevant rows : Filter
Join the RDDs: Join
apply map phase to compute investment: Map
apply GroupBy phase to group the data according to the desired view: GroupBy
apply a map phase to aggregate the data as per the grouping achieved in above step (say view data across timeperiod) and also create new objects based on the resultset desired to be collected: Map
collect the result: Collect
So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):
<timeperiod1, value>
When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.
IMO, we can replace GroupBy and subsequent Map operations by a sing reduce.
But reduce will work on object of type JavaPairRDD>.
So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.
Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.
Also, I am not sure how aggregate function works and will it be able to help me in this case.
Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.
<"market1", 20>
<"market2", 30>
This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.
Any insight is appreciated.
[EDIT]We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.[EDIT]
TIA!
The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)
If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.
combineByKey have three arguments:
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)
mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value
mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that
The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.
I hope this will be usefull
Shuffling is triggered by any change in the key of a [K,V] pair, or by a repartition() call. The partitioning is calculated based on the K (key) value. By default partitioning is calculated using the Hash value of your key, implemented by the hashCode() method. In your case your Key contains two Map instance variables. The default implementation of the hashCode() method will have to calculate the hashCode() of those maps as well, causing an iteration to happen over all it elements to in turn again calculate the hashCode() of those elements.
The solutions are:
Do not include the Map instances in your Key. This seems highly unusual.
Implement and override your own hashCode() that avoids going through the Map Instance variables.
Possibly you can avoid using the Map objects completely. If it is something that is shared amongst multiple elements, you might need to consider using broadcast variables in spark. The overhead of serializing your Maps during shuffling might also be a big contributing factor.
Avoid any shuffling, by tuning your hashing between two consecutive group-by's.
Keep shuffling Node local, by choosing a Partitioner that will have an affinity of keeping partitions local during consecutive use.
Good reading on hashCode(), including a reference to quotes by Josh Bloch can be found in wiki.

Can reduceLeft be executed in parallel?

I just started learning Scala, so please be patient :-)
I have a question about how reduceLeft behaves. Here an example:
List(1, 2, 3, 4, 5) reduceLeft (_ + _)
I wonder if the calculation can be done simultanously, e.g.:
first round:
process 1 calculates: 1 + 2
process 2 calculates: 4 + 5
second round:
process 1 calculates: 3 + 3
third round:
process 1 calculates: 6 + 9
At least that's what I would expect to happen if I just use the reduce function instead of reduceLeft. Or does reduceLeft really only does one reduction at a time?
((((1 + 2) + 3) + 4) + 5)
This would basically mean it can't be executed in parallel and one should always prefer reduce over reduceLeft/Right if possible?
The answer is yes, and it is very easy:
List(1, 2, 3, 4, 5).par.reduce (_ + _)
The par method turns the list into a parallel collection. When you call reduce on this parallel collection, it will be executed in parallel.
See the parallel collection documentation
As you've noticed, reduceLeft is not parallelizable, as it explicitly assumes a form that is not associative: (B,A) => B.
As long as you use an associative operator, reduce is parallelizable.
There's also an analog of foldLeft called aggregate that takes two functions: one to map into a combinable form, and two an associative one to merge the elements: (B,A)=>B, (B,B) => B.
This one, as long as the two functions will agree on the output, and you have a zero to mix in wherever you want, is parallelizable.
So if you want to be able to be parallel,
reduceLeft/Right -> reduce
foldLeft/Right -> aggregate
There may be some cases where reduce is more restrictive than reduceLeft but aggregate will do the trick.
That said, this only makes the statement able to be parallel. For it to actually be parallel you need to use a collection that inherits from ParIterable, and these all have Par in their names: ParVector, etc.. The easiest way to get a parallel collection is to call .par on a regular one (.seq goes the other way, from parallel to non-parallel). It's done this way because in general there's no reason to be parallel except for speed, but parallelism adds overhead. So you should only operate in parallel if there's enough work to do, and while you may know that, the compiler probably doesn't. Thus, you are to explicitly select which kind of collection you want. (Parallel collections return parallel, and sequential return sequential.)

Scala parallel unordered iterator

I have an Iterable of "work units" that need to be performed, in no particular order, and can easily run in parallel without interfering with one another.
Unfortunately, running too many of them at a time will exceed my available RAM, so I need to make sure that only a handful is running simultaneously at any given time.
At the most basic, I want a function of this type signature:
parMap[A, B](xs: Iterator[A], f: A => B, chunkSize: Int): Iterator[B]
such that the output Iterator is not necessarily in the same order as the input (if I want to maintain knowledge of where the result came from, I can output a pair with the input or something.) The consumer can then consume the resulting iterator incrementally without eating up all of the machine's memory, while maintaining as much parallelism as is possible for this task.
Furthermore, I want the function to be as efficient as possible. An initial idea I had was for example to do something along the following lines:
xs.iterator.grouped(chunkSize).flatMap(_.toSet.par.map(f).iterator)
where I was hoping the toSet would inform Scala's parallel collection that it could start producing elements from its iterator as soon as they were ready, in any order, and the grouped call was to limit the number of simultaneous workers. Unfortunately, it doesn't look like the toSet call achieves the desired effect (the results are returned in the same order as they would have been without the par call, in my experiments,) and the grouped call is suboptimal. For example, if we have a group size of 100, and 99 of those jobs complete immediately on a dozen cores, but one of them is particularly slow, most of the remaining cores will be idle until we can move to the next group. It would be much cleaner to have an "adaptive window" that is at most as big as my chunk size, but doesn't get held up by slow workers.
I can envision writing something like this myself with a work-stealing (de)queue or something along those lines, but I imagine that a lot of the hard work of dealing with the concurrency primitives is already done for me at some level in Scala's parallel collections library. Does anyone know what parts of it I could reuse to build this piece of functionality, or have other suggestions on how to implement such an operation?
The Parallel collections framework allows you to specify the maximum number of threads to be used for a given task. Using scala-2.10, you'd want to do:
def parMap[A,B](x : Iterable[A], f : A => B, chunkSize : Int) = {
val px = x.par
px.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(chunkSize))
px map f
}
This will prevent more than chunkSize operations running at any one time. This uses a work-stealing strategy underneath to keep the actors working, and so doesn't suffer from the same problem as your grouped example above.
Doing it this way won't reorder the results into first-completed order, however. For that, I'd suggest something like turning your operation into an actor and having a small actor pool running the operations and then sending results back to you as they complete.