too many map keys causing out of memory exception in spark - scala

I have an RDD 'inRDD' of the form RDD[(Vector[(Int, Byte)], Vector[(Int, Byte)])] which is a PairRDD(key,value) where key is Vector[(Int, Byte)] and value is Vector[(Int, Byte)].
For each element (Int, Byte) in the vector of key field, and each element (Int, Byte) in the vector of value field I would like to get a new (key,value) pair in the output RDD as (Int, Int), (Byte, Byte).
That should give me an RDD of the form RDD[((Int, Int), (Byte, Byte))].
For example, inRDD contents could be like,
(Vector((3,2)),Vector((4,2))), (Vector((2,3), (3,3)),Vector((3,1))), (Vector((1,3)),Vector((2,1))), (Vector((1,2)),Vector((2,2), (1,2)))
which would become
((3,4),(2,2)), ((2,3),(3,1)), ((3,3),(3,1)), ((1,2),(3,1)), ((1,2),(2,2)), ((1,1),(2,2))
I have the following code for that.
val outRDD = inRDD.flatMap {
case (left, right) =>
for ((ll, li) <- left; (rl, ri) <- right) yield {
(ll,rl) -> (li,ri)
}
}
It works when the vectors are small in size in the inRDD. But when there are lot elements in the vectors, I get out of memory exception. Increasing the available memory
to spark could only solve for smaller inputs and the error appears again for even larger inputs.
Looks like I am trying to assemble a huge structure in memory. I am unable to rewrite this code in any other ways.
I have implemented a similar logic with java in hadoop as follows.
for (String fromValue : fromAssetVals) {
fromEntity = fromValue.split(":")[0];
fromAttr = fromValue.split(":")[1];
for (String toValue : toAssetVals) {
toEntity = toValue.split(":")[0];
toAttr = toValue.split(":")[1];
oKey = new Text(fromEntity.trim() + ":" + toEntity.trim());
oValue = new Text(fromAttr + ":" + toAttr);
outputCollector.collect(oKey, oValue);
}
}
But when I try something similar in spark, I get nested rdd exceptions.
How do I do this efficiently with spark using scala?

Well, if Cartesian product is the only option you can at least make it a little bit more lazy:
inRDD.flatMap { case (xs, ys) =>
xs.toIterator.flatMap(x => ys.toIterator.map(y => (x, y)))
}
You can also handle this at the Spark level
import org.apache.spark.RangePartitioner
val indexed = inRDD.zipWithUniqueId.map(_.swap)
val partitioner = new RangePartitioner(indexed.partitions.size, indexed)
val partitioned = indexed.partitionBy(partitioner)
val lefts = partitioned.flatMapValues(_._1)
val rights = partitioned.flatMapValues(_._2)
lefts.join(rights).values

Related

groupBy on List as LinkedHashMap instead of Map

I am processing XML using scala, and I am converting the XML into my own data structures. Currently, I am using plain Map instances to hold (sub-)elements, however, the order of elements from the XML gets lost this way, and I cannot reproduce the original XML.
Therefore, I want to use LinkedHashMap instances instead of Map, however I am using groupBy on the list of nodes, which creates a Map:
For example:
def parse(n:Node): Unit =
{
val leaves:Map[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.groupBy(_.label)
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
...
}
In this example, I want leaves to be of type LinkedHashMap to retain the order of n.child. How can I achieve this?
Note: I am grouping by label/tagname because elements can occur multiple times, and for each label/tagname, I keep a list of elements in my data structures.
Solution
As answered by #jwvh I am using foldLeft as a substitution for groupBy. Also, I decided to go with LinkedHashMap instead of ListMap.
def parse(n:Node): Unit =
{
val leaves:mutable.LinkedHashMap[String, Seq[XmlItem]] =
n.child
.filter(node => { ... })
.foldLeft(mutable.LinkedHashMap.empty[String, Seq[Node]])((m, sn) =>
{
m.update(sn.label, m.getOrElse(sn.label, Seq.empty[Node]) ++ Seq(sn))
m
})
.map((tuple:Tuple2[String, Seq[Node]]) =>
{
val items = tuple._2.map(node =>
{
val attributes = ...
if (node.text.nonEmpty)
XmlItem(Some(node.text), attributes)
else
XmlItem(None, attributes)
})
(tuple._1, items)
})
To get the rough equivalent to .groupBy() in a ListMap you could fold over your collection. The problem is that ListMap preserves the order of elements as they were appended, not as they were encountered.
import collection.immutable.ListMap
List('a','b','a','c').foldLeft(ListMap.empty[Char,Seq[Char]]){
case (lm,c) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res0: ListMap[Char,Seq[Char]] = ListMap(b -> Seq(b), a -> Seq(a, a), c -> Seq(c))
To fix this you can foldRight instead of foldLeft. The result is the original order of elements as encountered (scanning left to right) but in reverse.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}
//res1: ListMap[Char,Seq[Char]] = ListMap(c -> Seq(c), b -> Seq(b), a -> Seq(a, a))
This isn't necessarily a bad thing since a ListMap is more efficient with last and init ops, O(1), than it is with head and tail ops, O(n).
To process the ListMap in the original left-to-right order you could .toList and .reverse it.
List('a','b','a','c').foldRight(ListMap.empty[Char,Seq[Char]]){
case (c,lm) => lm.updated(c, c +: lm.getOrElse(c, Seq()))
}.toList.reverse
//res2: List[(Char, Seq[Char])] = List((a,Seq(a, a)), (b,Seq(b)), (c,Seq(c)))
Purely immutable solution would be quite slow. So I'd go with
import collection.mutable.{ArrayBuffer, LinkedHashMap}
implicit class ExtraTraversableOps[A](seq: collection.TraversableOnce[A]) {
def orderedGroupBy[B](f: A => B): collection.Map[B, collection.Seq[A]] = {
val map = LinkedHashMap.empty[B, ArrayBuffer[A]]
for (x <- seq) {
val key = f(x)
map.getOrElseUpdate(key, ArrayBuffer.empty) += x
}
map
}
To use, just change .groupBy in your code to .orderedGroupBy.
The returned Map can't be mutated using this type (though it can be cast to mutable.Map or to mutable.LinkedHashMap), so it's safe enough for most purposes (and you could create a ListMap from it at the end if really desired).

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

acces tuple inside a tuple for anonymous map job in Spark

This post is essentially about how to build joint and marginal histograms from a (String, String) RDD. I posted the code that I eventually used below as the answer.
I have an RDD that contains a set of tuples of type (String,String) and since they aren't unique I want to get a look at how many times each String, String combination occurs so I use countByValue like so
val PairCount = Pairs.countByValue().toSeq
which gives me a tuple as output like this ((String,String),Long) where long is the number of times that the (String, String) tuple appeared
These Strings can be repeated in different combinations and I essentially want to run word count on this PairCount variable so I tried something like this to start:
PairCount.map(x => (x._1._1, x._2))
But the output the this spits out is String1->1, String2->1, String3->1, etc.
How do I output a key value pair from a map job in this case where the key is going to be one of the String values from the inner tuple, and the value is going to be the Long value from the outter tuple?
Update:
#vitalii gets me almost there. the answer gets me to a Seq[(String,Long)], but what I really need is to turn that into a map so that I can run reduceByKey it afterwards. when I run
PairCount.flatMap{case((x,y),n) => Seq[x->n]}.toMap
for each unique x I get x->1
for example the above line of code generates mom->1 dad->1 even if the tuples out of the flatMap included (mom,30) (dad,59) (mom,2) (dad,14) in which case I would expect toMap to provide mom->30, dad->59 mom->2 dad->14. However, I'm new to scala so I might be misinterpreting the functionality.
how can I get the Tuple2 sequence converted to a map so that I can reduce on the map keys?
If I correctly understand question, you need flatMap:
val pairCountRDD = pairs.countByValue() // RDD[((String, String), Int)]
val res : RDD[(String, Int)] = pairCountRDD.flatMap { case ((s1, s2), n) =>
Seq(s1 -> n, s2 -> n)
}
Update: I didn't quiet understand what your final goal is, but here's a few more examples that may help you, btw code above is incorrect, I have missed the fact that countByValue returns map, and not RDD:
val pairs = sc.parallelize(
List(
"mom"-> "dad", "dad" -> "granny", "foo" -> "bar", "foo" -> "baz", "foo" -> "foo"
)
)
// don't use countByValue, if pairs is large you will run out of memmory
val pairCountRDD = pairs.map(x => (x, 1)).reduceByKey(_ + _)
val wordCount = pairs.flatMap { case (a,b) => Seq(a -> 1, b ->1)}.reduceByKey(_ + _)
wordCount.take(10)
// count in how many pairs each word occur, keys and values:
val wordPairCount = pairs.flatMap { case (a,b) =>
if (a == b) {
Seq(a->1)
} else {
Seq(a -> 1, b ->1)
}
}.reduceByKey(_ + _)
wordPairCount.take(10)
to get the histograms for the (String,String) RDD I used this code.
val Hist_X = histogram.map(x => (x._1-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_Y = histogram.map(x => (x._2-> 1.0)).reduceByKey(_+_).collect().toMap
val Hist_XY = histogram.map(x => (x-> 1.0)).reduceByKey(_+_)
where histogram was the (String,String) RDD

Is there a better way for reduce operation on RDD[Array[Double]]

I want to reduce a RDD[Array[Double]] in order to each element of the array will be add with the same element of the next array.
I use this code for the moment :
var rdd1 = RDD[Array[Double]]
var coord = rdd1.reduce( (x,y) => { (x, y).zipped.map(_+_) })
Is there a better way to make this more efficiently because it cost a harm.
Using zipped.map is very inefficient, because it creates a lot of temporary objects and boxes the doubles.
If you use spire, you can just do this
> import spire.implicits._
> val rdd1 = sc.parallelize(Seq(Array(1.0, 2.0), Array(3.0, 4.0)))
> var coord = rdd1.reduce( _ + _)
res1: Array[Double] = Array(4.0, 6.0)
This is much nicer to look at, and should also be much more efficient.
Spire is a dependency of spark, so you should be able to do the above without any extra dependencies. At least it worked with a spark-shell for spark 1.3.1 here.
This will work for any array where there is an AdditiveSemigroup typeclass instance available for the element type. In this case, the element type is Double. Spire typeclasses are #specialized for double, so there will be no boxing going on anywhere.
If you really want to know what is going on to make this work, you have to use reify:
> import scala.reflect.runtime.{universe => u}
> val a = Array(1.0, 2.0)
> val b = Array(3.0, 4.0)
> u.reify { a + b }
res5: reflect.runtime.universe.Expr[Array[Double]] = Expr[scala.Array[Double]](
implicits.additiveSemigroupOps(a)(
implicits.ArrayNormedVectorSpace(
implicits.DoubleAlgebra,
implicits.DoubleAlgebra,
Predef.this.implicitly)).$plus(b))
So the addition works because there is an instance of AdditiveSemigroup for Array[Double].
I assume the concern is that you have very large Array[Double] and the transformation as written does not distribute the addition of them. If so, you could do something like (untested):
// map Array[Double] to (index, double)
val rdd2 = rdd1.flatMap(a => a.zipWithIndex.map(t => (t._2,t._1))
// get the sum for each index
val reduced = rdd2.reduceByKey(_ + _)
// key everything the same to get a single iterable in groubByKey
val groupAll = reduced.map(t => ("constKey", (t._1, t._2)
// get the doubles back together into an array
val coord = groupAll.groupByKey { (k,vs) =>
vs.toList.sortBy(_._1).toArray.map(_._2) }

applying a function to graph data using mapReduceTriplets in spark and graphx

I'm having some problems applying the mapReduceTriplets to my graph network in spark using graphx.
I've been following the tutorials and read in my own data which is put together as [Array[String],Int], so for example my vertices are:
org.apache.spark.graphx.VertexRDD[Array[String]] e.g. (3999,Array(17, Low, 9))
And my edges are:
org.apache.spark.graphx.EdgeRDD[Int]
e.g. Edge(3999,4500,1)
I'm trying to apply an aggregate type function using mapReduceTriplets which counts how many of the last integer in the array of a vertices (in the above example 9) is the same or different to the first integer (in the above example 17) of all connected vertices.
So you would end up with a list of counts for the number of matches or non-matches.
The problem I am having is applying any function using mapReduceTriplets, I am quite new to scala so this may be really obvious, but in the graphx tutorials it has an example which is using a graph with the format Graph[Double, Int], however my graph is in the format of Graph[Array[String],Int], so i'm just trying as a first step to figure out how I can use my graph in the example and then work from there.
The example on the graphx website is as follows:
val olderFollowers: VertexRDD[(Int, Double)] = graph.mapReduceTriplets[(Int, Double)](
triplet => { // Map Function
if (triplet.srcAttr > triplet.dstAttr) {
// Send message to destination vertex containing counter and age
Iterator((triplet.dstId, (1, triplet.srcAttr)))
} else {
// Don't send a message for this triplet
Iterator.empty
}
},
// Add counter and age
(a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function
)
Any advice would be most appreciated, or if you think there is a better way than using mapreducetriplets I would be happy to hear it.
Edited new code
val nodes = (sc.textFile("C~nodeData.csv")
.map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
val edges = GraphLoader.edgeListFile(sc, "C:~edges.txt")
val graph = edges.outerJoinVertices(nodes) {
case (uid, deg, Some(attrList)) => attrList
case (uid, deg, None) => Array.empty[String]
}
val countsRdd = graph.collectNeighbors(EdgeDirection.Either).leftOuterJoin(graph.vertices).map {
case (id, t) => {
val neighbors: Array[(VertexId, Array[String])] = t._1
val nodeAttr = (t._2)
neighbors.map(_._2).count( x => x.apply(x.size - 1) == nodeAttr(0))
}
}
I think you want to use GraphOps.collectNeighbors instead of either mapReduceTriplets or aggregateMessages.
collectNeighbors will give you an RDD with, for every VertexId in your graph, the connected nodes as an array. Just reduce the Array based on your needs. Something like:
val countsRdd = graph.collectNeighbors(EdgeDirection.Either)
.join(graph.vertices)
.map{ case (vid,t) => {
val neighbors = t._1
val nodeAttr = t._2
neighbors.map(_._2).filter( <add logic here> ).size
}
If this doesn't get you going in the right direction, or you get stuck, let me know (the "" part, for example).