Adding Constant to RDD - scala

I have a really stupid question, I know that a RDD is immutable, but is there any way that you can add a column of constant to a RDD?
More specifically, I have an RDD of RDD[a:String, b:String], I wish to add a column of 1's after it so that I have a RDD of RDD[a:Stirng, b:String, c:Int].
The reason is that I want to use the reduceByKey function to process these strings, and an arbitrary Int (that will be constantly updated) will help the function in reducing.

Solution in Scala is to use map simply
rdd.map( t => (t._1, t._2, 1))
Or
rdd.map{ case (a, b) => (a, b, 1)}

You can easily do it with map function, here's an example in Python:
rdd.map(lambda (a,b): (a,b,1))

Related

Count operation in reduceByKey in spark

val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Here I have performed Sum operation But Is it possible to do count operation inside reduceByKey.
Like what i think,
reduceByKey((x, y) => (math.count(x._1),(x._2+y._2)))
But this is not working any suggestion please.
Well, counting is equivalent to summing 1s, so just map the first item in each value tuple into 1 and sum both parts of the tuple like you did before:
val temp1 = tempTransform.map { temp =>
((temp.getShort(0), temp.getString(1)), (1, USAGE_TEMP.getDouble(3)))
}
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))
Result would be an RDD[((Short, String), (Int, Double))] where the first item in the value tuple (the Int) is the number of original records matching that key.
That's actually the classic map-reduce example - word count.
No, you can't do that. RDD provide iterator model for lazy computation. So every element will be visited only once.
If you really want to do sum as described, re-partition your rdd first, then use mapWithPartition, implement your calculation in closure( Keep in mind that elements in RDD is not in order).

Spark closure argument binding

I am working with Apache Spark in Scala.
I have a problem when trying to manipulate one RDD with data from a second RDD. I am trying to pass the 2nd RDD as an argument to a function being 'mapped' against the first RDD, but seemingly the closure created on that function binds an uninitialized version of that value.
Following is a simpler piece of code that shows the type of problem I'm seeing. (My real example where I first had trouble is larger and less understandable).
I don't really understand the argument binding rules for Spark closures.
What I'm really looking for is a basic approach or pattern for how to manipulate one RDD using the content of another (which was previously constructed elsewhere).
In the following code, calling Test1.process(sc) will fail with a null pointer access in findSquare (as the 2nd arg bound in the closure is not initialized)
object Test1 {
def process(sc: SparkContext) {
val squaresMap = (1 to 10).map(n => (n, n * n))
val squaresRDD = sc.parallelize(squaresMap)
val primes = sc.parallelize(List(2, 3, 5, 7))
for (p <- primes) {
println("%d: %d".format(p, findSquare(p, squaresRDD)))
}
}
def findSquare(n: Int, squaresRDD: RDD[(Int, Int)]): Int = {
squaresRDD.filter(kv => kv._1 == n).first._1
}
}
Problem you experience has nothing to do with closures or RDDs which, contrary to popular belief, are serializable.
It is simply breaks a fundamental Spark rule which states that you cannot trigger an action or transformation from another action or transformation* and different variants of this question have been asked on SO multiple times.
To understand why that's the case you have to think about the architecture:
SparkContext is managed on the driver
everything that happens inside transformations is executed on the workers. Each worker have access only to its own part of the data and don't communicate with other workers**.
If you want to use content of multiple RDDs you have to use one of the transformations which combine RDDs, like join, cartesian, zip or union.
Here you most likely (I am not sure why you pass tuple and use only first element of this tuple) want to either use a broadcast variable:
val squaresMapBD = sc.broadcast(squaresMap)
def findSquare(n: Int): Seq[(Int, Int)] = {
squaresMapBD.value
.filter{case (k, v) => k == n}
.map{case (k, v) => (n, k)}
.take(1)
}
primes.flatMap(findSquare)
or Cartesian:
primes
.cartesian(squaresRDD)
.filter{case (n, (k, _)) => n == k}.map{case (n, (k, _)) => (n, k)}
Converting primes to dummy pairs (Int, null) and join would be more efficient:
primes.map((_, null)).join(squaresRDD).map(...)
but based on your comments I assume you're interested in a scenario when there is natural join condition.
Depending on a context you can also consider using database or files to store common data.
On a side note RDDs are not iterable so you cannot simply use for loop. To be able to do something like this you have to collect or convert toLocalIterator first. You can also use foreach method.
* To be precise you cannot access SparkContext.
** Torrent broadcast and tree aggregates involve communication between executors so it is technically possible.
RDD are not serializable, so you can't use an rdd inside an rdd trasformation.
Then I've never seen enumerate an rdd with a for statement, usually I use foreach statement that is part of rdd api.
In order to combine data from two rdd, you can leverage join, union or broadcast ( in case your rdd is small)

aggregate data for uniquely tagged values in a list in scala

I was wondering if somebody could help.
I'm trying to aggregate some data in a list based on id values, I have a listBuffer which is updated from a foreach function. My output means I have an id number and a value, because the foreach applies a function to each id often more than once, the list I end up with looks something like the following:
ListBuffer(3106;0, 3106;3, 3108;2, 3108;0, 3110;1, 3110;2, 3113;0, 3113;2, 3113;0)
What I want to do is apply a simple function to aggregate this data, so I am left with
List(3106;3 ,3108;2, 3110;3, 3113;2)
I thought this could be done with foldLeft or groupBy, however I'm not sure how to get it to recognise id values and normal values.
Any help or pointers would be much appreciated
First of all, you can't group key-value pairs this way. In scala you have tuples which are written as
val pair: (Int, Int) = (3106,3), where
pair._1 == 3106
pair._2 == 3
are true statements.
So you have:
val l = ListBuffer((3106,0), (3106,3), (3108,2), (3108,0), (3110,1), (3110,2), (3113,0), (3113,2), (3113,0))
val result = l.groupBy(x => x._1).map(x => (x._1, x._2.map(_._2))).map(x => (x._1, x._2.sum)).toList
println(result)
will give you
List((3106,3), (3108,2), (3110,3), (3113,2))

Is it possible to user reduceByKey((x, y, z) => ...)?

Is it possible to have a reduceByKey of the way: reduceByKey((x, y, z) => ...)?
Because I have a RDD:
RDD[((String, String, Double), (Double, Double, scala.collection.immutable.Map[String,Double]))]
And I want reduce by key and I tried with this operation:
reduceByKey((x, y, z) => (x._1 + y._1 + z._1, x._2 + y._2 + z._2, (((x._3)++y._3)++z._3)))
and it shows me a error message: missing parameter type
Before I tested with two elements and it works, but with 3 I really don't know which is my error. What is the way to do that?
Here's what you're missing, reduceByKey is telling you that you have a Key-Value pairing. Conceptually there can only ever be 2 items in a pair, it's part of the what makes a pair a pair. Hence, the full signature of reduceByKey can only ever be a 2-Tuple as it's signature. So, no, you can't directly have a function of arity 3, only of arity 2.
Here's how I'd handle your situation:
reduceByKey((key,value) =>
val (one, two, three) = key
val (dub1, dub2, nameName) = value
// rest of work
}
However, let me make one slight suggestion? Use a case class for your value. It's easier to grok and is essentially equivalent to your 3-tuple.
if you see the reduceByKey function on PairRDDFunctions, it looks like,
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
hence, its not possible to have it work on a 3-tuple.
However, you can wrap your 3-tuple into a model and still keep your first string as the key making your RDD as RDD[(string, your-model)] and now you can aggregate the model in whatever way you want.
Hope this helps.

Scala List of tuples to flat list

I have list of tuple pairs, List[(String,String)] and want to flatten it to a list of strings, List[String].
Some of the options might be:
concatenate:
list.map(t => t._1 + t._2)
one after the other interleaved (after your comment it seems you were asking for this):
list.flatMap(t => List(t._1, t._2))
split and append them:
list.map(_._1) ++ list.map(_._2)
Well, you can always use flatMap as in:
list flatMap (x => List(x._1, x._2))
Although your question is a little vague.
Try:
val tt = List(("John","Paul"),("George","Ringo"))
tt.flatMap{ case (a,b) => List(a,b) }
This results in:
List(John, Paul, George, Ringo)
In general for lists of tuples of any arity, consider this,
myTuplesList.map(_.productIterator.map(_.toString)).flatten
Note the productIterator casts all types in a tuple to Any, hence we recast values here to String.
See -
https://stackoverflow.com/a/43716004/4610065
In this case -
import syntax.std.tuple._
List(("John","Paul"),("George","Ringo")).flatMap(_.toList)