How to create Key-Value RDD (Scala) - scala

I have the following RDD (name: AllTrainingDATA_RDD) which is of type
org.apache.spark.rdd.RDD[(String, Double, Double, String)] :
(ICCH_1,4.3,3.0,Iris-setosa)
(ICCH_1,4.4,2.9,Iris-setosa)
(ICCH_1,4.4,3.0,Iris-setosa)
(ICCH_2,4.4,3.2,Iris-setosa)
1st column : ICCH_ID, 2nd column: X_Coordinates, 3rd Column: Y_Coordinates, 4th column: Class
I would like to end up with an RDD which has 2nd and 3rd column as the Key and 4th column as Value. The column ICCH_ID should remain in the RDD.
My currently attempt based on some Internet research is this:
val AllTrainingDATA_RDD_Final = AllTrainingDATA_RDD.map(_.split(",")).keyBy(_(X_COORD,Y_COORD)).mapValues(fields => ("CLASS")).groupByKey().collect()
However I get this error:
error: value split is not a member of (String, Double, Double, String)
P.S. I am using Databricks Community Edition. I am new to Scala.

Let's try to break down your solution, part by part:
val AllTrainingDATA_RDD_Final = AllTrainingDATA_RDD
.map(_.split(","))
.keyBy(_(X_COORD,Y_COORD))
.mapValues(fields => ("CLASS"))
.groupByKey()
.collect()
You first problem is the use of .map(_.split(",")). This is likely a preprocessing stage done on an RDD[String] to extract the comma separated values from the text input lines. But since you've already done that, we can go ahead and drop the part.
Your second problem will come from .keyBy(_(X_COORD,Y_COORD)), and it's going to look something like this:
error: (String, Double, Double, String) does not take parameters
This is because you supplied keyBy an anonymous function that attempts to apply (X_COORD,Y_COORD) on each of the tuples in your RDD, but what you actually want is function that extracts the x and y coordinates (2nd and 3rd values) from your tuple. One way to achieve this is .keyBy{case (_, x, y, _) => (x, y)}
Lastly, your use of mapValues just produces the same String value ("CLASS") for all elements in the RDD. Instead, you can simply take the 4th value in the tuple like so: .mapValues(_._4)
Putting this all together, you get the following code:
val AllTrainingDATA_RDD_Final = AllTrainingDATA_RDD
.keyBy{case (_, x, y, _) => (x, y)}
.mapValues(_._4)
.groupByKey()
.collect()
Since you are new to Scala, I suggest you take some time to get aquatinted with syntax, features and APIs before you continue. It will help you understand and overcome such problems much faster.

Related

How to create an RDD by selecting specific data from an existing RDD where output should of RDD[String]?

I have scenario to capture some data (not all) from an existing RDD and then pass it to other Scala class for actual operations. Lets see with example data(empnum, empname, emplocation, empsal) in a text file.
11,John,Paris,1000
12,Daniel,UK,3000
first step, I create an RDD with RDD[String] by below code,
val empRDD = spark
.sparkContext
.textFile("empInfo.txt")
So, my requirement is to create another RDD with empnum, empname, emplocation (again with RDD[String]).
For that I have tried below code hence I am getting RDD[String, String, String].
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
I have tried with Slice also, it gives me RDD[Array(String)].
My required RDD should be of RDD[String] to pass to required Scala class to do some operations.
The expected output should be,
11,John,Paris
12,Daniel,UK
Can anyone help me how to achieve?
I would try this
val empReqRDD = empRDD
.map(a=> a.split(","))
.map(x=> (x(0), x(1), x(2)))
val rddString = empReqRDD.map({case(id,name,city) => "%s,%s,%s".format(id,name,city)})
In your initial implementation, the second map is putting the array elements into a 3-tuple, hence the RDD[(String, String, String)].
One way to accomplish your objective is to change the second map to construct a string like so:
empRDD
.map(a=> a.split(","))
.map(x => s"${x(0)},${x(1)},${x(2)}")
Alternatively, and a bit more concise, you could do it by taking the first 3 elements of the array and using the mkString method:
empRDD.map(_.split(',').take(3).mkString(","))
Probably overkill for this use-case, but you could also use a regex to extract the values:
val r = "([^,]*),([^,]*),([^,]*).*".r
empRDD.map { case r(id, name, city) => s"$id,$name,$city" }

Filtering RDDs based on value of Key

I have two RDDs that wrap the following arrays:
Array((3,Ken), (5,Jonny), (4,Adam), (3,Ben), (6,Rhonda), (5,Johny))
Array((4,Rudy), (7,Micheal), (5,Peter), (5,Shawn), (5,Aaron), (7,Gilbert))
I need to design a code in such a way that if I provide input as 3 I need to return
Array((3,Ken), (3,Ben))
If input is 6, output should be
Array((6,Rhonda))
I tried something like this:
val list3 = list1.union(list2)
list3.reduceByKey(_+_).collect
list3.reduceByKey(6).collect
None of these worked, can anyone help me out with a solution for this problem?
Given the following that you would have to define yourself
// Provide you SparkContext and inputs here
val sc: SparkContext = ???
val array1: Array[(Int, String)] = ???
val array2: Array[(Int, String)] = ???
val n: Int = ???
val rdd1 = sc.parallelize(array1)
val rdd2 = sc.parallelize(array2)
You can use the union and filter to reach your goal
rdd1.union(rdd2).filter(_._1 == n)
Since filtering by key is something that you would probably want to do in several occasions, it makes sense to encapsulate this functionality in its own function.
It would also be interesting if we could make sure that this function could work on any type of keys, not only Ints.
You can express this in the old RDD API as follows:
def filterByKey[K, V](rdd: RDD[(K, V)], k: K): RDD[(K, V)] =
rdd.filter(_._1 == k)
You can use it as follows:
val rdd = rdd1.union(rdd2)
val filtered = filterByKey(rdd, n)
Let's look at this method a little bit more in detail.
This method allows to filterByKey and RDD which contains a generic pair, where the type of the first item is K and the type of the second type is V (from key and value). It also accepts a key of type K that will be used to filter the RDD.
You then use the filter function, that takes a predicate (a function that goes from some type - in this case K - to a Boolean) and makes sure that the resulting RDD only contains items that respect this predicate.
We could also have written the body of the function as:
rdd.filter(pair => pair._1 == k)
or
rdd.filter { case (key, value) => key == k }
but we took advantage of the _ wildcard to express the fact that we want to act on the first (and only) parameter of this anonymous function.
To use it, you first parallelize your RDDs, call union on them and then invoke the filterByKey function with the number you want to filter by (as shown in the example).

Spark: Dividing one array by elements in another

I am new to Apache Spark and Scala. I am trying to understand something here: -
I have one array:
Companies= Array(
(Microsoft,478953),
(IBM,332042),
(JP Morgan,226003),
(Google,342033)
)
I wanted to divide this by another array, element by element:
Count = Array((Microsoft,4), (IBM,3), (JP Morgan,2), (Google,3))
I used this code :
val result: Array[(String, Double)] = wordMapCount
.zip(letterMapCount)
.map { case ((letter, wc), (_, lc)) => (letter, lc.toDouble / wc) }
From here: Divide Arrays
This works. However, I do not understand it. Why does zip require the second array and not the first one also the case matching how is that working here?
Why does zip require the second array and not the first one?
Because that's how zip works. It takes two separate RDD instances and maps one over the other to create pair of the first and second element:
def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]
case matching how is that working here
You have two tuples:
(Microsoft, 478953), (Microsoft,4)
What this partial function does decomposition of the tuple type via a call to Tuple2.unapply. This:
case ((letter, wc), (_, lc))
Means "extract the first argument (_1) from the first tuple into a fresh value named letter, and the second argument (_2) to a fresh value named wc. Same goes for the second tuple. And then, it creates a new tuple with letter as the first value and the division of lc and wc as the second argument.

How to transform Array[(Double, Double)] into Array[Double] in Scala?

I'm using MLlib of Spark (v1.1.0) and Scala to do k-means clustering applied to a file with points (longitude and latitude).
My file contains 4 fields separated by comma (the last two are the longitude and latitude).
Here, it's an example of k-means clustering using Spark:
https://spark.apache.org/docs/1.1.0/mllib-clustering.html
What I want to do is to read the last two fields of my files that are in a specific directory in HDFS, transform them into an RDD<Vector> o use this method in KMeans class:
train(RDD<Vector> data, int k, int maxIterations)
This is my code:
val data = sc.textFile("/user/test/location/*")
val parsedData = data.map(s => Vectors.dense(s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))))
But when I run it in spark-shell I get the following error:
error: overloaded method value dense with alternatives: (values:
Array[Double])org.apache.spark.mllib.linalg.Vector (firstValue:
Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[(Double, Double)])
So, I don't know how to transform my Array[(Double, Double)] into Array[Double]. Maybe there is another way to read the two fields and convert them into RDD<Vector>, any suggestion?
Previous suggestion using flatMap was based on the assumption that you wanted to map over the elements of the array given by the .split(",") - and offered to satisfy the types, by using Array instead of Tuple2.
The argument received by the .map/.flatMap functions is an element of the original collection, so should be named 'field' (singluar) for clarity. Calling fields(2) selects the 3rd character of each of the elements of the split - hence the source of confusion.
If what you're after is the 3rd and 4th elements of the .split(",") array, converted to Double:
s.split(",").drop(2).take(2).map(_.toDouble)
or if you want all BUT the first to fields converted to Double (if there may be more than 2):
s.split(",").drop(2).map(_.toDouble)
There're two 'factory' methods for dense Vectors:
def dense(values: Array[Double]): Vector
def dense(firstValue: Double, otherValues: Double*): Vector
While the provided type above is Array[Tuple2[Double,Double]] and hence does not type-match:
(Extracting the logic above:)
val parseLineToTuple: String => Array[(Double,Double)] = s => s=> s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))
What is needed here is to create a new Array out of the input String, like this: (again focusing only on the specific parsing logic)
val parseLineToArray: String => Array[Double] = s=> s.split(",").flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble)))
Integrating that in the original code should solve the issue:
val data = sc.textFile("/user/test/location/*")
val vectors = data.map(s => Vectors.dense(parseLineToArray(s))
(You can of course inline that code, I separated it here to focus on the issue at hand)
val parsedData = data.map(s => Vectors.dense(s.split(',').flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble))))

Is it possible to user reduceByKey((x, y, z) => ...)?

Is it possible to have a reduceByKey of the way: reduceByKey((x, y, z) => ...)?
Because I have a RDD:
RDD[((String, String, Double), (Double, Double, scala.collection.immutable.Map[String,Double]))]
And I want reduce by key and I tried with this operation:
reduceByKey((x, y, z) => (x._1 + y._1 + z._1, x._2 + y._2 + z._2, (((x._3)++y._3)++z._3)))
and it shows me a error message: missing parameter type
Before I tested with two elements and it works, but with 3 I really don't know which is my error. What is the way to do that?
Here's what you're missing, reduceByKey is telling you that you have a Key-Value pairing. Conceptually there can only ever be 2 items in a pair, it's part of the what makes a pair a pair. Hence, the full signature of reduceByKey can only ever be a 2-Tuple as it's signature. So, no, you can't directly have a function of arity 3, only of arity 2.
Here's how I'd handle your situation:
reduceByKey((key,value) =>
val (one, two, three) = key
val (dub1, dub2, nameName) = value
// rest of work
}
However, let me make one slight suggestion? Use a case class for your value. It's easier to grok and is essentially equivalent to your 3-tuple.
if you see the reduceByKey function on PairRDDFunctions, it looks like,
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
hence, its not possible to have it work on a 3-tuple.
However, you can wrap your 3-tuple into a model and still keep your first string as the key making your RDD as RDD[(string, your-model)] and now you can aggregate the model in whatever way you want.
Hope this helps.