Scala Spark - converting sign of a Seq[double] in a flatMap - scala

I'm trying to change the sign of a Seq[double] inside a flatMap. I am getting type mismatch error.
import util.Random.nextDouble
var numbers = Seq.fill(1000)(nextDouble)
val nrdd = sc.parallelize(numbers)
val mrdd = nrdd.flatMap(a => (a)* -1.0)

I think you are simply looking for the map method instead of flatMap.
val mrdd = nrdd.map(a => -a)

Related

How to use flatMap for flatten one component of a tuple

I have a tuple like.. (a, list(b,c,d)). I want the output like
(a,b)
(a,c)
(a,d)
I am trying to use flatMap for this but not getting any success. Even map is not helping in this case.
Input Data :
Chap01:Spark is an emerging technology
Chap01:You can easily learn Spark
Chap02:Hadoop is a Bigdata technology
Chap02:You can easily learn Spark and Hadoop
Code:
val rawData = sc.textFile("C:\\wc_input.txt")
val chapters = rawData.map(line => (line.split(":")(0), line.split(":")(1)))
val chapWords = chapters.flatMap(a => (a._1, a._2.split(" ")))
You could map over the second element of the tuple:
val t = ('a', List('b','c','d'))
val res = t._2.map((t._1, _))
The snipped above resolves to:
res: List[(Char, Char)] = List((a,b), (a,c), (a,d))
This scenario can be easily handled by flatMapValues methods in RDD. It works only on values of a pair RDD keeping the key same.

How to add one to every element of a SparseVector in Breeze?

Given a Breeze SparseVector object:
scala> val sv = new SparseVector[Double](Array(0, 4, 5), Array(1.5, 3.6, 0.4), 8)
sv: breeze.linalg.SparseVector[Double] = SparseVector(8)((0,1.5), (4,3.6), (5,0.4))
What is the best way to take the log of the values + 1?
Here is one way that works:
scala> new SparseVector(sv.index, log(sv.data.map(_ + 1)), sv.length)
res11: breeze.linalg.SparseVector[Double] = SparseVector(8)((0,0.9162907318741551), (4,1.5260563034950492), (5,0.3364722366212129))
I don't like this because it doesn't really make use of breeze to do the addition. We are using a breeze UFunc to take the log of an Array[Double], but that isn't much. I am concerned that in a distributed application with large SparseVectors, this will be slow.
Spark 1.6.3
You can define some UDF's to do arbitrary vectorized addition in Spark. First, you need to set up the ability to convert Spark vectors to Breeze vectors; an example of doing that is here. Once you have the implicit conversions in place, you have a few options.
To add any two columns you can use:
def addVectors(v1Col: String, v2Col: String, outputCol: String): DataFrame => DataFrame = {
// Error checking column names here
df: DataFrame => {
def add(v1: SparkVector, v2: SparkVector): SparkVector =
(v1.asBreeze + v2.asBreeze).fromBreeze
val func = udf((v1: SparkVector, v2: SparkVector) => add(v1, v2))
df.withColumn(outputCol, func(col(v1Col), col(v2Col)))
}
}
Note, the use of asBreeze and fromBreeze (as well as the alias for SparkVector) is established in the question linked above. A possible solution is to make a literal integer column by
df.withColumn(colName, lit(1))
and then add the columns.
The alternative for more complex mathematical functions is:
def applyMath(func: BreezeVector[Double] => BreezeVector[Double],
inColName: String, outColName: String): DataFrame => DataFrame = {
df: DataFrame => df.withColumn(outColName,
udf((v1: SparkVector) => func(v1.asBreeze).fromBreeze).apply(col(inColName)))
}
You could also make this generic in the Breeze vector parameter.

How do reduce tuples into tuple of tuples in scala

I have RDD with rows of type
(a,(b,c,d))
(a,(e,f,g))
I am trying to reduce it by key such that it yields rows of type
(a,(b,c,d),(e,f,g)).
But I am getting an error while using this :
val redcd = mapd.reduceByKey((_,_))
How do I do it?
If you have RDD as
scala> mapd.foreach(println)
(a,(b,c,d))
(a,(e,f,g))
(b,(b,c,d))
Then doing
val redcd = mapd.groupBy(_._1).mapValues(x => x.map(_._2).toList)
would give you
scala> redcd.foreach(println)
(b,List((b,c,d)))
(a,List((b,c,d), (e,f,g)))
Now if you want it in the format explained in question you can do
val redcd = mapd.groupBy(_._1).mapValues(x => x.map(_._2).toList.mkString(", "))
which would generate
scala> redcd.foreach(println)
(a,(b,c,d), (e,f,g))
(b,(b,c,d))
I hope the answer is helpful

Scala function does not return a value

I think I understand the rules of implicit returns but I can't figure out why splithead is not being set. This code is run via
val m = new TaxiModel(sc, file)
and then I expect
m.splithead
to give me an array strings. Note head is an array of strings.
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class TaxiModel(sc: SparkContext, dat: String) {
val rawData = sc.textFile(dat)
val head = rawData.take(10)
val splithead = head.slice(1,11).foreach(splitData)
def splitData(dat: String): Array[String] = {
val splits = dat.split("\",\"")
val split0 = splits(0).substring(1, splits(0).length)
val split8 = splits(8).substring(0, splits(8).length - 1)
Array(split0).union(splits.slice(1, 8)).union(Array(split8))
}
}
foreach just evaluates expression, and do not collect any data while iterating. You probably need map or flatMap (see docs here)
head.slice(1,11).map(splitData) // gives you Array[Array[String]]
head.slice(1,11).flatMap(splitData) // gives you Array[String]
Consider also a for comprehension (which desugars in this case into flatMap),
for (s <- head.slice(1,11)) yield splitData(s)
Note also that Scala strings are equipped with ordered collections methods, thus
splits(0).substring(1, splits(0).length)
proves equivalent to any of the following
splits(0).drop(1)
splits(0).tail

How to convert Spark's TableRDD to RDD[Array[Double]] in Scala?

I am trying to perform Scala operation on Shark. I am creating an RDD as follows:
val tmp: shark.api.TableRDD = sc.sql2rdd("select duration from test")
I need it to convert it to RDD[Array[Double]]. I tried toArray, but it doesn't seem to work.
I also tried converting it to Array[String] and then converting using map as follows:
val tmp_2 = tmp.map(row => row.getString(0))
val tmp_3 = tmp_2.map { row =>
val features = Array[Double] (row(0))
}
But this gives me a Spark's RDD[Unit] which cannot be used in the function. Is there any other way to proceed with this type conversion?
Edit I also tried using toDouble, but this gives me an RDD[Double] type, not RDD[Array[Double]]
val tmp_5 = tmp_2.map(_.toDouble)
Edit 2:
I managed to do this as follows:
A sample of the data:
296.98567000000003
230.84362999999999
212.89751000000001
914.02404000000001
305.55383
A Spark Table RDD was created first.
val tmp = sc.sql2rdd("select duration from test")
I made use of getString to translate it to a RDD[String] and then converted it to an RDD[Array[Double]].
val duration = tmp.map(row => Array[Double](row.getString(0).toDouble))