Spark: Best way to build RDD line by line - scala

I have an iterative function that generates some data I want the output in an RDD. What is the best way to build it? I can come up with two solutions
1)
var results = sparkSession.sparkContext.emptyRDD[String]
for (values <- values) {
results = results.union(sparkSession.sparkContext.parallelize[String]Seq(performSomething(value))
}
2)
var results = Seq[String]
for (values <- values) {
results = results += performSomething(value)
}
sparkSession.sparkContext.parallelize[String](results)
I guess the first approach will be slower but will probably reduce memory consumption on driver and second approach will be faster, but before parallelizing all data will be in the driver? Am I correct?
Is there a 3rd better approach?
Thanks!

Related

What's the simplest way to get a Spark DataFrame from arbitrary Array Data in Scala?

I've been breaking my head about this one for a couple of days now. It feels like it should be intuitively easy... Really hope someone can help!
I've built an org.nd4j.linalg.api.ndarray.INDArray of word occurrence from some semi-structured data like this:
import org.nd4j.linalg.factory.Nd4j
import org.nd4s.Implicits._
val docMap = collection.mutable.Map[Int,Map[Int,Int]] //of the form Map(phrase -> Map(phrasePosition -> word)
val words = ArrayBuffer("word_1","word_2","word_3",..."word_n")
val windows = ArrayBuffer("$phrase,$phrasePosition_1","$phrase,$phrasePosition_2",..."$phrase,$phrasePosition_n")
var matrix = Nd4j.create(windows.length*words.length).reshape(windows.length,words.length)
for (row <- matrix.shape(0)){
for(column <- matrix.shape(1){
//+1 to (row,column) if word occurs at phrase, phrasePosition indicated by window_n.
}
}
val finalmatrix = matrix.T.dot(matrix) // to get co-occurrence matrix
So far so good...
Downstream of this point I need to integrate the data into an existing pipeline in Spark, and use that implementation of pca etc, so I need to create a DataFrame, or at least an RDD. If I knew the number of words and/or windows in advance I could do something like:
case class Row(window : String, word_1 : Double, word_2 : Double, ...etc)
val dfSeq = ArrayBuffer[Row]()
for (row <- matrix.shape(0)){
dfSeq += Row(windows(row),matrix.get(NDArrayIndex.point(row), NDArrayIndex.all()))
}
sc.parallelize(dfSeq).toDF("window","word_1","word_2",...etc)
but the number of windows and words is determined at runtime. I'm looking for a WindowsxWords org.apache.spark.sql.DataFrame as output, input is a WindowsxWords org.nd4j.linalg.api.ndarray.INDArray
Thanks in advance for any help you can offer.
Ok, so after several days work it looks like the simple answer is: there isn't one. In fact, it looks like trying to use Nd4j in this context at all is a bad idea for several reasons:
It's (really) hard to get data out of the native INDArray format once you've put it in.
Even using something like guava, the .data() method brings everything on heap which will quickly become expensive.
You've got the added hassle of having to compile an assembly jar or use hdfs etc to handle the library itself.
I did also consider using Breeze which may actually provide a viable solution but carries some of the same problems and can't be used on distributed data structures.
Unfortunately, using native Spark / Scala datatypes, although easier once you know how, is - for someone like me coming from Python + numpy + pandas heaven at least - painfully convoluted and ugly.
Nevertheless, I did implement this solution successfully:
import org.apache.spark.mllib.linalg.{Vectors,Vector,Matrix,DenseMatrix,DenseVector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//first make a pseudo-matrix from Scala Array[Double]:
var rowSeq = Seq.fill(windows.length)(Array.fill(words.length)(0d))
//iterate through 'rows' and 'columns' to fill it:
for (row 0 until windows.length){
for (column 0 until words.length){
// rowSeq(row)(column) += 1 if word occurs at phrase, phrasePosition indicated by window_n.
}
}
//create Spark DenseMatrix
val rows : Array[Double] = rowSeq.transpose.flatten.toArray
val matrix = new DenseMatrix(windows.length,words.length,rows)
One of the main operations that I needed Nd4J for was matrix.T.dot(matrix) but it turns out that you can't multiply 2 matrices of Type org.apache.spark.mllib.linalg.DenseMatrix together, one of them (A) has to be a org.apache.spark.mllib.linalg.distributed.RowMatrix and - you guessed it - you can't call matrix.transpose() on a RowMatrix, only on a DenseMatrix! Since it's not really relevant to the question, I'll leave that part out, except to explain that what comes out of that step is a RowMatrix. Credit is also due here and here for the final part of the solution:
val rowMatrix : [RowMatrix] = transposeAndDotDenseMatrix(matrix)
// get DataFrame from RowMatrix via DenseMatrix
val newdense = new DenseMatrix(rowMatrix.numRows().toInt,rowMatrix.numCols().toInt,rowMatrix.rows.collect.flatMap(x => x.toArray)) // the call to collect() here is undesirable...
val matrixRows = newdense.rowIter.toSeq.map(_.toArray)
val df = spark.sparkContext.parallelize(matrixRows).toDF("Rows")
// then separate columns:
val df2 = (0 until words.length).foldLeft(df)((df, num) =>
df.withColumn(words(num), $"Rows".getItem(num)))
.drop("Rows")
Would love to hear improvements and suggestions on this, thanks.

How to connect component in Spark when data is too large

When dealing with component connecting of big data, I find it very difficult to merging them in spark.
The data structure in my research can be simplified to RDD[Array[Int]]. For example:
RDD[Array(1,2,3), Array(1,4), Array(5,6), Array(5,6,7,8), Array(9), Array(1)]
The objective is to merge two Array if they have intersection set, ending up with arrays without any intersection. Therefore after merging, it should be:
RDD[Array(1,2,3,4), Array(5,6,7,8), Array(9)]
The problem is kind of component connecting in Pregel framework in Graph Algo. One solution is to first find the edge connection between two Array using cartesian product and then merge them. However, in my case, there are 300K Array with total size 1G. Therefore, the time and memory complexity would be roughly 300K*300K. When I run the program in my Mac Pro in spark, it is completely stuck.
Baiscally, it is like:
Thanks
Here is my solution. Might not be decent enough, but works for a small data. Whether it can apply to large data needs further proof.
def mergeCanopy(canopies:RDD[Array[Int]]):Array[Array[Int]] = {
/*
try to merge two canopies
*/
val s = Set[Array[Int]]()
val c = canopies.aggregate(s)(mergeOrAppend, _++_)
return c.toArray
def mergeOrAppend(disjoint: Set[Array[Int]], cluster: Array[Int]):Set[Array[Int]] = {
var disjoints = disjoint
for (clus <- disjoint) {
if (clus.toSet.&(cluster.toSet) != Set()) {
disjoints += (clus.toSet++cluster.toSet).toArray
disjoints -= clus
return disjoints
}
}
disjoints += cluster
return disjoints
}

How to use existing trained model using LinearRegressionModel to work with SparkStreaming and predict data with it [duplicate]

I have the following code:
val blueCount = sc.accumulator[Long](0)
val output = input.map { data =>
for (value <- data.getValues()) {
if (record.getEnum() == DataEnum.BLUE) {
blueCount += 1
println("Enum = BLUE : " + value.toString()
}
}
data
}.persist(StorageLevel.MEMORY_ONLY_SER)
output.saveAsTextFile("myOutput")
Then the blueCount is not zero, but I got no println() output! Am I missing anything here? Thanks!
This is a conceptual question...
Imagine You have a big cluster, composed of many workers let's say n workers and those workers store a partition of an RDD or DataFrame, imagine You start a map task across that data, and inside that map you have a print statement, first of all:
Where will that data be printed out?
What node has priority and what partition?
If all nodes are running in parallel, who will be printed first?
How will be this print queue created?
Those are too many questions, thus the designers/maintainers of apache-spark decided logically to drop any support to print statements inside any map-reduce operation (this include accumulators and even broadcast variables).
This also makes sense because Spark is a language designed for very large datasets. While printing can be useful for testing and debugging, you wouldn't want to print every line of a DataFrame or RDD because they are built to have millions or billions of rows! So why deal with these complicated questions when you wouldn't even want to print in the first place?
In order to prove this you can run this scala code for example:
// Let's create a simple RDD
val rdd = sc.parallelize(1 to 10000)
def printStuff(x:Int):Int = {
println(x)
x + 1
}
// It doesn't print anything! because of a logic design limitation!
rdd.map(printStuff)
// But you can print the RDD by doing the following:
rdd.take(10).foreach(println)
I was able to work it around by making a utility function:
object PrintUtiltity {
def print(data:String) = {
println(data)
}
}

Implement a MergeSort like feature in spark with scala

Spark Version 1.2.1
Scala Version 2.10.4
I have 2 SchemaRDD which are associated by a numeric field:
RDD 1: (Big table - about a million records)
[A,3]
[B,4]
[C,5]
[D,7]
[E,8]
RDD 2: (Small table < 100 records so using it as a Broadcast Variable)
[SUM, 2]
[WIN, 6]
[MOM, 7]
[DOM, 9]
[POM, 10]
Result
[C,5, WIN]
[D,7, MOM]
[E,8, DOM]
[E,8, POM]
I want the max(field) from RDD1 which is <= the field from RDD2.
I am trying to approach this using Merge by:
Sorting RDD by a key (sort within a group will have not more than 100 records in that group. In the above example is within a group)
Performing the merge operation similar to mergesort. Here I need to keep a track of the previous value as well to find the max; still I traverse the list only once.
Since there are too may variables here I am getting "Task not serializable" exception. Is this implementation approach Correct? I am trying to avoid the Cartesian Product here. Is there a better way to do it?
Adding the code -
rdd1.groupBy(itm => (itm(2), itm(3))).mapValues( itmorg => {
val miorec = itmorg.toList.sortBy(_(1).toString)
for( r <- 0 to miorec.length) {
for ( q <- 0 to rdd2.value.length) {
if ( (miorec(r)(1).toString > rdd2.value(q).toString && miorec(r-1)(1).toString <= rdd2.value(q).toString && r > 0) || r == miorec.length)
org.apache.spark.sql.Row(miorec(r-1)(0),miorec(r-1)(1),miorec(r-1)(2),miorec(r-1)(3),rdd2.value(q))
}
}
}).collect.foreach(println)
I would not do a global sort. It is an expensive operation for what you need. Finding the maximum is certainly cheaper than getting a global ordering of all values. Instead, do this:
For each partition, build a structure that keeps the max on RDD1 for each row on RDD2. This can be trivially done using mapPartitions and normal scala data structures. You can even use your one-pass merge code here. You should get something like a HashMap(WIN -> (C, 5), MOM -> (D, 7), ...)
Once this is done locally on each executor, merging these resulting data structures should be simple using reduce.
The goal here is to do little to no shuffling an keeping the most complex operation local, since the result size you want is very small (it would be easier in code to just create all valid key/values with RDD1 and RDD2 then aggregateByKey, but less efficient).
As for your exception, you woudl need to show the code, "Task not serializable" usually means you are passing around closures which are not, well, serializable ;-)

Adding immutable Vectors

I am trying to work more with scalas immutable collection since this is easy to parallelize, but i struggle with some newbie problems. I am looking for a way to create (efficiently) a new Vector from an operation. To be precise I want something like
val v : Vector[Double] = RandomVector(10000)
val w : Vector[Double] = RandomVector(10000)
val r = v + w
I tested the following:
// 1)
val r : Vector[Double] = (v.zip(w)).map{ t:(Double,Double) => t._1 + t._2 }
// 2)
val vb = new VectorBuilder[Double]()
var i=0
while(i<v.length){
vb += v(i) + w(i)
i = i + 1
}
val r = vb.result
}
Both take really long compared to the work with Array:
[Vector Zip/Map ] Elapsed time 0.409 msecs
[Vector While Loop] Elapsed time 0.374 msecs
[Array While Loop ] Elapsed time 0.056 msecs
// with warm-up (10000) and avg. over 10000 runs
Is there a better way to do it? I think the work with zip/map/reduce has the advantage that it can run in parallel as soon as the collections have support for this.
Thanks
Vector is not specialized for Double, so you're going to pay a sizable performance penalty for using it. If you are doing a simple operation, you're probably better off using an array on a single core than a Vector or other generic collection on the entire machine (unless you have 12+ cores). If you still need parallelization, there are other mechanisms you can use, such as using scala.actors.Futures.future to create instances that each do the work on part of the range:
val a = Array(1,2,3,4,5,6,7,8)
(0 to 4).map(_ * (a.length/4)).sliding(2).map(i => scala.actors.Futures.future {
var s = 0
var j = i(0)
while (j < i(1)) {
s += a(j)
j += 1
}
s
}).map(_()).sum // _() applies the future--blocks until it's done
Of course, you'd need to use this on a much longer array (and on a machine with four cores) for the parallelization to improve things.
You should use lazily built collections when you use more than one higher-order methods:
v1.view zip v2 map { case (a,b) => a+b }
If you don't use a view or an iterator each method will create a new immutable collection even when they are not needed.
Probably immutable code won't be as fast as mutable but the lazy collection will improve execution time of your code a lot.
Arrays are not type-erased, Vectors are. Basically, JVM gives Array an advantage over other collections when handling primitives that cannot be overcome. Scala's specialization might decrease that advantage, but, given their cost in code size, they can't be used everywhere.