Scala - Vector.tabulate and structural sharing - scala

Suppose I have a value representing a chess board
val board: Vector[Vector[Option[Piece]]] = ...
and in some function to apply moves I construct a new board from this one using tabulate
Vector.tabulate(8,8)(
(x,y) =>
if (x,y) == (start_x,start_y)
None
else if (x,y) == (end_x,end_y)
board(start_x)(start_y)
else
board(x)(y)
)
Would the memory usage of this snippet be constant, since only two cells are changed? In other words, is the data reused?

No, there will be no structural sharing between the new board and the old board. If you throw the old board way after this snippet, memory will be constant, but it would be more efficient to use as much of the old board as possible. Try:
val piece = board(start_x)(start_y)
val board2 = board.updated(start_x, board(start_x).updated(start_y, None))
val newboard = board2.updated(end_x, board2(end_x).updated(end_y, piece))

Related

What's the simplest way to get a Spark DataFrame from arbitrary Array Data in Scala?

I've been breaking my head about this one for a couple of days now. It feels like it should be intuitively easy... Really hope someone can help!
I've built an org.nd4j.linalg.api.ndarray.INDArray of word occurrence from some semi-structured data like this:
import org.nd4j.linalg.factory.Nd4j
import org.nd4s.Implicits._
val docMap = collection.mutable.Map[Int,Map[Int,Int]] //of the form Map(phrase -> Map(phrasePosition -> word)
val words = ArrayBuffer("word_1","word_2","word_3",..."word_n")
val windows = ArrayBuffer("$phrase,$phrasePosition_1","$phrase,$phrasePosition_2",..."$phrase,$phrasePosition_n")
var matrix = Nd4j.create(windows.length*words.length).reshape(windows.length,words.length)
for (row <- matrix.shape(0)){
for(column <- matrix.shape(1){
//+1 to (row,column) if word occurs at phrase, phrasePosition indicated by window_n.
}
}
val finalmatrix = matrix.T.dot(matrix) // to get co-occurrence matrix
So far so good...
Downstream of this point I need to integrate the data into an existing pipeline in Spark, and use that implementation of pca etc, so I need to create a DataFrame, or at least an RDD. If I knew the number of words and/or windows in advance I could do something like:
case class Row(window : String, word_1 : Double, word_2 : Double, ...etc)
val dfSeq = ArrayBuffer[Row]()
for (row <- matrix.shape(0)){
dfSeq += Row(windows(row),matrix.get(NDArrayIndex.point(row), NDArrayIndex.all()))
}
sc.parallelize(dfSeq).toDF("window","word_1","word_2",...etc)
but the number of windows and words is determined at runtime. I'm looking for a WindowsxWords org.apache.spark.sql.DataFrame as output, input is a WindowsxWords org.nd4j.linalg.api.ndarray.INDArray
Thanks in advance for any help you can offer.
Ok, so after several days work it looks like the simple answer is: there isn't one. In fact, it looks like trying to use Nd4j in this context at all is a bad idea for several reasons:
It's (really) hard to get data out of the native INDArray format once you've put it in.
Even using something like guava, the .data() method brings everything on heap which will quickly become expensive.
You've got the added hassle of having to compile an assembly jar or use hdfs etc to handle the library itself.
I did also consider using Breeze which may actually provide a viable solution but carries some of the same problems and can't be used on distributed data structures.
Unfortunately, using native Spark / Scala datatypes, although easier once you know how, is - for someone like me coming from Python + numpy + pandas heaven at least - painfully convoluted and ugly.
Nevertheless, I did implement this solution successfully:
import org.apache.spark.mllib.linalg.{Vectors,Vector,Matrix,DenseMatrix,DenseVector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//first make a pseudo-matrix from Scala Array[Double]:
var rowSeq = Seq.fill(windows.length)(Array.fill(words.length)(0d))
//iterate through 'rows' and 'columns' to fill it:
for (row 0 until windows.length){
for (column 0 until words.length){
// rowSeq(row)(column) += 1 if word occurs at phrase, phrasePosition indicated by window_n.
}
}
//create Spark DenseMatrix
val rows : Array[Double] = rowSeq.transpose.flatten.toArray
val matrix = new DenseMatrix(windows.length,words.length,rows)
One of the main operations that I needed Nd4J for was matrix.T.dot(matrix) but it turns out that you can't multiply 2 matrices of Type org.apache.spark.mllib.linalg.DenseMatrix together, one of them (A) has to be a org.apache.spark.mllib.linalg.distributed.RowMatrix and - you guessed it - you can't call matrix.transpose() on a RowMatrix, only on a DenseMatrix! Since it's not really relevant to the question, I'll leave that part out, except to explain that what comes out of that step is a RowMatrix. Credit is also due here and here for the final part of the solution:
val rowMatrix : [RowMatrix] = transposeAndDotDenseMatrix(matrix)
// get DataFrame from RowMatrix via DenseMatrix
val newdense = new DenseMatrix(rowMatrix.numRows().toInt,rowMatrix.numCols().toInt,rowMatrix.rows.collect.flatMap(x => x.toArray)) // the call to collect() here is undesirable...
val matrixRows = newdense.rowIter.toSeq.map(_.toArray)
val df = spark.sparkContext.parallelize(matrixRows).toDF("Rows")
// then separate columns:
val df2 = (0 until words.length).foldLeft(df)((df, num) =>
df.withColumn(words(num), $"Rows".getItem(num)))
.drop("Rows")
Would love to hear improvements and suggestions on this, thanks.

Spark: Randomly sampling with replacement a DataFrame with the same amount of sample for each class

Despite existing a lot of seemingly similar questions none answers my question.
I have a DataFrame already processed in order to be fed to a DecisionTreeClassifier and it contains a column label which is filled with either 0.0 or 1.0.
I need to bootstrap my data set, by randomly selecting with replacement the same amount of rows for each values of my label column.
I've looked at all the doc and all I could find are DataFrame.sample(...) and DataFrameStatFunctions.sampleBy(...) but the issue with those are that the number of sample retained is not guaranteed and the second one doesn't allow replacement! This wouldn't be an issue on larger data set but in around 50% of my cases I'll have one of the label values that have less than a hundred rows and I really don't want skewed data.
Despite my best efforts, I was unable to find a clean solution to this problem and I resolved myself. to collecting the whole DataFrame and doing the sampling "manually" in Scala before recreating a new DataFrame to train my DecisionTreeClassifier on. But this seem highly inefficient and cumbersome, I would much rather stay with DataFrame and keep all the benefits coming from that structure.
Here is my current implementation for reference and so you know exactly what I'd like to do:
val nbSamplePerClass = /* some int value currently ranging between 50 and 10000 */
val onesDataFrame = inputDataFrame.filter("label > 0.0")
val zeros = inputDataFrame.except(onesDataFrame).collect()
val ones = onesDataFrame.collect()
val nbZeros = zeros.count().toInt
val nbOnes = ones.count().toInt
def randomIndexes(maxIndex: Int) = (0 until nbSamplePerClass).map(
_ => new scala.util.Random().nextInt(maxIndex)).toSeq
val zerosSample = randomIndexes(nbZeros).map(idx => zeros(idx))
val onesSample = randomIndexes(nbOnes).map(idx => ones(idx))
val samples = scala.collection.JavaConversions.seqAsJavaList(zerosSample ++ onesSample)
val resDf = sqlContext.createDataFrame(samples, inputDataFrame.schema)
Does anyone know how I could implement such a sampling while only working with DataFrames?
I'm pretty sure that it would significantly speed up my code!
Thank you for your time.

How to use RowMatrix.columnSimilarities (similarity search)

TL;DR; I am trying to train off of an existing data set (Seq[Words] with corresponding categories), and use that trained dataset to filter another dataset using category similarity.
I am trying to train a corpus of data and then use it for text analysis*. I've tried using NaiveBayes, but that seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
So, I am now trying to use TFIDF and passing that output into a RowMatrix and computing the similarities. But, I'm not sure how to run my query (one word for now). Here's what I've tried:
val rddOfTfidfFromCorpus : RDD[Vector]
val query = "word"
val tf = new HashingTF().transform(List(query))
val tfIDF = new IDF().fit(sc.makeRDD(List(tf))).transform(tf)
val mergedVectors = rddOfTfidfFromCorpus.union(sc.makeRDD(List(tfIDF)))
val similarities = new RowMatrix(mergedVectors).columnSimilarities(1.0)
Here is where I'm stuck (if I've even done everything right until here). I tried filtering the similarities i and j down to the parts of my query's TFIDF and end up with an empty collection.
The gist is that I want to train on a corpus of data and find what category it falls in. The above code is at least trying to get it down to one category and checking if I can get a prediction from that at least....
*Note that this is a toy example, so I only need something that works well enough
*I am using Spark 1.4.0
Using columnSimilarities doesn't make sense here. Since each column in your matrix represents a set of terms you'll get a matrix of similarities between tokens not documents. You could transpose the matrix and then use columnSimilarities but as far as I understand what you want is a similarity between query and corpus. You can express that using matrix multiplication as follows:
For starters you'll need an IDFModel you've trained on a corpus. Lets assume it is called idf:
import org.apache.spark.mllib.feature.IDFModel
val idf: IDFModel = ??? // Trained using corpus data
and a small helper:
def toBlockMatrix(rdd: RDD[Vector]) = new IndexedRowMatrix(
rdd.zipWithIndex.map{case (v, i) => IndexedRow(i, v)}
).toCoordinateMatrix.toBlockMatrix
First lets convert query to an RDD and compute TF:
val query: Seq[String] = ???
val queryTf = new HashingTF().transform(query)
Next we can apply IDF model and convert result to matrix:
val queryTfidf = idf.transform(queryTf)
val queryMatrix = toBlockMatrix(queryTfidf)
We'll need a corpus matrix as well:
val corpusMatrix = toBlockMatrix(rddOfTfidfFromCorpus)
If you multiple both we get a matrix with number of rows equal to the number of docs in the query and number of columns equal to the number of documents in the corpus.
val dotProducts = queryMatrix.multiply(corpusMatrix.transpose)
To get a proper cosine similarity you have to divide by a product of magnitudes but if you can handle that.
There are two problems here. First of all it is rather expensive. Moreover I am not sure if it really useful. To reduce cost you can apply some dimensionality reduction algorithm first but lets leave it for now.
Judging from a following statement
NaiveBayes (...) seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
I guess you want some kind of unsupervised learning method. The simplest thing you can try is K-means:
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
val numClusters: Int = ???
val numIterations = 20
val model = KMeans.train(rddOfTfidfFromCorpus, numClusters, numIterations)
val predictions = model.predict(queryTfidf)

Calculate sums of even/odd pairs on Hadoop?

I want to create a parallel scanLeft(computes prefix sums for an associative operator) function for Hadoop (scalding in particular; see below for how this is done).
Given a sequence of numbers in a hdfs file (one per line) I want to calculate a new sequence with the sums of consecutive even/odd pairs. For example:
input sequence:
0,1,2,3,4,5,6,7,8,9,10
output sequence:
0+1, 2+3, 4+5, 6+7, 8+9, 10
i.e.
1,5,9,13,17,10
I think in order to do this, I need to write an InputFormat and InputSplits classes for Hadoop, but I don't know how to do this.
See this section 3.3 here. Below is an example algorithm in Scala:
// for simplicity assume input length is a power of 2
def scanadd(input : IndexedSeq[Int]) : IndexedSeq[Int] =
if (input.length == 1)
input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
val collapsed = IndexedSeq.tabulate(input.length/2)(i => input(2 * i) + input(2*i+1))
//recursively scan collapsed values
val scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
val output = IndexedSeq.tabulate(input.length)(
i => i.evenOdd match {
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
case Even => scancollapse(i/2)
case Odd => scancollapse((i-1)/2) + input(i)
}
output
}
I understand that this might need a fair bit of optimization for it to work nicely with Hadoop. Translating this directly I think would lead to pretty inefficient Hadoop code. For example, Obviously in Hadoop you can't use an IndexedSeq. I would appreciate any specific problems you see. I think it can probably be made to work well, though.
Superfluous. You meant this code?
val vv = (0 to 1000000).grouped(2).toVector
vv.par.foldLeft((0L, 0L, false))((a, v) =>
if (a._3) (a._1, a._2 + v.sum, !a._3) else (a._1 + v.sum, a._2, !a._3))
This was the best tutorial I found for writing an InputFormat and RecordReader. I ended up reading the whole split as one ArrayWritable record.

Use forall instead of filter on List[A]

Am trying to determine whether or not to display an overtime game display flag in weekly game results report.
Database game results table has 3 columns (p4,p5,p6) that represent potential overtime game period score total ( for OT, Double OT, and Triple OT respectively). These columns are mapped to Option[Int] in application layer.
Currently I am filtering through game result teamA, teamB pairs, but really I just want to know if an OT game exists of any kind (vs. stepping through the collection).
def overtimeDisplay(a: GameResult, b: GameResult) = {
val isOT = !(List(a,b).filter(_.p4.isDefined).filter(_.p5.isDefined).filter(_.p6.isDefined).isEmpty)
if(isOT) {
<b class="b red">
{List( ((a.p4,a.p5,a.p6),(b.p4,b.p5,b.p6)) ).zipWithIndex.map{
case( ((Some(_),None,None), (Some(_),None,None)), i)=> "OT"
case( ((Some(_),Some(_),None), (Some(_),Some(_),None )), i)=> "Double OT"
case( ((Some(_),Some(_),Some(_)), (Some(_),Some(_),Some(_) )), i)=> "Triple OT"
}}
</b>
}
else scala.xml.NodeSeq.Empty
}
Secondarily, the determination of which type of overtime to display, currently that busy pattern match (which, looking at it now, does not appear cover all the scoring scenarios), could probably be done in a more functional/concise manner.
Feel free to lay it down if you have the better way.
Thanks
Not sure if I understand the initial code correctly, but here is an idea:
val results = List(a, b).map(r => Seq(r.p4, r.p5, r.p6).flatten)
val isOT = results.exists(_.nonEmpty)
val labels = IndexedSeq("", "Double ", "Triple ")
results.map(p => labels(p.size - 1) + "OT")
Turning score column to flat list in first line is crucial here. You have GameResult(p4: Option[Int], p5: Option[Int], p6: Option[Int]) which you can map to Seq[Option[Int]]: r => Seq(r.p4, r.p5, r.p6) and later flatten to turn Some[Int] to Int and get rid of None. This will turn Some(42), None, None into Seq(42).
Looking at this:
val isOT = !(List(a,b).filter(_.p4.isDefined).filter(_.p5.isDefined).filter(_.p6.isDefined).isEmpty)
This can be rewritten using exists instead of filter. I would rewrite it as follows:
List(a, b).exists(x => x.p4.isDefined && x.p5.isDefined && x.p6.isDefined)
In addition to using exists, I am combining the three conditions you passed to the filters into a single anonymous function.
In addition, I don't know why you're using zipWithIndex when it doesn't seem as though you're using the index in the map function afterwards. It could be removed entirely.