Scala Spark: How do I bootstrap sample from a column of a Spark Dataframe? - scala

I am looking to sample values, with replacement, from a column of a Spark DataFrame, using the Scala programming language in a Jupyter Notebook setting in a cluster environment. How do I do this?
I tried the following function that I found online:
import scala.util
def bootstrapMean(originalData: Array[Double]): Double = {
val n = originalData.length
def draw: Double = originalData(util.Random.nextInt(n))
// a tail recursive loop to randomly draw and add a value to the accumulating sum
def drawAndSumValues(current: Int, acc: Double = 0D): Double = {
if (current == 0) acc
else drawAndSumValues(current - 1, acc + draw)
}
drawAndSumValues(n) / n
}
Like so:
val data = stack.select("column_with_values").collect.map(_.toSeq).flatten
val m = 10
val bootstraps = Vector.fill(m)(bootstrapMean(data))
But I get the error:
An error was encountered:
<console>:47: error: type mismatch;
found : Array[Any]
required: Array[Double]
Note: Any >: Double, but class Array is invariant in type T.
You may wish to investigate a wildcard type such as `_ >: Double`. (SLS 3.2.10)
val bootstraps = Vector.fill(m)(bootstrapMean(data))
Not sure how to debug this, and whether I should bother to or try another approach. I'm looking for ideas/documentation/code. Thanks.
Update:
How do I put the user mck's solution below, in a for loop? I tried the following:
var bootstrap_container = Seq()
var a = 1
for( a <- 1 until 3){
var sampled = stack_b.select("diff_hours").sample(withReplacement = true, fraction = 0.5, seed = a)
var smpl_average = sampled.select(avg("diff_hours")).collect()(0)(0)
var bootstrap_smpls = bootstrap_container.union(Seq(smpl_average)).collect()
}
bootstrap_smpls
but that gives an error:
<console>:49: error: not enough arguments for method collect: (pf: PartialFunction[Any,B])(implicit bf: scala.collection.generic.CanBuildFrom[Seq[Any],B,That])That.
Unspecified value parameter pf.
var bootstrap_smpls = bootstrap_container.union(Seq(smpl_average)).collect()

You can use the sample method of dataframes, for example, if you want to sample with replacement and with a fraction of 0.5:
val sampled = stack.select("column_with_values").sample(true, 0.5)
To get the mean, you can do:
val col_average = sampled.select(avg("column_with_values")).collect()(0)(0)
EDIT:
var bootstrap_container = List[Double]()
var a = 1
for( a <- 1 until 3){
var sampled = stack_b2.select("diff_hours").sample(withReplacement = true, fraction = 0.5, seed = a)
var smpl_average = sampled.select(avg("diff_hours")).collect()(0)(0)
bootstrap_container = bootstrap_container :+ smpl_average.asInstanceOf[Double]
}
var mean_bootstrap = bootstrap_container.reduce(_ + _) / bootstrap_container.length

Related

TapeEquilibrium ScalaCheck

I have been trying to code some scalacheck property to verify the Codility TapeEquilibrium problem. For those who do not know the problem, see the following link: https://app.codility.com/programmers/lessons/3-time_complexity/tape_equilibrium/.
I coded the following yet incomplete code.
test("Lesson 3 property"){
val left = Gen.choose(-1000, 1000).sample.get
val right = Gen.choose(-1000, 1000).sample.get
val expectedSum = Math.abs(left - right)
val leftArray = Gen.listOfN(???, left) retryUntil (_.sum == left)
val rightArray = Gen.listOfN(???, right) retryUntil (_.sum == right)
val property = forAll(leftArray, rightArray){ (r: List[Int], l: List[Int]) =>
val array = (r ++ l).toArray
Lesson3.solution3(array) == expectedSum
}
property.check()
}
The idea is as follows. I choose two random numbers (values left and right) and calculate its absolute difference. Then, my idea is to generate two arrays. Each array will be random numbers whose sum will be either "left" or "right". Then by concatenating these array, I should be able to verify this property.
My issue is then generating the leftArray and rightArray. This itself is a complex problem and I would have to code a solution for this. Therefore, writing this property seems over-complicated.
Is there any way to code this? Is coding this property an overkill?
Best.
My issue is then generating the leftArray and rightArray
One way to generate these arrays or (lists), is to provide a generator of nonEmptyList whose elements sum is equal to a given number, in other word, something defined by method like this:
import org.scalacheck.{Gen, Properties}
import org.scalacheck.Prop.forAll
def listOfSumGen(expectedSum: Int): Gen[List[Int]] = ???
That verifies the property:
forAll(Gen.choose(-1000, 1000)){ sum: Int =>
forAll(listOfSumGen(sum)){ listOfSum: List[Int] =>
(listOfSum.sum == sum) && listOfSum.nonEmpty
}
}
To build such a list only poses a constraint on one element of the list, so basically here is a way to generate:
Generate list
The extra constrained element, will be given by the expectedSum - the sum of list
Insert the constrained element into a random index of the list (because obviously any permutation of the list would work)
So we get:
def listOfSumGen(expectedSum: Int): Gen[List[Int]] =
for {
list <- Gen.listOf(Gen.choose(-1000,1000))
constrainedElement = expectedSum - list.sum
index <- Gen.oneOf(0 to list.length)
} yield list.patch(index, List(constrainedElement), 0)
Now we the above generator, leftArray and rightArray could be define as follows:
val leftArray = listOfSumGen(left)
val rightArray = listOfSumGen(right)
However, I think that the overall approach of the property described is incorrect, as it builds an array where a specific partition of the array equals the expectedSum but this doesn't ensure that another partition of the array would produce a smaller sum.
Here is a counter-example run-through:
val left = Gen.choose(-1000, 1000).sample.get // --> 4
val right = Gen.choose(-1000, 1000).sample.get // --> 9
val expectedSum = Math.abs(left - right) // --> |4 - 9| = 5
val leftArray = listOfSumGen(left) // Let's assume one of its sample would provide List(3,1) (whose sum equals 4)
val rightArray = listOfSumGen(right)// Let's assume one of its sample would provide List(2,4,3) (whose sum equals 9)
val property = forAll(leftArray, rightArray){ (l: List[Int], r: List[Int]) =>
// l = List(3,1)
// r = List(2,4,3)
val array = (l ++ r).toArray // --> Array(3,1,2,4,3) which is the array from the given example in the exercise
Lesson3.solution3(array) == expectedSum
// According to the example Lesson3.solution3(array) equals 1 which is different from 5
}
Here is an example of a correct property that essentially applies the definition:
def tapeDifference(index: Int, array: Array[Int]): Int = {
val (left, right) = array.splitAt(index)
Math.abs(left.sum - right.sum)
}
forAll(Gen.nonEmptyListOf(Gen.choose(-1000,1000))) { list: List[Int] =>
val array = list.toArray
forAll(Gen.oneOf(array.indices)) { index =>
Lesson3.solution3(array) <= tapeDifference(index, array)
}
}
This property definition might collides with the way the actual solution has been implemented (which is one of the potential pitfall of scalacheck), however, that would be a slow / inefficient solution hence this would be more a way to check an optimized and fast implementation against slow and correct implementation (see this presentation)
Try this with c# :
using System;
using System.Collections.Generic;
using System.Linq;
private static int TapeEquilibrium(int[] A)
{
var sumA = A.Sum();
var size = A.Length;
var take = 0;
var res = new List<int>();
for (int i = 1; i < size; i++)
{
take = take + A[i-1];
var resp = Math.Abs((sumA - take) - take);
res.Add(resp);
if (resp == 0) return resp;
}
return res.Min();
}

How to remove found :AnyVal required :Double in scala ?

I am traversing a Scala Map and I am getting type mismatch error in my code. Here is what I am trying to do.
private var cumulativeCapacity:Map[String , Double] = Map()
private var cumulativeDelay:Map[String ,Double] = Map()
cumulativeCapacity.keys.foreach { linkId =>
val delay = cumulativeDelay.get(linkId).getOrElse(0)
val capacity = cumulativeCapacity.get(linkId).getOrElse(0)
val bin = largeset(capacity)
}
So the error is coming inside val bin = largeset(capacity) that, capacity should be double but found AnyVal. Provide me any solution or let me know if I am doing something wrong.
Welcome to SO.
The problem you are experiencing is due to the fact that you are providing an Int as a default value when the key is not found in your Map, instead of a Double. If you change 0 by 0.0 or 0D it should work. i.e
cumulativeCapacity.keys.foreach { linkId =>
val delay = cumulativeDelay.getOrElse(linkId, 0D)
val capacity = cumulativeCapacity.getOrElse(linkId, 0D)
val bin = largeset(capacity)
}

Using the kronecker product on complex matrices with scalaNLP breeze

I had a piece of code:
def this(vectors: List[DenseVector[Double]]) {
this(vectors.length)
var resultVector = vectors.head
for (vector <- vectors) {
resultVector = kron(resultVector.toDenseMatrix, vector.toDenseMatrix).toDenseVector
}
_vector = resultVector
}
It worked just the way I wanted it to work. The problem is that I needed complex values in stead of doubles. After importing breeze.math.Complex, I changed the code to:
def this(vectors: List[DenseVector[Complex]]) {
this(vectors.length)
var resultVector = vectors.head
for (vector <- vectors) {
resultVector = kron(resultVector.toDenseMatrix, vector.toDenseMatrix).toDenseVector
}
_vector = resultVector
}
This however results into the errors:
Error:(42, 26) could not find implicit value for parameter impl: breeze.linalg.kron.Impl2[breeze.linalg.DenseMatrix[breeze.math.Complex],breeze.linalg.DenseMatrix[breeze.math.Complex],VR]
resultVector = kron(resultVector.toDenseMatrix, vector.toDenseMatrix).toDenseVector
^
Error:(42, 26) not enough arguments for method apply: (implicit impl: breeze.linalg.kron.Impl2[breeze.linalg.DenseMatrix[breeze.math.Complex],breeze.linalg.DenseMatrix[breeze.math.Complex],VR])VR in trait UFunc.
Unspecified value parameter impl.
resultVector = kron(resultVector.toDenseMatrix, vector.toDenseMatrix).toDenseVector
^
Is this a bug or am I forgetting to do something?
I found the problem in the following way:
I first rewrote the function to use less matrix conversions
As there was a problem with the implicit impl variable of kron, I also rewrote the function call to explicitly state which variable to use to use
.
def this(vectors: List[DenseVector[Complex]]) {
this(vectors.length)
var resultMatrix = vectors.head.toDenseMatrix
for (i <- 1 until vectors.length) {
resultMatrix = kron(resultMatrix, vectors(i).toDenseMatrix)(kron.kronDM_M[Complex, Complex, DenseMatrix[Complex], Complex])
}
_vector = resultMatrix.toDenseVector
}
This showed me that there was no ScalarMulOp for V2, M, DenseMatrix[RV] where M is a Matrix[V1], V1 and V2 are the input types and RV is the output type of the ScalarMulOp
Digging through the source code of breeze I found in DenseMatrixOps that there only was an implicit ScalarMulOp for the above types if V1, V2 and RV are of type Int, Long, Float and Double. By copying the function and making it specific for Complex numbers, I was able to get the kronecker product to work. Now I could also remove the explicit use of (kron.kronDM_M[Complex, Complex, DenseMatrix[Complex], Complex]). The ScalarMulOp function in question is:
implicit def s_dm_op_Complex_OpMulScalar(implicit op: OpMulScalar.Impl2[Complex, Complex, Complex]):
OpMulScalar.Impl2[Complex, DenseMatrix[Complex], DenseMatrix[Complex]] =
new OpMulScalar.Impl2[Complex, DenseMatrix[Complex], DenseMatrix[Complex]] {
def apply(b: Complex, a: DenseMatrix[Complex]): DenseMatrix[Complex] = {
val res: DenseMatrix[Complex] = DenseMatrix.zeros[Complex](a.rows, a.cols)
val resd: Array[Complex] = res.data
val ad: Array[Complex] = a.data
var c = 0
var off = 0
while (c < a.cols) {
var r = 0
while (r < a.rows) {
resd(off) = op(b, ad(a.linearIndex(r, c)))
r += 1
off += 1
}
c += 1
}
res
}
implicitly[BinaryRegistry[Complex, Matrix[Complex], OpMulScalar.type, Matrix[Complex]]].register(this)
}

Flink: PageRank type mismatch error

I want to compute PageRank from a CSV file of edges formatted as follows:
12,13,1.0
12,14,1.0
12,15,1.0
12,16,1.0
12,17,1.0
...
My code:
var filename = "<filename>.csv"
val graph = Graph.fromCsvReader[Long,Double,Double](
env = env,
pathEdges = filename,
readVertices = false,
hasEdgeValues = true,
vertexValueInitializer = new MapFunction[Long, Double] {
def map(id: Long): Double = 0.0 } )
val ranks = new PageRank[Long](0.85, 20).run(graph)
I get the following error from the Flink Scala Shell:
error: type mismatch;
found : org.apache.flink.graph.scala.Graph[Long,_23,_24] where type _24 >: Double with _22, type _23 >: Double with _21
required: org.apache.flink.graph.Graph[Long,Double,Double]
val ranks = new PageRank[Long](0.85, 20).run(graph)
^
What am I doing wrong?
( And are the initial values 0.0 for every vertex and 1.0 for every edge correct? )
The problem is that you're giving the Scala org.apache.flink.graph.scala.Graph to PageRank.run which expects the Java org.apache.flink.graph.Graph.
In order to run a GraphAlgorithm for a Scala Graph object, you have to call the run method of the Scala Graph with the GraphAlgorithm.
graph.run(new PageRank[Long](0.85, 20))
Update
In the case of the PageRank algorithm it is important to note that the algorithm expects an instance of type Graph[K, java.lang.Double, java.lang.Double]. Since Java's Double type is different from Scala's Double type (in terms of type checking), this has to be accounted for.
For the example code this means
val graph = Graph.fromCsvReader[Long,java.lang.Double,java.lang.Double](
env = env,
pathEdges = filename,
readVertices = false,
hasEdgeValues = true,
vertexValueInitializer = new MapFunction[Long, java.lang.Double] {
def map(id: Long): java.lang.Double = 0.0 } )
.asInstanceOf[Graph[Long, java.lang.Double, java.lang.Double]]

Type mismatch from partition in Scala (expected (Set[String]...), actual (Set[String]...) )

I have a partition method that creates tuple of two sets of string.
def partition(i:Int) = {
dictionary.keySet.partition(dictionary(_)(i) == true)
}
I also have a map that maps integer to the return value from the partition method.
val m = Map[Int, (Set[String], Set[String])]()
for (i <- Range(0, getMaxIndex())) {
m(i) = partition(i)
}
The issue is that I have type mismatch error, but the error message does not make sense to me.
What might be wrong?
This is the code:
import scala.collection.mutable.Map
import scala.collection.{BitSet}
case class Partition(dictionary:Map[String, BitSet]) {
def max(x:Int, y:Int) = if (x > y) x else y
def partition(i:Int) = {
dictionary.keySet.partition(dictionary(_)(i) == true)
}
def getMaxIndex() = {
val values = dictionary.values
(0 /: values) ((m, bs) => max(m, bs.last))
}
def get() = {
val m = Map[Int, (Set[String], Set[String])]()
for (i <- Range(0, getMaxIndex())) {
m(i) = partition(i)
}
m
}
}
When I compile your example, the error is clear:
<console>:64: error: type mismatch;
found : (scala.collection.Set[String], scala.collection.Set[String])
required: (scala.collection.immutable.Set[String], scala.collection.immutable.Set[String])
m(i) = partition(i)
^
Looking into the API, the keySet method of a mutable map does not guarantee that the returned set is immutable. Compare this with keySet on an immutable Map—it does indeed return an immutable set.
Therefore, you could either
use an immutable Map and a var
force the result of your partition method to return an immutable set (e.g. toSet)
define the value type of your map to be collection.Set instead of Predef.Set which is an alias for collection.immtuable.Set.
To clarify these types, it helps to specify an explicit return type for your public methods (get and partition)