Stratified sampling in Spark - scala

I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?
I could do following but this will not give 80% of each user.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()

One possible solution is in Holden's answer, and here is some other solutions :
Using RDDs :
You can use the sampleByKeyExact transformation, from the PairRDDFunctions class.
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed)
Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
And this is how I would do :
Considering the following list :
val seq = Seq(
(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),
(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),
(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)
)
I would create an RDD Pair, mapping all the users as keys :
val data = sc.parallelize(seq).map(x => (x._1,(x._2,x._3)))
Then I'll set up fractions for each key as following, since sampleByKeyExact takes a Map of fraction for each key :
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is mapping on the keys to find distinct keys and then associate each to a fraction equals to 0.8. I collect the whole as a Map.
To sample now :
import org.apache.spark.rdd.PairRDDFunctions
val sampleData = data.sampleByKeyExact(false, fractions, 2L)
or
val sampleData = data.sampleByKeyExact(withReplacement = false, fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
scala > data.count
// [...]
// res10: Long = 12
scala > sampleData.count
// [...]
// res11: Long = 10
Using DataFrames :
Let's consider the same data (seq) from the previous section.
val df = seq.toDF("keyColumn","value1","value2")
df.show
// +----------+----------+------+
// | keyColumn| value1|value2|
// +----------+----------+------+
// |2147481832| 23355149| 1|
// |2147481832| 973010692| 1|
// |2147481832|2134870842| 1|
// |2147481832| 541023347| 1|
// |2147481832|1682206630| 1|
// |2147481832|1138211459| 1|
// |2147481832| 852202566| 1|
// |2147481832| 201375938| 1|
// |2147481832| 486538879| 1|
// |2147481832| 919187908| 1|
// | 214748183| 919187908| 1|
// | 214748183| 91187908| 1|
// +----------+----------+------+
We will need the underlying RDD to do that on which we creates tuples of the elements in this RDD by defining our key to be the first column :
val data: RDD[(Int, Row)] = df.rdd.keyBy(_.getInt(0))
val fractions: Map[Int, Double] = data.map(_._1)
.distinct
.map(x => (x, 0.8))
.collectAsMap
val sampleData: RDD[Row] = data.sampleByKeyExact(withReplacement = false, fractions, 2L)
.values
val sampleDataDF: DataFrame = spark.createDataFrame(sampleData, df.schema) // you can use sqlContext.createDataFrame(...) instead for spark 1.6)
You can now check the count on your keys or df or data sample :
scala > df.count
// [...]
// res9: Long = 12
scala > sampleDataDF.count
// [...]
// res10: Long = 10
Since Spark 1.5.0 you can use DataFrameStatFunctions.sampleBy method:
df.stat.sampleBy("keyColumn", fractions, seed)

Something like this is may be well suited to something like "Blink DB", but lets look at the question. There are two ways to interpret what you've asked one is:
1) You want 80% of your users, and you want all of the data for them.
2) You want 80% of each users data
For #1 you could do a map to get the user ids, call distinct, and then sample 80% of them (you may want to look at kFold in MLUtils or BernoulliCellSampler). You can then filter your input data to just the set of IDs you want.
For #2 you could look at BernoulliCellSampler and simply apply it directly.

Related

How to generate a DataFrame with random content and N rows?

How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)?
I know how to create a DataFrame manually, but I cannot automate it:
val df = sc.parallelize(Seq((1,20, 40), (60, 10, 80), (30, 15, 30))).toDF("col1", "col2", "col3")
Generating the data locally and then parallelizing it is totally fine, especially if you don't have to generate a lot of data.
However, should you ever need to generate a huge dataset, you can alway implement an RDD that does this for you in parallel, as in the following example.
import scala.reflect.ClassTag
import org.apache.spark.{Partition, TaskContext}
import org.apache.spark.rdd.RDD
// Each random partition will hold `numValues` items
final class RandomPartition[A: ClassTag](val index: Int, numValues: Int, random: => A) extends Partition {
def values: Iterator[A] = Iterator.fill(numValues)(random)
}
// The RDD will parallelize the workload across `numSlices`
final class RandomRDD[A: ClassTag](#transient private val sc: SparkContext, numSlices: Int, numValues: Int, random: => A) extends RDD[A](sc, deps = Seq.empty) {
// Based on the item and executor count, determine how many values are
// computed in each executor. Distribute the rest evenly (if any).
private val valuesPerSlice = numValues / numSlices
private val slicesWithExtraItem = numValues % numSlices
// Just ask the partition for the data
override def compute(split: Partition, context: TaskContext): Iterator[A] =
split.asInstanceOf[RandomPartition[A]].values
// Generate the partitions so that the load is as evenly spread as possible
// e.g. 10 partition and 22 items -> 2 slices with 3 items and 8 slices with 2
override protected def getPartitions: Array[Partition] =
((0 until slicesWithExtraItem).view.map(new RandomPartition[A](_, valuesPerSlice + 1, random)) ++
(slicesWithExtraItem until numSlices).view.map(new RandomPartition[A](_, valuesPerSlice, random))).toArray
}
Once you have this you can use it passing your own random data generator to get an RDD[Int]
val rdd = new RandomRDD(spark.sparkContext, 10, 22, scala.util.Random.nextInt(100) + 1)
rdd.foreach(println)
/*
* outputs:
* 30
* 86
* 75
* 20
* ...
*/
or an RDD[(Int, Int, Int)]
def rand = scala.util.Random.nextInt(100) + 1
val rdd = new RandomRDD(spark.sparkContext, 10, 22, (rand, rand, rand))
rdd.foreach(println)
/*
* outputs:
* (33,22,15)
* (65,24,64)
* (41,81,44)
* (58,7,18)
* ...
*/
and of course you can wrap it in a DataFrame very easily as well:
spark.createDataFrame(rdd).show()
/*
* outputs:
* +---+---+---+
* | _1| _2| _3|
* +---+---+---+
* |100| 48| 92|
* | 34| 40| 30|
* | 98| 63| 61|
* | 95| 17| 63|
* | 68| 31| 34|
* .............
*/
Notice how in this case the generated data is different every time the RDD/DataFrame is acted upon. By changing the implementation of RandomPartition to actually store the values instead of generating them on the fly, you can have a stable set of random items, while still retaining the flexibility and scalability of this approach.
One nice property of the stateless approach is that you can generate huge dataset even locally. The following ran in a few seconds on my laptop:
new RandomRDD(spark.sparkContext, 10, Int.MaxValue, 42).count
// returns: 2147483647
Here you go, Seq.fill is your friend:
def randomInt1to100 = scala.util.Random.nextInt(100)+1
val df = sc.parallelize(
Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)}
).toDF("col1", "col2", "col3")
You can simply use scala.util.Random to generate the random numbers within range and loop for 100 rows and finally use createDataFrame api
import scala.util.Random
val data = 1 to 100 map(x => (1+Random.nextInt(100), 1+Random.nextInt(100), 1+Random.nextInt(100)))
sqlContext.createDataFrame(data).toDF("col1", "col2", "col3").show(false)
You can use this below generic code
//no of rows required
val rows = 15
//no of columns required
val cols = 10
val spark = SparkSession.builder
.master("local[*]")
.appName("testApp")
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
.getOrCreate()
import spark.implicits._
val columns = 1 to cols map (i => "col" + i)
// create the DataFrame schema with these columns (in that order)
val schema = StructType(columns.map(StructField(_, IntegerType)))
val lstrows = Seq.fill(rows * cols)(Random.nextInt(100) + 1).grouped(cols).toList.map { x => Row(x: _*) }
val rdd = spark.sparkContext.makeRDD(lstrows)
val df = spark.createDataFrame(rdd, schema)
If you need to create a large amount of random data, Spark provides an object called RandomRDDs that can generate datasets filled with random numbers following a uniform, normal, or various other distributions.
https://spark.apache.org/docs/latest/mllib-statistics.html#random-data-generation
From their example:
import org.apache.spark.mllib.random.RandomRDDs._
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)

Comparing values from different keys in scala / spark

I am trying to find the difference between values for keys that are related (but not the same). For example, lets say that I have the following map:
(“John_1”,[“a”,”b”,”c”])
(“John_2”,[“a”,”b”])
(“John_3”,[”b”,”c”])
(“Mary_5”,[“a”,”d”])
(“John_5”,[“c”,”d”,”e”])
I want to compare the contents of Name_# to Name_(#-1) and get the difference. So, for the example above, I would like to get (ex:
(“John_1”,[“a”,”b”,”c”]) //Since there is no John_0, all of the contents are new, so I keep them all
(“John_2”,[]) //Since all of the contents of John_2 appear in John_1, the resulting list is empty (for now, I don’t care about what happened to “c”
(“John_3”,[”c”]) //In this case, “c” is a new item (because I don’t care whether it existed prior to John_2). Again, I don’t care what happened to “a”.
(“Mary_5”,[“a”,”d”]) //There is no Mary_4 so all the items are kept
(“John_5”,[“c”,”d”,”e”]) //There is no John_4 so all the items are kept.
I was thinking on doing some kind of aggregateByKey and then just find the difference between the lists, but I do not know how to make the match between the keys that I care about, namely Name_# with Name_(#-1).
Split "id":
import org.apache.spark.sql.functions._
val df = Seq(
("John_1", Seq("a","b","c")), ("John_2", Seq("a","b")),
("John_3", Seq("b","c")), ("Mary_5", Seq("a","d")),
("John_5", Seq("c","d","e"))
).toDF("key", "values").withColumn(
"user", split($"key", "_")(0)
).withColumn("id", split($"key", "_")(1).cast("long"))
Add window:
val w = org.apache.spark.sql.expressions.Window
.partitionBy($"user").orderBy($"id")
and udf
val diff = udf((x: Seq[String], y: Seq[String]) => y.diff(x)
and compute:
df
.withColumn("is_previous", coalesce($"id" - lag($"id", 1).over(w) === 1, lit(false)))
.withColumn("diff", when($"is_previous", diff( lag($"values", 1).over(w), $"values")).otherwise($"values"))
.show
// +------+---------+----+---+-----------+---------+
// | key| values|user| id|is_previous| diff|
// +------+---------+----+---+-----------+---------+
// |Mary_5| [a, d]|Mary| 5| false| [a, d]|
// |John_1|[a, b, c]|John| 1| false|[a, b, c]|
// |John_2| [a, b]|John| 2| true| []|
// |John_3| [b, c]|John| 3| true| [c]|
// |John_5|[c, d, e]|John| 5| false|[c, d, e]|
// +------+---------+----+---+-----------+---------+
I managed to solve my issue as follows:
First create a function that computes the previous key from the current key
def getPrevKey(k: String): String = {
val (n, h) = k.split(“_”)
val i = h.toInt
val sb = new StringBuilder
sb.append(n).append(“_”).append(i-1)
return sb.toString
}
Then, create a copy of my RDD with the shifted key:
val copyRdd = myRdd.map(row => {
val k1 = row._1
val v1 = row._2
val k2 = getPrevHour(k1)
(k2,v1)
})
And finally, I union both RDDs and reduce by key by taking the difference between the lists:
val result = myRdd.union(copyRdd)
.reduceByKey(_.diff(_))
This gets me the exact result I need, but has the problem that it requires a lot of memory due to the union. The final result is not that large, but the partial results really weigh down the process.

scala: Remove columns where column value below median value for all columns

In data reduction phase of analysis, I want to remove all columns where column total is below the median value of all column totals.
So with dataset:
v1,v2,v3
1 3 5
3 4 3
I sum columns
v1,v2,v3
4 7 8
The median is 7 so I drop v1
v2,v3
3 5
4 3
I thought I could do this with a streaming function on Row. But this does not seem possible.
The code I have come up with works, but it seems very verbose and looks a lot like Java code (which I take as a sign that I am doing it wrong).
Are there any more efficient ways of performing this operation?
val val dfv2=DataFrameUtils.openFile(spark,"C:\\Users\\jake\\__workspace\\R\\datafiles\\ikodaDataManipulation\\VERB2.csv")
//return a single row dataframe with sum of each column
val dfv2summed:DataFrame=dfv2.groupBy().sum()
logger.info(s"dfv2summed col count is ${dfv2summed.schema.fieldNames.length}")
//get the rowValues
val rowValues:Array[Long]=dfv2summed.head().getValuesMap(dfv2summed.schema.fieldNames).values.toArray
//sort the rows
scala.util.Sorting.quickSort(rowValues)
//calculate medians (simplistically)
val median:Long = rowValues(rowValues.length/2)
//ArrayBuffer to hold column needs that need removing
var columnArray: ArrayBuffer[String] = ArrayBuffer[String]()
//get tuple key value pairs of columnName/value
val entries: Map[String, Long]=dfv2summed.head().getValuesMap(dfv2summed.schema.fieldNames)
entries.foreach
{
//find all columns where total value below median value
kv =>
if(kv._2.<(median))
{
columnArray+=kv._1
}
}
//drop columns
val dropColumns:Seq[String]=columnArray.map(s => s.substring(s.indexOf("sum(")+4,s.length-1)).toSeq
logger.info(s"todrop ${dropColumns.size} : ${dropColumns}")
val reducedDf=dfv2.drop(dropColumns: _*)
logger.info(s"reducedDf col count is ${reducedDf.schema.fieldNames.length}")
After calculating the sum of each column in Spark, we can get the median value in plain Scala and then select only the columns greater than or equal to this value by column indices.
Let's start with defining a function for calculating the median, it is a slight modification of this example:
def median(seq: Seq[Long]): Long = {
//In order if you are not sure that 'seq' is sorted
val sortedSeq = seq.sortWith(_ < _)
if (seq.size % 2 == 1) sortedSeq(sortedSeq.size / 2)
else {
val (up, down) = sortedSeq.splitAt(seq.size / 2)
(up.last + down.head) / 2
}
}
We first calculate the sums for all columns and convert it to Seq[Long]:
import org.apache.spark.sql.functions._
val sums = df.select(df.columns.map(c => sum(col(c)).alias(c)): _*)
.first.toSeq.asInstanceOf[Seq[Long]]
Then we calculate the median,
val med = median(sums)
And use it as a threshold to generate the column indices to keep:
val cols_keep = sums.zipWithIndex.filter(_._1 >= med).map(_._2)
Finally, we map these indices inside a select() statement:
df.select(cols_keep map df.columns map col: _*).show()
+---+---+
| v2| v3|
+---+---+
| 3| 5|
| 4| 3|
+---+---+

How to use RDD.flatMap?

I have a text file with lines that contain userid and rid separated by | (pipe). rid values correspond to many labels on another file.
How can I use flatMap to implement a method as follows:
xRdd = sc.textFile("file.txt").flatMap { line =>
val (userid,rid) = line.split("\\|")
val labelsArr = getLabels(rid)
labelsArr.foreach{ i =>
((userid, i), 1)
}
}
At compile time, I get an error:
type mismatch; found : Unit required: TraversableOnce[?]
piecing together the information provided it seems you will have to replace your foreach operation with a map operation.
xRdd = sc.textFile("file.txt") flatMap { line =>
val (userid,rid) = line.split("\\|")
val labelsArr = getLabels(rid)
labelsArr.map(i=>((userid,i),1))
}
This is exactly the reason why I said here and here that Scala's for-comprehension could make things easier. And should help you out too.
When you see a series of flatMap and map that's the moment where the nesting should trigger some thinking about solutions to cut the "noise". That begs for simpler solutions, doesn't it?
See the following and appreciate Scala (and its for-comprehension) yourself!
val lines = sc.textFile("file.txt")
val pairs = for {
line <- lines
Array(userid, rid) = line.split("\\|")
label <- getLabels(rid)
} yield ((userid, label), 1)
If you throw in Spark SQL to the mix, things would get even simpler. Just to whet your appetite:
scala> pairs.toDF.show
+-----------------+---+
| _1| _2|
+-----------------+---+
| [jacek,1]| 1|
|[jacek,getLabels]| 1|
| [agata,2]| 1|
|[agata,getLabels]| 1|
+-----------------+---+
I'm sure you can guess what was inside my file.txt file, can't you?

Spark - Random Number Generation

I have written a method that must consider a random number to simulate a Bernoulli distribution. I am using random.nextDouble to generate a number between 0 and 1 then making my decision based on that value given my probability parameter.
My problem is that Spark is generating the same random numbers within each iteration of my for loop mapping function. I am using the DataFrame API. My code follows this format:
val myClass = new MyClass()
val M = 3
val myAppSeed = 91234
val rand = new scala.util.Random(myAppSeed)
for (m <- 1 to M) {
val newDF = sqlContext.createDataFrame(myDF
.map{row => RowFactory
.create(row.getString(0),
myClass.myMethod(row.getString(2), rand.nextDouble())
}, myDF.schema)
}
Here is the class:
class myClass extends Serializable {
val q = qProb
def myMethod(s: String, rand: Double) = {
if (rand <= q) // do something
else // do something else
}
}
I need a new random number every time myMethod is called. I also tried generating the number inside my method with java.util.Random (scala.util.Random v10 does not extend Serializable) like below, but I'm still getting the same numbers within each for loop
val r = new java.util.Random(s.hashCode.toLong)
val rand = r.nextDouble()
I've done some research, and it seems this has do to with Sparks deterministic nature.
Just use the SQL function rand:
import org.apache.spark.sql.functions._
//df: org.apache.spark.sql.DataFrame = [key: int]
df.select($"key", rand() as "rand").show
+---+-------------------+
|key| rand|
+---+-------------------+
| 1| 0.8635073400704648|
| 2| 0.6870153659986652|
| 3|0.18998048357873532|
+---+-------------------+
df.select($"key", rand() as "rand").show
+---+------------------+
|key| rand|
+---+------------------+
| 1|0.3422484248879837|
| 2|0.2301384925817671|
| 3|0.6959421970071372|
+---+------------------+
According to this post, the best solution is not to put the new scala.util.Random inside the map, nor completely outside (ie. in the driver code), but in an intermediate mapPartitionsWithIndex:
import scala.util.Random
val myAppSeed = 91234
val newRDD = myRDD.mapPartitionsWithIndex { (indx, iter) =>
val rand = new scala.util.Random(indx+myAppSeed)
iter.map(x => (x, Array.fill(10)(rand.nextDouble)))
}
The reason why the same sequence is repeated is that the random generator is created and initialized with a seed before the data is partitioned. Each partition then starts from the same random seed. Maybe not the most efficient way to do it, but the following should work:
val myClass = new MyClass()
val M = 3
for (m <- 1 to M) {
val newDF = sqlContext.createDataFrame(myDF
.map{
val rand = scala.util.Random
row => RowFactory
.create(row.getString(0),
myClass.myMethod(row.getString(2), rand.nextDouble())
}, myDF.schema)
}
Using Spark Dataset API, perhaps for use in an accumulator:
df.withColumn("_n", substring(rand(),3,4).cast("bigint"))