I have written a method that must consider a random number to simulate a Bernoulli distribution. I am using random.nextDouble to generate a number between 0 and 1 then making my decision based on that value given my probability parameter.
My problem is that Spark is generating the same random numbers within each iteration of my for loop mapping function. I am using the DataFrame API. My code follows this format:
val myClass = new MyClass()
val M = 3
val myAppSeed = 91234
val rand = new scala.util.Random(myAppSeed)
for (m <- 1 to M) {
val newDF = sqlContext.createDataFrame(myDF
.map{row => RowFactory
myClass.myMethod(row.getString(2), rand.nextDouble())
}, myDF.schema)
Here is the class:
class myClass extends Serializable {
val q = qProb
def myMethod(s: String, rand: Double) = {
if (rand <= q) // do something
else // do something else
I need a new random number every time myMethod is called. I also tried generating the number inside my method with java.util.Random (scala.util.Random v10 does not extend Serializable) like below, but I'm still getting the same numbers within each for loop
val r = new java.util.Random(s.hashCode.toLong)
val rand = r.nextDouble()
I've done some research, and it seems this has do to with Sparks deterministic nature.

Just use the SQL function rand:
import org.apache.spark.sql.functions._
//df: org.apache.spark.sql.DataFrame = [key: int]$"key", rand() as "rand").show
|key| rand|
| 1| 0.8635073400704648|
| 2| 0.6870153659986652|
| 3|0.18998048357873532|
+---+-------------------+$"key", rand() as "rand").show
|key| rand|
| 1|0.3422484248879837|
| 2|0.2301384925817671|
| 3|0.6959421970071372|

According to this post, the best solution is not to put the new scala.util.Random inside the map, nor completely outside (ie. in the driver code), but in an intermediate mapPartitionsWithIndex:
import scala.util.Random
val myAppSeed = 91234
val newRDD = myRDD.mapPartitionsWithIndex { (indx, iter) =>
val rand = new scala.util.Random(indx+myAppSeed) => (x, Array.fill(10)(rand.nextDouble)))

The reason why the same sequence is repeated is that the random generator is created and initialized with a seed before the data is partitioned. Each partition then starts from the same random seed. Maybe not the most efficient way to do it, but the following should work:
val myClass = new MyClass()
val M = 3
for (m <- 1 to M) {
val newDF = sqlContext.createDataFrame(myDF
val rand = scala.util.Random
row => RowFactory
myClass.myMethod(row.getString(2), rand.nextDouble())
}, myDF.schema)

Using Spark Dataset API, perhaps for use in an accumulator:
df.withColumn("_n", substring(rand(),3,4).cast("bigint"))


How to generate a DataFrame with random content and N rows?

How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)?
I know how to create a DataFrame manually, but I cannot automate it:
val df = sc.parallelize(Seq((1,20, 40), (60, 10, 80), (30, 15, 30))).toDF("col1", "col2", "col3")
Generating the data locally and then parallelizing it is totally fine, especially if you don't have to generate a lot of data.
However, should you ever need to generate a huge dataset, you can alway implement an RDD that does this for you in parallel, as in the following example.
import scala.reflect.ClassTag
import org.apache.spark.{Partition, TaskContext}
import org.apache.spark.rdd.RDD
// Each random partition will hold `numValues` items
final class RandomPartition[A: ClassTag](val index: Int, numValues: Int, random: => A) extends Partition {
def values: Iterator[A] = Iterator.fill(numValues)(random)
// The RDD will parallelize the workload across `numSlices`
final class RandomRDD[A: ClassTag](#transient private val sc: SparkContext, numSlices: Int, numValues: Int, random: => A) extends RDD[A](sc, deps = Seq.empty) {
// Based on the item and executor count, determine how many values are
// computed in each executor. Distribute the rest evenly (if any).
private val valuesPerSlice = numValues / numSlices
private val slicesWithExtraItem = numValues % numSlices
// Just ask the partition for the data
override def compute(split: Partition, context: TaskContext): Iterator[A] =
// Generate the partitions so that the load is as evenly spread as possible
// e.g. 10 partition and 22 items -> 2 slices with 3 items and 8 slices with 2
override protected def getPartitions: Array[Partition] =
((0 until slicesWithExtraItem) RandomPartition[A](_, valuesPerSlice + 1, random)) ++
(slicesWithExtraItem until numSlices) RandomPartition[A](_, valuesPerSlice, random))).toArray
Once you have this you can use it passing your own random data generator to get an RDD[Int]
val rdd = new RandomRDD(spark.sparkContext, 10, 22, scala.util.Random.nextInt(100) + 1)
* outputs:
* 30
* 86
* 75
* 20
* ...
or an RDD[(Int, Int, Int)]
def rand = scala.util.Random.nextInt(100) + 1
val rdd = new RandomRDD(spark.sparkContext, 10, 22, (rand, rand, rand))
* outputs:
* (33,22,15)
* (65,24,64)
* (41,81,44)
* (58,7,18)
* ...
and of course you can wrap it in a DataFrame very easily as well:
* outputs:
* +---+---+---+
* | _1| _2| _3|
* +---+---+---+
* |100| 48| 92|
* | 34| 40| 30|
* | 98| 63| 61|
* | 95| 17| 63|
* | 68| 31| 34|
* .............
Notice how in this case the generated data is different every time the RDD/DataFrame is acted upon. By changing the implementation of RandomPartition to actually store the values instead of generating them on the fly, you can have a stable set of random items, while still retaining the flexibility and scalability of this approach.
One nice property of the stateless approach is that you can generate huge dataset even locally. The following ran in a few seconds on my laptop:
new RandomRDD(spark.sparkContext, 10, Int.MaxValue, 42).count
// returns: 2147483647
Here you go, Seq.fill is your friend:
def randomInt1to100 = scala.util.Random.nextInt(100)+1
val df = sc.parallelize(
).toDF("col1", "col2", "col3")
You can simply use scala.util.Random to generate the random numbers within range and loop for 100 rows and finally use createDataFrame api
import scala.util.Random
val data = 1 to 100 map(x => (1+Random.nextInt(100), 1+Random.nextInt(100), 1+Random.nextInt(100)))
sqlContext.createDataFrame(data).toDF("col1", "col2", "col3").show(false)
You can use this below generic code
//no of rows required
val rows = 15
//no of columns required
val cols = 10
val spark = SparkSession.builder
.config("spark.sql.warehouse.dir", "file:///c:/tmp/spark-warehouse")
import spark.implicits._
val columns = 1 to cols map (i => "col" + i)
// create the DataFrame schema with these columns (in that order)
val schema = StructType(, IntegerType)))
val lstrows = Seq.fill(rows * cols)(Random.nextInt(100) + 1).grouped(cols) { x => Row(x: _*) }
val rdd = spark.sparkContext.makeRDD(lstrows)
val df = spark.createDataFrame(rdd, schema)
If you need to create a large amount of random data, Spark provides an object called RandomRDDs that can generate datasets filled with random numbers following a uniform, normal, or various other distributions.
From their example:
import org.apache.spark.mllib.random.RandomRDDs._
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = => 1.0 + 2.0 * x)

Spark UDAF with ArrayType as bufferSchema performance issues

I'm working on a UDAF that returns an array of elements.
The input for each update is a tuple of index and value.
What the UDAF does is to sum all the values under the same index.
For input(index,value) : (2,1), (3,1), (2,3)
should return (0,0,4,1,...,0)
The logic works fine, but I have an issue with the update method, my implementation only updates 1 cell for each row, but the last assignment in that method actually copies the entire array - which is redundant and extremely time consuming.
This assignment alone is responsible for 98% of my query execution time.
My question is, how can I reduce that time? Is it possible to assign 1 value in the buffer array without having to replace the entire buffer?
P.S.: I'm working with Spark 1.6, and I cannot upgrade it anytime soon, so please stick to solution that would work with this version.
class SumArrayAtIndexUDAF() extends UserDefinedAggregateFunction{
val bucketSize = 1000
def inputSchema: StructType = StructType(StructField("index",LongType) :: StructField("value",LongType) :: Nil)
def dataType: DataType = ArrayType(LongType)
def deterministic: Boolean = true
def bufferSchema: StructType = {
StructField("buckets", ArrayType(LongType)) :: Nil
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = new Array[Long](bucketSize)
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val index = input.getLong(0)
val value = input.getLong(1)
val arr = buffer.getAs[mutable.WrappedArray[Long]](0)
buffer(0) = arr // TODO THIS TAKES WAYYYYY TOO LONG - it actually copies the entire array for every call to this method (which essentially updates only 1 cell)
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val arr1 = buffer1.getAs[mutable.WrappedArray[Long]](0)
val arr2 = buffer2.getAs[mutable.WrappedArray[Long]](0)
for(i <- arr1.indices){
arr1.update(i, arr1(i) + arr2(i))
buffer1(0) = arr1
override def evaluate(buffer: Row): Any = {
TL;DR Either don't use UDAF or use primitive types in place of ArrayType.
Without UserDefinedFunction
Both solutions should skip expensive juggling between internal and external representation.
Using standard aggregates and pivot
This uses standard SQL aggregations. While optimized internally it might be expensive when number of keys and size of the array grow.
Given input:
val df = Seq((1, 2, 1), (1, 3, 1), (1, 2, 3)).toDF("id", "index", "value")
You can:
import org.apache.spark.sql.functions.{array, coalesce, col, lit}
val nBuckets = 10
#transient val values = array(
0 until nBuckets map (c => coalesce(col(c.toString), lit(0))): _*
.pivot("index", 0 until nBuckets)
.select($"id", values.alias("values"))
| id| values|
| 1|[0, 0, 4, 1, 0, 0...|
Using RDD API with combineByKey / aggregateByKey.
Plain old byKey aggregation with mutable buffer. No bells and whistles but should perform reasonably well with wide range of inputs. If you suspect input to be sparse, you may consider more efficient intermediate representation, like mutable Map.
{ case (acc, (index, value)) => { acc(index) += value; acc }},
(acc1, acc2) => { for (i <- 0 until nBuckets) acc1(i) += acc2(i); acc1}
| _1| _2|
| 1|[0, 0, 4, 1, 0, 0...|
Using UserDefinedFunction with primitive types
As far as I understand the internals, performance bottleneck is ArrayConverter.toCatalystImpl.
It look like it is called for each call MutableAggregationBuffer.update, and in turn allocates new GenericArrayData for each Row.
If we redefine bufferSchema as:
def bufferSchema: StructType = {
0 to nBuckets map (i => StructField(s"x$i", LongType))
both update and merge can be expressed as plain replacements of primitive values in the buffer. Call chain will remain pretty long, but it won't require copies / conversions and crazy allocations. Omitting null checks you'll need something similar to
val index = input.getLong(0)
buffer.update(index, buffer.getLong(index) + input.getLong(1))
for(i <- 0 to nBuckets){
buffer1.update(i, buffer1.getLong(i) + buffer2.getLong(i))
Finally evaluate should take Row and convert it to output Seq:
for (i <- 0 to nBuckets) yield buffer.getLong(i)
Please note that in this implementation a possible bottleneck is merge. While it shouldn't introduce any new performance problems, with M buckets, each call to merge is O(M).
With K unique keys, and P partitions it will be called M * K times in the worst case scenario, where each key, occurs at least once on each partition. This effectively increases complicity of the merge component to O(M * N * K).
In general there is not much you can do about it. However if you make specific assumptions about the data distribution (data is sparse, key distribution is uniform), you can shortcut things a bit, and shuffle first:
.repartition(n, $"key")
.agg(SumArrayAtIndexUDAF($"index", $"value"))
If the assumptions are satisfied it should:
Counterintuitively reduce shuffle size by shuffling sparse pairs, instead of dense array-like Rows.
Aggregate data using updates only (each O(1)) possibly touching only as subset of indices.
However if one or both assumptions are not satisfied, you can expect that shuffle size will increase while number of updates will stay the same. At the same time data skews can make things even worse than in update - shuffle - merge scenario.
Using Aggregator with "strongly" typed Dataset:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.{Encoder, Encoders}
class SumArrayAtIndex[I](f: I => (Int, Long))(bucketSize: Int) extends Aggregator[I, Array[Long], Seq[Long]]
with Serializable {
def zero = Array.fill(bucketSize)(0L)
def reduce(acc: Array[Long], x: I) = {
val (i, v) = f(x)
acc(i) += v
def merge(acc1: Array[Long], acc2: Array[Long]) = {
for {
i <- 0 until bucketSize
} acc1(i) += acc2(i)
def finish(acc: Array[Long]) = acc.toSeq
def bufferEncoder: Encoder[Array[Long]] = Encoders.kryo[Array[Long]]
def outputEncoder: Encoder[Seq[Long]] = ExpressionEncoder()
which could be used as shown below
val ds = Seq((1, (1, 3L)), (1, (2, 5L)), (1, (0, 1L)), (1, (4, 6L))).toDS
.agg(new SumArrayAtIndex[(Int, (Int, Long))](_._2)(10).toColumn)
|value|SumArrayAtIndex(scala.Tuple2) |
|1 |[1, 3, 5, 0, 6, 0, 0, 0, 0, 0] |
|2 |[0, 11, 0, 0, 0, 0, 0, 0, 0, 0]|
See also SPARK-27296 - User Defined Aggregating Functions (UDAFs) have a major efficiency problem

The proper way to compute correlation between two Seq columns into a third column

I have a DataFrame where each row has 3 columns:
ID:Long, ratings1:Seq[Double], ratings2:Seq[Double]
For each row I need to compute the correlation between those Vectors.
I came up with the following solution which seems to be inefficient (not working as Jarrod Roberson has mentioned) as I have to create RDDs for each Seq:
val similarities = => {
val ratings1 = sc.parallelize(row.getAs[Seq[Double]]("ratings1"))
val ratings2 = sc.parallelize(row.getAs[Seq[Double]]("ratings2"))
val corr:Double = Statistics.corr(ratings1, ratings2)
Similarity(row.getAs[Long]("ID"), corr)
Is there a way to compute such correlations properly?
Let's assume you have a correlation function for arrays:
def correlation(arr1: Array[Double], arr2: Array[Double]): Double
(for potential implementations of that function, which is completely independent of Spark, you can ask a separate question or search online, there are some close-enough resource, e.g. this implementation).
Now, all that's left to do is to wrap this function with a UDF and use it:
import org.apache.spark.sql.functions._
import spark.implicits._
val corrUdf = udf {
(arr1: Seq[Double], arr2: Seq[Double]) => correlation(arr1.toArray, arr2.toArray)
val result =$"ID", corrUdf($"ratings1", $"ratings2") as "correlation")

Join per line two different RDDs in just one - Scala

I'm programming a K-means algorithm in Spark-Scala.
My model predicts in which cluster is each point.
-6.59 -44.68
-35.73 39.93
47.54 -52.04
23.78 46.82
Load the data
val data = sc.textFile("/home/borja/flink/kmeans/points")
val parsedData = => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
Cluster the data into two classes using KMeans
val numClusters = 10
val numIterations = 100
val clusters = KMeans.train(parsedData, numClusters, numIterations)
val prediction = clusters.predict(parsedData)
However, I need to put the result and the points in the same file, in the next format:
[no title, numberOfCluster (1,2,3,..10), pointX, pointY]:
6 -6.59 -44.68
8 -35.73 39.93
10 47.54 -52.04
7 23.78 46.82
This is the entry of this executable in Python to print really nice the result.
But my best effort has got just this:
(you can check the first numbers are wrong: 68, 384, ...)
var i = 0
val c = sc.parallelize(data.collect().map(x => {
val tuple = (i, x)
i += 1
i = 0
val c2 = sc.parallelize(prediction.collect().map(x => {
val tuple = (i, x)
i += 1
val result = c.join(c2)
res94: Array[(Int, (String, Int))] = Array((68,(17.79 13.69,0)), (384,(-33.47 -4.87,8)), (440,(-4.75 -42.21,1)), (4,(-33.31 -13.11,6)), (324,(-39.04 -16.68,6)))
Thanks for your help! :)
I don't have a spark cluster handy to test, but something like this should work:
val result = { v =>
val cluster = clusters.predict(v)
s"$cluster ${v(0)} ${v(1)}"

Match Dataframe Categorical Variables in vector Spark Scala

I have been trying to follow the stack overflow example about creating dataframes for machine learning ml library in spark scala.
How to create correct data frame for classification in Spark ML
However, I cannot get the matching udf to work.
Syntax: "kinds of the type arguments (Vector,Int,Int,String,String) do
not conform to the expected kinds of the type parameters (type RT,type
A1,type A2,type A3,type A4). Vector's type parameters do not match
type RT's expected parameters: type Vector has one type parameter, but
type RT has none"
I need to create a dataframe to input into the logistic regression library. Source sample data example has:
Source, Amount, Account, Fraud
CACC1, 9120.50, 999, 0
CACC2, 3897.25, 999, 0
AMXCC1, -523, 999, 0
MASCC2, -8723.15, 999, 0
I suppose my desired output is:
| features|label|
|[1.0,9120.50,999] | 0.0|
|[1.0,3897.25,999] | 0.0|
|[2.0,-523.00,999] | 0.0|
|[0.0,-8723.15,999] | 0.0|
So far I have:
val df = sqlContext.sql("select * from prediction_test")
val df_2 ="source","amount","account")
val toVec3 = udf[Vector,String,Int,Int] { (a,b,c) =>
val e3 = c match {
case "MASCC2" => 0
case "CACC1" => 1
case "AMXCC1" => 2
Vectors.dense(e1, b, c)
val encodeLabel = udf[Double, Int](_match{case "0" => 0.0 case "1" => 1.0})
val df_3 = df_2.withColumn("features", toVec3(df_2("source"),df_2("amount"),df_2("account")).withColumn("label", encodeLabel(df("fraud"))).select("features","label")
How to create correct data frame for classification in Spark ML
By using Spark 2.3.1 I suggest following codes for classification ready Spark ML Pipeline. If you want to include classification object into Pipeline you need to just add it where I point out. ClassificationPipeline returns a PipelineModel. Once you transform this model you can get a classification ready columns named features and label.
// Handles categorical features
def stringIndexerPipeline(inputCol: String): (Pipeline, String) = {
val indexer = new StringIndexer()
.setOutputCol(inputCol + "_indexed")
val pipeline = new Pipeline().setStages(Array(indexer))
(pipeline, inputCol + "_indexed")
// Classification Pipeline Function
def ClassificationPipeline(df:DataFrame): PipelineModel = {
// Preprocessing categorical features
val (SourcePipeline, Source_indexed) = stringIndexerPipeline("Source")
// Use StringIndexer output as input for OneHotEncoderEstimator
val oneHotEncoder = new OneHotEncoderEstimator()
// Gather features that will be pass through pipeline
val inputCols = oneHotEncoder.getOutputCols ++ Array("Amount","Account")
// Put all inputs in a column as a vector
val vectorAssembler = new VectorAssembler()
// Scale vector column
val standartScaler = new StandardScaler()
// Create stringindexer for label col
val labelIndexer = new StringIndexer().
// create classification object in here
// val classificationObject = new ....
// Create a pipeline
val pipeline = new Pipeline().setStages(
Array(SourcePipeline, oneHotEncoder, vectorAssembler, standartScaler, labelIndexer/*, classificationObject*/))
val pipelineModel = ClassificationPipeline(df)
val transformedDF = pipelineModel.transform(df)