How can I divide rdd to specific number of rdds - scala

I have the below code which generates RDD from a text file:
val data = sparkContext.textfile(path)
val k = 3
How can I divide data into k unique RDD?

You can use RDD.randomSplitwhich will divide existing RDD based on weights passed in the parameters and return Array of RDDs.
The internal working will be like below...
* Randomly splits this RDD with the provided weights.
* #param weights weights for splits, will be normalized if they don't sum to 1
* #param seed random seed
* #return split RDDs in an array
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
require(weights.forall(_ >= 0),
s"Weights must be nonnegative, but got ${weights.mkString("[", ",", "]")}")
require(weights.sum > 0,
s"Sum of weights must be positive, but got ${weights.mkString("[", ",", "]")}")
withScope {
val sum = weights.sum
val normalizedCumWeights = / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
randomSampleWithRange(x(0), x(1), seed)
NOTE : weights weights for splits, will be normalized if they don't sum to 1
Based on the above behavior I created a sample snippet like below which was working :
def getDoubleWeights(numparts:Int) : Array[Double] = {
caller would be like....
val rddWithNumParts : Array[RDD] = yourRDD.randomSplit(getDoubleWeights(yourRDD.partitions.length))
This will uniformly divide in to number of RDD
NOTE : Same is applicable for below DataFrame.randomSplit as well
You can also convert that in to Dataframe by giving schema to RDD and use like below example.. sqlContext.createDataFrame(rddOfRow, Schema)
later you can call this method.
DataFrame[] randomSplit(double[] weights) Randomly splits this
DataFrame with the provided weights.
other thought I had is dividing based on number of Partitions...
i.e RDD.mapPartitionWithIndex(....)
for each partition you have an Iterator (can be converted in to RDD). you can have some thing like number of partitions = number of RDDs


Apply function to subset of Spark Datasets (Iteratively)

I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:
val sampleDataRDD = => {
val latMin = row._1
val latMax = latMin + 0.0001
val lonMin = row._2
val lonMax = lonMin + 0.0001
val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
val sampleDS = filterDF.sample(false, 0.05, 1234)
val output = sampleDataDS.reduce(_ union _)
I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.
I'm thinking I need to find a different solution, but I don't see it.
I've referenced these questions thus far:
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
Creating a Spark DataFrame from an RDD of lists

Compute average of numbers in a text file in spark scala

Lets say I have a file with each line representing a number. How do I find average of all the numbers in the file in Scala - Spark.
val data = sc.textFile("../../numbers.txt")
val sum = data.reduce( (x,y) => x+y )
val avg = sum/data.count()
The problem here is x and y are strings. How do I convert them into Long within the reduce function.
You need to apply a which parses the strings before reducing them:
val sum =
val avg = sum / data.count()
But I think you're better off using DoubleRDDFunctions.mean instead of calculating it yourself:
val mean =

How to compute cumulative sum using Spark

I have an rdd of (String,Int) which is sorted by key
val data = Array(("c1",6), ("c2",3),("c3",4))
val rdd = sc.parallelize(data).sortByKey
Now I want to start the value for the first key with zero and the subsequent keys as sum of the previous keys.
Eg: c1 = 0 , c2 = c1's value , c3 = (c1 value +c2 value) , c4 = (c1+..+c3 value)
expected output:
(c1,0), (c2,6), (c3,9)...
Is it possible to achieve this ?
I tried it with map but the sum is not preserved inside the map.
var sum = 0 ;
val t ={ x => { val temp = sum; sum = sum + x._2 ; (x._1,temp); }}
Compute partial results for each partition:
val partials = rdd.mapPartitionsWithIndex((i, iter) => {
val (keys, values) = iter.toSeq.unzip
val sums = values.scanLeft(0)(_ + _)
Iterator((, sums.last))
Collect partials sums
val partialSums = partials.values.collect
Compute cumulative sum over partitions and broadcast it:
val sumMap = sc.broadcast(
(0 until rdd.partitions.size)
.zip(partialSums.scanLeft(0)(_ + _))
Compute final results:
val result = partials.keys.mapPartitionsWithIndex((i, iter) => {
val offset = sumMap.value(i)
if (iter.isEmpty) Iterator()
else{case (k, v) => (k, v + offset)}.toIterator
Spark has buit-in supports for hive ANALYTICS/WINDOWING functions and the cumulative sum could be achieved easily using ANALYTICS functions.
Hive wiki ANALYTICS/WINDOWING functions.
Assuming you have sqlContext object-
val datardd = sqlContext.sparkContext.parallelize(Seq(("a",1),("b",2), ("c",3),("d",4),("d",5),("d",6)))
import sqlContext.implicits._
//Register as test table
//Calculate Cumulative sum
sqlContext.sql("select id,val, " +
"SUM(val) over ( order by id rows between unbounded preceding and current row ) cumulative_Sum " +
"from test").show()
This approach cause to below warning. In case executor runs outOfMemory, tune job’s memory parameters accordingly to work with huge dataset.
WARN WindowExec: No Partition Defined for Window operation! Moving
all data to a single partition, this can cause serious performance
I hope this helps.
Here is a solution in PySpark. Internally it's essentially the same as #zero323's Scala solution, but it provides a general-purpose function with a Spark-like API.
import numpy as np
def cumsum(rdd, get_summand):
"""Given an ordered rdd of items, computes cumulative sum of
get_summand(row), where row is an item in the RDD.
def cumsum_in_partition(iter_rows):
total = 0
for row in iter_rows:
total += get_summand(row)
yield (total, row)
rdd = rdd.mapPartitions(cumsum_in_partition)
def last_partition_value(iter_rows):
final = None
for cumsum, row in iter_rows:
final = cumsum
return (final,)
partition_sums = rdd.mapPartitions(last_partition_value).collect()
partition_cumsums = list(np.cumsum(partition_sums))
partition_cumsums = [0] + partition_cumsums
partition_cumsums = sc.broadcast(partition_cumsums)
def add_sums_of_previous_partitions(idx, iter_rows):
return ((cumsum + partition_cumsums.value[idx], row)
for cumsum, row in iter_rows)
rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
return rdd
# test for correctness by summing numbers, with and without Spark
rdd = sc.range(10000,numSlices=10).sortBy(lambda x: x)
cumsums, values = zip(*cumsum(rdd,lambda x: x).collect())
assert all(cumsums == np.cumsum(values))
I came across a similar problem and implemented #Paul 's solution. I wanted to do cumsum on a integer frequency table sorted by key(the integer), and there was a minor problem with np.cumsum(partition_sums), error being unsupported operand type(s) for +=: 'int' and 'NoneType'.
Because if the range is big enough, the probability of each partition having something is thus big enough(no None values). However, if the range is much smaller than count, and number of partitions remains the same, some of the partitions would be empty. Here comes the modified solution:
def cumsum(rdd, get_summand):
"""Given an ordered rdd of items, computes cumulative sum of
get_summand(row), where row is an item in the RDD.
def cumsum_in_partition(iter_rows):
total = 0
for row in iter_rows:
total += get_summand(row)
yield (total, row)
rdd = rdd.mapPartitions(cumsum_in_partition)
def last_partition_value(iter_rows):
final = None
for cumsum, row in iter_rows:
final = cumsum
return (final,)
partition_sums = rdd.mapPartitions(last_partition_value).collect()
# partition_cumsums = list(np.cumsum(partition_sums))
#----from here are the changed lines
partition_sums = [x if x is not None else 0 for x in partition_sums]
temp = np.cumsum(partition_sums)
partition_cumsums = list(temp)
partition_cumsums = [0] + partition_cumsums
partition_cumsums = sc.broadcast(partition_cumsums)
def add_sums_of_previous_partitions(idx, iter_rows):
return ((cumsum + partition_cumsums.value[idx], row)
for cumsum, row in iter_rows)
rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
return rdd
#test on random integer frequency
x = np.random.randint(10, size=1000)
D = sqlCtx.createDataFrame(pd.DataFrame(x.tolist(),columns=['D']))
c = D.groupBy('D').count().orderBy('D')
c_rdd = x:x['count'])
cumsums, values = zip(*cumsum(c_rdd,lambda x: x).collect())
you can want to try out with windows over using rowsBetween. hope still helpful.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val data = Array(("c1",6), ("c2",3),("c3",4))
val df = sc.parallelize(data).sortByKey().toDF("c", "v")
val w = Window.orderBy("c")
val r = $"c", sum($"v").over(w.rowsBetween(-2, -1)).alias("cs"))

Naive Bayes with Apache Spark MLlib

I'm using Naive Bayes with Apache Spark MLlib for Text classification follow tutorial:
/* instantiate Spark context (not needed for running inside Spark shell */
val sc = new SparkContext("local", "test")
/* word to vector space converter, limit to 10000 words */
val htf = new HashingTF(10000)
/* load positive and negative sentences from the dataset */
/* let 1 - positive class, 0 - negative class */
/* tokenize sentences and transform them into vector space model */
val positiveData = sc.textFile("/data/rt-polaritydata/rt-polarity.pos")
.map { text => new LabeledPoint(1, htf.transform(text.split(" ")))}
val negativeData = sc.textFile("/data/rt-polaritydata/rt-polarity.neg")
.map { text => new LabeledPoint(0, htf.transform(text.split(" ")))}
/* split the data 60% for training, 40% for testing */
val posSplits = positiveData.randomSplit(Array(0.6, 0.4), seed = 11L)
val negSplits = negativeData.randomSplit(Array(0.6, 0.4), seed = 11L)
/* union train data with positive and negative sentences */
val training = posSplits(0).union(negSplits(0))
/* union test data with positive and negative sentences */
val test = posSplits(1).union(negSplits(1))
/* Multinomial Naive Bayesian classifier */
val model = NaiveBayes.train(training)
/* predict */
val predictionAndLabels = { point =>
val score = model.predict(point.features)
(score, point.label)
/* metrics */
val metrics = new MulticlassMetrics(predictionAndLabels)
/* output F1-measure for all labels (0 and 1, negative and positive) */
metrics.labels.foreach( l => println(metrics.fMeasure(l)))
But, after training data. What should I do if I want to know sentence "Have a nice day" is positive or negative?
Thank you.
Generally speaking you need two things to make prediction on a raw data:
Apply the same transformations you've used for training data. If some transformer require fitting (like IDF, normalization, encoding) you have to use one fitted on a trained data. Since your approach is extremely simplistic all you need here is something like this:
val testData = htf.transform("Have a nice day".split(" "))
Use predict method of the trained model:

Spark - scala: shuffle RDD / split RDD into two random parts randomly

How can I take a rdd array of spark, and split it into two rdds randomly so each rdd will include some part of data (lets say 97% and 3%).
I thought to shuffle the list and then shuffledList.take((0.97*rddList.count).toInt)
But how can I Shuffle the rdd?
Or is there a better way to split the list?
I've found a simple and fast way to split the array:
val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03))
It will split the data using the provided weights.
You should use randomSplit method:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
// Randomly splits this RDD with the provided weights.
// weights for splits, will be normalized if they don't sum to 1
// returns split RDDs in an array
Here is its implementation in spark 1.0:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
val sum = weights.sum
val normalizedCumWeights = / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](x(0), x(1)),seed)