Apply function to subset of Spark Datasets (Iteratively) - scala

I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:
val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
val latMin = row._1
val latMax = latMin + 0.0001
val lonMin = row._2
val lonMax = lonMin + 0.0001
val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
val sampleDS = filterDF.sample(false, 0.05, 1234)
(sampleDS)
})
val output = sampleDataDS.reduce(_ union _)
I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.
I'm thinking I need to find a different solution, but I don't see it.
I've referenced these questions thus far:
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
Creating a Spark DataFrame from an RDD of lists

Related

Scala: two sliding more efficiently

I am working on the Quick Start of Apache Spark. I was wondering about efficiency of transformations on collections. I would like to know how to improve the following code:
// Variable initialisation
val N = 300.0
val input = (0.0 to N-1 by 1.0).toArray
val firstBigDivi = 100
val windowDuration = 6
val windowStep = 3
// Process
val windowedInput = inputArray.
sliding(firstBigDivi,firstBigDivi).toArray. //First, a big division
map(arr=>arr.sliding(windowDuration,windowStep).toArray)//Second, divide the division
Is there another way to do the same more efficiently? I think this code iterates twice over the input array (which could be an issue for big collections) is that right?
sliding creates an Iterator, so mapping that would be "cheap". You have a superfluous .toArray though between sliding and map. It suffices
val windowedInputIt = input.
sliding(firstBigDivi,firstBigDivi) //First, a big division
.map(arr=>arr.sliding(windowDuration,windowStep).toArray)
Then you can evaluate that iterator into an Array by writing
val windowedInput = windowedInputIt.toArray

Transforming Spark Dataframe Column

I am working with Spark dataframes. I have a categorical variable in my dataframe with many levels. I am attempting a simple transformation of this variable - Only pick the top few levels which has greater than n observations (say,1000). Club all other levels into an "Others" category.
I am fairly new to Spark, so I have been struggling to implement this. This is what I have been able to achieve so far:
# Extract all levels having > 1000 observations (df is the dataframe name)
val levels_count = df.groupBy("Col_name").count.filter("count >10000").sort(desc("count"))
# Extract the level names
val level_names = level_count.select("Col_name").rdd.map(x => x(0)).collect
This gives me an Array which has the level names that I would like to retain. Next, I should define the transformation function which can be applied to the column. This is where I am getting stuck. I believe we need to create a User defined function. This is what I tried:
# Define UDF
val var_transform = udf((x: String) => {
if (level_names contains x) x
else "others"
})
# Apply UDF to the column
val df_new = df.withColumn("Var_new", var_transform($"Col_name"))
However, when I try df_new.show it throws a "Task not serializable" exception. What am I doing wrong? Also, is there a better way to do this?
Thanks!
Here is a solution that would be, in my opinion, better for such a simple transformation: stick to the DataFrame API and trust catalyst and Tungsten to be optimised (e.g. making a broadcast join):
val levels_count = df
.groupBy($"Col_name".as("new_col_name"))
.count
.filter("count >10000")
val df_new = df
.join(levels_count,$"Col_name"===$"new_col_name", joinType="leftOuter")
.drop("Col_name")
.withColumn("new_col_name",coalesce($"new_col_name", lit("other")))

spark scala avoid shuffling after RowMatix calculation

scala spark : How to avoid RDD shuffling in join after Distributed Matrix operation
created a dense matrix as a input to calculate cosine distance between columns
val rowMarixIn = sc.textFile("input.csv").map{ line =>
val values = line.split(" ").map(_.toDouble)
Vectors.dense(values)
}
Extracted set of entries from co-ordinated matrix after the cosine calculations
val coMarix = new RowMatrix(rowMarixIn)
val similerRows = coMatrix.columnSimilarities()
//extract entires over a specific Threshold
val rowIndices = similerRows.entries.map {case MatrixEntry(row: Long, col: Long, sim: Double) =>
if (sim > someTreshold )){
col,sim
}`
We have a another RDD with rdd2(key,Val2)
just want to join the two rdd's, rowIndices(key,Val) , rdd2(key,Val2)
val joinedRDD = rowIndices.join(rdd2)
this will result in a shuffle ,
What are best practices to follow in order to avoid shuffle or any suggestion on a better approach is much appreciated

converting RDD to vector with fixed length file data

I'm new to Spark + Scala and still developing my intuition. I have a file containing many samples of data. Every 2048 lines represents a new sample. I'm attempting to convert each sample into a vector and then run through a k-means clustering algorithm. The data file looks like this:
123.34 800.18
456.123 23.16
...
When I'm playing with a very small subset of the data, I create an RDD from the file like this:
val fileData = sc.textFile("hdfs://path/to/file.txt")
and then create the vector using this code:
val freqLineCount = 2048
val numSamples = 200
val freqPowers = fileData.map( _.split(" ")(1).toDouble )
val allFreqs = freqPowers.take(numSamples*freqLineCount).grouped(freqLineCount)
val lotsOfVecs = allFreqs.map(spec => Vectors.dense(spec) ).toArray
val lotsOfVecsRDD = sc.parallelize( lotsOfVecs ).cache()
val numClusters = 2
val numIterations = 2
val clusters = KMeans.train(lotsOfVecsRDD, numClusters, numIterations)
The key here is that I can call .grouped on an array of strings and it returns an array of arrays with the sequential 2048 values. That is then trivial to convert to vectors and run it through the KMeans training algo.
I'm attempting to run this code on a much larger data set and running into java.lang.OutOfMemoryError: Java heap space errors. Presumably because I'm calling the take method on my freqPowers variable and then performing some operations on that data.
How would I go about achieving my goal of running KMeans on this data set keeping in mind that
each data sample occurs every 2048 lines in the file (so the file should be parsed somewhat sequentially)
this code needs to run on a distributed cluster
I need to not run out of memory :)
thanks in advance
You can do something like:
val freqLineCount = 2048
val freqPowers = fileData.flatMap(_.split(" ")(1).toDouble)
// Replacement of your current code.
val groupedRDD = freqPowers.zipWithIndex().groupBy(_._2 / freqLineCount)
val vectorRDD = groupedRDD.map(grouped => Vectors.dense(grouped._2.map(_._1).toArray))
val numClusters = 2
val numIterations = 2
val clusters = KMeans.train(vectorRDD, numClusters, numIterations)
The replacing code uses zipWithIndex() and division of longs to group RDD elements into freqLineCount chunks. After the grouping, the elements in question are extracted into their actual vectors.

Spark - scala: shuffle RDD / split RDD into two random parts randomly

How can I take a rdd array of spark, and split it into two rdds randomly so each rdd will include some part of data (lets say 97% and 3%).
I thought to shuffle the list and then shuffledList.take((0.97*rddList.count).toInt)
But how can I Shuffle the rdd?
Or is there a better way to split the list?
I've found a simple and fast way to split the array:
val Array(f1,f2) = data.randomSplit(Array(0.97, 0.03))
It will split the data using the provided weights.
You should use randomSplit method:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
// Randomly splits this RDD with the provided weights.
// weights for splits, will be normalized if they don't sum to 1
// returns split RDDs in an array
Here is its implementation in spark 1.0:
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](x(0), x(1)),seed)
}.toArray
}