I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:
val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
val latMin = row._1
val latMax = latMin + 0.0001
val lonMin = row._2
val lonMax = lonMin + 0.0001
val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
val sampleDS = filterDF.sample(false, 0.05, 1234)
(sampleDS)
})
val output = sampleDataDS.reduce(_ union _)
I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.
I'm thinking I need to find a different solution, but I don't see it.
I've referenced these questions thus far:
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
Creating a Spark DataFrame from an RDD of lists
I have the data set a,
a = sc.parallelize([((1,2),(10,20,[1,3])),((1,2),(30,40,[1]))])
and I need the following:
(1,2) is going to be the key
Since I want to calculate the streaming standard deviation of the first two values, I need to evaluate the
pure sums and sums of squares for each of these values. In other words, I need to
sumx=(10+30), sumx^2=(10^2 + 30^2) for the first value,
and
sumx=(20+40), sumx^2=(20^2 + 40^2) for the second value.
for the final value (the lists), I just want to concatenate them.
The final result needs to be:
([(1,2),(40,1000,60,2000,[1,3])])
Here is my code:
a.aggregateByKey((0.0,0.0,0.0,0.0,[]),\
(lambda x,y: (x[0]+y[0],x[0]*x[0]+y[0]*y[0],x[1]+y[1],x[1]*x[1]+y[1]*y[1],x[2]+y[2])),\
(lambda rdd1,rdd2: (rdd1[0]+rdd2[0],rdd1[1]+rdd2[1],rdd1[2]+rdd1[2],rdd1[3]+rdd2[3],rdd1[4]+rdd2[4]))).collect()
Unfortunately it returns the following error:
"TypeError: unsupported operand type(s) for +: 'float' and 'list'"
Any thoughts?
You can use hivecontext to solve this :
from pyspark.sql.context import HiveContext
hivectx = HiveContext(sc)
a = sc.parallelize([((1,2),(10,20,[1,3])),((1,2),(30,40,[1]))])
# Convert this to a dataframe
b = a.toDF(['col1','col2'])
# Explode col2 into individual columns
c = b.map(lambda x: (x.col1,x.col2[0],x.col2[1],x.col2[2])).toDF(['col1','col21','col22','col23'])
c.registerTempTable('mydf')
sql = """
select col1,
sum(col21) as sumcol21,
sum(POW(col21,2)) as sum2col21,
sum(col22) as sumcol22,
sum(POW(col22,2)) as sum2col22,
collect_set(col23) as col23
from mydf
group by col1
"""
d = hivectx.sql(sql)
# Get back your original dataframe
e = d.map(lambda x:(x.col1,(x.sumcol21,x.sum2col21,x.sumcol22,x.sum2col22,[item for sublist in x.col23 for item in sublist]))).toDF(['col1','col2'])
scala spark : How to avoid RDD shuffling in join after Distributed Matrix operation
created a dense matrix as a input to calculate cosine distance between columns
val rowMarixIn = sc.textFile("input.csv").map{ line =>
val values = line.split(" ").map(_.toDouble)
Vectors.dense(values)
}
Extracted set of entries from co-ordinated matrix after the cosine calculations
val coMarix = new RowMatrix(rowMarixIn)
val similerRows = coMatrix.columnSimilarities()
//extract entires over a specific Threshold
val rowIndices = similerRows.entries.map {case MatrixEntry(row: Long, col: Long, sim: Double) =>
if (sim > someTreshold )){
col,sim
}`
We have a another RDD with rdd2(key,Val2)
just want to join the two rdd's, rowIndices(key,Val) , rdd2(key,Val2)
val joinedRDD = rowIndices.join(rdd2)
this will result in a shuffle ,
What are best practices to follow in order to avoid shuffle or any suggestion on a better approach is much appreciated
Lets say I have a file with each line representing a number. How do I find average of all the numbers in the file in Scala - Spark.
val data = sc.textFile("../../numbers.txt")
val sum = data.reduce( (x,y) => x+y )
val avg = sum/data.count()
The problem here is x and y are strings. How do I convert them into Long within the reduce function.
You need to apply a RDD.map which parses the strings before reducing them:
val sum = data.map(_.toInt).reduce(_+_)
val avg = sum / data.count()
But I think you're better off using DoubleRDDFunctions.mean instead of calculating it yourself:
val mean = data.map(_.toInt).mean()
I have a basic RDD[Object] on which i apply a map with a hashfunction on Object values using nextGaussian and nextDouble scala function. And when i print values there change at each print
def hashmin(x:Data_Object, w:Double) = {
val x1 = x.get_vector.toArray
var a1 = Array(0.0).tail
val b = Random.nextDouble * w
for( ind <- 0 to x1.size-1) {
val nG = Random.nextGaussian
a1 = a1 :+ nG
}
var sum = 0.0
for( ind <- 0 to x1.size-1) {
sum = sum + (x1(ind)*a1(ind))
}
val hash_val = (sum+b)/w
val hash_val1 = (x.get_id,hash_val)
hash_val1
}
val w = 8
val rddhash = parsedData.map(x => hashmin(x,w))
rddhash.foreach(println)
rddhash.foreach(println)
I don't understand why. Thank you in advance.
RDDs are merely a "pointer" to the data + operations to be applied to it. Actions materialize those operations by executing the RDD lineage.
So, RDDs are basically recomputed when an action is requested. In this case, the map function calling hashmin is being evaluated every time the foreach action is called.
There're few options:
Cache the RDD - this will cause the lineage to be broken and the results of the first transformation will be preserved:
val rddhash = parsedData.map(x => hashmin(x,w)).cache()
Use a seed for your random function, sothat the pseudo-random sequence generated is each time the same.
RDDs are lazy - they're computed when they're used. So the calls to Random.nextGaussian are made again each time you call foreach.
You can use persist() to store an RDD if you want to keep fixed values.