Compute average of numbers in a text file in spark scala - scala

Lets say I have a file with each line representing a number. How do I find average of all the numbers in the file in Scala - Spark.
val data = sc.textFile("../../numbers.txt")
val sum = data.reduce( (x,y) => x+y )
val avg = sum/data.count()
The problem here is x and y are strings. How do I convert them into Long within the reduce function.

You need to apply a RDD.map which parses the strings before reducing them:
val sum = data.map(_.toInt).reduce(_+_)
val avg = sum / data.count()
But I think you're better off using DoubleRDDFunctions.mean instead of calculating it yourself:
val mean = data.map(_.toInt).mean()

Related

Length of dataframe inside UDF function

I need to write a complex User Defined Function (UDF) that takes multiple columns as input. Something like:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p} // actually a more complex function but let's make it simple
The third parameter cumsum_p indicate is a cumulative sum of p where p is a the length of the group it is computed. Because this udf will then be used in a groupby.
I come up with this solution which is almost ok:
val uudf = udf{(val:Int, lag:Int, cumsum_p:Double) => val + lag + cum_p}
val w = Window.orderBy($"sale_qty")
df.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1/length_group)).over(w)
)
).show()
The problem is that if I replace lit(1/length_group) with lit(1/count("sale_qty")) the created column now contains only 1 element which lead to an error...
You should compute count("sale_qty") first:
val w = Window.orderBy($"sale_qty")
df
.withColumn("cnt",count($"sale_qty").over())
.withColumn("needThat",
uudf(col("sale_qty"),
lead("sale_qty",1).over(w), sum(lit(1)/$"cnt").over(w)
)
).show()

Calculate cosine similarity of document relevance

I have go the normalized TF-IDF for and also the keyword RDD and now want to compute the cosine similarity to find relevance score for the document .
So I tried as
documentRdd = sc.textFile("documents.txt").flatMap(lambda l: re.split(r'[^\w]+',l))
keyWords = sc.textFile("keywords.txt").flatMap(lambda l: re.split(r'[^\w]+',l))
normalizer1 = Normalizer()
hashingTF = HashingTF()
tf = hashingTF.transform(documentRdd)
tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
normalizedtfidf=normalizer1.transform(tfidf)
Now I wanted to calculate the cosine similarity between the normalizedtfidf and keyWords.So I tried using
x = Vectors.dense(normalizedtfidf)
y = Vectors.dense(keywordTF)
print(1 - x.dot(y)/(x.norm(2)*y.norm(2)) , "is the releavance score")
But this throw the error as
TypeError: float() argument must be a string or a number
Which means I am passing a wrong format .Any help is appreciated .
Update
I tried then
x = Vectors.sparse(normalizedtfidf.count(),normalizedtfidf.collect())
y = Vectors.sparse(keywordTF.count(),keywordTF.collect())
but got
TypeError: Cannot treat type as a
vector
as the error.
You got the errors because you are attempting to convert RDD into Vectors forcibly.
You can achieve what you need without doing the conversion by doing the following steps :
Join both your RDDs into one RDD. Note that I am assuming you do not have a unique index in both RDDs for joining.
# Adding index to both RDDs by row.
rdd1 = normalizedtfidf.zipWithIndex().map(lambda arg : (arg[1], arg[0]))
rdd2 = keywordTF.zipWithIndex().map(lambda arg : (arg[1], arg[0]))
# Join both RDDs.
rdd_joined = rdd1.join(rdd2)
map RDD with a function to calculate cosine distance.
def cosine_dist(row):
x = row[1][0]
y = row[1][1]
return (1 - x.dot(y)/(x.norm(2)*y.norm(2)))
res = rdd_joined.map(cosine_dist)
You can then use your results or run collect to see them.

Apply function to subset of Spark Datasets (Iteratively)

I have a Dataset of geospatial data that I need to sample in a grid like fashion. I want divide the experiment area into a grid, and use a sampling function called "sample()" that takes three inputs, on each square of the grid, and then merge the sampled datasets back together. My current method utilized a map function, but I've learned that you can't have an RDD of RDDs/Datasets/DataFrames. So how can I apply the sampling function to subsets of my dataset? Here is the code I tried to write in map reduce fashion:
val sampleDataRDD = boundaryValuesDS.rdd.map(row => {
val latMin = row._1
val latMax = latMin + 0.0001
val lonMin = row._2
val lonMax = lonMin + 0.0001
val filterDF = featuresDS.filter($"Latitude" > latMin).filter($"Latitude"< latMax).filter($"Longitude">lonMin).filter($"Longitude"< lonMin)
val sampleDS = filterDF.sample(false, 0.05, 1234)
(sampleDS)
})
val output = sampleDataDS.reduce(_ union _)
I've tried various ways of dealing with this. Converting sampleDS to an RDD and to a List, but I still continue to get a NullPointerExcpetion when calling "collect" on output.
I'm thinking I need to find a different solution, but I don't see it.
I've referenced these questions thus far:
Caused by: java.lang.NullPointerException at org.apache.spark.sql.Dataset
Creating a Spark DataFrame from an RDD of lists

spark scala avoid shuffling after RowMatix calculation

scala spark : How to avoid RDD shuffling in join after Distributed Matrix operation
created a dense matrix as a input to calculate cosine distance between columns
val rowMarixIn = sc.textFile("input.csv").map{ line =>
val values = line.split(" ").map(_.toDouble)
Vectors.dense(values)
}
Extracted set of entries from co-ordinated matrix after the cosine calculations
val coMarix = new RowMatrix(rowMarixIn)
val similerRows = coMatrix.columnSimilarities()
//extract entires over a specific Threshold
val rowIndices = similerRows.entries.map {case MatrixEntry(row: Long, col: Long, sim: Double) =>
if (sim > someTreshold )){
col,sim
}`
We have a another RDD with rdd2(key,Val2)
just want to join the two rdd's, rowIndices(key,Val) , rdd2(key,Val2)
val joinedRDD = rowIndices.join(rdd2)
this will result in a shuffle ,
What are best practices to follow in order to avoid shuffle or any suggestion on a better approach is much appreciated

converting RDD to vector with fixed length file data

I'm new to Spark + Scala and still developing my intuition. I have a file containing many samples of data. Every 2048 lines represents a new sample. I'm attempting to convert each sample into a vector and then run through a k-means clustering algorithm. The data file looks like this:
123.34 800.18
456.123 23.16
...
When I'm playing with a very small subset of the data, I create an RDD from the file like this:
val fileData = sc.textFile("hdfs://path/to/file.txt")
and then create the vector using this code:
val freqLineCount = 2048
val numSamples = 200
val freqPowers = fileData.map( _.split(" ")(1).toDouble )
val allFreqs = freqPowers.take(numSamples*freqLineCount).grouped(freqLineCount)
val lotsOfVecs = allFreqs.map(spec => Vectors.dense(spec) ).toArray
val lotsOfVecsRDD = sc.parallelize( lotsOfVecs ).cache()
val numClusters = 2
val numIterations = 2
val clusters = KMeans.train(lotsOfVecsRDD, numClusters, numIterations)
The key here is that I can call .grouped on an array of strings and it returns an array of arrays with the sequential 2048 values. That is then trivial to convert to vectors and run it through the KMeans training algo.
I'm attempting to run this code on a much larger data set and running into java.lang.OutOfMemoryError: Java heap space errors. Presumably because I'm calling the take method on my freqPowers variable and then performing some operations on that data.
How would I go about achieving my goal of running KMeans on this data set keeping in mind that
each data sample occurs every 2048 lines in the file (so the file should be parsed somewhat sequentially)
this code needs to run on a distributed cluster
I need to not run out of memory :)
thanks in advance
You can do something like:
val freqLineCount = 2048
val freqPowers = fileData.flatMap(_.split(" ")(1).toDouble)
// Replacement of your current code.
val groupedRDD = freqPowers.zipWithIndex().groupBy(_._2 / freqLineCount)
val vectorRDD = groupedRDD.map(grouped => Vectors.dense(grouped._2.map(_._1).toArray))
val numClusters = 2
val numIterations = 2
val clusters = KMeans.train(vectorRDD, numClusters, numIterations)
The replacing code uses zipWithIndex() and division of longs to group RDD elements into freqLineCount chunks. After the grouping, the elements in question are extracted into their actual vectors.