How to convert a map to Spark's RDD - scala

I have a data set which is in the form of some nested maps, and its Scala type is:
Map[String, (LabelType,Map[Int, Double])]
The first String key is a unique identifier for each sample, and the value is a tuple that contains the label (which is -1 or 1), and a nested map which is the sparse representation of the non-zero elements which are associated with the sample.
I would like to load this data into Spark (using MUtil) and train and test some machine learning algorithms.
It's easy to write this data into a file with LibSVM's sparse encoding, and then load it in Spark:
writeMapToLibSVMFile(data_map,"libsvm_data.txt") // Implemeneted some where else
val conf = new SparkConf().setAppName("DecisionTree").setMaster("local[4]")
val sc = new SparkContext(conf)
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "libsvm_data.txt")
// Split the data into training and test sets
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
I know it should be as easy to directly load the data variable from data_map, but I don't know how.
Any help is appreciated!

I guess you want something like this
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
// If you know this upfront, otherwise it can be computed
// using flatMap
// trainMap.values.flatMap(_._2.keys).max + 1
val nFeatures: Int = ???
val trainMap = Map(
"x001" -> (-1, Map(0 -> 1.0, 3 -> 5.0)),
"x002" -> (1, Map(2 -> 5.0, 3 -> 6.0)))
val trainRdd: RDD[(String, LabeledPoint)] = sc
// Convert Map to Seq so it can passed to parallelize
.map{case (id, (labelInt, values)) => {
// Convert nested map to Seq so it can be passed to Vector
val features = Vectors.sparse(nFeatures, values.toSeq)
// Convert label to Double so it can be used for LabeledPoint
val label = labelInt.toDouble
(id, LabeledPoint(label, features))

It can be done in two ways
sc.textFile("libsvm_data.txt").map(s => createObject())
Convert map into collection of objects and use sc.parallelize()
The first one is preferrable.


How to filter an rdd by data type?

I have an rdd that i am trying to filter for only float type. Do Spark rdds provide any way of doing this?
I have a csv where I need only float values greater than 40 into a new rdd. To achieve this, i am checking if it is an instance of type float and filtering them. When I filter with a !, all the strings are still there in the output and when i dont use !, the output is empty.
val airports1 = airports.filter(line => !line.split(",")(6).isInstanceOf[Float])
val airports2 = airports1.filter(line => line.split(",")(6).toFloat > 40)
At the .toFloat , i run into NumberFormatException which I've tried to handle in a try catch block.
Since you have a plain string and you are trying to get float values from it, you are not actually filtering by type. But, if they can be parsed to float instead.
You can accomplish that using a flatMap together with Option.
import org.apache.spark.sql.SparkSession
import scala.util.Try
val spark = SparkSession.builder.master("local[*]").appName("Float caster").getOrCreate()
val sc = spark.sparkContext
val data = List("x,10", "y,3.3", "z,a")
val rdd = sc.parallelize(data) // rdd: RDD[String]
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption) // filtered: RDD[Float]
filtered.collect() // res0: Array[Float] = Array(10.0, 3.3)
For the > 40 part you can either, perform another filter after or filter the inner Option.
(Both should perform more or less equals due spark laziness, thus choose the one is more clear for you).
// Option 1 - Another filter.
val filtered2 = filtered.filter(x => x > 40)
// Option 2 - Filter the inner option in one step.
val filtered = rdd.flatMap(line => Try(line.split(",")(1).toFloat).toOption.filter(x => x > 40))
Let me know if you have any question.

Store Spark distributed matrix in MongoDB

After calculating the distance matrix related to a set of points stored in a file on HDFS, I need to store the calculated distance matrix which is in a distributed form (CoordinateMatrix/RowMatrix), in MongoDB through MongoDB Connector for Apache Spark. Is there a recommended way to do this or even a better connector for such an operation ?
Here is the part of my code:
val data = sc.textFile("hdfs://localhost:54310/usrp/copy_sample_data.txt")
val points = => Vectors.dense(s.split(',').map(_.toDouble)))
val indexed = points.zipWithIndex()
val indexedData ={case (value, index) => (index, value)}
val pairedSamples = indexedData.cartesian(indexedData)
val dist ={case (x,y) => ((x,y),distance(x._2,y._2))}.map{case ((x,y),z) => (((x,y),z,covariance(z)))}
val entries: RDD[MatrixEntry] ={case (((x,y),z,cov)) => MatrixEntry(x._1, y._1, cov)}
val coomat: CoordinateMatrix = new CoordinateMatrix(entries)
To further note, I have created this matrix in Spark from a RDD. So maybe it is even better/possible to save data from RDD to Mongodb ?
CoordinateMatrix and RowMatrix are basically wrappers around RDD[MatrixEntry] and RDD[Vector] respectively and both can be relatively saved to MongoDB. For coordinate matrix:
val spark: SparkSession = ???
import spark.implicits._
// For 1.x
// val sqlContext: SQLContext = ???
// import sqlContext.implicits._
val options = Map(
"uri" -> ???
"database" -> ???
val coordMat = new CoordinateMatrix(sc.parallelize(Seq(
MatrixEntry(1, 3, 1.4), MatrixEntry(3, 6, 2.8))
.option("collection", "coordinates")
you'll get documents of shape:
{'_id': ObjectId('...'), 'i': 3, 'j': 6, 'value': 2.8}
which can be easily casted back to the original form:
val entries =
.option("collection", "coordinates")
new CoordinateMatrix(entries.rdd)
Pretty much the same thing can be done for RowMatrix but you'll need a little bit more work (represent Vectors either as dense arrays or sparse tuple (size, indices, values)).
Unfortunately in both cases (CoordinateMatrix, RowMatrix) you'll loose information about matrix shape.

Spark MLib - Create LabeledPoint from RDD[Vector] features and RDD[Vector] label

I am building a training set using two text files representing documents and labels.
hello world
hello mars
I have read in these files and converted my document data to a tf-idf weighted term-document matrix which is represented as a RDD[Vector]. I have also read-in and created a RDD[Vector] for my labels:
val docs: RDD[Seq[String]] = sc.textFile("Documents.txt").map(_.split(" ").toSeq)
val labs: RDD[Vector] = sc.textFile("Labels.txt")
.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(docs)
val idf = new IDF(minDocFreq = 3).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
I would like to use tfidf and labsto create a RDD[LabeledPoint], but I am not sure how to apply a mapping with two different RDDs. Is this even possible/efficient, or do I need to rethink my approach?
One way to handle this is to join based on indices:
import org.apache.spark.RangePartitioner
// Add indices
val idfIndexed =
val labelsIndexed =
// Create range partitioner on larger RDD
val partitioner = new RangePartitioner(idfIndexed.partitions.size, idfIndexed)
// Join with custom partitioner
labelsIndexed.join(idfIndexed, partitioner).values

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = => x)

Finding the average of a data set using Apache Spark

I am learning how to use Apache Spark and I am trying to get the average temperature from each hour from a data set. The data set that I am trying to use is from weather information stored in a csv. I am having trouble finding how to first read in the csv file and then calculating the average temperature for each hour.
From the spark documentation I am using the example Scala line to read in a file.
val textFile = sc.textFile("")
I have given the link for the data file below. I am using the file called JCMB_2014.csv as it is the latest one with all months covered.
Weather Data
The code I have tried so far is:
class SimpleCSVHeader(header:Array[String]) extends Serializable {
val index = header.zipWithIndex.toMap
def apply(array:Array[String], key:String):String = array(index(key))
val csv = sc.textFile("JCMB_2014.csv")
val data = => line.split(",").map(elem => elem.trim))
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header
val header = new SimpleCSVHeader(data.take(1)(0))
val rows = data.filter(line => header(line,"date-time") != "date-time")
val users = => header(row,"date-time")
val usersByHits = => header(row,"date-time") -> header(row,"surface temperature (C)").toInt)
Here is sample code for calculating averages on hourly basis
Step1:Read file, Filter header,extract time and temp columns
scala> val hourlyTemps =>line.split(",")).filter(entries=>(!"time".equals(entries(3)))).map(entries=>(entries(3).toInt/60,(entries(8).toFloat,1)))
scala> hourlyTemps.take(1)
res25: Array[(Int, (Float, Int))] = Array((9,(10.23,1)))
(time/60) discards minutes and keeps only hours
Step2:Aggregate temperatures and no of occurrences
scala> val aggregateTemps=hourlyTemps.reduceByKey((a,b)=>(a._1+b._1,a._2+b._2))
scala> aggreateTemps.take(1)
res26: Array[(Int, (Double, Int))] = Array((34,(8565.25,620)))
Step2:Calculate Averages using total and no of occurrences
Find the final result below.
scala> avgTemps.collect
res28: Array[(Int, Float)] = Array((34,13.814922), (4,11.743354), (16,14.227251), (22,15.770312), (28,15.5324545), (30,15.167026), (14,13.177828), (32,14.659948), (36,12.865237), (0,11.994799), (24,15.662579), (40,12.040322), (6,11.398838), (8,11.141323), (12,12.004652), (38,12.329914), (18,15.020147), (20,15.358524), (26,15.631921), (10,11.192643), (2,11.848178), (13,12.616284), (19,15.198371), (39,12.107664), (15,13.706351), (21,15.612191), (25,15.627121), (29,15.432097), (11,11.541124), (35,13.317129), (27,15.602408), (33,14.220147), (37,12.644306), (23,15.83412), (1,11.872819), (17,14.595772), (3,11.78971), (7,11.248139), (9,11.049844), (31,14.901464), (5,11.59693))
You may want to provide Structure definition of your CSV file and convert your RDD to DataFrame, like described in the documentation. Dataframes provide a whole set of useful predefined statistic functions as well as the possibility to write some simple custom functions. You then will be able to compute the average with:
dataFrame.groupBy(<your columns here>).agg(avg(<column to compute average>)