Idiomatic way to aggregate map results into a Map datastructure - scala

I feel this should be simple but can't think of an idiomatic way to write this
case class Dataset(name: String)
val datasets = List(Dataset("first"), Dataset("second"))
def getData(datasetName: String): InputStream = {
// not important
}
val data: Map[String, InputStream] = {
// aggregate this so the dataset name maps to the result of getData
datasets.foreach(dataset => getData(dataset.name))
...
}
what I have currently creates a mutable map and puts the data one-by-one, then makes it immutable which is terrible imo
val data: Map[String, InputStream] = {
val map = new util.HashMap[String, InputStream]()
datasets.foreach(dataset => map.put(dataset.name, getData(dataset.name))
map.asScala.toMap
}

Related

Efficient way to collect HashSet during map operation on some Dataset

I have big dataset to transform one structure to another. During that phase I want also collect some info about computed field (quadkeys for given lat/longs). I dont want attach this info to every result row, since it would give a lot of duplication information and memory overhead. All I need is to know which particular quadkeys are touched by given coordinates. If there are any way to do it within one job to not iterate dataset twice?
def load(paths: Seq[String]): (Dataset[ResultStruct], Dataset[String]) = {
val df = sparkSession.sqlContext.read.format("com.databricks.spark.csv").option("header", "true")
.schema(schema)
.option("delimiter", "\t")
.load(paths:_*)
.as[InitialStruct]
val qkSet = mutable.HashSet.empty[String]
val result = df.map(c => {
val id = c.id
val points = toPoints(c.geom)
points.foreach(p => qkSet.add(Quadkey.get(p.lat, p.lon, 6).getId))
createResultStruct(id, points)
})
return result, //some dataset created from qkSet's from all executors
}
You could use accumulators
class SetAccumulator[T] extends AccumulatorV2[T, Set[T]] {
import scala.collection.JavaConverters._
private val items = new ConcurrentHashMap[T, Boolean]
override def isZero: Boolean = items.isEmpty
override def copy(): AccumulatorV2[T, Set[T]] = {
val other = new SetAccumulator[T]
other.items.putAll(items)
other
}
override def reset(): Unit = items.clear()
override def add(v: T): Unit = items.put(v, true)
override def merge(
other: AccumulatorV2[T, Set[T]]): Unit = other match {
case setAccumulator: SetAccumulator[T] => items.putAll(setAccumulator.items)
}
override def value: Set[T] = items.keys().asScala.toSet
}
val df = Seq("foo", "bar", "foo", "foo").toDF("test")
val acc = new SetAccumulator[String]
spark.sparkContext.register(acc)
df.map {
case Row(str: String) =>
acc.add(str)
str
}.count()
println(acc.value)
Prints
Set(bar, foo)
Note that map itself is lazy so something like count etc. is needed to actually force the calculation. Depending on the real use-case, another option would be to cache the data frame and just using plain SQL functions df.select("test").distinct()

How to add to an Immutable map : Scala

I have a ResultSet object returned from Hive using JDBC.
I am trying to store the values in a resultset in a Scala Immutable Map.
How can i add there values to an Immutable map as i am iterating the resultset using while loop
val m : Map[String, String] = null
while ( resultSet.next() ) {
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
m += (col -> data) // This Gives Reassignment error
}
I propose :
Iterator.continually{
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
col->data
}.takeWhile( _ => resultSet.next()).toMap
Instead of thinking "let's init an empty collection and fill it" which is imho the mutable way to think, this proposition rather think in terms of "let's declare how to build a collection with those elements in it and be done" :-)
You might want to use scala.collection.Iterator[A] so that you can create immutable map out of your java resultSet.
val myMap : Map[String, String] = new Iterator[(String, String)] {
override def hasNext = resultSet.next()
override def next() = {
val col = resultSet.getString("col_name")
val data = resultSet.getString("data_type")
col -> data
}
}.toMap
Otherwise you have to use mutable scala.collection.mutable.Map.

How to create collection of RDDs out of RDD?

I have an RDD[String], wordRDD. I also have a function that creates an RDD[String] from a string/word. I would like to create a new RDD for each string in wordRDD. Here are my attempts:
1) Failed because Spark does not support nested RDDs:
var newRDD = wordRDD.map( word => {
// execute myFunction()
(new MyClass(word)).myFunction()
})
2) Failed (possibly due to scope issue?):
var newRDD = sc.parallelize(new Array[String](0))
val wordArray = wordRDD.collect
for (w <- wordArray){
newRDD = sc.union(newRDD,(new MyClass(w)).myFunction())
}
My ideal result would look like:
// input RDD (wordRDD)
wordRDD: org.apache.spark.rdd.RDD[String] = ('apple','banana','orange'...)
// myFunction behavior
new MyClass('apple').myFunction(): RDD[String] = ('pple','aple'...'appl')
// after executing myFunction() on each word in wordRDD:
newRDD: RDD[String] = ('pple','aple',...,'anana','bnana','baana',...)
I found a relevant question here: Spark when union a lot of RDD throws stack overflow error, but it didn't address my issue.
Use flatMap to get RDD[String] as you desire.
var allWords = wordRDD.flatMap { word =>
(new MyClass(word)).myFunction().collect()
}
You cannot create a RDD from within another RDD.
However, it is possible to rewrite your function myFunction: String => RDD[String], which generates all words from the input where one letter is removed, into another function modifiedFunction: String => Seq[String] such that it can be used from within an RDD. That way, it will also be executed in parallel on your cluster. Having the modifiedFunction you can obtain the final RDD with all words by simply calling wordRDD.flatMap(modifiedFunction).
The crucial point is to use flatMap (to map and flatten the transformations):
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val input = sc.parallelize(Seq("apple", "ananas", "banana"))
// RDD("pple", "aple", ..., "nanas", ..., "anana", "bnana", ...)
val result = input.flatMap(modifiedFunction)
}
def modifiedFunction(word: String): Seq[String] = {
word.indices map {
index => word.substring(0, index) + word.substring(index+1)
}
}

Creating Dataframe from XML parsed by scalaxb

I can successfully parse XML data dropped into a directory by using the Spark streaming fileStream method, and I can write the resulting RDDs out to a text file just fine:
val fStream = {
ssc.fileStream[LongWritable, Text, XmlInputFormat](
WATCHDIR, xmlFilter _, newFilesOnly = false, conf = hadoopConf)
}
fStream.foreachRDD(rdd =>
if (rdd.count() == 0) {
logger.info("No files..")
})
val dStream = fStream.map{ case(x, y) =>
logger.info("Hello from the dStream")
logger.info(y.toString)
scalaxb.fromXML[Music](scala.xml.XML.loadString(y.toString))
}
dStream.foreachRDD(rdd => rdd.saveAsTextFile("file:///tmp/xmlout"))
The trouble is when I want to convert the RDDs to DataFrames in order to either register them as a temp table or saveAsParquetFile.
This code:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
dStream.foreachRDD(rdd => rdd.distinct().toDF().printSchema())
Results in this error:
java.lang.UnsupportedOperationException: Schema for type scalaxb.DataRecord[scala.Any] is not supported
I would have thought that since scalaxb generates case classes for my records, and that it would be simple for Spark to infer using reflection, and I see this is what it's trying to do, except Spark doesn't support the scalaxb.DataRecord type. Are there any Spark or Scalaxb experts who have any ideas on how to make the case classes generated by Scalaxb compatible with Spark?
BTW, here are the generated classes from scalaxb:
package generated
case class Song(attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
lazy val title = attributes.get("#title") map { _.as[String] }
lazy val length = attributes.get("#length") map { _.as[String] }
}
case class Album(song: Seq[generated.Song] = Nil,
description: String,
attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
lazy val title = attributes.get("#title") map { _.as[String] }
}
case class Artist(album: Seq[generated.Album] = Nil,
attributes: Map[String, scalaxb.DataRecord[Any]] = Map()) {
lazy val name = attributes.get("#name") map { _.as[String] }
}
case class Music(artist: Seq[generated.Artist] = Nil)

Reduce two Scala methods, that only differ in one Object Type

I have the following two methods, using objects from Apache Spark.
def SVMModelScoring(sc: SparkContext, scoringDataset: String, modelFileName: String): RDD[(Double, Double)] = {
val model = SVMModel.load(sc, modelFileName)
val scoreAndLabels =
MLUtils.loadLibSVMFile(sc, scoringDataset).randomSplit(Array(0.1), seed = 11L)(0).map { point =>
val score = model.predict(point.features)
(score, point.label)
}
return scoreAndLabels
}
def DecisionTreeScoring(sc: SparkContext, scoringDataset: String, modelFileName: String): RDD[(Double, Double)] = {
val model = DecisionTreeModel.load(sc, modelFileName)
val scoreAndLabels =
MLUtils.loadLibSVMFile(sc, scoringDataset).randomSplit(Array(0.1), seed = 11L)(0).map { point =>
val score = model.predict(point.features)
(score, point.label)
}
return scoreAndLabels
}
My previous attempts to merge these functions have resulted in errors surround model.predict.
Is there a way I can use model as a parameter that is weakly typed in Scala?
Disclaimer - I've never used Apache Spark.
It looks to me like the only difference between the two methods is the way the model is instantiated. It's unfortunate that the two model instances don't actually share a common trait that provides predict(...) but we can still make this work by pulling out the part that changes - the scorer:
def scoreWith(sc: SparkContext, scoringDataset: String)(scorer: (Vector)=>Double): RDD[(Double, Double)] = {
MLUtils.loadLibSVMFile(sc, scoringDataset).randomSplit(Array(0.1), seed = 11L)(0).map { point =>
val score = scorer(point.features)
(score, point.label)
}
}
Now we can get the previous functionality with:
def svmScorer(sc: SparkContext, scoringDataset:String, modelFileName:String) =
scoreWith(sc: SparkContext, scoringDataset:String)(SVMModel.load(sc, modelFileName).predict)
def dtScorer(sc: SparkContext, scoringDataset:String, modelFileName:String) =
scoreWith(sc: SparkContext, scoringDataset:String)(DecisionTreeModel.load(sc, modelFileName).predict)