var myMap:Map[String, Int] = Map()
myRDD.foreach { data =>
println( "1. " + data.name + " : " + data.time)
myMap += ( data.name -> data.time)
println( "2. " + myMap)
}
println( "Total Map : " + myMap)
Result
A : 1
Map(A -> 1)
B: 2
Map(B -> 2) // deleted key A
C: 3
Map(C -> 3) // deleted Key A and B
Total Map : Map() // nothing
Somehow I cannot store Map data in foreach. It kept deleting or initialing previous data when adding new key&value.
Any Idea of this?
Spark closures are serialized and executed in a separate context (remotely when in a cluster). myMap variable will not be updated locally.
To get the data from the RDD as a map, there's a built-in operation:
val myMap = rdd.collectAsMap()
Related
I want to test multiple methods, one that outputs a map and one that outputs a list. I have two separate test cases for each method, but I want a way to combine them and test both methods at the same time.
test("test 1 map") {
val testCases: Map[String, Map[String, Int]] = Map(
"Andorra" -> Map("la massana" -> 7211)
)
for ((input, expectedOutput) <- testCases) {
var computedOutput: mutable.Map[String, Int] = PaleBlueDot.cityPopulations(countriesFile, citiesFilename, input, "04")
assert(computedOutput == expectedOutput, input + " -> " + computedOutput)
}
}
test(testName="test 1 list") {
val testCases: Map[String, List[String]] = Map{
"Andorra" -> List("les escaldes")
}
for ((input, expectedOutput) <- testCases) {
var computedOutput: List[String] = PaleBlueDot.aboveAverageCities(countriesFile, citiesFilename, input)
assert(computedOutput.sorted == expectedOutput.sorted, input + " -> " + computedOutput)
}
Firstly, it is better to use a List rather than a Map for testCases as a Map can return values in any order. Using List ensures that tests are done in the order they are written in the list.
You can then make testCases into a List containing a tuple with test data for both tests, like this:
test("test map and list") {
val testCases = List {
"Andorra" -> (Map("la massana" -> 7211), List("les escaldes"))
}
for ((input, (mapOut, listOut)) <- testCases) {
val computedMap: mutable.Map[String, Int] =
PaleBlueDot.cityPopulations(countriesFile, citiesFilename, input, "04")
val computedList: List[String] =
PaleBlueDot.aboveAverageCities(countriesFile, citiesFilename, input)
assert(computedMap == mapOut, input + " -> " + computedMap)
assert(computedList.sorted == listOut.sorted, input + " -> " + computedList)
}
}
I have situation where have to store each partition's data to a file and load stored data at same partition later. Here is my code
Base class
case class foo ( posVals : Array[Double] , velVals : Array[Double] , f: Array[Double] => Double ,
fitnessVal: Double , LR1 : Double , PR1 : Double) extends Serializable {
var position : Array[Double] = posVals
var velocity : Array[Double] = velVals
var fitness : Double = fitnessVal
var PulseRate: Double = PR1
var LoudnessRate: Double = LR1
}
Objective function
def sphere (ar : Array[Double]) : Double = ar.reduce((x,y) => x+y*y)
Store and read data inside each partition
def execute(RDD: RDD[foo], c_itr: Int ): Array[(foo, Int)] = {
val newRDD = RDD.mapPartitionsWithIndex {
(index, Iterator) => {
var arr: Array[foo] = Iterator.toArray
if (c_itr != 0) {
//Read Data from stored file where file name is equal to partition number (index)
val bufferedSource = Source.fromFile("/result/"+index+".txt")
val lines = bufferedSource.getLines()
val data : Array[BAT1] = lines.flatMap{line =>
val p = line.split(",")
Seq( BAT1(p(0).toArray.map(_.toDouble) , p(1).toArray.map(_.toDouble) ,sphere ,line(2).toDouble, p(3).toDouble, p(4).toDouble) )
}.toArray
}
arr = data.clone() // Replace arr with loaded data from file
//Save to file
val writer = new FileWriter(Path + index + ".txt")
for ( i <- 0 until arr.length ) {
writer.write(arr(i).position.toList + "," + arr(i).velocity.toList + "," + arr(i).fitness + "," +
arr(i).LoudnessRate + "," + arr(i).PulseRate + "\n")
}
writer.close()
val bests : Array[(foo , Int)] = res1.map(x => (x, index))
bests.toIterator
}
}
newRDD.persist().collect()
}
Sample of data stored in file is.
List(86.6582767815429, -25.224569272200586, 90.52371028878218, -59.91851894060545, -37.12944037124118),List(-59.60155033146984, -8.927455672466586, -23.679516503590534, 87.58857469881022 ,-14.864361504195127),6.840659702736215E10,0.6012,0.04131580765457621
List(86.6582767815429, -25.224569272200586, 90.52371028878218, -59.91851894060545, -26.10553311409422),List(-66.83980088207335, 51.088426986986015, -109.74073303298485, 66.87095748811572, -22.941448024344268),9.195157603574039E10,0.9025,0.06132589765454988
This code is not reading exact data when data is read from file. I tried lot but unable to find the issue. How can I read stored data correctly in data object ?
We are trying to generate column wise statistics of our dataset in spark. In addition to using the summary function from statistics library. We are using the following procedure:
We determine the columns with string values
Generate key value pair for the whole dataset, using the column number as key and value of column as value
generate a new map of format
(K,V) ->((K,V),1)
Then we use reduceByKey to find the sum of all unique value in all the columns. We cache this output to reduce further computation time.
In the next step we cycle through the columns using a for loop to find the statistics for all the columns.
We are trying to reduce the for loop by again utilizing the map reduce way but we are unable to find some way to achieve it. Doing so will allow us to generate column statistics for all columns in one execution. The for loop method is running sequentially making it very slow.
Code:
//drops the header
def dropHeader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
def retAtrTuple(x: String) = {
val newX = x.split(",")
for (h <- 0 until newX.length)
yield (h,newX(h))
}
val line = sc.textFile("hdfs://.../myfile.csv")
val withoutHeader: RDD[String] = dropHeader(line)
val kvPairs = withoutHeader.flatMap(retAtrTuple) //generates a key-value pair where key is the column number and value is column's value
var bool_numeric_col = kvPairs.map{case (x,y) => (x,isNumeric(y))}.reduceByKey(_&&_).sortByKey() //this contains column indexes as key and boolean as value (true for numeric and false for string type)
var str_cols = bool_numeric_col.filter{case (x,y) => y == false}.map{case (x,y) => x}
var num_cols = bool_numeric_col.filter{case (x,y) => y == true}.map{case (x,y) => x}
var str_col = str_cols.toArray //array consisting the string col
var num_col = num_cols.toArray //array consisting numeric col
val colCount = kvPairs.map((_,1)).reduceByKey(_+_)
val e1 = colCount.map{case ((x,y),z) => (x,(y,z))}
var numPairs = e1.filter{case (x,(y,z)) => str_col.contains(x) }
//running for loops which needs to be parallelized/optimized as it sequentially operates on each column. Idea is to find the top10, bottom10 and number of distinct elements column wise
for(i <- str_col){
var total = numPairs.filter{case (x,(y,z)) => x==i}.sortBy(_._2._2)
var leastOnes = total.take(10)
println("leastOnes for Col" + i)
leastOnes.foreach(println)
var maxOnes = total.sortBy(-_._2._2).take(10)
println("maxOnes for Col" + i)
maxOnes.foreach(println)
println("distinct for Col" + i + " is " + total.count)
}
Let me simplify your question a bit. (A lot actually.) We have an RDD[(Int, String)] and we want to find the top 10 most common Strings for each Int (which are all in the 0–100 range).
Instead of sorting, as in your example, it is more efficient to use the Spark built-in RDD.top(n) method. Its run-time is linear in the size of the data, and requires moving much less data around than a sort.
Consider the implementation of top in RDD.scala. You want to do the same, but with one priority queue (heap) per Int key. The code becomes fairly complex:
import org.apache.spark.util.BoundedPriorityQueue // Pretend it's not private.
def top(n: Int, rdd: RDD[(Int, String)]): Map[Int, Iterable[String]] = {
// A heap that only keeps the top N values, so it has bounded size.
type Heap = BoundedPriorityQueue[(Long, String)]
// Get the word counts.
val counts: RDD[[(Int, String), Long)] =
rdd.map(_ -> 1L).reduceByKey(_ + _)
// In each partition create a column -> heap map.
val perPartition: RDD[Map[Int, Heap]] =
counts.mapPartitions { items =>
val heaps =
collection.mutable.Map[Int, Heap].withDefault(i => new Heap(n))
for (((k, v), count) <- items) {
heaps(k) += count -> v
}
Iterator.single(heaps)
}
// Merge the per-partition heap maps into one.
val merged: Map[Int, Heap] =
perPartition.reduce { (heaps1, heaps2) =>
val heaps =
collection.mutable.Map[Int, Heap].withDefault(i => new Heap(n))
for ((k, heap) <- heaps1.toSeq ++ heaps2.toSeq) {
for (cv <- heap) {
heaps(k) += cv
}
}
heaps
}
// Discard counts, return just the top strings.
merged.mapValues(_.map { case(count, value) => value })
}
This is efficient, but made painful because we need to work with multiple columns at the same time. It would be way easier to have one RDD per column and just call rdd.top(10) on each.
Unfortunately the naive way to split up the RDD into N smaller RDDs does N passes:
def split(together: RDD[(Int, String)], columns: Int): Seq[RDD[String]] = {
together.cache // We will make N passes over this RDD.
(0 until columns).map {
i => together.filter { case (key, value) => key == i }.values
}
}
A more efficient solution could be to write out the data into separate files by key, then load it back into separate RDDs. This is discussed in Write to multiple outputs by key Spark - one Spark job.
Thanks for #Daniel Darabos's answer. But there are some mistakes.
mixed use of Map and collection.mutable.Map
withDefault((i: Int) => new Heap(n)) do not create a new Heap when you set heaps(k) += count -> v
mix uasage of parentheses
Here is the modified code:
//import org.apache.spark.util.BoundedPriorityQueue // Pretend it's not private. copy to your own folder and import it
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object BoundedPriorityQueueTest {
// https://stackoverflow.com/questions/28166190/spark-column-wise-word-count
def top(n: Int, rdd: RDD[(Int, String)]): Map[Int, Iterable[String]] = {
// A heap that only keeps the top N values, so it has bounded size.
type Heap = BoundedPriorityQueue[(Long, String)]
// Get the word counts.
val counts: RDD[((Int, String), Long)] =
rdd.map(_ -> 1L).reduceByKey(_ + _)
// In each partition create a column -> heap map.
val perPartition: RDD[collection.mutable.Map[Int, Heap]] =
counts.mapPartitions { items =>
val heaps =
collection.mutable.Map[Int, Heap]() // .withDefault((i: Int) => new Heap(n))
for (((k, v), count) <- items) {
println("\n---")
println("before add " + ((k, v), count) + ", the map is: ")
println(heaps)
if (!heaps.contains(k)) {
println("not contains key " + k)
heaps(k) = new Heap(n)
println(heaps)
}
heaps(k) += count -> v
println("after add " + ((k, v), count) + ", the map is: ")
println(heaps)
}
println(heaps)
Iterator.single(heaps)
}
// Merge the per-partition heap maps into one.
val merged: collection.mutable.Map[Int, Heap] =
perPartition.reduce { (heaps1, heaps2) =>
val heaps =
collection.mutable.Map[Int, Heap]() //.withDefault((i: Int) => new Heap(n))
println(heaps)
for ((k, heap) <- heaps1.toSeq ++ heaps2.toSeq) {
for (cv <- heap) {
heaps(k) += cv
}
}
heaps
}
// Discard counts, return just the top strings.
merged.mapValues(_.map { case (count, value) => value }).toMap
}
def main(args: Array[String]): Unit = {
Logger.getRootLogger().setLevel(Level.FATAL) //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
val conf = new SparkConf().setAppName("word count").setMaster("local[1]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN") //http://stackoverflow.com/questions/27781187/how-to-stop-messages-displaying-on-spark-console
val words = sc.parallelize(List((1, "s11"), (1, "s11"), (1, "s12"), (1, "s13"), (2, "s21"), (2, "s22"), (2, "s22"), (2, "s23")))
println("# words:" + words.count())
val result = top(1, words)
println("\n--result:")
println(result)
sc.stop()
print("DONE")
}
}
For example I want to encrypt each token of a sentence and reduce them to a final encrypted text:
def convert(str: String) = {
str + ":"
}
val tokens = "Hi this is a text".split("\\ ").toList
val reduce = tokens.reduce((a, b) => convert(a) + convert(b))
println(reduce)
// result is `Hi:this::is::a::text:`
val fold = tokens.fold("") {
case (a, b) => convert(a) + convert(b)
}
println(fold)
// result is `:Hi::this::is::a::text:`
val scan = tokens.scan("") {
case (a, b) => convert(a) + convert(b)
}
println(scan)
// result is List(, :Hi:, :Hi::this:, :Hi::this::is:, :Hi::this::is::a:, :Hi::this::is::a::text:)
Assume that convert is an encryption function. So each token should encrypt only once not twice. but fold and reduce and scan reencrypt the encrypted token. I want this desired result Hi:this:is:a:text:
Well if you want to encrypt each Token individually, map should work.
val tokens = "Hi this is a text".split("\\ ").toList
val encrypted = tokens.map(convert).mkString
println(encrypted) //prints Hi:this:is:a:text:
def convert(str: String) = {
str + ":"
}
Edit: If you want to use a fold:
val encrypted = tokens.foldLeft("")((result, token) => result + convert(token))
One-liner specialised at this very example,
"Hi this is a text" split " " mkString("",":",":")
Or
val tokens = "Hi this is a text" split " "
val sep = ":"
val encrypted = tokens mkString("",sep,sep)
Note that fold or reduce will operate on two operands in every step. However you want to encrypt each of the tokens -- which is a unary operand. Therefore first you should do a map and then either a fold or a reduce:
tokens map(convert)
Reduce / Fold:
scala> tokens.map(convert).fold("")(_ + _)
res10: String = Hi:this:is:a:text:
scala> tokens.map(convert)reduce(_ + _)
res11: String = Hi:this:is:a:text:
Infact you can simply use mkString which makes it even more concise:
scala> tokens.map(convert).mkString
res12: String = Hi:this:is:a:text:
Also you can do the conversion in parallel too (using par ):
scala> tokens.par.map(convert).mkString
res13: String = Hi:this:is:a:text:
scala> tokens.par.map(convert)reduce(_ + _)
res14: String = Hi:this:is:a:text:
I think your main problem is how reduce and fold works. You can learn from other answer
As for you question, fold can help:
"Hi this is a text".split("\\ ").fold("") { (a, b) => a + convert(b) }
Here is a version with the code cleaned up and unnecessary conversions removed:
def convert(str: String) = str + :
val tokens = "Hi this is a text" split " "
val encrypted = (tokens map convert) mkString " "
mkString could be seen as a specialized Version of reduce (or fold) for Strings.
If for some reason, you don't want to use mkString the code would look like this:
def convert(str: String) = str + :
val tokens = "Hi this is a text" split " "
val encrypted = (tokens map convert) reduce (_ + _)
Or shortend with fold
val encrypted = "Hi this is a text".split(" ").foldLeft ("") { case (accum, str) => accum + convert(str) }
val db = mongoClient("test")
val coll = db("test")
val q = MongoDBObject("id" -> 100)
val result= coll.findOne(q)
How can I convert result to a map of key --> value pairs?
result of findOne is an Option[Map[String, AnyRef]] because MongoDBObject is a Map.
A Map is already a collection of pairs.
To print them, simply:
for {
r <- result
(key,value) <- r
}
yield println(key + " " + value.toString)
or
result.map(_.map({case (k,v) => println(k + " " + v)}))
To serialize mongo result, try com.mongodb.util.JSON.serialize, like
com.mongodb.util.JSON.serialize(result.get)