How to create RDD[Map(Int,Int)] using Spark and Scala? - scala
I have the following simple code in Java. This code creates and fills the Map with 0 values.
Map<Integer,Integer> myMap = new HashMap<Integer,Integer>();
for (int i=0; i<=20; i++) { myMap.put(i, 0); }
I want to create a similar RDD using Spark and Scala. I tried this approach, but it returns me RDD[(Any) => (Any,Int)] instead of RDD[Map(Int,Int)]. What am I doing wrong?
val data = (0 to 20).map(_ => (_,0))
val myMapRDD = sparkContext.parallelize(data)
What you are creating are tuples. Instead you need to create Map and parallelize as below
val data = (0 to 20).map(x => Map(x -> 0)) //data: scala.collection.immutable.IndexedSeq[scala.collection.immutable.Map[Int,Int]] = Vector(Map(0 -> 0), Map(1 -> 0), Map(2 -> 0), Map(3 -> 0), Map(4 -> 0), Map(5 -> 0), Map(6 -> 0), Map(7 -> 0), Map(8 -> 0), Map(9 -> 0), Map(10 -> 0), Map(11 -> 0), Map(12 -> 0), Map(13 -> 0), Map(14 -> 0), Map(15 -> 0), Map(16 -> 0), Map(17 -> 0), Map(18 -> 0), Map(19 -> 0), Map(20 -> 0))
val myMapRDD = sparkContext.parallelize(data) //myMapRDD: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]] = ParallelCollectionRDD[0] at parallelize at test.sc:19
In Scala, (0 to 20).map(_ => (_, 0)) would not compile, as it has invalid placeholder syntax. I believe you might be looking for something like below instead:
val data = (0 to 20).map( _->0 )
which would generate a list of key-value pairs, and is really just a placeholder shorthand for:
val data = (0 to 20).map( n => n->0 )
// data: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector(
// (0,0), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0), (9,0), (10,0),
// (11,0), (12,0), (13,0), (14,0), (15,0), (16,0), (17,0), (18,0), (19,0), (20,0)
// )
A RDD is an immtable collection (e.g. Seq, Array) of data. To create a RDD of Map[Int,Int], you would expand data inside a Map which in turn gets placed inside a Seq collection:
val rdd = sc.parallelize(Seq(Map(data: _*)))
rdd.collect
// res1: Array[scala.collection.immutable.Map[Int,Int]] = Array(
// Map(0 -> 0, 5 -> 0, 10 -> 0, 14 -> 0, 20 -> 0, 1 -> 0, 6 -> 0, 9 -> 0, 13 -> 0, 2 -> 0, 17 -> 0,
// 12 -> 0, 7 -> 0, 3 -> 0, 18 -> 0, 16 -> 0, 11 -> 0, 8 -> 0, 19 -> 0, 4 -> 0, 15 -> 0)
// )
Note that, as is, this RDD consists of only a single Map, and certainly you can assemble as many Maps as you wish in a RDD.
val rdd2 = sc.parallelize(Seq(
Map((0 to 4).map( _->0 ): _*),
Map((5 to 9).map( _->0 ): _*),
Map((10 to 14).map( _->0 ): _*),
Map((15 to 19).map( _->0 ): _*)
))
You can't parallelize a Map, as parallelize takes a Seq. What you can achieve is creating an RDD[(Int, Int)], which however does not enforce the uniqueness of keys. To perform operation by key, you can leverage PairRDDFunctions, that despite this limitation can end up being useful for your use case.
Let's try at least to get an RDD[(Int, Int)].
You used a slightly "wrong" syntax when applying the map to your range.
The _ placeholder can have different meanings depending on the context. The two meanings that got mixed up in your snippet of code are:
a placeholder for an anonymous function parameter that is not going to be used (as in (_ => 42), a function which ignores its input and always returns 42)
a positional placeholder for arguments in anonymous functions (as in (_, 42) a function that takes one argument and returns a tuple where the first element is the input and the second is the number 42)
The above examples are simplified and do not account for type inference as they only wish to point out two of the meanings of the _ placeholder that got mixed up in your snippet of code.
The first step is to use one of the two following functions to create the pairs that are going to be part of the map, either
a => (a, 0)
or
(_, 0)
and after parallelizing it you can get the RDD[(Int, Int)], as follows:
val pairRdd = sc.parallelize((0 to 20).map((_, 0)))
I believe it's worth noting here, that mapping on the local collection is going to be executed eagerly and bound to your driver's resources, while you can obtain the same final result by parallelizing the collection first and then mapping the pair-creating function on the RDD.
Now, as mentioned you don't have a distributed map, but rather a collection of key-value pairs where the key uniqueness is not enforced. But you can work pretty seamlessly with those values using PairRDDFunctions, which you obtain automatically by importing org.apache.spark.rdd.RDD.rddToPairRDDFunctions (or without having to do anything in the spark-shell as the import has already been done for your), which will decorate your RDD leveraging Scala's implicit conversions.
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
pairRdd.mapValues(_ + 1).foreach(println)
will print the following
(10,1)
(11,1)
(12,1)
(13,1)
(14,1)
(15,1)
(16,1)
(17,1)
(18,1)
(19,1)
(20,1)
(0,1)
(1,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
(7,1)
(8,1)
(9,1)
You can learn more about working with key-value pairs with the RDD API on the official documentation.
Related
Create nested Java TreeMap in Scala
I want to create a nested tree map in Scala from 2 lists Eg private val a = List(0, 100000, 500000, 1000000) private val b = List (0, 5, 25, 50) I want the nested tree map to contain keys as the values from list a. The value for these keys would be another tree map that contains keys as the values from list b with values as a default value. To clarify, here's the format in which i want the tree map-default value being 0.: { 0:{ 0:0, 5:0, 25:0, 50:0 }, 100000:{ 0:0, 5:0, 25:0, 50:0 }, ..} Is there an efficient way to do this in Scala? Edit: I want to use the same logic but use a Java Tree Map in Scala instead. Could someone please guide me?
Here's an approach to get it into the requested shape. There may be more efficient ways to do this for a large data set. a.flatMap(aa => TreeMap(aa -> b.flatMap(bb => TreeMap(bb -> 0)).toMap)) val res50: List[(Int, Map[Int, Int])] = List( (0, Map(0 -> 0, 5 -> 0, 25 -> 0, 50 -> 0)), (100000, Map(0 -> 0, 5 -> 0, 25 -> 0, 50 -> 0)), (500000, Map(0 -> 0, 5 -> 0, 25 -> 0, 50 -> 0)), (1000000,Map(0 -> 0, 5 -> 0, 25 -> 0, 50 -> 0)) )
All you need to do is: a.iterator.map { aa => aa -> b.iterator.map(bb => bb -> 0).to(TreeMap.canBuildFrom) }.to(TreeMap.canBuildFrom) You can see the code running here
And using foldLeft a.foldLeft(TreeMap.empty[Int, TreeMap[Int, Int]]) { (acc, elem) => acc + (elem -> b.map(bKey => bKey -> 0) .to(TreeMap.canBuildFrom)) //.to(TreeMap)) //if you use 2.13 }
How to move contents of one element in a map to another element in Scala
I am trying to transfer/copy an element in a map, to another element in the map in Scala. For example: Map(0 -> 5) Let's say this is the initial state of the map. What I want to happen is the following: Map(0 -> 0, 1 -> 5) So after the change has happened, 0 that initially points to 5, but after the transformation 0 will point to 0, and a new element is added (1) that points to 5. I have tried the following: theMap + (pointer -> (theMap(pointer) + 1)) However, I get the following error: java.util.NoSuchElementException: key not found: 1 Thanks for any help!
This should do the trick. def transfer(pointer: Int)(map: Map[Int, Int]): Map[Int, Int] = map.get(key = pointer) match { case Some(value) => map ++ Map( pointer -> 0, (pointer + 1) -> value ) case None => // Pointer didn't exist, what should happen here? map // For now returning the map unmodified. } And you can use it like this: transfer(pointer = 0)(map = Map(0 -> 5)) // res: Map[Int,Int] = Map(0 -> 0, 1 -> 5)
Scala - reducing a map based on changed key
Lets say have the following Map data val testMap: Map[String, Int] = Map("AAA_abc" -> 1, "AAA_anghesh" -> 2, "BBB_wfejw" -> 3, "BBB_qgqwe" -> 4, "C_fkee" -> 5) Now I want to reduce the map by key.split("_").head and add all the values for the keys that became equal. So for this example the Map should result into: Map(AAA -> 3, BBB -> 7, C -> 5) What would be the correct way to do so in Scala? I tried constructions with groupBy and reduceLeft but could not find a solution.
Here's a way to do it: testMap.groupBy(_._1.split("_").head).mapValues(_.values.sum)
A variation in one pass: testMap.foldLeft(Map[String,Int]())( (map, kv) => { val key = kv._1.split("_").head val previous = map.getOrElse(key,0) map.updated(key, previous + kv._2) })
Per-Document Word Count in Spark
I am learning Spark (in Scala) and have been trying to figure out how to count all the the words on each line of a file. I am working with a dataset where each line contains a tab-separated document_id and the full text of the document doc_1 <full-text> doc_2 <full-text> etc.. Here is a toy example I have in a file called doc.txt doc_1 new york city new york state doc_2 rain rain go away I think what I need to do is transform into tuples containig ((doc_id, word), 1) and then call reduceByKey() to sum the 1's. I wrote the following: val file = sc.textFile("docs.txt") val tuples = file.map(_.split("\t")) .map( x => (x(1).split("\\s+") .map(y => ((x(0), y), 1 )) ) ) Which does give me the intermediate representation I think I need: tuples.collect res0: Array[Array[((String, String), Int)]] = Array(Array(((doc_1,new),1), ((doc_1,york),1), ((doc_1,city),1), ((doc_1,new),1), ((doc_1,york),1), ((doc_1,state),1)), Array(((doc_2,rain),1), ((doc_2,rain),1), ((doc_2,go),1), ((doc_2,away),1))) But if call reduceByKey on tuples it produces an error tuples.reduceByKey(_ + ) <console>:21: error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[Array[((String, String), Int)]] tuples.reduceByKey(_ + ) I can't seem to get my head around how to do this. I think I need to do a reduce on an array inside an array. I have tried many different things but like the above keep getting errors and making no progress. Any guidance / advice on this would be much appreciated. Note: I know that there is a word count example on https://spark.apache.org/examples.html showing how to get counts for all of the words in a file. But that is for an entire input file. I am talking about getting counts per-document where each document is on a different line.
reduceByKey expects type RDD[(K,V)] whereas the instant you perform the split in the first map, you end up with an RDD[Array[...]], which is not the type signature that is needed. You can rework your current solution as below...but it probably will not be as performant (read after the code for a rework using flatMap): //Dummy data load val file = sc.parallelize(List("doc_1\tnew york city","doc_2\train rain go away")) //Split the data on tabs to get an array of (key, line) tuples val firstPass = file.map(_.split("\t")) //Split the line inside each tuple so you now have an array of (key, Array(...)) //Where the inner array is full of (word, 1) tuples val secondPass = firstPass.map(x=>(x(0), x(1).split("\\s+").map(y=>(y,1)))) //Now group the words and re-map so that the inner tuple is the wordcount val finalPass = secondPass.map(x=>(x._1, x._2.groupBy(_._1).map(y=>(y._1,y._2.size)))) Probably the better solution vvvv : If you want to keep your current structure, then you need to change to using a Tuple2 from the start and then using a flatMap after: //Load your data val file = sc.parallelize(List("doc_1\tnew york city","doc_2\train rain go away")) //Turn the data into a key-value RDD (I suggest caching the split, kept 1 line for SO) val firstPass = file.map(x=>(x.split("\t")(0), x.split("\t")(1))) //Change your key to be a Tuple2[String,String] and the value is the count val tuples = firstPass.flatMap(x=>x._2.split("\\s+").map(y=>((x._1, y), 1)))
Here is a quick demo on a VERY small dataset. scala> val file = sc.textFile("../README.md") 15/02/02 00:32:38 INFO MemoryStore: ensureFreeSpace(32792) called with curMem=45512, maxMem=278302556 15/02/02 00:32:38 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 32.0 KB, free 265.3 MB) file: org.apache.spark.rdd.RDD[String] = ../README.md MappedRDD[7] at textFile at <console>:12 scala> val splitLines = file.map{ line => line.split(" ") } splitLines: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[9] at map at <console>:14 scala> splitLines.map{ arr => arr.toList.groupBy(identity).map{ x => (x._1, x._2.size) } } res19: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Int]] = MappedRDD[10] at map at <console>:17 scala> val result = splitLines.map{ arr => arr.toList.groupBy(identity).map{ x => (x._1, x._2.size) } } result: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Int]] = MappedRDD[11] at map at <console>:16 scala> result.take(10).foreach(println) Map(# -> 1, Spark -> 1, Apache -> 1) Map( -> 1) Map(for -> 1, is -> 1, Data. -> 1, system -> 1, a -> 1, provides -> 1, computing -> 1, cluster -> 1, general -> 1, Spark -> 1, It -> 1, fast -> 1, Big -> 1, and -> 1) Map(in -> 1, Scala, -> 1, optimized -> 1, APIs -> 1, that -> 1, Java, -> 1, high-level -> 1, an -> 1, Python, -> 1, and -> 2, engine -> 1) Map(for -> 1, data -> 1, a -> 1, also -> 1, general -> 1, supports -> 2, It -> 1, graphs -> 1, analysis. -> 1, computation -> 1) Map(for -> 1, set -> 1, tools -> 1, rich -> 1, Spark -> 1, structured -> 1, including -> 1, of -> 1, and -> 1, higher-level -> 1, SQL -> 2) Map(GraphX -> 1, for -> 2, processing, -> 2, data -> 1, MLlib -> 1, learning, -> 1, machine -> 1, graph -> 1) Map(for -> 1, Streaming -> 1, processing. -> 1, stream -> 1, Spark -> 1, and -> 1) Map( -> 1) Map(<http://spark.apache.org/> -> 1)
Input file content: a b c d a b e f h i j l m h i l public static void main(String[] args) { SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local"); JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf); JavaRDD<String> textFile = javaSparkContext.textFile("C:\\Users\\arun7.gupta\\Desktop\\Spark\\word.txt"); /**Splitting the word with space*/ textFile = textFile.flatMap(line -> Arrays.asList(line.split(" ")).iterator()); /**Pair the word with count*/ JavaPairRDD<String, Integer> mapToPair = textFile.mapToPair(w -> new Tuple2<>(w, 1)); /**Reduce the pair with key and add count*/ JavaPairRDD<String, Integer> reduceByKey = mapToPair.reduceByKey((wc1, wc2) -> wc1 + wc2); System.out.println(reduceByKey.collectAsMap()); javaSparkContext.close(); } Output: {e=1, h=2, b=2, j=1, m=1, d=1, a=2, i=2, c=1, l=2, f=1}
How to convert an Iterator[int] to a map with frequency per bin in scala
I just learned how to convert a list of integers to a map with frequency per bin in scala. How to convert list of integers to a map with frequency per bin in scala However I am working with a 22 GB file therefore I am streaming trough the file. Source.fromFile("test.txt").getLines.filter(x => x.charAt(0) != '#').map(x => x.split("\t")(1)).map(x => x.toInt) The groupby function only works on a list, not on an iterator. I guess because it needs all the values in memory. I can't convert the iterator to an list because of the file size. So an example would be List(1,2,3,101,330,302).iterator And how can I go from there to res1: scala.collection.immutable.Map[Int,Int] = Map(100 -> 1, 300 -> 2, 0 -> 3)
You may use fold: val iter = List(1,2,3,101,330,302).iterator iter.foldLeft(Map[Int, Int]()) {(accum, a) => val key = a/100 * 100; accum + (key -> (accum.getOrElse(key, 0) + 1))} // scala.collection.immutable.Map[Int,Int] = Map(0 -> 3, 100 -> 1, 300 - 2)