I want to create a nested tree map in Scala from 2 lists
Eg
private val a = List(0, 100000, 500000, 1000000)
private val b = List (0, 5, 25, 50)
I want the nested tree map to contain keys as the values from list a. The value for these keys would be another tree map that contains keys as the values from list b with values as a default value. To clarify, here's the format in which i want the tree map-default value being 0.:
{
0:{
0:0,
5:0,
25:0,
50:0
},
100000:{
0:0,
5:0,
25:0,
50:0
},
..}
Is there an efficient way to do this in Scala?
Edit: I want to use the same logic but use a Java Tree Map in Scala instead. Could someone please guide me?
Here's an approach to get it into the requested shape. There may be more efficient ways to do this for a large data set.
a.flatMap(aa => TreeMap(aa -> b.flatMap(bb => TreeMap(bb -> 0)).toMap))
val res50: List[(Int, Map[Int, Int])] =
List(
(0, Map(0 -> 0, 5 -> 0, 25 -> 0, 50 -> 0)),
(100000, Map(0 -> 0, 5 -> 0, 25 -> 0, 50 -> 0)),
(500000, Map(0 -> 0, 5 -> 0, 25 -> 0, 50 -> 0)),
(1000000,Map(0 -> 0, 5 -> 0, 25 -> 0, 50 -> 0))
)
All you need to do is:
a.iterator.map { aa =>
aa -> b.iterator.map(bb => bb -> 0).to(TreeMap.canBuildFrom)
}.to(TreeMap.canBuildFrom)
You can see the code running here
And using foldLeft
a.foldLeft(TreeMap.empty[Int, TreeMap[Int, Int]]) { (acc, elem) =>
acc + (elem -> b.map(bKey => bKey -> 0)
.to(TreeMap.canBuildFrom))
//.to(TreeMap)) //if you use 2.13
}
Related
I am trying to transfer/copy an element in a map, to another element in the map in Scala. For example:
Map(0 -> 5)
Let's say this is the initial state of the map. What I want to happen is the following:
Map(0 -> 0, 1 -> 5)
So after the change has happened, 0 that initially points to 5, but after the transformation 0 will point to 0, and a new element is added (1) that points to 5.
I have tried the following:
theMap + (pointer -> (theMap(pointer) + 1))
However, I get the following error:
java.util.NoSuchElementException: key not found: 1
Thanks for any help!
This should do the trick.
def transfer(pointer: Int)(map: Map[Int, Int]): Map[Int, Int] =
map.get(key = pointer) match {
case Some(value) =>
map ++ Map(
pointer -> 0,
(pointer + 1) -> value
)
case None =>
// Pointer didn't exist, what should happen here?
map // For now returning the map unmodified.
}
And you can use it like this:
transfer(pointer = 0)(map = Map(0 -> 5))
// res: Map[Int,Int] = Map(0 -> 0, 1 -> 5)
I have the following simple code in Java. This code creates and fills the Map with 0 values.
Map<Integer,Integer> myMap = new HashMap<Integer,Integer>();
for (int i=0; i<=20; i++) { myMap.put(i, 0); }
I want to create a similar RDD using Spark and Scala. I tried this approach, but it returns me RDD[(Any) => (Any,Int)] instead of RDD[Map(Int,Int)]. What am I doing wrong?
val data = (0 to 20).map(_ => (_,0))
val myMapRDD = sparkContext.parallelize(data)
What you are creating are tuples. Instead you need to create Map and parallelize as below
val data = (0 to 20).map(x => Map(x -> 0)) //data: scala.collection.immutable.IndexedSeq[scala.collection.immutable.Map[Int,Int]] = Vector(Map(0 -> 0), Map(1 -> 0), Map(2 -> 0), Map(3 -> 0), Map(4 -> 0), Map(5 -> 0), Map(6 -> 0), Map(7 -> 0), Map(8 -> 0), Map(9 -> 0), Map(10 -> 0), Map(11 -> 0), Map(12 -> 0), Map(13 -> 0), Map(14 -> 0), Map(15 -> 0), Map(16 -> 0), Map(17 -> 0), Map(18 -> 0), Map(19 -> 0), Map(20 -> 0))
val myMapRDD = sparkContext.parallelize(data) //myMapRDD: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]] = ParallelCollectionRDD[0] at parallelize at test.sc:19
In Scala, (0 to 20).map(_ => (_, 0)) would not compile, as it has invalid placeholder syntax. I believe you might be looking for something like below instead:
val data = (0 to 20).map( _->0 )
which would generate a list of key-value pairs, and is really just a placeholder shorthand for:
val data = (0 to 20).map( n => n->0 )
// data: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector(
// (0,0), (1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0), (9,0), (10,0),
// (11,0), (12,0), (13,0), (14,0), (15,0), (16,0), (17,0), (18,0), (19,0), (20,0)
// )
A RDD is an immtable collection (e.g. Seq, Array) of data. To create a RDD of Map[Int,Int], you would expand data inside a Map which in turn gets placed inside a Seq collection:
val rdd = sc.parallelize(Seq(Map(data: _*)))
rdd.collect
// res1: Array[scala.collection.immutable.Map[Int,Int]] = Array(
// Map(0 -> 0, 5 -> 0, 10 -> 0, 14 -> 0, 20 -> 0, 1 -> 0, 6 -> 0, 9 -> 0, 13 -> 0, 2 -> 0, 17 -> 0,
// 12 -> 0, 7 -> 0, 3 -> 0, 18 -> 0, 16 -> 0, 11 -> 0, 8 -> 0, 19 -> 0, 4 -> 0, 15 -> 0)
// )
Note that, as is, this RDD consists of only a single Map, and certainly you can assemble as many Maps as you wish in a RDD.
val rdd2 = sc.parallelize(Seq(
Map((0 to 4).map( _->0 ): _*),
Map((5 to 9).map( _->0 ): _*),
Map((10 to 14).map( _->0 ): _*),
Map((15 to 19).map( _->0 ): _*)
))
You can't parallelize a Map, as parallelize takes a Seq. What you can achieve is creating an RDD[(Int, Int)], which however does not enforce the uniqueness of keys. To perform operation by key, you can leverage PairRDDFunctions, that despite this limitation can end up being useful for your use case.
Let's try at least to get an RDD[(Int, Int)].
You used a slightly "wrong" syntax when applying the map to your range.
The _ placeholder can have different meanings depending on the context. The two meanings that got mixed up in your snippet of code are:
a placeholder for an anonymous function parameter that is not going to be used (as in (_ => 42), a function which ignores its input and always returns 42)
a positional placeholder for arguments in anonymous functions (as in (_, 42) a function that takes one argument and returns a tuple where the first element is the input and the second is the number 42)
The above examples are simplified and do not account for type inference as they only wish to point out two of the meanings of the _ placeholder that got mixed up in your snippet of code.
The first step is to use one of the two following functions to create the pairs that are going to be part of the map, either
a => (a, 0)
or
(_, 0)
and after parallelizing it you can get the RDD[(Int, Int)], as follows:
val pairRdd = sc.parallelize((0 to 20).map((_, 0)))
I believe it's worth noting here, that mapping on the local collection is going to be executed eagerly and bound to your driver's resources, while you can obtain the same final result by parallelizing the collection first and then mapping the pair-creating function on the RDD.
Now, as mentioned you don't have a distributed map, but rather a collection of key-value pairs where the key uniqueness is not enforced. But you can work pretty seamlessly with those values using PairRDDFunctions, which you obtain automatically by importing org.apache.spark.rdd.RDD.rddToPairRDDFunctions (or without having to do anything in the spark-shell as the import has already been done for your), which will decorate your RDD leveraging Scala's implicit conversions.
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
pairRdd.mapValues(_ + 1).foreach(println)
will print the following
(10,1)
(11,1)
(12,1)
(13,1)
(14,1)
(15,1)
(16,1)
(17,1)
(18,1)
(19,1)
(20,1)
(0,1)
(1,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
(7,1)
(8,1)
(9,1)
You can learn more about working with key-value pairs with the RDD API on the official documentation.
If you have two maps (one is mutable, the other is immutable), how would you multiply the values of one with the corresponding values of the other?
For example:
val testA = scala.collection.mutable.Map("£2" -> 3, "£1" -> 0,
"50p" -> 4, "20p" -> 0, "10p" -> 0, "5p" -> 0)
val testB = scala.collection.immutable.Map("£2" -> 2, "£1" -> 1,
"50p" -> 0.5, "20p" -> 0.2, "10p" -> 0.1, "5p" -> 0.05)
Expecting a result of:
val total = scala.collection.immutable.Map("£2" -> 6, "£1" -> 0,
"50p" -> 2, "20p" -> 0, "10p" -> 0, "5p" -> 0)`
You can use map to map each value to that value multiplied by the lookup result on testB (or 1.0, if none found)
testA.map { case (k, v) => (k, v * testB.getOrElse(k, 1.0)) }
I am learning Spark (in Scala) and have been trying to figure out how to count all the the words on each line of a file.
I am working with a dataset where each line contains a tab-separated document_id and the full text of the document
doc_1 <full-text>
doc_2 <full-text>
etc..
Here is a toy example I have in a file called doc.txt
doc_1 new york city new york state
doc_2 rain rain go away
I think what I need to do is transform into tuples containig
((doc_id, word), 1)
and then call reduceByKey() to sum the 1's. I wrote the following:
val file = sc.textFile("docs.txt")
val tuples = file.map(_.split("\t"))
.map( x => (x(1).split("\\s+")
.map(y => ((x(0), y), 1 )) ) )
Which does give me the intermediate representation I think I need:
tuples.collect
res0: Array[Array[((String, String), Int)]] = Array(Array(((doc_1,new),1), ((doc_1,york),1), ((doc_1,city),1), ((doc_1,new),1), ((doc_1,york),1), ((doc_1,state),1)), Array(((doc_2,rain),1), ((doc_2,rain),1), ((doc_2,go),1), ((doc_2,away),1)))
But if call reduceByKey on tuples it produces an error
tuples.reduceByKey(_ + )
<console>:21: error: value reduceByKey is not a member of org.apache.spark.rdd.RDD[Array[((String, String), Int)]]
tuples.reduceByKey(_ + )
I can't seem to get my head around how to do this. I think I need to do a reduce on an array inside an array. I have tried many different things but like the above keep getting errors and making no progress.
Any guidance / advice on this would be much appreciated.
Note: I know that there is a word count example on https://spark.apache.org/examples.html showing how to get counts for all of the words in a file. But that is for an entire input file. I am talking about getting counts per-document where each document is on a different line.
reduceByKey expects type RDD[(K,V)] whereas the instant you perform the split in the first map, you end up with an RDD[Array[...]], which is not the type signature that is needed. You can rework your current solution as below...but it probably will not be as performant (read after the code for a rework using flatMap):
//Dummy data load
val file = sc.parallelize(List("doc_1\tnew york city","doc_2\train rain go away"))
//Split the data on tabs to get an array of (key, line) tuples
val firstPass = file.map(_.split("\t"))
//Split the line inside each tuple so you now have an array of (key, Array(...))
//Where the inner array is full of (word, 1) tuples
val secondPass = firstPass.map(x=>(x(0), x(1).split("\\s+").map(y=>(y,1))))
//Now group the words and re-map so that the inner tuple is the wordcount
val finalPass = secondPass.map(x=>(x._1, x._2.groupBy(_._1).map(y=>(y._1,y._2.size))))
Probably the better solution vvvv :
If you want to keep your current structure, then you need to change to using a Tuple2 from the start and then using a flatMap after:
//Load your data
val file = sc.parallelize(List("doc_1\tnew york city","doc_2\train rain go away"))
//Turn the data into a key-value RDD (I suggest caching the split, kept 1 line for SO)
val firstPass = file.map(x=>(x.split("\t")(0), x.split("\t")(1)))
//Change your key to be a Tuple2[String,String] and the value is the count
val tuples = firstPass.flatMap(x=>x._2.split("\\s+").map(y=>((x._1, y), 1)))
Here is a quick demo on a VERY small dataset.
scala> val file = sc.textFile("../README.md")
15/02/02 00:32:38 INFO MemoryStore: ensureFreeSpace(32792) called with curMem=45512, maxMem=278302556
15/02/02 00:32:38 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 32.0 KB, free 265.3 MB)
file: org.apache.spark.rdd.RDD[String] = ../README.md MappedRDD[7] at textFile at <console>:12
scala> val splitLines = file.map{ line => line.split(" ") }
splitLines: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[9] at map at <console>:14
scala> splitLines.map{ arr => arr.toList.groupBy(identity).map{ x => (x._1, x._2.size) } }
res19: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Int]] = MappedRDD[10] at map at <console>:17
scala> val result = splitLines.map{ arr => arr.toList.groupBy(identity).map{ x => (x._1, x._2.size) } }
result: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Int]] = MappedRDD[11] at map at <console>:16
scala> result.take(10).foreach(println)
Map(# -> 1, Spark -> 1, Apache -> 1)
Map( -> 1)
Map(for -> 1, is -> 1, Data. -> 1, system -> 1, a -> 1, provides -> 1, computing -> 1, cluster -> 1, general -> 1, Spark -> 1, It -> 1, fast -> 1, Big -> 1, and -> 1)
Map(in -> 1, Scala, -> 1, optimized -> 1, APIs -> 1, that -> 1, Java, -> 1, high-level -> 1, an -> 1, Python, -> 1, and -> 2, engine -> 1)
Map(for -> 1, data -> 1, a -> 1, also -> 1, general -> 1, supports -> 2, It -> 1, graphs -> 1, analysis. -> 1, computation -> 1)
Map(for -> 1, set -> 1, tools -> 1, rich -> 1, Spark -> 1, structured -> 1, including -> 1, of -> 1, and -> 1, higher-level -> 1, SQL -> 2)
Map(GraphX -> 1, for -> 2, processing, -> 2, data -> 1, MLlib -> 1, learning, -> 1, machine -> 1, graph -> 1)
Map(for -> 1, Streaming -> 1, processing. -> 1, stream -> 1, Spark -> 1, and -> 1)
Map( -> 1)
Map(<http://spark.apache.org/> -> 1)
Input file content:
a b c d
a b e f
h i j l
m h i l
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("test").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> textFile = javaSparkContext.textFile("C:\\Users\\arun7.gupta\\Desktop\\Spark\\word.txt");
/**Splitting the word with space*/
textFile = textFile.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
/**Pair the word with count*/
JavaPairRDD<String, Integer> mapToPair = textFile.mapToPair(w -> new Tuple2<>(w, 1));
/**Reduce the pair with key and add count*/
JavaPairRDD<String, Integer> reduceByKey = mapToPair.reduceByKey((wc1, wc2) -> wc1 + wc2);
System.out.println(reduceByKey.collectAsMap());
javaSparkContext.close();
}
Output:
{e=1, h=2, b=2, j=1, m=1, d=1, a=2, i=2, c=1, l=2, f=1}
Let an immutable map
val m = (0 to 3).map {x => (x,x*10) }.toMap
m: scala.collection.immutable.Map[Int,Int] = Map(0 -> 0, 1 -> 10, 2 -> 20, 3 -> 30)
a collection of keys of interest
val k = Set(0,2)
and a function
def f(i:Int) = i + 1
How to apply f onto the values in the map mapped by the keys of interest so that the resulting map would be
Map(0 -> 1, 1 -> 10, 2 -> 21, 3 -> 30)
m.transform{ (key, value) => if (k(key)) f(value) else value }
That's the first thing that popped into my mind but I am pretty sure that in Scala you could do it prettier:
m.map(e => {
if(k.contains(e._1)) (e._1 -> f(e._2)) else (e._1 -> e._2)
})
A variation of #regis-jean-gilles answer using map and pattern matching
m.map { case a # (key, value) => if (k(key)) key -> f(value) else a }