I get the error java.lang.IndexOutOfBoundsException: 5, however I double-checked the code and according to my understanding everything is correct. So, I cannot figure out how to solve the issue.
I have RDD[(String,Map[String,List[Product with Serializable]])], such as:
(1566,Map(data1 -> List(List(1469785000, 111, 1, 3, null, 0),List(1469785022, 111, 1, 3, null, 1)), data2 -> List((4,88,1469775603,1,3370,f,537490800,661.09)))
I want to create a new RDD that will aggregate the values of 5th elements of sub-lists in data1:
Map(id -> 1566, type -> List(0,1))
I wrote the following code:
val newRDD = currentRDD.map({
line => Map(("id",line._1),
("type",line._2.get("data1").get.map(_.productElement(5))
)
})
If I put _.productElement(0), then the result is Map(id -> 1566, type -> List(1469785000,1469785022)). So, I absolutely misunderstand why the 0th field can be accessed, while the 3rd, 4th, 5th fields provoke IndexOutOfBoundsException.
The problem was due to List[Product with Serializable], while I actually handled it as List[List[Any]]. I changed the initial RDD to RDD[(String,Map[String,List[List[Any]]])] and now everything works.
Related
When working with large collections, we usually hear the term "lazy evaluation". I want to better demonstrate the difference between strict and lazy evaluation, so I tried the following example - getting the first two even numbers from a list:
scala> var l = List(1, 47, 38, 53, 51, 67, 39, 46, 93, 54, 45, 33, 87)
l: List[Int] = List(1, 47, 38, 53, 51, 67, 39, 46, 93, 54, 45, 33, 87)
scala> l.filter(_ % 2 == 0).take(2)
res0: List[Int] = List(38, 46)
scala> l.toStream.filter(_ % 2 == 0).take(2)
res1: scala.collection.immutable.Stream[Int] = Stream(38, ?)
I noticed that when I'm using toStream, I'm getting Stream(38, ?). What does the "?" mean here? Does this have something to do with lazy evaluation?
Also, what are some good example of lazy evaluation, when should I use it and why?
One benefit using lazy collections is to "save" memory, e.g. when mapping to large data structures. Consider this:
val r =(1 to 10000)
.map(_ => Seq.fill(10000)(scala.util.Random.nextDouble))
.map(_.sum)
.sum
And using lazy evaluation:
val r =(1 to 10000).toStream
.map(_ => Seq.fill(10000)(scala.util.Random.nextDouble))
.map(_.sum)
.sum
The first statement will genrate 10000 Seqs of size 10000 and keeps them in memory, while in the second case only one Seq at a time needs to exist in memory, therefore its much faster...
Another use-case is when only a part of the data is actually needed. I often use lazy collections together with take, takeWhile etc
Let's take a real life scenario - Instead of having a list, you have a big log file that you want to extract first 10 lines that contains "Success".
The straight forward solution would be reading the file line-by-line, and once you have a line that contains "Success", print it and continue to the next line.
But since we love functional programming, we don't want to use the traditional loops. Instead, we want to achieve our goal by composing functions.
First attempt:
Source.fromFile("log_file").getLines.toList.filter(_.contains("Success")).take(10)
Let's try to understand what actually happened here:
we read the whole file
filter relevant lines
took the first 10 elements
If we try to print Source.fromFile("log_file").getLines.toList, we will get the whole file, which is obviously a waste, since not all lines are relevant for us.
Why we got all lines and only then we performed the filtering? That's because the List is a strict data structure, so when we call toList, it evaluates immediately, and only after having the whole data, the filtering is applied.
Luckily, Scala provides lazy data structures, and stream is one of them:
Source.fromFile("log_file").getLines.toStream.filter(_.contains("Success")).take(10)
In order to demonstrate the difference, let's try:
Source.fromFile("log_file").getLines.toStream
Now we get something like:
Scala.collection.immutable.Stream[Int] = Stream(That's the first line, ?)
toStream evaluates to only one element - the first line in the file. The next element is represented by a "?", which indicates that the stream has not evaluated the next element, and that's because toStream is lazy function, and the next item is evaluated only when used.
Now after we apply the filter function, it will start reading the next line until we get the first line that contains "Success":
> var res = Source.fromFile("log_file").getLines.toStream.filter(_.contains("Success"))
Scala.collection.immutable.Stream[Int] = Stream(First line contains Success!, ?)
Now we apply the take function. There is still no action is performed, but it knows that is should pick 10 lines, so it doesn't evaluate until we use the result:
res foreach println
Finally, i we now print res, we'll get a Stream containing the first 10 lines, as we expected.
Am trying to convert a map to list of Tuples, given a map like below
Map("a"->2,"a"->4,"a"->5,"b"->6,"b"->1,"c"->3)
i want output like
List(("a",2),("a",4),("a",5),("b",6),("b",1),("c",3))
I tried following
val seq = inputMap.toList //output (a,5)(b,1)(c,3)
var list:List[(String,Int)] = Nil
for((k,v)<-inputMap){
(k,v) :: list
} //output (a,5)(b,1)(c,3)
Why does it remove duplicates? I dont see other tuples that has "a" as key.
That's because a Map doesn't allow duplicate keys:
val map = Map("a"->2,"a"->4,"a"->5,"b"->6,"b"->1,"c"->3)
println(map) // Map(a -> 5, b -> 1, c -> 3)
Since map has duplicate keys, it will remove duplicate entries while map creation itself.
Map("a"->2,"a"->4,"a"->5,"b"->6,"b"->1,"c"->3)
it will turn into,
Map(a -> 5, b -> 1, c -> 3)
so other operations will be performed on short map
The problem is with Map, whose keys are a Set, so you cannot have twice the same key. This is because Maps are dictionaries, which are made to access a value by its key, so keys MUST be unique. The builder thus keeps only the last value given with the key "a".
By the way, Map already has a method toList, that do exactly what you implemented.
I am new to Scala and trying out the map function on a Map.
Here is my Map:
scala> val map1 = Map ("abc" -> 1, "efg" -> 2, "hij" -> 3)
map1: scala.collection.immutable.Map[String,Int] =
Map(abc -> 1, efg -> 2, hij -> 3)
Here is a map function and the result:
scala> val result1 = map1.map(kv => (kv._1.toUpperCase, kv._2))
result1: scala.collection.immutable.Map[String,Int] =
Map(ABC -> 1, EFG -> 2, HIJ -> 3)
Here is another map function and the result:
scala> val result1 = map1.map(kv => (kv._1.length, kv._2))
result1: scala.collection.immutable.Map[Int,Int] = Map(3 -> 3)
The first map function returns all the members as expected however the second map function returns only the last member of the Map. Can someone explain why this is happening?
Thanks in advance!
In Scala, a Map cannot have duplicate keys. When you add a new key -> value pair to a Map, if that key already exists, you overwrite the previous value. If you're creating maps from functional operations on collections, then you're going to end up with the value corresponding to the last instance of each unique key. In the example you wrote, each string key of the original map map1 has the same length, and so all your string keys produce the same integer key 3 for result1. What's happening under the hood to calculate result1 is:
A new, empty map is created
You map "abc" -> 1 to 3 -> 3 and add it to the map. Result now contains 1 -> 3.
You map "efg" -> 2 to 3 -> 2 and add it to the map. Since the key is the same, you overwrite the existing value for key = 3. Result now contains 2 -> 3.
You map "hij" -> 3 to 3 -> 3 and add it to the map. Since the key is the same, you overwrite the existing value for key = 3. Result now contains 3 -> 3.
Return the result, which is Map(3 -> 3)`.
Note: I made a simplifying assumption that the order of the elements in the map iterator is the same as the order you wrote in the declaration. The order is determined by hash bin and will probably not match the order you added elements, so don't build anything that relies on this assumption.
Problem at Hand
Wrote an attempted improved bi-gram generator working over lines, taking into account full stops and the like. Results are as wanted. It does not use mapPartitions but is as per below.
import org.apache.spark.mllib.rdd.RDDFunctions._
val wordsRdd = sc.textFile("/FileStore/tables/natew5kh1478347610918/NGram_File.txt",10)
val wordsRDDTextSplit = wordsRdd.map(line => (line.trim.split(" "))).flatMap(x => x).map(x => (x.toLowerCase())).map(x => x.replaceAll(",{1,}","")).map(x => x.replaceAll("!
{1,}",".")).map(x => x.replaceAll("\\?{1,}",".")).map(x => x.replaceAll("\\.{1,}",".")).map(x => x.replaceAll("\\W+",".")).filter(_ != ".")filter(_ != "")
val x = wordsRDDTextSplit.collect() // need to do this due to lazy evaluation etc. I think, need collect()
val y = for ( Array(a,b,_*) <- x.sliding(2).toArray)
yield (a, b)
val z = y.filter(x => !(x._1 contains ".")).map(x => (x._1.replaceAll("\\.{1,}",""), x._2.replaceAll("\\.{1,}","")))
I have some questions:
The results are as expected. No data is missed. But can I convert such an approach to a mapPartitions approach? Would I not lose some data? Many say that that this is the case due to the partitions that we would be processing having a subset of all the words and hence missing the relationship at a boundary of the split, ie.the next and the previous word. With a large file split I can see from the map point of view this could occur as well. Correct?
However, if you look at the code above (no mapPartitions attempt), it always works regardless of how much I parallelize this, 10 or 100 specified with partitions with words that are consecutive over different partitions. I checked this with mapPartitionsWithIndex. This I am not clear on. OK, a reduce on (x, y) => x + y is well understood.
Thanks in advance. I must be missing some elementary point in all this.
Output & Results
z: Array[(String, String)] = Array((hello,how), (how,are), (are,you), (you,today), (i,am), (am,fine), (fine,but), (but,would), (would,like), (like,to), (to,talk), (talk,to), (to,you), (you,about), (about,the), (the,cat), (he,is), (is,not), (not,doing), (doing,so), (so,well), (what,should), (should,we), (we,do), (please,help), (help,me), (hi,there), (there,ged))
mapped: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[669] at mapPartitionsWithIndex at :123
Partition Assignment
res13: Array[String] = Array(hello -> 0, how -> 0, are -> 0, you -> 0, today. -> 0, i -> 0, am -> 32, fine -> 32, but -> 32, would -> 32, like -> 32, to -> 32, talk -> 60, to -> 60, you -> 60, about -> 60, the -> 60, cat. -> 60, he -> 60, is -> 60, not -> 96, doing -> 96, so -> 96, well. -> 96, what -> 96, should -> 122, we -> 122, do. -> 122, please -> 122, help -> 122, me. -> 122, hi -> 155, there -> 155, ged. -> 155)
May be SPARK is just really smart, smarter than I thought initially. Or may be not? Saw some stuff on partition preservation, some of it contradictory imho.
map vs mapValues meaning former destroys partitioning and hence single partition processing?
You can use mapPartitions on in place of any of the maps used to create wordsRDDTextSplit, but I don't really see any reason to. mapPartitions is most useful when you have a high initialization cost that you don't want to pay for every record in the RDD.
Whether you use map or mapPartitions to create wordsRDDTextSplit, your sliding window doesn't operate on anything until you create the local data structure x.
Why does the immutable version of the ListMap store in ascending order, while mutable version stores in descending order?
Here is a test that you can use if you got scalatest-1.6.1.jar and junit-4.9.jar
#Test def StackoverflowQuestion()
{
val map = Map("A" -> 5, "B" -> 12, "C" -> 2, "D" -> 9, "E" -> 18)
val sortedIMMUTABLEMap = collection.immutable.ListMap[String, Int](map.toList.sortBy[Int](_._2): _*)
println("head : " + sortedIMMUTABLEMap.head._2)
println("last : " + sortedIMMUTABLEMap.last._2)
sortedIMMUTABLEMap.foreach(X => println(X))
assert(sortedIMMUTABLEMap.head._2 < sortedIMMUTABLEMap.last._2)
val sortedMUTABLEMap = collection.mutable.ListMap[String, Int](map.toList.sortBy[Int](_._2): _*)
println("head : " + sortedMUTABLEMap.head._2)
println("last : " + sortedMUTABLEMap.last._2)
sortedMUTABLEMap.foreach(X => println(X))
assert(sortedMUTABLEMap.head._2 > sortedMUTABLEMap.last._2)
}
Heres the output of the PASSING test :
head : 2
last : 18
(C,2)
(A,5)
(D,9)
(B,12)
(E,18)
head : 18
last : 2
(E,18)
(B,12)
(D,9)
(A,5)
(C,2)
The symptoms can be simplified to:
scala> collection.mutable.ListMap(1 -> "one", 2 -> "two").foreach(println)
(2,two)
(1,one)
scala> collection.immutable.ListMap(1 -> "one", 2 -> "two").foreach(println)
(1,one)
(2,two)
The "sorting" in your code is not the core of the issue, your call to ListMap is using the ListMap.apply call from the companion object that constructs a list map backed by a mutable or immutable list. The rule is that the insertion order will be preserved.
The difference seems to be that mutable list is backed by an immutable list and insert happens at the front. So that's why when iterating you get LIFO behavior. I'm still looking at the immutable one but I bet the inserts are effectively at the back. Edit, I'm changing my mind: insert are probably at the front, but it seems the immutable.ListMap.iterator method decides to reverse the result with a toList.reverseIterator on the returned iterator. I think it worth bringing it in the mailing list.
Could the documentation be better? Certainly. Is there pain? Not really, I don't let it happen. If the documentation is incomplete, it's wise to test the behavior or go look up at the source before picking a structure versus another one.
Actually, there can be pain if the Scala team decides to change behavior at a later time and feel they can because the behavior is effectively undocumented and there is no contract.
To address your use case explained in the the comment, say you've collected the string frequency count in a map (mutable or immutable):
val map = Map("A" -> 5, "B" -> 12, "C" -> 2, "D" -> 9, "E" -> 18, "B" -> 5)
Since you only need to sort once at the end, you can convert the tuples from the map to a seq and then sort:
map.toSeq.sortBy(_._2)
// Seq[(java.lang.String, Int)] = ArrayBuffer((C,2), (A,5), (B,5), (D,9), (E,18))
As I see it neither ListMap claims to be a sorted map, just a map implemented with a List. In fact I don't see anything in their contract that says anything about preserving the insertion order.
Programming in Scala explains that ListMap may be of use if the early elements are more likely to be accessed, but that otherwise it has little advantage over Map.
Don't build any expectations on order, it is not declared and it will vary between Scala versions.
For example:
import scala.collection.mutable.{ListMap => MutableListMap}
MutableListMap("A" -> 5, "B" -> 12, "C" -> 2, "D" -> 9, "E" -> 18).foreach(println)
On 2.9.1 gives:
(E,18)
(D,9)
(C,2)
(B,12)
(A,5)
but on 2.11.6 gives:
(E,18)
(C,2)
(A,5)
(B,12)
(D,9)