SPARK N-grams & Parallelization not using mapPartitions - scala

Problem at Hand
Wrote an attempted improved bi-gram generator working over lines, taking into account full stops and the like. Results are as wanted. It does not use mapPartitions but is as per below.
import org.apache.spark.mllib.rdd.RDDFunctions._
val wordsRdd = sc.textFile("/FileStore/tables/natew5kh1478347610918/NGram_File.txt",10)
val wordsRDDTextSplit = wordsRdd.map(line => (line.trim.split(" "))).flatMap(x => x).map(x => (x.toLowerCase())).map(x => x.replaceAll(",{1,}","")).map(x => x.replaceAll("!
{1,}",".")).map(x => x.replaceAll("\\?{1,}",".")).map(x => x.replaceAll("\\.{1,}",".")).map(x => x.replaceAll("\\W+",".")).filter(_ != ".")filter(_ != "")
val x = wordsRDDTextSplit.collect() // need to do this due to lazy evaluation etc. I think, need collect()
val y = for ( Array(a,b,_*) <- x.sliding(2).toArray)
yield (a, b)
val z = y.filter(x => !(x._1 contains ".")).map(x => (x._1.replaceAll("\\.{1,}",""), x._2.replaceAll("\\.{1,}","")))
I have some questions:
The results are as expected. No data is missed. But can I convert such an approach to a mapPartitions approach? Would I not lose some data? Many say that that this is the case due to the partitions that we would be processing having a subset of all the words and hence missing the relationship at a boundary of the split, ie.the next and the previous word. With a large file split I can see from the map point of view this could occur as well. Correct?
However, if you look at the code above (no mapPartitions attempt), it always works regardless of how much I parallelize this, 10 or 100 specified with partitions with words that are consecutive over different partitions. I checked this with mapPartitionsWithIndex. This I am not clear on. OK, a reduce on (x, y) => x + y is well understood.
Thanks in advance. I must be missing some elementary point in all this.
Output & Results
z: Array[(String, String)] = Array((hello,how), (how,are), (are,you), (you,today), (i,am), (am,fine), (fine,but), (but,would), (would,like), (like,to), (to,talk), (talk,to), (to,you), (you,about), (about,the), (the,cat), (he,is), (is,not), (not,doing), (doing,so), (so,well), (what,should), (should,we), (we,do), (please,help), (help,me), (hi,there), (there,ged))
mapped: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[669] at mapPartitionsWithIndex at :123
Partition Assignment
res13: Array[String] = Array(hello -> 0, how -> 0, are -> 0, you -> 0, today. -> 0, i -> 0, am -> 32, fine -> 32, but -> 32, would -> 32, like -> 32, to -> 32, talk -> 60, to -> 60, you -> 60, about -> 60, the -> 60, cat. -> 60, he -> 60, is -> 60, not -> 96, doing -> 96, so -> 96, well. -> 96, what -> 96, should -> 122, we -> 122, do. -> 122, please -> 122, help -> 122, me. -> 122, hi -> 155, there -> 155, ged. -> 155)
May be SPARK is just really smart, smarter than I thought initially. Or may be not? Saw some stuff on partition preservation, some of it contradictory imho.
map vs mapValues meaning former destroys partitioning and hence single partition processing?

You can use mapPartitions on in place of any of the maps used to create wordsRDDTextSplit, but I don't really see any reason to. mapPartitions is most useful when you have a high initialization cost that you don't want to pay for every record in the RDD.
Whether you use map or mapPartitions to create wordsRDDTextSplit, your sliding window doesn't operate on anything until you create the local data structure x.

Related

Split Map type column with huge values into multiple rows using Scala and Spark

This is in continuation of the question: Combine value part of Tuple2 which is a map, into single map grouping by the key of Tuple2
I am now able to reduce the rows by using reduceByKey.
But now, in the final DataFrame...
e.g.
(A, {1->100, 2->200, 3->100})
(B, {4->300, 1->500, 9->300, 11->900, 5->900, 6-> 111, 7-> 222, 8-> 333, 12-> 444, 13->555, 19->666})
(C, {6->100, 4->200, 7->100, 8->200, 5->800})
...some rows have the map column with very large map. For e.g. for B above.
I am trying to write the DF to Azure Cosmos DB Core SQL. Here each row from the above DF turns into 1 document of Cosmos DB. The issue is if the row size is more than 2MB, then, Cosmos DB rejects the request.
Question: I want to split rows with huge map columns into multiple rows (so that they become less than 2MB in size). Duplicate key column is not an issue.
The final result can be (if I divide the map if it has more than 5 elements everytime):
(A, {1->100, 2->200, 3->100})
(B, {4->300, 1->500, 9->300, 11->900, 5->900})
(B, {6-> 111, 7-> 222, 8-> 333, 12-> 444, 13->555})
(B, {19->666})
(C, {6->100, 4->200, 7->100, 8->200, 5->800})
You may ask that in the previous question, it was already split, then why did I merge? The reason is in the previous question, for B, without reduceByKey, I may have 1000 rows. But, finally I only need 20 rows for example as above. 1 row would have been ideal but due to Cosmos limit, I have to create multiple documents (each less than 2MB).
Hope I am clear. Please let me know for any clarification required.
I was able to solve this by writing my own custom code as below:
originalDF.rdd.reduceByKey((a, b) => a ++ b).map(row => {
val indexedMapEntries: Map[Int, (String, String)] = row._2.zipWithIndex.map(mapWithIndex => (mapWithIndex._2, mapWithIndex._1))
var min = 0
var max = Math.min(indexedMapEntries.size - 1, 9999)
var proceed = true
var rowKeyIdToAttributesMapList: ListBuffer[(String, Map[String, String])] = new ListBuffer[(String, Map[String, String])]()
while (proceed) {
var tempMapToHoldEntries = Map[String, String]()
var i = min
while (i <= max) {
var entry: (String, String) = indexedMapEntries.get(i).get
tempMapToHoldEntries += entry
i = i + 1
}
rowKeyIdToAttributesMapList += ((row._1, tempMapToHoldEntries))
min = max + 1
max = Math.min(indexedMapEntries.size - 1, max + 9999)
if (min > (indexedMapEntries.size - 1))
proceed = false
}
rowKeyIdToAttributesMapList.toList
}).flatMap(x => x).toDF("rowKeyId", "attributes")
Here, the originalDF is the one from my previous question (check OP). 10000 is the maximum size of each map for a rowKeyId. If the map size exceeds 10000, then I create a new row with same rowKeyId and remaining attributes in loop.

What are some good use cases of lazy evaluation in Scala?

When working with large collections, we usually hear the term "lazy evaluation". I want to better demonstrate the difference between strict and lazy evaluation, so I tried the following example - getting the first two even numbers from a list:
scala> var l = List(1, 47, 38, 53, 51, 67, 39, 46, 93, 54, 45, 33, 87)
l: List[Int] = List(1, 47, 38, 53, 51, 67, 39, 46, 93, 54, 45, 33, 87)
scala> l.filter(_ % 2 == 0).take(2)
res0: List[Int] = List(38, 46)
scala> l.toStream.filter(_ % 2 == 0).take(2)
res1: scala.collection.immutable.Stream[Int] = Stream(38, ?)
I noticed that when I'm using toStream, I'm getting Stream(38, ?). What does the "?" mean here? Does this have something to do with lazy evaluation?
Also, what are some good example of lazy evaluation, when should I use it and why?
One benefit using lazy collections is to "save" memory, e.g. when mapping to large data structures. Consider this:
val r =(1 to 10000)
.map(_ => Seq.fill(10000)(scala.util.Random.nextDouble))
.map(_.sum)
.sum
And using lazy evaluation:
val r =(1 to 10000).toStream
.map(_ => Seq.fill(10000)(scala.util.Random.nextDouble))
.map(_.sum)
.sum
The first statement will genrate 10000 Seqs of size 10000 and keeps them in memory, while in the second case only one Seq at a time needs to exist in memory, therefore its much faster...
Another use-case is when only a part of the data is actually needed. I often use lazy collections together with take, takeWhile etc
Let's take a real life scenario - Instead of having a list, you have a big log file that you want to extract first 10 lines that contains "Success".
The straight forward solution would be reading the file line-by-line, and once you have a line that contains "Success", print it and continue to the next line.
But since we love functional programming, we don't want to use the traditional loops. Instead, we want to achieve our goal by composing functions.
First attempt:
Source.fromFile("log_file").getLines.toList.filter(_.contains("Success")).take(10)
Let's try to understand what actually happened here:
we read the whole file
filter relevant lines
took the first 10 elements
If we try to print Source.fromFile("log_file").getLines.toList, we will get the whole file, which is obviously a waste, since not all lines are relevant for us.
Why we got all lines and only then we performed the filtering? That's because the List is a strict data structure, so when we call toList, it evaluates immediately, and only after having the whole data, the filtering is applied.
Luckily, Scala provides lazy data structures, and stream is one of them:
Source.fromFile("log_file").getLines.toStream.filter(_.contains("Success")).take(10)
In order to demonstrate the difference, let's try:
Source.fromFile("log_file").getLines.toStream
Now we get something like:
Scala.collection.immutable.Stream[Int] = Stream(That's the first line, ?)
toStream evaluates to only one element - the first line in the file. The next element is represented by a "?", which indicates that the stream has not evaluated the next element, and that's because toStream is lazy function, and the next item is evaluated only when used.
Now after we apply the filter function, it will start reading the next line until we get the first line that contains "Success":
> var res = Source.fromFile("log_file").getLines.toStream.filter(_.contains("Success"))
Scala.collection.immutable.Stream[Int] = Stream(First line contains Success!, ?)
Now we apply the take function. There is still no action is performed, but it knows that is should pick 10 lines, so it doesn't evaluate until we use the result:
res foreach println
Finally, i we now print res, we'll get a Stream containing the first 10 lines, as we expected.

java.lang.IndexOutOfBoundsException when acessing list of lists

I get the error java.lang.IndexOutOfBoundsException: 5, however I double-checked the code and according to my understanding everything is correct. So, I cannot figure out how to solve the issue.
I have RDD[(String,Map[String,List[Product with Serializable]])], such as:
(1566,Map(data1 -> List(List(1469785000, 111, 1, 3, null, 0),List(1469785022, 111, 1, 3, null, 1)), data2 -> List((4,88,1469775603,1,3370,f,537490800,661.09)))
I want to create a new RDD that will aggregate the values of 5th elements of sub-lists in data1:
Map(id -> 1566, type -> List(0,1))
I wrote the following code:
val newRDD = currentRDD.map({
line => Map(("id",line._1),
("type",line._2.get("data1").get.map(_.productElement(5))
)
})
If I put _.productElement(0), then the result is Map(id -> 1566, type -> List(1469785000,1469785022)). So, I absolutely misunderstand why the 0th field can be accessed, while the 3rd, 4th, 5th fields provoke IndexOutOfBoundsException.
The problem was due to List[Product with Serializable], while I actually handled it as List[List[Any]]. I changed the initial RDD to RDD[(String,Map[String,List[List[Any]]])] and now everything works.

Implement a partition function using a fold in Scala

I'm new to Scala and I want to write a higher-order function (say "partition2") that takes a list of integers and a function that returns either true or false. The output would be a list of values for which the function is true and a list of values for which the function is false. I'd like to implement this using a fold. I know something like this would be a really straightforward way to do this:
val (passed, failed) = List(49, 58, 76, 82, 88, 90) partition ( _ > 60 )
I'm wondering how this same logic could be applied using a fold.
You can start by thinking about what you want your accumulator to look like. In many cases it'll have the same type as the thing you want to end up with, and that works here—you can use two lists to keep track of the elements that passed and failed. Then you just need to write the cases and add the element to the appropriate list:
List(49, 58, 76, 82, 88, 90).foldRight((List.empty[Int], List.empty[Int])) {
case (i, (passed, failed)) if i > 60 => (i :: passed, failed)
case (i, (passed, failed)) => (passed, i :: failed)
}
I'm using a right fold here because prepending to a list is nicer than the alternative, but you could easily rewrite it to use a left fold.
You can do this:
List(49, 58, 76, 82, 88, 90).foldLeft((Vector.empty[Int], Vector.empty[Int])){
case ((passed, failed), x) =>
if (x > 60) (passed :+ x, failed)
else (passed, failed :+ x)
}
Basically you have two accumulators, and as you visit each element, you add it to the appropriate accumulator.

Spark: How to write diffrent group values from RDD to different files?

I need to write values with key 1 to file file1.txt and values with key 2 to file2.txt:
val ar = Array (1 -> 1, 1 -> 2, 1 -> 3, 1 -> 4, 1 -> 5, 2 -> 6, 2 -> 7, 2 -> 8, 2 -> 9)
val distAr = sc.parallelize(ar)
val grk = distAr.groupByKey()
How to do this without iterrating collection grk twice?
We write data from different customers to different tables, which is essentially the same usecase. The common pattern we use is something like this:
val customers:List[String] = ???
customers.foreach{customer => rdd.filter(record => belongsToCustomer(record,customer)).saveToFoo()}
This probably does not fulfill the wish of 'not iterating over the rdd twice (or n times)', but filter is a cheap operation to do in a parallel distributed environment and it works, so I think it does comply to the 'general Spark way' of doing things.