Scala java.lang.NoSuchElementException Error - scala

I have a Scala for loop that goes like this:
val a = sc.textFile("path to file containing 8 elements")
for(i <- 0 to a.count.toInt)
{
println((a.take(i).last))
}
But it throws java.lang.NoSuchElementException error.
I am not able to understand what's wrong and how to resolve it?

There are two problems
1) The "to" operator for defining range (in 0 to a.count.toInt) is a problem here as it is inclusive range from 0 to 8. In a collection of 8 elements, it is trying to access element at index 8.
You can use 0 until a.count.toInt.
2) Second problem is the way "last" operator is called. When i=0, the expression a.take(i) is an empty collection and hence calling "last" on it results in NoSuchElementException.
Why would you iteratively take 1, 2, 3...8 elements from a collection just to take the last element everytime?
It is ok to do what you are doing with a collection of 8 elements but if you wanted to do this on a larger RDD, you should consider caching the RDD if you want to do something like this on a larger RDD.

Related

Iterating through Seq[row] till a particular condition is met using Scala

I need to iterate a scala Seq of Row type until a particular condition is met. i dont need to process further post the condition.
I have a seq[Row] r->WrappedArray([1/1/2020,abc,1],[1/2/2020,pqr,1],[1/3/2020,stu,0],[1/4/2020,opq,1],[1/6/2020,lmn,0])
I want to iterate through this collection for r.getInt(2) until i encounter 0. As soon as i encounter 0, i need to break the iteration and collect r.getString(1) till then. I dont need to look into any other data post that.
My output should be: Array(abc,pqr,stu)
I am new to scala programming. This seq was actually a Dataframe. I know how to handle this using Spark dataframes, but due to some restriction put forth by my organization, windows function, createDataFrame function are not available/working in our environment. Hence i have resort to Scala programming to achieve the same.
All I could come up was something like below, but not really working!
breakable{
for(i <- r)
var temp = i.getInt(3)===0
if(temp ==true)
{
val = i.getInt(2)
break()
}
}
Can someone please help me here!
You can use the takeWhile method to grab the elements while it's value is 1
s.takeWhile(_.getInt(2) == 1).map(_.getString(1))
Than will give you
List(abc, pqr)
So you still need to get the first element where the int values 0 which you can do as follows:
s.find(_.getInt(2)== 0).map(_.getString(1)).get
Putting all together (and handle possible nil values):
s.takeWhile(_.getInt(2) == 1).map(_.getString(1)) ++ s.find(_.getInt(2)== 0).map(r => List(r.getString(1))).getOrElse(Nil)
Result:
Seq[String] = List(abc, pqr, stu)

Implement a MergeSort like feature in spark with scala

Spark Version 1.2.1
Scala Version 2.10.4
I have 2 SchemaRDD which are associated by a numeric field:
RDD 1: (Big table - about a million records)
[A,3]
[B,4]
[C,5]
[D,7]
[E,8]
RDD 2: (Small table < 100 records so using it as a Broadcast Variable)
[SUM, 2]
[WIN, 6]
[MOM, 7]
[DOM, 9]
[POM, 10]
Result
[C,5, WIN]
[D,7, MOM]
[E,8, DOM]
[E,8, POM]
I want the max(field) from RDD1 which is <= the field from RDD2.
I am trying to approach this using Merge by:
Sorting RDD by a key (sort within a group will have not more than 100 records in that group. In the above example is within a group)
Performing the merge operation similar to mergesort. Here I need to keep a track of the previous value as well to find the max; still I traverse the list only once.
Since there are too may variables here I am getting "Task not serializable" exception. Is this implementation approach Correct? I am trying to avoid the Cartesian Product here. Is there a better way to do it?
Adding the code -
rdd1.groupBy(itm => (itm(2), itm(3))).mapValues( itmorg => {
val miorec = itmorg.toList.sortBy(_(1).toString)
for( r <- 0 to miorec.length) {
for ( q <- 0 to rdd2.value.length) {
if ( (miorec(r)(1).toString > rdd2.value(q).toString && miorec(r-1)(1).toString <= rdd2.value(q).toString && r > 0) || r == miorec.length)
org.apache.spark.sql.Row(miorec(r-1)(0),miorec(r-1)(1),miorec(r-1)(2),miorec(r-1)(3),rdd2.value(q))
}
}
}).collect.foreach(println)
I would not do a global sort. It is an expensive operation for what you need. Finding the maximum is certainly cheaper than getting a global ordering of all values. Instead, do this:
For each partition, build a structure that keeps the max on RDD1 for each row on RDD2. This can be trivially done using mapPartitions and normal scala data structures. You can even use your one-pass merge code here. You should get something like a HashMap(WIN -> (C, 5), MOM -> (D, 7), ...)
Once this is done locally on each executor, merging these resulting data structures should be simple using reduce.
The goal here is to do little to no shuffling an keeping the most complex operation local, since the result size you want is very small (it would be easier in code to just create all valid key/values with RDD1 and RDD2 then aggregateByKey, but less efficient).
As for your exception, you woudl need to show the code, "Task not serializable" usually means you are passing around closures which are not, well, serializable ;-)

Index wise most frequently occuring element

I have a an array of the form
val array: Array[(Int, (String, Int))] = Array(
(idx1,(word1,count1)),
(idx2,(word2,count2)),
(idx1,(word1,count1)),
(idx3,(word3,count1)),
(idx4,(word4,count4)))
I want to get the top 10 and bottom 10 elements from this array for each index (idx1,idx2,....). Basically I want the top 10 most occuring and bottom 10 least occuring elements for each index value.
Please suggest how to acheive in spark in most efficient way.
I have tried it using the for loops for each index but this makes the program too slow and runs sequentially.
An example would be this :
(0,("apple",1))
(0,("peas",2))
(0,("banana",4))
(1,("peas",2))
(1,("banana",1))
(1,("apple",3))
(2,("NY",3))
(2,("London",5))
(2,("Zurich",6))
(3,("45",1))
(3,("34",4))
(3,("45",6))
Suppose I do top 2 on this set output would be
(0,("banana",4))
(0,("peas",2))
(1,("apple",3))
(1,("peas",2))
(2,("Zurich",6))
(2,("London",5))
(3,("45",6))
(3,("34",4))
I also need bottom 2 in the same way
I understand this is equivalent to producing the entire column list by using groupByKey on (K,V) pairs and then doing sort operation on it. Although the operation is right but in a typical spark environment the groupByKey operation will involve a lot of shuffle output and this may lead to inefficient operation.
Not sure about spark, but I think you can go with something like:
def f(array: Array[(Int, (String, Int))], n:Int) =
array.groupBy(_._1)
.map(pair => (
pair._1,
pair._2.sortBy(_._2).toList
)
)
.map(pair => (
pair._1,
(
pair._2.take(Math.min(n, pair._2.size)),
pair._2.drop(Math.max(0, pair._2.size - n))
)
)
)
The groupBy returns a map of index into a sorted list of entries by frequenct. After this, you map these entries to a pair of lists, one containing the top n elements and the other containing the bottom n elements. Note that you can replace all named parameters with _, I did it for clarity.
This version assumes that you always are interested in computing both the top and bot n elements, and thus does both in a single pass. If you usually only need one of the two, it's more efficient to add the .take or .drop immediately after the toList.

Scala vector splicing (maintaining sorted sequence)

I need to maintain a sorted sequence (mutable or immutable — I don't care), dynamically inserting elements into the middle of it (to keep it sorted) and removing them likewise (so, random access by index is crucial).
The best thing I came onto is using a Vector and scala.collections.Searching from 2.11, and then:
var vector: Vector[Ordered]
...
val ip = vector.search(element)
Inserting:
vector = (vector.take(ip.insertionPoint) :+ element) ++ vector.drop(ip.insertionPoint)
Deleting:
vector.patch(from = ip.insertionPoint, patch = Nil, replaced = 1)
Doesn't look elegant to me, and I suspect performance issues. Is there a better way? Splicing sequences seems like a very basic operation to me, but I can't find an elegant solution.
You should use SortedSet. Default implementation of SortedSet is immutable red-black tree. There is also a mutable implementation.
SortedSet[Int]() + 5 + 3 + 4 + 7 + 1
// SortedSet[Int] = TreeSet(1, 3, 4, 5, 7)
Set contains no duplicate elements. In case you want to count duplicate elements you could use SortedMap[Key, Int] with elements as keys and counts as values. See this answer for MultiSet emulation using Map.

What does "rows[0]" mean?

HI! I am looking for a document that will define what the word "rows[0]" means. this is for BIRT in the Eclipse framework. Perhaps this is a Javascript word? I dunno... been searching like mad and have found nothing yet. Any ideas?
rows is a shortcut to dataSet.rows. Returns the current data rows (of type DataRow[]) for the data set associated with this report item instance. If this report element has no data set, this property is undefined.
Source: http://www.eclipse.org/birt/phoenix/ref/ROM_Scripting_SPEC.pdf
Typically code like rows[x] is accessing an element inside an array. Any intro to programming book should be able to define that for you.
rows[0] would be accessing the first element in the array.
That operation has several names depending on the language, but generally the same concept. In Java, it's an array access expression in C#, it's an indexer or array access operator. As with just about anything, C++ is more complicated, but basically the [] operator takes a collection of something or an array and pulls out (or assigns to) a specific numbered element in that collection or array (generally starting at 0). So in C# ...
// create a list of integers
List<int> lst = new List<int>() { 1, 2, 3, 4, 5 };
// access list
int x = lst[0]; // get the first element of the list, x = 1 afterwords
x = lst[2]; // get the third element of the list, x = 3 afterwords
x = lst[4]; // get the fifth element of the list, x = 5 afterwords
x = lst[5]; // IndexOutOfBounds Exception