For example, I have a Scala RDD with 10000 elements, I want to take each element one by one to deal with. How do I do that? I tried use take(i).drop(i-1), but it is extraordinarily time consuming.
According to what you said in the comments:
yourRDD.map(tuple => tuple._2.map(elem => doSomething(elem)))
The first map will iterate over the tuples inside of your RDD, that is why I called the variable tuple, then for every tuple we get the second element ._2 and apply a map which iterate over all the elements of your Iterable that is why I called the variable elem.
doSomething() is just a random function of your choice to apply on each element.
Related
I understand that take(n) will return n elements of an RDD, but how Spark decides from which partition to call those elements from and which elements should be chosen?
Does it maintain indexes internally on Driver?
In the take(n) method of RDD, Spark starts scanning for elements from the first partition. If there are not enough elements in that, Spark increases the number of partitions to scan from. And as for what elements are taken that is determined by the following line
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)
The take(n) method of the Iterator in scala says "Selects first ''n'' values of this iterator."-scaladoc. So as for what elements will be selected, we see elements are selected from the front of the iterator.
In Scala I have a list of tuples List[(String, String)]. So now from this list I want to find how many times each unique tuple appears in the list.
One way to do this would be to apply groupby{ x => x} and then find the length. But here my data set it quite large and it's taking a lot of time.
So is there any better way of doing this?
I would do the counting manually, using a Map. Iterate over your collection/list. During the iteration, build a count map. Keys in the count map are unique items from the original collection/list, values are number of occurrences of the key. If the item being processed during the iteration is in the count collection, increase its value by 1. If not, add value 1 to the count map. You can use getOrElse:
count(current_item) = count.getOrElse(current_item, 0) + 1;
This should work faster than groupby, followed by length check. Will also require less memory.
Other suggestions, check also this discussion.
I have a set of tuple key,value pairs which look like this:
X=[(('cat','mouse'),1),(('dog','rat'),20),(('hamster','skittles),67)]
which I want to sort in order of the second item in the tuple. Pythonically I would have used:
sorted(X, key=lambda tup:tup[1])
I also want to get the key,value pair with the highest value, again, pythonically this would be simple:
max_X=max(x[1] for x in X)
max_tuple=[x for x in X if x[1]==max_X
however I do not know how to translate this into a spark job.
X.max(lambda x: x[1])
You could also do it another way, which is probably faster if you need to sort your RDD anyway. But, this is slower if you don't need your RDD to be sorted, because sorting will take longer than just telling it to find the max.(So, in a vacuum, use the max function).
X.sortBy(lambda x: x[1], False).first()
This will sort as you did before, but adding the False will sort it in descending order. Then you take the first one, which will be the largest.
Figured it out in the 2 minutes since posting!
X.sortBy(lambda x:x[1]).collect()
I am trying to find a datastructure that can do a constant lookup and then scan next sorted element from that point until end element is reached. Basically linear scan on sorted set but instead of doing it from first element it should start from specific element so i can scan a range effectively. TreeMap might be a right datastructure for it. Correct me if I'm wrong there. I am trying to use its def
slice(from: Int, until: Int): TreeMap[A, B] and supply from and to values as indexOf element to start and end scan. I can't find a method to get indexOf treeMap element based on Key. I'm sure its internally there but is it expose somewhere? Also, what's the performance of this method? Is it really better then doing linear scan from first element?
I think, you are looking for TreeMap.from() or TreeMap.iteratorFrom() or TreeMap.range()
I need to write a function that takes a List[Int], start index and end index and returns the max element within that range. I have a working recursive solution but I was wondering if its possible to do it using a built-in function in the Scala collection library.
Additional criteria are:
1) O(N) run time
2) No intermediate structure creation
3) No mutable data structures
def sum(input:List[Int],startIndex:Int,endIndex:Int): Int = ???
This is easily possible with Scala's views:
def subListMax[A](list: List[A], start: Int, end: Int)(implicit cmp: Ordering[A]): A =
list.view(start, end).max
view does not create an intermediate data structure, it only provides an interface into a slice of the original one. See the Views chapter in the documentation for the Scala collections library.
I think it is not possible with your criteria.
There is no higher order function known to me, which works on a sublist of a list. Many just create intermediate collections.
This would be a simple O(n) solution, but it creates an intermediate structure.
input.slice(startIndex, endIndex + 1).max
There are of course also functions, which traverse a collection and yield a value. Examples are reduce and foldLeft. The problem is that they traverse the whole collection and you can't set a start or end.
You could iterate from startIndex to endIndex and use foldLeft on that to get a value via indexing. An example would be:
(startIndex to endIndex).foldLeft(Integer.MIN_VALUE)((curmax, i) => input(i).max(curmax))
Here we only have an iterator, which basically behaves like a for-loop and produces no heavy intermediate collections. However iterating from startIndex to endIndex is O(n) and on each iteration we have an indexing operation (input(i)) that is also generally O(n) on List. So in the end, sum would not be O(n).
This is of course only my experience with scala and maybe I am misguided.
Ah and on your 3): I'm not sure what you mean by that. Using mutable state inside the function should not be a problem at all, since it only exists to compute your result and then disappear.