Skip multiple iterations in Scala for loop - scala

for ((c,i) <- collection.zip(0 until collection.length)) c match ???
I want to iterate over a collection with indexes. And if c matches some pattern i want to skip multiple iterations(demands on what it matched).
Is there a way to do this or a better way to iterate like this?

Related

Scala Count Occurrences in Large List

In Scala I have a list of tuples List[(String, String)]. So now from this list I want to find how many times each unique tuple appears in the list.
One way to do this would be to apply groupby{ x => x} and then find the length. But here my data set it quite large and it's taking a lot of time.
So is there any better way of doing this?
I would do the counting manually, using a Map. Iterate over your collection/list. During the iteration, build a count map. Keys in the count map are unique items from the original collection/list, values are number of occurrences of the key. If the item being processed during the iteration is in the count collection, increase its value by 1. If not, add value 1 to the count map. You can use getOrElse:
count(current_item) = count.getOrElse(current_item, 0) + 1;
This should work faster than groupby, followed by length check. Will also require less memory.
Other suggestions, check also this discussion.

How to sort Array[Row] by given column index in Scala

How to sort Array[Row] by given column index in Scala?
I'm using RDD[Row].collect() which gives me array[Row], but I want to sort it based on a given column index.
I have already used quick-sort logic and it's working, but there are too many for loops and all.
I would like to use a Scala built-in API which can do this task with the minimum amount of code.
It would be much more efficient to sort the Dataframe before collecting it - if you collect it, you lose the distributed (and parallel) computation. You can use Dataframe's sort, for example - ascending order by column "col1":
val sorted = dataframe.sort(asc("col1"))

How to traverse a RDD efficiently

For example, I have a Scala RDD with 10000 elements, I want to take each element one by one to deal with. How do I do that? I tried use take(i).drop(i-1), but it is extraordinarily time consuming.
According to what you said in the comments:
yourRDD.map(tuple => tuple._2.map(elem => doSomething(elem)))
The first map will iterate over the tuples inside of your RDD, that is why I called the variable tuple, then for every tuple we get the second element ._2 and apply a map which iterate over all the elements of your Iterable that is why I called the variable elem.
doSomething() is just a random function of your choice to apply on each element.

Scala: Find max element of a sub list

I need to write a function that takes a List[Int], start index and end index and returns the max element within that range. I have a working recursive solution but I was wondering if its possible to do it using a built-in function in the Scala collection library.
Additional criteria are:
1) O(N) run time
2) No intermediate structure creation
3) No mutable data structures
def sum(input:List[Int],startIndex:Int,endIndex:Int): Int = ???
This is easily possible with Scala's views:
def subListMax[A](list: List[A], start: Int, end: Int)(implicit cmp: Ordering[A]): A =
list.view(start, end).max
view does not create an intermediate data structure, it only provides an interface into a slice of the original one. See the Views chapter in the documentation for the Scala collections library.
I think it is not possible with your criteria.
There is no higher order function known to me, which works on a sublist of a list. Many just create intermediate collections.
This would be a simple O(n) solution, but it creates an intermediate structure.
input.slice(startIndex, endIndex + 1).max
There are of course also functions, which traverse a collection and yield a value. Examples are reduce and foldLeft. The problem is that they traverse the whole collection and you can't set a start or end.
You could iterate from startIndex to endIndex and use foldLeft on that to get a value via indexing. An example would be:
(startIndex to endIndex).foldLeft(Integer.MIN_VALUE)((curmax, i) => input(i).max(curmax))
Here we only have an iterator, which basically behaves like a for-loop and produces no heavy intermediate collections. However iterating from startIndex to endIndex is O(n) and on each iteration we have an indexing operation (input(i)) that is also generally O(n) on List. So in the end, sum would not be O(n).
This is of course only my experience with scala and maybe I am misguided.
Ah and on your 3): I'm not sure what you mean by that. Using mutable state inside the function should not be a problem at all, since it only exists to compute your result and then disappear.

Use distinct and skip in a query

I tried running this:
db.col.find().skip(5).distinct("field1")
But it throws an error.
How to use them together?
I can use aggregation but results are different:
db.col.aggregate([{$group:{_id:'$field1'}}, {$skip:3},{$sort:{"field1":1}}])
What I want is links in sorted order i.e numbers should come first then capital letters and then small letters.
Distinct method must be run on COLLECTION not on cursor and returns an array. Read this
http://docs.mongodb.org/manual/reference/method/db.collection.distinct/
So you can't use skip after distinct.
May be you should use this query
db.col.aggregate([{$group:{_id:'$field1'}}, {$skip:3},{$sort:{"_id":1}}]) because field field1 will not exists in result after first clause of grouping.
Also I think you should do sort at first and then skip because in your query you skip 3 unsorted results and then sort them.
(If you provide more information about structure of your documents and what output you want it would be more clearly and I will correct answer properly)