Sorting strings using Ordered[String] in Scala? - scala

I need to have an equivalent of the compareWith(a,b) method in Java, in Scala.
I have a list of strings and I need to sort them by comparing them with each other. sortBy just takes one string and returns a score, but that's not enough in my case, i need to compare two strings with each other and then return a number based on which one is better.
It seems like the only option is to write a custom case class, convert the strings to it, and then covert them back. For performance reasons, I want to avoid this as I have a large amount of data to process.
Is there a way to do this with just the strings?

I think you are looking for sortWith.
sortWith(lt: (A, A) ⇒ Boolean): Repr
Sorts this sequence according to a comparison function.
Note: will not terminate for infinite-sized collections.
The sort is stable. That is, elements that are equal (as determined by lt) appear in the same order in the sorted sequence as in the original.
 lt the comparison function which tests whether its first argument precedes its second argument in the desired ordering.
Example:
List("Steve", "Tom", "John", "Bob").sortWith(_.compareTo(_) < 0) =
List("Bob", "John", "Steve", "Tom")

Related

How to loop through tuple in scala [duplicate]

This question already has answers here:
Scala: How to convert tuple elements to lists
(5 answers)
Closed 5 years ago.
I have a tuple in scala
val captainStuff = ("One", "Two", "Three", "Four", "Five")
How i can iterate through for loop?? It's easy to loop through list and map. But how to loop through Tuple.
Thanks!!
You can convert it to iterator like:
val captainStuff = ("One", "Two", "Three", "Four", "Five")
captainStuff.productIterator.foreach(x => {
println(x)
})
This question is a duplicate btw:
Scala: How to convert tuple elements to lists
How i can iterate through for loop?? It's easy to loop through list and map. But how to loop through Tuple.
Lists and maps are collections. Tuples are not. Iterating (aka "looping through") really only makes sense for collections which tuples aren't.
Tuples are product types. They are a way of grouping multiple values of different types together into a single structure. Considering that the fields of a tuple may have different types, how exactly would you iterate over it? What would be the type of your element variable?
If you are familiar with other languages, you may be familiar with the concept of records (e.g. RECORD in Pascal or struct in C). Tuples are kind of like them, except the fields don't have names. How do you iterate over a record in Pascal or a struct in C? You don't, it makes no sense.
In fact, you can think of an object as a record. Again, how do you iterate over the fields of an object? You don't, it makes no sense.
Note #1: Yes, sometimes, it does make sense to iterate over the field of an object iff you are doing reflective metaprogramming.
Note #2: In Scala, tuples inherit from Product, which has a non-typesafe productIterator method that gives you an Iterator[Any] which allows you to iterate over a tuple, but without type-safety. Just don't do it.
tl;dr: tuples are not collections. You simply don't iterate over them. Period. If you think you have to, you're doing something wrong, i.e. you shouldn't have a tuple but maybe an array or a list instead.

Scala Count Occurrences in Large List

In Scala I have a list of tuples List[(String, String)]. So now from this list I want to find how many times each unique tuple appears in the list.
One way to do this would be to apply groupby{ x => x} and then find the length. But here my data set it quite large and it's taking a lot of time.
So is there any better way of doing this?
I would do the counting manually, using a Map. Iterate over your collection/list. During the iteration, build a count map. Keys in the count map are unique items from the original collection/list, values are number of occurrences of the key. If the item being processed during the iteration is in the count collection, increase its value by 1. If not, add value 1 to the count map. You can use getOrElse:
count(current_item) = count.getOrElse(current_item, 0) + 1;
This should work faster than groupby, followed by length check. Will also require less memory.
Other suggestions, check also this discussion.

How to traverse a RDD efficiently

For example, I have a Scala RDD with 10000 elements, I want to take each element one by one to deal with. How do I do that? I tried use take(i).drop(i-1), but it is extraordinarily time consuming.
According to what you said in the comments:
yourRDD.map(tuple => tuple._2.map(elem => doSomething(elem)))
The first map will iterate over the tuples inside of your RDD, that is why I called the variable tuple, then for every tuple we get the second element ._2 and apply a map which iterate over all the elements of your Iterable that is why I called the variable elem.
doSomething() is just a random function of your choice to apply on each element.

sorting a list of tuples in Pyspark

I have a set of tuple key,value pairs which look like this:
X=[(('cat','mouse'),1),(('dog','rat'),20),(('hamster','skittles),67)]
which I want to sort in order of the second item in the tuple. Pythonically I would have used:
sorted(X, key=lambda tup:tup[1])
I also want to get the key,value pair with the highest value, again, pythonically this would be simple:
max_X=max(x[1] for x in X)
max_tuple=[x for x in X if x[1]==max_X
however I do not know how to translate this into a spark job.
X.max(lambda x: x[1])
You could also do it another way, which is probably faster if you need to sort your RDD anyway. But, this is slower if you don't need your RDD to be sorted, because sorting will take longer than just telling it to find the max.(So, in a vacuum, use the max function).
X.sortBy(lambda x: x[1], False).first()
This will sort as you did before, but adding the False will sort it in descending order. Then you take the first one, which will be the largest.
Figured it out in the 2 minutes since posting!
X.sortBy(lambda x:x[1]).collect()

Scala: Find max element of a sub list

I need to write a function that takes a List[Int], start index and end index and returns the max element within that range. I have a working recursive solution but I was wondering if its possible to do it using a built-in function in the Scala collection library.
Additional criteria are:
1) O(N) run time
2) No intermediate structure creation
3) No mutable data structures
def sum(input:List[Int],startIndex:Int,endIndex:Int): Int = ???
This is easily possible with Scala's views:
def subListMax[A](list: List[A], start: Int, end: Int)(implicit cmp: Ordering[A]): A =
list.view(start, end).max
view does not create an intermediate data structure, it only provides an interface into a slice of the original one. See the Views chapter in the documentation for the Scala collections library.
I think it is not possible with your criteria.
There is no higher order function known to me, which works on a sublist of a list. Many just create intermediate collections.
This would be a simple O(n) solution, but it creates an intermediate structure.
input.slice(startIndex, endIndex + 1).max
There are of course also functions, which traverse a collection and yield a value. Examples are reduce and foldLeft. The problem is that they traverse the whole collection and you can't set a start or end.
You could iterate from startIndex to endIndex and use foldLeft on that to get a value via indexing. An example would be:
(startIndex to endIndex).foldLeft(Integer.MIN_VALUE)((curmax, i) => input(i).max(curmax))
Here we only have an iterator, which basically behaves like a for-loop and produces no heavy intermediate collections. However iterating from startIndex to endIndex is O(n) and on each iteration we have an indexing operation (input(i)) that is also generally O(n) on List. So in the end, sum would not be O(n).
This is of course only my experience with scala and maybe I am misguided.
Ah and on your 3): I'm not sure what you mean by that. Using mutable state inside the function should not be a problem at all, since it only exists to compute your result and then disappear.