How to sort on multiple columns using takeOrdered? - scala

How to sort by 2 or multiple columns using the takeOrdered(4)(Ordering[Int]) approach in Spark-Scala.
I can achieve this using the sortBy like this :
lines.sortBy(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).map(p => println(p)).take(50)
But when i try to sort using the takeOrdered approach its failing

tl;dr Do something like this (but consider rewriting your code to call split only once):
lines.map(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).takeOrdered(50)
Here is the explanation.
When you call takeOrdered directly on lines, the implicit Ordering that takes effect is Ordering[String] because lines is an RDD[String]. You need to transform lines into a new RDD[(Int, Int)]. Because there is an implicit Ordering[(Int, Int)] available, it takes effect on your transformed RDD.
Meanwhile, sortBy works a little differently. Here is the signature:
sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
I know that is an intimidating signature, but if you cut through the noise, you can see that sortBy takes a function that maps your original type to a new type just for sorting purposes and applies the Ordering for that return type if one is in implicit scope.
In your case, you are applying a function to the Strings in your RDD to transform them into a "view" of how Spark should treat them merely for sorting purposes, i.e as a (Int, Int), and then relying on the fact that the implicit Ordering[(Int, Int)] is available as mentioned.
The sortBy approach allows you to keep lines intact as an RDD[String] and use the mapping just to sort while the takeOrdered approach operates on a brand new RDD containing (Int, Int) derived from the original lines. Whichever approach is more suitable for your needs depends on what you wish to accomplish.
On another note, you probably want to rewrite your code to only split your text once.

You could implement your custom Ordering:
lines.takeOrdered(4)(new Ordering[String] {
override def compare(x: String, y: String): Int = {
val xs=x.split(",")
val ys=y.split(",")
val d1 = xs(1).toInt - ys(1).toInt
if (d1 != 0) d1 else ys(4).toInt - xs(4).toInt
}
})

Related

How to generalise implementations of 'Seq[String] => Seq[Int]' and 'Iterator[String] => Iterator[Int]' for file processing?

Suppose I've got a function Seq[String] => Seq[Int], e.g. def len(as: Seq[String]): Int = as.map(_.length). Now I would like to apply this function to a text file, e.g. transform all the file lines to numbers.
I read a text file as scala.io.Source.fromFile("/tmp/xxx.txt").getLines that returns an iterator.
I can use toList or to(LazyList) to "convert" the iterator to Seq but then I read the whole file into the memory.
So I need to write another function Iterator[String] => Iterator[Int], which is actually a copied version of Seq[String] => Seq[Int]. Is it correct ? What is the best way to avoid the duplicated code?
If you have an arbitrary function Seq[String] => Seq[Int], then
I use toList or to(LazyList) to "convert" the iterator to Seq but in both cases I read the whole file in the memory.
is the best you can do, because the function can start by looking at the end of the Seq[String], or its length, etc.
And Scala doesn't let you look "inside" the function and figure out "it's map(something), I can just do the same map for iterators" (there are some caveats with macros, but not really useful here).
So I need to write another function Iterator[String] => Iterator[Int], which is actually a copied version of Seq[String] => Seq[Int]. Is it correct ? What is the best way to avoid the duplicated code?
If you control the definition of the function, you can use higher-kinded types to define a function which works for both cases. E.g. in Scala 2.13
def len[C[A] <: IterableOnceOps[A, C, C[A]]](as: C[String]): C[Int] = as.map(_.length)
val x: Seq[Int] = len(Seq("a", "b"))
val y: Iterator[Int] = len(Iterator("a", "b"))

Handle different states

I was wondering if it was possible to maintain radically different states across an application? For example, have the update function of the first state call the one from the second state?
I do not recall going through any such example, nor did I find any counter indication... Based on the example from https://docs.cloud.databricks.com/docs/spark/1.6/examples/Streaming%20mapWithState.html, I know of no reason why I wouldn't be able to have different trackStateFuncs with different States, and still update those thanks to their Key, as shown below:
def firstTrackStateFunc(batchTime: Time,
key: String,
value: Option[Int],
state: State[Long]): Option[(String, Long)] = {
val sum = value.getOrElse(0).toLong + state.getOption.getOrElse(0L)
val output = (key, sum)
state.update(sum)
Some(output)
}
and
def secondTrackStateFunc(batchTime: Time,
key: String,
value: Option[Int],
state: State[Int]): Option[(String, Long)] = {
// disregard problems this example would cause
val dif = value.getOrElse(0) - state.getOption.getOrElse(0L)
val output = (key, dif)
state.update(dif)
Some(output)
}
I think this is possible but still remain unsure. I would like someone to validate or invalidate this assumption...
I was wondering if it was possible to maintain radically different
states across an application?
Every call to mapWithState on a DStream[(Key, Value)] can hold one State[T] object. This T needs to be the same for every invocation of mapWithState. In order to use different states, you can either chain mapWithState calls, where one's Option[U] is anothers input, or you can have split the DStream and apply a different mapWithState call to each one. You cannot, however, call a different State[T] object inside another, as they are isolated from one another, and one cannot mutate the state of the other.
#Yuval gave a great answer to chain mapWithState functions. However, I have another approach. Instead of having two mapWithState calls, you can put both the sum and the diff in the same State[(Int, Int)].
In this case, you would only need one mapWithState functions where you could update both the things. Something like this:
def trackStateFunc(batchTime: Time,
key: String,
value: Option[Int],
state: State[(Long, Int)]): Option[(String, (Long, Int))] =
{
val sum = value.getOrElse(0).toLong + state.getOption.getOrElse(0L)
val dif = value.getOrElse(0) - state.getOption.getOrElse(0L)
val output = (key, (sum, diff))
state.update((sum, diff))
Some(output)
}

iterative lookup from within rdd.map in scala

def retrieveindex (stringlist: List[String], lookuplist: List[String]) =
stringlist.foreach(y => lookuplist.indexOf(y))
is my function.
I am trying to use this within an rdd like this:
val libsvm = libsvmlabel.map(x =>
Array(x._2._2,retrieveindex(x._2._1.toList,featureSet.toList)))
However, I am getting an output that is empty. There is no error, but the output from retrieveindex is empty. When I use println to see if I am retrieving correctly, I do see the indices printed. Is there any way to do this? Should I first 'distribute' the function to all the workers? I am a newbie.
retrieveindex has a return type of type Unit (because of foreach which just applies a function (String) ⇒ Unit on each element). Therefore it does not map to anything.
You probably want it to return the list of indices, like:
def retrieveindex(stringlist: List[String], lookuplist: List[String]): List[Int] =
stringlist.map(y => lookuplist.indexOf(y))

How to combine Maps with different value types in Scala

I have the following code which is working:
case class Step() {
def bindings(): Map[String, Any] = ???
}
class Builder {
private val globalBindings = scala.collection.mutable.HashMap.empty[String, Any]
private val steps = scala.collection.mutable.ArrayBuffer.empty[Step]
private def context: Map[String, Any] =
globalBindings.foldLeft(Map[String, Any]())((l, r) => l + r) ++ Map[String, Any]("steps" -> steps.foldLeft(Vector[Map[String, Any]]())((l, r) => l.+:(r.bindings)))
}
But I think it could be simplified so as to not need the first foldLeft in the 'context' method.
The desired result is to produce a map where the entry values are either a String, an object upon which toString will be invoked later, or a function which returns a String.
Is this the best I can do with Scala's type system or can I make the code clearer?
TIA
First of all, the toMap method on mutable.HashMap returns an immutable.Map. You can also use map instead of the inner foldLeft together with toVector if you really need a vector, which might be unnecessary. Finally, you can just use + to add the desired key-value pair of "steps" to the map.
So your whole method body could be:
globalBindings.toMap + ("steps" -> steps.map(_.bindings).toVector)
I'd also note that you should be apprehensive of using types like Map[String, Any] in Scala. So much of the power of Scala comes from its type system and it can be used to great effect in many such situations, and so these types are often considered unidiomatic. Of course, there are situations where this approach makes the most sense, and without more context it would be hard to determine if that were true here.

Does Scala have syntax for 0- and 1-tuples?

scala> val two = (1,2)
two: (Int, Int) = (1,2)
scala> val one = (1,)
<console>:1: error: illegal start of simple expression
val one = (1,)
^
scala> val zero = ()
zero: Unit = ()
Is this:
val one = Tuple1(5)
really the most concise way to write a singleton tuple literal in Scala? And does Unit work like an empty tuple?
Does this inconsistency bother anyone else?
really the most concise way to write a singleton tuple literal in Scala?
Yes.
And does Unit work like an empty tuple?
No, since it does not implement Product.
Does this inconsistency bother anyone else?
Not me.
It really is the most concise way to write a tuple with an arity of 1.
In the comments above I see many references to "why Tuple1 is useful".
Tuples in Scala extend the Product trait, which lets you iterate over the tuple members.
One can implement a method that has a parameter of type Product, and in this case Tuple1 is the only generic way to iterate fixed size collections with multiple types without losing the type information.
There are other reasons for using Tuple1, but this is the most common use-case that I had.
I have never seen a single use of Tuple1. Nor can I imagine one.
In Python, where people do use it, tuples are fixed-size collections. Tuples in Scala are not collections, they are cartesian products of types. So, an Int x Int is a Tuple2[Int, Int], or (Int, Int) for short. Naturally, an Int is an Int, and no type is meaningless.
The previous answers have given a valid Tuple of 1 element.
For one of zero elements this code could work:
object tuple0 extends AnyRef with Product {
def productArity = 0
def productElement(n: Int) = throw new IllegalStateException("No element")
def canEqual(that: Any) = false
}