iterative lookup from within rdd.map in scala - scala

def retrieveindex (stringlist: List[String], lookuplist: List[String]) =
stringlist.foreach(y => lookuplist.indexOf(y))
is my function.
I am trying to use this within an rdd like this:
val libsvm = libsvmlabel.map(x =>
Array(x._2._2,retrieveindex(x._2._1.toList,featureSet.toList)))
However, I am getting an output that is empty. There is no error, but the output from retrieveindex is empty. When I use println to see if I am retrieving correctly, I do see the indices printed. Is there any way to do this? Should I first 'distribute' the function to all the workers? I am a newbie.

retrieveindex has a return type of type Unit (because of foreach which just applies a function (String) ⇒ Unit on each element). Therefore it does not map to anything.
You probably want it to return the list of indices, like:
def retrieveindex(stringlist: List[String], lookuplist: List[String]): List[Int] =
stringlist.map(y => lookuplist.indexOf(y))

Related

How to generalise implementations of 'Seq[String] => Seq[Int]' and 'Iterator[String] => Iterator[Int]' for file processing?

Suppose I've got a function Seq[String] => Seq[Int], e.g. def len(as: Seq[String]): Int = as.map(_.length). Now I would like to apply this function to a text file, e.g. transform all the file lines to numbers.
I read a text file as scala.io.Source.fromFile("/tmp/xxx.txt").getLines that returns an iterator.
I can use toList or to(LazyList) to "convert" the iterator to Seq but then I read the whole file into the memory.
So I need to write another function Iterator[String] => Iterator[Int], which is actually a copied version of Seq[String] => Seq[Int]. Is it correct ? What is the best way to avoid the duplicated code?
If you have an arbitrary function Seq[String] => Seq[Int], then
I use toList or to(LazyList) to "convert" the iterator to Seq but in both cases I read the whole file in the memory.
is the best you can do, because the function can start by looking at the end of the Seq[String], or its length, etc.
And Scala doesn't let you look "inside" the function and figure out "it's map(something), I can just do the same map for iterators" (there are some caveats with macros, but not really useful here).
So I need to write another function Iterator[String] => Iterator[Int], which is actually a copied version of Seq[String] => Seq[Int]. Is it correct ? What is the best way to avoid the duplicated code?
If you control the definition of the function, you can use higher-kinded types to define a function which works for both cases. E.g. in Scala 2.13
def len[C[A] <: IterableOnceOps[A, C, C[A]]](as: C[String]): C[Int] = as.map(_.length)
val x: Seq[Int] = len(Seq("a", "b"))
val y: Iterator[Int] = len(Iterator("a", "b"))

Protobuf to scala conversion

I have in protobuf
message ResultsPb{
repeated int32 Result = 1;
}
and the corresponding in scala
Results: List[Int]
I’m new to this and I’m having a hard time finding the proper way to convert from one to the other. Here is what I've come up with so far but not sure at all of that. First def doesn't build, second def do build.
def toResults(resultsPb: Option[ResultsPb]): List[Int] ={
List[Int](resultsPb.Result)
}
def fromResults(results: List[Int]): Option[ResultsPb] ={
Some(ResultsPb (results.toSeq))
}
Any help would be greatly appreciated.
Thanks
def toResults(resultsPb: Option[ResultsPb]): List[Int] ={
resultsPb.toList.flatMap(_.Result)
}
def fromResults(results: List[Int]): Option[ResultsPb] ={
Some(results).filter(_.nonEmpty).map(t => ResultPb(t))
}
Update
So, to convert optional proto class to its argument you need
to check that option is not empty
extract data from class
this could be done as Bob proposed (with pattern matching), or simplier with flatMap (but before you need to convert Option to Seq/List to satisfy type check)
In second method, again you can do it with if statement (checking if List is empty), with matching (checking the same with "case _ :: Nil => ... case Nil =>") or as I proposed with filtering empty result and getting None automatically in this case.

Scala Future[myType] to MyType

I have Future[MyType] and I need to pass the value of MyType to a method which returns Seq[Future[MyType]], so basic signature of problem is:
val a: Seq[Future[MyType]] = ...
getValue(t: MyType): Seq[Future[MyType]] = {...}
I want to pass value of a to getValue. I tried something like:
val b:Seq[Future[MyType]] = a.map{v => getValue(v)}
I want b to be of Seq[Future[MyType]] type
but, it obviously didn't worked as v is of type Future[MyType] and getValue needs only MyType as parameter. What could be a possible solution??
You can do:
val b = a.map(_.map(getValue(_)))
This will give you a Seq[Future[Seq[Future[MyType]]]]. That's pretty ugly. There are three tools that can make that better.
Future.sequence takes a Seq[Future[A]] and gives you a Future[Seq[A]]. The output future will wait for all input futures to complete before giving a result. This might not always be what you want.
fut.flatMap takes a function computing a Future as a result but does not return a nested Future, as would happen with .map.
You can call .flatten on a Seq[Seq[A]] to get a Seq[A]
Putting this all together, you could do something like:
val b: Seq[Future[Seq[MyType]] = a.map(_.flatMap(x => Future.sequence(getValue(x))))
val c: Future[Seq[MyType]] = Future.sequence(b).map(_.flatten)
More generally, when dealing with "container" types, you'll use some combination of map and flatMap to get at the inner types and pass them around. And common containers often have ways to flatten or swap orders, e.g. A[A[X]] => A[X] or A[B[X]] => B[A[X]].

How to sort on multiple columns using takeOrdered?

How to sort by 2 or multiple columns using the takeOrdered(4)(Ordering[Int]) approach in Spark-Scala.
I can achieve this using the sortBy like this :
lines.sortBy(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).map(p => println(p)).take(50)
But when i try to sort using the takeOrdered approach its failing
tl;dr Do something like this (but consider rewriting your code to call split only once):
lines.map(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).takeOrdered(50)
Here is the explanation.
When you call takeOrdered directly on lines, the implicit Ordering that takes effect is Ordering[String] because lines is an RDD[String]. You need to transform lines into a new RDD[(Int, Int)]. Because there is an implicit Ordering[(Int, Int)] available, it takes effect on your transformed RDD.
Meanwhile, sortBy works a little differently. Here is the signature:
sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
I know that is an intimidating signature, but if you cut through the noise, you can see that sortBy takes a function that maps your original type to a new type just for sorting purposes and applies the Ordering for that return type if one is in implicit scope.
In your case, you are applying a function to the Strings in your RDD to transform them into a "view" of how Spark should treat them merely for sorting purposes, i.e as a (Int, Int), and then relying on the fact that the implicit Ordering[(Int, Int)] is available as mentioned.
The sortBy approach allows you to keep lines intact as an RDD[String] and use the mapping just to sort while the takeOrdered approach operates on a brand new RDD containing (Int, Int) derived from the original lines. Whichever approach is more suitable for your needs depends on what you wish to accomplish.
On another note, you probably want to rewrite your code to only split your text once.
You could implement your custom Ordering:
lines.takeOrdered(4)(new Ordering[String] {
override def compare(x: String, y: String): Int = {
val xs=x.split(",")
val ys=y.split(",")
val d1 = xs(1).toInt - ys(1).toInt
if (d1 != 0) d1 else ys(4).toInt - xs(4).toInt
}
})

nested "if"/"match" statement in scala: better approach?

I have a series of validation functions that returns an Option[Problem], if any, or None if no validation problems are found.
I would like to write a simple function that calls each validation function, stop and return the first not-None result.
Naturally I can write this method in the "java-style", but I would like to know if a better approach exists.
EDIT
This was the original Java solution:
validate01(arg);
validate02(arg);
validate03(arg);
...
Each method throws an exception in case of problem. I would stay away from the exceptions while I'm writing Scala.
As an example, let's say we want to validate a String. Our validation function takes a String and a list of validators, which are functions from String to Option[Problem]. We can implement it in a functional manner like this:
def firstProblem(validators: List[String => Option[Problem]], s:String) =
validators.view.flatMap(_(s)).headOption
This creates a new list by applying each validation function to the string and keeping the result only if it is a Some. We then take the first element of this List. Because of the call to view, the list will be computed only as needed. So as soon as the first Problem is found, no further validators will be called.
If you have finite and known at compile time number of validations you may use .orElse on Options:
def foo(x: Int): Option[Problem] = ...
def bar(x: Int): Option[Problem] = ...
...
def baz(x: Int): Option[Problem] = ...
foo(1) orElse bar(2) orElse .. baz(n)
Maybe you want--assuming the validation functions take no arguments
def firstProblem(fs: (() => Option[Problem])*) = {
fs.iterator.map(f => f()).find(_.isDefined).flatten
}
You'll get an existing Option[Problem] if there are any, or None if they all succeed. If you need to pass arguments to the functions, then you need to explain what those arguments are. For example, you could
def firstProblem[A](a: A)(fs: (A => Option[Problem])*) = /* TODO */
if you can pass the same argument to all of them. You would use it like this:
firstProblem(myData)(
validatorA,
validatorB,
validatorC,
validatorD
)