Spark Cassandra Connector: for comprehension error (type mismatch) - scala

Problem
Maybe this is due to my lack of Scala knowledge, but it seems like adding another level to the for comprehension should just work. If the first for comprehension line is commented out, the code works. I ultimately want a Set[Int] instead of '1 to 2', but it serves to show the problem. The first two lines of the for should not need a type specifier, but I include it to show that I've tried the obvious.
Tools/Jars
IntelliJ 2016.1
Java 8
Scala 2.10.5
Cassandra 3.x
spark-assembly-1.6.0-hadoop2.6.0.jar (pre-built)
spark-cassandra-connector_2.10-1.6.0-M1-SNAPSHOT.jar (pre-built)
spark-cassandra-connector-assembly-1.6.0-M1-SNAPSHOT.jar (I built)
Code
case class NotifHist(intnotifhistid:Int, eventhistids:Seq[Int], yosemiteid:String, initiatorname:String)
case class NotifHistSingle(intnotifhistid:Int, inteventhistid:Int, dataCenter:String, initiatorname:String)
object SparkCassandraConnectorJoins {
def joinQueryAfterMakingExpandedRdd(sc:SparkContext, orgNodeId:Int) {
val notifHist:RDD[NotifHistSingle] = for {
orgNodeId:Int <- 1 to 2 // comment out this line and it works
notifHist:NotifHist <- sc.cassandraTable[NotifHist](keyspace, "notifhist").where("intorgnodeid = ?", orgNodeId)
eventHistId <- notifHist.eventhistids
} yield NotifHistSingle(notifHist.intnotifhistid, eventHistId, notifHist.yosemiteid, notifHist.initiatorname)
...etc...
}
Compilation Output
Information:3/29/16 8:52 AM - Compilation completed with 1 error and 0 warnings in 1s 507ms
/home/jpowell/Projects/SparkCassandraConnector/src/com/mir3/spark/SparkCassandraConnectorJoins.scala
**Error:(88, 21) type mismatch;
found : scala.collection.immutable.IndexedSeq[Nothing]
required: org.apache.spark.rdd.RDD[com.mir3.spark.NotifHistSingle]
orgNodeId:Int <- 1 to 2
^**
Later
#slouc Thanks for the comprehensive answer. I was using the for comprehension's syntactic sugar to also keep state from the second statement to fill elements in the NotifHistSingle ctor, so I don't see how to get the equivalent map/flatmap to work. Therefore, I went with the following solution:
def joinQueryAfterMakingExpandedRdd(sc:SparkContext, orgNodeIds:Set[Int]) {
def notifHistForOrg(orgNodeId:Int): RDD[NotifHistSingle] = {
for {
notifHist <- sc.cassandraTable[NotifHist](keyspace, "notifhist").where("intorgnodeid = ?", orgNodeId)
eventHistId <- notifHist.eventhistids
} yield NotifHistSingle(notifHist.intnotifhistid, eventHistId, notifHist.yosemiteid, notifHist.initiatorname)
}
val emptyTable:RDD[NotifHistSingle] = sc.emptyRDD[NotifHistSingle]
val notifHistForAllOrgs:RDD[NotifHistSingle] = orgNodeIds.foldLeft(emptyTable)((accum, oid) => accum ++ notifHistForOrg(oid))
}

For comprehension is actually syntax sugar; what's really going on underneath is a series of chained flatMap calls, with a single map at the end which replaces yield. Scala compiler translates every for comprehension like this. If you use if conditions in your for comprehension, they are translated into filters, and if you don't yield anything foreach is used. For more information, see here.
So, to explain on your case - this:
val notifHist:RDD[NotifHistSingle] = for {
orgNodeId:Int <- 1 to 2 // comment out this line and it works
notifHist:NotifHist <- sc.cassandraTable[NotifHist](keyspace, "notifhist").where("intorgnodeid = ?", orgNodeId)
eventHistId <- notifHist.eventhistids
} yield NotifHistSingle(...)
is actually translated by the compiler to this:
val notifHist:RDD[NotifHistSingle] = (1 to 2)
.flatMap(x => sc.cassandraTable[NotifHist](keyspace, "notifhist").where("intorgnodeid = ?", x)
.flatMap(x => x.eventhistids)
.map(x => NotifHistSingle(...))
You are getting the error if you include the 1 to 2 line because that makes your for comprehension operate on a sequence (vector, to be more precise). So when invoking flatMap(), compiler expects you to follow up with a function that transforms each element of your vector to a GenTraversableOnce. If you take a closer look at the type of your for expression (most IDEs will display it just by hovering over it) you can see it for yourself:
def flatMap[B, That](f: A => GenTraversableOnce[B])(implicit bf: CanBuildFrom[Repr, B, That]): That
This is the problem. Compiler doesn't know how to flatMap the vector 1 to 10 using a function that returns CassandraRDD. It wants a function that returns GenTraversableOnce. If you remove the 1 to 2 line then you remove this restriction.
Bottom line - if you want to use a for comprehension and yield values out of it, you have to obey the type rules. It's impossible to flatten a sequence consisting of elements which are not sequences and cannot be turned into sequences.
You can always map instead of flatMap since map is less restrictive (it requires A => B instead of A => GenTraversableOnce[B]). This means that instead of getting all results in one giant sequence, you will get a sequence where each element is a group of results (one group for each query). You can also play around the types, trying to get a GenTraversableOnce from your query result (e.g. invoking sc.cassandraTable().where().toArray or something; I don't really work with Cassandra so I don't know).

Related

Monadic way to get first Right to result from getting an Either from items of a list?

Up front: I know how to just write a custom function that will do this, but I swear there's a built-in thing whose name I'm just forgetting, to handle it in an idiomatic way. (Also, in my actual use case I'm likely to be using more complex monads involving state and assorted nonsense, and I feel like the answer I'm looking for will handle those as well, while the hand-coded one would need to be updated.)
I have a list items : List[A] and a function f : (A) -> Either[Error, B]. I vaguely recall there's an easy dedicated function that will apply f to each item in items and then return the first Right(b) to result, without applying f to the remaining items (or return Left[error] of the last error, if nothing succeeds.)
For example, if you had f(items(0)) result in Left("random error"), f(items(1)) result in Right("Find this one!"), and f(items(2)) result in launchTheNukes(); Right("Uh oh."), then the return should be Right("Find this one!") and no nukes should be launched.
It's sort of like what's happening in a for comprehension, where you could do:
for{
res0 <- f(items(0))
res1 <- f(items(1))
res2 <- f(items(2))
} yield res2
Which would return either the first error or the final result - so I want that, but to handle an arbitrary list rather than hard-coded, and returning the first success, not the first error. (The answer I'm looking for might be two functions, one to swap the sides of an Either, and one to automatically chain foldLefts across a list... I think there's a single-step solution though.)
Code snippet for commented solution:
def tester(i : Int) : Either[String, Int] = {if (i % 2 == 0) Right(100 / (4 - i)) else Left(i.toString)}
(1 to 5).collectFirst(tester)
I'm assuming (from your mention of more complex monads such as State) that you're using the Cats library.
You probably want one of the methods that come from Traverse
For example, its sequence and traverse methods are two variations of the "I have a list of things, and a thing-to-monad function, and I want a monad of things". Since Either is a monad whose flatMap aborts early upon encountering a Left, you could .swap your Eithers so that the flatMap aborts early upon encountering a Right, and then .swap the result back at the end.
def tester(i : Int): Either[String, Int] = /* from your question */
val items = (1 to 5).toList
items.traverse(tester(_).swap).swap // Right(50)
val allLeft = List(Left("oh no"), Left("uh oh"))
allLeft.traverse(_.swap).swap // Left(List("oh no", "uh oh"))
Ho about list.iterator.map(f).collectFirst { case Right(x) =>x } (this returns Option(x) of the first Right(x) it finds ... Could return Option(Right(x)) but that seems redundant.
Or you might go back to either:
list.iterator.map(f).collectFirst { case x#Right(_) => x }.getOrElse(Left("oops"))
If you insist on getting the last Left in case there are no Rights (doesn't seem to be very meaningful actually), then it seems like a simple recursive scan (that you said you knew how to write) is the best option.

scala for/yield on instance of class

I am scratching my head on an example I've seen in breeze's documentation about distributions.
After creating a Rand instance, they show that you can do the following:
import breeze.stats.distributions._
val pois = new Poisson(3.0);
val doublePoi: Rand[Double] = for(x <- pois) yield x.toDouble
Now, this is very cool, I can get a Rand object that I can get Double instead of Int when I call the samples method. Another example might be:
val abc = ('a' to 'z').map(_.toString).toArray
val letterDist: Rand[String] = for(x <- pois) yield {
val i = if (x > 26) x % 26 else x
abc(i)
}
val lettersSamp = letterDist.samples.take(20)
println(letterSamp)
The question is, what is going on here? Rand[T] is not a collection, and all the for/yield examples I've seen so far work on collections. The scala docs don't mention much, the only thing I found is translating for-comprehensions in here.
What is the underlying rule here? How else can this be used (doesn't have to be a breeze related answer)
Scala has rules for translating for and for-yield expressions to the equivalent flatMap and map calls, optionally also applying filters using withFilter and such. The actual specification for how you translate each for comprehension expression into the equivalent method calls can be found in this section of the Scala Specification.
If we take your example and compile it we'll see the underlying transformation happen to the for-yield expression. This is done using scalac -Xprint:typer command to print out the type trees:
val letterDist: breeze.stats.distributions.Rand[String] =
pois.map[String](((x: Int) => {
val i: Int = if (x.>(26))
x.%(26)
else
x;
abc.apply(i)
}));
Here you can see that for-yield turns into a single map passing in an Int and applying the if-else inside the expression. This works because Rand[T] has a map method defined:
def map[E](f: T => E): Rand[E] = MappedRand(outer, f)
For comprehensions are just syntactic sugar for flatMap, map and withFilter. The main requirement for use in a for comprehension is that those methods are implemented. They are therefore not limited to collections E.g. some common non-collections used in for comprehensions are Option, Try and Future.
In your case, Poisson seems to inherit from a trait called Rand
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/stats/distributions/Rand.scala
This trait has map, flatmap, and withFilter defined.
Tip: If you use an IDE like IntelliJ - you can press alt+enter on your for comprehension, and chose convert to desugared expression and you will see how it expands.

Ending a for-comprehension loop when a check on one of the items returns false

I am a bit new to Scala, so apologies if this is something a bit trivial.
I have a list of items which I want to iterate through. I to execute a check on each of the items and if just one of them fails I want the whole function to return false. So you can see this as an AND condition. I want it to be evaluated lazily, i.e. the moment I encounter the first false return false.
I am used to the for - yield syntax which filters items generated through some generator (list of items, sequence etc.). In my case however I just want to break out and return false without executing the rest of the loop. In normal Java one would just do a return false; within the loop.
In an inefficient way (i.e. not stopping when I encounter the first false item), I could do it:
(for {
item <- items
if !satisfiesCondition(item)
} yield item).isEmpty
Which is essentially saying that if no items make it through the filter all of them satisfy the condition. But this seems a bit convoluted and inefficient (consider you have 1 million items and the first one already did not satisfy the condition).
What is the best and most elegant way to do this in Scala?
Stopping early at the first false for a condition is done using forall in Scala. (A related question)
Your solution rewritten:
items.forall(satisfiesCondition)
To demonstrate short-circuiting:
List(1,2,3,4,5,6).forall { x => println(x); x < 3 }
1
2
3
res1: Boolean = false
The opposite of forall is exists which stops as soon as a condition is met:
List(1,2,3,4,5,6).exists{ x => println(x); x > 3 }
1
2
3
4
res2: Boolean = true
Scala's for comprehensions are not general iterations. That means they cannot produce every possible result that one can produce out of an iteration, as, for example, the very thing you want to do.
There are three things that a Scala for comprehension can do, when you are returning a value (that is, using yield). In the most basic case, it can do this:
Given an object of type M[A], and a function A => B (that is, which returns an object of type B when given an object of type A), return an object of type M[B];
For example, given a sequence of characters, Seq[Char], get UTF-16 integer for that character:
val codes = for (char <- "A String") yield char.toInt
The expression char.toInt converts a Char into an Int, so the String -- which is implicitly converted into a Seq[Char] in Scala --, becomes a Seq[Int] (actually, an IndexedSeq[Int], through some Scala collection magic).
The second thing it can do is this:
Given objects of type M[A], M[B], M[C], etc, and a function of A, B, C, etc into D, return an object of type M[D];
You can think of this as a generalization of the previous transformation, though not everything that could support the previous transformation can necessarily support this transformation. For example, we could produce coordinates for all coordinates of a battleship game like this:
val coords = for {
column <- 'A' to 'L'
row <- 1 to 10
} yield s"$column$row"
In this case, we have objects of the types Seq[Char] and Seq[Int], and a function (Char, Int) => String, so we get back a Seq[String].
The third, and final, thing a for comprehension can do is this:
Given an object of type M[A], such that the type M[T] has a zero value for any type T, a function A => B, and a condition A => Boolean, return either the zero or an object of type M[B], depending on the condition;
This one is harder to understand, though it may look simple at first. Let's look at something that looks simple first, say, finding all vowels in a sequence of characters:
def vowels(s: String) = for {
letter <- s
if Set('a', 'e', 'i', 'o', 'u') contains letter.toLower
} yield letter.toLower
val aStringVowels = vowels("A String")
It looks simple: we have a condition, we have a function Char => Char, and we get a result, and there doesn't seem to be any need for a "zero" of any kind. In this case, the zero would be the empty sequence, but it hardly seems worth mentioning it.
To explain it better, I'll switch from Seq to Option. An Option[A] has two sub-types: Some[A] and None. The zero, evidently, is the None. It is used when you need to represent the possible absence of a value, or the value itself.
Now, let's say we have a web server where users who are logged in and are administrators get extra javascript on their web pages for administration tasks (like wordpress does). First, we need to get the user, if there's a user logged in, let's say this is done by this method:
def getUser(req: HttpRequest): Option[User]
If the user is not logged in, we get None, otherwise we get Some(user), where user is the data structure with information about the user that made the request. We can then model that operation like this:
def adminJs(req; HttpRequest): Option[String] = for {
user <- getUser(req)
if user.isAdmin
} yield adminScriptForUser(user)
Here it is easier to see the point of the zero. When the condition is false, adminScriptForUser(user) cannot be executed, so the for comprehension needs something to return instead, and that something is the "zero": None.
In technical terms, Scala's for comprehensions provides syntactic sugars for operations on monads, with an extra operation for monads with zero (see list comprehensions in the same article).
What you actually want to accomplish is called a catamorphism, usually represented as a fold method, which can be thought of as a function of M[A] => B. You can write it with fold, foldLeft or foldRight in a sequence, but none of them would actually short-circuit the iteration.
Short-circuiting arises naturally out of non-strict evaluation, which is the default in Haskell, in which most of these papers are written. Scala, as most other languages, is by default strict.
There are three solutions to your problem:
Use the special methods forall or exists, which target your precise use case, though they don't solve the generic problem;
Use a non-strict collection; there's Scala's Stream, but it has problems that prevents its effective use. The Scalaz library can help you there;
Use an early return, which is how Scala library solves this problem in the general case (in specific cases, it uses better optimizations).
As an example of the third option, you could write this:
def hasEven(xs: List[Int]): Boolean = {
for (x <- xs) if (x % 2 == 0) return true
false
}
Note as well that this is called a "for loop", not a "for comprehension", because it doesn't return a value (well, it returns Unit), since it doesn't have the yield keyword.
You can read more about real generic iteration in the article The Essence of The Iterator Pattern, which is a Scala experiment with the concepts described in the paper by the same name.
forall is definitely the best choice for the specific scenario but for illustration here's good old recursion:
#tailrec def hasEven(xs: List[Int]): Boolean = xs match {
case head :: tail if head % 2 == 0 => true
case Nil => false
case _ => hasEven(xs.tail)
}
I tend to use recursion a lot for loops w/short circuit use cases that don't involve collections.
UPDATE:
DO NOT USE THE CODE IN MY ANSWER BELOW!
Shortly after I posted the answer below (after misinterpreting the original poster's question), I have discovered a way superior generic answer (to the listing of requirements below) here: https://stackoverflow.com/a/60177908/501113
It appears you have several requirements:
Iterate through a (possibly large) list of items doing some (possibly expensive) work
The work done to an item could return an error
At the first item that returns an error, short circuit the iteration, throw away the work already done, and return the item's error
A for comprehension isn't designed for this (as is detailed in the other answers).
And I was unable to find another Scala collections pre-built iterator that provided the requirements above.
While the code below is based on a contrived example (transforming a String of digits into a BigInt), it is the general pattern I prefer to use; i.e. process a collection and transform it into something else.
def getDigits(shouldOnlyBeDigits: String): Either[IllegalArgumentException, BigInt] = {
#scala.annotation.tailrec
def recursive(
charactersRemaining: String = shouldOnlyBeDigits
, accumulator: List[Int] = Nil
): Either[IllegalArgumentException, List[Int]] =
if (charactersRemaining.isEmpty)
Right(accumulator) //All work completed without error
else {
val item = charactersRemaining.head
val isSuccess =
item.isDigit //Work the item
if (isSuccess)
//This item's work completed without error, so keep iterating
recursive(charactersRemaining.tail, (item - 48) :: accumulator)
else {
//This item hit an error, so short circuit
Left(new IllegalArgumentException(s"item [$item] is not a digit"))
}
}
recursive().map(digits => BigInt(digits.reverse.mkString))
}
When it is called as getDigits("1234") in a REPL (or Scala Worksheet), it returns:
val res0: Either[IllegalArgumentException,BigInt] = Right(1234)
And when called as getDigits("12A34") in a REPL (or Scala Worksheet), it returns:
val res1: Either[IllegalArgumentException,BigInt] = Left(java.lang.IllegalArgumentException: item [A] is not digit)
You can play with this in Scastie here:
https://scastie.scala-lang.org/7ddVynRITIOqUflQybfXUA

Scala partially applied curried functions

Why can't i rewrite
println(abc.foldRight(0)((a,b) => math.max(a.length,b)))
in
object Main {
def main(args : Array[String]) {
val abc = Array[String]("a","abc","erfgg","r")
println(abc.foldRight(0)((a,b) => math.max(a.length,b)))
}
}
to
println(abc.foldRight(0)(math.max(_.length,_)))
? scala interpreter yields
/path/to/Main.scala:4: error: wrong number of parameters; expected = 2
println(abc.foldRight(0)(math.max(_.length,_)))
^
one error found
Which is not descriptive enough for me. Isn't resulting lambda takes two parameters one of which being called for .length method, as in abc.map(_.length)?
abc.foldRight(0)(math.max(_.length, _)) will expand to something like abc.foldRight(0)(y => math.max(x => x.length, y)). The placeholder syntax expands in the nearest pair of closing parentheses, except when you have only the underscore in which case it will expand outside the closest pair of parentheses.
You can use abc.foldRight(0)(_.length max _) which doesn't suffer from this drawback.

Scala's for-comprehension `if` statements

Is it possible in scala to specialize on the conditions inside an if within a for comprehension? I'm thinking along the lines of:
val collection: SomeGenericCollection[Int] = ...
trait CollectionFilter
case object Even extends CollectionFilter
case object Odd extends CollectionFilter
val evenColl = for { i <- collection if(Even) } yield i
//evenColl would be a SomeGenericEvenCollection instance
val oddColl = for { i <- collection if(Odd) } yield i
//oddColl would be a SomeGenericOddCollection instance
The gist is that by yielding i, I get a new collection of a potentially different type (hence me referring to it as "specialization")- as opposed to just a filtered-down version of the same GenericCollection type.
The reason I ask is that I saw something that I couldn't figure out (an example can be found on line 33 of this ScalaQuery example. What it does is create a query for a database (i.e. SELECT ... FROM ... WHERE ...), where I would have expected it to iterate over the results of said query.
So, I think you are asking if it is possible for the if statement in a for-comprehension to change the result type. The answer is "yes, but...".
First, understand how for-comprehensions are expanded. There are questions here on Stack Overflow discussing it, and there are parameters you can pass to the compiler so it will show you what's going on.
Anyway, this code:
val evenColl = for { i <- collection if(Even) } yield i
Is translated as:
val evenColl = collection.withFilter(i => Even).map(i => i)
So, if the withFilter method changes the collection type, it will do what you want -- in this simple case. On more complex cases, that alone won't work:
for {
x <- xs
y <- ys
if cond
} yield (x, y)
is translated as
xs.flatMap(ys.withFilter(y => cond).map(y => (x, y)))
In which case flatMap is deciding what type will be returned. If it takes the cue from what result was returned, then it can work.
Now, on Scala Collections, withFilter doesn't change the type of the collection. You could write your own classes that would do that, however.
yes you can - please refer to this tutorial for an easy example. The scala query example you cited is also iterating on the collection, it is then using that data to build the query.