I need to break out of a seq map when a condition is met something like this where foo would return a list of objects where the size depends on how long it takes to find the targetId
def foo(ids: Seq[String], targetId: String) = ids.map(id => getObject(id)).until(id == targetId)
obviously the until method does not exist but I am looking for something that does the equivalent
No need to create intermediate stream/iterator/view.
Just call takeWhile first:
ids.takeWhile(_ != targetId).map(getObject)
There are 2 ways I use:
1) replace map with a recursive call that processes things in certain way. Pretty handy if there are some complex side-effects.
2) use Stream or Iterator and takeWhile to evaluate it's elements lazily and terminate once the condition is met. I would go with this variant since it will be close to the first option - but much more consise.
Since the RDD I was playing with was small, I achieved the same using take(n).
Related
I am currently starting to learn to use spark with Scala. The problem I am working on needs me to read a file, split each line on a certain character, then filtering the lines where one of the columns matches a predicate and finally remove a column. So the basic, naive implementation is a map, then a filter then another map.
This meant going through the collection 3 times and that seemed quite unreasonable to me. So I tried replacing them by one collect (the collect that takes a partial function as an argument). And much to my surprise, this made it run much slower. I tried locally on regular Scala collections; as expected, the latter way of doing is much faster.
So why is that ? My idea is that the map and filter and map are not applied sequentially, but rather mixed into one operation; in other words, when an action forces evaluation every element of the list will be checked and the pending operations will be executed. Is that right ? But even so, why do the collect perform so badly ?
EDIT: a code example to show what I want to do:
The naive way:
sc.textFile(...).map(l => {
val s = l.split(" ")
(s(0), s(1))
}).filter(_._2.contains("hello")).map(_._1)
The collect way:
sc.textFile(...).collect {
case s if(s.split(" ")(0).contains("hello")) => s(0)
}
The answer lies in the implementation of collect:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
As you can see, it's the same sequence of filter->map, but less efficient in your case.
In scala both isDefinedAt and apply methods of PartialFunction evaluate if part.
So, in your "collect" example split will be performed twice for each input element.
val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val tokenizedLines = lines.map(Tokenizer.tokenize)
in the above code snippet, the tokenize function may return empty strings. How do i skip adding it to the map in that case? or remove empty entries post adding to map?
tokenizedLines.filter(_.nonEmpty)
The currently accepted answer, using filter and nonEmpty, incurs some performance penalty because nonEmpty is not a method on String, but, instead, it's added through implicit conversion. With value objects being used, I expect the difference to be almost imperceptible, but on versions of Scala where that is not the case, it is a substantial hit.
Instead, one could use this, which is assured to be faster:
tokenizedLines.filterNot(_.isEmpty)
You could use flatMap with Option.
Something like that:
lines.flatMap{
case "" => None
case s => Some(s)
}
val tokenizedLines = (lines.map(Tokenizer.tokenize)).filter(_.nonEmpty)
Here is the standard format for a for/yield in scala: notice it expects a collection - whose elements drive the iteration.
for (blah <- blahs) yield someThingDependentOnBlah
I have a situation where an indeterminate number of iterations will occur in a loop. The inner loop logic determines how many will be executed.
while (condition) { some logic that affects the triggering condition } yield blah
Each iteration will generate one element of a sequence - just like a yield is programmed to do. What is a recommended way to do this?
You can
Iterator.continually{ some logic; blah }.takeWhile(condition)
to get pretty much the same thing. You'll need to use something mutable (e.g. a var) for the logic to impact the condition. Otherwise you can
Iterator.iterate((blah, whatever)){ case (_,w) => (blah, some logic on w) }.
takeWhile(condition on _._2).
map(_._1)
Using for comprehensions is the wrong thing for that. What you describe is generally done by unfold, though that method is not present in Scala's standard library. You can find it in Scalaz, though.
Another way similar to suggestion by #rexkerr:
blahs.toIterator.map{ do something }.takeWhile(condition)
This feels a bit more natural than the Iterator.continually
I'm learning Scala now, and I have a scenario where I have to compare an element (say num) with all the elements in a list.
Assume,
val MyList = List(1, 2, 3, 4)
If num is equal to anyone the elements in the list, I need to return true. I know to do it recursively using the head and tail functions, but is there a simpler way to it (I think I'll be able to do it using foreach, but I'm not sure how to implement it exactly)?
There is number of possibilities:
val x = 3
MyList.contains(x)
!MyList.forall(y => y != x) // early exit, basically the same as .contains
If you plan to do it frequently, you may consider to convert your list to Set, cause every .contains lookup on list in worst case is proportional to number of elements, whereas on Set it is effectively constant
val mySet = MyList.toSet
mySet.contains(x)
or simply:
mySet(x)
A contains method is pretty standard for lists in any language. Scala's List has it too:
http://www.scala-lang.org/api/current/scala/collection/immutable/List.html
As others have answered, the contains method on the list will do exactly this, and it's the most understandable/performant way.
Looking at your closing comments though, you wouldn't be able to do it (in an elegant fashion) with foreach, since that returns Unit. Foreach "does" something for each element, but you don't get any result back. It's useful for logging/println statements, but it doesn't act as a transformation.
If you want to run a function on every element individually, you would use map, which returns a List of the results of applying the function. So assuming num = 3, then MyList.map(_ == num) would return List(false, false, true, false). Since you're looking for a single result, and not a list of results, then this is not what you're after.
In order to collapse a sequence of things into a single result, you would use a fold over the data. Folding involves a function that takes two arguments (the result so far, and the current thing in the list) and returns the new running result. So that this can work on the very first element, you also need to provide the initial value to use for the ongoing result (usually some sort of zero).
In your particular case, then, you want a Boolean answer at the end - "was an element found that was equal to num". So the running result would be "have I seen an element so far that was equal to num". Which means the initial value is false. And the function itself should return true if an element has already been seen, or if the current element is equal to num.
Putting this together, it would look like this:
MyList.foldLeft(false) { case (runningResult, listElem) =>
// return true if runningResult is true, or if listElem is the target number
runningResult || listElem == num
}
This doesn't have the nice aspect of stopping as soon as the target value has been found - and it's nowhere near as concise as calling MyList.contains. But as an instructional example, this is how you could implement this yourself from the primitive functional operations on a list.
List has a method for that:
val found = MyList.contains(num)
Why using foreach, map, flatMap etc. are considered better than using get for Scala Options? If I useisEmpty I can call get safely.
Well, it kind of comes back to "tell, don't ask". Consider these two lines:
if (opt.isDefined) println(opt.get)
// versus
opt foreach println
In the first case, you are looking inside opt and then reacting depending on what you see. In the second case you are just telling opt what you want done, and let it deal with it.
The first case knows too much about Option, replicates logic internal to it, is fragile and prone to errors (it can result in run-time errors, instead of compile-time errors, if written incorrectly).
Add to that, it is not composable. If you have three options, a single for comprehension takes care of them:
for {
op1 <- opt1
op2 <- opt2
op3 <- opt3
} println(op1+op2+op3)
With if, things start to get messy fast.
One nice reason to use foreach is parsing something with nested options. If you have something like
val nestedOption = Some(Some(Some(1)))
for {
opt1 <- nestedOption
opt2 <- opt1
opt3 <- opt2
} println(opt3)
The console prints 1. If you extend this to a case where you have a class that optionally stores a reference to something, which in turn stores another reference, for comprehensions allow you to avoid a giant "pyramid" of None/Some checking.
There are already excellent answers to the actual question, but for more Option-foo you should definitely check out Tony Morris' Option Cheat Sheet.
The reason it's more useful to apply things like map, foreach, and flatMap directly to the Option instead of using get and then performing the function is that it works on either Some or None and you don't have to do special checks to make sure the value is there.
val x: Option[Int] = foo()
val y = x.map(_+1) // works fine for None
val z = x.get + 1 // doesn't work if x is None
The result for y here is an Option[Int], which is desirable since if x is optional, then y might be undetermined as well. Since get doesn't work on None, you'd have to do a bunch of extra work to make sure you didn't get any errors; extra work that is done for you by map.
Put simply:
If you need to do something (a procedure when you don't need to capture the return value of each invocation) only if the option is defined (i.e. is a Some): use foreach (if you care about the results of each invocation, use map)
If you need to do something if the option defined and something else if it's not: use isDefined in an if statement
If you need the value if the option is a Some, or a default value if it is a None: use getOrElse
Trying to perform our Operations with get is more imperative style where u need to tel what to do and how to do . In other words , we are dictating things and digging more into the Options internals. Where as map,flatmap are more functional way of doing things where we are say what to do but not how to do.