Removing empty strings from maps in scala - scala

val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val tokenizedLines = lines.map(Tokenizer.tokenize)
in the above code snippet, the tokenize function may return empty strings. How do i skip adding it to the map in that case? or remove empty entries post adding to map?

tokenizedLines.filter(_.nonEmpty)

The currently accepted answer, using filter and nonEmpty, incurs some performance penalty because nonEmpty is not a method on String, but, instead, it's added through implicit conversion. With value objects being used, I expect the difference to be almost imperceptible, but on versions of Scala where that is not the case, it is a substantial hit.
Instead, one could use this, which is assured to be faster:
tokenizedLines.filterNot(_.isEmpty)

You could use flatMap with Option.
Something like that:
lines.flatMap{
case "" => None
case s => Some(s)
}

val tokenizedLines = (lines.map(Tokenizer.tokenize)).filter(_.nonEmpty)

Related

(Scala) Am I using Options correctly?

I'm currently working on my functional programming - I am fairly new to it. Am i using Options correctly here? I feel pretty insecure on my skills currently. I want my code to be as safe as possible - Can any one point out what am I doing wrong here or is it not that bad? My code is pretty straight forward here:
def main(args: Array[String]): Unit =
{
val file = "myFile.txt"
val myGame = Game(file) //I have my game that returns an Option here
if(myGame.isDefined) //Check if I indeed past a .txt file
{
val solutions = myGame.get.getAllSolutions() //This returns options as well
if(solutions.isDefined) //Is it possible to solve the puzzle(crossword)
{
for(i <- solutions.get){ //print all solutions to the crossword
i.solvedCrossword foreach println
}
}
}
}
-Thanks!! ^^
When using Option, it is recommended to use match case instead of calling 'isDefined' and 'get'
Instead of the java style for loop, use higher-order function:
myGame match {
case Some(allSolutions) =>
val solutions = allSolutions.getAllSolutions
solutions.foreach(_.solvedCrossword.foreach(println))
case None =>
}
As a rule of thumb, you can think of Option as a replacement for Java's null pointer. That is, in cases where you might want to use null in Java, it often makes sense to use Option in Scala.
Your Game() function uses None to represent errors. So you're not really using it as a replacement for null (at least I'd consider it poor practice for an equivalent Java method to return null there instead of throwing an exception), but as a replacement for exceptions. That's not a good use of Option because it loses error information: you can no longer differentiate between the file not existing, the file being in the wrong format or other types of errors.
Instead you should use Either. Either consists of the cases Left and Right where Right is like Option's Some, but Left differs from None in that it also takes an argument. Here that argument can be used to store information about the error. So you can create a case class containing the possible types of errors and use that as an argument to Left. Or, if you never need to handle the errors differently, but just present them to the user, you can use a string with the error message as the argument to Left instead of case classes.
In getAllSolutions you're just using None as a replacement for the empty list. That's unnecessary because the empty list needs no replacement. It's perfectly fine to just return an empty list when there are no solutions.
When it comes to interacting with the Options, you're using isDefined + get, which is a bit of an anti pattern. get can be used as a shortcut if you know that the option you have is never None, but should generally be avoided. isDefined should generally only be used in situations where you need to know whether an option contains a value, but don't need to know the value.
In cases where you need to know both whether there is a value and what that value is, you should either use pattern matching or one of Option's higher-order functions, such as map, flatMap, getOrElse (which is kind of a higher-order function if you squint a bit and consider by-name arguments as kind-of like functions). For cases where you want to do something with the value if there is one and do nothing otherwise, you can use foreach (or equivalently a for loop), but note that you really shouldn't do nothing in the error case here. You should tell the user about the error instead.
If all you need here is to print it in case all is good, you can use for-comprehension which is considered quite idiomatic Scala way
for {
myGame <- Game("mFile.txt")
solutions <- myGame.getAllSolutions()
solution <- solutions
crossword <- solution.solvedCrossword
} println(crossword)

Scala's collect inefficient in Spark?

I am currently starting to learn to use spark with Scala. The problem I am working on needs me to read a file, split each line on a certain character, then filtering the lines where one of the columns matches a predicate and finally remove a column. So the basic, naive implementation is a map, then a filter then another map.
This meant going through the collection 3 times and that seemed quite unreasonable to me. So I tried replacing them by one collect (the collect that takes a partial function as an argument). And much to my surprise, this made it run much slower. I tried locally on regular Scala collections; as expected, the latter way of doing is much faster.
So why is that ? My idea is that the map and filter and map are not applied sequentially, but rather mixed into one operation; in other words, when an action forces evaluation every element of the list will be checked and the pending operations will be executed. Is that right ? But even so, why do the collect perform so badly ?
EDIT: a code example to show what I want to do:
The naive way:
sc.textFile(...).map(l => {
val s = l.split(" ")
(s(0), s(1))
}).filter(_._2.contains("hello")).map(_._1)
The collect way:
sc.textFile(...).collect {
case s if(s.split(" ")(0).contains("hello")) => s(0)
}
The answer lies in the implementation of collect:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
As you can see, it's the same sequence of filter->map, but less efficient in your case.
In scala both isDefinedAt and apply methods of PartialFunction evaluate if part.
So, in your "collect" example split will be performed twice for each input element.

How can I map a null value to Seq.empty in Slick

I have encoded a list of values to a single database column by joining them with a delimiter. This works fine, except when the list is empty. In that case the database column is filled with an empty string, and when mapping back this gives me a Seq("") instead of Seq.empty.
implicit val SeqUriColumnType = MappedColumnType.base[Seq[Uri], String](
p => p.map(_.toString).mkString(","),
s => if (s.isEmpty) Seq.empty else s.split(",").map(Uri(_)).toSeq
)
I've worked around this by using an if statement but that feels odd. I've tried using MappedColumnType.base[Seq[Uri], Option[String]], but that didn't compile. I think it requires me to also use an option for the Seq, and that's not what I'm looking for.
In essence I want an empty Seq to result in a null value in the db, and to return an empty Seq again when retrieving. How do I do this properly?
Oh, if they handle options, you can remove orNull from the end :). Also, note, that here you are not really converting a collection to option. You are converting a String to option. Does it make it better? :)

Scala break out of map

I need to break out of a seq map when a condition is met something like this where foo would return a list of objects where the size depends on how long it takes to find the targetId
def foo(ids: Seq[String], targetId: String) = ids.map(id => getObject(id)).until(id == targetId)
obviously the until method does not exist but I am looking for something that does the equivalent
No need to create intermediate stream/iterator/view.
Just call takeWhile first:
ids.takeWhile(_ != targetId).map(getObject)
There are 2 ways I use:
1) replace map with a recursive call that processes things in certain way. Pretty handy if there are some complex side-effects.
2) use Stream or Iterator and takeWhile to evaluate it's elements lazily and terminate once the condition is met. I would go with this variant since it will be close to the first option - but much more consise.
Since the RDD I was playing with was small, I achieved the same using take(n).

Scala: Option[Seq[String]] vs Seq[Option[String]]?

I'm creating a method to retrieve a list of users from a database by ID.
I'm trying to decide on the pros and cons of declaring the ids parameter as Option[Seq[String]] vs Seq[Option[String]]?
In what cases should I favour one over the other?
A list of users in neither well represented as an Option[Seq[String]] nor as a Seq[Option[String]]. I would expect something like a List[User] as a list of users. Or maybe a Vector or Seq
If your string represents your user, and the None case does nothing, you could consider filtering those out. You can do this with
val dbresult: Seq[Option[String]] = ???
val strings = dbresult collect { case Some(str) => str }
or
val strings = dbresult.flatten
but it's difficult to give good advice without knowing what the Option[String] or Option[Seq] represents
As usual this strongly depends on the use case.
A Seq[Option[String]] will be useful if the size of the sequence is relevant (eg., because you want to zip it with another sequence).
If this is not the case I would opt for flattening the sequence in order to just have a Seq[String]. This will likely be a better choice than Option[Seq[String]], as the sequence can also be of zero length.
In fact an Option can usually be treated as if it where an array that can have either length zero or one. Therefore wrapping an Iterable in an Option often only adds unnecessary complexity.