I have encoded a list of values to a single database column by joining them with a delimiter. This works fine, except when the list is empty. In that case the database column is filled with an empty string, and when mapping back this gives me a Seq("") instead of Seq.empty.
implicit val SeqUriColumnType = MappedColumnType.base[Seq[Uri], String](
p => p.map(_.toString).mkString(","),
s => if (s.isEmpty) Seq.empty else s.split(",").map(Uri(_)).toSeq
)
I've worked around this by using an if statement but that feels odd. I've tried using MappedColumnType.base[Seq[Uri], Option[String]], but that didn't compile. I think it requires me to also use an option for the Seq, and that's not what I'm looking for.
In essence I want an empty Seq to result in a null value in the db, and to return an empty Seq again when retrieving. How do I do this properly?
Oh, if they handle options, you can remove orNull from the end :). Also, note, that here you are not really converting a collection to option. You are converting a String to option. Does it make it better? :)
Related
I do some data conversion on a list of string and I get a list of Either where Left represents an error and Right represents a successfully converted item.
val results: Seq[Either[String, T]] = ...
I partition my results with:
val (errors, items) = results.partition(_.isLeft)
After doing some error processing I want to return a Seq[T] of valid items. That means, returning the value of all Right elements. Because of the partitioning I already knew that all elements of items Right. I have come up with five possibilities of how to do it. But what is the best in readability and performance? Is there an idiomatic way of how to do it in Scala?
// which variant is most scala like and still understandable?
items.map(_.right.get)
items.map(_.right.getOrElse(null))
items.map(_.asInstanceOf[Right[String, T]].value)
items.flatMap(_.toOption)
items.collect{case Right(item) => item}
Using .get is considered "code smell": it will work in this case, but makes the reader of the code pause and spend a few extra "cycles" in order to "prove" that it is ok. It is better to avoid using things like .get on Either and Option or .apply on a Map or a IndexedSeq.
.getOrElse is ok ... but null is not something you often see in scala code. Again, makes the reader stop and think "why is this here? what will happen if it ends up returning null?" etc. Better to avoid as well.
.asInstanceOf is ... just bad. It breaks type safety, and is just ... not scala.
That leaves .flatMap(_.toOption) or .collect. Both are fine. I would personally prefer the latter as it is a bit more explicit (and does not make the reader stop to remember which way Either is biased).
You could also use foldRight to do both partition and extract in one "go":
val (errors, items) = results.foldRight[(List[String], List[T])](Nil,Nil) {
case (Left(error), (e, i)) => (error :: e, i)
case ((Right(result), (e, i)) => (e, result :: i)
}
Starting in Scala 2.13, you'll probably prefer partitionMap to partition.
It partitions elements based on a function which returns either Right or Left. Which in your case, is simply the identity:
val (lefts, rights) = List(Right(1), Left("2"), Left("3")).partitionMap(identity)
// val lefts: List[String] = List(2, 3)
// val rights: List[Int] = List(1)
which let you use lefts and rights independently and with the right types.
Going through them one by one:
items.map(_.right.get)
You already know that these are all Rights. This will be absolutely fine.
items.map(_.right.getOrElse(null))
The .getOrElse is unnecessary here as you already know it should never happen. I would recommend throwing an exception if you find a Left (somehow) though, something like this: items.map(x => x.right.getOrElse(throw new Exception(s"Unexpected Left: [$x]")) (or whatever exception you think is suitable), rather than meddling with null values.
items.map(_.asInstanceOf[Right[String, T]].value)
This is unnecessarily complicated. I also can't get this to compile, but I may be doing something wrong. Either way there's no need to use asInstanceOf here.
items.flatMap(_.toOption)
I also can't get this to compile. items.flatMap(_.right.toOption) compiles for me, but at that point it'll always be a Some and you'll still have to .get it.
items.collect{case Right(item) => item}
This is another case of "it works but why be so complicated?". It also isn't exhaustive in the case of a Left item being there, but this should never happen so there's no need to use .collect.
Another way you could get the right values out is with pattern matching:
items.map {
case Right(value) => value
case other => throw new Exception(s"Unexpected Left: $other")
}
But again, this is probably unnecessary as you already know that all values will be Right.
If you are going to partition results like this, I recommend the first option, items.map(_.right.get). Any other options either have unreachable code (code you'd never be able to hit through Unit tests or real-life operation) or are unnecessarily complicated for the sake of "looking functional".
I'm admittedly very new to Scala, and I'm having trouble with the syntactical sugar I see in many Scala examples.
It often results in a very concise statement, but honestly so far (for me) a bit unreadable.
So I wish to take a typical use of the Option class, safe-dereferencing, as a good place to start for understanding, for example, the use of the underscore in a particular example I've seen.
I found a really nice article showing examples of the use of Option to avoid the case of null.
https://medium.com/#sinisalouc/demystifying-the-monad-in-scala-cc716bb6f534#.fhrljf7nl
He describes a use as so:
trait User {
val child: Option[User]
}
By the way, you can also write those functions as in-place lambda
functions instead of defining them a priori. Then the code becomes
this:
val result = UserService.loadUser("mike")
.flatMap(user => user.child)
.flatMap(user => user.child)
That looks great! Maybe not as concise as one can do in groovy, but not bad.
So I thought I'd try to apply it to a case I am trying to solve.
I have a type Person where the existence of a Person is optional, but if we have a person, his attributes are guaranteed. For that reason, there are no use of the Option type within the Person type itself.
The Person has an PID which is of type Id. The Id type consists of two String types; the Id-Type and the Id-Value.
I've used the Scala console to test the following:
class Id(val idCode : String, val idVal : String)
class Person(val pid : Id, val name : String)
val anId: Id = new Id("Passport_number", "12345")
val person: Person = new Person(anId, "Sean")
val operson : Option[Person] = Some(person)
OK. That setup my person and it's optional instance.
I learned from the above linked article that I could get the Persons Id-Val by using flatMap; Like this:
val result = operson.flatMap(person => Some(person.pid)).flatMap(pid => Some(pid.idVal)).getOrElse("NoValue")
Great! That works. And if I infact have no person, my result is "NoValue".
I used flatMap (and not Map) because, unless I misunderstand (and my tests with Map were incorrect) if I use Map I have to provide an alternate or default Person instance. That I didn't want to have to do.
OK, so, flatMap is the way to go.
However, that is really not a very concise statement.
If I were writing that in more of a groovy style, I guess i'd be able to do something like this:
val result = person?.pid.idVal
Wow, that's a bit nicer!
Surely Scala has a means to provide something at least nearly as nice as Groovy?
In the above linked example, he was able to make his statement more concise using some of that syntactical sugar I mentioned before. The underscore:
or even more concise:
val result = UserService.loadUser("mike")
.flatMap(_.child)
.flatMap(_.child)
So, it seems in this case the underscore character allows you to skip specifying the type (as the type is inferred) and replace it with underscore.
However, when I try the same thing with my example:
val result = operson.flatMap(Some(_.pid)).flatMap(Some(_.idVal)).getOrElse("NoValue")
Scala complains.
<console>:15: error: missing parameter type for expanded function ((x$2) => x$2.idVal)
val result = operson.flatMap(Some(_.pid)).flatMap(Some(_.idVal)).getOrElse("NoValue")
Can someone help me along here?
How am I misunderstanding this?
Is there a short-hand method of writing my above lengthy statement?
Is flatMap the best way to achieve what I am after? Or is there a better more concise and/or readable way to do it ?
thanks in advance!
Why do you insist on using flatMap? I'd just use map for your example instead:
val result = operson.map(_.pid).map(_.idVal).getOrElse("NoValue")
or even shorter:
val result = operson.map(_.pid.idVal).getOrElse("NoValue")
You should only use flatMap with functions that return Options. Your pid and idVals are not Options, so just map them instead.
You said
I have a type Person where the existence of a Person is optional, but if we have a person, his attributes are guaranteed. For that reason, there are no use of the Option type within the Person type itself.
This is the essential difference between your example and the User example. In the User example, both the existence of a User instance, and its child field are options. This is why, to get a child, you need to flatMap. However, since in your example, only the existence of a Person is not guaranteed, after you've retrieved an Option[Person], you can safely map to any of its fields.
Think of flatMap as a map, followed by a flatten (hence its name). If I mapped on child:
val ouser = Some(new User())
val child: Option[Option[User]] = ouser.map(_.child)
I would end up with an Option[Option[User]]. I need to flatten that to a single Option level, that's why I use flatMap in the first place.
If you looking for the most concise solution, consider this:
val result = operson.fold("NoValue")(_.pid.idVal)
Though one could find it not clear or confusing
I'm creating a method to retrieve a list of users from a database by ID.
I'm trying to decide on the pros and cons of declaring the ids parameter as Option[Seq[String]] vs Seq[Option[String]]?
In what cases should I favour one over the other?
A list of users in neither well represented as an Option[Seq[String]] nor as a Seq[Option[String]]. I would expect something like a List[User] as a list of users. Or maybe a Vector or Seq
If your string represents your user, and the None case does nothing, you could consider filtering those out. You can do this with
val dbresult: Seq[Option[String]] = ???
val strings = dbresult collect { case Some(str) => str }
or
val strings = dbresult.flatten
but it's difficult to give good advice without knowing what the Option[String] or Option[Seq] represents
As usual this strongly depends on the use case.
A Seq[Option[String]] will be useful if the size of the sequence is relevant (eg., because you want to zip it with another sequence).
If this is not the case I would opt for flattening the sequence in order to just have a Seq[String]. This will likely be a better choice than Option[Seq[String]], as the sequence can also be of zero length.
In fact an Option can usually be treated as if it where an array that can have either length zero or one. Therefore wrapping an Iterable in an Option often only adds unnecessary complexity.
I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine. However, the result I got from RDD has square brackets around every element like this [A00001]. I was wondering if there's an appropriate way to convert a column to a list or a way to remove the square brackets.
Any suggestions would be appreciated. Thank you!
This should return the collection containing single list:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
Without the mapping, you just get a Row object, which contains every column from the database.
Keep in mind that this will probably get you a list of Any type. Ïf you want to specify the result type, you can use .asInstanceOf[YOUR_TYPE] in r => r(0).asInstanceOf[YOUR_TYPE] mapping
P.S. due to automatic conversion you can skip the .rdd part.
With Spark 2.x and Scala 2.11
I'd think of 3 possible ways to convert values of a specific column to a List.
Common code snippets for all the approaches
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate
import spark.implicits._ // for .toDF() method
val df = Seq(
("first", 2.0),
("test", 1.5),
("choose", 8.0)
).toDF("id", "val")
Approach 1
df.select("id").collect().map(_(0)).toList
// res9: List[Any] = List(one, two, three)
What happens now? We are collecting data to Driver with collect() and picking element zero from each record.
This could not be an excellent way of doing it, Let's improve it with the next approach.
Approach 2
df.select("id").rdd.map(r => r(0)).collect.toList
//res10: List[Any] = List(one, two, three)
How is it better? We have distributed map transformation load among the workers rather than a single Driver.
I know rdd.map(r => r(0)) does not seems elegant you. So, let's address it in the next approach.
Approach 3
df.select("id").map(r => r.getString(0)).collect.toList
//res11: List[String] = List(one, two, three)
Here we are not converting DataFrame to RDD. Look at map it won't accept r => r(0)(or _(0)) as the previous approach due to encoder issues in DataFrame. So end up using r => r.getString(0) and it would be addressed in the next versions of Spark.
Conclusion
All the options give the same output but 2 and 3 are effective, finally 3rd one is effective and elegant(I'd think).
Databricks notebook
I know the answer given and asked for is assumed for Scala, so I am just providing a little snippet of Python code in case a PySpark user is curious. The syntax is similar to the given answer, but to properly pop the list out I actually have to reference the column name a second time in the mapping function and I do not need the select statement.
i.e. A DataFrame, containing a column named "Raw"
To get each row value in "Raw" combined as a list where each entry is a row value from "Raw" I simply use:
MyDataFrame.rdd.map(lambda x: x.Raw).collect()
In Scala and Spark 2+, try this (assuming your column name is "s"):
df.select('s').as[String].collect
sqlContext.sql(" select filename from tempTable").rdd.map(r => r(0)).collect.toList.foreach(out_streamfn.println) //remove brackets
it works perfectly
List<String> whatever_list = df.toJavaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return row.getAs("column_name").toString();
}
}).collect();
logger.info(String.format("list is %s",whatever_list)); //verification
Since no one has given any solution in java(Real Programming Language)
Can thank me later
from pyspark.sql.functions import col
df.select(col("column_name")).collect()
here collect is functions which in turn convert it to list.
Be ware of using the list on the huge data set. It will decrease performance.
It is good to check the data.
Below is for Python-
df.select("col_name").rdd.flatMap(lambda x: x).collect()
An updated solution that gets you a list:
dataFrame.select("YOUR_COLUMN_NAME").map(r => r.getString(0)).collect.toList
This is java answer.
df.select("id").collectAsList();
val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val tokenizedLines = lines.map(Tokenizer.tokenize)
in the above code snippet, the tokenize function may return empty strings. How do i skip adding it to the map in that case? or remove empty entries post adding to map?
tokenizedLines.filter(_.nonEmpty)
The currently accepted answer, using filter and nonEmpty, incurs some performance penalty because nonEmpty is not a method on String, but, instead, it's added through implicit conversion. With value objects being used, I expect the difference to be almost imperceptible, but on versions of Scala where that is not the case, it is a substantial hit.
Instead, one could use this, which is assured to be faster:
tokenizedLines.filterNot(_.isEmpty)
You could use flatMap with Option.
Something like that:
lines.flatMap{
case "" => None
case s => Some(s)
}
val tokenizedLines = (lines.map(Tokenizer.tokenize)).filter(_.nonEmpty)