val patterns = Seq(
"the name and age are ([a-z]+), ([0-9]+)".r,
"name:([a-z]+),age:([0-9]+)".r,
"n=([a-z]+),a=([0-9]+)".r
)
def transform(subject: String): Option[String] = {
patterns.map(_.unapplySeq(subject)).collectFirst { case Some(List(name, age)) => s"$name$age" }
}
println(transform("name:joe,age:42")) // Some(joe42)
This code that finds and transforms the first match in a list of regexes can be improved by returning early for the average case
What is the best way to make patterns a lazy collection taking into consideration performance and thread safety? Can a view be reused by multiple threads or should a new view be created for each invocation of transform?
patterns.view
patterns.to(LazyList)
I still feel that there could be advantages when using a single regex pattern.
val pattern = List("the name and age are (\\w+), (\\d+)"
,"name:(\\w+),age:(\\d+)"
// add more as needed
,"n=(\\w+),a=(\\d+)"
).mkString("|").r
def transform(subject: String): Option[String] =
pattern.findFirstMatchIn(subject)
.map(_.subgroups.filter(_ != null).mkString)
Related
I'm in the process of learning Scala and am trying to write some sort of function that will compare one element in an list against an element in another list at the same index. I know that there has to be a more Scalatic way to do this than two write two for loops and keep track of the current index of each manually.
Let's say that we're comparing URLs, for example. Say that we have the following two Lists that are URLs split by the / character:
val incomingUrl = List("users", "profile", "12345")
and
val urlToCompare = List("users", "profile", ":id")
Say that I want to treat any element that begins with the : character as a match, but any element that does not begin with a : will not be a match.
What is the best and most Scalatic way to go about doing this comparison?
Coming from a OOP background, I would immediately jump to a for loop, but I know that there has to be a good FP way to go about it that will teach me a thing or two about Scala.
EDIT
For completion, I found this outdated question shortly after posting mine that relates to the problem.
EDIT 2
The implementation that I chose for this specific use case:
def doRoutesMatch(incomingURL: List[String], urlToCompare: List[String]): Boolean = {
// if the lengths don't match, return immediately
if (incomingURL.length != urlToCompare.length) return false
// merge the lists into a tuple
urlToCompare.zip(incomingURL)
// iterate over it
.foreach {
// get each path
case (existingPath, pathToCompare) =>
if (
// check if this is some value supplied to the url, such as `:id`
existingPath(0) != ':' &&
// if this isn't a placeholder for a value that the route needs, then check if the strings are equal
p2 != p1
)
// if neither matches, it doesn't match the existing route
return false
}
// return true if a `false` didn't get returned in the above foreach loop
true
}
You can use zip, that invoked on Seq[A] with Seq[B] results in Seq[(A, B)]. In other words it creates a sequence with tuples with elements of both sequences:
incomingUrl.zip(urlToCompare).map { case(incoming, pattern) => f(incoming, pattern) }
There is already another answer to the question, but I am adding another one since there is one corner case to watch out for. If you don't know the lengths of the two Lists, you need zipAll. Since zipAll needs a default value to insert if no corresponding element exists in the List, I am first wrapping every element in a Some, and then performing the zipAll.
object ZipAllTest extends App {
val incomingUrl = List("users", "profile", "12345", "extra")
val urlToCompare = List("users", "profile", ":id")
val list1 = incomingUrl.map(Some(_))
val list2 = urlToCompare.map(Some(_))
val zipped = list1.zipAll(list2, None, None)
println(zipped)
}
One thing that might bother you is that we are making several passes through the data. If that's something you are worried about, you can use lazy collections or else write a custom fold that makes only one pass over the data. That is probably overkill though. If someone wants to, they can add those alternative implementations in another answer.
Since the OP is curious to see how we would use lazy collections or a custom fold to do the same thing, I have included a separate answer with those implementations.
The first implementation uses lazy collections. Note that lazy collections have poor cache properties so that in practice, it often does does not make sense to use lazy collections as a micro-optimization. Although lazy collections will minimize the number of times you traverse the data, as has already been mentioned, the underlying data structure does not have good cache locality. To understand why lazy collections minimize the number of passes you make over the data, read chapter 5 of Functional Programming in Scala.
object LazyZipTest extends App{
val incomingUrl = List("users", "profile", "12345", "extra").view
val urlToCompare = List("users", "profile", ":id").view
val list1 = incomingUrl.map(Some(_))
val list2 = urlToCompare.map(Some(_))
val zipped = list1.zipAll(list2, None, None)
println(zipped)
}
The second implementation uses a custom fold to go over lists only one time. Since we are appending to the rear of our data structure, we want to use IndexedSeq, not List. You should rarely be using List anyway. Otherwise, if you are going to convert from List to IndexedSeq, you are actually making one additional pass over the data, in which case, you might as well not bother and just use the naive implementation I already wrote in the other answer.
Here is the custom fold.
object FoldTest extends App{
val incomingUrl = List("users", "profile", "12345", "extra").toIndexedSeq
val urlToCompare = List("users", "profile", ":id").toIndexedSeq
def onePassZip[T, U](l1: IndexedSeq[T], l2: IndexedSeq[U]): IndexedSeq[(Option[T], Option[U])] = {
val folded = l1.foldLeft((l2, IndexedSeq[(Option[T], Option[U])]())) { (acc, e) =>
acc._1 match {
case x +: xs => (xs, acc._2 :+ (Some(e), Some(x)))
case IndexedSeq() => (IndexedSeq(), acc._2 :+ (Some(e), None))
}
}
folded._2 ++ folded._1.map(x => (None, Some(x)))
}
println(onePassZip(incomingUrl, urlToCompare))
println(onePassZip(urlToCompare, incomingUrl))
}
If you have any questions, I can answer them in the comments section.
Assuming that I would like to write a function foo that transforms a DataFrame:
object Foo {
def foo(source: DataFrame): DataFrame = {
...complex iterative algorithm with a stopping condition...
}
}
since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution.
This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation of foo should be executed only if the resulted DataFrame or its derivative(s) are being used on the Driver (through another "Action").
So far, the only way to reliably achieve this is through writing all implementations into a SparkPlan, and superimpose it into the DataFrame's SparkExecution, this is very error-prone and involves lots of boilerplate codes. What is the recommended way to do this?
It is not exactly clear to me what you try to achieve but Scala itself provides at least few tools which you may find useful:
lazy vals:
val rdd = sc.range(0, 10000)
lazy val count = rdd.count // Nothing is executed here
// count: Long = <lazy>
count // count is evaluated only when it is actually used
// Long = 10000
call-by-name (denoted by => in the function definition):
def foo(first: => Long, second: => Long, takeFirst: Boolean): Long =
if (takeFirst) first else second
val rdd1 = sc.range(0, 10000)
val rdd2 = sc.range(0, 10000)
foo(
{ println("first"); rdd1.count },
{ println("second"); rdd2.count },
true // Only first will be evaluated
)
// first
// Long = 10000
Note: In practice you should create local lazy binding to make sure that arguments are not evaluated on every access.
infinite lazy collections like Stream
import org.apache.spark.mllib.random.RandomRDDs._
val initial = normalRDD(sc, 1000000L, 10)
// Infinite stream of RDDs and actions and nothing blows :)
val stream: Stream[RDD[Double]] = Stream(initial).append(
stream.map {
case rdd if !rdd.isEmpty =>
val mu = rdd.mean
rdd.filter(_ > mu)
case _ => sc.emptyRDD[Double]
}
)
Some subset of these should be more than enough to implement complex lazy computations.
I have 2 datasets.
One is a dataframe with a bunch of data, one column has comments (a string).
The other is a list of words.
If a comment contains a word in the list, I want to replace the word in the comment with ##### and return the comment in full with the replaced words.
Here's some sample data:
CommentSample.txt
1 A badword small town
2 "Love the truck, though rattle is annoying."
3 Love the paint!
4
5 "Like that you added the ""oh badword2"" handle to passenger side."
6 "badword you. specific enough for you, badword3?"
7 This car is a piece if badword2
ProfanitySample.txt
badword
badword2
badword3
Here's my code so far:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Response(UniqueID: Int, Comment: String)
val response = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim.toString, r(10).trim.toInt)).toDF()
var profanity = sc.textFile("file:/data/ProfanitySample.txt").map(x => (x.toLowerCase())).toArray();
def replaceProfanity(s: String): String = {
val l = s.toLowerCase()
val r = "#####"
if(profanity.contains(s))
r
else
s
}
def processComment(s: String): String = {
val commentWords = sc.parallelize(s.split(' '))
commentWords.foreach(replaceProfanity)
commentWords.collect().mkString(" ")
}
response.select(processComment("Comment")).show(100)
It compiles, it runs, but the words are not replaced.
I don't know how to debug in scala.
I'm totally new! This is my first project ever!
Many thanks for any pointers.
-M
First, I think the usecase you describe here won't benefit much from the use of DataFrames - it's simpler to implement using RDDs only (DataFrames are mostly convenient when your transformations can easily be described using SQL, which isn't the case here).
So - here's a possible implementation using RDDs. This assumes the list of profanities isn't too large (i.e. up to ~thousands), so we can collect it into non-distributed memory. If that's not the case, a different approach (involving a join) might be needed.
case class Response(UniqueID: Int, Comment: String)
val mask = "#####"
val responses: RDD[Response] = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim))
val profanities: Array[String] = sc.textFile("file:/data/ProfanitySample.txt").collect()
val result = responses.map(r => {
// using foldLeft here means we'll replace profanities one by one,
// with the result of each replace as the input of the next,
// starting with the original comment
profanities.foldLeft(r.Comment)({
case (updatedComment, profanity) => updatedComment.replaceAll(s"(?i)\\b$profanity\\b", mask)
})
})
result.take(10).foreach(println) // just printing some examples...
Note that the case-insensitivity and the "words only" limitations are implemented in the regex itself: "(?i)\\bSomeWord\\b".
Here is an idiom I find myself writing.
def chooseName(nameFinder: NameFinder) = {
if(nameFinder.getReliableName.isEmpty) nameFinder.getReliableName
else nameFinder.secondBestChoice
}
In order to avoid calling getReliableName() twice on nameFinder, I add code that makes my method look less elegant.
def chooseName(nameFinder: NameFinder) = {
val reliableName = nameFinder.getReliableName()
val secondBestChoice = nameFinder.getSecondBestChoice()
if(reliableName.isEmpty) reliableName
else secondBestChoice
}
This feels dirty because I am creating an unnecessary amount of state using the vals for no reason other than to prevent a duplicate method call. Scala has taught me that whenever I feel dirty there is almost always a better way.
Is there a more elegant way to write this?
Here's two Strings, return whichever isn't empty while favoring the first
There's no need to always call getSecondBestChoice, of course. Personally, I find nothing inelegant about the code after changing that - it's clear what it does, has no mutable state. The other answers just seem overcomplicated just to avoid using a val
def chooseName(nameFinder: NameFinder) = {
val reliableName = nameFinder.getReliableName()
if(reliableName.isEmpty) reliableName
else nameFinder.getSecondBestChoice()
}
If you really want to avoid the val, here's another variant (generalises well if there are more than two alternatives)
List(nameFinder.getReliableName(), nameFinder.getSecondBestChoice()).find(_.nonEmpty).get
(or getOrElse(lastResort) if everything in the list may be empty too)
Here's a way using Option. It's not that much prettier, but everything is called only once. This assumes you want a String as a result, and don't care if the second string is empty.
Some(nameFinder.getReliableName)
.filter(_.nonEmpty)
.getOrElse(nameFinder.secondBestChoice)
Option(namefinder.getReliableName) // transforms a potential null into None
.filter(_.trim.nonEmpty) // "" is None, but also " "
.getOrElse(nameFinder.secondBestChoice)
Or better, if you can modify getReliableName to return an Option[String]:
def chooseName(nameFinder: NameFinder): String =
namefinder.getReliableName getOrElse nameFinder.secondBestChoice
Finally, if secondBestChoice can fail as well (assuming it returns an Option[String]):
def chooseName(nameFinder: NameFinder): Option[String] =
namefinder.getReliableName orElse nameFinder.secondBestChoice
If you need it more than once:
scala> implicit class `nonempty or else`(val s: String) extends AnyVal {
| def nonEmptyOrElse(other: => String) = if (s.isEmpty) other else s }
defined class nonempty
scala> "abc" nonEmptyOrElse "def"
res2: String = abc
scala> "" nonEmptyOrElse "def"
res3: String = def
Using the following pattern matching may deliver a neater scalish code,
def chooseName(nameFinder: NameFinder) = {
nameFinder.getReliableName match {
case r if r.isEmpty => r
case _ => nameFinder.secondBestChoice
}
}
I have a SQL database table with the following structure:
create table category_value (
category varchar(25),
property varchar(25)
);
I want to read this into a Scala Map[String, Set[String]] where each entry in the map is a set of all of the property values that are in the same category.
I would like to do it in a "functional" style with no mutable data (other than the database result set).
Following on the Clojure loop construct, here is what I have come up with:
def fillMap(statement: java.sql.Statement): Map[String, Set[String]] = {
val resultSet = statement.executeQuery("select category, property from category_value")
#tailrec
def loop(m: Map[String, Set[String]]): Map[String, Set[String]] = {
if (resultSet.next) {
val category = resultSet.getString("category")
val property = resultSet.getString("property")
loop(m + (category -> m.getOrElse(category, Set.empty)))
} else m
}
loop(Map.empty)
}
Is there a better way to do this, without using mutable data structures?
If you like, you could try something around
def fillMap(statement: java.sql.Statement): Map[String, Set[String]] = {
val resultSet = statement.executeQuery("select category, property from category_value")
Iterator.continually((resultSet, resultSet.next)).takeWhile(_._2).map(_._1).map{ res =>
val category = res.getString("category")
val property = res.getString("property")
(category, property)
}.toIterable.groupBy(_._1).mapValues(_.map(_._2).toSet)
}
Untested, because I don’t have a proper sql.Statement. And the groupBy part might need some more love to look nice.
Edit: Added the requested changes.
There are two parts to this problem.
Getting the data out of the database and into a list of rows.
I would use a Spring SimpleJdbcOperations for the database access, so that things at least appear functional, even though the ResultSet is being changed behind the scenes.
First, some a simple conversion to let us use a closure to map each row:
implicit def rowMapper[T<:AnyRef](func: (ResultSet)=>T) =
new ParameterizedRowMapper[T]{
override def mapRow(rs:ResultSet, row:Int):T = func(rs)
}
Then let's define a data structure to store the results. (You could use a tuple, but defining my own case class has advantage of being just a little bit clearer regarding the names of things.)
case class CategoryValue(category:String, property:String)
Now select from the database
val db:SimpleJdbcOperations = //get this somehow
val resultList:java.util.List[CategoryValue] =
db.query("select category, property from category_value",
{ rs:ResultSet => CategoryValue(rs.getString(1),rs.getString(2)) } )
Converting the data from a list of rows into the format that you actually want
import scala.collection.JavaConversions._
val result:Map[String,Set[String]] =
resultList.groupBy(_.category).mapValues(_.map(_.property).toSet)
(You can omit the type annotations. I've included them to make it clear what's going on.)
Builders are built for this purpose. Get one via the desired collection type companion, e.g. HashMap.newBuilder[String, Set[String]].
This solution is basically the same as my other solution, but it doesn't use Spring, and the logic for converting a ResultSet to some sort of list is simpler than Debilski's solution.
def streamFromResultSet[T](rs:ResultSet)(func: ResultSet => T):Stream[T] = {
if (rs.next())
func(rs) #:: streamFromResultSet(rs)(func)
else
rs.close()
Stream.empty
}
def fillMap(statement:java.sql.Statement):Map[String,Set[String]] = {
case class CategoryValue(category:String, property:String)
val resultSet = statement.executeQuery("""
select category, property from category_value
""")
val queryResult = streamFromResultSet(resultSet){rs =>
CategoryValue(rs.getString(1),rs.getString(2))
}
queryResult.groupBy(_.category).mapValues(_.map(_.property).toSet)
}
There is only one approach I can think of that does not include either mutable state or extensive copying*. It is actually a very basic technique I learnt in my first term studying CS. Here goes, abstracting from the database stuff:
def empty[K,V](k : K) : Option[V] = None
def add[K,V](m : K => Option[V])(k : K, v : V) : K => Option[V] = q => {
if ( k == q ) {
Some(v)
}
else {
m(q)
}
}
def build[K,V](input : TraversableOnce[(K,V)]) : K => Option[V] = {
input.foldLeft(empty[K,V]_)((m,i) => add(m)(i._1, i._2))
}
Usage example:
val map = build(List(("a",1),("b",2)))
println("a " + map("a"))
println("b " + map("b"))
println("c " + map("c"))
> a Some(1)
> b Some(2)
> c None
Of course, the resulting function does not have type Map (nor any of its benefits) and has linear lookup costs. I guess you could implement something in a similar way that mimicks simple search trees.
(*) I am talking concepts here. In reality, things like value sharing might enable e.g. mutable list constructions without memory overhead.