How does $(objectName) work in Scala Spark? - scala

Hi I am trying to learn org.apache.spark.ml design. I noticed in that param values are often retrieved by using $ symbol.
For example, in NGram.scala
val n: IntParam = new IntParam(this, "n", "number elements per n-gram (>=1)",
ParamValidators.gtEq(1))
/** #group setParam */
def setN(value: Int): this.type = set(n, value)
/** #group getParam */
def getN: Int = $(n)
Is $ a Scala operator with general features?
I can see in eclipse (when I hoverover $(n)) that it is defined as
final protected def $[T](param: Param[T]): T
but I could not find the class where it was defined the above way.
Can you please tell me where this is defined ?

A dollar sign is a perfectly legal method name in Scala. You can see the definition of this method here.
In this case, it is simply an alias for getOrDefault.

Related

Scala : StringBuilder class (append vs ++=) performance

I was using '++=' in scala for combining the string values using StringBuilder class instance.
StringBuilder class also provides append method that takes string parameter to combine string values.
I could see both the methods in the scala stringbuilder class documentation here .
I was told that ++= is slower than append while combining multiple string values but I am not able to find any documentation that says it would be slower.
If anyone can give some link to documentation or explanation on why ++= is slower than append that would help me understand the concept better.
Operation that I am performing or the code is as below:
val sourceAlias = "source"
val destinationAlias = "destination"
val compositeKeys = Array("Id","Name")
val initialUpdateExpression = new StringBuilder("")
for (key <- compositeKeys) {
initialUpdateExpression ++= s"$sourceAlias.$key = $destinationAlias.$key and "
}
initialUpdateExpression ++= s"$sourceAlias.Valid = $destinationAlias.Valid"
val updateExpression = initialUpdateExpression.toString()
They seem to be the same, as far as I can tell directly from the code in StringBuider.scala for Scala 2.13.8.
++= looks like this:
/** Alias for `addAll` */
def ++= (s: String): this.type = addAll(s)
which calls:
/** Overloaded version of `addAll` that takes a string */
def addAll(s: String): this.type = { underlying.append(s); this }
That in the end, calls:
#Override
#HotSpotIntrinsicCandidate
public StringBuilder append(String str) {
super.append(str);
return this;
}
Whereas, append is overloaded, but assuming you want the String version:
/** Appends the given String to this sequence.
*
* #param s a String.
* #return this StringBuilder.
*/
def append(s: String): StringBuilder = {
underlying append s
this
}
Which in turn, calls:
#Override
#HotSpotIntrinsicCandidate
public StringBuilder append(String str) {
super.append(str);
return this;
}
Which is exactly the same code, as ++= calls. So no, they should have the same performance for your particular use case. I also tried decompiling an example with both method calls, and I did not see any difference between them.
EDIT:
Perhaps you might been told about better performance when combining multiple string values in the same append versus using String concatenation +. For example:
sb.append("some" + "thing")
sb.append("some").append("thing")
The second line is slightly more efficient, since in the first one you create an additional String and an additional unnamed StringBuilder. If this is the case, check this post for clarification on this matter.
There is no difference. But you should not be using StringBuilder to begin with. Mutable structures and vars are evil, you should pretend they do not exist at all, at least until you acquire enough command of the language to be able to identify the 0.1% of use cases where they are actually necessary.
val updateExpression = Seq("Id", "Name", "Valid")
.map { key => s"source.$key = destination.$key") }
.mkString(" and ")

Understanding Sets and Sequences using String checking as an example

I have a string which I would like to cross check if it is purely made of letters and space.
val str = "my long string to test"
val purealpha = " abcdefghijklmnopqrstuvwxyz".toSet
if (str.forall(purestring(_))) println("PURE") else "NOTPURE"
The above CONCISE code does the job. However, if I run it this way:
val str = "my long string to test"
val purealpha = " abcdefghijklmnopqrstuvwxyz" // not converted toSet
str.forall(purealpha(_)) // CONCISE code
I get an error (found: Char ... required: Boolean) and it can only work using the contains method this way:
str.forall(purealpha.contains(_))
My question is how can I use the CONCISE form without converting the string to a Set. Any suggestions on having my own String class with the right combination of methods to enable the nice code; or maybe some pure function(s) working on strings.
It's just a fun exercise I'm doing, so I can understand the intricate details of various methods on collections (including apply method) and how to write nice concise code and classes.
A slightly different approach is to use a regex pattern.
val str = "my long string to test"
val purealpha = "[ a-z]+"
str matches purealpha // res0: Boolean = true
If we look at the source code we can see that both these implementations are doing different things, although giving the same result.
When you are converting it to a Set and using the forAll, you are ultimately calling the apply method for the set. Here is how the apply is called explicitly in your code, also using named parameters in the anonymous functions:
if (str.forall(s => purestring.apply(s))) println("PURE") else "NOTPURE" // first example
str.forall(s => purealpha.apply(s)) // second example
Anyway, let's take a look at the source code for apply for Set (gotten from GenSetLike.scala):
/** Tests if some element is contained in this set.
*
* This method is equivalent to `contains`. It allows sets to be interpreted as predicates.
* #param elem the element to test for membership.
* #return `true` if `elem` is contained in this set, `false` otherwise.
*/
def apply(elem: A): Boolean = this contains elem
When you leave the String literal, you have to specifically call the .contains (this is the source code for that gotten from SeqLike.scala):
/** Tests whether this $coll contains a given value as an element.
* $mayNotTerminateInf
*
* #param elem the element to test.
* #return `true` if this $coll has an element that is equal (as
* determined by `==`) to `elem`, `false` otherwise.
*/
def contains[A1 >: A](elem: A1): Boolean = exists (_ == elem)
As you can imagine, doing an apply for the String literal will not give the same result as doing an apply for a Set.
A suggestion on having more conciseness is to omit the (_) entirely in the second example (compiler type inference will pick that up):
val str = "my long string to test"
val purealpha = " abcdefghijklmnopqrstuvwxyz" // not converted toSet
str.forall(purealpha.contains)

Map an instance using function in Scala

Say I have a local method/function
def withExclamation(string: String) = string + "!"
Is there a way in Scala to transform an instance by supplying this method? Say I want to append an exclamation mark to a string. Something like:
val greeting = "Hello"
val loudGreeting = greeting.applyFunction(withExclamation) //result: "Hello!"
I would like to be able to invoke (local) functions when writing a chain transformation on an instance.
EDIT: Multiple answers show how to program this possibility, so it seems that this feature is not present on an arbitraty class. To me this feature seems incredibly powerful. Consider where in Java I want to execute a number of operations on a String:
appendExclamationMark(" Hello! ".trim().toUpperCase()); //"HELLO!"
The order of operations is not the same as how they read. The last operation, appendExclamationMark is the first word that appears. Currently in Java I would sometimes do:
Function.<String>identity()
.andThen(String::trim)
.andThen(String::toUpperCase)
.andThen(this::appendExclamationMark)
.apply(" Hello "); //"HELLO!"
Which reads better in terms of expressing a chain of operations on an instance, but also contains a lot of noise, and it is not intuitive to have the String instance at the last line. I would want to write:
" Hello "
.applyFunction(String::trim)
.applyFunction(String::toUpperCase)
.applyFunction(this::withExclamation); //"HELLO!"
Obviously the name of the applyFunction function can be anything (shorter please). I thought backwards compatibility was the sole reason Java's Object does not have this.
Is there any technical reason why this was not added on, say, the Any or AnyRef classes?
You can do this with an implicit class which provides a way to extend an existing type with your own methods:
object StringOps {
implicit class RichString(val s: String) extends AnyVal {
def withExclamation: String = s"$s!"
}
def main(args: Array[String]): Unit = {
val m = "hello"
println(m.withExclamation)
}
}
Yields:
hello!
If you want to apply any functions (anonymous, converted from methods, etc.) in this way, you can use a variation on Yuval Itzchakov's answer:
object Combinators {
implicit class Combinators[A](val x: A) {
def applyFunction[B](f: A => B) = f(x)
}
}
A while after asking this question, I noticed that Kotlin has this built in:
inline fun <T, R> T.let(block: (T) -> R): R
Calls the specified function block with this value as its argument and returns
its result.
A lot more, quite useful variations of the above function are provided on all types, like with, also, apply, etc.

Adding types in [ ] brackets in function definition

Welcome everyone,
I am actually learning scala through studying "Functional Programming in Scala" book, and in this book authors are parametrizing functions with adding types in [ ] brackes following name of the function like:
def sortList[A](xs: List[A]): List[A] = ...
What is reason for doing that? Compiler cant infer it by itself from parameters? Or am i missing something?
In the specific instance above, A is the type that the sortList function will work on. In other words, it will take a List containing objects of type A and sort them, returning a new List with objects of type A.
You would use it as follows:
val list = 10::30::List(20)
val sortedList = sortList(list)
The scala compiler will detect that the type of the List being passed in is a list of Int and understand that the "A" in the declaration is the type of the List being passed in.
The type has to be known at compile time, but scala is very good at inferring the types, and in the example above, it can see that the type of the list being passed in is a list of Int
It is important to note that the type is initially inferred when the List is created to be of type List[Int] and then the compiler can also see later, when the list is passed to the sortList function, that the type of the List being passed in to sortList is List[Int]
Here are some additional examples that I threw together to show more. If you run these commands in the Scala command line or using a Scala Eclipse worksheet you'll see what is going on. The only additional thing here really is that it also shows you can apply the type to sortList without other parameters to make it specific to Ints rather than applicable to all types.
/* Declare a List[Int] and List[String] for use later */
val list = 10::30::List(20)
val stringList = "1"::"2"::List("3")
/* Doesn't actually sort - just returns xs */
def sortList[A](xs: List[A]): List[A] = xs
/* Sort both lists */
sortList(list)
sortList(stringList)
/* Create a version of sortListInt which just works on Ints */
def sortListInt = sortList[Int] _
/* Sort the List[Int] with the new function */
sortListInt(list)
/* fails to compile - sortListInt is a
version of sortList which is only applicable to Int */
sortListInt(stringList)
You have to declare the type before you use it, this way you introduce the type, so you can use it later. Similarly to class generic types (or like Java, C#).

Strongly typed access to csv in scala?

I would like to access csv files in scala in a strongly typed manner. For example, as I read each line of the csv, it is automatically parsed and represented as a tuple with the appropriate types. I could specify the types beforehand in some sort of schema that is passed to the parser. Are there any libraries that exist for doing this? If not, how could I go about implementing this functionality on my own?
product-collections appears to be a good fit for your requirements:
scala> val data = CsvParser[String,Int,Double].parseFile("sample.csv")
data: com.github.marklister.collections.immutable.CollSeq3[String,Int,Double] =
CollSeq((Jan,10,22.33),
(Feb,20,44.2),
(Mar,25,55.1))
product-collections uses opencsv under the hood.
A CollSeq3 is an IndexedSeq[Product3[T1,T2,T3]] and also a Product3[Seq[T1],Seq[T2],Seq[T3]] with a little sugar. I am the author of product-collections.
Here's a link to the io page of the scaladoc
Product3 is essentially a tuple of arity 3.
If your content has double-quotes to enclose other double quotes, commas and newlines, I would definitely use a library like opencsv that deals properly with special characters. Typically you end up with Iterator[Array[String]]. Then you use Iterator.map or collect to transform each Array[String] into your tuples dealing with type conversions errors there. If you need to do process the input without loading all in memory, you then keep working with the iterator, otherwise you can convert to a Vector or List and close the input stream.
So it may look like this:
val reader = new CSVReader(new FileReader(filename))
val iter = reader.iterator()
val typed = iter collect {
case Array(double, int, string) => (double.toDouble, int.toInt, string)
}
// do more work with typed
// close reader in a finally block
Depending on how you need to deal with errors, you can return Left for errors and Right for success tuples to separate the errors from the correct rows. Also, I sometimes wrap of all this using scala-arm for closing resources. So my data maybe wrapped into the resource.ManagedResource monad so that I can use input coming from multiple files.
Finally, although you want to work with tuples, I have found that it is usually clearer to have a case class that is appropriate for the problem and then write a method that creates that case class object from an Array[String].
You can use kantan.csv, which is designed with precisely that purpose in mind.
Imagine you have the following input:
1,Foo,2.0
2,Bar,false
Using kantan.csv, you could write the following code to parse it:
import kantan.csv.ops._
new File("path/to/csv").asUnsafeCsvRows[(Int, String, Either[Float, Boolean])](',', false)
And you'd get an iterator where each entry is of type (Int, String, Either[Float, Boolean]). Note the bit where the last column in your CSV can be of more than one type, but this is conveniently handled with Either.
This is all done in an entirely type safe way, no reflection involved, validated at compile time.
Depending on how far down the rabbit hole you're willing to go, there's also a shapeless module for automated case class and sum type derivation, as well as support for scalaz and cats types and type classes.
Full disclosure: I'm the author of kantan.csv.
I've created a strongly-typed CSV helper for Scala, called object-csv. It is not a fully fledged framework, but it can be adjusted easily. With it you can do this:
val peopleFromCSV = readCSV[Person](fileName)
Where Person is case class, defined like this:
case class Person (name: String, age: Int, salary: Double, isNice:Boolean = false)
Read more about it in GitHub, or in my blog post about it.
Edit: as pointed out in a comment, kantan.csv (see other answer) is probably the best as of the time I made this edit (2020-09-03).
This is made more complicated than it ought to because of the nontrivial quoting rules for CSV. You probably should start with an existing CSV parser, e.g. OpenCSV or one of the projects called scala-csv. (There are at least three.)
Then you end up with some sort of collection of collections of strings. If you don't need to read massive CSV files quickly, you can just try to parse each line into each of your types and take the first one that doesn't throw an exception. For example,
import scala.util._
case class Person(first: String, last: String, age: Int) {}
object Person {
def fromCSV(xs: Seq[String]) = Try(xs match {
case s0 +: s1 +: s2 +: more => new Person(s0, s1, s2.toInt)
})
}
If you do need to parse them fairly quickly and you don't know what might be there, you should probably use some sort of matching (e.g. regexes) on the individual items. Either way, if there's any chance of error you probably want to use Try or Option or somesuch to package errors.
I built my own idea to strongly typecast the final product, more than the reading stage itself..which as pointed out might be better handled as stage one with something like Apache CSV, and stage 2 could be what i've done. Here's the code you are welcome to it. The idea is to typecast the CSVReader[T] with type T .. upon construction, you must supply the reader with a Factor object of Type[T] as well. The idea here is that the class itself (or in my example a helper object) decides the construction detail and thus decouples this from the actual reading. You could use Implicit objects to pass the helper around but I've not done that here. The only downside is that each row of the CSV must be of the same class type, but you could expand this concept as needed.
class CsvReader/**
* #param fname
* #param hasHeader : ignore header row
* #param delim : "\t" , etc
*/
[T] ( factory:CsvFactory[T], fname:String, delim:String) {
private val f = Source.fromFile(fname)
private var lines = f.getLines //iterator
private var fileClosed = false
if (lines.hasNext) lines = lines.dropWhile(_.trim.isEmpty) //skip white space
def hasNext = (if (fileClosed) false else lines.hasNext)
lines = lines.drop(1) //drop header , assumed to exist
/**
* also closes the file
* #return the line
*/
def nextRow ():String = { //public version
val ans = lines.next
if (ans.isEmpty) throw new Exception("Error in CSV, reading past end "+fname)
if (lines.hasNext) lines = lines.dropWhile(_.trim.isEmpty) else close()
ans
}
//def nextObj[T](factory:CsvFactory[T]): T = past version
def nextObj(): T = { //public version
val s = nextRow()
val a = s.split(delim)
factory makeObj a
}
def allObj() : Seq[T] = {
val ans = scala.collection.mutable.Buffer[T]()
while (hasNext) ans+=nextObj()
ans.toList
}
def close() = {
f.close;
fileClosed = true
}
} //class
next the example Helper Factory and example "Main"
trait CsvFactory[T] { //handles all serial controls (in and out)
def makeObj(a:Seq[String]):T //for reading
def makeRow(obj:T):Seq[String]//the factory basically just passes this duty
def header:Seq[String] //must define headers for writing
}
/**
* Each class implements this as needed, so the object can be serialized by the writer
*/
case class TestRecord(val name:String, val addr:String, val zip:Int) {
def toRow():Seq[String] = List(name,addr,zip.toString) //handle conversion to CSV
}
object TestFactory extends CsvFactory[TestRecord] {
def makeObj (a:Seq[String]):TestRecord = new TestRecord(a(0),a(1),a(2).toDouble.toInt)
def header = List("name","addr","zip")
def makeRow(o:TestRecord):Seq[String] = {
o.toRow.map(_.toUpperCase())
}
}
object CsvSerial {
def main(args: Array[String]): Unit = {
val whereami = System.getProperty("user.dir")
println("Begin CSV test in "+whereami)
val reader = new CsvReader(TestFactory,"TestCsv.txt","\t")
val all = reader.allObj() //read the CSV info a file
sd.p(all)
reader.close
val writer = new CsvWriter(TestFactory,"TestOut.txt", "\t")
for (x<-all) writer.printObj(x)
writer.close
} //main
}
Example CSV (tab seperated.. might need to repair if you copy from an editor)
Name Addr Zip "Sanders, Dante R." 4823 Nibh Av. 60797.00 "Decker, Caryn G." 994-2552 Ac Rd. 70755.00 "Wilkerson, Jolene Z." 3613 Ultrices. St. 62168.00 "Gonzales, Elizabeth W." "P.O. Box 409, 2319 Cursus. Rd." 72909.00 "Rodriguez, Abbot O." Ap #541-9695 Fusce Street 23495.00 "Larson, Martin L." 113-3963 Cras Av. 36008.00 "Cannon, Zia U." 549-2083 Libero Avenue 91524.00 "Cook, Amena B." Ap
#668-5982 Massa Ave 69205.00
And finally the writer (notice the factory methods require this as well with "makerow"
import java.io._
class CsvWriter[T] (factory:CsvFactory[T], fname:String, delim:String, append:Boolean = false) {
private val out = new PrintWriter(new BufferedWriter(new FileWriter(fname,append)));
if (!append) out.println(factory.header mkString delim )
def flush() = out.flush()
def println(s:String) = out.println(s)
def printObj(obj:T) = println( factory makeRow(obj) mkString(delim) )
def printAll(objects:Seq[T]) = objects.foreach(printObj(_))
def close() = out.close
}
If you know the the # and types of fields, maybe like this?:
case class Friend(id: Int, name: String) // 1, Fred
val friends = scala.io.Source.fromFile("friends.csv").getLines.map { line =>
val fields = line.split(',')
Friend(fields(0).toInt, fields(1))
}