Scala : StringBuilder class (append vs ++=) performance - scala

I was using '++=' in scala for combining the string values using StringBuilder class instance.
StringBuilder class also provides append method that takes string parameter to combine string values.
I could see both the methods in the scala stringbuilder class documentation here .
I was told that ++= is slower than append while combining multiple string values but I am not able to find any documentation that says it would be slower.
If anyone can give some link to documentation or explanation on why ++= is slower than append that would help me understand the concept better.
Operation that I am performing or the code is as below:
val sourceAlias = "source"
val destinationAlias = "destination"
val compositeKeys = Array("Id","Name")
val initialUpdateExpression = new StringBuilder("")
for (key <- compositeKeys) {
initialUpdateExpression ++= s"$sourceAlias.$key = $destinationAlias.$key and "
}
initialUpdateExpression ++= s"$sourceAlias.Valid = $destinationAlias.Valid"
val updateExpression = initialUpdateExpression.toString()

They seem to be the same, as far as I can tell directly from the code in StringBuider.scala for Scala 2.13.8.
++= looks like this:
/** Alias for `addAll` */
def ++= (s: String): this.type = addAll(s)
which calls:
/** Overloaded version of `addAll` that takes a string */
def addAll(s: String): this.type = { underlying.append(s); this }
That in the end, calls:
#Override
#HotSpotIntrinsicCandidate
public StringBuilder append(String str) {
super.append(str);
return this;
}
Whereas, append is overloaded, but assuming you want the String version:
/** Appends the given String to this sequence.
*
* #param s a String.
* #return this StringBuilder.
*/
def append(s: String): StringBuilder = {
underlying append s
this
}
Which in turn, calls:
#Override
#HotSpotIntrinsicCandidate
public StringBuilder append(String str) {
super.append(str);
return this;
}
Which is exactly the same code, as ++= calls. So no, they should have the same performance for your particular use case. I also tried decompiling an example with both method calls, and I did not see any difference between them.
EDIT:
Perhaps you might been told about better performance when combining multiple string values in the same append versus using String concatenation +. For example:
sb.append("some" + "thing")
sb.append("some").append("thing")
The second line is slightly more efficient, since in the first one you create an additional String and an additional unnamed StringBuilder. If this is the case, check this post for clarification on this matter.

There is no difference. But you should not be using StringBuilder to begin with. Mutable structures and vars are evil, you should pretend they do not exist at all, at least until you acquire enough command of the language to be able to identify the 0.1% of use cases where they are actually necessary.
val updateExpression = Seq("Id", "Name", "Valid")
.map { key => s"source.$key = destination.$key") }
.mkString(" and ")

Related

Is it possible to insert a variable in Scala string like python format string?

I'm trying to create a Scala code that is similar to python where I read a text file such as:
test.sql
select * from {name}
and insert name variable from the main program itself.
I'm new to Scala but I was able to read the file as such:
val filename = "test.sql"
val src_file = scala.io.Source.fromFile(filename)
val sql_str = try src_file.getLines mkString "\n" finally src_file.close()
Now I'm trying to do something like I would do in Python but in Scala:
sql_str.format(name = "table1")
Is there an equivalent of this in Scala.
I strongly advise against using a string interpolation / replacement to build SQL queries, as it's an easy way to leave your program vulnerable to SQL Injection (regardless of what programming language you're using). If interacting with SQL is your goal, I'd recommend looking into a database helper library like Doobie or Slick.
That disclaimer out of the way, there are a few approaches to string interpolation in Scala.
Normally, string interpolation is done with a string literal in your code, with $... or ${...} used to interpolate expressions into your string (the {} are needed if your expression is more than just a name reference). For example
val name: String = /* ... */
val message = s"Hello, $name"
// bad idea if `table` might come from user input
def makeQuery(table: String) = s"select * from $table"
But this doesn't work for string templates that you load from a file; it only works for string literals that are defined in your code. If you can change things up so that your templates are defined in your code instead of a file, that'll be the easiest way.
If that doesn't work, you could resort to Java's String.format method, in which the template String uses %s as a placeholder for an expression (see the docs for full info on the syntax for that). This related question has an example for using that. This is probably closest to what you actually asked for.
You could also do something custom with string replacement, e.g.
val template: String = /* load from file */
template.replace("{name}", "my_table")
// or something more general-purpose
def customInterpolate(template: String, vars: Map[String, String]): String = {
vars.foldLeft(template) { case (s, (k, v)) =>
s.replace(s"{$k}", v)
}
}
val exampleTmp = s"update {name} set message = {message}"
customInterpolate(exampleTmp, Map(
"name" -> "my_table",
"message" -> "hello",
))

Understanding Sets and Sequences using String checking as an example

I have a string which I would like to cross check if it is purely made of letters and space.
val str = "my long string to test"
val purealpha = " abcdefghijklmnopqrstuvwxyz".toSet
if (str.forall(purestring(_))) println("PURE") else "NOTPURE"
The above CONCISE code does the job. However, if I run it this way:
val str = "my long string to test"
val purealpha = " abcdefghijklmnopqrstuvwxyz" // not converted toSet
str.forall(purealpha(_)) // CONCISE code
I get an error (found: Char ... required: Boolean) and it can only work using the contains method this way:
str.forall(purealpha.contains(_))
My question is how can I use the CONCISE form without converting the string to a Set. Any suggestions on having my own String class with the right combination of methods to enable the nice code; or maybe some pure function(s) working on strings.
It's just a fun exercise I'm doing, so I can understand the intricate details of various methods on collections (including apply method) and how to write nice concise code and classes.
A slightly different approach is to use a regex pattern.
val str = "my long string to test"
val purealpha = "[ a-z]+"
str matches purealpha // res0: Boolean = true
If we look at the source code we can see that both these implementations are doing different things, although giving the same result.
When you are converting it to a Set and using the forAll, you are ultimately calling the apply method for the set. Here is how the apply is called explicitly in your code, also using named parameters in the anonymous functions:
if (str.forall(s => purestring.apply(s))) println("PURE") else "NOTPURE" // first example
str.forall(s => purealpha.apply(s)) // second example
Anyway, let's take a look at the source code for apply for Set (gotten from GenSetLike.scala):
/** Tests if some element is contained in this set.
*
* This method is equivalent to `contains`. It allows sets to be interpreted as predicates.
* #param elem the element to test for membership.
* #return `true` if `elem` is contained in this set, `false` otherwise.
*/
def apply(elem: A): Boolean = this contains elem
When you leave the String literal, you have to specifically call the .contains (this is the source code for that gotten from SeqLike.scala):
/** Tests whether this $coll contains a given value as an element.
* $mayNotTerminateInf
*
* #param elem the element to test.
* #return `true` if this $coll has an element that is equal (as
* determined by `==`) to `elem`, `false` otherwise.
*/
def contains[A1 >: A](elem: A1): Boolean = exists (_ == elem)
As you can imagine, doing an apply for the String literal will not give the same result as doing an apply for a Set.
A suggestion on having more conciseness is to omit the (_) entirely in the second example (compiler type inference will pick that up):
val str = "my long string to test"
val purealpha = " abcdefghijklmnopqrstuvwxyz" // not converted toSet
str.forall(purealpha.contains)

Why Scala needs duplicate constructor? (java.lang.NoSuchMethodException)

I was receiving this error in my Hadoop job.
java.lang.NoSuchMethodException: <PackageName>.<ClassName>.<init>(<parameters>)
In most Scala code, you would have it in compile time. But since this job is called in runtime I was not catching it in compile time.
I would think default parameter would cause constructors with both signatures to be created, one taking a single argument.
class BasicDynamicBlocker(args: Args, evaluation: Boolean = false) extends Job(args) with HiveAccess {
//I NEEDED THIS TOO:
def this(args: Args) = {
this(args, false)
}
...
}
I learned the hard way that I needed to declare the overloaded constructor using this. (I wanted to write this out in case it helps someone else.)
I also have a small questions. It still seems redundant to me. Is there a reason Scala language's design restrictions require this?
It is not like when you have default parameter you will get overloads generated for each possible case, like for example:
def method(num: Int = 4, str: String = "") = ???
you expect compiler to generate
def method(num: Int) = method(num, "")
def method(str: String) = method(4, str)
def method() = method(4, "")
but that is not the case.
You will instead have generated methods (in companion object), for each default param
def method$default$1: Int = 4
def method$default$2: String = "a"
and whenever you say in your code
method(str = "a")
it will be just changed to
method(method$default$1, "a")
So in your case, constructor with signature this(args: Args) just did not exist, there was only the 2 param version.
You can read more here: http://docs.scala-lang.org/sips/completed/named-and-default-arguments.html

Strongly typed access to csv in scala?

I would like to access csv files in scala in a strongly typed manner. For example, as I read each line of the csv, it is automatically parsed and represented as a tuple with the appropriate types. I could specify the types beforehand in some sort of schema that is passed to the parser. Are there any libraries that exist for doing this? If not, how could I go about implementing this functionality on my own?
product-collections appears to be a good fit for your requirements:
scala> val data = CsvParser[String,Int,Double].parseFile("sample.csv")
data: com.github.marklister.collections.immutable.CollSeq3[String,Int,Double] =
CollSeq((Jan,10,22.33),
(Feb,20,44.2),
(Mar,25,55.1))
product-collections uses opencsv under the hood.
A CollSeq3 is an IndexedSeq[Product3[T1,T2,T3]] and also a Product3[Seq[T1],Seq[T2],Seq[T3]] with a little sugar. I am the author of product-collections.
Here's a link to the io page of the scaladoc
Product3 is essentially a tuple of arity 3.
If your content has double-quotes to enclose other double quotes, commas and newlines, I would definitely use a library like opencsv that deals properly with special characters. Typically you end up with Iterator[Array[String]]. Then you use Iterator.map or collect to transform each Array[String] into your tuples dealing with type conversions errors there. If you need to do process the input without loading all in memory, you then keep working with the iterator, otherwise you can convert to a Vector or List and close the input stream.
So it may look like this:
val reader = new CSVReader(new FileReader(filename))
val iter = reader.iterator()
val typed = iter collect {
case Array(double, int, string) => (double.toDouble, int.toInt, string)
}
// do more work with typed
// close reader in a finally block
Depending on how you need to deal with errors, you can return Left for errors and Right for success tuples to separate the errors from the correct rows. Also, I sometimes wrap of all this using scala-arm for closing resources. So my data maybe wrapped into the resource.ManagedResource monad so that I can use input coming from multiple files.
Finally, although you want to work with tuples, I have found that it is usually clearer to have a case class that is appropriate for the problem and then write a method that creates that case class object from an Array[String].
You can use kantan.csv, which is designed with precisely that purpose in mind.
Imagine you have the following input:
1,Foo,2.0
2,Bar,false
Using kantan.csv, you could write the following code to parse it:
import kantan.csv.ops._
new File("path/to/csv").asUnsafeCsvRows[(Int, String, Either[Float, Boolean])](',', false)
And you'd get an iterator where each entry is of type (Int, String, Either[Float, Boolean]). Note the bit where the last column in your CSV can be of more than one type, but this is conveniently handled with Either.
This is all done in an entirely type safe way, no reflection involved, validated at compile time.
Depending on how far down the rabbit hole you're willing to go, there's also a shapeless module for automated case class and sum type derivation, as well as support for scalaz and cats types and type classes.
Full disclosure: I'm the author of kantan.csv.
I've created a strongly-typed CSV helper for Scala, called object-csv. It is not a fully fledged framework, but it can be adjusted easily. With it you can do this:
val peopleFromCSV = readCSV[Person](fileName)
Where Person is case class, defined like this:
case class Person (name: String, age: Int, salary: Double, isNice:Boolean = false)
Read more about it in GitHub, or in my blog post about it.
Edit: as pointed out in a comment, kantan.csv (see other answer) is probably the best as of the time I made this edit (2020-09-03).
This is made more complicated than it ought to because of the nontrivial quoting rules for CSV. You probably should start with an existing CSV parser, e.g. OpenCSV or one of the projects called scala-csv. (There are at least three.)
Then you end up with some sort of collection of collections of strings. If you don't need to read massive CSV files quickly, you can just try to parse each line into each of your types and take the first one that doesn't throw an exception. For example,
import scala.util._
case class Person(first: String, last: String, age: Int) {}
object Person {
def fromCSV(xs: Seq[String]) = Try(xs match {
case s0 +: s1 +: s2 +: more => new Person(s0, s1, s2.toInt)
})
}
If you do need to parse them fairly quickly and you don't know what might be there, you should probably use some sort of matching (e.g. regexes) on the individual items. Either way, if there's any chance of error you probably want to use Try or Option or somesuch to package errors.
I built my own idea to strongly typecast the final product, more than the reading stage itself..which as pointed out might be better handled as stage one with something like Apache CSV, and stage 2 could be what i've done. Here's the code you are welcome to it. The idea is to typecast the CSVReader[T] with type T .. upon construction, you must supply the reader with a Factor object of Type[T] as well. The idea here is that the class itself (or in my example a helper object) decides the construction detail and thus decouples this from the actual reading. You could use Implicit objects to pass the helper around but I've not done that here. The only downside is that each row of the CSV must be of the same class type, but you could expand this concept as needed.
class CsvReader/**
* #param fname
* #param hasHeader : ignore header row
* #param delim : "\t" , etc
*/
[T] ( factory:CsvFactory[T], fname:String, delim:String) {
private val f = Source.fromFile(fname)
private var lines = f.getLines //iterator
private var fileClosed = false
if (lines.hasNext) lines = lines.dropWhile(_.trim.isEmpty) //skip white space
def hasNext = (if (fileClosed) false else lines.hasNext)
lines = lines.drop(1) //drop header , assumed to exist
/**
* also closes the file
* #return the line
*/
def nextRow ():String = { //public version
val ans = lines.next
if (ans.isEmpty) throw new Exception("Error in CSV, reading past end "+fname)
if (lines.hasNext) lines = lines.dropWhile(_.trim.isEmpty) else close()
ans
}
//def nextObj[T](factory:CsvFactory[T]): T = past version
def nextObj(): T = { //public version
val s = nextRow()
val a = s.split(delim)
factory makeObj a
}
def allObj() : Seq[T] = {
val ans = scala.collection.mutable.Buffer[T]()
while (hasNext) ans+=nextObj()
ans.toList
}
def close() = {
f.close;
fileClosed = true
}
} //class
next the example Helper Factory and example "Main"
trait CsvFactory[T] { //handles all serial controls (in and out)
def makeObj(a:Seq[String]):T //for reading
def makeRow(obj:T):Seq[String]//the factory basically just passes this duty
def header:Seq[String] //must define headers for writing
}
/**
* Each class implements this as needed, so the object can be serialized by the writer
*/
case class TestRecord(val name:String, val addr:String, val zip:Int) {
def toRow():Seq[String] = List(name,addr,zip.toString) //handle conversion to CSV
}
object TestFactory extends CsvFactory[TestRecord] {
def makeObj (a:Seq[String]):TestRecord = new TestRecord(a(0),a(1),a(2).toDouble.toInt)
def header = List("name","addr","zip")
def makeRow(o:TestRecord):Seq[String] = {
o.toRow.map(_.toUpperCase())
}
}
object CsvSerial {
def main(args: Array[String]): Unit = {
val whereami = System.getProperty("user.dir")
println("Begin CSV test in "+whereami)
val reader = new CsvReader(TestFactory,"TestCsv.txt","\t")
val all = reader.allObj() //read the CSV info a file
sd.p(all)
reader.close
val writer = new CsvWriter(TestFactory,"TestOut.txt", "\t")
for (x<-all) writer.printObj(x)
writer.close
} //main
}
Example CSV (tab seperated.. might need to repair if you copy from an editor)
Name Addr Zip "Sanders, Dante R." 4823 Nibh Av. 60797.00 "Decker, Caryn G." 994-2552 Ac Rd. 70755.00 "Wilkerson, Jolene Z." 3613 Ultrices. St. 62168.00 "Gonzales, Elizabeth W." "P.O. Box 409, 2319 Cursus. Rd." 72909.00 "Rodriguez, Abbot O." Ap #541-9695 Fusce Street 23495.00 "Larson, Martin L." 113-3963 Cras Av. 36008.00 "Cannon, Zia U." 549-2083 Libero Avenue 91524.00 "Cook, Amena B." Ap
#668-5982 Massa Ave 69205.00
And finally the writer (notice the factory methods require this as well with "makerow"
import java.io._
class CsvWriter[T] (factory:CsvFactory[T], fname:String, delim:String, append:Boolean = false) {
private val out = new PrintWriter(new BufferedWriter(new FileWriter(fname,append)));
if (!append) out.println(factory.header mkString delim )
def flush() = out.flush()
def println(s:String) = out.println(s)
def printObj(obj:T) = println( factory makeRow(obj) mkString(delim) )
def printAll(objects:Seq[T]) = objects.foreach(printObj(_))
def close() = out.close
}
If you know the the # and types of fields, maybe like this?:
case class Friend(id: Int, name: String) // 1, Fred
val friends = scala.io.Source.fromFile("friends.csv").getLines.map { line =>
val fields = line.split(',')
Friend(fields(0).toInt, fields(1))
}

type mismatch in string concatenation

I'm really new to Scala and I'm not even able to concatenate Strings. Here is my code:
object RandomData {
private[this] val bag = new scala.util.Random
def apply(sensorId: String, stamp: Long, size: Int): String = {
var cpt: Int = 0
var data: String = "test"
repeat(10) {
data += "_test"
}
return data
}
}
I got the error:
type mismatch;
found : Unit
required: com.excilys.ebi.gatling.core.structure.ChainBuilder
What am I doing wrong ??
repeat is offered by Gatling in order to repeat Gatling tasks, e.g., query a website. If you have a look at the documentation (I wasn't able to find a link to the API doc of repeat), you'll see that repeat expects a chain, which is why your error message says "required: com.excilys.ebi.gatling.core.structure.ChainBuilder". However, all you do is to append to a string - which will not return a value of type ChainBuilder.
Moreover, appending to a string is nothing that should be done via Gatling. It looks to me as if you are confusing Gatling's repeat with a Scala for loop. If you only want to append "_test" to data 10 times, use one of Scala's loops (for, while) or a functional approach with e.g. foldLeft. Here are two examples:
/* Imperative style loop */
for(i <- 1 to 10) {
data += "_test"
}
/* Functional style with lazy streams */
data += Stream.continually("_test").take(10).mkString("")
Your problem is that the block
{
data += "_test"
}
evaluates to Unit, whereas the repeat method seems to want it to evaluate to a ChainBuilder.
Check out the documentation for the repeat method. I was unable to find it, but it's probably reasonable to assume that it looks something like
def repeat(numTimes: Int)(thunk: => ChainBuilder): Unit
I'm not sure if the repeat method does anything special, but with your usage, you could just use this block instead of the repeat(10){...}
for(i <- 1 to 10) data += "_test"
Also, as a side note, you don't need the return keyword with scala. You can just say data instead of return data.