Unit test to check retrievals from config file - scala

In a spark project we use Typesafe Config. We have a big config file with some (necessary) redundancy. It is quite easy to refer the wrong branch of the config json and sometimes the error slips to production code.
I need to verify that all the calls of config.getString(x) with x literal will not fail for a given config file.
I'd like to write a unit test that checks every string used in my application to obtain a config value.
A possible solution we found is to preload all the config values to a case class, so
val rawPath = config.getString("comp1.data.files.rawData")
val coresNumber = config.getLong("comp1.setup.cores")
would become
case class ConfigData(rawPath:String, coresNumber:Long)
def initConfig():ConfigData ={
val rawPath = config.getString("comp1.data.files.rawData")
val coresNumber = config.getLong("comp1.setup.cores")
ConfigData(rawPat,coresNumber)
}
val conf = initConfig()
val rawPath = conf.rawPath
val coresNumner = conf.coresNumber
and then simply call initData() to test for config load errors.
I've also thought of using scala reflection but I'd need to find all the places in my code where config.getString(x) is called and then obtain the x to test their existence in the config file, but I can't find a way to get all the instances of a method call and create a test on the parameter.
Is there anything I haven't thought of?

One solution is to provide the configuration in a Singleton, which you initialise when starting the server - so the Server starts only if the configuration is correct. Or run the configurations in Unit Tests.
In this Singleton you load your Configuration in single values or case classes depending on amount of values.
We use pureconfig which makes it pretty simple to map the configuration directly into case classes.
Here is an Example without Pure Script: DemoAdapterContext and AdaptersContext.
Let me know if you need more infos.

Related

Sealed trait and dynamic case objects

I have a few enumerations implemented as sealed traits and case objects. I prefer using the ADT approach because of the non-exhaustive warnings and mostly because we want to avoid the type erasure. Something like this
sealed abstract class Maker(val value: String) extends Product with Serializable {
override def toString = value
}
object Maker {
case object ChryslerMaker extends Vendor("Chrysler")
case object ToyotaMaker extends Vendor("Toyota")
case object NissanMaker extends Vendor("Nissan")
case object GMMaker extends Vendor("General Motors")
case object UnknownMaker extends Vendor("")
val tipos = List(ChryslerMaker, ToyotaMaker, NissanMaker,GMMaker, UnknownMaker)
private val fromStringMap: Map[String, Maker] = tipos.map(s => s.toString -> s).toMap
def apply(key: String): Option[Maker] = fromStringMap.get(key)
}
This is working well so far, now we are considering providing access to other programmers to our code to allow them to configure on site. I see two potential problems:
1) People messing up and writing things like:
case object ChryslerMaker extends Vendor("Nissan")
and people forgetting to update the tipos
I have been looking into using a configuration file (JSON or csv) to provide these values and read them as we do with plenty of other elements, but all the answers I have found rely on macros and seem to be extremely dependent on the scala version used (2.12 for us).
What I would like to find is:
1a) (Prefered) a way to create dynamically the case objects from a list of strings making sure the objects are named consistently with the value they hold
1b) (Acceptable) if this proves too hard a way to obtain the objects and the values during the test phase
2) Check that the number of elements in the list matches the number of case objects created.
I forgot to mention, I have looked briefly to enumeratum but I would prefer not to include additional libraries unless i really understand the pros and cons (and right now I am not sure how enumerated compares with the ADT approach, if you think this is the best way and can point me to such discussion that would work great)
Thanks !
One idea that comes to my mind is to create an SBT SourceGenerator task.
That will read an input JSON, CSV, XML or whatever file, that is part of your project and will create a scala file.
// ----- File: project/VendorsGenerator.scala -----
import sbt.Keys._
import sbt._
/**
* An SBT task that generates a managed source file with all Scalastyle inspections.
*/
object VendorsGenerator {
// For demonstration, I will use this plain List[String] to generate the code,
// you may change the code to read a file instead.
// Or maybe this will be good enough.
final val vendors: List[String] =
List(
"Chrysler",
"Toyota",
...
"Unknow"
)
val generatorTask = Def.task {
// Make the 'tipos' List, which contains all vendors.
val tipos =
vendors
.map(vendorName => s"${vendorName}Vendor")
.mkString("val tipos: List[Vendor] = List(", ",", ")")
// Make a case object for each vendor.
val vendorObjects = vendors.map { vendorName =>
s"""case object ${vendorName}Vendor extends Vendor { override final val value: String = "${vendorName}" }"""
}
// Fill the code template.
val code =
List(
List(
"package vendors",
"sealed trait Vendor extends Product with Serializable {",
"def value: String",
"override final def toString: String = value",
"}",
"object Vendors extends (String => Option[Vendor]) {"
),
vendorObjects,
List(
tipos,
"private final val fromStringMap: Map[String, Vendor] = tipos.map(v => v.toString -> v).toMap",
"override def apply(key: String): Option[Vendor] = fromStringMap.get(key.toLowerCase)",
"}"
)
).flatten
// Save the new file to the managed sources dir.
val vendorsFile = (sourceManaged in Compile).value / "vendors.scala"
IO.writeLines(vendorsFile, code)
Seq(vendorsFile)
}
}
Now, you can activate your source generator.
This task will be run each time, before the compile step.
// ----- File: build.sbt -----
sourceGenerators in Compile += VendorsGenerator.generatorTask.taskValue
Please note that I suggest this, because I have done it before and because I don't have any macros nor meta programming experience.
Also, note that this example relays a lot in Strings, which make the code a little bit hard to understand and maintain.
BTW, I haven't used enumeratum, but giving it a quick look looks like the best solution to this problem
Edit
I have my code ready to read a HOCON file and generate the matching code. My question now is where to place the scala file in the project directory and where will the files be generated. I am a little bit confused because there seems to be multiple steps 1) compile my scala generator, 2) run the generator, and 3) compile and build the project. Is this right?
Your generator is not part of your project code, but instead of your meta-project (I know that sounds confusing, you may read this for understanding that) - as such, you place the generator inside the project folder at the root level (the same folder where is the build.properties file for specifying the sbt version).
If your generator needs some dependencies (I'm sure it does for reading the HOCON) you place them in a build.sbt file inside that project folder.
If you plan to add unit test to the generator, you may create an entire scala project inside the meta-project (you may give a look to the project folder of a open source project (Yes, yes I know, confusing again) in which I work for reference) - My personal suggestion is that more than testing the generator itself, you should test the generated file instead, or better both.
The generated file will be automatically placed in the src_managed folder (which lives inside target and thus it is ignored from your source code version control).
The path inside that is just by order, as everything inside the src_managed folder is included by default when compiling.
val vendorsFile = (sourceManaged in Compile).value / "vendors.scala" // Path to the file to write.`
In order to access the values defined in the generated file on your source code, you only need to add a package to the generated file and import the values from that package in your code (as with any normal file).
You don't need to worry about anything related with compilation order, if you include your source generator in your build.sbt file, SBT will take care of everything automatically.
sourceGenerators in Compile += VendorsGenerator.generatorTask.taskValue // Activate the source generator.
SBT will re-run your generator everytime it needs to compile.
"BTW I get "not found: object sbt" on the imports".
If the project is inside the meta-project space, it will find the sbt package by default, don't worry about it.

Not able to mock the getTimestamp method on mocked Row

I am writing an application that interacts with Cassandra using Scala. While performing unit testing, I am using mockito wherein I am mocking the resultSet and row
val mockedResultSet = mock[ResultSet]
val mockedRow = mock[Row]
Now while mocking the methods of the mockedRow, such as
doReturn("mocked").when(mockedRow).getString("ColumnName")
works fine. However, I am not able to mock the getTimestamp method of the mockedRow. I have tried 2 approaches but was not successful.
First approach
val testDate = "2018-08-23 15:51:12+0530"
val formatter = new SimpleDateFormat("yyyy-mm-dd HH:mm:ssZ")
val date: Date = formatter.parse(testDate)
doReturn(date).when(mockedRow).getTimestamp("ColumnName")
and second approach
when(mockedRow.getTimestamp("column")).thenReturn(Timestamp.valueOf("2018-08-23 15:51:12+0530"))
Both of them return null i.e it does not return the mocked value of the getTimestamp method. I am using cassandra driver core 3.0 dependency in my project.
Any help would br highly appreciated. Thanks in advance !!!
Mocking objects you don't own is usually considered a bad practice, that said, what you can do to try to see what's going on is to verify the interactions with the mock, i.e.
verify(mockedRow).getTimestamp("column")
Given you are getting null out of the mock, that statement should fail, but the failure will show all the actual calls received by the mock (and it's parameters), which should help you to debug.
A way to minimize this kind of problems is to use a mockito session, in standard mockito they can only be used through a JUnit runner, but with mockito-scala you can use them manually like this
MockitoScalaSession().run {
val mockedRow = mock[Row]
when(mockedRow.getTimestamp("column")).thenReturn(Timestamp.valueOf("2018-08-23 15:51:12+0530"))
//Execute your test
}
That code will check that the mock is not being called with anything that hasn't been stubbed for, it will also tell you if you had provided stubs that weren't actually used and a few more things.
If you like that behaviour (and you are using ScalaTest) you can apply it automatically to every test by using MockitoFixture
I'm a developer of mockito-scala btw

Elastic4s, mockito and verify with type erasure and implicits

In previous versions of Elastic4s you could do something like
val argument1: ArgumentCapture[DeleteIndexDefinition] = ???
verify(client).execute(argument1.capture())
assert(argument1 == ???)
val argument2: ArgumentCapture[IndexDefinition] = ???
verify(client, times(2)).execute(argument2.capture())
assert(argument2 == ???)
after several executions in your test (i.e. one DeleteIndexDefinition, followed of two IndexDefinition). And each verify would be matched against its type.
However, Elastic4s now takes an implicit parameter in its client.execute method. The parameter is of type Executable[T,R], which means you now need something like
val argument1: ArgumentCapture[DeleteIndexDefinition] = ???
verify(client).execute(argument1.capture())(any[Executable[DeleteIndexDefinition,R]])
assert(argument1 == ???)
val argument2: ArgumentCapture[IndexDefinition] = ???
verify(client, times(2)).execute(argument2.capture())(any[Executable[IndexDefinition,R]])
assert(argument2 == ???)
After doing that, I was getting an error. Mockito is considering both three client.execute in the first verify. Yes, even if the first parameter is of a different type.
That's because the implicit(the second parameter) has, after type erasure, the same type Executable.
So the asertions were failing. How to test in this setup?
The approach now taken in elastic4s to encapsulate the logic for executing each request type is one using typeclasses. This is why the implicit now exists. It help modularize each request type, and avoids the God class anti-pattern that was starting to creep into the ElasticClient class.
Two things I can think of that might help you:
What you already posted up, using Mockito and passing in the implicit as another matcher. This is how you can mock a method using implicits in general.
Not use mockito, but spool up a local embedded node, and try it against real data. This is my preferred approach when I write elasticsearch code. The advantages are that you're testing real queries against the real server, so not only checking that they are invoked, but that they actually work. (Some people might consider this an integration test, but whatever I don't agree, it all runs inside a single self contained test with no outside deps).
The latest release of elastic4s even include a testkit that makes it really easy to get the embedded node. You can look at almost any of the unit tests to give you an idea how to use it.
My solution was to create one verify with a generic type. It took me a while to realise that even if there is no common type, you always have AnyRef.
So, something like this works
val objs: ArgumentCaptor[AnyRef] = ArgumentCaptor.forClass(classOf[AnyRef])
verify(client, times(3)).execute(objs.capture())(any())
val values = objs.getAllValues
assert(values.get(0).isInstanceOf[DeleteIndexDefinition])
assert(values.get(1).isInstanceOf[IndexDefinition])
assert(values.get(2).isInstanceOf[IndexDefinition])
I've created both the question and the answer. But I'll consider other answers.

how to make saveAsTextFile NOT split output into multiple file?

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it.
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Does the number of outputs correspond to the number of reducers it uses?
Does this mean the output is compressed?
I know I can combine the output together using bash, but is there an option to store the output in a single text file, without splitting?? I looked at the API docs, but it doesn't say much about this.
The reason it saves it as multiple files is because the computation is distributed. If the output is small enough such that you think you can fit it on one machine, then you can end your program with
val arr = year.collect()
And then save the resulting array as a file, Another way would be to use a custom partitioner, partitionBy, and make it so everything goes to one partition though that isn't advisable because you won't get any parallelization.
If you require the file to be saved with saveAsTextFile you can use coalesce(1,true).saveAsTextFile(). This basically means do the computation then coalesce to 1 partition. You can also use repartition(1) which is just a wrapper for coalesce with the shuffle argument set to true. Looking through the source of RDD.scala is how I figured most of this stuff out, you should take a look.
For those working with a larger dataset:
rdd.collect() should not be used in this case as it will collect all data as an Array in the driver, which is the easiest way to get out of memory.
rdd.coalesce(1).saveAsTextFile() should also not be used as the parallelism of upstream stages will be lost to be performed on a single node, where data will be stored from.
rdd.coalesce(1, shuffle = true).saveAsTextFile() is the best simple option as it will keep the processing of upstream tasks parallel and then only perform the shuffle to one node (rdd.repartition(1).saveAsTextFile() is an exact synonym).
rdd.saveAsSingleTextFile() as provided bellow additionally allows one to store the rdd in a single file with a specific name while keeping the parallelism properties of rdd.coalesce(1, shuffle = true).saveAsTextFile().
Something that can be inconvenient with rdd.coalesce(1, shuffle = true).saveAsTextFile("path/to/file.txt") is that it actually produces a file whose path is path/to/file.txt/part-00000 and not path/to/file.txt.
The following solution rdd.saveAsSingleTextFile("path/to/file.txt") will actually produce a file whose path is path/to/file.txt:
package com.whatever.package
import org.apache.spark.rdd.RDD
import org.apache.hadoop.fs.{FileSystem, FileUtil, Path}
import org.apache.hadoop.io.compress.CompressionCodec
object SparkHelper {
// This is an implicit class so that saveAsSingleTextFile can be attached to
// SparkContext and be called like this: sc.saveAsSingleTextFile
implicit class RDDExtensions(val rdd: RDD[String]) extends AnyVal {
def saveAsSingleTextFile(path: String): Unit =
saveAsSingleTextFileInternal(path, None)
def saveAsSingleTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit =
saveAsSingleTextFileInternal(path, Some(codec))
private def saveAsSingleTextFileInternal(
path: String, codec: Option[Class[_ <: CompressionCodec]]
): Unit = {
// The interface with hdfs:
val hdfs = FileSystem.get(rdd.sparkContext.hadoopConfiguration)
// Classic saveAsTextFile in a temporary folder:
hdfs.delete(new Path(s"$path.tmp"), true) // to make sure it's not there already
codec match {
case Some(codec) => rdd.saveAsTextFile(s"$path.tmp", codec)
case None => rdd.saveAsTextFile(s"$path.tmp")
}
// Merge the folder of resulting part-xxxxx into one file:
hdfs.delete(new Path(path), true) // to make sure it's not there already
FileUtil.copyMerge(
hdfs, new Path(s"$path.tmp"),
hdfs, new Path(path),
true, rdd.sparkContext.hadoopConfiguration, null
)
// Working with Hadoop 3?: https://stackoverflow.com/a/50545815/9297144
hdfs.delete(new Path(s"$path.tmp"), true)
}
}
}
which can be used this way:
import com.whatever.package.SparkHelper.RDDExtensions
rdd.saveAsSingleTextFile("path/to/file.txt")
// Or if the produced file is to be compressed:
import org.apache.hadoop.io.compress.GzipCodec
rdd.saveAsSingleTextFile("path/to/file.txt.gz", classOf[GzipCodec])
This snippet:
First stores the rdd with rdd.saveAsTextFile("path/to/file.txt") in a temporary folder path/to/file.txt.tmp as if we didn't want to store data in one file (which keeps the processing of upstream tasks parallel)
And then only, using the hadoop file system api, we proceed with the merge (FileUtil.copyMerge()) of the different output files to create our final output single file path/to/file.txt.
You could call coalesce(1) and then saveAsTextFile() - but it might be a bad idea if you have a lot of data. Separate files per split are generated just like in Hadoop in order to let separate mappers and reducers write to different files. Having a single output file is only a good idea if you have very little data, in which case you could do collect() as well, as #aaronman said.
As others have mentioned, you can collect or coalesce your data set to force Spark to produce a single file. But this also limits the number of Spark tasks that can work on your dataset in parallel. I prefer to let it create a hundred files in the output HDFS directory, then use hadoop fs -getmerge /hdfs/dir /local/file.txt to extract the results into a single file in the local filesystem. This makes the most sense when your output is a relatively small report, of course.
In Spark 1.6.1 the format is as shown below. It creates a single output file.It is best practice to use it if the output is small enough to handle.Basically what it does is that it returns a new RDD that is reduced into numPartitions partitions.If you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1)
pair_result.coalesce(1).saveAsTextFile("/app/data/")
You can call repartition() and follow this way:
val year = sc.textFile("apat63_99.txt").map(_.split(",")(1)).flatMap(_.split(",")).map((_,1)).reduceByKey((_+_)).map(_.swap)
var repartitioned = year.repartition(1)
repartitioned.saveAsTextFile("C:/Users/TheBhaskarDas/Desktop/wc_spark00")
You will be able to do it in the next version of Spark, in the current version 1.0.0 it's not possible unless you do it manually somehow, for example, like you mentioned, with a bash script call.
I also want to mention that the documentation clearly states that users should be careful when calling coalesce with a real small number of partitions . this can cause upstream partitions to inherit this number of partitions.
I would not recommend using coalesce(1) unless really required.
Here's my answer to output a single file. I just added coalesce(1)
val year = sc.textFile("apat63_99.txt")
.map(_.split(",")(1))
.flatMap(_.split(","))
.map((_,1))
.reduceByKey((_+_)).map(_.swap)
year.saveAsTextFile("year")
Code:
year.coalesce(1).saveAsTextFile("year")

Scala, Specs2, Mockito and null return values

I'm trying to test-drive some Scala code using Specs2 and Mockito. I'm relatively new to all three, and having difficulty with the mocked methods returning null.
In the following (transcribed with some name changes)
"My Component's process(File)" should {
"pass file to Parser" in new modules {
val file = mock[File]
myComponent.process(file)
there was one(mockParser).parse(file)
}
"pass parse result to Translator" in new modules {
val file = mock[File]
val myType1 = mock[MyType1]
mockParser.parse(file) returns (Some(myType1))
myComponent.process(file)
there was one(mockTranslator).translate(myType1)
}
}
The "pass file to Parser" works until I add the translator call in the SUT, and then dies because the mockParser.parse method has returned a null, which the translator code can't take.
Similarly, the "pass parse result to Translator" passes until I try to use the translation result in the SUT.
The real code for both of these methods can never return null, but I don't know how to tell Mockito to make the expectations return usable results.
I can of course work around this by putting null checks in the SUT, but I'd rather not, as I'm making sure to never return nulls and instead using Option, None and Some.
Pointers to a good Scala/Specs2/Mockito tutorial would be wonderful, as would a simple example of how to change a line like
there was one(mockParser).parse(file)
to make it return something that allows continued execution in the SUT when it doesn't deal with nulls.
Flailing about trying to figure this out, I have tried changing that line to
there was one(mockParser).parse(file) returns myResult
with a value for myResult that is of the type I want returned. That gave me a compile error as it expects to find a MatchResult there rather than my return type.
If it matters, I'm using Scala 2.9.0.
If you don't have seen it, you can look the mock expectation page of the specs2 documentation.
In your code, the stub should be mockParser.parse(file) returns myResult
Edited after Don's edit:
There was a misunderstanding. The way you do it in your second example is the good one and you should do exactly the same in the first test:
val file = mock[File]
val myType1 = mock[MyType1]
mockParser.parse(file) returns (Some(myType1))
myComponent.process(file)
there was one(mockParser).parse(file)
The idea of unit testing with mock is always the same: explain how your mocks work (stubbing), execute, verify.
That should answer the question, now a personal advice:
Most of the time, except if you want to verify some algorithmic behavior (stop on first success, process a list in reverse order) you should not test expectation in your unit tests.
In your example, the process method should "translate things", thus your unit tests should focus on it: mock your parsers and translators, stub them and only check the result of the whole process. It's less fine grain but the goal of a unit test is not to check every step of a method. If you want to change the implementation, you should not have to modify a bunch of unit tests that verify each line of the method.
I have managed to solve this, though there may be a better solution, so I'm going to post my own answer, but not accept it immediately.
What I needed to do was supply a sensible default return value for the mock, in the form of an org.mockito.stubbing.Answer<T> with T being the return type.
I was able to do this with the following mock setup:
val defaultParseResult = new Answer[Option[MyType1]] {
def answer(p1: InvocationOnMock): Option[MyType1] = None
}
val mockParser = org.mockito.Mockito.mock(implicitly[ClassManifest[Parser]].erasure,
defaultParseResult).asInstanceOf[Parser]
after a bit of browsing of the source for the org.specs2.mock.Mockito trait and things it calls.
And now, instead of returning null, the parse returns None when not stubbed (including when it's expected as in the first test), which allows the test to pass with this value being used in the code under test.
I will likely make a test support method hiding the mess in the mockParser assignment, and letting me do the same for various return types, as I'm going to need the same capability with several return types just in this set of tests.
I couldn't locate support for a shorter way of doing this in org.specs2.mock.Mockito, but perhaps this will inspire Eric to add such. Nice to have the author in the conversation...
Edit
On further perusal of source, it occurred to me that I should be able to just call the method
def mock[T, A](implicit m: ClassManifest[T], a: org.mockito.stubbing.Answer[A]): T = org.mockito.Mockito.mock(implicitly[ClassManifest[T]].erasure, a).asInstanceOf[T]
defined in org.specs2.mock.MockitoMocker, which was in fact the inspiration for my solution above. But I can't figure out the call. mock is rather overloaded, and all my attempts seem to end up invoking a different version and not liking my parameters.
So it looks like Eric has already included support for this, but I don't understand how to get to it.
Update
I have defined a trait containing the following:
def mock[T, A](implicit m: ClassManifest[T], default: A): T = {
org.mockito.Mockito.mock(
implicitly[ClassManifest[T]].erasure,
new Answer[A] {
def answer(p1: InvocationOnMock): A = default
}).asInstanceOf[T]
}
and now by using that trait I can setup my mock as
implicit val defaultParseResult = None
val mockParser = mock[Parser,Option[MyType1]]
I don't after all need more usages of this in this particular test, as supplying a usable value for this makes all my tests work without null checks in the code under test. But it might be needed in other tests.
I'd still be interested in how to handle this issue without adding this trait.
Without the full it's difficult to say but can you please check that the method you're trying to mock is not a final method? Because in that case Mockito won't be able to mock it and will return null.
Another piece of advice, when something doesn't work, is to rewrite the code with Mockito in a standard JUnit test. Then, if it fails, your question might be best answered by someone on the Mockito mailing list.