Is there any benefit to using partially applied functions vs injecting dependencies into a class? Both approaches as I understand them shown here:
class DB(conn: String) {
def get(sql: String): List[Any] = _
}
object DB {
def get(conn: String) (sql: String): List[Any] = _
}
object MyApp {
val conn = "jdbc:..."
val sql = "select * from employees"
val db1 = new DB(conn)
db1.get(sql)
val db2 = DB.get(conn) _
db2(sql)
}
Using partially-applied functions is somewhat simpler, but the conn is passed to the function each time, and could have a different conn each time it is called. The advantage of using a class is that it can perform one-off operations when it is created, such as validation or caching, and retain the results in the class for re-use.
For example the conn in this code is a String but this is presumably used to connect to a database of some sort. With the partially-applied function it must make this connection each time. With a class the connection can be made when the class is created and just re-used for each query. The class version can also prevent the class being created unless the conn is valid.
The class is usually used when the dependency is longer-lived or used by multiple functions. Partial application is more common when the dependency is shorter-lived, like during a single loop or callback. For example:
list.map(f(context))
def f(context: Context)(element: Int): Result = ???
It wouldn't really make sense to create a class just to hold f. On the other hand, if you have 5 functions that all take context, you should probably just put those into a class. In your example, get is unlikely to be the only thing that requires the conn, so a class makes more sense.
Related
For accessing objects, a Slick DAO which contains functions returning actions and objects of a stored type were created. Example:
def findByKeysAction(a: String, b: String, c: String = {
Users.filter(x => x.a === a && x.b === b && x.c === c).result
}
def findByKeys(a: String, b: String, c: String): Future[Option[foo]] = {
db.run(findByKeysAction(consumerId, contextId, userId)).map(_.headOption)
}
Notice how the non-action-based function wraps the other in db.run().
What is a solid approach to test both functions and minimizing redundancy of code?
I naive method could of course be to test them both with their individual test setups (above is a simple example; there could be a lot of test setup needed to satisfy DB restrictions).
Notice how the non-action-based function wraps the other in db.run().
Not really. Your findByKeys method does not call findByUserIdAction, so I'm adjusting for that minor detail in this answer.
def findByUserIdAction(userId: String) = {
Users.filter(_.userId === userId).result
}
The above code returns a DBIOAction. As the documentation states:
Just like a query, an I/O action is only a description of an operation. Creating or composing actions does not execute anything on a database.
As far as a user of Slick is concerned, there is no meaningful test for a DBIOAction, because by itself it does nothing; it's only a recipe for what one wants to do. To execute the above DBIOAction, you have to materialize it, which is what the following does:
def findByUserId(userId: String): Future[Option[User]] = {
db.run(findByUserIdAction(userId)).map(_.headOption)
}
The materialized result is what you want to test. One way to do so is to use ScalaTest's ScalaFutures trait. For example, in a spec that mixes in that trait, you could have something like:
"Users" should "return a single user by id" in {
findByUserId("id3").futureValue shouldBe Option(User("id3", ...))
}
Take a look at this Slick 3.2.0 test project for more examples: specifically, TestSpec and QueryCoffeesTest.
In summary, don't bother trying to test a DBIOAction in isolation; just test its materialized result.
Using Scala, Play Framework, Slick 3, Specs2.
I have a repository layer and a service layer. The repositories are quite dumb, and I use specs2 to make sure the service layer does its job.
My repositories used to return futures, like this:
def findById(id: Long): Future[Option[Foo]] =
db.run(fooQuery(id))
Then it would be used in the service:
def foonicate(id: Long): Future[Foo] = {
fooRepository.findById(id).flatMap { optFoo =>
val foo: Foo = optFoo match {
case Some(foo) => [business logic returning Foo]
case None => [business logic returning Foo]
}
fooRepository.save(foo)
}
}
Services were easy to spec. In the service spec, the FooRepository would be mocked like this:
fooRepository.findById(3).returns(Future(Foo(3)))
I recently found the need for database transactions. Several queries should be combined into a single transaction. The prevailing opinion seems to be that it's perfectly ok to handle transaction logic in the service layer.
With that in mind, I've changed my repositories to return slick.dbio.DBIO and added a helper method to run the query transactionally:
def findById(id: Long): DBIO[Option[Foo]] =
fooQuery(id)
def run[T](query: DBIO[T]): Future[T] =
db.run(query.transactionally)
The services compose the DBIOs and finally call the repository to run the query:
def foonicate(id: Long): Future[Foo] = {
val query = fooRepository.findById(id).flatMap { optFoo =>
val foo: Foo = optFoo match {
case Some(foo) => [business logic finally returning Foo]
case None => [business logic finally returning Foo]
}
fooRepository.save(foo)
}
fooRepository.run(query)
}
That seems to work, but now I can only spec it like this:
val findFooDbio = DBIO.successful(None))
val saveFooDbio = DBIO.successful(Foo(3))
fooRepository.findById(3) returns findFooDbio
fooRepository.save(Foo(3)) returns saveFooDbio
fooRepository.run(any[DBIO[Foo]]) returns Future(Foo(3))
That any in the run mock sucks! Now I'm not testing the actual logic but instead accept any DBIO[Foo]! I've tried to use the following:
fooRepository.run(findFooDbio.flatMap(_ => saveFooDbio)) returns Future(Foo(3))
But it breaks with java.lang.NullPointerException: null, which is specs2's way of saying "sorry mate, the method with this parameter wasn't found". I've tried various variations, but none of them worked.
I suspect that might be because functions can't be compared:
scala> val a: Int => String = x => "hi"
a: Int => String = <function1>
scala> val b: Int => String = x => "hi"
b: Int => String = <function1>
scala> a == b
res1: Boolean = false
Any ideas how to spec DBIO composition without cheating?
I had a similar idea and also investigated it to see if I could:
create the same DBIO composition
use it with matcher in mocking
However I found out that it is actually infeasible in practice:
as you noticed, you cannot compare functions
additionally, when you investigate internals of DBIO, it is basically Free-monad-like structure - it has implementation for plain values and directly generated queries (then, if you extract the statements you could compare some part of the query), but there are also mappings which stores functions
even if you somehow managed to reuse functions, so that they had reference equality, implementations of DBIO do not care about overriding equals, so they would be different beast anyway
So knowing that, I gave up the initial idea. What I can recommend instead:
mock on any input of database.run - it is more error prone, as it won't notify you if test expectations start differing from returned results, but it's better than nothing
replace DBIO by some intermediate structure, that you know you can compare safely. E.g. Cats' Free monad implementation uses case classes so as long as you manage to ensure that functions are somehow comparable (e.g. by not creating them ad hoc, but instead using vals and objects), you could compare on intermediate representation, and mock whole interpret -> run process
replace unit tests with mocked database with integration tests with an actual database
try out Typed Tagless Final Interpreter pattern for handling databases - and basically inject in tests different monad than in production (e.g. prod -> service returning DBIO, production -> service returning Futures you want)
Actually, you could try many other things with Free, TTFI and swapping implementations. The bottom line is - you cannot reliably compare on DBIO, so design your code in a way, that you could test without doing so. It's not a pleasant answer, especially if you just wanted to put together test, and move on, but AFAIK there is no other way.
I have a copy of Programming MapReduce with Scalding by Antonios Chalkiopoulos. In the book he discusses the External Operations design pattern for Scalding code. You can see an example on his website here. I have made a choice to use the Type Safe API. Naturally, this introduces new challenges but I prefer it over the Fields API which is what is heavily discussed in the book I have previously mentioned and the site.
I am wondering how people have implemented the external operations pattern with the Type Safe API. My initial implementation is as follows:
I create a class that extends com.twitter.scalding.Job which will
serve as my Scalding job class where I will 'manage arguments, define
taps, and use external operations to construct data processing
pipelines'.
I create an object where I define my functions to be used in the Type
Safe pipes. Because the Type Safe pipes take as arguments a function,
I can then just pass the functions in the object as arguments to the
pipes.
This creates code that looks like this:
class MyJob(args: Args) extends Job(args) {
import MyOperations._
val input_path = args(MyJob.inputArgPath)
val output_path = args(MyJob.outputArgPath)
val eventInput: TypedPipe[(LongWritable, Text)] = this.mode match {
case m: HadoopMode => TypedPipe.from(WritableSequenceFile[LongWritable, Text](input_path))
case _ => TypedPipe.from(WritableSequenceFile[LongWritable, Text](input_path))
}
val eventOutput: FixedPathSource with TypedSink[(LongWritable, Text)] with TypedSource[(LongWritable, Text)] = this.mode match {
case m: HadoopMode => WritableSequenceFile[LongWritable, Text](output_path)
case _ => TypedTsv[(LongWritable, Text)](output_path)
}
val validatedEvents: TypedPipe[(LongWritable, Either[Text, Event])] = eventInput.map(convertTextToEither).fork
validatedEvents.filter(isEvent).map(removeEitherWrapper).write(eventOutput)
}
object MyOperations {
def convertTextToEither(v: (LongWritable, Text)): (LongWritable, Either[Text, Event]) = {
...
}
def isEvent(v: (LongWritable, Either[Text, Event])): Boolean = {
...
}
def removeEitherWrapper(v: (LongWritable, Either[Text, Event])): (LongWritable, Text) = {
...
}
}
As you can see, the functions that are passed to the Scalding Type Safe operations are kept separate from the job itself. While this is not as 'clean' as the external operations pattern presented, this is a quick way to write this kind of code. Additionally, I can use JUnitRunner for doing job level integration tests and ScalaTest for function level unit tests.
The main point of this post though is to ask how people are doing this sort of thing? The documentation around the internet for Scalding Type Safe API is sparse. Are there more functional Scala friendly ways for doing this? Am I missing a key component here for the design pattern? I sort of feel nervous about this because with the Fields API you can write unit tests on pipes with ScaldingTest. As far as I know, you can't do that with TypedPipes. Please let me know if there is a generally agreed upon pattern for Scalding Type Safe API or how you create reusable, modular, and testable Type Safe API code. Thanks for the help!
Update 2 after Antonios' reply
Thank you for the reply. That was basically the answer I was looking for. I wanted to continue the conversation. The main issue I see in your answer as I commented was that this implementation expects a specific type implementation but what if the types change throughout your job? I have explored this code and it seems to work but it seems hacked on.
def self: TypedPipe[Any]
def testingPipe: TypedPipe[(LongWritable, Text)] = self.map(
(firstVar: Any) => {
val tester = firstVar.asInstanceOf[(LongWritable, Text)]
(tester._1, tester._2)
}
)
The upside to this is I declare one implementation of self but the downside is this ugly type casting. Additionally, I have not tested this out in depth with a more complex pipeline. So basically, what are your thoughts on how to handle types as they change with only one self implementation for cleanliness/brevity?
Scala extension methods are implemented using implicit classes.
You add to the compiler the capability of converting a TypedPipe into a (wrapper) class that contains your external operations:
import com.twitter.scalding.TypedPipe
import com.twitter.scalding._
import cascading.flow.FlowDef
class MyJob(args: Args) extends Job(args) {
implicit class MyOperationsWrapper(val self: TypedPipe[Double]) extends MyOperations with Serializable
val pipe = TypedPipe.from(TypedTsv[Double](args("input")))
val result = pipe
.operation1
.operation2(x => x*2)
.write(TypedTsv[Double](args("output")))
}
trait MyOperations {
def self: TypedPipe[Double]
def operation1(implicit fd: FlowDef): TypedPipe[Double] =
self.map { x =>
println(s"Input: $x")
x / 100
}
def operation2(datafn:Double => Double)(implicit fd: FlowDef): TypedPipe[Double] =
self.map { x=>
val result = datafn(x)
println(s"Result: $result")
result
}
}
import org.apache.hadoop.util.ToolRunner
import org.apache.hadoop.conf.Configuration
object MyRunner extends App {
ToolRunner.run(new Configuration(), new Tool, (classOf[MyJob].getName :: "--local" ::
"--input" :: "doubles.tsv" ::
"--output":: "result.tsv" :: args.toList).toArray)
}
Regarding how to manage types across the pipes, my recommendation would be to try to work out some basic types that make sense and use case classes. To use your example i would rename the method convertTextToEither into extractEvents :
case class LogInput(l : Long, text: Text)
case class Event(data: String)
def extractEvents( line : LogInput ): TypedPipe[Event] =
self.filter( isEvent(line) )
.map ( getEvent(line.text) )
Then you would have
LogInputOperations for LogInput types
EventOperations for Event types
I am not sure what is the problem you see with the snippet you showed, and why you think it is "less clean". It looks fine to me.
As for the unit testing jobs using typed API question, take a look at JobTest, it seems to be just what you are looking for.
I'm attempting to write a method which accepts multiple generic types and takes as an argument a unit of work to execute.
The idea is that the unit of work is a common function that itself is generic. For the sake of example, let's say it's something like the following:
def loadModelRdd[T: TypeTag](sc: SparkContext): RDD[T] = {
...
}
loadModelRdd() will construct an RDD of the given type after some internal processing like loading the Model information, etc.
A prototype method I've been hacking on looks something like the following (non-working):
def forkAll[A : Manifest, B : Manifest](work: => RDD[_]): (RDD[A], RDD[B]) = {
def aFuture = Future { work } // How can I notify that this work call returns type A?
def bFuture = Future { work } // How can I notify that this work call returns type B?
val res = for {
a <- aFuture
b <- bFuture
} yield (a.asInstanceOf[A], b.asInstanceOf[B])
Await.result(res, 10.seconds)
}
This is a shortened version of the code I'm working on as I'm actually looking at accepting as many as 10 different types.
As you can see, the overall goal of the forkAll method is to wrap the unit of work in a Future, fork-join the execution of the unit of work for each type, then return the results as a Tuple'd result. An example consumer statement would be:
val (a, b) = forkAll[ClassA, ClassB](loadModelRdd)
i.e I want to fork-join at this point and wait for the results, but I want the executions to be executed in parallel and then collected back to the Driver (Spark Driver to be specific).
The problem is I'm not sure how to coerce the type returned by the unit of work within forkAll when constructing the Future {} blocks. Without the forkAll, the implementation looked like the following:
val resA = loadModelRdd[ClassA](sc)
val resB = loadModelRdd[ClassB](sc)
...
I am looking at doing this for two reasons:
To abstract the details of fork-join for any unit of work which matches this model.
A version of this code, which explicitly states what the unit of work is, is working in Production and was responsible for cutting execution of a long-running block by close to half. I have a couple of execution steps where this pattern could be applied
Is this something that is possible in Scala's type system? Or should I look at this problem from a different perspective? I've tried a couple of implementations (including one described here) but I haven't quite found one that fits my current view of the problem
Please let me know if there is any additional information needed.
Thanks!
Short answer: Scala does not allow functions with type parameters, so what you want is not exactly possible.
You are attempting to pass a method with a type parameter. Although methods are allowed to have type parameters, functions are not. When you try to pass a method, it acts like an anonymous function, so you must specify a type.
However, since methods do allow type parameters, you can take advantage of this by creating an abstract class that will do your fork/join
abstract class ForkJoin {
protected def work[T]: RDD[T]
def apply[A, B]: (RDD[A], RDD[B]) = {
// Write implementation of fork/join here
(work[A], work[B])
}
}
then overriding the type generic work method so that it does what you want, such as calling some other pre-defined method.
val forkJoin = new ForkJoin {
override protected def work[T]: RDD[T] =
loadModelRdd[T](sc)
}
val (intRdd, stringRdd) = forkJoin[Int, String]
Check out this for a prototype implementation that compiles and runs without issues.
I want to improve the following Cassandra related Scala code. I have two unrelated user defined types which are actually in Java source files (leaving out the details).
public class Blob { .. }
public class Meta { .. }
So here is how I use them currently from Scala:
private val blobMapper: Mapper[Blob] = mappingManager.mapper(classOf[Blob])
private val metaMapper: Mapper[Meta] = mappingManager.mapper(classOf[Meta])
def save(entity: Object) = {
entity match {
case blob: Blob => blobMapper.saveAsync(blob)
case meta: Meta => metaMapper.saveAsync(meta)
case _ => // exception
}
}
While this works, how can you avoid the following problems
repetition when adding new user defined type classes like Blob or Meta
pattern matching repetition when adding new methods like save
having Object as parameter type
You can definitely use Mapper as a typeclass, doing:
def save[A](entity: A)(implicit mapper: Mapper[A]) = mapper.saveAsync(entity)
Now you have a generic method able to perform a save operation on every type A for which a Mapper[A] is in scope.
Also, the mappingManager.mapper implementation could be probably improved to avoid classOf, but it's hard to tell from the question in the current state.
A few questions:
Is mappingManager.mapper(cls) expensive?
How much do you care about handling subclasses of Blob or Meta?
Can something like this work for you?
def save[T: Manifest](entity: T) = {
mappingManager.mapper(manifest[T].runtimeClass).saveAsync(entity)
}
If you do care about making sure that subclasses of Meta grab the proper mapper then you may find isAssignableFrom helpful in your .mapper (and store found sub-classes in a HashMap so you only have to look once).
EDIT: Then maybe you want something like this (ignoring threading concerns):
private[this] val mapperMap = mutable.HashMap[Class[_], Mapper[_]]()
def save[T: Manifest](entity: T) = {
val cls = manifest[T].runtimeClass
mapperMap.getOrElseUpdate(cls, mappingManager.mapper(cls))
.asInstanceOf[Mapper[T]]
.saveAsync(entity)
}