Scalding TypedPipe API External Operations pattern - scala

I have a copy of Programming MapReduce with Scalding by Antonios Chalkiopoulos. In the book he discusses the External Operations design pattern for Scalding code. You can see an example on his website here. I have made a choice to use the Type Safe API. Naturally, this introduces new challenges but I prefer it over the Fields API which is what is heavily discussed in the book I have previously mentioned and the site.
I am wondering how people have implemented the external operations pattern with the Type Safe API. My initial implementation is as follows:
I create a class that extends com.twitter.scalding.Job which will
serve as my Scalding job class where I will 'manage arguments, define
taps, and use external operations to construct data processing
pipelines'.
I create an object where I define my functions to be used in the Type
Safe pipes. Because the Type Safe pipes take as arguments a function,
I can then just pass the functions in the object as arguments to the
pipes.
This creates code that looks like this:
class MyJob(args: Args) extends Job(args) {
import MyOperations._
val input_path = args(MyJob.inputArgPath)
val output_path = args(MyJob.outputArgPath)
val eventInput: TypedPipe[(LongWritable, Text)] = this.mode match {
case m: HadoopMode => TypedPipe.from(WritableSequenceFile[LongWritable, Text](input_path))
case _ => TypedPipe.from(WritableSequenceFile[LongWritable, Text](input_path))
}
val eventOutput: FixedPathSource with TypedSink[(LongWritable, Text)] with TypedSource[(LongWritable, Text)] = this.mode match {
case m: HadoopMode => WritableSequenceFile[LongWritable, Text](output_path)
case _ => TypedTsv[(LongWritable, Text)](output_path)
}
val validatedEvents: TypedPipe[(LongWritable, Either[Text, Event])] = eventInput.map(convertTextToEither).fork
validatedEvents.filter(isEvent).map(removeEitherWrapper).write(eventOutput)
}
object MyOperations {
def convertTextToEither(v: (LongWritable, Text)): (LongWritable, Either[Text, Event]) = {
...
}
def isEvent(v: (LongWritable, Either[Text, Event])): Boolean = {
...
}
def removeEitherWrapper(v: (LongWritable, Either[Text, Event])): (LongWritable, Text) = {
...
}
}
As you can see, the functions that are passed to the Scalding Type Safe operations are kept separate from the job itself. While this is not as 'clean' as the external operations pattern presented, this is a quick way to write this kind of code. Additionally, I can use JUnitRunner for doing job level integration tests and ScalaTest for function level unit tests.
The main point of this post though is to ask how people are doing this sort of thing? The documentation around the internet for Scalding Type Safe API is sparse. Are there more functional Scala friendly ways for doing this? Am I missing a key component here for the design pattern? I sort of feel nervous about this because with the Fields API you can write unit tests on pipes with ScaldingTest. As far as I know, you can't do that with TypedPipes. Please let me know if there is a generally agreed upon pattern for Scalding Type Safe API or how you create reusable, modular, and testable Type Safe API code. Thanks for the help!
Update 2 after Antonios' reply
Thank you for the reply. That was basically the answer I was looking for. I wanted to continue the conversation. The main issue I see in your answer as I commented was that this implementation expects a specific type implementation but what if the types change throughout your job? I have explored this code and it seems to work but it seems hacked on.
def self: TypedPipe[Any]
def testingPipe: TypedPipe[(LongWritable, Text)] = self.map(
(firstVar: Any) => {
val tester = firstVar.asInstanceOf[(LongWritable, Text)]
(tester._1, tester._2)
}
)
The upside to this is I declare one implementation of self but the downside is this ugly type casting. Additionally, I have not tested this out in depth with a more complex pipeline. So basically, what are your thoughts on how to handle types as they change with only one self implementation for cleanliness/brevity?

Scala extension methods are implemented using implicit classes.
You add to the compiler the capability of converting a TypedPipe into a (wrapper) class that contains your external operations:
import com.twitter.scalding.TypedPipe
import com.twitter.scalding._
import cascading.flow.FlowDef
class MyJob(args: Args) extends Job(args) {
implicit class MyOperationsWrapper(val self: TypedPipe[Double]) extends MyOperations with Serializable
val pipe = TypedPipe.from(TypedTsv[Double](args("input")))
val result = pipe
.operation1
.operation2(x => x*2)
.write(TypedTsv[Double](args("output")))
}
trait MyOperations {
def self: TypedPipe[Double]
def operation1(implicit fd: FlowDef): TypedPipe[Double] =
self.map { x =>
println(s"Input: $x")
x / 100
}
def operation2(datafn:Double => Double)(implicit fd: FlowDef): TypedPipe[Double] =
self.map { x=>
val result = datafn(x)
println(s"Result: $result")
result
}
}
import org.apache.hadoop.util.ToolRunner
import org.apache.hadoop.conf.Configuration
object MyRunner extends App {
ToolRunner.run(new Configuration(), new Tool, (classOf[MyJob].getName :: "--local" ::
"--input" :: "doubles.tsv" ::
"--output":: "result.tsv" :: args.toList).toArray)
}
Regarding how to manage types across the pipes, my recommendation would be to try to work out some basic types that make sense and use case classes. To use your example i would rename the method convertTextToEither into extractEvents :
case class LogInput(l : Long, text: Text)
case class Event(data: String)
def extractEvents( line : LogInput ): TypedPipe[Event] =
self.filter( isEvent(line) )
.map ( getEvent(line.text) )
Then you would have
LogInputOperations for LogInput types
EventOperations for Event types

I am not sure what is the problem you see with the snippet you showed, and why you think it is "less clean". It looks fine to me.
As for the unit testing jobs using typed API question, take a look at JobTest, it seems to be just what you are looking for.

Related

How to normalise a Union Type (T | Option[T])?

I have the following case class:
case class Example[T](
obj: Option[T] | T = None,
)
This allows me to construct it like Example(myObject) instead of Example(Some(myObject)).
To work with obj I need to normalise it to Option[T]:
lazy val maybeIn = obj match
case o: Option[T] => o
case o: T => Some(o)
the type test for Option[T] cannot be checked at runtime
I tried with TypeTest but I got also warnings - or the solutions I found look really complicated - see https://stackoverflow.com/a/69608091/2750966
Is there a better way to achieve this pattern in Scala 3?
I don't know about Scala3. But you could simply do this:
case class Example[T](v: Option[T] = None)
object Example {
def apply[T](t: T): Example[T] = Example(Some(t))
}
One could also go for implicit conversion, regarding the specific use case of the OP:
import scala.language.implicitConversions
case class Optable[Out](value: Option[Out])
object Optable {
implicit def fromOpt[T](o: Option[T]): Optable[T] = Optable(o)
implicit def fromValue[T](v: T): Optable[T] = Optable(Some(v))
}
case class SomeOpts(i: Option[Int], s: Option[String])
object SomeOpts {
def apply(i: Optable[Int], s: Optable[String]): SomeOpts = SomeOpts(i.value, s.value)
}
println(SomeOpts(15, Some("foo")))
We have a specialized Option-like type for this purpose: OptArg (in Scala 2 but should be easily portable to 3)
import com.avsystem.commons._
def gimmeLotsOfParams(
intParam: OptArg[Int] = OptArg.Empty,
strParam: OptArg[String] = OptArg.Empty
): Unit = ???
gimmeLotsOfParams(42)
gimmeLotsOfParams(strParam = "foo")
It relies on an implicit conversion so you have to be a little careful with it, i.e. don't use it as a drop-in replacement for Option.
The implementation of OptArg is simple enough that if you don't want external dependencies then you can probably just copy it into your project or some kind of "commons" library.
EDIT: the following answer is incorrect. As of Scala 3.1, flow analysis is only able to check for nullability. More information is available on the Scala book.
I think that the already given answer is probably better suited for the use case you proposed (exposing an API can can take a simple value and normalize it to an Option).
However, the question in the title is still interesting and I think it makes sense to address it.
What you are observing is a consequence of type parameters being erased at runtime, i.e. they only exist during compilation, while matching happens at runtime, once those have been erased.
However, the Scala compiler is able to perform flow analysis for union types. Intuitively I'd say there's probably a way to make it work in pattern matching (as you did), but you can make it work for sure using an if and isInstanceOf (not as clean, I agree):
case class Example[T](
obj: Option[T] | T = None
) {
lazy val maybeIn =
if (obj.isInstanceOf[Option[_]]) {
obj
} else {
Some(obj)
}
}
You can play around with this code here on Scastie.
Here is the announcement from 2019 when flow analysis was added to the compiler.

Preparing proper HTTP Response from an Akka-stream that can produce error situations

I intend to model a trivial game-play (HTTPReq/HTTPResp) using Akka Streams. In a round, the player is challenged to guess a number by the server. Server checks the player's response and if the what server holds and what the player guesses are the same, then the player is given a point.
A typical flow is like this:
Player (already authenticated, sessionID assigned) requests to start a round
Server checks if the sessionID is valid; if it is not, the player is informed with a suitable message
Server generates a number and offers to the player, along with a RoundID
... so on. Nothing extraordinary.
This is a rough arrangement of types and the flows:
import akka.{Done, NotUsed}
import akka.stream.scaladsl.{Flow, Keep, Source}
import java.util.Random
import akka.stream.scaladsl.Sink
import scala.concurrent.Future
import scala.util.{Failure, Success}
sealed trait GuessingGameMessageToAndFro
case class StartARound(sessionID: String) extends GuessingGameMessageToAndFro
case class RoundStarted(sessionID: String, roundID: Int) extends GuessingGameMessageToAndFro
case class NumberGuessed(sessionID: String, roundID: Int, guessedNo: Int) extends GuessingGameMessageToAndFro
case class CorrectNumberGuessed(sessionID: String, nextRoundID: Int) extends GuessingGameMessageToAndFro
case class FinalRoundScore(sessionID: String, finalScore: Int) extends GuessingGameMessageToAndFro
case class MissingSession(sessionID: String) extends GuessingGameMessageToAndFro
case class IncorrectNumberGuessed(sessionID: String, clientGuessed: Int, serverChose: Int) extends GuessingGameMessageToAndFro
object SessionService {
def exists(m: StartARound) = if (m.sessionID.startsWith("1")) m else MissingSession(m.sessionID)
}
object NumberGenerator {
def numberToOfferToPlayer(m: GuessingGameMessageToAndFro) = {
m match {
case StartARound(s) => RoundStarted(s, new Random().nextInt())
case MissingSession(s) => m
case _ => throw new RuntimeException("Not yet implemented")
}
}
}
val sessionExistenceChecker: Flow[StartARound,GuessingGameMessageToAndFro,NotUsed]
= Flow.fromFunction(m => SessionService.exists(m))
val guessNumberPreparator: Flow[GuessingGameMessageToAndFro,GuessingGameMessageToAndFro,_]
= Flow.fromFunction(m => NumberGenerator.numberToOfferToPlayer(m))
val s1 = StartARound("123")
val k =
Source
.single(s1)
.via(sessionExistenceChecker)
.via(guessNumberPreparator)
.toMat(Sink.head)(Keep.right)
val finallyObtained = k.run
finallyObtained.onComplete(v => {
v match {
case Success(x) => // Prepare proper HTTP Response
case Failure(ex) => // Prepare proper HTTP Response
}
})
The reason I am going through a long pattern matching block in numberToOfferToPlayer() (I have shown 2 here, but obviously its size will increase with every type that can flow) is because if the operator like sessionExistenceChecker generates a MissingSession (which is an error condition), it has to travel through the rest of stream, unchanged till it reaches the Future[Done] stage. In fact, the problem is more general: at any stage, a proper transformation should result into an acceptable type or an error type (mutually exclusive). If I follow this approach, the pattern-matching blocks will proliferate, at the cost of unnecessary repetition, if not ugliness perhaps.
I am feeling uncomfortable with this solution of mine. It is becoming verbose and ungainly.
Needless to say, I have not shown the Akka-HTTP facing part here (including the Routes). The code above can be easily stitched, with the route handlers. So, I have skipped it.
My question is: what is a right idiom for such streams? Conceptually speaking, if everything is fine, the elements should keep moving along the stream. However, whenever an error occurs, the (error) element should shoot off to the final stage, directly, skipping all other stages in between. What is the accepted way to model this?
I have gone through a number of Stackoverflow posts, which demonstrate that for similar situations, one should go the partition/merge way. I understand how I can adopt that approach, but for simple cases like mine, that seems to be unnecessary work. Or, am I completely off the mark here?
Any hint, snippet or rap on the knuckles, will be appreciated.
Use a PartialFunction
For this particular use case I would generally agree that a partition & merge setup is "unnecessary work". The other stack posts, referred to in the question, are for the use case where you only have Flow values to combine without the ability to manipulate the underlying logic within the Flow.
When you are able to modify the underlying logic then a simpler solution exists. But the solution does not strictly lie within akka's domain. Instead, you can utilize functional programming constructs available in scala itself.
If you rewrite the numberTofferToPlayer function to be a PartialFunction:
object NumberGenerator {
val numberToOfferToPlayer : PartialFunction[GuessingGameMessageToAndFro, GuessingGameMessageToAndFro] = {
case s : StartARound => RoundStarted(s.sessionID, new Random().nextInt())
}
}
Then this PartialFunction can be lifted into a regular function which will apply the logic if the message is of type StartARound or just forward the message if it is any other type.
This lifting is done with the applyOrElse method of PartialFunction in conjunction with the predefined identity function in scala which returns the input as the output (i.e. "forwards" the input):
import NumberGenerator.numberToOfferToPlayer
val messageForwarder : GuessingGameMessageToAndFro => GuessingGameMessageToAndFro =
identity[GuessingGameMessageToAndFro]
val guessNumberPreparator: Flow[GuessingGameMessageToAndFro,GuessingGameMessageToAndFro,_] =
Flow fromFunction (numberToOfferToPlayer applyOrElse (_, messageForwarder))
Higher Level Of Abstraction
If you have several of these PartialFunctions which you would like to add forwarding logic to:
val foo : PartialFunction[GuessingGameMessageToAndFro, GuessingGameMessageToAndFro] = {
case r : RoundStarted => ???
}
val bar : PartialFunction[GuessingGameMessageToAndFro, GuessingGameMessageToAndFro] = {
case n : NumberGuessed => ???
}
Then you can write a general lifter that will abstract away the regular function creation:
val applyOrForward : PartialFunction[GuessingGameMessageToAndFro, GuessingGameMessageToAndFro] => GuessingGameMessageToAndFro => GuessingGameMessageToAndFro =
((_ : PartialFunction[Int, Int]) applyOrElse ((_ : GuessingGameMessageToAndFro), messageForwader).curried
This lifter will clean up your code nicely:
val offerFlow = Flow fromFunction applyOrForward(numberToOfferToPlayer)
val fooFlow = Flow fromFunction applyOrForward(foo)
val barFlow = Flow fromFunction applyOrForward(bar)
These Flows can then be combined in the manner that the question describes:
val combinedFlow = offerFlow via fooFlow via barFlow
Similarly, you could get the same result by combining the PartialFunctions first and then creating a single Flow from the combination. This would be useful for unit testing outside of akka:
val combinedPartial = numberToOfferToPlayer orElse foo orElse bar
//no akka test kit necessary
assert {
val testError = MissingSession("testId")
applyOrForward(combinedPartial)(testError) equals testError
}
//nothing much to test
val otherCombinedFlow = Flow fromFunction applyOrForward(combinedPartial)

Scalatest custom matchers for 'should contain'

This is a situation I have encountered frequently, but I have not been able to find a solution yet.
Suppose you have a list of persons and you just want to verify the person names.
This works:
persons.map(_.name) should contain theSameElementsAs(List("A","B"))
Instead, I would rather write this like
val toName: Person => String = _.name
persons should contain theSameElementsAs(List("A","B")) (after mapping toName)
because this is how you would say this.
Sometimes however, you'd like to use a custom matcher which matches more than just one property of the object. How would it be possible to use
persons should contain(..)
syntax, but somehow be able to use a custom matcher?
Both these situations I could easily solve using JUnit or TestNG using Hamcrest matchers, but I have not found a way to do this with ScalaTest.
I have tried to use the 'after being' syntax from the Explicitly trait, but that's not possible since this takes a 'Normalization' which defines that the 'normalized' method uses the same type for the argument and return type. So it's not possible to change a Person to a String.
Also I have not succeeded yet in implementing an 'Explicitly' like trait because it does not like the Equality[.] type I return and/or it does not know anymore what the original list type was, so using '_.name' does not compile.
Any suggestions are welcome.
You can manage something similar via the word decided and moderate abuse of the Equality trait. This is because the Equality trait's areEqual method takes a parameter of the generic type and one of type Any, so you can use that to compare Person with String, and decided by simply takes an Equality object which means you don't have to futz around with Normality.
import org.scalactic.Equality
import org.scalatest.{FreeSpec, Matchers}
final class Test extends FreeSpec with Matchers {
case class Person(name: String)
val people = List(Person("Alice"), Person("Eve"))
val namesBeingEqual = MappingEquality[Person, String](p => p.name)
"test should pass" in {
(people should contain theSameElementsAs List("Alice", "Eve"))(
decided by namesBeingEqual)
}
"test should fail" in {
(people should contain theSameElementsAs List("Alice", "Bob"))(
decided by namesBeingEqual)
}
case class MappingEquality[S, T](map: S => T) extends Equality[S] {
override def areEqual(s: S, b: Any): Boolean = b match {
case t: T => map(s) == t
case _ => false
}
}
}
I'm not sure I'd say this is a good idea since it doesn't exactly behave in the way one would expect anything called Equality to behave, but it works.
You can even get the beingMapped syntax you suggest by adding it to after via implicit conversion:
implicit class AfterExtensions(aft: TheAfterWord) {
def beingMapped[S, T](map: S => T): Equality[S] = MappingEquality(map)
}
}
I did try getting it work with after via the Uniformity trait, which has similar methods involving Any, but ran into problems because the normalization is the wrong way around: I can create a Uniformity[String] object from your example, but not a Uniformity[Person] one. (The reason is that there's a normalized method returning the generic type which is used to construct the Equality object, meaning that in order to compare strings with strings the left-side input must be a string.) This means that the only way to write it is with the expected vs actual values in the opposite order from normally:
"test should succeed" in {
val mappedToName = MappingUniformity[Person, String](person => person.name)
(List("Alice", "Eve") should contain theSameElementsAs people)(
after being mappedToName)
}
case class MappingUniformity[S, T](map: S => T) extends Uniformity[T] {
override def normalizedOrSame(b: Any): Any = b match {
case s: S => map(s)
case t: T => t
}
override def normalizedCanHandle(b: Any): Boolean =
b.isInstanceOf[S] || b.isInstanceOf[T]
override def normalized(s: T): T = s
}
Definitely not how you'd usually want to write this.
use inspectors
forAll (xs) { x => x should be < 3 }

How to write efficient type bounded code if the types are unrelated in Scala

I want to improve the following Cassandra related Scala code. I have two unrelated user defined types which are actually in Java source files (leaving out the details).
public class Blob { .. }
public class Meta { .. }
So here is how I use them currently from Scala:
private val blobMapper: Mapper[Blob] = mappingManager.mapper(classOf[Blob])
private val metaMapper: Mapper[Meta] = mappingManager.mapper(classOf[Meta])
def save(entity: Object) = {
entity match {
case blob: Blob => blobMapper.saveAsync(blob)
case meta: Meta => metaMapper.saveAsync(meta)
case _ => // exception
}
}
While this works, how can you avoid the following problems
repetition when adding new user defined type classes like Blob or Meta
pattern matching repetition when adding new methods like save
having Object as parameter type
You can definitely use Mapper as a typeclass, doing:
def save[A](entity: A)(implicit mapper: Mapper[A]) = mapper.saveAsync(entity)
Now you have a generic method able to perform a save operation on every type A for which a Mapper[A] is in scope.
Also, the mappingManager.mapper implementation could be probably improved to avoid classOf, but it's hard to tell from the question in the current state.
A few questions:
Is mappingManager.mapper(cls) expensive?
How much do you care about handling subclasses of Blob or Meta?
Can something like this work for you?
def save[T: Manifest](entity: T) = {
mappingManager.mapper(manifest[T].runtimeClass).saveAsync(entity)
}
If you do care about making sure that subclasses of Meta grab the proper mapper then you may find isAssignableFrom helpful in your .mapper (and store found sub-classes in a HashMap so you only have to look once).
EDIT: Then maybe you want something like this (ignoring threading concerns):
private[this] val mapperMap = mutable.HashMap[Class[_], Mapper[_]]()
def save[T: Manifest](entity: T) = {
val cls = manifest[T].runtimeClass
mapperMap.getOrElseUpdate(cls, mappingManager.mapper(cls))
.asInstanceOf[Mapper[T]]
.saveAsync(entity)
}

Generic synchronisation design

We are building some sync functionality using two-way json requests and this algorithm. All good and we have it running in prototype mode. Now I am trying to genericise the code, as we will be synching for several tables in the app. It would be cool to be able to define a class as "extends Synchable" and get the additional attributes and sync processing methods with a few specialisations/overrides. I have got this far:
abstract class Synchable [T<:Synchable[T]] (val ruid: String, val lastSyncTime: String, val isDeleted:Int) {
def contentEquals(Target: T): Boolean
def updateWith(target: T)
def insert
def selectSince(clientLastSyncTime: String): List[T]
def findByRuid(ruid: String): Option[T]
implicit val validator: Reads[T]
def process(clientLastSyncTime: String, updateRowList: List[JsObject]) = {
for (syncRow <- updateRowList) {
val validatedSyncRow = syncRow.validate[Synchable]
validatedSyncRow.fold(
valid = { result => // valid row
findByRuid(result.ruid) match { //- do we know about it?
case Some(knownRow) => knownRow.updateWith(result)
case None => result.insert
}
}... invalid, etc
I am new to Scala and know I am probably missing things - WIP!
Any pointers or suggestions on this approach would be much appreciated.
Some quick ones:
Those _ parameters you pass in and then immediately assign to vals: why not do it in one hit? e.g.
abstract class Synchable( val ruid: String = "", val lastSyncTime: String = "", val isDeleted: Int = 0) {
which saves you a line and is clearer in intent as well I think.
I'm not sure about your defaulting of Strings to "" - unless there's a good reason (and there often is), I think using something like ruid:Option[String] = None is more explicit and lets you do all sorts of nice monad-y things like fold, map, flatMap etc.
Looking pretty cool otherwise - the only other thing you might want to do is strengthen the typing with a bit of this.type magic so you'll prevent incorrect usage at compile-time. With your current abstract class, nothing prevents me from doing:
class SynchableCat extends Synchable { ... }
class SynchableDog extends Synchable { ... }
val cat = new SynchableCat
val dog = new SynchableDog
cat.updateWith(dog) // This won't end well
But if you just change your abstract method signatures to things like this:
def updateWith(target: this.type)
Then the change ripples down through the subclasses, narrowing down the types, and the compiler will omit a (relatively clear) error if I try the above update operation.