Keras callback for model class API - callback

I'd like to compute AUROC (Area Under ROC Curve) for every epoch and store them for later display. My model is Model API.
Is it possible to create a similar snippet as follow, which uses callbacks in case of Sequential API? Knowing that self.model, self.validation_data and self.model.predict are not callable for Model API.
import keras
from sklearn.metrics
import roc_auc_score
class Histories(keras.callbacks.Callback):
def on_train_begin(self, logs={}):
self.aucs = []
self.losses = []
def on_train_end(self, logs={}):
return
def on_epoch_begin(self, epoch, logs={}):
return
def on_epoch_end(self, epoch, logs={}):
self.losses.append(logs.get('loss'))
y_pred = self.model.predict(self.model.validation_data[0])
self.aucs.append(roc_auc_score(self.model.validation_data[1], y_pred))
return
def on_batch_begin(self, batch, logs={}):
return
def on_batch_end(self, batch, logs={}):
return
Here is my main:
histories = my_callbacks.Histories()
history = model.fit_generator(train_gen.generate(labels, partition['train']),
steps_per_epoch= len(partition['train'])//args.batch_size,
epochs=args.nb_epochs,
verbose=1,
callbacks = [histories],
validation_data = test_gen.generate(labels, partition['test']),
validation_steps = len(partition['test'])//args.batch_size,
use_multiprocessing=True);
Will the AUC be computed for every batch generated by test_gen (test data generator)? What should I do to compute AUCs on each test batch then aggregate them at the end of each epoch?
Thank you for your help.

Related

Unit Testing IO Scala

I'm starting unit testing in scala using scalatest.
the method I'm testing is as follows :
def readAlpha: IO[Float] = IO {
val alpha = scala.io.StdIn.readFloat()
alpha
}
The test consists of limiting the float that the user inserts to two decimals
Here is what I've tried but that doesn't seem to work.
How could I be fixing it
"alpha" should " have two decimal numbers after comma" in {
val alpha = readAlpha
//assert(alpha == (f"$alpha%.2f"))
}
I cannot say that you are doing wrong, but you cannot compare effect type with a pure type. And It's not possible to write a direct test for the console input app. First of you should mock the readAlpha method somehow, then you should evaluate the value and after that, it's possible to compare them. Here is a small demonstration:
import cats.effect.IO
class ConsoleIO {
def readAlpha: IO[Float] = IO {
val alpha = scala.io.StdIn.readFloat()
alpha
}
}
// ---------------------------------
import cats.effect.{ContextShift, IO, Timer}
import org.scalatest.funsuite.AsyncFunSuite
import scala.concurrent.ExecutionContext
class ConsoleIOTest extends AsyncFunSuite {
class MockConsole extends ConsoleIO {
override def readAlpha: IO[Float] = IO.pure(2.23f)
}
implicit val cs: ContextShift[IO] = IO.contextShift(ExecutionContext.global)
implicit val timer: Timer[IO] = IO.timer(ExecutionContext.global)
test("alpha should have two decimal numbers after comma") {
val consoleIO = new MockConsole
val alpha = consoleIO.readAlpha.unsafeRunSync()
assert(alpha.toString === (f"$alpha%.2f"))
}
}
Here unsafeRunSync() produces the result by running the encapsulated effects as impure.

Implement simple architecture using Akka Graphs

I’m attempting to setup a simple graph structure that process data via invoking rest services, forwards the result of each service to an intermediary processing unit before forwarding the result. Here is a high level architecture :
Can this be defined using Akka graph streams ? Reading https://doc.akka.io/docs/akka/current/stream/stream-graphs.html I don't understand how to even implement this simple architecture.
I've tried to implement custom code to execute functions within a graph :
package com.graph
class RestG {
def flow (in : String) : String = {
return in + "extra"
}
}
object RestG {
case class Flow(in: String) {
def out : String = in+"out"
}
def main(args: Array[String]): Unit = {
List(new RestG().flow("test") , new RestG().flow("test2")).foreach(println)
}
}
I'm unsure how to send data between the functions. So I think I should be using Akka Graphs but how to implement the architecture above ?
Here's how I would approach the problem. First some types:
type Data = Int
type RestService1Response = String
type RestService2Response = String
type DisplayedResult = Boolean
Then stub functions to asynchronously call the external services:
def callRestService1(data: Data): Future[RestService1Response] = ???
def callRestService2(data: Data): Future[RestService2Response] = ???
def resultCombiner(resp1: RestService1Response, resp2: RestService2Response): DisplayedResult = ???
Now for the Akka Streams (I'm leaving out setting up an ActorSystem etc.)
import akka.Done
import akka.stream.scaladsl._
type SourceMatVal = Any
val dataSource: Source[Data, SourceMatVal] = ???
def restServiceFlow[Response](callF: Data => Future[Data, Response], maxInflight: Int) = Flow[Data].mapAsync(maxInflight)(callF)
// NB: since we're fanning out, there's no reason to have different maxInflights here...
val service1 = restServiceFlow(callRestService1, 4)
val service2 = restServiceFlow(callRestService2, 4)
val downstream = Flow[(RestService1Response, RestService2Response)]
.map((resultCombiner _).tupled)
val splitAndCombine = GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val fanOut = b.add(Broadcast[Data](2))
val fanIn = b.add(Zip[RestService1Response, RestService2Response])
fanOut.out(0).via(service1) ~> fanIn.in0
fanOut.out(1).via(service2) ~> fanIn.in1
FlowShape(fanOut.in, fanIn.out)
}
// This future will complete with a `Done` if/when the stream completes
val future: Future[Done] = dataSource
.via(splitAndCombine)
.via(downstream)
.runForeach { displayableData =>
??? // Display the data
}
It's possible to do all the wiring within the Graph DSL, but I generally prefer to keep my graph stages as simple as possible and only use them to the extent that the standard methods on Source/Flow/Sink can't do what I want.

Akka streams: dealing with futures within graph stage

Within an akka stream stage FlowShape[A, B] , part of the processing I need to do on the A's is to save/query a datastore with a query built with A data. But that datastore driver query gives me a future, and I am not sure how best to deal with it (my main question here).
case class Obj(a: String, b: Int, c: String)
case class Foo(myobject: Obj, name: String)
case class Bar(st: String)
//
class SaveAndGetId extends GraphStage[FlowShape[Foo, Bar]] {
val dao = new DbDao // some dao with an async driver
override def createLogic(inheritedAttributes: Attributes) = new GraphStageLogic(shape) {
setHandlers(in, out, new InHandler with Outhandler {
override def onPush() = {
val foo = grab(in)
val add = foo.record.value()
val result: Future[String] = dao.saveAndGetRecord(add.myobject)//saves and returns id as string
//the naive approach
val record = Await(result, Duration.inf)
push(out, Bar(record))// ***tests pass every time
//mapping the future approach
result.map { x=>
push(out, Bar(x))
} //***tests fail every time
The next stage depends on the id of the db record returned from query, but I want to avoid Await. I am not sure why mapping approach fails:
"it should work" in {
val source = Source.single(Foo(Obj("hello", 1, "world")))
val probe = source
.via(new SaveAndGetId))
.runWith(TestSink.probe)
probe
.request(1)
.expectBarwithId("one")//say we know this will be
.expectComplete()
}
private implicit class RichTestProbe(probe: Probe[Bar]) {
def expectBarwithId(expected: String): Probe[Bar] =
probe.expectNextChainingPF{
case r # Bar(str) if str == expected => r
}
}
When run with mapping future, I get failure:
should work ***FAILED***
java.lang.AssertionError: assertion failed: expected: message matching partial function but got unexpected message OnComplete
at scala.Predef$.assert(Predef.scala:170)
at akka.testkit.TestKitBase$class.expectMsgPF(TestKit.scala:406)
at akka.testkit.TestKit.expectMsgPF(TestKit.scala:814)
at akka.stream.testkit.TestSubscriber$ManualProbe.expectEventPF(StreamTestKit.scala:570)
The async side channels example in the docs has the future in the constructor of the stage, as opposed to building the future within the stage, so doesn't seem to apply to my case.
I agree with Ramon. Constructing a new FlowShapeis not necessary in this case and it is too complicated. It is very much convenient to use mapAsync method here:
Here is a code snippet to utilize mapAsync:
import akka.stream.scaladsl.{Sink, Source}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object MapAsyncExample {
val numOfParallelism: Int = 10
def main(args: Array[String]): Unit = {
Source.repeat(5)
.mapAsync(parallelism)(x => asyncSquare(x))
.runWith(Sink.foreach(println)) previous stage
}
//This method returns a Future
//You can replace this part with your database operations
def asyncSquare(value: Int): Future[Int] = Future {
value * value
}
}
In the snippet above, Source.repeat(5) is a dummy source to emit 5 indefinitely. There is a sample function asyncSquare which takes an integer and calculates its square in a Future. .mapAsync(parallelism)(x => asyncSquare(x)) line uses that function and emits the output of Future to the next stage. In this snipet, the next stage is a sink which prints every item.
parallelism is the maximum number of asyncSquare calls that can run concurrently.
I think your GraphStage is unnecessarily overcomplicated. The below Flow performs the same actions without the need to write a custom stage:
val dao = new DbDao
val parallelism = 10 //number of parallel db queries
val SaveAndGetId : Flow[Foo, Bar, _] =
Flow[Foo]
.map(foo => foo.record.value().myobject)
.mapAsync(parallelism)(rec => dao.saveAndGetRecord(rec))
.map(Bar.apply)
I generally try to treat GraphStage as a last resort, there is almost always an idiomatic way of getting the same Flow by using the methods provided by the akka-stream library.

Race condition when generating data with a GenericInputFormat

I'm trying Flink and wrote the following example program:
object IFJob {
#SerialVersionUID(1L)
final class StringInputFormat extends GenericInputFormat[String] {
val N = 100
var i = 0L
override def reachedEnd(): Boolean = this.synchronized {
i >= N
}
override def nextRecord(ot: String): String = this.synchronized {
i += 1
return (i % 2) + ""
}
}
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val text: DataSet[String] = env.createInput(new StringInputFormat())
val map = text.map {
(_, 1)
}
// map.print()
val by = map.groupBy(0)
val aggregate: AggregateDataSet[(String, Int)] = by.aggregate(Aggregations.SUM, 1)
aggregate.print()
}
}
I am creating a StringInputFormat once and read it in parallel (with a default parallelism = 8).
When I run the above program, the results vary between executions, i.e., they are not deterministic. Results are duplicated 1-8x times.
For example I get the following results:
// first run
(0,150)
(1,150)
// second run
(0,50)
(1,50)
// third run
(0,200)
(1,200)
The expected result would be
(0,400)
(1,400)
Because there the StringInputFormat should generate 8 times 50 "0" and "1" records.
I even added synchronization to the input format, but it didn't help.
What am I missing in the Flink computation model?
The behavior you observe is the result of how Flink assigns work to an InputFormat. This works as follows:
On the master (JobManager), the createInputSplits() method is called which returns an array of InputSplit. An InputSplit is a chunk of data to read (or generate). The GenericInputFormat creates one InputSplit for each parallel task instance. In your case, it creates 8 InputSplit objects and each InputSplit should generate 50 "1" and 50 "0" records.
The parallel instances of a DataSourceTask are started on the workers (TaskManagers). Each DataSourceTask has an own instance of the InputFormat.
Once started, a DataSourceTask requests an InputSplit from the master and calls the open() method of its InputFormat with the InputSplit. When the InputFormat finished processing the InputSplit, the DataSourceTask requests a new from the master.
In your case, each InputSplit is very quickly processed. Hence, there is a race between DataSourceTasks requesting InputSplits for their InputFormats and some InputFormats processes more than one InputSplit. Since an InputFormat does not reset its internal state (i.e., set i = 0) when is opens a new InputSplit it will only generate data for the first InputSplit it processes.
You can fix this by adding this method to the StringInputFormat:
override def open(split: GenericInputSplit): Unit = {
super.open(split)
i = 0
}

Multiple type for a variable in spark using scala

I am working on a Spark project with scala. I want to train a model which can be k_means, gaussian_mixture, logistic regression, naive_bayes etc. But I cannot define a generic model as a return type. Since these algorithms' types are different like GaussianMixtureModel, KMeansModel etc. I cannot find any logical way to return this trained model.
Here is a peace of code from the project:
model.model_algorithm match {
case "k_means" =>
val model_k_means = k_means(data, parameters)
case "gaussian_mixture" =>
val model_gaussian_mixture = gaussian_mixture(data, parameters)
case "logistic_regression" =>
val model_logistic_regression = logistic_regression(data, parameters)
}
So is there a way to return this trained model or to define a generic model that accepts all types?
You can create a common Interface to wrap all your internal logic of training and predicting and just expose a simple interface to be reused.
trait AlgorithmInterface extends Serializable {
def train(data: RDD[LabeledPoint])
def predict(record: Vector)
}
And have Algorithms implemented in classes like
class LogisticRegressionAlgorithm extends AlgorithmInterface {
var model:LogisticRegressionModel = null
override def train(data: RDD[LabeledPoint]): Unit = {
model = new LogisticRegressionWithLBFGS()
.setNumClasses(10)
.run(data)
}
override def predict(record:Vector): Double = model.predict(record)
}
class GaussianMixtureAlgorithm extends AlgorithmInterface {
var model: GaussianMixtureModel = null
override def train(data: RDD[LabeledPoint]): Unit = {
model = new GaussianMixture().setK(2).run(data.map(_.features))
}
override def predict(record: Vector) = model.predict(record)
}
Implementing it
// Assigning the models to an Array[AlgorithmInterface]
val models: Array[AlgorithmInterface] = Array(
new LogisticRegressionAlgorithm(),
new GaussianMixtureAlgorithm()
)
// Training the Models using the Interfaces Train Function
models.foreach(_.train(data))
//Predicting the Value
models.foreach( model=> println(model.predict(vectorData)))