Would you please guide me what the philosophy of Callbacks is? In other words, how can I define them for each unique problem?
I assume you are taking about keras. So callbacks are used to execute code during training. Everything is explained quite well here
Here an example on how to log loss during training:
class LossHistory(Callback):
def on_train_begin(self, logs={}):
self.losses = []
def on_batch_end(self, batch, logs={}):
self.losses.append(logs.get('loss'))
print logs.get('loss')
model.fit(data, data, epochs=50, batch_size=72, validation_data=(data, data), verbose=0, shuffle=False,callbacks=[LossHistory()])
Related
Here's what I'm trying to accomplish: I have a Chisel accelerator which calls another Chisel accelerator and passes in a value. I want the second one to have a while loop in it where the condition is partially based on the input value. Here's some sample code:
class Module1 extends Module {
val in = 0.U
val Module2Module = Module2()
Module2Module.io.in := in
}
class Module2 extends Module {
val io = IO(new Bundle {
val in = Input(Reg(UInt))
}
val test = 0.U
while (test < io.in) {
}
}
I'm getting the error that "test < io.in" is a chisel.Bool, not a Boolean. I know that I can't convert that to Scala types, right?
What is the proper way to implement this? Is it by having signals sent to/from Module1 to Module2 to indicate that the accelerator isn't done yet and to only proceed when it is? If so, wouldn't this get complex quickly, if you have several functions, each in different modules?
You will need to use registers, created by the Reg family of constructors and control the flow with when, elsewhen, and otherwise. I think a good example for you is in 2.6_testers2.ipynb of chisel bootcamp. The GCD circuit is equivalent to a while loop. The circuit continues until the y register is decremented to zero. Each clock cycle corresponds to a single iteration of a software while loop. The circuit uses the ready and valid fields of the Decoupled inputs and outputs to coordinate ingesting new data and reporting when a GCD value has been computed. Take a look at this example and see if you have more questions.
Just to elaborate on why you can't use a while loop with hardware values like chisel3.Bool, you can think about a chisel3 design as a Scala program that constructs a hardware design as it executes. When chisel3 runs, it is just running a program who's output is your circuit (ultimately emitted as Verilog). while is a Scala construct so it's only available during the execution of the program, it doesn't exist in the actual hardware. There's a similar question and answer about for loops on the chisel-users mailing list.
Now to answer your question, as Chick mentioned you can use the chisel3 constructs when, .elsewhen, and .otherwise to handle control flow in the actual hardware:
class Module2 extends Module {
val io = IO(new Bundle {
val in = Input(Reg(UInt))
}
val test = 0.U
when (test < io.in) {
// Logic for that applies when (or while) the condition is true
} .otherwise {
// Logic that applies when it isn't
}
}
Also as Chick mentioned, you'll likely need some state (using Regs) since you may need to do things over multiple clock cycles. It's hard to advise beyond this simple example without more info, but please expand on your question or ask more questions if you need more help.
If so, wouldn't this get complex quickly, if you have several functions, each in different modules?
I'm not sure how to answer this bit without more context, but the whole purpose of Chisel is to make it easier to create abstractions that allow you to handle complexity. Chisel enables software engineering when designing hardware.
I am writing a class that takes a Flow (representing a kind of socket) as a constructor argument and that allows to send messages and wait for the respective answers asynchronously by returning a Future. Example:
class SocketAdapter(underlyingSocket: Flow[String, String, _]) {
def sendMessage(msg: MessageType): Future[ResponseType]
}
This is not necessarily trivial because there may be other messages in the socket stream that are irrelevant, so some filtering is required.
In order to test the class I need to provide something like a "TestFlow" analogous to TestSink and TestSource. In fact I can create a flow by combining both. However, the problem is that I only obtain the actual probes upon materialization and materialization happens inside the class under test.
The problem is similar to the one I described in this question. My problem would be solved if I could materialize the flow first and then pass it to a client to connect to it. Again, I'm thinking about using MergeHub and BroadcastHub and again I see the problem that the resulting stream would behave differently because it is not linear anymore.
Maybe I misunderstood how a Flow is supposed to be used. In order to feed messages into the flow when sendMessage() is called, I need a certain kind of Source anyway. Maybe a Source.actorRef(...) or Source.queue(...), so I could pass in the ActorRef or SourceQueue directly. However, I'd prefer if this choice was up to the SocketAdapter class. Of course, this applies to the Sink as well.
It feels like this is a rather common case when working with streams and sockets. If it is not possible to create a "TestFlow" like I need it, I'm also happy with some advice on how to improve my design and make it better testable.
Update: I browsed through the documentation and found SourceRef and SinkRef. It looks like these could solve my problem but I'm not sure yet. Is it reasonable to use them in my case or are there any drawbacks, e.g. different behaviour in the test compared to production where there are no such refs?
Indirect Answer
The nature of your question suggests a design flaw which you are bumping into at testing time. The answer below does not address the issue in your question, but it demonstrates how to avoid the situation altogether.
Don't Mix Business Logic with Akka Code
Presumably you need to test your Flow because you have mixed a substantial amount of logic into the materialization. Lets assume you are using raw sockets for your IO. Your question suggests that your flow looks like:
val socketFlow : Flow[String, String, _] = {
val socket = new Socket(...)
//business logic for IO
}
You need a complicated test framework for your Flow because your Flow itself is also complicated.
Instead, you should separate out the logic into an independent function that has no akka dependencies:
type MessageProcessor = MessageType => ResponseType
object BusinessLogic {
val createMessageProcessor : (Socket) => MessageProcessor = {
//business logic for IO
}
}
Now your flow can be very simple:
val socket : Socket = new Socket(...)
val socketFlow = Flow.map(BusinessLogic.createMessageProcessor(socket))
As a result: your unit testing can exclusively work with createMessageProcessor, there's no need to test akka Flow because it is a simple veneer around the complicated logic that is tested independently.
Don't Use Streams For Concurrency Around 1 Element
The other big problem with your design is that SocketAdapter is using a stream to process just 1 message at a time. This is incredibly wasteful and unnecessary (you're trying to kill a mosquito with a tank).
Given the separated business logic your adapter becomes much simpler and independent of akka:
class SocketAdapter(messageProcessor : MessageProcessor) {
def sendMessage(msg: MessageType): Future[ResponseType] = Future {
messageProcessor(msg)
}
}
Note how easy it is to use Future in some instances and Flow in other scenarios depending on the need. This comes from the fact that the business logic is independent of any concurrency framework.
This is what I came up with using SinkRef and SourceRef:
object TestFlow {
def withProbes[In, Out](implicit actorSystem: ActorSystem,
actorMaterializer: ActorMaterializer)
:(Flow[In, Out, _], TestSubscriber.Probe[In], TestPublisher.Probe[Out]) = {
val f = Flow.fromSinkAndSourceMat(TestSink.probe[In], TestSource.probe[Out])
(Keep.both)
val ((sinkRefFuture, (inProbe, outProbe)), sourceRefFuture) =
StreamRefs.sinkRef[In]()
.viaMat(f)(Keep.both)
.toMat(StreamRefs.sourceRef[Out]())(Keep.both)
.run()
val sinkRef = Await.result(sinkRefFuture, 3.seconds)
val sourceRef = Await.result(sourceRefFuture, 3.seconds)
(Flow.fromSinkAndSource(sinkRef, sourceRef), inProbe, outProbe)
}
}
This gives me a flow I can completely control with the two probes but I can pass it to a client that connects source and sink later, so it seems to solve my problem.
The resulting Flow should only be used once, so it differs from a regular Flow that is rather a flow blueprint and can be materialized several times. However, this restriction applies to the web socket flow I am mocking anyway, as described here.
The only issue I still have is that some warnings are logged when the ActorSystem terminates after the test. This seems to be due to the indirection introduced by the SinkRef and SourceRef.
Update: I found a better solution without SinkRef and SourceRef by using mapMaterializedValue():
def withProbesFuture[In, Out](implicit actorSystem: ActorSystem,
ec: ExecutionContext)
: (Flow[In, Out, _],
Future[(TestSubscriber.Probe[In], TestPublisher.Probe[Out])]) = {
val (sinkPromise, sourcePromise) =
(Promise[TestSubscriber.Probe[In]], Promise[TestPublisher.Probe[Out]])
val flow =
Flow
.fromSinkAndSourceMat(TestSink.probe[In], TestSource.probe[Out])(Keep.both)
.mapMaterializedValue { case (inProbe, outProbe) =>
sinkPromise.success(inProbe)
sourcePromise.success(outProbe)
()
}
val probeTupleFuture = sinkPromise.future
.flatMap(sink => sourcePromise.future.map(source => (sink, source)))
(flow, probeTupleFuture)
}
When the class under test materializes the flow, the Future is completed and I receive the test probes.
I built a pipeline including a DecisionTreeClassifier(dt) like this
val pipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
Then I used this pipeline as the estimator in a CrossValidator in order to get a model with the best set of hyperparameters like this
val c_v = new CrossValidator().setEstimator(pipeline).setEvaluator(new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")).setEstimatorParamMaps(paramGrid).setNumFolds(5)
Finally, I could train a model on a training test with this crossvalidator
val model = c_v.fit(train)
But the question is, I want to view the best trained decision tree model with the parameter .toDebugTree of DecisionTreeClassificationModel. But model is a CrossValidatorModel. Yes, you can use model.bestModel, but it is still of type Model, you cannot apply .toDebugTree to it. And also I assume the bestModel is still a pipline including labelIndexer, featureIndexer, dt, labelConverter.
So does anyone know how I can obtain the decisionTree model from the model fitted by the crossvalidator, which I could view the actual model by toDebugString? Or is there any workaround that I can view the decisionTree model?
Well, in cases like this one the answer is always the same - be specific about the types.
First extract the pipeline model, since what you are trying to train is a Pipeline :
import org.apache.spark.ml.PipelineModel
val bestModel: Option[PipelineModel] = model.bestModel match {
case p: PipelineModel => Some(p)
case _ => None
}
Then you'll need to extract the model from the underlying stage. In your case it's a decision tree classification model :
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
val treeModel: Option[DecisionTreeClassificationModel] = bestModel
flatMap {
_.stages.collect {
case t: DecisionTreeClassificationModel => t
}.headOption
}
To print the tree, for example :
treeModel.foreach(_.toDebugString)
(DISCLAIMER: There is another aspect, which imho deserves its own answer. I know it is a little OT given the question, however, it questions the question. If somebody down votes because he disagrees with the content please also leave a comment)
Should you extract the "best" tree and the answer is typically no.
Why are we doing CV? We are trying to evaluate our choices, to get. The choices are the classifiers used, hyper parameter used, preprocessing like feature selection. For the last one it is important that this happens on the training data. E.g., do not normalise the features on all data. So the output of CV is the pipeline generated. On a side note: the feature selection should evaluated on a "internal cv"
What we are not doing, we are not generating a "pool of classifiers" where we choose the best classifier. However, i've seen this surprisingly often. The problem is that you have an extremely high chance of a twining-effect. Even in a perfectly Iid dataset there are likely (near)duplicated training examples. There is a pretty good chance that the "best" CV classifier is just an indication in which fold you have the best twining.
Hence, what should you do? Once, you have fixed your parameters you should use the entire training data to build the final model. Hopefully, but nobody does this, you have set aside an additional evaluation set, which you have never touched in the process to get an evaluation of your final model.
I've got two streams and I want to be able to consume only one based on a computation that I run every x seconds.
I think I basically need to create a third tick stream - something like every(3.seconds) - that does the computation and then come up with sort of a switch between the other two.
I'm kind of stuck here (and I've only just started fooling around with scalaz-stream).
Thanks!
There are several ways we can approach this problem. One way to approach it is using awakeEvery. For concrete example, see here.
To describe the example briefly, consider that we would like to query twitter in every 5 sec and get the tweets and perform sentiment analysis. We can compose this pipeline as follows:
val source =
awakeEvery(5 seconds) |> buildTwitterQuery(query) through queryChannel flatMap {
Process emitAll _
}
Note that the queryChannel can be stated as follows.
def statusTask(query: Query): Task[List[Status]] = Task {
twitterClient.search(query).getTweets.toList
}
val queryChannel: Channel[Task, Query, List[Status]] = channel lift statusTask
Let me know if you have any question. As stated earlier, for the complete example, see this.
I hope it helps!
Is it possible to mock a RDD without using sparkContext?
I want to unit test the following utility function:
def myUtilityFunction(data1: org.apache.spark.rdd.RDD[myClass1], data2: org.apache.spark.rdd.RDD[myClass2]): org.apache.spark.rdd.RDD[myClass1] = {...}
So I need to pass data1 and data2 to myUtilityFunction. How can I create a data1 from a mock org.apache.spark.rdd.RDD[myClass1], instead of create a real RDD from SparkContext? Thank you!
RDDs are pretty complex, mocking them is probably not the best way to go about creating test data. Instead I'd recommend using sc.parallelize with your data. I'm also (somewhat biased) think that https://github.com/holdenk/spark-testing-base can help by providing a trait to setup & teardown the Spark Context for your tests.
I totally agree with #Holden on that!
Mocking RDDS is difficult; executing your unit tests in a local
Spark context is preferred, as recommended in the programming guide.
I know this may not technically be a unit test, but it is hopefully close
enough.
Unit Testing
Spark is friendly to unit testing with any popular unit test framework.
Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test frameworkâs tearDown method, as Spark does not support two contexts running concurrently in the same program.
But if you are really interested and you still want to try mocking RDDs, I'll suggest that you read the ImplicitSuite test code.
The only reason they are pseudo-mocking the RDD is to test if implict works well with the compiler, but they don't actually need a real RDD.
def mockRDD[T]: org.apache.spark.rdd.RDD[T] = null
And it's not even a real mock. It just creates a null object of type RDD[T]