Stateful Structured Spark Streaming: Timeout is not getting triggered - scala

I've set the timeout duration to "2 minutes" as follows:
def updateAcrossEvents (tuple3: Tuple3[String, String, String], inputs: Iterator[R00tJsonObject],
oldState: GroupState[MyState]): OutputRow = {
println("$$$$ Inside updateAcrossEvents with : " + tuple3._1 + ", " + tuple3._2 + ", " + tuple3._3)
var state: MyState = if (oldState.exists) oldState.get else MyState(tuple3._1, tuple3._2, tuple3._3)
if (oldState.hasTimedOut) {
println("##### oldState has timed out ####")
// Logic to Write OutputRow
OutputRow("some values here...")
} else {
for (input <- inputs) {
state = updateWithEvent(state, input)
oldState.update(state)
oldState.setTimeoutDuration("2 minutes")
}
OutputRow(null, null, null)
}
}
I have also specified ProcessingTimeTimeout in 'mapGroupsWithState' as follows...
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(updateAcrossEvents)
But 'hasTimedOut' is never true so I don't get any output! What am I doing wrong?

It seems it only works if input data is continuously flowing. I had stopped the input job because I had enough data but it seems timeouts work only if the data is continuously fed. Not sure why it's designed that way. Makes it a bit harder to write unit/integration tests BUT I am sure there's a reason why it's designed this way. Thanks.

Related

crafting the body for request does not work concurrently

I would like to send simultaneous requests through gatlings for some duration
below is the snippet of my code where I am crafting the requests.
JSON file contents function which is used for crafting the json. its been used in the main request
the TestDevice_dev.csv has list of devices till 30 after 30 I will reuse it.
TestDevice1
TestDevice2
TestDevice3
.
.
.
val dFeeder = csv("TestDevice_dev.csv").circular
val trip_dte_tunnel_1 = scenario("TripSimulation")
.feed(dFeeder)
.exec(session => {
val key = conf.getString("config.env.sign_key")
var bodyTrip = CannedRequests.jsonFileContents("${deviceID}")
//deviceId comes from the feeder
session.set("trip_sign", SignatureGeneration.getSignature(key, bodyTrip))
session.set("tripBody",bodyTrip)
})
.exec(http("trip")
.post(trip_url)
.headers(trip_Headers_withsign)
.body(StringBody("${tripBody}")).asJSON.check(status.is(201)))
.exec(flushSessionCookies)
the scenario is started as below
val scn_trip = scenario("trip simulation")
.repeat{1} {
exec(DataExchange.trip_dte_tunnel_1)
}
setUp(scn_trip.inject(constantUsersPerSec(5) during (5 seconds))) ```
it runs fine if there is 1 user for 5 seconds but not simulatenous users.
the json request which is crafted looks like the below
"events":[
{
"deviceDetailsDataModel":{
"deviceId":"<deviceID>"
},
"eventDateTime":"<timeStamp>",
"tripInfoDataModel":{
"ignitionStatus":"ON",
"ignitionONTime":"<onTimeStamp>"
}
},
{
"deviceDetailsDataModel":{
"deviceId":"<deviceID>"
},
"eventDateTime":"<timeStamp>",
"tripInfoDataModel":{
"ignitionStatus":"ON",
"ignitionONTime":"<onTimeStamp>"
}
},
{
"deviceDetailsDataModel":{
"deviceId":"<deviceID>"
},
"eventDateTime":"<timeStamp>",
"tripInfoDataModel":{
"ignitionOFFTime":"<onTimeStamp>",
"ignitionStatus":"OFF"
}
}
]
}`
`def jsonFileContents(deviceId: String): String= {
val fileName = "trip-data.json"
var stringBuilder=""
var timeStamp1:Long = ZonedDateTime.now(ZoneId.of("America/Chicago")).toInstant().toEpochMilli().toLong - 10000.toLong
for (line <- (Source fromFile fileName).getLines) {
if (line.contains("eventDateTime")) {
var lineReplace=line.replaceAll("<timeStamp>", timeStamp1.toString())
stringBuilder=stringBuilder+lineReplace
timeStamp1 = timeStamp1+1000.toLong
}
else if (line.contains("onTimeStamp")) {
var lineReplace1=line.replaceAll("<onTimeStamp>", timeStamp1.toString)
stringBuilder=stringBuilder+lineReplace1
}
else if (line.contains("deviceID")){
var lineReplace2=line.replace("<deviceID>", deviceId)
stringBuilder=stringBuilder+lineReplace2
}
else {
stringBuilder =stringBuilder+line
}
}
stringBuilder
}
`
Best guess: your feeder contains one single entry and you're using the default queue strategy. Either add more entries in your feeder file to match the number of users, or use a different strategy.
This really is explained in the documentation, including the tutorials. I recommend you take some time to read the documentation before rushing into the code, you'll save lots of time in the end.
You don't need to do your own parameter substitution of values in the json file - Gatling supports passing en ELFileBody as the body where you can have a json file with gatling EL expressions like ${deviceId}.

Can Vertx executeBlocking be used on a list?

Trying to process a list of long running jobs in a vertx way
One would hope one could do something like:
use the executeBlocking to process the long running job in an async manner
use the composite future to wait for the futures to complete
I'm aware this approach does not work .. the list of Futures is not complete before the code drops into the CompositeFuture.
Is there a executeBlocking approach or does one have to use either the eventbus, vertx utils that support lists?
java.util.ArrayList futureList = new ArrayList()
for (i = 0; i < 100; i ++){
vertx.executeBlocking({ future ->
int id = i
println "Running " + id
java.lang.Thread.sleep(1000)
println "Thread done " + id
future.complete()
}, true , { res ->
if (res.succeeded()) {
print "."
} else {
print "x"
}
})
}
CompositeFuture.join(futureList).setHandler({ ar ->
if (ar.succeeded()) {
System.err.println "all threads should be done.."
}
})
Results in .. "all threads should be done" printing early
Running 84
Running 87
Running 87
Running 95
all threads should be done..
done.
Thread done 3
Thread done 36
Thread done 3
Thread done 0
In your example, futureList is empty so CompositeFuture.join(futureList) is completed immediately.
Change your example like this:
java.util.ArrayList futureList = new ArrayList()
for (i = 0; i < 100; i ++){
Future jobFuture = Future.future()
futureList.add(jobFuture)
vertx.executeBlocking({ future ->
int id = i
println "Running " + id
java.lang.Thread.sleep(1000)
println "Thread done " + id
future.complete()
}, true , { res ->
if (res.succeeded()) {
print "."
} else {
print "x"
}
jobFuture.complete()
})
}
Notice the jobFuture creation:
Future jobFuture = Future.future()
futureList.add(jobFuture)
As well as completion:
jobFuture.complete()
Now the CompositeFuture.join(futureList) handler will be executed only after all jobs complete.

Difference between RoundRobinRouter and RoundRobinRoutinglogic

So I was reading tutorial about akka and came across this http://manuel.bernhardt.io/2014/04/23/a-handful-akka-techniques/ and I think he explained it pretty well, I just picked up scala recently and having difficulties with the tutorial above,
I wonder what is the difference between RoundRobinRouter and the current RoundRobinRouterLogic? Obviously the implementation is quite different.
Previously the implementation of RoundRobinRouter is
val workers = context.actorOf(Props[ItemProcessingWorker].withRouter(RoundRobinRouter(100)))
with processBatch
def processBatch(batch: List[BatchItem]) = {
if (batch.isEmpty) {
log.info(s"Done migrating all items for data set $dataSetId. $totalItems processed items, we had ${allProcessingErrors.size} errors in total")
} else {
// reset processing state for the current batch
currentBatchSize = batch.size
allProcessedItemsCount = currentProcessedItemsCount + allProcessedItemsCount
currentProcessedItemsCount = 0
allProcessingErrors = currentProcessingErrors ::: allProcessingErrors
currentProcessingErrors = List.empty
// distribute the work
batch foreach { item =>
workers ! item
}
}
}
Here's my implementation of RoundRobinRouterLogic
var mappings : Option[ActorRef] = None
var router = {
val routees = Vector.fill(100) {
mappings = Some(context.actorOf(Props[Application3]))
context watch mappings.get
ActorRefRoutee(mappings.get)
}
Router(RoundRobinRoutingLogic(), routees)
}
and treated the processBatch as such
def processBatch(batch: List[BatchItem]) = {
if (batch.isEmpty) {
println(s"Done migrating all items for data set $dataSetId. $totalItems processed items, we had ${allProcessingErrors.size} errors in total")
} else {
// reset processing state for the current batch
currentBatchSize = batch.size
allProcessedItemsCount = currentProcessedItemsCount + allProcessedItemsCount
currentProcessedItemsCount = 0
allProcessingErrors = currentProcessingErrors ::: allProcessingErrors
currentProcessingErrors = List.empty
// distribute the work
batch foreach { item =>
// println(item.id)
mappings.get ! item
}
}
}
I somehow cannot run this tutorial, and it's stuck at the point where it's iterating the batch list. I wonder what I did wrong.
Thanks
In the first place, you have to distinguish diff between them.
RoundRobinRouter is a Router that uses round-robin to select a connection.
While
RoundRobinRoutingLogic uses round-robin to select a routee
You can provide own RoutingLogic (it has helped me to understand how Akka works under the hood)
class RedundancyRoutingLogic(nbrCopies: Int) extends RoutingLogic {
val roundRobin = RoundRobinRoutingLogic()
def select(message: Any, routees: immutable.IndexedSeq[Routee]): Routee = {
val targets = (1 to nbrCopies).map(_ => roundRobin.select(message, routees))
SeveralRoutees(targets)
}
}
link on doc http://doc.akka.io/docs/akka/2.3.3/scala/routing.html
p.s. this doc is very clear and it has helped me the most
Actually I misunderstood the method, and found out the solution was to use RoundRobinPool as stated in http://doc.akka.io/docs/akka/2.3-M2/project/migration-guide-2.2.x-2.3.x.html
For example RoundRobinRouter has been renamed to RoundRobinPool or
RoundRobinGroup depending on which type you are actually using.
from
val workers = context.actorOf(Props[ItemProcessingWorker].withRouter(RoundRobinRouter(100)))
to
val workers = context.actorOf(RoundRobinPool(100).props(Props[ItemProcessingWorker]), "router2")

Terminating a scala program?

I have used try catch as part of my mapreduce code. I am reducing my values based on COUNT in the below code. how do i terminate the job using the code in the below
class RepReducer extends Reducer[NullWritable, Text, Text, IntWritable] {
override def reduce(key: NullWritable, values: Iterable[Text], context: Reducer[NullWritable, Text, Text, IntWritable]#Context): Unit = {
val count = values.toList.length
if (count == 0){
try {
context.write(new Text("Number of tables with less than 40% coverage"), new IntWritable(count))
} catch {
case e: Exception =>
Console.err.println(" ")
e.printStackTrace()
}
}
else
{
System.out.println("terminate job") //here i want to terminate if count is not equal to 0
}
}
}
I think you still need to call context.write to return the control back to Hadoop even if you decide to skip certain data in the 'else'.

Uploading multiple large files and other form data in playframework 2/scala

I am trying to upload multiple large files in the play framework using scala. I'm still a scala and play noob.
I got some great code from here which got me 90% of the way, but now I'm stuck again.
The main issue I have now is that I can only read the file data, not any other data that's been uploaded, and after poking around the play docs I'm unclear as to how to get at that from here. Any Suggestions appreciated!
def directUpload(projectId: String) = Secured(parse.multipartFormData(myFilePartHandler)) { implicit request =>
Ok("Done");
}
def myFilePartHandler: BodyParsers.parse.Multipart.PartHandler[MultipartFormData.FilePart[Result]] = {
parse.Multipart.handleFilePart {
case parse.Multipart.FileInfo(partName, filename, contentType) =>
println("Handling Streaming Upload:" + filename + "/" + partName + ", " + contentType);
//Set up the PipedOutputStream here, give the input stream to a worker thread
val pos: PipedOutputStream = new PipedOutputStream();
val pis: PipedInputStream = new PipedInputStream(pos);
val worker: UploadFileWorker = new UploadFileWorker(pis,contentType.get);
worker.start();
//Read content to the POS
play.api.libs.iteratee.Iteratee.fold[Array[Byte], PipedOutputStream](pos) { (os, data) =>
os.write(data)
os
}.mapDone { os =>
os.close()
worker.join()
if( worker.success )
Ok("uplaod done. Size: " + worker.size )
else
Status(503)("Upload Failed");
}
}
}
You have to handle the data part. As you can guess (or look up in the documentation) the function to handle the data part ist called: handleFilePart.
def myFilePartHandler: BodyParsers.parse.Multipart.PartHandler[MultipartFormData.FilePart[Result]] = {
parse.Multipart.handleFilePart {
// ...
}
parse.Multipart.handleFilePart {
// ...
}
}
Another way would be the handlePart method. Check the documentation for more details.