Benchmark (De)Serialization using Scalameter - scala

So I have been struggling to understand how to use Scalameter for my use case. I have two methods: toJson (serialization) and fromJson(deserialization) which belong to a scala object class (static).
What I want to achieve is benchmarking the runtime for these two methods against a variety of different datasets e.g. stateSmall, stateMedium, etc.
Tried something very basic, but I have a feeling this isn't the correct way to define these benchmarks; Since it only run once - preferably I would want this to run 1000+ times to get the average runtime.
In the documentation, the examples I've seen are not super clear for me on how to achieve this.
object StateSerializerBenchmark extends Bench.LocalTime {
val stateSerializer = Gen.single("owner")(StateSerializer)
performance of "StateSerializer" in {
measure method "toJson - small" in {
using(stateSerializer) in {
_.toJson(BenchmarkFixtures.userProfileStateSmall)
}
}
measure method "toJson - medium" in {
using(stateSerializer) in {
_.toJson(BenchmarkFixtures.userProfileMedium)
}
}
}
}
Update
An alternative way which seems a bit better
object StateSerializerBenchmark extends Bench.LocalTime {
val sizes = Gen.range("sizes")(1, 100, 10)
val userProfileStateSRange = for {
size <- sizes
} yield List.fill(size)(BenchmarkFixtures.userProfileStateMedium)
val userProfileStateMRange = for {
size <- sizes
} yield List.fill(size)(BenchmarkFixtures.userProfileStateLarge)
performance of "StateSerializer" in {
measure method "toJson - small" in {
using(userProfileStateSRange) in {
s => StateSerializer.toJson(s)
}
}
measure method "toJson - medium" in {
using(userProfileStateMRange) in {
s => StateSerializer.toJson(s)
}
}
measure method "toJson - large" in {
using(userProfileStateMRange) in {
s => StateSerializer.toJson(s)
}
}
}
}
Any examples or pointers would be appreciated.

Related

crafting the body for request does not work concurrently

I would like to send simultaneous requests through gatlings for some duration
below is the snippet of my code where I am crafting the requests.
JSON file contents function which is used for crafting the json. its been used in the main request
the TestDevice_dev.csv has list of devices till 30 after 30 I will reuse it.
TestDevice1
TestDevice2
TestDevice3
.
.
.
val dFeeder = csv("TestDevice_dev.csv").circular
val trip_dte_tunnel_1 = scenario("TripSimulation")
.feed(dFeeder)
.exec(session => {
val key = conf.getString("config.env.sign_key")
var bodyTrip = CannedRequests.jsonFileContents("${deviceID}")
//deviceId comes from the feeder
session.set("trip_sign", SignatureGeneration.getSignature(key, bodyTrip))
session.set("tripBody",bodyTrip)
})
.exec(http("trip")
.post(trip_url)
.headers(trip_Headers_withsign)
.body(StringBody("${tripBody}")).asJSON.check(status.is(201)))
.exec(flushSessionCookies)
the scenario is started as below
val scn_trip = scenario("trip simulation")
.repeat{1} {
exec(DataExchange.trip_dte_tunnel_1)
}
setUp(scn_trip.inject(constantUsersPerSec(5) during (5 seconds))) ```
it runs fine if there is 1 user for 5 seconds but not simulatenous users.
the json request which is crafted looks like the below
"events":[
{
"deviceDetailsDataModel":{
"deviceId":"<deviceID>"
},
"eventDateTime":"<timeStamp>",
"tripInfoDataModel":{
"ignitionStatus":"ON",
"ignitionONTime":"<onTimeStamp>"
}
},
{
"deviceDetailsDataModel":{
"deviceId":"<deviceID>"
},
"eventDateTime":"<timeStamp>",
"tripInfoDataModel":{
"ignitionStatus":"ON",
"ignitionONTime":"<onTimeStamp>"
}
},
{
"deviceDetailsDataModel":{
"deviceId":"<deviceID>"
},
"eventDateTime":"<timeStamp>",
"tripInfoDataModel":{
"ignitionOFFTime":"<onTimeStamp>",
"ignitionStatus":"OFF"
}
}
]
}`
`def jsonFileContents(deviceId: String): String= {
val fileName = "trip-data.json"
var stringBuilder=""
var timeStamp1:Long = ZonedDateTime.now(ZoneId.of("America/Chicago")).toInstant().toEpochMilli().toLong - 10000.toLong
for (line <- (Source fromFile fileName).getLines) {
if (line.contains("eventDateTime")) {
var lineReplace=line.replaceAll("<timeStamp>", timeStamp1.toString())
stringBuilder=stringBuilder+lineReplace
timeStamp1 = timeStamp1+1000.toLong
}
else if (line.contains("onTimeStamp")) {
var lineReplace1=line.replaceAll("<onTimeStamp>", timeStamp1.toString)
stringBuilder=stringBuilder+lineReplace1
}
else if (line.contains("deviceID")){
var lineReplace2=line.replace("<deviceID>", deviceId)
stringBuilder=stringBuilder+lineReplace2
}
else {
stringBuilder =stringBuilder+line
}
}
stringBuilder
}
`
Best guess: your feeder contains one single entry and you're using the default queue strategy. Either add more entries in your feeder file to match the number of users, or use a different strategy.
This really is explained in the documentation, including the tutorials. I recommend you take some time to read the documentation before rushing into the code, you'll save lots of time in the end.
You don't need to do your own parameter substitution of values in the json file - Gatling supports passing en ELFileBody as the body where you can have a json file with gatling EL expressions like ${deviceId}.

Performance disadvantage of using Datasets vs RDD with spark

I've rewrite my code partially to use dataset instead of rdds, however I experience significant performance decrease for some operations.
For example:
val filtered = trips.filter(t => exportFilter.check(t)).cache()
seems to be much slower, and CPU mostly idle:
What the reason for this? Is that bad idea to use datasets when trying to access plain objects?
UPDATE:
Here is filter check method:
override def check(trip: Trip): Boolean = {
if (trip == null || !trip.isCompleted) {
return false
}
// Return if no extended filter configured or we already
if (exportConfiguration.isBasicFilter) {
return trip.isCompleted
}
// Here trip is completed, check other conditions
// Filter out trips from future
val isTripTimeOk = checkTripTime(trip)
return isTripTimeOk
}
/**
* Trip time should have end time today or inside yesterday midnight interval
*/
def checkTripTime(trip: Trip): Boolean = {
// Check inclusive trip low bound. Should have end time today or inside yesterday midnight interval
val isLowBoundOk = tripTimingProcessor.isLaterThanYesterdayMidnightIntervalStarts(trip.getEndTimeMillis)
if (!isLowBoundOk) {
updateLowBoundMetrics(trip)
return false
}
// Check trip high bound
val isHighBoundOk = tripTimingProcessor.isBeforeMidnightIntervalStarts(trip.getEndTimeMillis)
if (!isHighBoundOk) {
metricService.inc(trip.getStartTimeMillis, trip.getProviderId,
ExportMetricName.TRIPS_EXPORTED_S3_SKIPPED_END_INSIDE_MIDNIGHT_INTERVAL)
}
return isHighBoundOk
}
private def updateLowBoundMetrics(trip: Trip) = {
metricService.inc(trip.getStartTimeMillis, trip.getProviderId,
ExportMetricName.TRIPS_EXPORTED_S3_SKIPPED_END_BEFORE_YESTERDAY_MIDNIGHT_INTERVAL)
val pointIter = trip.getPoints.iterator()
while (pointIter.hasNext()) {
val point = pointIter.next()
metricService.inc(point.getCaptureTimeMillis, point.getProviderId,
ExportMetricName.POINT_EXPORTED_S3_SKIPPED_END_BEFORE_YESTERDAY_MIDNIGHT_INTERVAL)
}
}

How to provide the output result of a function as a PUT request in scala?

i have a scala code that converts a layer to a Geotiff file. Now i want this Geotiff file to be passed in a PUT request as a REST service. How can i do that?
Here is a section of code:
val labeled_layerstack =
{
//Labeled Layerstack
//val layers_input = Array(layer_dop) ++ layers_sat
val layers_labeled_input = Array(layer_label) ++ Array(output_layerstack) //++ layers_input
ManyLayersToMultibandLayer(layers_labeled_input, output_labeled_layerstack)
output_labeled_layerstack
}
if (useCleanup) {
DeleteLayer(layer_label)
if(useDOP)
DeleteLayer(layer_dop)
for( layer_x <- layers_sat)
DeleteLayer(layer_x)
}
labeled_layerstack
}
else output_labeled_layerstack //if reusing existing layerstack ( processing steps w/o "layerstack")
if(processingSteps.isEmpty || processingSteps.get.steps.exists(step => step == "classification")) {
if (useRandomForest) {
ClusterTestRandomForest(labeled_layerstack, fileNameClassifier, layerResult, Some(output_layerstack))
if (useExportResult) {
LayerToGeotiff(layerResult, fileNameResult, useStitching = useExportStitching)
}
}
else if (useSVM) {
ClusterTestSVM(labeled_layerstack, fileNameClassifier, layerResult, Some(output_layerstack))
if (useExportResult) {
LayerToGeotiff(layerResult, fileNameResult, useStitching = useExportStitching)
}
}
}
The original code is quite long and is not shareable so i am sharing this which is relevant to the problem. The output of LayertoGeotiff should be passed as an PUT request. How can i create such a request?
I suggest to you the Play framework to send a PUT request

removing elements from a sequence using for/yield

Given a Future[Seq[Widget]], where Widget contains a amount : Int property, I'd like to return a Seq[Widget] but for only those Widgets whose amount value is greater than 100. I believe the for { if … } yield { } construct will give me what I want but am unsure how to filter through the Sequence. I have:
val myWidgetFuture : Future[Seq[Widget]] = ...
for {
widgetSeq <- myWidgetFuture
if (??? amount > 100) <— what to put here?
} yield {
widgetSeq
}
If there's a clean non-yield way of doing this that will also work for me.
You don't even need yield. Use map.
val myWidgetFuture: Future[Seq[Widget]] = ???
myWidgetFuture map { ws => ws filter (_.amount > 100) }
If you want to use for … yield with an if filter, you'll need to use two fors:
for {
widgetSeq <- myWidgetFuture
} yield for {
widget <- widgetSeq
if widget.amount > 100
} yield widget

Terminating a scala program?

I have used try catch as part of my mapreduce code. I am reducing my values based on COUNT in the below code. how do i terminate the job using the code in the below
class RepReducer extends Reducer[NullWritable, Text, Text, IntWritable] {
override def reduce(key: NullWritable, values: Iterable[Text], context: Reducer[NullWritable, Text, Text, IntWritable]#Context): Unit = {
val count = values.toList.length
if (count == 0){
try {
context.write(new Text("Number of tables with less than 40% coverage"), new IntWritable(count))
} catch {
case e: Exception =>
Console.err.println(" ")
e.printStackTrace()
}
}
else
{
System.out.println("terminate job") //here i want to terminate if count is not equal to 0
}
}
}
I think you still need to call context.write to return the control back to Hadoop even if you decide to skip certain data in the 'else'.