Can anyone explain interesting spin-lock behavior? - scala

Given the following code
case class Score(value: BigInt, random: Long = randomLong) extends Comparable[Score] {
override def compareTo(that: Score): Int = {
if (this.value < that.value) -1
else if (this.value > that.value) 1
else if (this.random < that.random) -1
else if (this.random > that.random) 1
else 0
}
override def equals(obj: _root_.scala.Any): Boolean = {
val that = obj.asInstanceOf[Score]
this.value == that.value && this.random == that.random
}
}
#tailrec
private def update(mode: UpdateMode, member: String, newScore: Score, spinCount: Int, spinStart: Long): Unit = {
// Caution: there is some subtle logic below, so don't modify it unless you grok it
try {
Metrics.checkSpinCount(member, spinCount)
} catch {
case cause: ConcurrentModificationException =>
throw new ConcurrentModificationException(Leaderboard.maximumSpinCountExceeded.format("update", member), cause)
}
// Set the spin-lock
put(member, None) match {
case None =>
// BEGIN CRITICAL SECTION
// Member's first time on the board
if (scoreToMember.put(newScore, member) != null) {
val message = s"$member: added new member in memberToScore, but found old member in scoreToMember"
logger.error(message)
throw new ConcurrentModificationException(message)
}
memberToScore.put(member, Some(newScore)) // remove the spin-lock
// END CRITICAL SECTION
case Some(option) => option match {
case None => // Update in progress, so spin until complete
//logger.debug(s"update: $member locked, spinCount = $spinCount")
for (i <- -1 to spinCount * 2) {Thread.`yield`()} // dampen contention
update(mode, member, newScore, spinCount + 1, spinStart)
case Some(oldScore) =>
// BEGIN CRITICAL SECTION
// Member already on the leaderboard
if (scoreToMember.remove(oldScore) == null) {
val message = s"$member: oldScore not found in scoreToMember, concurrency defect"
logger.error(message)
throw new ConcurrentModificationException(message)
} else {
val score =
mode match {
case Replace =>
//logger.debug(s"$member: newScore = $newScore")
newScore
case Increment =>
//logger.debug(s"$member: newScore = $newScore, oldScore = $oldScore")
Score(newScore.value + oldScore.value)
}
//logger.debug(s"$member: updated score = $score")
scoreToMember.put(score, member)
memberToScore.put(member, Some(score)) // remove the spin-lock
//logger.debug(s"update: $member unlocked")
}
// END CRITICAL SECTION
// Do this outside the critical section to reduce time under lock
if (spinCount > 0) Metrics.checkSpinTime(System.nanoTime() - spinStart)
}
}
}
There are two important data structures: memberToScore and scoreToMember. I have experimented using both TrieMap[String,Option[Score]] and ConcurrentHashMap[String,Option[Score]] for memberToScore and both have the same behavior.
So far my testing indicates the code is correct and thread safe, but the mystery is the performance of the spin-lock. On a system with 12 hardware threads, and 1000 iterations on 12 Futures: hitting the same member all the time results in spin cycles of 50 or more, but hitting a random distribution of members can result in spin cycles of 100 or more. The behavior gets worse if I don't dampen the spin without iterating over yield() calls.
So, this seems counter intuitive, I was expecting the random distribution of keys to result in less spin than the same key, but testing proves otherwise.
Can anyone offer some insight into this counter-intuitive behavior?
Granted there may be better solutions to my design, and I am open to them, but for now I cannot seem to find a satisfactory explanation for what my tests are showing, and my curiosity leaves me hungry.
As an aside, while the single member test has a lower ceiling for the spin count, the random member test has a lower ceiling for time spinning, which is what I would expect. I just cannot explain why the random member test generally produces a higher ceiling for spin count.

Related

Performance disadvantage of using Datasets vs RDD with spark

I've rewrite my code partially to use dataset instead of rdds, however I experience significant performance decrease for some operations.
For example:
val filtered = trips.filter(t => exportFilter.check(t)).cache()
seems to be much slower, and CPU mostly idle:
What the reason for this? Is that bad idea to use datasets when trying to access plain objects?
UPDATE:
Here is filter check method:
override def check(trip: Trip): Boolean = {
if (trip == null || !trip.isCompleted) {
return false
}
// Return if no extended filter configured or we already
if (exportConfiguration.isBasicFilter) {
return trip.isCompleted
}
// Here trip is completed, check other conditions
// Filter out trips from future
val isTripTimeOk = checkTripTime(trip)
return isTripTimeOk
}
/**
* Trip time should have end time today or inside yesterday midnight interval
*/
def checkTripTime(trip: Trip): Boolean = {
// Check inclusive trip low bound. Should have end time today or inside yesterday midnight interval
val isLowBoundOk = tripTimingProcessor.isLaterThanYesterdayMidnightIntervalStarts(trip.getEndTimeMillis)
if (!isLowBoundOk) {
updateLowBoundMetrics(trip)
return false
}
// Check trip high bound
val isHighBoundOk = tripTimingProcessor.isBeforeMidnightIntervalStarts(trip.getEndTimeMillis)
if (!isHighBoundOk) {
metricService.inc(trip.getStartTimeMillis, trip.getProviderId,
ExportMetricName.TRIPS_EXPORTED_S3_SKIPPED_END_INSIDE_MIDNIGHT_INTERVAL)
}
return isHighBoundOk
}
private def updateLowBoundMetrics(trip: Trip) = {
metricService.inc(trip.getStartTimeMillis, trip.getProviderId,
ExportMetricName.TRIPS_EXPORTED_S3_SKIPPED_END_BEFORE_YESTERDAY_MIDNIGHT_INTERVAL)
val pointIter = trip.getPoints.iterator()
while (pointIter.hasNext()) {
val point = pointIter.next()
metricService.inc(point.getCaptureTimeMillis, point.getProviderId,
ExportMetricName.POINT_EXPORTED_S3_SKIPPED_END_BEFORE_YESTERDAY_MIDNIGHT_INTERVAL)
}
}

Breadth first search inefficiencies

I'm writing a simple breadth first search algorithm is Scala and I feel like it should be pretty efficient. However when I run this one some relatively small problems I'm managing to run out of memory.
def search(start: State): Option[State] = {
val queue: mutable.Queue[State] = mutable.Queue[State]()
queue.enqueue( start )
while( queue.nonEmpty ){
val node = queue.dequeue()
if( self.isGoal(node) )
return Some(node)
self.successors(node).foreach( queue.enqueue )
}
None
}
I believe the enqueue and dequeue methods on a mutable queue were constant time and for each is implemented efficiently. The methods isGoal and successors I know are as efficient as they as they can be. I don't understand how I can be running out of memory so quickly. Are there any inefficiencies in this code that I'm missing?
I think c0der's comment nailed it: you may be getting caught in an infinite loop re-checking nodes that you've already visited. Consider the following changes:
def search(start: State): Option[State] = {
var visited: Set[State] = Set() // change #1
val queue: mutable.Queue[State] = mutable.Queue[State]()
queue.enqueue( start )
while( queue.nonEmpty ){
val node = queue.dequeue()
if (!visited.contains(node)) { // change #2
visited += node // change #3
if( self.isGoal(node) )
return Some(node)
self.successors(node).foreach( queue.enqueue )
}
}
None
}
Initialize a new Set, visited, to keep track of which Nodes you've been to.
Immediately after dequeueing a Node, check if you've visited it before. If not, continue checking this Node. Otherwise, ignore it.
Make sure to add this Node to the visited Set so it's not checked again in the future.
Hope that helps :D
You have some Java, not Scala code there. For Scala vars and while is something that you should not use at all. Here is my suggestion how you could solve this.
class State(val neighbours: List[State]) // I am not sure how your State class looks like, but it could look something like this
val goal = new State(List())
def breathFirst(start: State): Option[State] = {
#scala.annotation.tailrec
def recursiveFunction(visited: List[State], toVisit: List[State]): Option[State] = { // So we will create recursive function with visited nodes and nodes that we should visit
if (toVisit.isEmpty) return None // If toVisit is empty that means that there is no path from start to goal, return none
else {
val visiting = toVisit.head // Else we should take first node from toVisit
val visitingNeighbours = visiting.neighbours // Take all neighbours from node that we are visiting
val visitingNeighboursNotYetVisited = visitingNeighbours.filter(x => !visited.contains(x)) //Filter all neighbours that are not visited
if (visitingNeighboursNotYetVisited.contains(goal)) { //if we found goal, return it
return Some(goal)
} else {
return recursiveFunction(visited :+ visiting, toVisit.tail ++ visitingNeighboursNotYetVisited) // Otherwise add node that we visited in this iteration to list of visited nodes that does not have visited node - it was head so we take toVisit.tail
// and also we will take all neighbours that are not visited and add them to toVisit list for next iteration
}
}
}
if (start == goal) { // If goal is start, return start
Some(start)
} else { // else call our recursive function with empty visited list and with toVisit list that has start node
recursiveFunction(List(), List(start))
}
}
NOTE: You could change:
val visitingNeighboursNotYetVisited = visitingNeighbours.filter(x => !visited.contains(x)) //Filter all neighbours that are not visited
with
val visitingNeighboursNotYetVisited = visitingNeighbours
and check if you will go out of memory, and, as probably you wont it will show you why you should use tailrec.

How to create basic pagination logic in Gatling?

So I am trying to create basic pagination in Gatling but failing miserably.
My current scenario is as follows:
1) First call is always a POST request, the body includes page index 1 and page size 50
{"pageIndex":1,"pageSize":50}
I am then receiving 50 object + the total count of objects on the environment:
"totalCount":232
Since I need to iterate through all objects on the environment, I will need to POST this call 5 time, each time with an updated pageIndex.
My current (failing) code looks like:
def getAndPaginate(jsonBody: String) = {
val pageSize = 50;
var totalCount: Int = 0
var currentPage: Int = 1
var totalPages: Int =0
exec(session => session.set("pageIndex", currentPage))
exec(http("Get page")
.post("/api")
.body(ElFileBody("json/" + jsonBody)).asJson
.check(jsonPath("$.result.objects[?(#.type == 'type')].id").findAll.saveAs("list")))
.check(jsonPath("$.result.totalCount").saveAs("totalCount"))
.exec(session => {
totalCount = session("totalCount").as[Int]
totalPages = Math.ceil(totalCount/pageSize).toInt
session})
.asLongAs(currentPage <= totalPages)
{
exec(http("Get assets action list")
.post("/api")
.body(ElFileBody("json/" + jsonBody)).asJson
.check(jsonPath("$.result.objects[?(#.type == 'type')].id").findAll.saveAs("list")))
currentPage = currentPage+1
exec(session => session.set("pageIndex", currentPage))
pause(Config.minDelayValue seconds, Config.maxDelayValue seconds)
}
}
Currently the pagination values are not assign to the variables that I have created at the beginning of the function, if I create the variables at the Object level then they are assigned but in a manner which I dont understand. For example the result of Math.ceil(totalCount/pageSize).toInt is 4 while it should be 5. (It is 5 if I execute it in the immediate window.... I dont get it ). I would than expect asLongAs(currentPage <= totalPages) to repeat 5 times but it only repeats twice.
I tried to create the function in a class rather than an Object because as far as I understand there is only one Object. (To prevent multiple users accessing the same variable I also ran only one user with the same result)
I am obviously missing something basic here (new to Gatling and Scala) so any help would be highly appreciated :)
using regular scala variables to hold the values isn't going to work - the gatling DSL defines builders that are only executed once at startup, so lines like
.asLongAs(currentPage <= totalPages)
will only ever execute with the initial values.
So you just need to handle everything using session variables
def getAndPaginate(jsonBody: String) = {
val pageSize = 50;
exec(session => session.set("notDone", true))
.asLongAs("${notDone}", "index") {
exec(http("Get assets action list")
.post("/api")
.body(ElFileBody("json/" + jsonBody)).asJson
.check(
jsonPath("$.result.totalCount")
//transform lets us take the result of a check (and an optional session) and modify it before storing - so we can use it to get store a boolean that reflects whether we're on the last page
.transform( (totalCount, session) => ((session("index").as[Int] + 1) * pageSize) < totalCount.toInt)
.saveAs("notDone")
)
)
.pause(Config.minDelayValue seconds, Config.maxDelayValue seconds)
}
}

Difference between RoundRobinRouter and RoundRobinRoutinglogic

So I was reading tutorial about akka and came across this http://manuel.bernhardt.io/2014/04/23/a-handful-akka-techniques/ and I think he explained it pretty well, I just picked up scala recently and having difficulties with the tutorial above,
I wonder what is the difference between RoundRobinRouter and the current RoundRobinRouterLogic? Obviously the implementation is quite different.
Previously the implementation of RoundRobinRouter is
val workers = context.actorOf(Props[ItemProcessingWorker].withRouter(RoundRobinRouter(100)))
with processBatch
def processBatch(batch: List[BatchItem]) = {
if (batch.isEmpty) {
log.info(s"Done migrating all items for data set $dataSetId. $totalItems processed items, we had ${allProcessingErrors.size} errors in total")
} else {
// reset processing state for the current batch
currentBatchSize = batch.size
allProcessedItemsCount = currentProcessedItemsCount + allProcessedItemsCount
currentProcessedItemsCount = 0
allProcessingErrors = currentProcessingErrors ::: allProcessingErrors
currentProcessingErrors = List.empty
// distribute the work
batch foreach { item =>
workers ! item
}
}
}
Here's my implementation of RoundRobinRouterLogic
var mappings : Option[ActorRef] = None
var router = {
val routees = Vector.fill(100) {
mappings = Some(context.actorOf(Props[Application3]))
context watch mappings.get
ActorRefRoutee(mappings.get)
}
Router(RoundRobinRoutingLogic(), routees)
}
and treated the processBatch as such
def processBatch(batch: List[BatchItem]) = {
if (batch.isEmpty) {
println(s"Done migrating all items for data set $dataSetId. $totalItems processed items, we had ${allProcessingErrors.size} errors in total")
} else {
// reset processing state for the current batch
currentBatchSize = batch.size
allProcessedItemsCount = currentProcessedItemsCount + allProcessedItemsCount
currentProcessedItemsCount = 0
allProcessingErrors = currentProcessingErrors ::: allProcessingErrors
currentProcessingErrors = List.empty
// distribute the work
batch foreach { item =>
// println(item.id)
mappings.get ! item
}
}
}
I somehow cannot run this tutorial, and it's stuck at the point where it's iterating the batch list. I wonder what I did wrong.
Thanks
In the first place, you have to distinguish diff between them.
RoundRobinRouter is a Router that uses round-robin to select a connection.
While
RoundRobinRoutingLogic uses round-robin to select a routee
You can provide own RoutingLogic (it has helped me to understand how Akka works under the hood)
class RedundancyRoutingLogic(nbrCopies: Int) extends RoutingLogic {
val roundRobin = RoundRobinRoutingLogic()
def select(message: Any, routees: immutable.IndexedSeq[Routee]): Routee = {
val targets = (1 to nbrCopies).map(_ => roundRobin.select(message, routees))
SeveralRoutees(targets)
}
}
link on doc http://doc.akka.io/docs/akka/2.3.3/scala/routing.html
p.s. this doc is very clear and it has helped me the most
Actually I misunderstood the method, and found out the solution was to use RoundRobinPool as stated in http://doc.akka.io/docs/akka/2.3-M2/project/migration-guide-2.2.x-2.3.x.html
For example RoundRobinRouter has been renamed to RoundRobinPool or
RoundRobinGroup depending on which type you are actually using.
from
val workers = context.actorOf(Props[ItemProcessingWorker].withRouter(RoundRobinRouter(100)))
to
val workers = context.actorOf(RoundRobinPool(100).props(Props[ItemProcessingWorker]), "router2")

Terminating a scala program?

I have used try catch as part of my mapreduce code. I am reducing my values based on COUNT in the below code. how do i terminate the job using the code in the below
class RepReducer extends Reducer[NullWritable, Text, Text, IntWritable] {
override def reduce(key: NullWritable, values: Iterable[Text], context: Reducer[NullWritable, Text, Text, IntWritable]#Context): Unit = {
val count = values.toList.length
if (count == 0){
try {
context.write(new Text("Number of tables with less than 40% coverage"), new IntWritable(count))
} catch {
case e: Exception =>
Console.err.println(" ")
e.printStackTrace()
}
}
else
{
System.out.println("terminate job") //here i want to terminate if count is not equal to 0
}
}
}
I think you still need to call context.write to return the control back to Hadoop even if you decide to skip certain data in the 'else'.