Anyone had issues with subtraction of a long within an RDD - scala

I am having an issue with the subtraction of a long within an RDD to filter out items in the RDD that are within a certain time range.
So my code filters an RDD of case class auctions, with an object of successfulAuctions(Long, Int, String):
auctions.filter(it => relevantAuctions(it, successfulAuctions))
The successfulAuctions object is made up of a timestamp: Long, an itemID: Int, and a direction: String (BUY/SELL).
The relevantAuctions function basically uses tail recursion to find the auctions in a time range for the exact item and direction.
#tailrec
def relevantAuctions(auction: Auction, successfulAuctions: List[(Long, String, String)]): Boolean = successfulAuctions match {
case sample :: xs => if (isRelevantAuction(auction, sample) ) true else relevantAuctions(auction, xs)
case Nil => false
}
This then feeds into another method in the if statement that checks the timestamp in the sample is within a 10ms range, and the item ID is the same, as is the direction.
def isRelevantAuction(auction: Auction, successfulAuction: (Long, String, String)): Boolean = {
(successfulAuction.timestampNanos - auction.timestampNanos) >= 0 &&
(successfulAuction.timestampNanos - auction.timestampNanos) < 10000000L &&
auction.itemID == successfulAuction.itemID &&
auction.direction== successfulAuction.direction
}
I am having issues where the range option is not entirely working. The timestamps I am receiving back are not within the required range. Although the Item ID and direction seems to be working successfully.
The results I am getting are as follows, when I have a timestamp of 1431651108749267459 for the successful auction, I am receiving other auctions of a time GREATER than this, where it should be less.
The auctions I am receiving have the timestamps of:
1431651108749326603
1431651108749330732
1431651108749537901
Has anyone experienced this phenomenon?
Thanks!

Related

zipping lists with an optional list to construct a list of object in Scala

I have a case class like this:
case class Metric(name: String, value: Double, timeStamp: Int)
I receive individual components to build metrics in separate lists and zip them to create a list of Metric objects.
def buildMetric(names: Seq[String], values: Seq[Double], ts: Seq[Int]): Seq[Metric] = {
(names, values, ts).zipped.toList map {
case (name, value, time) => Metric(name, value, time)
}
}
Now I need to add an optional parameter to both buildMetric function and Metric class.
case class Metric(name: String, value: Double, timeStamp: Int, type: Option[Type])
&
def buildMetric(names: Seq[String], values: Seq[Double], ts: Seq[Int], types: Option[Seq[Type]]): Seq[Metric]
The idea is that we some times receive a sequence of the type which if present matches the length of names and values lists. I am not sure how to modify the body of buildMetric function to create the Metric objects with type information idiomatically. I can think of a couple of approaches.
Do an if-else on types.isDefined and then zip the types with types.get with another list in one condition and leave as above in the other. This makes me write the same code twice.
The other option is to simply use a while loop and create a Metric object with types.map(_(i)) passed a last parameter.
So far I am using the second option, but I wonder if there is a more functional way of handling this problem.
The first option can't be done because zipped only works with tuples of 3 or fewer elements.
The second version might look like this:
def buildMetric(names: Seq[String], values: Seq[Double], ts: Seq[Int], types: Option[Seq[Type]]): Seq[Metric] =
for {
(name, i) <- names.zipWithIndex
value <- values.lift(i)
time <- ts.lift(i)
optType = types.flatMap(_.lift(i))
} yield {
Metric(name, value, time, optType)
}
One more option from my point of view, if you would like to keep this zipped approach - convert types from Option[Seq[Type]] to Seq[Option[Type]] with same length as names filled with None values in case if types is None as well:
val optionTypes: Seq[Option[Type]] = types.fold(Seq.fill(names.length)(None: Option[Type]))(_.map(Some(_)))
// Sorry, Did not find `zipped` for Tuple4 case
names.zip(values).zip(ts).zip(optionTypes).toList.map {
case (((name, value), time), optionType) => Metric(name, value, time, optionType)
}
Hope this helps!
You could just use pattern matching on types:
def buildMetric(names: Seq[String], values: Seq[Double], ts: Seq[Int], types: Option[Seq[Type]]): Seq[Metric] = {
types match {
case Some(types) => names.zip(values).zip(ts).zip(types).map {
case (((name, value), ts,), t) => Metric(name, value, ts, Some(t))
}
case None => (names, values, ts).zipped.map(Metric(_, _, _, None))
}
}

How to extract timed-out sessions using mapWithState

I am updating my code to switch from updateStateByKey to mapWithState in order to get users' sessions based on a time-out of 2 minutes (2 is used for testing purpose only). Each session should aggregate all the streaming data (JSON string) within a session before time-out.
This was my old code:
val membersSessions = stream.map[(String, (Long, Long, List[String]))](eventRecord => {
val parsed = Utils.parseJSON(eventRecord)
val member_id = parsed.getOrElse("member_id", "")
val timestamp = parsed.getOrElse("timestamp", "").toLong
//The timestamp is returned twice because the first one will be used as the start time and the second one as the end time
(member_id, (timestamp, timestamp, List(eventRecord)))
})
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
//transform to (member_id, (time, time, counter, events within session))
(a._1, (a._2._1, a._2._2, 1, a._2._3))
}).
reduceByKey((a, b) => {
//transform to (member_id, (lowestStartTime, MaxFinishTime, sumOfCounter, events within session))
(Math.min(a._1, b._1), Math.max(a._2, b._2), a._3 + b._3, a._4 ++ b._4)
}).updateStateByKey(Utils.updateState)
The problems of updateStateByKey are nicely explained here. One of the key reasons why I decided to use mapWithState is because updateStateByKey was unable to return finished sessions (the ones that have timed out) for further processing.
This is my first attempt to transform the old code to the new version:
val spec = StateSpec.function(updateState _).timeout(Minutes(1))
val latestSessionInfo = membersSessions.map[(String, (Long, Long, Long, List[String]))](a => {
//transform to (member_id, (time, time, counter, events within session))
(a._1, (a._2._1, a._2._2, 1, a._2._3))
})
val userSessionSnapshots = latestSessionInfo.mapWithState(spec).snapshotStream()
I slightly misunderstand what shoud be the content of updateState, because as far as I understand the time-out should not be calculated manually (it was previously done in my function Utils.updateState) and .snapshotStream should return the timed-out sessions.
Assuming you're always waiting on a timeout of 2 minutes, you can make your mapWithState stream only output the data once it time out is triggered.
What would this mean for your code? It would mean that you now need to monitor timeout instead of outputting the tuple in each iteration. I would imagine your mapWithState will look something along the lines of:
def updateState(key: String,
value: Option[(Long, Long, Long, List[String])],
state: State[(Long, Long, Long, List[String])]): Option[(Long, Long, Long, List[String])] = {
def reduce(first: (Long, Long, Long, List[String]), second: (Long, Long, Long, List[String])) = {
(Math.min(first._1, second._1), Math.max(first._2, second._2), first._3 + second._3, first._4 ++ second._4)
}
value match {
case Some(currentValue) =>
val result = state
.getOption()
.map(currentState => reduce(currentState, currentValue))
.getOrElse(currentValue)
state.update(result)
None
case _ if state.isTimingOut() => state.getOption()
}
}
This way, you only output something externally to the stream if the state has timed out, otherwise you aggregate it inside the state.
This means that your Spark DStream graph can filter out all values which aren't defined, and only keep those which are:
latestSessionInfo
.mapWithState(spec)
.filter(_.isDefined)
After filter, you'll only have states which have timed out.

How to do null checks in Scala in an idiomatic way?

case class Account(
val name: String,
val uuid: String,
val lat: Double,
val lng: Double
)
}
object Account {
def create(row: Row): Option[YellowAccount] = {
val name = row.getAs[String]("store_name")
val uuid = row.getAs[String]("uuid")
val latO = row.getAs[String]("latitude")
val lngO = row.getAs[String]("longitude")
// How do you do null checks here in an idiomatic way?
if (latO == null || lngO == null) {
return None
}
Some(YellowAccount(name, uuid, latO.toDouble, lngO.toDouble))
}
}
lat/long are compulsory fields in Account. How do you do null checks in an idiomatic way?
You can use Option type to handle null values. You just wrap a nullable value in Option and then you can pattern match on it or something else. In your example, I think the most concise way to combine 4 nullable values is for-comprehension:
import scala.util.Try
object Account {
def create(row: Row): Option[YellowAccount] = {
for {
name <- Option( row.getAs[String]("store_name") )
uuid <- Option( row.getAs[String]("uuid") )
latO <- Try( row.getAs[String]("latitude").toDouble ).toOption
lngO <- Try( row.getAs[String]("longitude").toDouble ).toOption
} yield
YellowAccount(name, uuid, latO, lngO)
}
}
EDIT
Another thing here is _.toDouble conversion, which may throw an exception if it fails to parse the string, so you can wrap it in Try instead (I updated the code above).
EDIT2
To clarify what's happening here:
when you wrap a value in Option it becomes None if the value is null, or Some(...) with the value otherwise
similarly when wrapping something that may throw an exception in Try, it becomes either Failure with the exception, or Success with the value
toOption method converts Try to Option in a straightforward way: Failure becomes None, Success becomes Some
in the for-comprehension if any of the four Options returns None (i.e. one of them was null of failed to parse a number), the whole statement returns None, and only if each of the four yields a value, they are passed to the YellowAccount constructor
As the other answer suggests, you can use Option to handle possible nulls. You can't use for comprehension the way it is suggested there, but there are several ways around it. The easiest, probably is to .zip the two Options together, and then map over the result:
Option(row.getAs[latitude])
.zip(Option(row.getAs[String]("longitude")))
.map { case (lat, long) =>
YellowAccount(
row.getAs[String]("store_name"),
row.getAs[String]("uuid"),
lat.toDouble,
long.toDouble
)
}

Scala: return based on value of int expression

I'm trying to calculate the time from various dates since a specific period start date. Calculating the dates since was fine and easy, but I also want the function to return -1 if the dates between them is negative. I'm not sure a clean way to do this?
I've got:
def getTimeBetween(date1: String, date2: String, format: String, dtype: String): Int = {
dtype match {
case "fixed" => {
if (date1 == null) throw new Exception("Specify date1")
else Days.daysBetween(date2, date1).getDays()
}
case _ => throw new Exception("Enter a valid type")
}
}
I'm pretty new to Scala so I don't know all the syntax well yet. I'm more used to Java/C++, so I thought next if/else statements would work but I've realized it's not the way to go.
Thanks.
A definite improvement would be to not throw runtime exceptions and instead rely on a standard Scala construct, Try comes to mind here.
import scala.util.{ Success, Failure, Try }
import org.joda.time.Days
def getTimeBetween(date1: String, date2: String, format: String, dtype: String): Try[Int] = {
dtype match {
case "fixed" => {
Try(Days.daysBetween(date2, date1).getDays) match {
case Success(days) if days >= 0 => Success(days)
case Success(_) => Success(-1)
case Failure(t) => Failure(t)
}
}
case _ => Failure(new IllegalArgumentException)
}
}
This method will return Success(#days) with the number of days between if they're positive, or Success(-1) if they're negative.
The Try wrapper around the daysBetween function will catch any exceptions that might be thrown during parsing, such as illegal formats or null values in the parameters. Any runtime exception will result in a Failure(t) with t being the exception caught.
There's a lot of other ways to approach this, like the Either type, or using Option and the scala.util.catching function.
Edit:
Since you left it out of your example I did also, but for the daysBetween method to compile you'll need to parse the strings into DateTime objects, using something like DateTime.parse(date1) etc.
Assuming your method returns a negative integer and you want it to return -1 instead of all negative values you could simply put it in an auxilary variable and check
val res = Days.daysBetween(date2, date1).getDays()
if (res <0) -1 else res
A more compact (and argueably less clear method) only for Int would be using max
Days.daysBetween(date2, date1).getDays().max(-1)

Scala, finding max value in arrays

First time I've had to ask a question here, there is not enough info on Scala out there for a newbie like me.
Basically what I have is a file filled with hundreds of thousands of lists formatted like this:
(type, date, count, object)
Rows look something like this:
(food, 30052014, 400, banana)
(food, 30052014, 2, pizza)
All I need to is find the one row with the highest count.
I know I did this a couple of months ago but can't seem to wrap my head around it now. I'm sure I can do this without a function too. All I want to do is set a value and put that row in it but I can't figure it out.
I think basically what I want to do is a Math.max on the 3rd element in the lists, but I just can't get it.
Any help will be kindly appreciated. Sorry if my wording or formatting of this question isn't the best.
EDIT: There's some extra info I've left out that I should probably add:
All the records are stored in a tsv file. I've done this to split them:
val split_food = food.map(_.split("/t"))
so basically I think I need to use split_food... somehow
Modified version of #Szymon answer with your edit addressed:
val split_food = food.map(_.split("/t"))
val max_food = split_food.maxBy(tokens => tokens(2).toInt)
or, analogously:
val max_food = split_food.maxBy { case Array(_, _, count, _) => count.toInt }
In case you're using apache spark's RDD, which has limited number of usual scala collections methods, you have to go with reduce
val max_food = split_food.reduce { (max: Array[String], current: Array[String]) =>
val curCount = current(2).toInt
val maxCount = max(2).toInt // you probably would want to preprocess all items,
// so .toInt will not be called again and again
if (curCount > maxCount) current else max
}
You should use maxBy function:
case class Purchase(category: String, date: Long, count: Int, name: String)
object Purchase {
def apply(s: String) = s.split("\t") match {
case Seq(cat, date, count, name) => Purchase(cat, date.toLong, count.toInt, name)
}
}
foodRows.map(row => Purchase(row)).maxBy(_.count)
Simply:
case class Record(food:String, date:String, count:Int)
val l = List(Record("ciccio", "x", 1), Record("buffo", "y", 4), Record("banana", "z", 3))
l.maxBy(_.count)
>>> res8: Record = Record(buffo,y,4)
Not sure if you got the answer yet but I had the same issues with maxBy. I found once I ran the package... import scala.io.Source I was able to use maxBy and it worked.