try/catch not working when use tail recursive function - scala

I'm building a tail-recursive function that reads multiple hdfs paths and merges all of them into a single data-frame. The function works perfectly as long as all the path exist, if not, the function fails and does not finish joining the data of the paths that do exist. To solve this problem I have tried to handle the error using try/catch but have not been successful.
The error says: could not optimize #tailrec annotated method loop: it contains a recursive call not in tail position
My function is :
def getRangeData(toOdate: String, numMonths: Int, pathRoot: String, ColumnsTable: List[String]): DataFrame = {
val dataFrameNull = spark.createDataFrame(spark.sparkContext.emptyRDD[Row],
StructType((ColumnsTable :+ "odate").map(columnName => StructField(columnName, StringType, true))))
val rangePeriod = getRangeDate(numMonths, toOdate)
#tailrec
def unionRangeData(rangePeriod: List[LocalDate], pathRoot: String, df: DataFrame = dataFrameNull): DataFrame = {
try {
if (rangePeriod.isEmpty) {
df
}
else {
val month = "%02d".format(rangePeriod.head.getMonthValue)
val year = rangePeriod.head.getYear
val odate = rangePeriod.head.toString
val path = s"${pathRoot}/partition_data_year_id=${year}/partition_data_month_id=${month}"
val columns = ColumnsTable.map(columnName => trim(col(columnName)).as(columnName))
val dfTemporal = spark.read.parquet(path).select(columns: _*).withColumn("odate", lit(odate).cast("date"))
unionRangeData(rangePeriod.tail, pathRoot, df.union(dfTemporal))
}
} catch {
case e: Exception =>
logger.error("path not exist")
dataFrameNull
}
}
unionRangeData(rangePeriod, pathRoot)
}
def getRangeDate(numMonths: Int, toOdate: String, listDate: List[LocalDate] = List()): List[LocalDate] = {
if (numMonths == 0) {
listDate
}
else {
getRangeDate(numMonths - 1, toOdate, LocalDate.parse(toOdate).plusMonths(1).minusMonths(numMonths) :: listDate)
}
}
In advance, thank you very much for your help.

I would suggest you remove the try-catch construct entirely from the function and use it instead at the call site at the bottom of getRangeData.
Alternatively you can also use scala.util.Try to wrap the call: Try(unionRangeData(rangePeriod, pathRoot)), and use one of its combinators to perform your logging or provide a default value in the error case.
Related post which explains why the Scala compiler cannot perform tail call optimization inside try-catch:
Why won't Scala optimize tail call with try/catch?

Related

request timeout from flatMapping over cats.effect.IO

I am attempting to transform some data that is encapsulated in cats.effect.IO with a Map that also is in an IO monad. I'm using http4s with blaze server and when I use the following code the request times out:
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
}
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// get the shifts
var getDbShifts: IO[List[Shift]] = shiftModel.findByUserId(userId)
// use the userRoleId to get the RoleId then get the tasks for this role
val taskMap : IO[Map[String, Double]] = taskModel.findByUserId(userId).flatMap {
case tskLst: List[Task] => IO(tskLst.map((task: Task) => (task.name -> task.standard)).toMap)
}
val traversed: IO[List[Shift]] = for {
shifts <- getDbShifts
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: IO[List[ShiftJson]] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) =>
taskMap.flatMap((tm: Map[String, Double]) =>
IO(ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / tm.get(sj.name).get)))
).sequence
//TODO: this flatMap is bricking my request
lstShiftJson.flatMap((sjLst: List[ShiftJson]) => {
IO(Shift(shift.id, shift.shiftDate, shift.shiftStart, shift.shiftEnd,
shift.lunchDuration, shift.shiftDuration, shift.breakOffProd, shift.systemDownOffProd,
shift.meetingOffProd, shift.trainingOffProd, shift.projectOffProd, shift.miscOffProd,
write[List[ShiftJson]](sjLst), shift.userRoleId, shift.isApproved, shift.score, shift.comments
))
})
})
} yield traversed
traversed.flatMap((sLst: List[Shift]) => Ok(write[List[Shift]](sLst)))
}
as you can see the TODO comment. I've narrowed down this method to the flatmap below the TODO comment. If I remove that flatMap and merely return "IO(shift)" to the traversed variable the request does not timeout; However, that doesn't help me much because I need to make use of the lstShiftJson variable which has my transformed json.
My intuition tells me I'm abusing the IO monad somehow, but I'm not quite sure how.
Thank you for your time in reading this!
So with the guidance of Luis's comment I refactored my code to the following. I don't think it is optimal (i.e. the flatMap at the end seems unecessary, but I couldnt' figure out how to remove it. BUT it's the best I've got.
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
}
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// FOR EACH SHIFT
// - read the shift.roleTasks into a ShiftJson object
// - divide each task value by the task.standard where task.name = shiftJson.name
// - write the list of shiftJson back to a string
val traversed = for {
taskMap <- taskModel.findByUserId(userId).map((tList: List[Task]) => tList.map((task: Task) => (task.name -> task.standard)).toMap)
shifts <- shiftModel.findByUserId(userId)
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: List[ShiftJson] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) => ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / taskMap.get(sj.name).get ))
shift.roleTasks = write[List[ShiftJson]](lstShiftJson)
IO(shift)
})
} yield traversed
traversed.flatMap((t: List[Shift]) => Ok(write[List[Shift]](t)))
}
Luis mentioned that mapping my List[Shift] to a Map[String, Double] is a pure operation so we want to use a map instead of flatMap.
He mentioned that I'm wrapping every operation that comes from the database in IO which is causing a great deal of recomputation. (including DB transactions)
To solve this issue I moved all of the database operations inside of my for loop, using the "<-" operator to flatMap each of the return values allows the variables being used to preside within the IO monads, hence preventing the recomputation experienced before.
I do think there must be a better way of returning my return value. flatMapping the "traversed" variable to get back inside of the IO monad seems to be unnecessary recomputation, so please anyone correct me.

How to define a function in scala for flatMap

New to Scala, I want to try to rewrite some code in flatMap by calling a function instead of writing the whole process inside "()".
The original code is like:
val longForm = summary.flatMap(row => {
/*This is the code I want to replace with a function*/
val metric = row.getString(0)
(1 until row.size).map{i=>
(metric,schema(i).name,row.getString(i).toDouble)
})
}/*End of function*/)
The function I wrote is:
def tfunc(line:Row):List[Any] ={
val metric = line.getString(0)
var res = List[Any]
for (i<- 1 to line.size){
/*Save each iteration result as a List[tuple], then append to the res List.*/
val tup = (metric,schema(i).name,line.getString(i).toDouble)
val tempList = List(tup)
res = res :: tempList
}
res
}
The function did not passed compilation with the following error:
error: missing argument list for method apply in object List
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing apply _ or apply(_) instead of apply.
var res = List[Any]
What is wrong with this function?
And for flatMap, is it the write way to return the result as a List?
You haven't explained why you want to replace that code block. Is there a particular goal you're after? There are many, many, different ways that block could be rewritten. How can we know which would be better at meeting you requirements?
Here's one approach.
def tfunc(line :Row) :List[(String,String,Double)] ={
val metric = line.getString(0)
List.tabulate(line.tail.length){ idx =>
(metric, schema(idx+1).name, line.getString(idx+1).toDouble)
}
}

Use of Scala Loan pattern in Success Case

I'm following the tutorial from Alvin Alexander to use Loan Pattern
Here is the code what I use -
val year = 2016
val nationalData = {
val source = io.Source.fromFile(s"resources/Babynames/names/yob$year.txt")
// names is iterator of String, split() gives the array
//.toArray & toSeq is a slow process compare to .toSet // .toSeq gives Stream Closed error
val names = source.getLines().filter(_.nonEmpty).map(_.split(",")(0)).toSet
source.close()
names
// println(names.mkString(","))
}
println("Names " + nationalData)
val info = for (stateFile <- new java.io.File("resources/Babynames/namesbystate").list(); if stateFile.endsWith(".TXT")) yield {
val source = io.Source.fromFile("resources/Babynames/namesbystate/" + stateFile)
val names = source.getLines().filter(_.nonEmpty).map(_.split(",")).
filter(a => a(2).toInt == year).map(a => a(3)).toArray // .toSet
source.close()
(stateFile.take(2), names)
}
println(info(0)._2.size + " names from state "+ info(0)._1)
println(info(1)._2.size + " names from state "+ info(1)._1)
for ((state, sname) <- info) {
println("State: " +state + " Coverage of name in "+ year+" "+ sname.count(n => nationalData.contains(n)).toDouble / nationalData.size) // Set doesn't have length method
}
This is how I applied readTextFile, readTextFileWithTry on the above code to learn/experiment Loan Pattern in the above code
def using[A <: { def close(): Unit }, B](resource: A)(f: A => B): B =
try {
f(resource)
} finally {
resource.close()
}
def readTextFile(filename: String): Option[List[String]] = {
try {
val lines = using(fromFile(filename)) { source =>
(for (line <- source.getLines) yield line).toList
}
Some(lines)
} catch {
case e: Exception => None
}
}
def readTextFileWithTry(filename: String): Try[List[String]] = {
Try {
val lines = using(fromFile(filename)) { source =>
(for (line <- source.getLines) yield line).toList
}
lines
}
}
val year = 2016
val data = readTextFile(s"resources/Babynames/names/yob$year.txt") match {
case Some(lines) =>
val n = lines.filter(_.nonEmpty).map(_.split(",")(0)).toSet
println(n)
case None => println("couldn't read file")
}
val data1 = readTextFileWithTry("resources/Babynames/namesbystate")
data1 match {
case Success(lines) => {
val info = for (stateFile <- data1; if stateFile.endsWith(".TXT")) yield {
val source = fromFile("resources/Babynames/namesbystate/" + stateFile)
val names = source.getLines().filter(_.nonEmpty).map(_.split(",")).
filter(a => a(2).toInt == year).map(a => a(3)).toArray // .toSet
(stateFile.take(2), names)
println(names)
}
}
But in the second case, readTextFileWithTry, I am getting the following error -
Failed, message is: java.io.FileNotFoundException: resources\Babynames\namesbystate (Access is denied)
I guess the reason for the failure is from SO what I understand -
I am trying to open the same file on each iteration of the for loop
Apart from that, I have few concerns regarding how I use -
Is it the good way to use? Can some help me how can I use the TRY on multiple occasions?
I tried to change the return type of readTextFileWithTry like Option[A] or Set/Map or Scala Collection to apply higher-order functions later on that. but not able to succeed. Not sure that is a good practice or not.
How can I use higher-order functions in Success case, as there are multiple operations and in Success case the code blocks get bigger? I can't use any field outside of Success case.
Can someone help me to understand?
I think that you problem has nothing to do with "I am trying to open the same file on each iteration of the for loop" and it is actually the same as in the accepted answer
Unfortunately you didn't provide stack trace so it is not clear on which line this happens. I would guess that the falling call is
val data1 = readTextFileWithTry("resources/Babynames/namesbystate")
And looking at your first code sample:
val info = for (stateFile <- new java.io.File("resources/Babynames/namesbystate").list(); if stateFile.endsWith(".TXT")) yield {
it looks like the path "resources/Babynames/namesbystate" points to a directory. But in your second example you are trying to read it as a file and this is the reason for the error. It comes from the fact that your readTextFileWithTry is not a valid substitute for java.io.File.list call. And File.list doesn't need a wrapper because it doesn't use any intermediate closeable/disposable entity.
P.S. it might make more sense to use File.list(FilenameFilter filter) instead of if stateFile.endsWith(".TXT"))

Spark custom encoder for dataframe

I know about How to store custom objects in Dataset? but still, it is not really clear for me how to build this custom encoder which properly serializes to multiple fields. Manually, I created some functions https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala/myOrg/GeoSpark.scala#L122-L154 wich map a Polygon back and forth between Dataset - RDD - Dataset by mapping the objects to primitive types spark can handle i.e. a tuple (String, Int)(edit: full code below).
For example, to go from the Polygon Object to a tuple of (String, Int) I use the following
def writeSerializableWKT(iterator: Iterator[AnyRef]): Iterator[(String, Int)] = {
val writer = new WKTWriter()
iterator.flatMap(cur => {
val cPoly = cur.asInstanceOf[Polygon]
// TODO is it efficient to create this collection? Is this a proper iterator 2 iterator transformation?
List(((writer.write(cPoly), cPoly.getUserData.asInstanceOf[Int])).iterator
})
}
def createSpatialRDDFromLinestringDataSet(geoDataset: Dataset[WKTGeometryWithPayload]): RDD[Polygon] = {
geoDataset.rdd.mapPartitions(iterator => {
val reader = new WKTReader()
iterator.flatMap(cur => {
try {
reader.read(cur.lineString) match {
case p: Polygon => {
val polygon = p.asInstanceOf[Polygon]
polygon.setUserData(cur.payload)
List(polygon).iterator
}
case _ => throw new NotImplementedError("Multipolygon or others not supported")
}
} catch {
case e: ParseException =>
logger.error("Could not parse")
logger.error(e.getCause)
logger.error(e.getMessage)
None
}
})
})
}
I noticed that already I start to do a lot of work twice (see the link to both methods). Now wanting to be able to handle
https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala (full code below)
/myOrg/GeoSpark.scala#L82-L84
val joinResult = JoinQuery.SpatialJoinQuery(objectRDD, minimalPolygonCustom, true)
// joinResult.map()
val joinResultCounted = JoinQuery.SpatialJoinQueryCountByKey(objectRDD, minimalPolygonCustom, true)
which is a PairRDD[Polygon, HashSet[Polygon]], or respectively PairRDD[Polygon, Int] how would I need to specify my functions as an Encoder in order to not solve the same problem 2 more times?

Scala extending while loops to do-until expressions

I'm trying to do some experiment with Scala. I'd like to repeat this experiment (randomized) until the expected result comes out and get that result. If I do this with either while or do-while loop, then I need to write (suppose 'body' represents the experiment and 'cond' indicates if it's expected):
do {
val result = body
} while(!cond(result))
It does not work, however, since the last condition cannot refer to local variables from the loop body. We need to modify this control abstraction a little bit like this:
def repeat[A](body: => A)(cond: A => Boolean): A = {
val result = body
if (cond(result)) result else repeat(body)(cond)
}
It works somehow but is not perfect for me since I need to call this method by passing two parameters, e.g.:
val result = repeat(body)(a => ...)
I'm wondering whether there is a more efficient and natural way to do this so that it looks more like a built-in structure:
val result = do { body } until (a => ...)
One excellent solution for body without a return value is found in this post: How Does One Make Scala Control Abstraction in Repeat Until?, the last one-liner answer. Its body part in that answer does not return a value, so the until can be a method of the new AnyRef object, but that trick does not apply here, since we want to return A rather than AnyRef. Is there any way to achieve this? Thanks.
You're mixing programming styles and getting in trouble because of it.
Your loop is only good for heating up your processor unless you do some sort of side effect within it.
do {
val result = bodyThatPrintsOrSomething
} until (!cond(result))
So, if you're going with side-effecting code, just put the condition into a var:
var result: Whatever = _
do {
result = bodyThatPrintsOrSomething
} until (!cond(result))
or the equivalent:
var result = bodyThatPrintsOrSomething
while (!cond(result)) result = bodyThatPrintsOrSomething
Alternatively, if you take a functional approach, you're going to have to return the result of the computation anyway. Then use something like:
Iterator.continually{ bodyThatGivesAResult }.takeWhile(cond)
(there is a known annoyance of Iterator not doing a great job at taking all the good ones plus the first bad one in a list).
Or you can use your repeat method, which is tail-recursive. If you don't trust that it is, check the bytecode (with javap -c), add the #annotation.tailrec annotation so the compiler will throw an error if it is not tail-recursive, or write it as a while loop using the var method:
def repeat[A](body: => A)(cond: A => Boolean): A = {
var a = body
while (cond(a)) { a = body }
a
}
With a minor modification you can turn your current approach in a kind of mini fluent API, which results in a syntax that is close to what you want:
class run[A](body: => A) {
def until(cond: A => Boolean): A = {
val result = body
if (cond(result)) result else until(cond)
}
}
object run {
def apply[A](body: => A) = new run(body)
}
Since do is a reserved word, we have to go with run. The result would now look like this:
run {
// body with a result type A
} until (a => ...)
Edit:
I just realized that I almost reinvented what was already proposed in the linked question. One possibility to extend that approach to return a type A instead of Unit would be:
def repeat[A](body: => A) = new {
def until(condition: A => Boolean): A = {
var a = body
while (!condition(a)) { a = body }
a
}
}
Just to document a derivative of the suggestions made earlier, I went with a tail-recursive implementation of repeat { ... } until(...) that also included a limit to the number of iterations:
def repeat[A](body: => A) = new {
def until(condition: A => Boolean, attempts: Int = 10): Option[A] = {
if (attempts <= 0) None
else {
val a = body
if (condition(a)) Some(a)
else until(condition, attempts - 1)
}
}
}
This allows the loop to bail out after attempts executions of the body:
scala> import java.util.Random
import java.util.Random
scala> val r = new Random()
r: java.util.Random = java.util.Random#cb51256
scala> repeat { r.nextInt(100) } until(_ > 90, 4)
res0: Option[Int] = Some(98)
scala> repeat { r.nextInt(100) } until(_ > 90, 4)
res1: Option[Int] = Some(98)
scala> repeat { r.nextInt(100) } until(_ > 90, 4)
res2: Option[Int] = None
scala> repeat { r.nextInt(100) } until(_ > 90, 4)
res3: Option[Int] = None
scala> repeat { r.nextInt(100) } until(_ > 90, 4)
res4: Option[Int] = Some(94)