scala io Exception handling - scala

I am trying to read a comma separated file in scala and convert that into list of Json object. The code is working if all the records are valid. How do i catch the exception for the records which are not valid in my function below. If a record is not valid it should throw and exception and continue reading the file. but in my case once an invalid record comes the application stops.
def parseFile(file: String): List[JsObject] = {
val bufferedSource = Source.fromFile(file)
try {
bufferedSource.getLines().map(line => {
val cols = line.split(",").map(_.trim)
(Json.obj("Name" -> cols(0), "Media" -> cols(1), "Gender" -> cols(2), "Age" -> cols(3).toInt)) // Need to handle io exception here.
}).toList
}
finally {
bufferedSource.close()
}
}

I think you may benefit from using Option (http://danielwestheide.com/blog/2012/12/19/the-neophytes-guide-to-scala-part-5-the-option-type.html) and the Try object (http://danielwestheide.com/blog/2012/12/26/the-neophytes-guide-to-scala-part-6-error-handling-with-try.html)
What you are doing here is stopping all work when an error happens (ie you throw to outside the map) a better option is to isolate the failure and return some object that we can filer out. Below is a quick implementation I made
package csv
import play.api.libs.json.{JsObject, Json}
import scala.io.Source
import scala.util.Try
object CsvParser extends App {
//Because a bad row can either != 4 columns or the age can not be a int we want to return an option that will be ignored by our caller
def toTuple(array : Array[String]): Option[(String, String, String, Int)] = {
array match {
//if our array has exactly 4 columns
case Array(name, media, gender, age) => Try((name, media, gender, age.toInt)).toOption
// any other array size will be ignored
case _ => None
}
}
def toJson(line: String): Option[JsObject] = {
val cols = line.split(",").map(_.trim)
toTuple(cols) match {
case Some((name: String, media: String, gender: String, age: Int)) => Some(Json.obj("Name" -> name, "Media" -> media, "Gender" -> gender, "Age" -> age))
case _ => None
}
}
def parseFile(file: String): List[JsObject] = {
val bufferedSource = Source.fromFile(file)
try { bufferedSource.getLines().map(toJson).toList.flatten } finally { bufferedSource.close() }
}
parseFile("my/csv/file/path")
}
The above code will ignore any rows where there not exactly 4 columns. It will also contain the NumberFormatException from .toInt.
The idea is to isolate the failure and pass back some type that the caller can either work with when the row was parsed...or ignore when a failure happened.

Related

Using Scala groupBy(), from method fetchUniqueCodesForARootCode(). I want to get a map from rootCodes to lists of uniqueCodes

I want to to return Future[Map[String, List[String]]] from fetchUniqueCodesForARootCode method
import scala.concurrent._
import ExecutionContext.Implicits.global
case class DiagnosisCode(rootCode: String, uniqueCode: String, description: Option[String] = None)
object Database {
private val data: List[DiagnosisCode] = List(
DiagnosisCode("A00", "A001", Some("Cholera due to Vibrio cholerae")),
DiagnosisCode("A00", "A009", Some("Cholera, unspecified")),
DiagnosisCode("A08", "A080", Some("Rotaviral enteritis")),
DiagnosisCode("A08", "A083", Some("Other viral enteritis"))
)
def getAllUniqueCodes: Future[List[String]] = Future {
Database.data.map(_.uniqueCode)
}
def fetchDiagnosisForUniqueCode(uniqueCode: String): Future[Option[DiagnosisCode]] = Future {
Database.data.find(_.uniqueCode.equalsIgnoreCase(uniqueCode))
}
}
getAllUniqueCodes returns all unique codes from data List.
fetchDiagnosisForUniqueCode returns DiagnosisCode when uniqueCode matches.
From fetchDiagnosisForUniqueCodes, I am returningFuture[List[DiagnosisCode]] using getAllUniqueCodes() and fetchDiagnosisForUniqueCode(uniqueCode).*
def fetchDiagnosisForUniqueCodes: Future[List[DiagnosisCode]] = {
val xa: Future[List[Future[DiagnosisCode]]] = Database.getAllUniqueCodes.map { (xs:
List[String]) =>
xs.map { (uq: String) =>
Database.fetchDiagnosisForUniqueCode(uq)
}
}.map(n =>
n.map(y=>
y.map(_.get)))
}
xa.flatMap {
listOfFuture =>
Future.sequence(listOfFuture)
}}
Now, def fetchUniqueCodesForARootCode should return Future[Map[String, List[DiagnosisCode]]] using fetchDiagnosisForUniqueCodes and groupBy
Here is the method
def fetchUniqueCodesForARootCode: Future[Map[String, List[String]]] = {
fetchDiagnosisForUniqueCodes.map { x =>
x.groupBy(x => (x.rootCode, x.uniqueCode))
}
}
Need to get the below result from fetchUniqueCodesForARootCode:-
A00 -> List(A001, A009), H26 -> List(H26001, H26002), B15 -> List(B150, B159), H26 -> List(H26001, H26002)
It's hard to decode from the question description, what the problem is. But if I understood correctly, you want to get a map from rootCodes to lists of uniqueCodes.
The groupBy method takes a function that for every element returns its key. So first you have to group by the rootCodes and then you have to use map to get the correct values.
groupBy definition: https://dotty.epfl.ch/api/scala/collection/IterableOps.html#groupBy-f68
scastie: https://scastie.scala-lang.org/KacperFKorban/PL1X3joNT3qNOTm6OQ3VUQ

Comparing the json data types at runtime using Jackson and Scala

I have an incoming JSON data that looks like below:
{"id":"1000","premium":29999,"eventTime":"2021-12-22 00:00:00"}
Now, I have created a class that will accept this record and will check whether the data type of the incoming record is according to the data types defined in the case class. However, when I am calling the method it is always calling the Failure part of the match case.
case class Premium(id: String, premium: Long, eventTime: String)
class Splitter extends ProcessFunction[String, Premium] {
val outputTag = new OutputTag[String]("failed")
def fromJson[T](json: String)(implicit m: Manifest[T]): Either[String, T] = {
Try {
println("inside")
lazy val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.readValue[T](json)
} match {
case Success(x) => {
Right(x)
}
case Failure(err) => {
Left(json)
}
}
}
override def processElement(i: String, context: ProcessFunction[String, Premium]#Context, collector: Collector[Premium]): Unit = {
fromJson(i) match {
case Right(data) => {
collector.collect(data)
println("Good Records: " + data)
}
case Left(json) => {
context.output(outputTag, json)
println("Bad Records: " + json)
}
}
}
}
Based on the sample record above, it should pass the Success value but no matter what I pass, it always enters the Failure part. What else is missing?
I am using Scala 2.11.12 and I tried examples from this link and this link but no luck.

TypeInformation in Flink

I have a pipeline in a place where data is being sent from Flink to Kafka topic in a JSON format. I was also able to get it from the Kafka topic and was able to get the JSON attributes as well. Now, like scala reflect classes where I can also compare the data type at runtime, I was trying to do the same thing in Fink using TypeInformation where I can set some predefined format and whatever data is being read from topic should go under this Validation and should be passed or failed accordingly.
I have a data like below:.
{"policyName":"String", "premium":2400, "eventTime":"2021-12-22 00:00:00" }
For my problem, I came across a couple of examples in Flink's book where it is mentioned how to create a TypeInformation variable but there was nothing mentioned on how to use it so I tried my way:
val objectMapper = new ObjectMapper()
val tupleType: TypeInformation[(String, String, String)] =
Types.TUPLE[(String, Int, String)]
println(tupleType.getTypeClass)
src.map(v => v)
.map { x =>
val policyName: String = objectMapper.readTree(x).get("policyName").toString()
val premium: Int = objectMapper.readTree(x).get("premium").toString().toInt
val eventTime: String = objectMapper.readTree(x).get("eventTime").toString()
if ((policyName, premium, eventTime)== tupleType.getTypeClass) {
println("Good Record: " + (policyName, premium, eventTime))
}
else {
println("Bad Record: " + (id, category, eventTime))
}
}
Now if I pass the input as below to the flink kafka producer:
{"policyName":"whatever you feel like","premium":"4000","eventTime":"2021-12-20 00:00:00"}
It should give me the expected output as a "Bad record" and the tuple since the datatype of premium is String and not Long/Int.
If a pass the input as below:
{"policyName":"whatever you feel like","premium":4000,"eventTime":"2021-12-20 00:00:00"}
It should give me the output as "Good Record" and the tuple
But according to my code, it is always giving me the else part.
If I create a datastream variable and store the results of the above map and then compare like below then it gives me the correct result:
if (tupleType == datas.getType()) { //where 'datas' is a datastream
print("Good Records")
} else {
println("Bad Records")
}
But I want to send the good/bad records to a different stream or maybe can directly be inserted in the Cassandra table. So, that is why I am using loops for identifying the records one by one. Is my way correct? What would be the best practice considering what I am trying to achieve?
Based on Dominik's inputs, I tried creating my ow CustomDeserializer class:
import com.fasterxml.jackson.databind.ObjectMapper
import org.apache.flink.api.common.serialization.DeserializationSchema
import org.apache.flink.api.common.typeinfo.TypeInformation
import java.nio.charset.StandardCharsets
class sample extends DeserializationSchema[String] {
override def deserialize(message: Array[Byte]): Tuple3[Int, String, String] = {
val data = new String(message,
StandardCharsets.UTF_8)
val objectMapper = new ObjectMapper()
val id: Int = objectMapper.readTree(data).get("id").toString().toInt
val category: String = objectMapper.readTree(data).get("Category").toString()
val eventTime: String = objectMapper.readTree(data).get("eventTime").toString()
return (id, category, eventTime)
}
override def isEndOfStream(t: String): Boolean = ???
override def getProducedType: TypeInformation[String] = return TypeInformation.of(classOf[String])
}
I wanna try to implement something like below:
src.map(v => v)
.map { x =>
if (new sample().deserialize(x)==true) {
println("Good Record: " + (id, category, eventTime))
}
else {
println("Bad Record: " + (id, category, eventTime))
}
}
But the input is in Array[Bytes] form. So how can I implement it? Where exactly I am going wrong? What needs to be modified? This is my first ever attempt in Flink Scala custom classes.
Inputs Passed: Inputs
I don't really think that using TypeInformation to do what You want is best idea. You can simply use something like ProcessFunction that will accept a JSON String and then use the ObjectMapper to deserialize JSON to class with the expected structure. You can output the correctly deserialized objects from the ProcessFunction and the Strings that failed deserialization can be apassed as side output since they will be Your Bad Records.
This could look like below, note that this uses Jackson scala to perform deserialization to case class. You can find more info here
case class Premium(policyName: String, premium: Long, eventTime: String)
class Splitter extends ProcessFunction[String, Premium] {
val outputTag = new OutputTag[String]("failed")
def fromJson[T](json: String)(implicit m: Manifest[T]): Either[String, T] = {
Try {
lazy val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.readValue[T](json)
} match {
case Success(x) => Right(x)
case Failure(err) => {
Left(json)
}
}
}
override def processElement(i: String, context: ProcessFunction[String, Premium]#Context, collector: Collector[Premium]): Unit = {
fromJson(i) match {
case Right(data) => collector.collect(data)
case Left(json) => context.output(outputTag, json)
}
}
}
Then You can use the outputTag to get the side output data from the stream to get incorrect records.

Scala Action Map Implementation Issue (follow up)

This is a fairly long winded question and a follow up to my last one.
I have the following code for an application being built - I am looking to call the function in handleOne but it is not working in the action map. I think this is due to the unit assigned to statesVotes in the handler. The goal is to create a menu driven application that performs a set of desired functions. The function in question here is: Get all the state values and display suitably formatted.
Potentially have to make the states into a map but looking for the same functionality of the case class.
import scala.io.StdIn.readInt
object myApp3 extends App{
val dataRE = "([^(]+) \\((\\d+)\\),(.+)".r
val pVotes = "([^:]+):(\\d+)".r
case class State(name : String
,code : Int
,parties : Array[(String,Int)])
val states: List[State] =
util.Using(io.Source.fromFile("filename.txt"))(_.getLines().toList)
.get //will throw if read file fails
.collect{case dataRE(name,code,votes) =>
State(name.trim
,code.toInt
,votes.split(",")
.collect{case pVotes(p,v) => (p,v.toInt)}
)
}
val actionMap = Map[Int, () => Boolean](1 -> handleOne)
var opt = 0
do{
opt = readOption
} while (menu(opt))
def readOption: Int = {
println(
"""|Please select one of the following:
| 1 - Show All States and Votes
| 2 - CW Option 2
| 3 - quit""".stripMargin)
readInt()
}
def menu(option: Int): Boolean = {
actionMap.get(option) match {
case Some(f) => f()
case None =>
println("Command not recognized!")
true
}
}
// handle one calls function mnuShowStatesVotes, which invokes function statesVotes
def handleOne(): Boolean = {
mnuShowStatesVotes(statesVotes : List[State])
true
}
def mnuShowStatesVotes(f:() => List[State]) = {
f() foreach(println())
}
def statesVotes = states.sortBy(_.name) //alphabetical order of states
.foreach{ st =>
println(st.name) //show line by split by state name
st.parties
.sortBy(-_._2) //sorts parties by votes in descending order
.map{case (p,v) => f"\t$p%-12s:$v%9d"}
.foreach(println)
}
}
Essentially want the menu option handleOne to correctly invoke the function in statesVotes.
The text file being used can be found below:
Alabama (9),Democratic:849624,Republican:1441170,Libertarian:25176,Others:7312
Alaska (3),Democratic:153778,Republican:189951,Libertarian:8897,Others:6904
Arizona (11),Democratic:1672143,Republican:1661686,Libertarian:51465,Green:1557,Others:475
It seems to me that your code would benefit by adopting a clear and distinct separation/segregation of roles and responsibilities.
Let's get the preliminaries taken care of.
import scala.util.{Try, Success, Failure, Using}
case class State(name : String
,code : Int
,parties : Array[(String,Int)])
Now let's parse the input data.
This code has one job to do: load the data from the input file. It takes one parameter, the input filename, and returns either Success() with the accumulated data, or Failure() with the error exception.
def readFile(filename: String): Try[List[State]] = {
val dataRE = "([^(]+) \\((\\d+)\\),(.+)".r
val pVotes = "([^:]+):(\\d+)".r
Using(io.Source.fromFile(filename)) {
_.getLines()
.toList
.collect{ case dataRE(name, code, votes) =>
State(name.trim
,code.toInt
,votes.split(",")
.collect{case pVotes(p,v) => (p,v.toInt)})
}
}
}
Note that collect() will simply ignore file data the doesn't fit the expected format. If you were to use map() instead then bad input data would cause a Failure().
Now let's put all the output methods, and their descriptions, under one roof. This is most of what the user will see.
class Menu(states: List[State]) {
def apply(key: String): Boolean = {
val (_, op, continue) = lookup(key)
op()
continue
}
private val lookup: Map[String,(String,()=>Unit,Boolean)] =
Map("?" -> ("show this menu", menu _, true)
,"menu" -> ("show this menu", menu _, true)
,"all" -> ("display all voting data", all _, true)
,"st" -> ("vote totals by state", stVotes _, true)
,"x" -> ("exit", done _, false)
,"quit" -> ("exit", done _, false)
).withDefaultValue(("",unknown _, true))
private def done(): Unit = println("bye")
private def unknown(): Unit =
println("unknown selection ('?' for main menu)")
private def menu(): Unit =
lookup.keys.toVector.sorted
.map(k => s"$k\t: ${lookup(k)._1}")
.foreach(println)
private def all(): Unit =
states.sortBy(_.name) //alphabetical
.foreach{ st =>
println(st.name) //state name
st.parties
.sortBy(-_._2) //votes in decreasing order
.map{case (p,v) => f"\t$p%-12s:$v%9d"}
.foreach(println)
}
private def stVotes(): Unit =
states.map(st => (st.name, st.parties.map(_._2).sum))
.sortBy(-_._2) //votes in decreasing order
.map{case (state,total) => f"$state%-9s:$total%8d"}
.foreach(println)
}
Notice that only the apply() method is public. Everything else is private and under wraps.
To create a new data report you just add an entry in the lookup Map and add the new method to produce the output.
Now all we need is the code to tie the pieces together and to take user input.
def main(args: Array[String]): Unit =
args.headOption.map(readFile) match {
case None =>
println(s"usage: ${this.productPrefix} <data_file>")
case Some(Failure(exc)) =>
println(s"Error reading data file: $exc")
case Some(Success(stateData)) =>
val menu = new Menu(stateData)
menu("menu")
Iterator.continually(menu(io.StdIn.readLine(">> ").toLowerCase))
.dropWhile(identity)
.next()
}
Note that this.productPrefix is made available if the surrounding object is a case object.

Scala pattern matching on generic Map

Whats the best way to handle generics and erasure when doing pattern matching in Scala (a Map in my case). I am looking for a proper implementation without compiler warnings. I have a function that I want to return Map[Int, Seq[String]] from. Currently the code looks like:
def teams: Map[Int, Seq[String]] = {
val dateam = new scala.collection.mutable.HashMap[Int, Seq[String]]
// data.attributes is Map[String, Object] returned from JSON parsing (jackson-module-scala)
val teamz = data.attributes.get("team_players")
if (teamz.isDefined) {
val x = teamz.get
try {
x match {
case m: mutable.Map[_, _] => {
m.foreach( kv => {
kv._1 match {
case teamId: String => {
kv._2 match {
case team: Seq[_] => {
val tid: Int = teamId.toInt
dateam.put(tid, team.map(s => s.toString))
}
}
}
}
})
}
}
} catch {
case e: Exception => {
logger.error("Unable to convert the team_players (%s) attribute.".format(x), e)
}
}
dateam
} else {
logger.warn("Missing team_players attribute in: %s".format(data.attributes))
}
dateam.toMap
}
Use a Scala library to handle it. There are some based on Jackson (Play's ScalaJson, for instance -- see this article on using it stand-alone), as well as libraries not based on Jackson (of which my preferred is Argonaut, though you could also go with Spray-Json).
These libraries, and others, solve this problem. Doing it by hand is awkward and prone to errors, so don't do it.
It could be reasonable to use for comprehension (with some built in pattern matching). Also we could take into account that Map is a list of tuples, in our case of (String, Object) type. As well we will ignore for this example probable exceptions, so:
import scala.collection.mutable.HashMap
def convert(json: Map[String, Object]): HashMap[Int, Seq[String]] = {
val converted = for {
(id: String, description: Seq[Any]) <- json
} yield (id.toInt, description.map(_.toString))
HashMap[Int, Seq[String]](converted.toSeq: _*)
}
So, our for comprehension taking into account only tuples with (String, Seq[Any]) type, then combines converted String to Int and Seq[Any] to Seq[String]. And makes Map to be mutable.