Spark custom encoder for dataframe - scala

I know about How to store custom objects in Dataset? but still, it is not really clear for me how to build this custom encoder which properly serializes to multiple fields. Manually, I created some functions https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala/myOrg/GeoSpark.scala#L122-L154 wich map a Polygon back and forth between Dataset - RDD - Dataset by mapping the objects to primitive types spark can handle i.e. a tuple (String, Int)(edit: full code below).
For example, to go from the Polygon Object to a tuple of (String, Int) I use the following
def writeSerializableWKT(iterator: Iterator[AnyRef]): Iterator[(String, Int)] = {
val writer = new WKTWriter()
iterator.flatMap(cur => {
val cPoly = cur.asInstanceOf[Polygon]
// TODO is it efficient to create this collection? Is this a proper iterator 2 iterator transformation?
List(((writer.write(cPoly), cPoly.getUserData.asInstanceOf[Int])).iterator
})
}
def createSpatialRDDFromLinestringDataSet(geoDataset: Dataset[WKTGeometryWithPayload]): RDD[Polygon] = {
geoDataset.rdd.mapPartitions(iterator => {
val reader = new WKTReader()
iterator.flatMap(cur => {
try {
reader.read(cur.lineString) match {
case p: Polygon => {
val polygon = p.asInstanceOf[Polygon]
polygon.setUserData(cur.payload)
List(polygon).iterator
}
case _ => throw new NotImplementedError("Multipolygon or others not supported")
}
} catch {
case e: ParseException =>
logger.error("Could not parse")
logger.error(e.getCause)
logger.error(e.getMessage)
None
}
})
})
}
I noticed that already I start to do a lot of work twice (see the link to both methods). Now wanting to be able to handle
https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala (full code below)
/myOrg/GeoSpark.scala#L82-L84
val joinResult = JoinQuery.SpatialJoinQuery(objectRDD, minimalPolygonCustom, true)
// joinResult.map()
val joinResultCounted = JoinQuery.SpatialJoinQueryCountByKey(objectRDD, minimalPolygonCustom, true)
which is a PairRDD[Polygon, HashSet[Polygon]], or respectively PairRDD[Polygon, Int] how would I need to specify my functions as an Encoder in order to not solve the same problem 2 more times?

Related

How to covert multiple strings in a list to be keys in a map

I am trying to write a function that would return a map in which every word is a key and the values are pages at which the word shows up. Currently, I am stuck at the point where I have data of the following type: List(List(words),page).
Is there any sensible way to reformat this data if so, please explain as I have no idea how to even begin?
object G {
def main(args: Array[String]): Unit = {
stwórzIndeks()
}
def stwórzIndeks(): Unit= {
val linie = io.Source
.fromResource("tekst.txt")
.getLines
.toList
val zippedLinie: List[(String,Int)]=linie.zipWithIndex
val splitt=zippedLinie.foldLeft(List.empty[(List[String],Int)])((acc,curr)=>{
curr match {
case (arr,int) => {
val toAdd=(arr.split("\\s+").toList,zippedLinie.length-int)
toAdd+:acc
}
}
})
}
}
You can replace that foldLet with a flatMap with an inner map to get a big List of (word, page).
val wordsAndPage = zippedLinie.flatMap {
case (line, idx) =>
lome.split("\\s+").toList.map(word => word -> idx + 1)
}
After that you can check for one of the grouping methods in the scaladoc.

request timeout from flatMapping over cats.effect.IO

I am attempting to transform some data that is encapsulated in cats.effect.IO with a Map that also is in an IO monad. I'm using http4s with blaze server and when I use the following code the request times out:
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
}
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// get the shifts
var getDbShifts: IO[List[Shift]] = shiftModel.findByUserId(userId)
// use the userRoleId to get the RoleId then get the tasks for this role
val taskMap : IO[Map[String, Double]] = taskModel.findByUserId(userId).flatMap {
case tskLst: List[Task] => IO(tskLst.map((task: Task) => (task.name -> task.standard)).toMap)
}
val traversed: IO[List[Shift]] = for {
shifts <- getDbShifts
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: IO[List[ShiftJson]] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) =>
taskMap.flatMap((tm: Map[String, Double]) =>
IO(ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / tm.get(sj.name).get)))
).sequence
//TODO: this flatMap is bricking my request
lstShiftJson.flatMap((sjLst: List[ShiftJson]) => {
IO(Shift(shift.id, shift.shiftDate, shift.shiftStart, shift.shiftEnd,
shift.lunchDuration, shift.shiftDuration, shift.breakOffProd, shift.systemDownOffProd,
shift.meetingOffProd, shift.trainingOffProd, shift.projectOffProd, shift.miscOffProd,
write[List[ShiftJson]](sjLst), shift.userRoleId, shift.isApproved, shift.score, shift.comments
))
})
})
} yield traversed
traversed.flatMap((sLst: List[Shift]) => Ok(write[List[Shift]](sLst)))
}
as you can see the TODO comment. I've narrowed down this method to the flatmap below the TODO comment. If I remove that flatMap and merely return "IO(shift)" to the traversed variable the request does not timeout; However, that doesn't help me much because I need to make use of the lstShiftJson variable which has my transformed json.
My intuition tells me I'm abusing the IO monad somehow, but I'm not quite sure how.
Thank you for your time in reading this!
So with the guidance of Luis's comment I refactored my code to the following. I don't think it is optimal (i.e. the flatMap at the end seems unecessary, but I couldnt' figure out how to remove it. BUT it's the best I've got.
def getScoresByUserId(userId: Int): IO[Response[IO]] = {
implicit val formats = DefaultFormats + ShiftJsonSerializer() + RawShiftSerializer()
implicit val shiftJsonReader = new Reader[ShiftJson] {
def read(value: JValue): ShiftJson = value.extract[ShiftJson]
}
implicit val shiftJsonDec = jsonOf[IO, ShiftJson]
// FOR EACH SHIFT
// - read the shift.roleTasks into a ShiftJson object
// - divide each task value by the task.standard where task.name = shiftJson.name
// - write the list of shiftJson back to a string
val traversed = for {
taskMap <- taskModel.findByUserId(userId).map((tList: List[Task]) => tList.map((task: Task) => (task.name -> task.standard)).toMap)
shifts <- shiftModel.findByUserId(userId)
traversed <- shifts.traverse((shift: Shift) => {
val lstShiftJson: List[ShiftJson] = read[List[ShiftJson]](shift.roleTasks)
.map((sj: ShiftJson) => ShiftJson(sj.name, sj.taskType, sj.label, sj.value.toString.toDouble / taskMap.get(sj.name).get ))
shift.roleTasks = write[List[ShiftJson]](lstShiftJson)
IO(shift)
})
} yield traversed
traversed.flatMap((t: List[Shift]) => Ok(write[List[Shift]](t)))
}
Luis mentioned that mapping my List[Shift] to a Map[String, Double] is a pure operation so we want to use a map instead of flatMap.
He mentioned that I'm wrapping every operation that comes from the database in IO which is causing a great deal of recomputation. (including DB transactions)
To solve this issue I moved all of the database operations inside of my for loop, using the "<-" operator to flatMap each of the return values allows the variables being used to preside within the IO monads, hence preventing the recomputation experienced before.
I do think there must be a better way of returning my return value. flatMapping the "traversed" variable to get back inside of the IO monad seems to be unnecessary recomputation, so please anyone correct me.

try/catch not working when use tail recursive function

I'm building a tail-recursive function that reads multiple hdfs paths and merges all of them into a single data-frame. The function works perfectly as long as all the path exist, if not, the function fails and does not finish joining the data of the paths that do exist. To solve this problem I have tried to handle the error using try/catch but have not been successful.
The error says: could not optimize #tailrec annotated method loop: it contains a recursive call not in tail position
My function is :
def getRangeData(toOdate: String, numMonths: Int, pathRoot: String, ColumnsTable: List[String]): DataFrame = {
val dataFrameNull = spark.createDataFrame(spark.sparkContext.emptyRDD[Row],
StructType((ColumnsTable :+ "odate").map(columnName => StructField(columnName, StringType, true))))
val rangePeriod = getRangeDate(numMonths, toOdate)
#tailrec
def unionRangeData(rangePeriod: List[LocalDate], pathRoot: String, df: DataFrame = dataFrameNull): DataFrame = {
try {
if (rangePeriod.isEmpty) {
df
}
else {
val month = "%02d".format(rangePeriod.head.getMonthValue)
val year = rangePeriod.head.getYear
val odate = rangePeriod.head.toString
val path = s"${pathRoot}/partition_data_year_id=${year}/partition_data_month_id=${month}"
val columns = ColumnsTable.map(columnName => trim(col(columnName)).as(columnName))
val dfTemporal = spark.read.parquet(path).select(columns: _*).withColumn("odate", lit(odate).cast("date"))
unionRangeData(rangePeriod.tail, pathRoot, df.union(dfTemporal))
}
} catch {
case e: Exception =>
logger.error("path not exist")
dataFrameNull
}
}
unionRangeData(rangePeriod, pathRoot)
}
def getRangeDate(numMonths: Int, toOdate: String, listDate: List[LocalDate] = List()): List[LocalDate] = {
if (numMonths == 0) {
listDate
}
else {
getRangeDate(numMonths - 1, toOdate, LocalDate.parse(toOdate).plusMonths(1).minusMonths(numMonths) :: listDate)
}
}
In advance, thank you very much for your help.
I would suggest you remove the try-catch construct entirely from the function and use it instead at the call site at the bottom of getRangeData.
Alternatively you can also use scala.util.Try to wrap the call: Try(unionRangeData(rangePeriod, pathRoot)), and use one of its combinators to perform your logging or provide a default value in the error case.
Related post which explains why the Scala compiler cannot perform tail call optimization inside try-catch:
Why won't Scala optimize tail call with try/catch?

Pattern matching and RDDs

I have a very simple (n00b) question but I'm somehow stuck. I'm trying to read a set of files in Spark with wholeTextFiles and want to return an RDD[LogEntry], where LogEntry is just a case class. I want to end up with an RDD of valid entries and I need to use a regular expression to extract the parameters for my case class. When an entry is not valid I do not want the extractor logic to fail but simply write an entry in a log. For that I use LazyLogging.
object LogProcessors extends LazyLogging {
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[Option[CleaningLogEntry]] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.map(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
That gives me an RDD[Array[Option[LogEntry]]]. Is there a neat way to end up with an RDD of the LogEntrys? I'm somehow missing it.
I was thinking about using Try instead, but I'm not sure if that's any better.
Thoughts greatly appreciated.
To get rid of the Array - simply replace the map command with flatMap - flatMap will treat a result of type Traversable[T] for each record as separate records of type T.
To get rid of the Option - collect only the successful ones: entries.collect { case Some(entry) => entry }.
Note that this collect(p: PartialFunction) overload (which performs something equivelant to a map and a filter combined) is very different from collect() (which sends all data to the driver).
Altogether, this would be something like:
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[CleaningLogEntry] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.flatMap(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
entries.collect { case Some(entry) => entry }
}

Scala case class modularization

I'm new to scala and I have a requirement to refactor / modularize my code.
My code looks like this,
case class dim1(col1: String,col2: Int,col3)
val dim1 = sc.textFile("s3n://dim1").map { row =>
val parts = row.split("\t")
dim1(parts(0),parts(1).toInt,parts(2)) }
case class dim2(col1: String,col2: Int)
val dim1 = sc.textFile("s3n://dim1").map { row =>
val parts = row.split("\t")
dim2(parts(0),parts(1).toInt) }
case class dim3(col1: String,col2: Int,col3: String,col4: Int)
val dim1 = sc.textFile("s3n://dim1").map { row =>
val parts = row.split("\t")
dim3(parts(0),parts(1).toInt,parts(2),parts(3).toInt) }
case class dim4(col1: String,col2: String,col3: Int)
val dim1 = sc.textFile("s3n://dim1").map { row =>
val parts = row.split("\t")
dim4(parts(0),parts(1),parts(2).toInt) }
This is ETL SCALA transform code that runs on Apache Spark.
Here are the steps that I have ,
Define case class for every dimension.
Read a file from S3 and map it to respective case class. I also need to change datatype if it is required.
These steps are highly repeated and I would like to write a function like this ,
readAndMap(datasetlocation: String,caseclassnametomap: String)
With this my code will become ,
readAndMap("s3n://dim1",dim1)
readAndMap("s3n://dim2",dim2)
readAndMap("s3n://dim3",dim3)
readAndMap("s3n://dim4",dim4)
Some examples / directions will be highly appreciated
Thanks
You can do something like this,
def readAndMap[A](datasetLocation: String)(createA: List[String] => A) = {
sc.textFile(datasetLocation).map { row =>
createA(row.split("\t").toList)
}
}
You can call this like
readAndMap[dim1]("s3n://dim1"){ parts => dim1(parts(0),parts(1).toInt,parts(2)) }
readAndMap[dim2]("s3n://dim2"){ parts => dim2(parts(0),parts(1).toInt) }
readAndMap[dim3]("s3n://dim3"){ parts => dim3(parts(0),parts(1).toInt,parts(2),parts(3).toInt) }
readAndMap[dim4]("s3n://dim4"){ parts => dim4(parts(0),parts(1),parts(2).toInt) }
You cannot directly give case class and ask method to construct an instance, because, the arity of the case class apply methods are different to each other.