Scala case class modularization - scala

I'm new to scala and I have a requirement to refactor / modularize my code.
My code looks like this,
case class dim1(col1: String,col2: Int,col3)
val dim1 = sc.textFile("s3n://dim1").map { row =>
val parts = row.split("\t")
dim1(parts(0),parts(1).toInt,parts(2)) }
case class dim2(col1: String,col2: Int)
val dim1 = sc.textFile("s3n://dim1").map { row =>
val parts = row.split("\t")
dim2(parts(0),parts(1).toInt) }
case class dim3(col1: String,col2: Int,col3: String,col4: Int)
val dim1 = sc.textFile("s3n://dim1").map { row =>
val parts = row.split("\t")
dim3(parts(0),parts(1).toInt,parts(2),parts(3).toInt) }
case class dim4(col1: String,col2: String,col3: Int)
val dim1 = sc.textFile("s3n://dim1").map { row =>
val parts = row.split("\t")
dim4(parts(0),parts(1),parts(2).toInt) }
This is ETL SCALA transform code that runs on Apache Spark.
Here are the steps that I have ,
Define case class for every dimension.
Read a file from S3 and map it to respective case class. I also need to change datatype if it is required.
These steps are highly repeated and I would like to write a function like this ,
readAndMap(datasetlocation: String,caseclassnametomap: String)
With this my code will become ,
readAndMap("s3n://dim1",dim1)
readAndMap("s3n://dim2",dim2)
readAndMap("s3n://dim3",dim3)
readAndMap("s3n://dim4",dim4)
Some examples / directions will be highly appreciated
Thanks

You can do something like this,
def readAndMap[A](datasetLocation: String)(createA: List[String] => A) = {
sc.textFile(datasetLocation).map { row =>
createA(row.split("\t").toList)
}
}
You can call this like
readAndMap[dim1]("s3n://dim1"){ parts => dim1(parts(0),parts(1).toInt,parts(2)) }
readAndMap[dim2]("s3n://dim2"){ parts => dim2(parts(0),parts(1).toInt) }
readAndMap[dim3]("s3n://dim3"){ parts => dim3(parts(0),parts(1).toInt,parts(2),parts(3).toInt) }
readAndMap[dim4]("s3n://dim4"){ parts => dim4(parts(0),parts(1),parts(2).toInt) }
You cannot directly give case class and ask method to construct an instance, because, the arity of the case class apply methods are different to each other.

Related

Spark custom encoder for dataframe

I know about How to store custom objects in Dataset? but still, it is not really clear for me how to build this custom encoder which properly serializes to multiple fields. Manually, I created some functions https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala/myOrg/GeoSpark.scala#L122-L154 wich map a Polygon back and forth between Dataset - RDD - Dataset by mapping the objects to primitive types spark can handle i.e. a tuple (String, Int)(edit: full code below).
For example, to go from the Polygon Object to a tuple of (String, Int) I use the following
def writeSerializableWKT(iterator: Iterator[AnyRef]): Iterator[(String, Int)] = {
val writer = new WKTWriter()
iterator.flatMap(cur => {
val cPoly = cur.asInstanceOf[Polygon]
// TODO is it efficient to create this collection? Is this a proper iterator 2 iterator transformation?
List(((writer.write(cPoly), cPoly.getUserData.asInstanceOf[Int])).iterator
})
}
def createSpatialRDDFromLinestringDataSet(geoDataset: Dataset[WKTGeometryWithPayload]): RDD[Polygon] = {
geoDataset.rdd.mapPartitions(iterator => {
val reader = new WKTReader()
iterator.flatMap(cur => {
try {
reader.read(cur.lineString) match {
case p: Polygon => {
val polygon = p.asInstanceOf[Polygon]
polygon.setUserData(cur.payload)
List(polygon).iterator
}
case _ => throw new NotImplementedError("Multipolygon or others not supported")
}
} catch {
case e: ParseException =>
logger.error("Could not parse")
logger.error(e.getCause)
logger.error(e.getMessage)
None
}
})
})
}
I noticed that already I start to do a lot of work twice (see the link to both methods). Now wanting to be able to handle
https://github.com/geoHeil/geoSparkScalaSample/blob/master/src/main/scala (full code below)
/myOrg/GeoSpark.scala#L82-L84
val joinResult = JoinQuery.SpatialJoinQuery(objectRDD, minimalPolygonCustom, true)
// joinResult.map()
val joinResultCounted = JoinQuery.SpatialJoinQueryCountByKey(objectRDD, minimalPolygonCustom, true)
which is a PairRDD[Polygon, HashSet[Polygon]], or respectively PairRDD[Polygon, Int] how would I need to specify my functions as an Encoder in order to not solve the same problem 2 more times?

Pattern matching and RDDs

I have a very simple (n00b) question but I'm somehow stuck. I'm trying to read a set of files in Spark with wholeTextFiles and want to return an RDD[LogEntry], where LogEntry is just a case class. I want to end up with an RDD of valid entries and I need to use a regular expression to extract the parameters for my case class. When an entry is not valid I do not want the extractor logic to fail but simply write an entry in a log. For that I use LazyLogging.
object LogProcessors extends LazyLogging {
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[Option[CleaningLogEntry]] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.map(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
That gives me an RDD[Array[Option[LogEntry]]]. Is there a neat way to end up with an RDD of the LogEntrys? I'm somehow missing it.
I was thinking about using Try instead, but I'm not sure if that's any better.
Thoughts greatly appreciated.
To get rid of the Array - simply replace the map command with flatMap - flatMap will treat a result of type Traversable[T] for each record as separate records of type T.
To get rid of the Option - collect only the successful ones: entries.collect { case Some(entry) => entry }.
Note that this collect(p: PartialFunction) overload (which performs something equivelant to a map and a filter combined) is very different from collect() (which sends all data to the driver).
Altogether, this would be something like:
def extractLogs(sc: SparkContext, path: String, numPartitions: Int = 5): RDD[CleaningLogEntry] = {
val pattern = "<some pattern>".r
val logs = sc.wholeTextFiles(path, numPartitions)
val entries = logs.flatMap(fileContent => {
val file = fileContent._1
val content = fileContent._2
content.split("\\r?\\n").map(line => line match {
case pattern(dt, ev, seq) => Some(LogEntry(<...>))
case _ => logger.error(s"Cannot parse $file: $line"); None
})
})
entries.collect { case Some(entry) => entry }
}

Counting pattern in scala list

My list looks like the following: List(Person,Invite,Invite,Person,Invite,Person...). I am trying to match based on a inviteCountRequired, meaning that the Invite objects following the Person object in the list is variable. What is the best way of doing this? My match code so far looks like this:
aList match {
case List(Person(_,_,_),Invitee(_,_,_),_*) => ...
case _ => ...
}
First stack question, please go easy on me.
Let
val aList = List(Person(1), Invite(2), Invite(3), Person(2), Invite(4), Person(3), Invite(6), Invite(7))
Then index each location in the list and select Person instances,
val persons = (aList zip Stream.from(0)).filter {_._1.isInstanceOf[Person]}
namely, List((Person(1),0), (Person(2),3), (Person(3),5)) . Define then sublists where the lower bound corresponds to a Person instance,
val intervals = persons.map{_._2}.sliding(2,1).toArray
res31: Array[List[Int]] = Array(List(0, 3), List(3, 5))
Construct sublists,
val latest = aList.drop(intervals.last.last) // last Person and Invitees not paired
val associations = intervals.map { case List(pa,pb,_*) => b.slice(pa,pb) } ++ latest
Hence the result looks like
Array(List(Person(1), Invite(2), Invite(3)), List(Person(2), Invite(4)), List(Person(3), Invite(6), Invite(7)))
Now,
associations.map { a =>
val person = a.take(1)
val invitees = a.drop(1)
// ...
}
This approach may be seen as a variable size sliding.
Thanks for your tips. I ended up creating another case class:
case class BallotInvites(person:Person,invites:List[Any])
Then, I populated it from the original list:
def constructBallotList(ballots:List[Any]):List[BallotInvites] ={
ballots.zipWithIndex.collect {
case (iv:Ballot,i) =>{
BallotInvites(iv,
ballots.distinct.takeRight(ballots.distinct.length-(i+1)).takeWhile({
case y:Invitee => true
case y:Person =>true
case y:Ballot => false})
)}
}}
val l = Ballot.constructBallotList(ballots)
Then to count based on inviteCountRequired, I did the following:
val count = l.count(b=>if ((b.invites.count(x => x.isInstanceOf[Person]) / contest.inviteCountRequired)>0) true else false )
I am not sure I understand the domain but you should only need to iterate once to construct a list of person + invites tuple.
sealed trait PorI
case class P(i: Int) extends PorI
case class I(i: Int) extends PorI
val l: List[PorI] = List(P(1), I(1), I(1), P(2), I(2), P(3), P(4), I(4))
val res = l.foldLeft(List.empty[(P, List[I])])({ case (res, t) =>
t match {
case p # P(_) => (p, List.empty[I]) :: res
case i # I(_) => {
val head :: tail = res
(head._1, i :: head._2) :: tail
}
}
})
res // List((P(4),List(I(4))), (P(3),List()), (P(2),List(I(2))), (P(1),List(I(1), I(1))))

Scala extractors: a cumbersome example

Everything started from a couple of considerations:
Extractors are Scala objects that implements some unapply methods with certain peculiarities (directly from «Programming in Scala 2nd edition», I've checked)
Objects are singletons lazy initialised on the static scope
I've tried to implement a sort of «parametric extractors» under the form of case classes to try to have an elegant pattern for SHA1 checking.
I'd like to check a list of SHA1s against a buffer to match which of them apply. I'd like to write something like this:
val sha1: Array[Byte] = ...
val sha2: Array[Byte] = ...
buffer match {
case SHA1(sha1) => ...
case SHA1(sha2) => ...
...
}
Ok, it looks weird, but don't bother now.
I've tried to solve the problem by simply implementing a case class like this
case class SHA1(sha1: Array[Byte]) {
def unapply(buffer: Array[Byte]): Boolean = ...
}
and use it like
case SHA1(sha1)() =>
and even
case (SHA1(sha1)) =>
but it doesn't work: compiler fails.
Then I've a little changed the code in:
val sha1 = SHA1(sha1)
val sha2 = SHA1(sha2)
buffer match {
case sha1() => println("sha1 Match")
case sha2() => println("sha2 Match")
...
}
and it works without any issue.
Questions are:
Q1: There are any subtle implications in using such a kind of «extractors»
Q2: Provided the last example works, which syntax was I supposed to use to avoid to define temporary vals? (if any provided compiler's job with match…case expressions)
EDIT
The solution proposed by Aaron doesn't work either. A snippet:
case class SHA1(sha1: Array[Byte]) {
def unapply(buffer: Array[Byte]) = buffer.length % 2 == 0
}
object Sha1Sample {
def main(args: Array[String]) {
println("Sha1 Sample")
val b1: Array[Byte] = Array(0, 1, 2)
val b2: Array[Byte] = Array(0, 1, 2, 3)
val sha1 = SHA1(b1)
List(b1, b2) map { b =>
b match {
case sha1() => println("Match") // works
case `sha1` => println("Match") // compile but it is semantically incorrect
case SHA1(`b1`) => println("SOLVED") // won't compile
case _ => println("Doesn't Match")
}
}
}
}
Short answer: you need to put backticks around lowercase identifiers if you don't want them to be interpreted as pattern variables.
case Sha1(`sha1`) => // ...
See this question.

map over structure with only partial match

I have a tree-like structure of abstract classes and case classes representing an Abstract Syntax Tree of a small language.
For the top abstract class i've implemented a method map:
abstract class AST {
...
def map(f: (AST => AST)): AST = {
val b1 = this match {
case s: STRUCTURAL => s.smap(f) // structural node for example IF(expr,truebranch,falsebranch)
case _ => this // leaf, // leaf, like ASSIGN(x,2)
}
f(b1)
}
...
The smap is defined like:
override def smap(f: AST => AST) = {
this.copy(trueb = trueb.map(f), falseb = falseb.map(f))
}
Now im writing different "transformations" to insert, remove and change nodes in the AST.
For example, remove adjacent NOP nodes from blocks:
def handle_list(l:List[AST]) = l match {
case (NOP::NOP::tl) => handle_list(tl)
case h::tl => h::handle_list(tl)
case Nil => Nil
}
ast.map {
case BLOCK(listofstatements) => handle_list(listofstatements)
}
If I write like this, I end up with MatchError and I can "fix it" by changing the above map to:
ast.map {
case BLOCK(listofstatements) => handle_list(listofstatements)
case a => a
}
Should I just live with all those case a => a or could I improve my map method(or other parts) in some way?
Make the argument to map a PartialFunction:
def map(f: PartialFunction[AST, AST]): AST = {
val idAST: PartialFunction[AST, AST] = {case a => a}
val g = f.orElse(idAST)
val b1 = this match {
case s: STRUCTURAL => s.smap(g)
case _ => this
}
g(b1)
}
If tree transformations are more than a minor aspect of your project, I highly recommend you use Kiama's Rewriter module to implement them. It implements Stratego-like strategy-driven transformations. It has a very rich set of strategies and strategy combinators that permit a complete separation of traversal logic (which for the vast majority of cases can be taken "off the shelf" from the supplied strategies and combinators) from (local) transformations (which are specific to your AST and you supply, of course).