Applying multiple map functions to streaming database results in Play 2.6 - scala

I have a large query that seems to be a prime candidate for streaming results.
I would like to make a call to a function, which returns an object which I can apply additional map transformations on, and then ultimately convert the entire result into a list. This is because the conversions will results in a set of objects much smaller than the results in the database and there are many different transformations that must take place sequentially. Processing each result at a time will save me significant memory.
For example, if the results from the database were a stream (though the correct thing is likely an AkkaStream or an Iteratee), then I could do something like:
def outer(converter1[String, Int}, converter2[Int,Double]) {
val sqlIterator = getSqlIterator()
val mappedIterator1 = sqlIterator.map(x => converter1(x.bigColumn))
val mappedIterator2 = sqlIterator.map(x => converter2(x))
val retVal = mappedIterator.toList
retVal
}
def getSqlIterator() {
val selectedObjects = SQL( """SELECT * FROM table""").map { x =>
val id = x[Long]("id")
val tinyColumn = x[String]("tiny_column")
val bigColumn = x[String]("big_column")
NewObject(id, tinyColumn, bigColumn)
}
val transformed = UNKNOWN_FUNCTION(selectedObjects)
transformed
}
Most of the documentation appears to provide the mechanism to apply a "reduce" function to the results, rather than a "map" function, but the resulting mapped functions will be much smaller, saving me significant memory. What should I do for UNKNOWN_FUNCTION?

The following is a simple example of using Anorm's Akka Streams support to read the values from a single column of type String, applying two transformations to each element, and placing the results in a Seq. I'll leave it as an exercise for you to retrieve the values from multiple columns at a time, if that's what you need.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import anorm._
import scala.collection.immutable.Seq
import scala.concurrent.Future
implicit val system = ActorSystem("MySystem")
implicit val materializer = ActorMaterializer()
implicit val ec = system.dispatcher
val convertStringToInt: String => Int = ???
val convertIntToDouble: Int => Double = ???
val result: Future[Seq[Double]] =
AkkaStream.source(SQL"SELECT big_column FROM table", SqlParser.scalar[String])
.map(convertStringToInt)
.map(convertIntToDouble)
.runWith(Sink.seq[Double])

Related

How to load multiple files into multiple rdd?

I have multiple files those are independent and need processing by spark. How could I load them into separate rdds in parallel? Thanks!
The coding language is scala.
If you want concurrent reading/processing of RDDs, you could leverage scala.concurrent.Future (or effects in ZIO, Cats etc).
Sample code for loading function is below:
def load(paths: Seq[String], spark: SparkSession)
(implicit ec: ExecutionContext): Seq[Future[RDD[String]]] = {
def loadSinglePath(path: String): Future[RDD[String]] = Future {
spark.sparkContext.textFile(path)
}
paths map loadSinglePath
}
Sample code for using this function:
import scala.concurrent.duration.{Duration, DurationInt}
val sc = SparkSession.builder.master("local[*]").getOrCreate()
implicit val ec = ExecutionContext.global
val result = load(Seq("t1.txt", "t2.txt", "t3.txt"), sc).zipWithIndex
.map { case (rddFuture, idx) =>
rddFuture.map( rdd =>
println(s"Rdd with index $idx has ${rdd.count()}")
)
}
Await.result(Future.sequence(result), 1 hour)
For example purposes, default global ExecutionContext is provided, but it could be configurable to run your code inside the custom one (just replace this implicit val ec with your own ExecutionContext)

Getting java.lang.ArrayIndexOutOfBoundsException: 1 in spark when applying aggregate functions

I am trying to do some transformations on a data set. After reading the data set when performing df.show() operations, I am getting the rows listed in spark shell. But when I try to do df.count or any aggregate functions, I am getting
java.lang.ArrayIndexOutOfBoundsException: 1.
val itpostsrow = sc.textFile("/home/jayk/Downloads/spark-data")
import scala.util.control.Exception.catching
import java.sql.Timestamp
implicit class StringImprovements(val s:String) {
def toIntSafe = catching(classOf[NumberFormatException])
opt s.toInt
def toLongsafe = catching(classOf[NumberFormatException])
opt s.toLong
def toTimeStampsafe = catching(classOf[IllegalArgumentException]) opt Timestamp.valueOf(s)
}
case class Post(commentcount:Option[Int],lastactivitydate:Option[java.sql.Timestamp],ownerUserId:Option[Long],body:String,score:Option[Int],creattiondate:Option[java.sql.Timestamp],viewcount:Option[Int],title:String,tags:String,answerCount:Option[Int],acceptedanswerid:Option[Long],posttypeid:Option[Long],id:Long)
def stringToPost(row:String):Post = {
val r = row.split("~")
Post(r(0).toIntSafe,
r(1).toTimeStampsafe,
r(2).toLongsafe,
r(3),
r(4).toIntSafe,
r(5).toTimeStampsafe,
r(6).toIntSafe,
r(7),
r(8),
r(9).toIntSafe,
r(10).toLongsafe,
r(11).toLongsafe,
r(12).toLong)
}
val itpostsDFcase1 = itpostsrow.map{x=>stringToPost(x)}
val itpostsDF = itpostsDFcase1.toDF()
Your function stringToPost() might cause a Java error ArrayIndexOutOfBoundsException if the text file contains some empty row or if the number of fields after the split is not 13.
Due to Spark's lazy evaluation one notices such errors only when performing an action like count.

Aggregator is slow when using a complex intermediate type

I have a custom Aggregator that does a count-min sketch aggregation. It works, but is slow (code below). I get similar slow performance if I use a custom UDAF based on the UserDefinedAggregateFunction class.
This is much faster if I use the Dataset mapPartitions API to aggregate within a partition and then reduce across partitions.
Question - the slowness of the UDAF and Aggregator APIs seem to be caused by the serialization/deserialization (encoding) that happens at each row. Are the UDAF and Aggregator APIs not meant to be used to aggregate into non-trivial data structures like the count-min sketch? Is the mapPartitions approach the best way to handle this?
Aggregator sample code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Encoder, Row, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.util.sketch.CountMinSketch
object AggTest extends App {
val input = "2008.csv"
val conf = new SparkConf().setMaster("local[4]").setAppName("tester")
val sqlContext = SparkSession.builder().config(conf).getOrCreate().sqlContext
val df = sqlContext.read.format("csv").option("header", "true").load(input)
implicit val sketchEncoder = org.apache.spark.sql.Encoders.kryo[CountMinSketch]
case class SketchAgg(col: String) extends Aggregator[Row, CountMinSketch, CountMinSketch] {
def zero: CountMinSketch = CountMinSketch.create(500, 4, 2429)
def reduce(sketch: CountMinSketch, row: Row) = {
val a = row.getAs[Any](col)
sketch.add(a)
sketch
}
def merge(sketch1: CountMinSketch, sketch2: CountMinSketch) = {
sketch1.mergeInPlace(sketch2)
}
def finish(sketch: CountMinSketch) = sketch
def bufferEncoder: Encoder[CountMinSketch] = sketchEncoder
def outputEncoder: Encoder[CountMinSketch] = sketchEncoder
}
val sketch = df.agg(SketchAgg("ArrDelay")
.toColumn
.alias("sketch"))
.select("sketch")
.as[CountMinSketch]
.first()
}

Parallelizing access to S3 objects in Scala

I'm writing a Scala program to read objects matching a certain prefix on S3.
At the moment, I'm testing it on my Macbook Pro and it takes 270ms (avg. over 1000 trials) to hit S3, retrieve the 10 objects (avg. size of object 150Kb) and process it to print the output.
Here's my code:
val myBucket = "my-test-bucket"
val myPrefix = "t"
val startTime = System.currentTimeMillis()
//Can I make listObject parallel?
val listObjRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(myBucket)
val listObjResult: Seq[String] = s3.listObjects(listObjRequest).getObjectSummaries.par.toIndexedSeq.map(_.getKey).filter(_ matches s"./.*${myPrefix}.*/*")
//Can I make forEach parallel?
listObjResult foreach println //Could be any function
println(s"Total time: ${System.currentTimeMillis() - startTime}ms")
In the big scheme of things, I've got to sift through 50Gb of data (approx. 350K nested objects) and delete objects following a certain prefix (approx. 40K objects).
Hardware considerations aside, what can I do to optimize my code?
Thanks!
A possible solution would be to batch the request objects and send a request for batch deletion in S3. You can group the objects to delete and then parallalize the mapping over the parallel collection:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.DeleteObjectsRequest.KeyVersion
import com.amazonaws.services.s3.model.{DeleteObjectsRequest, DeleteObjectsResult}
import scala.collection.JavaConverters._
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.util.Try
object AmazonBatchDeletion {
def main(args: Array[String]): Unit = {
val filesToDelete: List[String] = ???
val numOfGroups: Int = ???
val deletionAttempts: Iterator[Future[Try[DeleteObjectsResult]]] =
filesToDelete
.grouped(numOfGroups)
.map(groupToDelete => Future {
blocking {
deleteFilesInBatch(groupToDelete, "bucketName")
}
})
val result: Future[Iterator[Try[DeleteObjectsResult]]] =
Future.sequence(deletionAttempts)
// TODO: make sure deletion was successful.
// Recover if needed form faulted futures.
}
def deleteFilesInBatch(filesToDelete: List[String],
bucketName: String): Try[DeleteObjectsResult] = {
val amazonClient = new AmazonS3Client()
val deleteObjectsRequest = new DeleteObjectsRequest(bucketName)
deleteObjectsRequest.setKeys(filesToDelete.map(new KeyVersion(_)).asJava)
Try {
amazonClient.deleteObjects(deleteObjectsRequest)
}
}
}

Scala function does not return a value

I think I understand the rules of implicit returns but I can't figure out why splithead is not being set. This code is run via
val m = new TaxiModel(sc, file)
and then I expect
m.splithead
to give me an array strings. Note head is an array of strings.
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class TaxiModel(sc: SparkContext, dat: String) {
val rawData = sc.textFile(dat)
val head = rawData.take(10)
val splithead = head.slice(1,11).foreach(splitData)
def splitData(dat: String): Array[String] = {
val splits = dat.split("\",\"")
val split0 = splits(0).substring(1, splits(0).length)
val split8 = splits(8).substring(0, splits(8).length - 1)
Array(split0).union(splits.slice(1, 8)).union(Array(split8))
}
}
foreach just evaluates expression, and do not collect any data while iterating. You probably need map or flatMap (see docs here)
head.slice(1,11).map(splitData) // gives you Array[Array[String]]
head.slice(1,11).flatMap(splitData) // gives you Array[String]
Consider also a for comprehension (which desugars in this case into flatMap),
for (s <- head.slice(1,11)) yield splitData(s)
Note also that Scala strings are equipped with ordered collections methods, thus
splits(0).substring(1, splits(0).length)
proves equivalent to any of the following
splits(0).drop(1)
splits(0).tail