How to load multiple files into multiple rdd?

How to load multiple files into multiple rdd? - scala

I have multiple files those are independent and need processing by spark. How could I load them into separate rdds in parallel? Thanks!
The coding language is scala.

If you want concurrent reading/processing of RDDs, you could leverage scala.concurrent.Future (or effects in ZIO, Cats etc).
Sample code for loading function is below:
def load(paths: Seq[String], spark: SparkSession)
(implicit ec: ExecutionContext): Seq[Future[RDD[String]]] = {
def loadSinglePath(path: String): Future[RDD[String]] = Future {
spark.sparkContext.textFile(path)
}
paths map loadSinglePath
}
Sample code for using this function:
import scala.concurrent.duration.{Duration, DurationInt}
val sc = SparkSession.builder.master("local[*]").getOrCreate()
implicit val ec = ExecutionContext.global
val result = load(Seq("t1.txt", "t2.txt", "t3.txt"), sc).zipWithIndex
.map { case (rddFuture, idx) =>
rddFuture.map( rdd =>
println(s"Rdd with index $idx has ${rdd.count()}")
)
}
Await.result(Future.sequence(result), 1 hour)
For example purposes, default global ExecutionContext is provided, but it could be configurable to run your code inside the custom one (just replace this implicit val ec with your own ExecutionContext)

Related

Applying multiple map functions to streaming database results in Play 2.6

I have a large query that seems to be a prime candidate for streaming results.
I would like to make a call to a function, which returns an object which I can apply additional map transformations on, and then ultimately convert the entire result into a list. This is because the conversions will results in a set of objects much smaller than the results in the database and there are many different transformations that must take place sequentially. Processing each result at a time will save me significant memory.
For example, if the results from the database were a stream (though the correct thing is likely an AkkaStream or an Iteratee), then I could do something like:
def outer(converter1[String, Int}, converter2[Int,Double]) {
val sqlIterator = getSqlIterator()
val mappedIterator1 = sqlIterator.map(x => converter1(x.bigColumn))
val mappedIterator2 = sqlIterator.map(x => converter2(x))
val retVal = mappedIterator.toList
retVal
}
def getSqlIterator() {
val selectedObjects = SQL( """SELECT * FROM table""").map { x =>
val id = x[Long]("id")
val tinyColumn = x[String]("tiny_column")
val bigColumn = x[String]("big_column")
NewObject(id, tinyColumn, bigColumn)
}
val transformed = UNKNOWN_FUNCTION(selectedObjects)
transformed
}
Most of the documentation appears to provide the mechanism to apply a "reduce" function to the results, rather than a "map" function, but the resulting mapped functions will be much smaller, saving me significant memory. What should I do for UNKNOWN_FUNCTION?

The following is a simple example of using Anorm's Akka Streams support to read the values from a single column of type String, applying two transformations to each element, and placing the results in a Seq. I'll leave it as an exercise for you to retrieve the values from multiple columns at a time, if that's what you need.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import anorm._
import scala.collection.immutable.Seq
import scala.concurrent.Future
implicit val system = ActorSystem("MySystem")
implicit val materializer = ActorMaterializer()
implicit val ec = system.dispatcher
val convertStringToInt: String => Int = ???
val convertIntToDouble: Int => Double = ???
val result: Future[Seq[Double]] =
AkkaStream.source(SQL"SELECT big_column FROM table", SqlParser.scalar[String])
.map(convertStringToInt)
.map(convertIntToDouble)
.runWith(Sink.seq[Double])

Aggregator is slow when using a complex intermediate type

I have a custom Aggregator that does a count-min sketch aggregation. It works, but is slow (code below). I get similar slow performance if I use a custom UDAF based on the UserDefinedAggregateFunction class.
This is much faster if I use the Dataset mapPartitions API to aggregate within a partition and then reduce across partitions.
Question - the slowness of the UDAF and Aggregator APIs seem to be caused by the serialization/deserialization (encoding) that happens at each row. Are the UDAF and Aggregator APIs not meant to be used to aggregate into non-trivial data structures like the count-min sketch? Is the mapPartitions approach the best way to handle this?
Aggregator sample code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Encoder, Row, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.util.sketch.CountMinSketch
object AggTest extends App {
val input = "2008.csv"
val conf = new SparkConf().setMaster("local[4]").setAppName("tester")
val sqlContext = SparkSession.builder().config(conf).getOrCreate().sqlContext
val df = sqlContext.read.format("csv").option("header", "true").load(input)
implicit val sketchEncoder = org.apache.spark.sql.Encoders.kryo[CountMinSketch]
case class SketchAgg(col: String) extends Aggregator[Row, CountMinSketch, CountMinSketch] {
def zero: CountMinSketch = CountMinSketch.create(500, 4, 2429)
def reduce(sketch: CountMinSketch, row: Row) = {
val a = row.getAs[Any](col)
sketch.add(a)
sketch
}
def merge(sketch1: CountMinSketch, sketch2: CountMinSketch) = {
sketch1.mergeInPlace(sketch2)
}
def finish(sketch: CountMinSketch) = sketch
def bufferEncoder: Encoder[CountMinSketch] = sketchEncoder
def outputEncoder: Encoder[CountMinSketch] = sketchEncoder
}
val sketch = df.agg(SketchAgg("ArrDelay")
.toColumn
.alias("sketch"))
.select("sketch")
.as[CountMinSketch]
.first()
}

Parallelizing access to S3 objects in Scala

I'm writing a Scala program to read objects matching a certain prefix on S3.
At the moment, I'm testing it on my Macbook Pro and it takes 270ms (avg. over 1000 trials) to hit S3, retrieve the 10 objects (avg. size of object 150Kb) and process it to print the output.
Here's my code:
val myBucket = "my-test-bucket"
val myPrefix = "t"
val startTime = System.currentTimeMillis()
//Can I make listObject parallel?
val listObjRequest: ListObjectsRequest = new ListObjectsRequest().withBucketName(myBucket)
val listObjResult: Seq[String] = s3.listObjects(listObjRequest).getObjectSummaries.par.toIndexedSeq.map(_.getKey).filter(_ matches s"./.*${myPrefix}.*/*")
//Can I make forEach parallel?
listObjResult foreach println //Could be any function
println(s"Total time: ${System.currentTimeMillis() - startTime}ms")
In the big scheme of things, I've got to sift through 50Gb of data (approx. 350K nested objects) and delete objects following a certain prefix (approx. 40K objects).
Hardware considerations aside, what can I do to optimize my code?
Thanks!

A possible solution would be to batch the request objects and send a request for batch deletion in S3. You can group the objects to delete and then parallalize the mapping over the parallel collection:
import com.amazonaws.services.s3.AmazonS3Client
import com.amazonaws.services.s3.model.DeleteObjectsRequest.KeyVersion
import com.amazonaws.services.s3.model.{DeleteObjectsRequest, DeleteObjectsResult}
import scala.collection.JavaConverters._
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent._
import scala.util.Try
object AmazonBatchDeletion {
def main(args: Array[String]): Unit = {
val filesToDelete: List[String] = ???
val numOfGroups: Int = ???
val deletionAttempts: Iterator[Future[Try[DeleteObjectsResult]]] =
filesToDelete
.grouped(numOfGroups)
.map(groupToDelete => Future {
blocking {
deleteFilesInBatch(groupToDelete, "bucketName")
}
})
val result: Future[Iterator[Try[DeleteObjectsResult]]] =
Future.sequence(deletionAttempts)
// TODO: make sure deletion was successful.
// Recover if needed form faulted futures.
}
def deleteFilesInBatch(filesToDelete: List[String],
bucketName: String): Try[DeleteObjectsResult] = {
val amazonClient = new AmazonS3Client()
val deleteObjectsRequest = new DeleteObjectsRequest(bucketName)
deleteObjectsRequest.setKeys(filesToDelete.map(new KeyVersion(_)).asJava)
Try {
amazonClient.deleteObjects(deleteObjectsRequest)
}
}
}

Saving two RDDs in parallel

accessLogs.saveAsTextFile(outputDirectory1)
accessList.saveAsTextFile(outputDirectory2)
How to save both the RDD in parallel rather than in series?

import scala.concurrent._
import scala.concurrent.duration._
val rdds = Seq(accessLogs, accessLists)
val dirs = Seq("outputDirectory1", "outputDirectory2")
import ExecutionContext.Implicits.global
val future = Future.sequence(
for ((rdd, dir) <- rdds zip dirs) yield Future(rdd.saveAsTextFile(dir))
)
//Await.ready(future, Duration.Inf) //to wait for rdds to be saved...
Note that despite the name, the method sequence on the Future companion object used above will execute the Futures resulting from the for-comprehension in parallel and not sequentially. This sequence method is essentially an applicative functor sequence.

You can save them in threads.
new Thread() {
override def run(): Unit = {
accessLogs.saveAsTextFile(outputDirectory1)
}
}.start()
new Thread() {
override def run(): Unit = {
accessList.saveAsTextFile(outputDirectory2)
}
}.start()
saveAsTextFile doesn't return anything, so I am not sure why are you setting the return value.

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?

Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to load multiple files into multiple rdd? - scala

I have multiple files those are independent and need processing by spark. How could I load them into separate rdds in parallel? Thanks! The coding language is scala.

Related

Applying multiple map functions to streaming database results in Play 2.6

Aggregator is slow when using a complex intermediate type

Parallelizing access to S3 objects in Scala

Saving two RDDs in parallel

Enriching SparkContext without incurring in serialization issues

Categories

Resources