Stop processing large text files in Apache Spark after certain amount of errors - scala

I'm very new to Spark. I'm working in 1.6.1.
Let's imagine I have large file, I'm reading it into RDD[String] thru textFile.
Then I want to validate each line in some function.
Because file is huge, I want to stop processing when I reached certain amount of errors, let's say 1000 lines.
Something like
val rdd = sparkContext.textFile(fileName)
rdd.map(line => myValidator.validate(line))
here is validate function:
def validate(line:String) : (String, String) = {
// 1st in Tuple for resulted line, 2nd ,say, for validation error.
}
How to calculate errors inside 'validate'?. It is actually executed in parallel on multiple nodes? Broadcasts? Accumulators?

You can achieve this behavior using Spark's laziness by "splitting" the result of the parsing into success and failures, calling take(n) on the failures, and only using the success data if there were less then n failures.
To achieve this more conveniently, I'd suggest changing the signature of validate to return some type that can easily distinguish success from failure, e.g. scala.util.Try:
def validate(line:String) : Try[String] = {
// returns Success[String] on success,
// Failure (with details in the exception object) otherwise
}
And then, something like:
val maxFailures = 1000
val rdd = sparkContext.textFile(fileName)
val parsed: RDD[Try[String]] = rdd.map(line => myValidator.validate(line)).cache()
val failures: Array[Throwable] = parsed.collect { case Failure(e) => e }.take(maxFailures)
if (failures.size == maxFailures) {
// report failures...
} else {
val success: RDD[String] = parsed.collect { case Success(s) => s }
// continue here...
}
Why would this work?
If there are less then 1000 failures, the entire dataset will be parsed when take(maxFailures) is called, the successful data will be cached and ready to use
If there are 1000 failures or more, the parsing would stop there, as the take operation won't require anymore reads

Related

Scala - how to get and print the contents of Either?

I'm processing data where some records may be corrupted. So I decided to explore the data and used Either to divide valid and invalid records.
I figured out how to count the number of each kind of records and now getting the output for failedCount and successCount successfully.
But I have a problem with printing out each invalid (Left) sale record. What could be wrong with my approach?
I don't get any output when printing out failedSales
def filterSales(rawSales: RDD[Sale]): RDD[(String, Sale)] = {
val filteredSales = rawSales
.map(sale => {
val saleOption = Try(sale.id -> sale)
saleOption match {
case Success(successSale) => Right(successSale)
case Failure(e) => Left(s"Corrupted sale: $sale;", e)
}
})
val failedCount: Long = filteredSales.filter(_.isLeft).count()
val successCount: Long = filteredSales.filter(_.isRight).count()
println("FAILED SALES COUNT: " + failedCount)
println("SUCCESS SALES COUNT: " + successCount)
// Problem here
val failedSales: RDD[Either.LeftProjection[(String, Throwable), (String, Sale)]] = filteredSales.map(_.left)
println("FAILED SALES: ")
// Doesn't produce any output
failedSales.foreach(println)
}
When you call foreach(fn) on an RDD then the funtion fn (println in your case) is executed on the slave nodes where the RDD is distributed. So it's happening somewhere but not on the driver program you're looking at.
If you have a small data set then you could collect() the RDD so the data is returned to your driver and you can println that.
If you have large data, you could saveAsTextFile() so it gets written to HDFS and you can download from there.

Streamline results in mapPartitions (Spark)

Is there a way to return partial results in mapPartitions() ?
Currently I use it like this:
myRDD.mapPartitions{
iter: iterator[InputType] => {
val additionalData = <some costly init operation>
val results = ArrayBuffer[OutputType]()
for(input: InputType <- iter) results += (transform(input, additionalData))
results.iterator
}
}
But of course if a partition is too big the results array will throw an OOM exception.
So my question: is there a way to send partial results every once in a while so as to avoid any OOM ?
I want to stick to mapPartitions because I initialize a costly object (e.g. get the value of a big broadcasted variable) before processing the input and I don't want to do that at every record like with map
If additionalData doesn't access the iterator you can just map:
myRDD.mapPartitions{
iter: iterator[InputType] => {
val additionalData = ???
iter.map(input => transform(input, additionalData))
}}

Sink for line-by-line file IO with backpressure

I have a file processing job that currently uses akka actors with manually managed backpressure to handle the processing pipeline, but I've never been able to successfully manage the backpressure at the input file reading stage.
This job takes an input file and groups lines by an ID number present at the start of each line, and then once it hits a line with a new ID number, it pushes the grouped lines to a processing actor via message, and then continues with the new ID number, all the way until it reaches the end of the file.
This seems like it would be a good use case for Akka Streams, using the File as a sink, but I'm still not sure of three things:
1) How can I read the file line by line?
2) How can I group by the ID present on every line? I currently use very imperative processing for this, and I don't think I'll have the same ability in a stream pipeline.
3) How can I apply backpressure, such that I don't keep reading lines into memory faster than I can process the data downstream?
Akka streams' groupBy is one approach. But groupBy has a maxSubstreams param which would require that you to know that max # of ID ranges up front. So: the solution below uses scan to identify same-ID blocks, and splitWhen to split into substreams:
object Main extends App {
implicit val system = ActorSystem("system")
implicit val materializer = ActorMaterializer()
def extractId(s: String) = {
val a = s.split(",")
a(0) -> a(1)
}
val file = new File("/tmp/example.csv")
private val lineByLineSource = FileIO.fromFile(file)
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = 1024))
.map(_.utf8String)
val future: Future[Done] = lineByLineSource
.map(extractId)
.scan( (false,"","") )( (l,r) => (l._2 != r._1, r._1, r._2) )
.drop(1)
.splitWhen(_._1)
.fold( ("",Seq[String]()) )( (l,r) => (r._2, l._2 ++ Seq(r._3) ))
.concatSubstreams
.runForeach(println)
private val reply = Await.result(future, 10 seconds)
println(s"Received $reply")
Await.ready(system.terminate(), 10 seconds)
}
extractId splits lines into id -> data tuples. scan prepends id -> data tuples with a start-of-ID-range flag. The drop drops the primer element to scan. splitwhen starts a new substream for each start-of-range. fold concatenates substreams to lists and removes the start-of-ID-range boolean, so that each substream produces a single element. In place of the fold you probably want a custom SubFlow which processes a streams of rows for a single ID and emits some result for the ID range. concatSubstreams merges the per-ID-range substreams produced by splitWhen back into a single stream that's printed by runForEach .
Run with:
$ cat /tmp/example.csv
ID1,some input
ID1,some more input
ID1,last of ID1
ID2,one line of ID2
ID3,2nd before eof
ID3,eof
Output is:
(ID1,List(some input, some more input, last of ID1))
(ID2,List(one line of ID2))
(ID3,List(2nd before eof, eof))
It appears that the easiest way to add "back pressure" to your system without introducing huge modifications is to simply change the mailbox type of the input groups consuming Actor to BoundedMailbox.
Change the type of the Actor that consumes your lines to BoundedMailbox with high mailbox-push-timeout-time:
bounded-mailbox {
mailbox-type = "akka.dispatch.BoundedDequeBasedMailbox"
mailbox-capacity = 1
mailbox-push-timeout-time = 1h
}
val actor = system.actorOf(Props(classOf[InputGroupsConsumingActor]).withMailbox("bounded-mailbox"))
Create iterator from your file, create grouped (by id) iterator from that iterator. Then just cycle through the data, sending groups to consuming Actor. Note, that send will block in this case, when Actor's mailbox gets full.
def iterGroupBy[A, K](iter: Iterator[A])(keyFun: A => K): Iterator[Seq[A]] = {
def rec(s: Stream[A]): Stream[Seq[A]] =
if (s.isEmpty) Stream.empty else {
s.span(keyFun(s.head) == keyFun(_)) match {
case (prefix, suffix) => prefix.toList #:: rec(suffix)
}
}
rec(iter.toStream).toIterator
}
val lines = Source.fromFile("input.file").getLines()
iterGroupBy(lines){l => l.headOption}.foreach {
lines:Seq[String] =>
actor.tell(lines, ActorRef.noSender)
}
That's it!
You probably want to move file reading stuff to separate thread, as it's gonna block. Also by adjusting mailbox-capacity you can regulate amount of consumed memory. But if reading batch from the file is always faster than processing, it seems reasonable to keep capacity small, like 1 or 2.
upd iterGroupBy implemented with Stream, tested not to produce StackOverflow.

Spark RDD equivalent to Scala collections partition

This is a minor issue with one of my spark jobs which doesn't seem to cause any issues -- yet annoys me every time I see it and fail to come up with a better solution.
Say I have a Scala collection like this:
val myStuff = List(Try(2/2), Try(2/0))
I can partition this list into successes and failures with partition:
val (successes, failures) = myStuff.partition(_.isSuccess)
Which is nice. The implementation of partition only traverses the source collection once to build the two new collections. However, using Spark, the best equivalent I have been able to devise is this:
val myStuff: RDD[Try[???]] = sourceRDD.map(someOperationThatMayFail)
val successes: RDD[???] = myStuff.collect { case Success(v) => v }
val failures: RDD[Throwable] = myStuff.collect { case Failure(ex) => ex }
Which aside from the difference of unpacking the Try (which is fine) also requires traversing the data twice. Which is annoying.
Is there any better Spark alternative that can split an RDD without multiple traversals? i.e. having a signature something like this where partition has the behaviour of Scala collections partition rather than RDD partition:
val (successes: RDD[Try[???]], failures: RDD[Try[???]]) = myStuff.partition(_.isSuccess)
For reference, I previously used something like the below to solve this. The potentially failing operation is de-serializing some data from a binary format, and the failures have become interesting enough that they need to be processed and saved as an RDD rather than something logged.
def someOperationThatMayFail(data: Array[Byte]): Option[MyDataType] = {
try {
Some(deserialize(data))
} catch {
case e: MyDesrializationError => {
logger.error(e)
None
}
}
}
There might be other solutions, but here you go:
Setup:
import scala.util._
val myStuff = List(Try(2/2), Try(2/0))
val myStuffInSpark = sc.parallelize(myStuff)
Execution:
val myStuffInSparkPartitioned = myStuffInSpark.aggregate((List[Try[Int]](),List[Try[Int]]()))(
(accum, curr)=>if(curr.isSuccess) (curr :: accum._1,accum._2) else (accum._1, curr :: accum._2),
(first, second)=> (first._1 ++ second._1,first._2 ++ second._2))
Let me know if you need an explanation

Using Iteratees and Enumerators in Play Scala to Stream Data to S3

I am building a Play Framework application in Scala where I would like to stream an array of bytes to S3. I am using the Play-S3 library to do this. The "Multipart file upload" of the documentation section is what's relevant here:
// Retrieve an upload ticket
val result:Future[BucketFileUploadTicket] =
bucket initiateMultipartUpload BucketFile(fileName, mimeType)
// Upload the parts and save the tickets
val result:Future[BucketFilePartUploadTicket] =
bucket uploadPart (uploadTicket, BucketFilePart(partNumber, content))
// Complete the upload using both the upload ticket and the part upload tickets
val result:Future[Unit] =
bucket completeMultipartUpload (uploadTicket, partUploadTickets)
I am trying to do the same thing in my application but with Iteratees and Enumerators.
The streams and asynchronicity make things a little complicated, but here is what I have so far (Note uploadTicket is defined earlier in the code):
val partNumberStream = Stream.iterate(1)(_ + 1).iterator
val partUploadTicketsIteratee = Iteratee.fold[Array[Byte], Future[Vector[BucketFilePartUploadTicket]]](Future.successful(Vector.empty[BucketFilePartUploadTicket])) { (partUploadTickets, bytes) =>
bucket.uploadPart(uploadTicket, BucketFilePart(partNumberStream.next(), bytes)).flatMap(partUploadTicket => partUploadTickets.map( _ :+ partUploadTicket))
}
(body |>>> partUploadTicketsIteratee).andThen {
case result =>
result.map(_.map(partUploadTickets => bucket.completeMultipartUpload(uploadTicket, partUploadTickets))) match {
case Success(x) => x.map(d => println("Success"))
case Failure(t) => throw t
}
}
Everything compiles and runs without incident. In fact, "Success" gets printed, but no file ever shows up on S3.
There might be multiple problems with your code. It's a bit unreadable caused by the map method calls. You might have a problem with your future composition. Another problem might be caused by the fact that all chunks (except for the last) should be at least 5MB.
The code below has not been tested, but shows a different approach. The iteratee approach is one where you can create small building blocks and compose them into a pipe of operations.
To make the code compile I added a trait and a few methods
trait BucketFilePartUploadTicket
val uploadPart: (Int, Array[Byte]) => Future[BucketFilePartUploadTicket] = ???
val completeUpload: Seq[BucketFilePartUploadTicket] => Future[Unit] = ???
val body: Enumerator[Array[Byte]] = ???
Here we create a few parts
// Create 5MB chunks
val chunked = {
val take5MB = Traversable.takeUpTo[Array[Byte]](1024 * 1024 * 5)
Enumeratee.grouped(take5MB transform Iteratee.consume())
}
// Add a counter, used as part number later on
val zipWithIndex = Enumeratee.scanLeft[Array[Byte]](0 -> Array.empty[Byte]) {
case ((counter, _), bytes) => (counter + 1) -> bytes
}
// Map the (Int, Array[Byte]) tuple to a BucketFilePartUploadTicket
val uploadPartTickets = Enumeratee.mapM[(Int, Array[Byte])](uploadPart.tupled)
// Construct the pipe to connect to the enumerator
// the ><> operator is an alias for compose, it is more intuitive because of
// it's arrow like structure
val pipe = chunked ><> zipWithIndex ><> uploadPartTickets
// Create a consumer that ends by finishing the upload
val consumeAndComplete =
Iteratee.getChunks[BucketFilePartUploadTicket] mapM completeUpload
Running it is done by simply connecting the parts
// This is the result, a Future[Unit]
val result = body through pipe run consumeAndComplete
Note that I did not test any code and might have made some mistakes in my approach. This however shows a different way of dealing with the problem and should probably help you to find a good solution.
Note that this approach waits for one part to complete upload before it takes on the next part. If the connection from your server to amazon is slower than the connection from the browser to you server this mechanism will slow the input.
You could take another approach where you do not wait for the Future of the part upload to complete. This would result in another step where you use Future.sequence to convert the sequence of upload futures into a single future containing a sequence of the results. The result would be a mechanism sending a part to amazon as soon as you have enough data.