What's the right way to "log and skip" validated transformations in spark-streaming - scala

I have a spark-streaming application where I want to do some data transformations before my main operation, but the transformation involves some data validation.
When the validation fails, I want to log the failure cases, and then proceed on with the rest.
Currently, I have code like this:
def values: DStream[String] = ???
def validate(element: String): Either[String, MyCaseClass] = ???
val validationResults = values.map(validate)
validationResults.foreachRDD { rdd =>
rdd.foreach {
case Left(error) => logger.error(error)
case _ =>
}
}
val validatedValues: DStream[MyCaseClass] =
validationResults.mapPartitions { partition =>
partition.collect { case Right(record) => record }
}
This currently works, but it feels like I'm doing something wrong.
Questions
As far as I understand, this will perform the validation twice, which is potentially wasteful.
Is it correct to use values.map(validation).persist() to solve that problem?
Even if I persist the values, it still iterates and pattern matches on all the elements twice. It feels like there should be some method I can use to solve this. On a regular scala collection, I might use some of the cats TraverseFilter api, or with fs2.Stream an evalMapFilter. What DStream api can I use for that? Maybe something with mapPartitions?

I would say that the best way to tackle this is to take advantage that the stdlib flatMap accepts Option
def values: DStream[String] = ???
def validate(element: String): Either[String, MyCaseClass] = ???
val validatedValues: DStream[MyCaseClass] =
values.map(validate).flatMap {
case Left(error) =>
logger.error(error)
None
case Right(record) =>
Some(record)
}
You can also be a little bit more verbose using mapPartitions which should be a little bit more efficient.

The 'best' option here depends a bit on the rest of your spark job and your version of spark.
Ideally you'd pick a mechanism natively understood by catalyst. The spark3 dataset observe listener may be what you're looking for there. (I haven't seen many examples of using this in the wild but it seems like this was the motivation behind such a thing.)
In pure spark sql, one option is to add a new column with the results of validation, e.g. a column named invalid_reason which is NULL if the record is valid or some [enumerated] string containing the reason the column failed validation. At this point, you likely want to persist/cache the dataset before doing a groupBy/count/collect/log operation, then filter where invalid_reason is null on the persisted dataframe and continue on the rest of the processing.
tl;dr: consider adding a validation column rather than just applying a 'validate' function. You then 'fork' processing here: log the records which have the invalid column specified, process the rest of your job on the records which don't. It does add some volume to your dataframe, but doesn't require processing the same records twice.

Related

Akka Streams how to write a GraphStage with OrElse

I have the following requirement. I am writing a GraphStage which needs to lookup a SQL Table (in a different db) . If that lookup fails, only then it should lookup a second table (in a different db) and if that lookup also fails, then lookup in the 3rd table (in a different table). if all lookups fail, then use a default
I googled and found this
http://doc.akka.io/japi/akka/current/akka/stream/scaladsl/OrElse.html
and also this thead
Alternative flows based on condition for akka stream
But broadcast and partition is not what I am looking for. I don't want to lookup both the tables simultaneously. what I want is that if one flow returns a None, only then the second flow is used to fetch the value.
Right now I have done something like this
val flow = Flow[Foo].map{foo =>
lookup1(foo.id) orElse lookup2(foo.id) getOrElse default
}
But this makes the flow above very monolithic. it would be nice if I can break the flow above into 3 separate ones, and then connect them via orelse clause in my graphstage.
Using flatMapConcat and orElse might help you getting your code more generic in terms of the number of source you want to combine. See example below
val altFlows: List[Flow[Foo, Option[Bar], NotUsed]] = ???
val default : Bar = ???
Flow[Foo].flatMapConcat { foo ⇒
val altSources = altFlows.map(Source.single(foo).via(_).collect{ case Some(x) ⇒ x })
val default = Source.single(default)
(altSources :+ default).reduce(_ orElse _)
}

Scala's collect inefficient in Spark?

I am currently starting to learn to use spark with Scala. The problem I am working on needs me to read a file, split each line on a certain character, then filtering the lines where one of the columns matches a predicate and finally remove a column. So the basic, naive implementation is a map, then a filter then another map.
This meant going through the collection 3 times and that seemed quite unreasonable to me. So I tried replacing them by one collect (the collect that takes a partial function as an argument). And much to my surprise, this made it run much slower. I tried locally on regular Scala collections; as expected, the latter way of doing is much faster.
So why is that ? My idea is that the map and filter and map are not applied sequentially, but rather mixed into one operation; in other words, when an action forces evaluation every element of the list will be checked and the pending operations will be executed. Is that right ? But even so, why do the collect perform so badly ?
EDIT: a code example to show what I want to do:
The naive way:
sc.textFile(...).map(l => {
val s = l.split(" ")
(s(0), s(1))
}).filter(_._2.contains("hello")).map(_._1)
The collect way:
sc.textFile(...).collect {
case s if(s.split(" ")(0).contains("hello")) => s(0)
}
The answer lies in the implementation of collect:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
As you can see, it's the same sequence of filter->map, but less efficient in your case.
In scala both isDefinedAt and apply methods of PartialFunction evaluate if part.
So, in your "collect" example split will be performed twice for each input element.

Iterator of InputStream: How to close the InputStreams?

I have an Iterator[InputStream] which i map over to retrieve the individual results:
val streams: Iterator[InputStream[CustomType]] = retrieveStreams()
val results: Iterator[MyResultType] = streams flatMap (c => transformToResult(c))
This works as expected, meaning I can retrieve values of type MyResultType from the results iterator. The only problem I have is that the individual InputStreams are never being closed. Is there any way to do this?
There is no magic way to close it, or at least to guarantee that it will get closed. Thus you have to close each stream yourself. Take a look at the Loan Pattern which makes it less error prone: Loaner Pattern in Scala.
In your case you don't have a single resource to release but rather a collection of resources, so adjust your custom loan pattern accordingly.
Since you are dealing with Iterator you might have unlimited supply of InputStreams, in that case your transformToResult function would have to close the stream or something else at the element level.
It could look something like this:
val streams: Iterator[InputStream[CustomType]] = retrieveStreams()
val results: Iterator[MyResultType] =
streams flatMap (c => yourLoaner(c)(transformToResult))

How to parallelize several apache spark rdds?

I have the next code:
sc.parquetFile("some large parquet file with bc").registerTempTable("bcs")
sc.parquetFile("some large parquet file with imps").registerTempTable("imps")
val bcs = sc.sql("select * from bcs")
val imps = sc.sql("select * from imps")
I want to do:
bcs.map(x => wrapBC(x)).collect
imps.map(x => wrapIMP(x)).collect
but when I do this, it's running not async. I can to do it with Future, like that:
val bcsFuture = Future { bcs.map(x => wrapBC(x)).collect }
val impsFuture = Future { imps.map(x => wrapIMP(x)).collect }
val result = for {
bcs <- bcsFuture
imps <- impsFuture
} yield (bcs, imps)
Await.result(result, Duration.Inf) //this return (Array[Bc], Array[Imp])
I want to do this without Future, how can I do it?
Update This was originally composed before the question was updated. Given those updates, I agree with #stholzm's answer to use cartesian in this case.
There do exist a limited number of actions which will produce a FutureAction[A] for an RDD[A] and be executed in the background. These are available on the AsyncRDDActions class, and so long as you import SparkContext._ any RDD will can be implicitly converted to an AysnchRDDAction as needed. For your specific code example that would be:
bcs.map(x => wrapBC(x)).collectAsync
imps.map(x => wrapIMP(x)).collectAsync
In additionally to evaluating the DAG up to action in the background, the FutureAction produced has the cancel method to attempt to end processing early.
Caveat
This may not do what you think it does. If the intent is to get data from both sources and then combine them you're more likely to want to join or group the RDDs instead. For that you can look at the functions available in PairRDDFunctions, again available on RDDs through implicit conversion.
If the intention isn't to have the data graphs interact then so far in my experience then running batches concurrently might only serve to slow down both, though that may be a consequence of how the cluster is configured. If the resource manager is set up to give each execution stage a monopoly on the cluster in FIFO order (the default in standalone and YARN modes, I believe; I'm not sure about Mesos) then each of the asynchronous collects will contend with each other for that monopoly, run their tasks, then contend again for the next execution stage.
Compare this to using a Future to wrap blocking calls to downstream services or database, for example, where either the resources in question are completely separate or generally have enough resource capacity to handle multiple requests in parallel without contention.
Update: I misunderstood the question. The desired result is not the cartesian product Array[(Bc, Imp)].
But I'd argue that it does not matter how long the single map calls take because as soon as you add other transformations, Spark tries to combine them in an efficient way. As long as you only chain transformations on RDDs, nothing happens on the data. When you finally apply an action then the execution engine will figure out a way to produce the requested data.
So my advice would be to not think so much about the intermediate steps and avoid collect as much as possible because it will fetch all the data to the driver program.
It seems you are building a cartesian product yourself. Try cartesian instead:
val bc = bcs.map(x => wrapBC(x))
val imp = imps.map(x => wrapIMP(x))
val result = bc.cartesian(imp).collect
Note that collect is called on the final RDD and no longer on intermediate results.
You can use union for solve this problem. For example:
bcs.map(x => wrapBC(x).asInstanceOf[Any])
imps.map(x => wrapIMP(x).asInstanceOf[Any])
val result = (bcs union imps).collect()
val bcsResult = result collect { case bc: Bc => bc }
val impsResult = result collect { case imp: Imp => imp }
If you want to use sortBy or another operations, you can use inheritance of trait or main class.

Scala polymorphic function for filtering an input List of Either

Seeking a more elegant solution
I have this piece of code, I just use it in test cases where it isn't necessary to do any error handling. What it does is:
take an input list of strings
parse them using the DSJSonmapper.parseDSResult method
filters them and extracts the Right value from each Either (Left is an Exception)
The code is as follows:
def parseDs(ins: List[String]) = {
def filterResults[U, T](in: List[Either[U, T]]): List[T] = {
in.filter(y => y.isRight).map(z => z.right.get)
}
filterResults(ins.map(x => DSJsonMapper.parseDSResult(x)))
}
Now, I haven't done an awful lot of polymorphic functions, but this works. However I feel like it's a bit ugly. Has anyone got a better suggestion, how to accomplish the same thing.
I'm aware this is going to come down to a case of personal preference. But suggestions are welcome.
collect is made for exactly this kind of situation:
def filterMe[U,T](in: List[Either[U,T]]): List[T] = in.collect{
case Right(r) => r
}
In fact, it's so good at this you may want to skip the def and just
ins.map(DSJsonMapper.parseDsResult).collect{ case Right(r) => r }
Rex's answer is possibly a little clearer, but here's a slightly shorter alternative that parses and "filters" in a single step:
ins.flatMap(DSJsonMapper.parseDSResult(_).right.toOption)
Here we take the right projection of each parse result and turn that into an Option (which will be None if the parse failed and Some(whatever) otherwise). Since we're using flatMap, the Nones don't appear in the result and the values are pulled out of the Somes.