Split Spark DataFrame based on condition

Split Spark DataFrame based on condition - scala

I need something similar to the randomSplit function:
val Array(df1, df2) = myDataFrame.randomSplit(Array(0.6, 0.4))
However, I need to split myDataFrame based on a boolean condition. Does anything like the following exist?
val Array(df1, df2) = myDataFrame.booleanSplit(col("myColumn") > 100)
I'd like not to do two separate .filter calls.

Unfortunately the DataFrame API doesn't have such a method, to split by a condition you'll have to perform two separate filter transformations:
myDataFrame.cache() // recommended to prevent repeating the calculation
val condition = col("myColumn") > 100
val df1 = myDataFrame.filter(condition)
val df2 = myDataFrame.filter(not(condition))

I understand that caching and filtering twice looks a bit ugly, but please bear in mind that DataFrames are translated to RDDs, which are evaluated lazily, i.e. only when they are directly or indirectly used in an action.
If a method booleanSplit as suggested in the question existed, the result would be translated to two RDDs, each of which would be evaluated lazily. One of the two RDDs would be evaluated first and the other would be evaluated second, strictly after the first. At the point the first RDD is evaluated, the second RDD would not yet have "come into existence" (EDIT: Just noticed that there is a similar question for the RDD API with an answer that gives a similar reasoning)
To actually gain any performance benefit, the second RDD would have to be (partially) persisted during the iteration of the first RDD (or, actually, during the iteration of the parent RDD of both, which is triggered by the iteration of the first RDD). IMO this wouldn't align overly well with the design of the rest of the RDD API. Not sure if the performance gains would justify this.
I think the best you can achieve is to avoid writing two filter calls directly in your business code, by writing an implicit class with a method booleanSplit as a utility method does that part in a similar way as Tzach Zohar's answer, maybe using something along the lines of myDataFrame.withColumn("__condition_value", condition).cache() so the the value of the condition is not calculated twice.

Related

Scala Spark isin broadcast list

I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.

I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe

I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.

Spark: Which is the proper way to use a Broadcast variable?

I don't know if I'm using well a broadcast variable.
I have two RDDs, rdd1 and rdd2. I want to apply rdd2.mapPartitionsWithIndex(...), and for each partition I need to perfom some calculation using the whole rdd1. So, I think this is a case to use a Broadcast variable. First question: Am I thinking it right?
To do so, I did this:
val rdd1Broadcast = sc.broadcast(rdd1.collect())
Second question: Why do I need to put .collect(). I saw examples with and without .collect(), but I didn't realized when do I need to use it.
Also, I did this:
val rdd3 = rdd2.mapPartitionsWithIndex( myfunction(_, _, rdd1Broadcast), preservesPartitioning = preserves).cache()
Third question: Which is better: passing rdd1Broadcast or rdd1Broadcast.value?

Am I thinking it right?
There is really not enough information to answer this part. Broadcasting is useful only if broadcasted object is relatively small, or local access significantly reduces computational complexity.
Why do I need to put .collect().
Because RDDs can be accessed only on the driver. Broadcasting RDD is not meaningful, as you cannot access the data from a task.
Which is better: passing rdd1Broadcast or rdd1Broadcast.value?
The argument should be of type Broadcast[_] so don't use rdd1Broadcast.value. If parameter is passed by value, it will be evaluated and substituted locally, and broadcast will not be used.

How to use Scala to partition data into buckets for further processing

I want to read in a csv log which has as it's first column a timestamp of form hh:mm:ss. I would like to partition the entries into buckets, say hourly. I'm curious what the best approach is that adheres to Scala's semantics, i.e., reading the file as a stream, parsing it (maybe by a match predicate?) and emitting the csv entries as tuples.
It's been a couple of years since I looked at Scala but this problem seems particularly well suited to the language.
log format example:
[time],[string],[int],[int],[int],[int],[string]
The last field in the input could be mapped to an emum in the output tuple but I'm not sure there's value in that.
I'd be happy with a general recipe that I could use, with suggestions for certain built-in functions that are well suited to the problem.
The overall goal is a map-reduce, where I want to count elements in a time window but those elements first need to be preprocessed by a regex replace, before sorting and counting.
I've tried to keep the problem abstract, so the problem can be approached as a pattern to follow.
Thanks.

Perhaps as a first pass, a simple groupBy would do the trick ?
logLines.groupBy(line => line.timestamp.hours)

Using the groupBy idiom, and some filtering, my solution looks like
val lines: Traversable[String] = source.getLines.map(_.trim).toTraversable
val events: List[String] = lines.filter(line => line.matches("[\\d]+:.*")).toList
val buckets: Map[String, List[String]] = events.groupBy { line => line.substring(0, line.indexOf(":")) }
This gives me 24 buckets, one for each hour. Now I can process each bucket, perform the regex replace that I need to de-parameterize the URIs and finally map-reduce those to find the frequency each route has occurred.
Important note. I learned that groupBy doesn't work as desired, without first creating a List from the Traversable stream. Without that step, the end result is a single valued map for each hour. Possibly not the most performant solution, since it requires all events to be loaded in memory before partitioning. Is there a better solution that can partition a stream? Perhaps something that adds events to a mutable Set as the stream is processed?

Spark: groupBy taking lot of time

In my application when taking perfromance numbers, groupby is eating away lot of time.
My RDD is of below strcuture:
JavaPairRDD<CustomTuple, Map<String, Double>>
CustomTuple:
This object contains information about the current row in RDD like which week, month, city, etc.
public class CustomTuple implements Serializable{
private Map hierarchyMap = null;
private Map granularMap = null;
private String timePeriod = null;
private String sourceKey = null;
}
Map
This map contains the statistical data about that row like how much investment, how many GRPs, etc.
<"Inv", 20>
<"GRP", 30>
I was executing below DAG on this RDD
apply filter on this RDD and scope out relevant rows : Filter
apply filter on this RDD and scope out relevant rows : Filter
Join the RDDs: Join
apply map phase to compute investment: Map
apply GroupBy phase to group the data according to the desired view: GroupBy
apply a map phase to aggregate the data as per the grouping achieved in above step (say view data across timeperiod) and also create new objects based on the resultset desired to be collected: Map
collect the result: Collect
So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):
<timeperiod1, value>
When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.
IMO, we can replace GroupBy and subsequent Map operations by a sing reduce.
But reduce will work on object of type JavaPairRDD>.
So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.
Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.
Also, I am not sure how aggregate function works and will it be able to help me in this case.
Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.
<"market1", 20>
<"market2", 30>
This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.
Any insight is appreciated.
[EDIT]We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.[EDIT]
TIA!

The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)
If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.
combineByKey have three arguments:
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)
mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value
mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that
The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.
I hope this will be usefull

Shuffling is triggered by any change in the key of a [K,V] pair, or by a repartition() call. The partitioning is calculated based on the K (key) value. By default partitioning is calculated using the Hash value of your key, implemented by the hashCode() method. In your case your Key contains two Map instance variables. The default implementation of the hashCode() method will have to calculate the hashCode() of those maps as well, causing an iteration to happen over all it elements to in turn again calculate the hashCode() of those elements.
The solutions are:
Do not include the Map instances in your Key. This seems highly unusual.
Implement and override your own hashCode() that avoids going through the Map Instance variables.
Possibly you can avoid using the Map objects completely. If it is something that is shared amongst multiple elements, you might need to consider using broadcast variables in spark. The overhead of serializing your Maps during shuffling might also be a big contributing factor.
Avoid any shuffling, by tuning your hashing between two consecutive group-by's.
Keep shuffling Node local, by choosing a Partitioner that will have an affinity of keeping partitions local during consecutive use.
Good reading on hashCode(), including a reference to quotes by Josh Bloch can be found in wiki.

FP or typelevel tools to groupBy on deep leaf in nested data?

I got a deeply nested datastructure:
Seq[Seq[(String, Seq[(String, Seq[(String, Try[Boolean])])], Long)]]
Is there a nice functional way to groupBy on Try.isFailure?
With Shapeless it is possible to search in arbitrary nested datastructures, as can be seen here. But finding is only one part of my problem. I saw zippers and lenses, they are nice but afaik they are not the right tool here.
For info, the data represents results of some test code. The layers are:
permutations of configurations => tested component => mutation on data => testing code. Strings are descriptions, long is the time it took to finish for each component test.
I want to create two lists, one with all failures keeping all the info where and when they happened keeping exceptions as info, and one corresponding one for successes.
Is there a solution out there already?
Note: the most sensible approach for that particular case would be to redesign my testcode such that two lists, one failurelist and one successlist are created from the start. But still, I'd like to know. This kind of problem doesn't seem to be uncommon.

It may not be the most creative solution, but you could partition the outermost Seq as follows:
val partitioned = seq.partition{ s =>
val flat = s.map(_._2).flatten.map(_._2).flatten
flat.find(tup => tup._2.isFailure).isDefined
}
In this example, the first line in the partition body flattens out the nested structure so you are left with the inner most Seq. Then, from there, the predicate condition to return for the partition call is derived from seeing if the inner most Seq contains at least one Failure. What you are left with is a tuple where the first Seq is the outermost items that have `failures in their nested structures and the second one is ones where no failures occurred.
This is probably not the best performing solution, but it's succinct as far as code lines is concerned. In fact, you could even do it in one line as follows:
val partitioned = seq.partition(_.map(_._2).flatten.map(_._2).flatten.find(_._2.isFailure).isDefined)