Apache Spark RDD - not updating

Apache Spark RDD - not updating - scala

I create a PairRDD which contains a Vector.
var newRDD = oldRDD.mapValues(listOfItemsAndRatings => Vector(Array.fill(2){math.random}))
Later on I update the RDD:
newRDD.lookup(ratingObject.user)(0) += 0.2 * (errorRate(rating) * myVector)
However, although it outputs an updated Vector (as shown in the console), when I next call newRDD I can see the Vector value has changed. Through testing I have concluded that it has changed to something given by math.random - as every time I call newRDD the Vector changes. I understand there is a lineage graph and maybe that has something to do with it. I need to update the Vector held in the RDD to new values and I need to do this repeatedly.
Thanks.

RDD are immutable structures meant to distribute operations on data over a cluster.
There're two elements playing a role in the behavior you are observing here:
RDD lineage may be computed every time. In this case, it means that an action on newRDD might trigger the lineage computation, therefore applying the Vector(Array.fill(2){math.random}) transformation and resulting in new values each time. The lineage can be broken using cache, in which case the value of the transformation will be kept in memory and/or disk after the first time it's applied.
This results in:
val randomVectorRDD = oldRDD.mapValues(listOfItemsAndRatings => Vector(Array.fill(2){math.random}))
randomVectorRDD.cache()
The second aspect that needs further consideration is the on-site mutation:
newRDD.lookup(ratingObject.user)(0) += 0.2 * (errorRate(rating) * myVector)
Although this might work on a single machine because all Vector references are local, it will not scale to a cluster as lookup references will be serialized and mutations will not be preserved. Therefore it bears the question of why use Spark for this.
To be implemented on Spark, this algorithm will need re-design in order to be expressed in terms of transformations instead of punctual lookup/mutations.

Related

Spark - Stategy for persisting derived dataframes when parent DF is already persisted

I have not found a clear answer to this question yet, even though there are multiple similar questions in SO.
I don't fill-in all the details for the code below, as the actual transformations are not important for my questions.
// Adding _corrupt_record to have records that are not valid json
val inputDf = spark.read.schema(someSchema.add("_corrupt_record", StringType)).json(path)
/**
* The following lazy-persists the DF and does not return a new DF. Since
* Spark>=2.3 the queries from raw JSON/CSV files are disallowed when the
* referenced columns only include the internal corrupt record column
* (named _corrupt_record by default). Caching is the workaround.
*/
inputDf.persist
val uncorruptedDf = inputDf.filter($"_corrupt_record".isNull)
val corruptedDf = inputDf.filter($"_corrupt_record".isNotNull)
// Doing a count on both derived DFs - corruptedDf will also be output for further investigation
log.info("Not corrupted records: " + uncorruptedDf.count)
log.info("Corrupted records: " + corruptedDf.count)
corruptedDf.write.json(corruptedOutputPath)
// Not corrupted data will be used for some complicated transformations
val finalDf = uncorruptedDf.grouby(...).agg(...)
log.info("Finally chosen records: " + finalDf.count)
finalDf.write.json(outputPath)
As you can see, I marked the input dataframe inputDf for persistence (see the reason here), but never did a count on it. Then I derived two dataframes, to both of which I did a count.
Question 1: When I do uncorruptedDf.count, what does it do to the parent dataframe inputdf? Does it trigger caching of the whole inputDf, the part of it that corresponds to uncorruptedDf.count, or nothing? RDD Documentation says that:
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).
Question 2: Does it make sense at this point (before the two count) to persist the derived dataframes corruptedDf and uncorruptedDf and unpersist inputDf? Since there are two actions happening on each derived dataframe, I would say yes, but I am not sure. If so.. what is the correct place to unpersist the parent DF below? (A), (B), or (C)?
uncorruptedDf.persist
corruptedDf.persist
// (A) I don't think I should inputDf.unpersist here, since derived DFs are not yet persisted
log.info("Not corrupted records: " + uncorruptedDf.count)
log.info("Corrupted records: " + corruptedDf.count)
// (B) This seems a reasonable place, to free some memory
val finalDf = uncorruptedDf.grouby(...).agg(...)
log.info("Finally chosen records: " + finalDf.count)
finalDf.write.json(outputPath)
// (C) Is there any value from unpersisting here?
Question 3: Same as previous question but for finalDf vs corruptedDf. As can be seen I perform two actions on the finalDf: count and write.
Thanks in advance!

For question 1:
Yes it would persist the inputdf when the first count is called which is uncorrupted.count() but it won't persist any transformation that you do on the inputdf. On next count it won't read the data from the json file but it would read it from the partitions that it cached.
For question 2:
I think you should not persist the inputdf as there is nothing that you gain. Persisting the corrupted and uncorrupted of makes sense as you are performing multiple actions on it. You are just performing transformation on the inputdf to filter corrupt and uncorrupt records and spark is smart enough to combine it as one step during its physical planning stage.To conclude you should not persist inputdf and in that way you do not have to worry about unpersisting it.
For question 3:
You should not persist final dataframe as you are only performing one action on it of writing it to physical path as json file.
PS: don't try to cache/ persist each dataframe as caching itself has performance impact and have to do additional work to keep the data in memory or save to disk based on the storage level that you specify. If there are less transformations and are not complex it better to avoid caching. You can use explain command on the dataframe to see the physical ans logical plans.

Scala Spark isin broadcast list

I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.

I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe

I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.

How to keep RDD persisted and consistent?

I have the following code (simplification for a complex situation):
val newRDD = prevRDD.flatMap{a =>
Array.fill[Int](scala.util.Random.nextInt(10)){scala.util.Random.nextInt(2)})
}.persist()
val a = newRDD.count
val b = newRDD.count
and even that the RDD supposed to be persisted (and therefore consistent), a and b are not identical in most cases.
Is there a way to keep the results of the first action consistent, so when the second "action" will be called, the results of the first action will be returned?
* Edit *
The issue that I have is apparently caused by zipWithIndex method exists in my code - which creates indices higher than the count. I'll ask about it in a different thread. Thanks

There is no way to make sure 100% consistent.
When you call persist it will try to cache all of partitions on memory if it fits.
Otherwise, It will recompute partitions which are not fit on memory.

Spark: Distributed removal/addition of elements in a set?

I am trying to convert a ML algorithm to Spark Scala to take advantage of my cluster's power. The relevant bits of pseudo-code are the following:
initialize set of elements
while(set not empty) {
while(...) { remove a given element from the set }
while(...) { add a given element to the set }
}
Is there any way to parallelize such a thing?
I would intuitively say that this is not implementable in a distributed fashion (the number of iterations being unknown), but I have been reading that Spark allows implementation of iterative ML algorithms.
Here is what I tried so far:
Originally used a mutable Set and removed/added elements during the loops in simple Scala. It runs correctly, but I feel like the whole code will just be executed on the driver which limits the interest of using Spark?
Made the set a RDD, and replaced the var during every iteration by a new RDD with subtracted/added element (which I suppose is super heavy?). No error appears but the variable doesn't actually get updated.
mySetRDD = mySetRDD.subtract(sc.parallelize(Seq(element)))
Looked up Accumulators for a way to keep a set of elements upated on its content (presence/absence of elements) across multiple executors, but they do not seem to allow things other than simple updates of numerical values.

Create PairRDD and then repartitionByKey say x partitions.
After that you can use
PairRdd1.zipPartition() to get the iterator over partition of rdds. Then you can write a function which will work over two iterators to produce third or output iterator.
Since you have repartition the rdd by key you need not keep track of the removals across partitions.
https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#zipPartitions(org.apache.spark.rdd.RDD, boolean, scala.Function2, scala.reflect.ClassTag, scala.reflect.ClassTag)

Spark: groupBy taking lot of time

In my application when taking perfromance numbers, groupby is eating away lot of time.
My RDD is of below strcuture:
JavaPairRDD<CustomTuple, Map<String, Double>>
CustomTuple:
This object contains information about the current row in RDD like which week, month, city, etc.
public class CustomTuple implements Serializable{
private Map hierarchyMap = null;
private Map granularMap = null;
private String timePeriod = null;
private String sourceKey = null;
}
Map
This map contains the statistical data about that row like how much investment, how many GRPs, etc.
<"Inv", 20>
<"GRP", 30>
I was executing below DAG on this RDD
apply filter on this RDD and scope out relevant rows : Filter
apply filter on this RDD and scope out relevant rows : Filter
Join the RDDs: Join
apply map phase to compute investment: Map
apply GroupBy phase to group the data according to the desired view: GroupBy
apply a map phase to aggregate the data as per the grouping achieved in above step (say view data across timeperiod) and also create new objects based on the resultset desired to be collected: Map
collect the result: Collect
So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):
<timeperiod1, value>
When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.
IMO, we can replace GroupBy and subsequent Map operations by a sing reduce.
But reduce will work on object of type JavaPairRDD>.
So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.
Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.
Also, I am not sure how aggregate function works and will it be able to help me in this case.
Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.
<"market1", 20>
<"market2", 30>
This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.
Any insight is appreciated.
[EDIT]We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.[EDIT]
TIA!

The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)
If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.
combineByKey have three arguments:
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)
mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value
mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that
The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.
I hope this will be usefull

Shuffling is triggered by any change in the key of a [K,V] pair, or by a repartition() call. The partitioning is calculated based on the K (key) value. By default partitioning is calculated using the Hash value of your key, implemented by the hashCode() method. In your case your Key contains two Map instance variables. The default implementation of the hashCode() method will have to calculate the hashCode() of those maps as well, causing an iteration to happen over all it elements to in turn again calculate the hashCode() of those elements.
The solutions are:
Do not include the Map instances in your Key. This seems highly unusual.
Implement and override your own hashCode() that avoids going through the Map Instance variables.
Possibly you can avoid using the Map objects completely. If it is something that is shared amongst multiple elements, you might need to consider using broadcast variables in spark. The overhead of serializing your Maps during shuffling might also be a big contributing factor.
Avoid any shuffling, by tuning your hashing between two consecutive group-by's.
Keep shuffling Node local, by choosing a Partitioner that will have an affinity of keeping partitions local during consecutive use.
Good reading on hashCode(), including a reference to quotes by Josh Bloch can be found in wiki.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache Spark RDD - not updating - scala

Related

Spark - Stategy for persisting derived dataframes when parent DF is already persisted

Scala Spark isin broadcast list

How to keep RDD persisted and consistent?

Spark: Distributed removal/addition of elements in a set?

Spark: groupBy taking lot of time

Categories

Resources