Improve performance for function which counts common words - scala

I have this program, that uses Apache Spark to calculate the frequency of words.
I create an RDD with the key/value pairs(word=key, frequency=value). The dataset is distributed over worker nodes. The function frequentWordCount is executed at regular intervals. It selects strings from the files.
which are then converted into key-value-pairs and connected to the wordDataset-RDD. The words with a frequency of >50, are counted.
I was told that this approach is not performant. Can somebody tell me why and how I could improve this?
val sc = new SparkContext(...)
var wordDataset:RDD[(String, Int)] = sc.sequenceFile[String, Int](“…”).persist()
def frequentWordCount(fileName:String):Long = {
val words = sc.sequenceFile[String](fileName)
val joined = wordDataset.join(words.map(x=>(x,1)))
joined.filter(x=>x._1._2>50).count
}

Approximately how many frequent words will you have? For a lot of reasonable tasks, I think it should be unexpectedly small - small enough to fit into each individual machine's memory. IIRC, words tend to obey a power law distribution, so there shouldn't be that many "common" words. In that case, broadcasting a set of frequent words could be much faster than joining:
val sc = new SparkContext(...)
var commonWords: BroadCast[Set[String]] = sc.broadcast(sc.sequenceFile[String, Int](“…”).filter(_._2 > 50).collect().toSet)
def frequentWordCount(fileName:String):Long = {
val words = sc.sequenceFile[String](fileName)
words.filter(commonWords.value.contains).count
}
If you are calling frequentWordCount multiple times, it probably is also better to do it in just one RDD operation where your words are associated with a filename and then grouped and counted or something... specifics depend on how it's used.

If the number of common words is small enough to fit into an in-memory Set, then what the other answer suggests (except, you need to map(_._1) there after filter.
Otherwise, the two things you could improve are (1) filter before join, you want to throw out extra data as soon as you can rather than unnecessarily scanning over it multiple times, and (2) as a general rule, you always want to join the larger dataset to the smaller one, not the other way around.
sc.sequenceFile[String](fileName)
.keyBy(identity)
.join(wordDataset.filter(_._2 > 50))
.count

Related

What is a more efficient way to compare and filter Sequences (multiple calls to a single call)

I have two sequences of Data objects and I want to establish what has been added, removed and is common between DataSeq1 and DataSeq2 based upon the id in the Data objects within each sequence.
I can achieve this using the following:
val dataRemoved = DataSeq1.filterNot(c => DataSeq2.exists(_.id == c.id))
val dataAdded = DataSeq2.filterNot(c => DataSeq1.exists(_.id == c.id))
val dataCommon = DataSeq1.filter(c => DataSeq2.exists(_.id == c.id))
//Based upon what is common I want to filter DataSeq2
var incomingDataToCompare = List[Data]()
dataCommon.foreach(data => {incomingDataToCompare = DataSeq2.find(_.id == data.id).get :: incomingDataToCompare})
However as the Data object gets larger calling filters three different times may have a performance impact. Is there a more efficient way to achieve the same output (i.e. what has been removed, added and in common) in a single call?
The short answer is, quite possibly not, unless you are going to add some additional features into the system. I would guess that you need to keep a log of operations in order to improve the time complexity. Even better if that log will be indexed both by the order in which the operation has occurred and by the id of the item that was added/removed. I will leave it to you to discover how such a log can be used.
Also you might be able to improve time complexity if you are going to keep the original sequences sorted by id (or a separate seq of sorted ids, you should probably be able to do that incurring a logN penalty per a single operation). This seq should be of type vector or something, to allow fast random access. Then you can probably iterate with two pointers. But this algorithm's efficiency will depend greatly on whether the unique ids are bounded and also whether this "establish added/removed/same" operation will be called much more frequently compared to the operations that mutate the sequences.

Scala Spark isin broadcast list

I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.
I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe
I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.

How to properly apply HashPartitioner before a join in Spark?

To reduce shuffling during the joining of two RDDs, I decided to partition them using HashPartitioner first. Here is how I do it. Am I doing it correctly, or is there a better way to do this?
val rddA = ...
val rddB = ...
val numOfPartitions = rddA.getNumPartitions
val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions))
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions))
val rddAB = rddApartitioned.join(rddBpartitioned)
To reduce shuffling during the joining of two RDDs,
It is surprisingly common misconception that repartitoning reduces or even eliminates shuffles. It doesn't. Repartitioning is shuffle in its purest form. It doesn't save time, bandwidth or memory.
The rationale behind using proactive partitioner is different - it allows you to shuffle once, and reuse the state, to perform multiple by-key operations, without additional shuffles (though as far as I am aware, not necessarily without additional network traffic, as co-partitioning doesn't imply co-location, excluding cases where shuffles occurred in a single actions).
So your code is correct, but in a case where you join once it doesn't buy you anything.
Just one comment, better to append .persist() after .partitionBy if there are multiple actions for rddApartitioned and rddBpartitioned, otherwise, all the actions will evaluate the entire lineage of rddApartitioned and rddBpartitioned, which will cause the hash-partitioning takes place again and again.
val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions)).persist()
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions)).persist()

Scala: how to copy part of a List to another List

I am new to Scala.
I have a list
origList = List[Double] with thousands of elements.
I need to create another list
outList = List[Double]
and copy to it the elements from origList with indices
start, start+1, ..., start+nCopy-1
that is the output list will have nCopy elements.
This part of the code will be executed many times. What is the most efficient way to do that in Scala?
The way people usually do it in scala is list.slice(start, start+nCopy).
Note, that List in scala is not a random access container like ArrayList is in java. It is implemented as a linked list, so, especially, if you are going to do this many times, it will help significantly, if you convert your list to something indexed before hand: val converted = list.toIndexedSeq or, better, val converted = list.toArray.
.slice on an Array or on IndexedSeq will be much more efficient, especially if start index is high.
Now, if you are really concerned about efficiency, of this one operation, nothing (unfortunately) beats the good-old java approach:
val converted = list.toArray
val copied = java.util.Arrays.copyOfRange(converted, start, start+nCopy)
This can be orders of magnitude faster than converted.slice (leave alone list.slice) when copying a large enough (hundreds) number of elements.

How do you perform blocking IO in apache spark job?

What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?
val values: Future[RDD[Double]] = Future sequence tasks
I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is not suitable.
I just wonder, if anyone had such a problem, and how did you solve it?
What I'm trying to achieve is to get a parallelism on a single worker node, so I can call that external service 3000 times per second.
Probably, there is another solution, more suitable for spark, like having multiple working nodes on single host.
It's interesting to know, how do you cope with such a challenge? Thanks.
Here is answer to my own question:
val buckets = sc.textFile(logFile, 100)
val tasks: RDD[Future[Object]] = buckets map { item =>
future {
// call native code
}
}
val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] =>
val searchFuture: Future[Iterator[Object]] = Future sequence f
Await result (searchFuture, JOB_TIMEOUT)
}
The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.
'values' collection contains the data, that is returned from the native code and that work is done across the cluster.
Based on your answer, that the blocking call is to compare provided input with each individual item in the RDD, I would strongly consider rewriting the comparison in java/scala so that it can be run as part of your spark process. If the comparison is a "pure" function (no side effects, depends only on its inputs), it should be straightforward to re-implement, and the decrease in complexity and increase in stability in your spark process due to not having to make remote calls will probably make it worth it.
It seems unlikely that your remote service will be able to handle 3000 calls per second, so a local in-process version would be preferable.
If that is absolutely impossible for some reason, then you might be able to create a RDD transformation which turns your data into a RDD of futures, in pseudo-code:
val callRemote(data:Data):Future[Double] = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Future[Double]] = inputData.map(callRemote)
And then carry on from there, computing on your Future[Double] objects.
If you know how much parallelism your remote process can handle, it might be best to abandon the Future mode and accept that it is a bottleneck resource.
val remoteParallelism:Int = 100 // some constant
val callRemoteBlocking(data:Data):Double = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Double] = inputData.
coalesce(remoteParallelism).
map(callRemoteBlocking)
Your job will probably take quite some time, but it shouldn't flood your remote service and die horribly.
A final option is that if the inputs are reasonably predictable and the range of outcomes is consistent and limited to some reasonable number of outputs (millions or so), you could precompute them all as a data set using your remote service and find them at spark job time using a join.