Spark update cached dataset - scala

I have a sorted dataset, which is updated (filtered) inside a cycle according on the value of the head of the dataset.
If I cache the dataset every n (e.g., 50) cycles, I have good performance.
However, after a certain amount of cycles, the cache it seems to not work, since the program slows down (I guess it is because the memory assigned to the caching is filled).
I was asking if and how is it possible to maintain only the updated dataset in cache, in order to not fill the memory and still have good performance.
Please find below an example of my code:
dataset = dataset.sort(/* sort condition */)
dataset.cache()
var head = dataset.head(1)
var count = 0
while (head.nonEmpty) {
count +=1
/* custom operation with the head */
dataset = dataset.filter(/* filter condition based on the head of the dataset */
if (count % 50 == 0) {
dataset.cache()
}
head = dataset.head(1)
}

cache alone won't help you here. With each iteration lineage and execution plan grow, and it is not something that can be addressed by caching alone.
You should at least break the lineage:
if (count % 50 == 0) {
dataset.cache()
dataset.checkpoint
}
although personally I would also write data to a distributed storage and read it back:
if (count % 50 == 0) {
dataset.write.parquet(s"/some/path/$count")
dataset = spark.read.parquet(s"/some/path/$count")
}
it might not be acceptable depending on your deployment, but in many cases behaves more predictably than caching and checkpointing

Try uncaching dataset before caching this way you will remove old copy of dataset from memory and keep only the latest, avoiding multiple copies in memory. Below is sample but you have keep dataset.unpersist() in correct location based on you code logic
if (count % 50 == 0) {
dataset.cache()
}
dataset = dataset.filter(/* filter condition based on the head of the dataset */
if (count % 50 == 0) {
dataset.cache()
}

Related

Difference between size and sizeIs

What is the semantic difference between size and sizeIs? For example,
List(1,2,3).sizeIs > 1 // true
List(1,2,3).size > 1 // true
Luis mentions in a comment that
...on 2.13+ one can use sizeIs > 1 which will be more efficient than
size > 1 as the first one does not compute all the size before
returning
Add size comparison methods to IterableOps #6950 seems to be the pull request that introduced it.
Reading the scaladoc
Returns a value class containing operations for comparing the size of
this $coll to a test value. These operations are implemented in terms
of sizeCompare(Int)
it is not clear to me why is sizeIs more efficient than regular size?
As far as I understand the changes.
The idea is that for collections that do not have a O(1) (constant) size. Then, sizeIs can be more efficient, specially for comparisons with small values (like 1 in the comment).
But why?
Simple, because instead of computing all the size and then doing the comparison, sizeIs returns an object which when computing the comparison, can return early.
For example, lets check the code
def sizeCompare(otherSize: Int): Int = {
if (otherSize < 0) 1
else {
val known = knownSize
if (known >= 0) Integer.compare(known, otherSize)
else {
var i = 0
val it = iterator
while (it.hasNext) {
if (i == otherSize) return if (it.hasNext) 1 else 0 // HERE!!! - return as fast as possible.
it.next()
i += 1
}
i - otherSize
}
}
}
Thus, in the example of the comment, suppose a very very very long List of three elements. sizeIs > 1 will return as soon as it knows that the List has at least one element and hasMore. Thus, saving the cost of traversing the other two elements to compute a size of 3 and then doing the comparison.
Note that: If the size of the collection is greater than the comparing value, then the performance would be roughly the same (maybe slower than just size due the extra comparisons on each cycle). Thus, I would only recommend this for comparisons with small values, or when you believe the values will be smaller than the collection.

Merge Sort algorithm efficiency

I am currently taking an online algorithms course in which the teacher doesn't give code to solve the algorithm, but rather rough pseudo code. So before taking to the internet for the answer, I decided to take a stab at it myself.
In this case, the algorithm that we were looking at is merge sort algorithm. After being given the pseudo code we also dove into analyzing the algorithm for run times against n number of items in an array. After a quick analysis, the teacher arrived at 6nlog(base2)(n) + 6n as an approximate run time for the algorithm.
The pseudo code given was for the merge portion of the algorithm only and was given as follows:
C = output [length = n]
A = 1st sorted array [n/2]
B = 2nd sorted array [n/2]
i = 1
j = 1
for k = 1 to n
if A(i) < B(j)
C(k) = A(i)
i++
else [B(j) < A(i)]
C(k) = B(j)
j++
end
end
He basically did a breakdown of the above taking 4n+2 (2 for the declarations i and j, and 4 for the number of operations performed -- the for, if, array position assignment, and iteration). He simplified this, I believe for the sake of the class, to 6n.
This all makes sense to me, my question arises from the implementation that I am performing and how it effects the algorithms and some of the tradeoffs/inefficiencies it may add.
Below is my code in swift using a playground:
func mergeSort<T:Comparable>(_ array:[T]) -> [T] {
guard array.count > 1 else { return array }
let lowerHalfArray = array[0..<array.count / 2]
let upperHalfArray = array[array.count / 2..<array.count]
let lowerSortedArray = mergeSort(array: Array(lowerHalfArray))
let upperSortedArray = mergeSort(array: Array(upperHalfArray))
return merge(lhs:lowerSortedArray, rhs:upperSortedArray)
}
func merge<T:Comparable>(lhs:[T], rhs:[T]) -> [T] {
guard lhs.count > 0 else { return rhs }
guard rhs.count > 0 else { return lhs }
var i = 0
var j = 0
var mergedArray = [T]()
let loopCount = (lhs.count + rhs.count)
for _ in 0..<loopCount {
if j == rhs.count || (i < lhs.count && lhs[i] < rhs[j]) {
mergedArray.append(lhs[i])
i += 1
} else {
mergedArray.append(rhs[j])
j += 1
}
}
return mergedArray
}
let values = [5,4,8,7,6,3,1,2,9]
let sortedValues = mergeSort(values)
My questions for this are as follows:
Do the guard statements at the start of the merge<T:Comparable> function actually make it more inefficient? Considering we are always halving the array, the only time that it will hold true is for the base case and when there is an odd number of items in the array.
This to me seems like it would actually add more processing and give minimal return since the time that it happens is when we have halved the array to the point where one has no items.
Concerning my if statement in the merge. Since it is checking more than one condition, does this effect the overall efficiency of the algorithm that I have written? If so, the effects to me seems like they vary based on when it would break out of the if statement (e.g at the first condition or the second).
Is this something that is considered heavily when analyzing algorithms, and if so how do you account for the variance when it breaks out from the algorithm?
Any other analysis/tips you can give me on what I have written would be greatly appreciated.
You will very soon learn about Big-O and Big-Theta where you don't care about exact runtimes (believe me when I say very soon, like in a lecture or two). Until then, this is what you need to know:
Yes, the guards take some time, but it is the same amount of time in every iteration. So if each iteration takes X amount of time without the guard and you do n function calls, then it takes X*n amount of time in total. Now add in the guards who take Y amount of time in each call. You now need (X+Y)*n time in total. This is a constant factor, and when n becomes very large the (X+Y) factor becomes negligible compared to the n factor. That is, if you can reduce a function X*n to (X+Y)*(log n) then it is worthwhile to add the Y amount of work because you do fewer iterations in total.
The same reasoning applies to your second question. Yes, checking "if X or Y" takes more time than checking "if X" but it is a constant factor. The extra time does not vary with the size of n.
In some languages you only check the second condition if the first fails. How do we account for that? The simplest solution is to realize that the upper bound of the number of comparisons will be 3, while the number of iterations can be potentially millions with a large n. But 3 is a constant number, so it adds at most a constant amount of work per iteration. You can go into nitty-gritty details and try to reason about the distribution of how often the first, second and third condition will be true or false, but often you don't really want to go down that road. Pretend that you always do all the comparisons.
So yes, adding the guards might be bad for your runtime if you do the same number of iterations as before. But sometimes adding extra work in each iteration can decrease the number of iterations needed.

How to connect component in Spark when data is too large

When dealing with component connecting of big data, I find it very difficult to merging them in spark.
The data structure in my research can be simplified to RDD[Array[Int]]. For example:
RDD[Array(1,2,3), Array(1,4), Array(5,6), Array(5,6,7,8), Array(9), Array(1)]
The objective is to merge two Array if they have intersection set, ending up with arrays without any intersection. Therefore after merging, it should be:
RDD[Array(1,2,3,4), Array(5,6,7,8), Array(9)]
The problem is kind of component connecting in Pregel framework in Graph Algo. One solution is to first find the edge connection between two Array using cartesian product and then merge them. However, in my case, there are 300K Array with total size 1G. Therefore, the time and memory complexity would be roughly 300K*300K. When I run the program in my Mac Pro in spark, it is completely stuck.
Baiscally, it is like:
Thanks
Here is my solution. Might not be decent enough, but works for a small data. Whether it can apply to large data needs further proof.
def mergeCanopy(canopies:RDD[Array[Int]]):Array[Array[Int]] = {
/*
try to merge two canopies
*/
val s = Set[Array[Int]]()
val c = canopies.aggregate(s)(mergeOrAppend, _++_)
return c.toArray
def mergeOrAppend(disjoint: Set[Array[Int]], cluster: Array[Int]):Set[Array[Int]] = {
var disjoints = disjoint
for (clus <- disjoint) {
if (clus.toSet.&(cluster.toSet) != Set()) {
disjoints += (clus.toSet++cluster.toSet).toArray
disjoints -= clus
return disjoints
}
}
disjoints += cluster
return disjoints
}

Find out HDD usage of binary data in a field

In a collection with a field image that contains images (BinData). I want to find out how many % of the DB are used by the images. What is the most efficient way to calculate the total size of all images?
I want to avoid fetching all images from the DB server, so I came up with this code:
mapper = Code("""
function() {
var n = 0;
if (this.image) {
n = this.image.length();
}
emit('sizes', n);
}
""")
reducer = Code("""
function(key, sizes) {
var total = 0;
for (var i = 0; i < sizes.length; i++) {
total += sizes[i];
}
}
return total;
""")
result = db.files.map_reduce(mapper, reducer, "image_sizes")
During the execution memory usage of mongodb gets quite high, it looks as if the whole data is loaded into memory. How can this be optimized? Also, does it make sense to call this.image.length() in order to find out how many Bytes the images occupy on the harddrive?
You can not avoid loading all the data into memory. MongoDB treats a document as its atomic unit, and by querying all documents, it will pull in all documents into memory.
As an alternative, what might possibly help you is just to see how many bytes a collection takes up, but that of course only works if the only thing you have stored in your collection are images. On the shell, you can do this with:
db.files.stats()
Which has the field storageSize that shows you how much storage is needed for your images approximately. This is not nearly as accurate as going through all your images though.

Scala vals vs vars

I'm pretty new to Scala but I like to know what is the preferred way of solving this problem. Say I have a list of items and I want to know the total amount of the items that are checks. I could do something like so:
val total = items.filter(_.itemType == CHECK).map(._amount).sum
That would give me what I need, the sum of all checks in a immutable variable. But it does it with what seems like 3 iterations. Once to filter the checks, again to map the amounts and then the sum. Another way would be to do something like:
var total = new BigDecimal(0)
for (
item <- items
if item.itemType == CHECK
) total += item.amount
This gives me the same result but with 1 iteration and a mutable variable which seems fine too. But if I wanted to to extract more information, say the total number of checks, that would require more counters or mutable variables but I wouldn't have to iterate over the list again. Doesn't seem like the "functional" way of achieving what I need.
var numOfChecks = 0
var total = new BigDecimal(0)
items.foreach { item =>
if (item.itemType == CHECK) {
numOfChecks += 1
total += item.amount
}
}
So if you find yourself needing a bunch of counters or totals on a list is it preferred to keep mutable variables or not worry about it do something along the lines of:
val checks = items.filter(_.itemType == CHECK)
val total = checks.map(_.amount).sum
return (checks.size, total)
which seems easier to read and only uses vals
Another way of solving your problem in one iteration would be to use views or iterators:
items.iterator.filter(_.itemType == CHECK).map(._amount).sum
or
items.view.filter(_.itemType == CHECK).map(._amount).sum
This way the evaluation of the expression is delayed until the call of sum.
If your items are case classes you could also write it like this:
items.iterator collect { case Item(amount, CHECK) => amount } sum
I find that speaking of doing "three iterations" is a bit misleading -- after all, each iteration does less work than a single iteration with everything. So it doesn't automatically follows that iterating three times will take longer than iterating once.
Creating temporary objects, now that is a concern, because you'll be hitting memory (even if cached), which isn't the case of the single iteration. In those cases, view will help, even though it adds more method calls to do the same work. Hopefully, JVM will optimize that away. See Moritz's answer for more information on views.
You may use foldLeft for that:
(0 /: items) ((total, item) =>
if(item.itemType == CHECK)
total + item.amount
else
total
)
The following code will return a tuple (number of checks -> sum of amounts):
((0, 0) /: items) ((total, item) =>
if(item.itemType == CHECK)
(total._1 + 1, total._2 + item.amount)
else
total
)