Preserving index in a RDD in Spark - scala

I would like to create an RDD which contains String elements. Alongside of these elements I would like a number indicating the index of the element. However, I do not want this number to change if I remove elements, as I want the number to be the original index (thus preserving it). It is also important that the order is preserved in this RDD.
If I use zipWithIndex and thereafter remove some elements, will the indexes change? Which function/structure can I use to have unchanged indexes? I was thinking of creating a Pair RDD, however my input data does not contain indexes.

Answering rather than deleting. My problem was easily solved by zipWithIndex which fulfilled all my requirements.

Related

How to efficiently loop through a MongoDB collection in order update a sequence column?

I am new to MongoDB/Mongoose and have a challenge I'm trying to solve in order to avoid a performance rabbit hole!
I have a MongoDB collection containing a numeric column called 'sequence' and after inserting a new document, I need to cycle through the collection starting at the position of the inserted document and to increment the value of sequence by one. In this way I maintain a collection of documents numbered from 1 to n (i.e. where n = the number of documents in the collection), and can render the collection as a table in which newly inserted records appear in the right place.
Clearly one way to do this is to loop through the collection, doing a seq++ in each iteration, and then using Model.updateOne() to apply the value of seq to sequence for each document in turn.
My concern is that this involves calling updateOne() potentially hundreds of times, which might not be optimal for performance. Any suggestions on how I should approach this in a more efficient way?

Which elememt does DataFrame. DropDuplicate drop

If I sort a dataframe in descending ortder based on a column. And then drop the duplicates using df.dropDuplicate then which element will be removed? The element which was smaller based on sort?
DropDuplicate method preserves the first Element and removing the others.
So yes , on descending sort only the largest(based on sort) will be preserved and others removed.

Scala Count Occurrences in Large List

In Scala I have a list of tuples List[(String, String)]. So now from this list I want to find how many times each unique tuple appears in the list.
One way to do this would be to apply groupby{ x => x} and then find the length. But here my data set it quite large and it's taking a lot of time.
So is there any better way of doing this?
I would do the counting manually, using a Map. Iterate over your collection/list. During the iteration, build a count map. Keys in the count map are unique items from the original collection/list, values are number of occurrences of the key. If the item being processed during the iteration is in the count collection, increase its value by 1. If not, add value 1 to the count map. You can use getOrElse:
count(current_item) = count.getOrElse(current_item, 0) + 1;
This should work faster than groupby, followed by length check. Will also require less memory.
Other suggestions, check also this discussion.

How to sort Array[Row] by given column index in Scala

How to sort Array[Row] by given column index in Scala?
I'm using RDD[Row].collect() which gives me array[Row], but I want to sort it based on a given column index.
I have already used quick-sort logic and it's working, but there are too many for loops and all.
I would like to use a Scala built-in API which can do this task with the minimum amount of code.
It would be much more efficient to sort the Dataframe before collecting it - if you collect it, you lose the distributed (and parallel) computation. You can use Dataframe's sort, for example - ascending order by column "col1":
val sorted = dataframe.sort(asc("col1"))

Selectively remove and delete objects from a NSMutableArray in Swift

Basic question. What is the best way to selectively remove and delete items from a mutable Array in Swift?
There are options that do NOT seem to be suited for this like calling removeObject inside a
for in loop
enumeration block
and others that appear to work in general like
for loop using an index + calling removeObjectAtIndex, even inside the loop
for in loop for populating an arrayWithItemsToRemove and then use originalArray.removeObjectsInArray(arrayWithItemsToRemove)
using .filter to create a new array seems to be really nice, but I am not quite sure how I feel about replacing the whole original array
Is there a recommended, simple and secure way to remove items from an array? One of those I mentioned or something I am missing?
It would be nice to get different takes (with pros and cons) or preferences on this. I still struggle choosing the right one.
If you want to loop and remove elements from a NSMutableArray based on a condition, you can loop the array in reverse order (from last index to zero), and remove the objects satisfying the condition.
For example, if you have an array of integers and want to remove the numbers divisible by three, you can run the loop like this:
var array: NSMutableArray = [1, 2, 3, 4, 5, 6, 7];
for index in stride(from: array.count - 1, through: 0, by: -1) {
if array[index] as Int % 3 == 0 {
array.removeObjectAtIndex(index)
}
}
Looping in reverse order ensures that the index of the array elements still to check doesn't change. In forward mode instead, if you remove for instance the first element, then the element previously at index 1 will change to index 0, and you have to account for that in the code.
Usage of removeObject (which doesn't work with the above code) is not recommended in a loop for performance reasons, because its implementation loops through all elements of the array and uses isEqualTo to determine whether to remove the object or not. The complexity order raises from O(n) to O(n^2) - in a worst case scenario, where all elements of the array are removed, the array is traversed once in the main loop, and traversed again for each element of the array. So all solution based on enumeration blocks, for-in, etc., should be avoided, unless you have a good reason.
filter instead is a good alternative, and it's what I'd use because:
it's concise and clear: 1 line of code as opposed to 5 lines (including closing brackets) of the index based solution
its performances are comparable to the index based solution - it is a bit slower, but I think not that much
It might not be ideal in all cases though, because, as you said, it generates a new array rather than operating in place.
When working with NSMutableArray you shouldn't remove objects while you are looping along the mutable array itself (unless looping backwards, as pointed out by Antonio's answer).
A common solution is to make an immutable copy of the array, iterate on the copy, and remove objects selectively on the original mutable array by calling "removeObject" or by calling "removeObjectAtIndex", but you will have to calculate the index, since indexes in the original array and the copy will not match because of the removals (you will have to decrement the "removal index" each time an object is removed).
Another solution (better) is to loop the array once, create an NSIndexSet with the indexes of the objects to remove, and then call "removeObjectsAtIndexes:" on the mutable array.
See documentation on NSMutableArray's "removeObjectsAtIndexes:" in Swift.
Some of the options:
For loop over indexes and calling removeObjectAtIndex: 1) You will have to deal with the fact that when you remove, the index of the following object will become the current index, so you have to make sure to not increment the index in that case; you can avoid this by iterating backwards. 2) Each call to removeObjectAtIndex is O(n) (since it must shift all following elements forwards), so the algorithm is O(n^2).
For loop to build a set of elements to remove and then calling removeObjectsInArray: The first part is O(n). removeObjectsInArray uses a hash table to test elements for removal efficiently; hash table access is O(1) on average but O(n) worst-case, so the algorithm is O(n) on average, but O(n^2) worst-case.
Using filter to create a new array: This is O(n). It creates a new array.
For loop to build an index set of indexes of elements to remove (or with indexesOfObjectsPassingTest), then remove them using removeObjectsAtIndexes: I believe this is O(n). It does not create a new array.
Use filterUsingPredicate using a predicate based on a block of your test: I believe this is also O(n). It does not create a new array.