Scala Count Occurrences in Large List - scala

In Scala I have a list of tuples List[(String, String)]. So now from this list I want to find how many times each unique tuple appears in the list.
One way to do this would be to apply groupby{ x => x} and then find the length. But here my data set it quite large and it's taking a lot of time.
So is there any better way of doing this?

I would do the counting manually, using a Map. Iterate over your collection/list. During the iteration, build a count map. Keys in the count map are unique items from the original collection/list, values are number of occurrences of the key. If the item being processed during the iteration is in the count collection, increase its value by 1. If not, add value 1 to the count map. You can use getOrElse:
count(current_item) = count.getOrElse(current_item, 0) + 1;
This should work faster than groupby, followed by length check. Will also require less memory.
Other suggestions, check also this discussion.

Related

Preserving index in a RDD in Spark

I would like to create an RDD which contains String elements. Alongside of these elements I would like a number indicating the index of the element. However, I do not want this number to change if I remove elements, as I want the number to be the original index (thus preserving it). It is also important that the order is preserved in this RDD.
If I use zipWithIndex and thereafter remove some elements, will the indexes change? Which function/structure can I use to have unchanged indexes? I was thinking of creating a Pair RDD, however my input data does not contain indexes.
Answering rather than deleting. My problem was easily solved by zipWithIndex which fulfilled all my requirements.

How to sort Array[Row] by given column index in Scala

How to sort Array[Row] by given column index in Scala?
I'm using RDD[Row].collect() which gives me array[Row], but I want to sort it based on a given column index.
I have already used quick-sort logic and it's working, but there are too many for loops and all.
I would like to use a Scala built-in API which can do this task with the minimum amount of code.
It would be much more efficient to sort the Dataframe before collecting it - if you collect it, you lose the distributed (and parallel) computation. You can use Dataframe's sort, for example - ascending order by column "col1":
val sorted = dataframe.sort(asc("col1"))

How to traverse a RDD efficiently

For example, I have a Scala RDD with 10000 elements, I want to take each element one by one to deal with. How do I do that? I tried use take(i).drop(i-1), but it is extraordinarily time consuming.
According to what you said in the comments:
yourRDD.map(tuple => tuple._2.map(elem => doSomething(elem)))
The first map will iterate over the tuples inside of your RDD, that is why I called the variable tuple, then for every tuple we get the second element ._2 and apply a map which iterate over all the elements of your Iterable that is why I called the variable elem.
doSomething() is just a random function of your choice to apply on each element.

sorting a list of tuples in Pyspark

I have a set of tuple key,value pairs which look like this:
X=[(('cat','mouse'),1),(('dog','rat'),20),(('hamster','skittles),67)]
which I want to sort in order of the second item in the tuple. Pythonically I would have used:
sorted(X, key=lambda tup:tup[1])
I also want to get the key,value pair with the highest value, again, pythonically this would be simple:
max_X=max(x[1] for x in X)
max_tuple=[x for x in X if x[1]==max_X
however I do not know how to translate this into a spark job.
X.max(lambda x: x[1])
You could also do it another way, which is probably faster if you need to sort your RDD anyway. But, this is slower if you don't need your RDD to be sorted, because sorting will take longer than just telling it to find the max.(So, in a vacuum, use the max function).
X.sortBy(lambda x: x[1], False).first()
This will sort as you did before, but adding the False will sort it in descending order. Then you take the first one, which will be the largest.
Figured it out in the 2 minutes since posting!
X.sortBy(lambda x:x[1]).collect()

Possible to retrieve multiple random, non-sequential documents from MongoDB?

I'd like to retrieve a random set of documents from a MongoDB database. So far after lots of Googling, I've only seen ways to retrieve one random document OR a set of documents starting at a random skip position but where the documents are still sequential.
I've tried mongoose-simple-random, and unfortunately it doesn't retrieve a "true" random set. What it does is skip to a random position and then retrieve n documents from that position.
Instead, I'd like to retrieve a random set like MySQL does using one query (or a minimal amount of queries), and I need this list to be random every time. I need this to be efficient -- relatively on par with such a query with MySQL. I want to reproduce the following but in MongoDB:
SELECT * FROM products ORDER BY rand() LIMIT 50;
Is this possible? I'm using Mongoose, but an example with any adapter -- or even a straight MongoDB query -- is cool.
I've seen one method of adding a field to each document, generating a random value for each field, and using {rand: {$gte:rand()}} each query we want randomized. But, my concern is that two queries could theoretically return the same set.
You may do two requests, but in an efficient way :
Your first request just gets the list of all "_id" of document of your collections. Be sure to use a mongo projection db.products.find({}, { '_id' : 1 }).
You have a list of "_id", just pick N randomly from the list.
Do a second query using the $in operator.
What is especially important is that your first query is fully supported by an index (because it's "_id"). This index is likely fully in memory (else you'd probably have performance problems). So, only the index is read while running the first query, and it's incredibly fast.
Although the second query means reading actual documents, the index will help a lot.
If you can do things this way, you should try.
I don't think MySQL ORDER BY rand() is particularly efficient - as I understand it, it essentially assigns a random number to each row, then sorts the table on this random number column and returns the top N results.
If you're willing to accept some overhead on your inserts to the collection, you can reduce the problem to generating N random integers in a range. Add a counter field to each document: each document will be assigned a unique positive integer, sequentially. It doesn't matter what document gets what number, as long as the assignment is unique and the numbers are sequential, and you either don't delete documents or you complicate the counter document scheme to handle holes. You can do this by making your inserts two-step. In a separate counter collection, keep a document with the first number that hasn't been used for the counter. When an insert occurs, first findAndModify the counter document to retrieve the next counter value and increment the counter value atomically. Then insert the new document with the counter value. To find N random values, find the max counter value, then generate N distinct random numbers in the range defined by the max counter, then use $in to retrieve the documents. Most languages should have random libraries that will handle generating the N random integers in a range.