Selective sampling in spark RDD - scala

I have a RDD from logged events I wanted to take few samples of each category.
Data is like below
My try
eventlist = ['type1', 'type2'....]
orginalRDD = sc.textfile("/path/to/file/*.gz").map(lambda x: x.split("|"))
samplelist = []
for event in event list:
eventsample = orginalRDD.filter(lambda x: x[3] == event).take(5).collect()
print samplelist
I have two questions on this,
1. Some better way/efficient way to collect sample based on specific condition?
2. Is it possible to collect the unsplit lines instead of splitted lines?
Python or scala suggestion are welcome!

If sample doesn't have to be random something like this should work just fine:
n = ... # Number of elements you want to sample
pairs = x: (x[3], x))
[], # zero values
lambda acc, x: (acc + [x])[:n], # Add new value a trim to n elements
lambda acc1, acc2: (acc1 + acc2)[:n]) # Combine two accumulators and trim
Getting a random sample is a little bit harder. One possible approach is to add a random value and sort before aggregation:
import os
import random
def add_random(iter):
seed = int(os.urandom(4).encode('hex'), 16)
rs = random.Random(seed)
for x in iter:
yield (rs.random(), x)
lambda acc, x: (acc + [x])[:n],
lambda acc1, acc2: (acc1 + acc2)[:n]))
For a DataFrame specific solution see Choosing random items from a Spark GroupedData Object


Functional way to build a matrix from the list in scala

I've asked this question on but haven't got a concrete answer yet. I am given a vector v and I would like to construct a matrix m based on this vector according to the rules specified below. I would like to write the following code in a purely functional way, i.e. m = or similar. I can do it easily in procedural way like this
import scala.util.Random
val v = Vector.fill(50)(Random.nextInt(100))
val m = Array.fill[Int](10, 10)(0)
def populateMatrix(x: Int): Unit = m(x/10)(x%10) += 1 => populateMatrix(x))
m foreach { row => row foreach print; println }
In words, I am iterating through v, getting a pair of indices (i,j) from each v(k) and updating the matrix m at these positions, i.e., m(i)(j) += 1. But I am seeking a functional way. It is clear for me how to implement this in, e.g. Mathematica
v=RandomInteger[{99}, 300]
m=SparseArray[{Rule[{Quotient[#, 10] + 1, Mod[#, 10] + 1}, 1]}, {10, 10}] & /# v // Total // Normal
But how to do it in scala, which is functional language, too?
Your populate matrix approach can be "reversed" - map vector into index tuples, group them, count size of each group and turn it into map (index tuple -> size) which will be used to populate corresponding index in array with Array.tabulate:
val v = Vector.fill(50)(Random.nextInt(100))
val values = => (i/10, i%10))
val result = Array.tabulate(10,10)( (i, j)=> values.getOrElse((i,j), 0))

How to reduce shuffling and time taken by Spark while making a map of items?

I am using spark to read a csv file like this :
x, y, z
x, y
x, y, c, f
x, z
I want to make a map of items vs their count. This is the code I wrote :
private def genItemMap[Item: ClassTag](data: RDD[Array[Item]], partitioner: HashPartitioner): mutable.Map[Item, Long] = {
val immutableFreqItemsMap = data.flatMap(t => t)
.map(v => (v, 1L))
.reduceByKey(partitioner, _ + _)
val freqItemsMap = mutable.Map(immutableFreqItemsMap.toSeq: _*)
When I run it, it is taking a lot of time and shuffle space. Is there a way to reduce the time?
I have a 2 node cluster with 2 cores each and 8 partitions. The number of lines in the csv file are 170000.
If you just want to do an unique item count thing, then I suppose you can take the following approach.
val data: RDD[Array[Item]] = ???
val itemFrequency = data
.flatMap(arr => => (item, 1))
.reduceByKey(_ + _)
Do not provide any partitioner while reducing, otherwise it will cause re-shuffling. Just keep it with the partitioning it already had.
Also... do not collect the distributed data into a local in-memory object like a Map.

Spark: How to 'scan' RDD collections?

Does Spark have any analog of Scala scan operation to work on RDD collections?
(for details please see Reduce, fold or scan (Left/Right)?)
For example:
val abc = List("A", "B", "C")
def add(res: String, x: String) = {
println(s"op: $res + $x = ${res + x}")
res + x
So to get:
// op: z + A = zA // same operations as foldLeft above...
// op: zA + B = zAB
// op: zAB + C = zABC
// res: List[String] = List(z, zA, zAB, zABC) // maps intermediate results
Any other means to achieve the same result?
What is "Spark" way to solve, for example, the following problem:
Compute elements of the vector as (in pseudocode):
x(i) = SomeFun(for k from 0 to i-1)(y(k))
Should I collect RDD for this? No other way?
Update 2
Ok, I understand the general problem. Yet maybe you could advise me on the particular case I have to deal with.
I have a list of ints as input RDD and I have to build an outptut RDD, where the following should hold:
1) input.length == output.length // output list is of the same length as input
2) output(i) = sum( range (0..i), input(i)) / q^i // i-th element of output list equals sum of input elements from 0 to i divided by i-th power of some constant q
In fact I need a combination of map and fold function to solve this.
Another idea is to write a recursive fold on diminishing tails of the input list. But this is super inefficient and AFAIK Spark does not have tail or init function for RDD.
How would you solve this problem in Sparck?
You are correct that there does not exist the analog of scan() in the generic RDD.
A potential explantion: Such a method would require access to all elements of the distributed collection to process each element of the generated output collection. before continuing on to the next output element.
So if your input list were say 1 million plus one entries there would be 1 million shuffle operations on the cluster (even though the sorting is not required here - spark gives it for "free" when doing a cluster collect step).
UPDATE OP has expanded the question. Here is response to the expanded question.
from updated OP:
x(i) = SomeFun(for k from 0 to i-1)(y(k))
You need to distinguish whether x(i) computation - specifically the y(k) function - were going to either:
require access to the entire dataset x(0 .. i -1)
change the structure of the dataset
on each iteration. That is the case for scan - and given your description it seems to be your purpose. AFAIK this is not supported in Spark. Once again - think if you were developing the distributed framework. How would you achieve same? It does not seem to be a scalable means to achieve - so yes you would need to do that computation in an
invocation against the original RDD and perform it on the Driver.

How can I create a TF-IDF for Text Classification using Spark?

I have a CSV file with the following format :
The product_idX is a integer and the product_titleX is a String, example :
453478692, Apple iPhone 4 8Go
I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.
I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4.
So I'm reading the file :
val file = sc.textFile("offers.csv")
Then I'm mapping it in tuples RDD[Array[String]]
val tuples = => line.split(",")).cache
and after I'm transforming the tuples into pairs RDD[(Int, String)]
val pairs = tuples.(line => (line(0),line(1)))
But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.
To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of
document_id, [token_ids]
The second is an inverted index like
token_id, [document_ids]
I'll call those corpus and inv_index respectively.
To get tf we need to count the number of occurrences of each token in each document. So
from collections import Counter
def wc_per_row(row):
cnt = Counter()
for word in row:
cnt[word] += 1
return cnt.items()
tf = (x, y): (x, wc_per_row(y)))
The df is simply the length of each term's inverted index. From that we can calculate the idf.
df = (x, y): (x, len(y)))
num_documnents = tf.count()
# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df.
import math.log10
idf = (k, v): (k, 1. + log10(num_documents/v))).collect()
Now we just have to do a join on the term_id:
def calc_tfidf(tf_tuples, idf_tuples):
return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
(k2, v2) in idf_tuples if k1 == k2]
tfidf = (k, v): (k, calc_tfidf(v, idf)))
This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.
And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.
If anyone can improve on this, I'm very interested.

How to sum the corresponding values in the List into a Tuple?

I have a list details of this type :
case class Detail(point: List[Double], cluster: Int)
val details = List(Detail(List(2.0, 10.0),1), Detail(List(2.0, 5.0),3),
Detail(List(8.0, 4.0),2), Detail(List(5.0, 8.0),2))
I want filter this list into a tuple which contains a sum of each corresponding point where the cluster is 2
So I filter this List :
details.filter(detail => detail.cluster == 2)
which returns :
List(Detail(List(8.0, 4.0),2), Detail(List(5.0, 8.0),2))
It's the summing of the corresponding values I'm having trouble with. In this example the tuple should contain (8+5, 4+8) = (13, 12)
I'm thinking to flatten the List and then sum each corresponding value but
just returns the same List
How to sum the corresponding values in the List into a Tuple ?
I could achieve this easily using a for loop and just extract the details I need into a counter but what is the functional solution ?
What do you want to happen if the lists for different Details have different lengths?
Or same length which is different from 2? Tuples are generally only used when you need a fixed in advance number of elements; you won't even be able to write a return type if you need tuples of different lengths.
Assuming that all of them are lists of the same length and you get a list in return, something like this should work (untested):
details.filter(_.cluster == 2).map(_.point)
I.e. first get all points as a list of lists, transpose it so you get a list for each "coordinate", and sum each of these lists.
If you do know that each point has two coordinates, this should likely be reflected in your Point type, by using (Double, Double) instead of List[Double] and you can just fold over the list of points, which should be a bit more efficient. Look at definition of foldLeft and the standard implementation of sum in terms of foldLeft:
def sum(list: List[Int]): Int = list.foldLeft(0)((acc, x) => acc + x)
and it should be easy to do what you want.
You can use just one foldLeft with PF without filter:
case ((accX, accY), Detail(x :: y :: Nil, 2)) => (accX + x, accY + y)
case (acc, _) => acc
res1: (Double, Double) = (13.0,12.0)