I'm trying to sort out how does reduceByKey functionate but this case is confusing me and I can't understand it at all.
The code is:
stream.foreachRDD((rdd: RDD[Record]) => {
// convert string to PoJo and generate rows as tuple group
val pairs = rdd
.map(row => (row.timestamp(), jsonDecode(row.value())))
.map(row => (row._2.getType.name(), (1, row._2.getValue, row._1)))
val flatten = pairs
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2, (y._3 + x._3) / 2))
.map(f => Row.fromSeq(Seq(f._1, f._2._2 / f._2._1, new Timestamp(f._2._3))))
Imagine data income:
["oceania", 500], ["australia", 450] and etc.
In flatten variable I'm trying to aggregate data by market type or this first type in JSON. Here is generating tuple: * the first one is counter value and this value is 1,
* second one is the rate and received from Kafka,
* third one is event time. for instance 2017-05-12 16:00:00
*
* in the map,
* method f._1 is market name,
* we divide total rate to total item count f._2._2 / f._2._1
* as you can see f._2._3 is average event time
Can someone help me explain what does mean f._2._3(I mean I know its temp variable but what is in there or could be) and how is total rate counting by dividing f._2._2 / f._2._1, what is dividing exactly? Thank you :)
For each row you define the following element in your RDD pairs:
(marketType, (counter, rate, eventTime))
Note that this is a Tuple2 whose second element is a Tuple3. Tuples are special case classes whose n-th element (starting at 1) is named _n. For instance, to access the rate of an element f, you will have to do f._2._2 (the second elemement of the Tuple3, which is the second element of the Tuple2).
Since your elements have special meaning, you might want to consider defining a case class MyRow(counter: Int, rate: Int, time: Timestamp), in order to have a clearer view on what your code is doing when you write something like f._2._3 (by the way, the type of eventTime is not clear to me, since you have only represented it as String, but you do numerical operations on it).
Now to what your code really attempts to do:
The reducing function takes two Tuple3 (or MyRow, if you change your code) and outputs another one (here, your reducing function sums over the counters, the rates, and makes the average between two values on the eventTime).
reduceByKey applies this reducing function as long as it finds two elements with the same key: since the output of the reducing function is of the same type than its inputs, it can be applied on it, as long as you have other values on your RDD that has the same key.
For a simple example, if you have
(key1, (1, 200, 2017/04/04 12:00:00))
(key1, (1, 300, 2017/04/04 12:00:00))
(key1, (1, 500, 2017/04/04 12:00:00))
(key2, (1, 500, 2017/04/04 12:00:00))
Then the reduceByKey will output
(key1, (3, 1000, 2017/04/04 12:00:00))
(key2, (1, 500, 2017/04/04 12:00:00))
And then your last map will work on this by calculating the total rate:
(key1, (333, 2017/04/04 12:00:00))
(key2, (500, 2017/04/04 12:00:00))
You may have noticed that I used always the same time in all the examples. That's because your reducing function on this field will give unexpected results because it is not associative. Try doing the same exercise as above but with different timestamps, and you will see that the reduced value for key1 will be different depending on the order in which you apply the reduction.
Let's see this: we want to reduce 4, 8, and 16 with this function so we might want to do this as
((4 + 8) / 2 + 16) / 2
or as
(4 + (8 + 16) / 2) / 2
depending on if we want to start on the left or on the right (in a real case, there are many more different possibilities, and they will happen in Spark, since you do not always know how your values are distributed over the cluster).
Calculating the two possibilities above, we get different values: 11 and 8, so you see that this may cause bigger problems in a real-life case.
A simple solution in your case would be to also do the sum of all timestamps (assuming they are Long values, or even BigInteger, to avoid overflow), and divide only in the end by the number of values to have the real time average.
Related
I have a Spark job that needs to compute movie content-based similarities. There are 46k movies. Each movie is represented by a set of SparseVectors (each vector is a feature vector for one of the movie's fields such as Title, Plot, Genres, Actors, etc.). For Actors and Genres, for example, the vector shows whether a given actor is present (1) or absent (0) in the movie.
The task is to find top 10 similar movies for each movie.
I managed to write a script in Scala that performs all those computations and does the job. It works for smaller sets of movies such as 1000 movies but not for the whole dataset (out of memory, etc.).
The way I do this computation is by using a cross join on the movies dataset. Then reduce the problem by only taking rows where movie1_id < movie2_id.
Still the dataset at this point will contain 46000^2/2 rows which is 1058000000.
And each row has significant amount of data.
Then I calculate similarity score for each row. After similarity is calculated I group the results where movie1_id is same and sort them in descending order by similarity score using a Window function taking top N items (similar to how it's described here: Spark get top N highest score results for each (item1, item2, score)).
The question is - can it be done more efficiently in Spark? E.g. without having to perform a crossJoin?
And another question - how does Spark deal with such huge Dataframes (1058000000 rows consisting of multiple SparseVectors)? Does it have to keep all this in memory at a time? Or does it process such dataframes piece by piece somehow?
I'm using the following function to calculate similarity between movie vectors:
def intersectionCosine(movie1Vec: SparseVector, movie2Vec: SparseVector): Double = {
val a: BSV[Double] = toBreeze(movie1Vec)
val b: BSV[Double] = toBreeze(movie2Vec)
var dot: Double = 0
var offset: Int = 0
while( offset < a.activeSize) {
val index: Int = a.indexAt(offset)
val value: Double = a.valueAt(offset)
dot += value * b(index)
offset += 1
}
val bReduced: BSV[Double] = new BSV(a.index, a.index.map(i => b(i)), a.index.length)
val maga: Double = magnitude(a)
val magb: Double = magnitude(bReduced)
if (maga == 0 || magb == 0)
return 0
else
return dot / (maga * magb)
}
Each row in the Dataframe consists of two joined classes:
final case class MovieVecData(imdbID: Int,
Title: SparseVector,
Decade: SparseVector,
Plot: SparseVector,
Genres: SparseVector,
Actors: SparseVector,
Countries: SparseVector,
Writers: SparseVector,
Directors: SparseVector,
Productions: SparseVector,
Rating: Double
)
It can be done more efficiently, as long as you are fine with approximations, and don't require exact results (or exact number or results).
Similarly to my answer to Efficient string matching in Apache Spark you can use LSH, with:
BucketedRandomProjectionLSH to approximate Euclidean distance.
MinHashLSH to approximate Jaccard Distance.
If feature space is small (or can be reasonably reduced) and each category is relatively small you can also optimize your code by hand:
explode feature array to generate #features records from a single record.
Self join result by feature, compute distance and filter out candidates (each pair of records will be compared if and only if they share specific categorical feature).
Take top records using your current code.
A minimal example would be (consider it to be a pseudocode):
import org.apache.spark.ml.linalg._
// This is oversimplified. In practice don't assume only sparse scenario
val indices = udf((v: SparseVector) => v.indices)
val df = Seq(
(1L, Vectors.sparse(1024, Array(1, 3, 5), Array(1.0, 1.0, 1.0))),
(2L, Vectors.sparse(1024, Array(3, 8, 12), Array(1.0, 1.0, 1.0))),
(3L, Vectors.sparse(1024, Array(3, 5), Array(1.0, 1.0))),
(4L, Vectors.sparse(1024, Array(11, 21), Array(1.0, 1.0))),
(5L, Vectors.sparse(1024, Array(21, 32), Array(1.0, 1.0)))
).toDF("id", "features")
val possibleMatches = df
.withColumn("key", explode(indices($"features")))
.transform(df => df.alias("left").join(df.alias("right"), Seq("key")))
val closeEnough(threshold: Double) = udf((v1: SparseVector, v2: SparseVector) => intersectionCosine(v1, v2) > threshold)
possilbeMatches.filter(closeEnough($"left.features", $"right.features")).select($"left.id", $"right.id").distinct
Note that both solutions are worth the overhead only if hashing / features are selective enough (and optimally sparse). In the example shown above you'd compare only rows inside set {1, 2, 3} and {4, 5}, never between sets.
However in the worst case scenario (M records, N features) we can make N M2 comparisons, instead of M2
Another thought.. Given that your matrix is relatively small and sparse, it can fit in memory using breeze CSCMatrix[Int].
Then, you can compute co-occurrences using A'B (A.transposed * B) followed by a TopN selection of the LLR (logLikelyhood ratio) of each pairs. Here, since you keep only 10 top items per row, the output matrix will be very sparse as well.
You can lookup the details here:
https://github.com/actionml/universal-recommender
You can borrow from the idea of locality sensitive hashing. Here is one approach:
Define a set of hash keys based on your matching requirements. You would use these keys to find potential matches. For example, a possible hash key could be based on the movie actor vector.
Perform reduce for each key. This will give sets of potential matches. For each of the potential matched set, perform your "exact match". The exact match will produce sets of exact matches.
Run Connected Component algorithm to perform set merge to get the sets of all exact matches.
I have implemented something similar using the above approach.
Hope this helps.
Another possible solution would be to use builtin RowMatrix and brute force columnSimilarity as explained on databricks:
https://databricks.com/blog/2014/10/20/efficient-similarity-algorithm-now-in-spark-twitter.html
https://datascience.stackexchange.com/questions/14862/spark-item-similarity-recommendation
Notes:
Keep in mind that you will always have N^2 values in resulting similarity matrix
You will have to concatenate your sparse vectors
One very important suggestion , that i have used in similar scenarios is if some movie
relation similarity score
A-> B 8/10
B->C 7/10
C->D 9/10
If
E-> A 4 //less that some threshold or hyperparameter
Don't calculate similarity for
E-> B
E-> C
E->D
for example I have two RDDs in PySpark:
((0,0), 1)
((0,1), 2)
((1,0), 3)
((1,1), 4)
and second is just
((0,1), 3)
((1,1), 0)
I want to have intersection from the first RDD with the second one. Actually, second RDDs has to play the role of the mask for the first. The output should be:
((0,1), 2)
((1,1), 4)
it means the values from the first RDD, but only for the keys from the second. The lengths of both RDDs are different.
I have some solution (have to prove), but something like this:
rdd3 = rdd1.cartesian(rdd2)
rdd4 = rdd3.filter(lambda((key1, val1), (key2, val2)): key1 == key2)
rdd5 = rdd4.map(lambda((key1, val1), (key2, val2)): (key1, val1))
I don't know, how efficient is this solution. would like to hear the opinion of experienced Spark programmers....
Perhaps we shouldn't think of this process as join. You're not really looking to join two datasets, you're looking to subtract one dataset from the other?
I'm going to state what I am assuming from your question
You don't care about the values in the second dataset, at all.
You only want to keep the values in the first dataset where the key value pair appears in the second dataset.
Idea 1: Cogroup (I think probably the fastest way). It's basically calculating the intersection of both datasets.
rdd1 = sc.parallelize([((0,0), 1), ((0,1), 2), ((1,0), 3), ((1,1), 4)])
rdd2 = sc.parallelize([((0,1), 3), ((1,1), 0)])
intersection = rdd1.cogroup(rdd2).filter(lambda x: x[1][0] and x[1][1])
final_rdd = intersection.map(lambda x: (x[0], list(x[1][0]))).map(lambda (x,y): (x, y[0]))
Idea 2: Subtract By Key
rdd1 = sc.parallelize([((0,0), 1), ((0,1), 2), ((1,0), 3), ((1,1), 4)])
rdd2 = sc.parallelize([((0,1), 3), ((1,1), 0)])
unwanted_rows = rdd1.subtractByKey(rdd2)
wanted_rows = rdd1.subtractByKey(unwanted_rows)
I'm not 100% sure if this is faster than your method. It does require two subtractByKey operations, which can be slow. Also, this method does not preserve order (e.g. ((0, 1), 2), despite being first in your first dataset, is second in the final dataset). But I can't imagine this matters.
As to which is faster, I think it depends on how long your cartersian join takes. Mapping and filtering tend to be faster than the shuffle operations needed for subtractByKey, but of course cartesian is a time consuming process.
Anyway, I figure you can try out this method and see if it works for you!
A sidenote for performance improvements, depending on how large your RDDs are.
If rdd1 is small enough to be held in main memory, the subtraction process can be sped up immensely if you broadcast it and then stream rdd2 against it. However, I acknowledge that this is rarely the case.
What I want to do like this:
http://cn.mathworks.com/help/matlab/ref/median.html?requestedDomain=www.mathworks.com
Find the median value of each column.
It can be done by collecting the RDD to driver, for a big data which will become impossible.
I know Statistics.colStats() can calculate mean, variance... but median is not included.
Additionally, the vector is high-dimensional and sparse.
Well I didn't understand the vector part, however this is my approach (I bet there are better ones):
val a = sc.parallelize(Seq(1, 2, -1, 12, 3, 0, 3))
val n = a.count() / 2
println(n) // outputs 3
val b = a.sortBy(x => x).zipWithIndex()
val median = b.filter(x => x._2 == n).collect()(0)._1 // this part doesn't look nice, I hope someone tells me how to improve it, maybe zero?
println(median) // outputs 2
b.collect().foreach(println) // (-1,0) (0,1) (1,2) (2,3) (3,4) (3,5) (12,6)
The trick is to sort your dataset using sortBy, then zip the entries with their index using zipWithIndex and then get the middle entry, note that I set an odd number of samples, for simplicity but the essence is there, besides you have to do this with every column of your dataset.
I am reading through Twitter's Scala School right now and was looking at the groupBy and partition methods for collections. And I am not exactly sure what the difference between the two methods is.
I did some testing on my own:
scala> List(1, 2, 3, 4, 5, 6).partition(_ % 2 == 0)
res8: (List[Int], List[Int]) = (List(2, 4, 6),List(1, 3, 5))
scala> List(1, 2, 3, 4, 5, 6).groupBy(_ % 2 == 0)
res9: scala.collection.immutable.Map[Boolean,List[Int]] = Map(false -> List(1, 3, 5), true -> List(2, 4, 6))
So does this mean that partition returns a list of two lists and groupBy returns a Map with boolean keys and list values? Both have the same "effect" of splitting a list into two different parts based on a condition. I am not sure why I would use one over the other. So, when would I use partition over groupBy and vice-versa?
groupBy is better suited for lists of more complex objects.
Say, you have a class:
case class Beer(name: String, cityOfBrewery: String)
and a List of beers:
val beers = List(Beer("Bitburger", "Bitburg"), Beer("Frueh", "Cologne") ...)
you can then group beers by cityOfBrewery:
val beersByCity = beers.groupBy(_.cityOfBrewery)
Now you can get yourself a list of all beers brewed in any city you have in your data:
val beersByCity("Cologne") = List(Beer("Frueh", "Cologne"), ...)
Neat, isn't it?
And I am not exactly sure what the difference between the two methods
is.
The difference is in their signature. partition expects a function A => Boolean while groupBy expects a function A => K.
It appears that in your case the function you apply with groupBy is A => Boolean too, but you don't want always to do this, sometimes you want to group by a function that don't always returns a boolean based on its input.
For example if you want to group a List of strings by their length, you need to do it with groupBy.
So, when would I use partition over groupBy and vice-versa?
Use groupBy if the image of the function you apply is not in the boolean set (i.e f(x) for an input x yield another result than a boolean). If it's not the case then you can use both, it's up to you whether you prefer a Map or a (List, List) as output.
Partition is when you need to split some collection into two basing on yes/no logic (even/odd numbers, uppercase/lowecase letters, you name it). GroupBy has more general usage: producing many groups, basing on some function. Let's say you want to split corpus of words into bins depending on their first letter (resulting into 26 groups), it simply not possible with .partition.
I'm converting several columns of strings to numeric features I can use in a LabeledPoint. I'm considering two approaches:
Create a mapping of strings to doubles, iterate through the RDD and lookup each string and assign the appropriate value.
Sort the RDD by the column, iterate through the RDD with a counter, assign each string to the current counter value until the string changes at which time the counter value is incremented and assigned. Since we never see a string twice (thanks to sorting) this will effectively assign a unique value to each string.
In the first approach we must collect the unique values for our map. I'm not sure how long this takes (linear time?). Then we iterate through the list of values and build up a HashMap - linear time and memory. Finally we iterate and lookup each value, N * eC (effective constant time).
In the second approach we sort (n log n time) and then iterate and keep track of a simple counter and a few variables.
What approach is recommended? There are memory, performance, and coding style considerations. The first feels like 2N + eC * N with N * (String, Double) memory and can be written in a functional style. The second is N log N + N but feels imperative. Will Spark need to transfer the static map around? I could see that being a deal breaker.
The second method unfortunately won't work the reason is you can not read form counter you can only increment it. What is even worst you dont really know when value changes you dont have state to remember previous vector. I guess you could use something like mapPartition and total order partitioner. You would have to know that your partitions are processed in order and there cant be same keys in more then one partition but this feels really hacky (and i dont know if it would work).
I dont think its possible to do this in one pass. But you can do it in two. In your first method you can use for example set accumulator put all you values in it then number them in driver and use in second pass to replace them. The complexity would be 2N (assuming that number of values << N).
Edit:
implicit object SetAcc extends AccumulatorParam[Set[String]] {
def zero(s: Set[String]) = Set()
def addInPlace(s1: Set[String], s2: Set[String]) = s1 ++ s2
}
val rdd = sc.parallelize(
List((1, "a"), (2, "a"), (3, "b"), (4, "a"), (5, "c"), (6, "b"))
)
val acc: Accumulator[Set[String]] = sc.accumulator(Set())
rdd.foreach(p => acc += Set(p._2))
val encoding = acc.value.zipWithIndex.toMap
val result = rdd map {p => (p._1, encoding(p._2))}
If you feel like this dictionary is too big you can of course brodcast it. If you have to many features and values in them and you dont want to create so many big accumulators then you can just use reduce function to process them all together and collect on driver. Just my thoughts about it. I guess you just have to try and see whats suits the best your usecase.
Edit:
In mllib there is class meant for this purpose HashingTF. It allows you to translate you data set in one pass. The drawback is that it uses hashing modulo specified parameter to map Objects to Doubles. This can lead to collisions if parameter is too small.
val tf = new HashingTF(numFeatures = 10000)
val transformed = data.map(line => tf.transform(line.split("""\s+"""))
Ofc you can do the same thing by hand without using HashingTF class.