Calculating derived value in Spark Streaming - scala

I have two Key Value Pairs of the type org.apache.spark.streaming.dstream.DStream[Int].
First Key value pair is (word,frequency).
Second key value pair is (Number of rows,Value).
I would like to divide frequency by value in for each word. But, I am getting below error
value / is not a member of org.apache.spark.streaming.dstream.DStream[Int]
Sample Code :
f is frequency of the word and c is the total count
rdd has word and frequency
val cp = rdd.foreachRDD {
x => (x, f/c)
}

First apply map transformation on the DStream object and then inside that you will get RDD now you apply map transformation on RDD object as follow
dStream.map{rdd=>
rdd.map(x=>(x,f/c))
}
if f is the object of DStream then collect it first before use it in RDD or DStream closure.

Related

Unique pairs in cartesian product of an RDD with itself

I have an RDD with 100 items and each item is of the format (String,Int,Option[Set[Int]])
I want to compute the cartesian product of this RDD with itself and want to have only unique pairs. For example without removing unique pairs I would get something like this:
(a,a),(a,b),(a,c),(b,a),(b,b),(b,c),(c,a),(c,b),(c,c)
The output that I want is (a,b),(a,c),(b,c)
I have managed to remove the instances where the pairs are the same value (a,a),(b,b),(c,c) but I am unsure how to remove the reverse pairs.
This is the code I used to do it:
val m = records.cartesian(records).filter{case (a,b) => a != b}.collect()

Spark - Rdd transformation inside another transformation

I was trying to transform an RDD inside another transformation. Since, RDD transformations and actions can only be invoked by the driver I collected the 2nd RDD and tried to apply the transformation on it inside the other transformation like below
val name_match = first_names.map(y => (y, first_names_collection.value.filter(z => soundex.difference(z, y) == 4 ) ))
The above code is throwing the following exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationAttemptNotFoundException): Application attempt appattempt_1468905506642_46091_000001 doesn't exist in ApplicationMasterService cache.
Here, the size of first_names_collection is more than 10 GB. Would that be causing this problem? Is there any other way to do this?
It looks like you want to calculate a difference function between each element of name_match and each element of first_names_collection and find pairs with a difference of 4.
Typically performing pair-wise calculations on two RDDs is done by enumerating all pairs using cartesian first. Your solution would look something like:
first_name.cartesian(first_names_collection) // generate all pairs
.filter{case (lhs, rhs) => soundex.difference(lhs, rhs) == 4}
.groupByKey

Flattening an element of a RDD

I am using Spark scala API.
prods_grpd has this type: String, mutable.HashSet[String]
val prods_grpd = all_meds.aggregateByKey(initialSet)(addToSet, mergePartitionSets)
prods_grpd.saveAsTextFile("scratch/prods_grpdby_users.tsv")
When I save this rdd, I get this o/p. The first value is key and then I get a set of keys.
(8635214,Set(2013-01-01))
(3580112,Set(2013-01-01))
(146086,Set(2010-01-01, 2012-01-01))
(112220,Set(2013-01-01))
(2020,Set(2013-01-01))
(24218,Set(2013-01-01))
However, I want o/p like:
(8635214, 2013-01-01)
(3580112, 2013-01-01)
(146086, 2010-01-01, 2012-01-01)
(112220, 2013-01-01)
(2020, 2013-01-01)
(24218, 2013-01-01)
I which like to know how do I unnest/flatten the 2nd parameter of RDD.
You cannot simply convert Set to Tuple because tuples are not collections and don't support arbitrary number of elements. Instead you can map entries to strings with desired format:
prods_grpd.map{case (k, s) => {
val sstr = s.mkString(",")
s"($k,$sstr)"
}}.saveAsTextFile(...)

reduceByKey: How does it work internally?

I am new to Spark and Scala. I was confused about the way reduceByKey function works in Spark. Suppose we have the following code:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The map function is clear: s is the key and it points to the line from data.txt and 1 is the value.
However, I didn't get how the reduceByKey works internally? Does "a" points to the key? Alternatively, does "a" point to "s"? Then what does represent a + b? how are they filled?
Let's break it down to discrete methods and types. That usually exposes the intricacies for new devs:
pairs.reduceByKey((a, b) => a + b)
becomes
pairs.reduceByKey((a: Int, b: Int) => a + b)
and renaming the variables makes it a little more explicit
pairs.reduceByKey((accumulatedValue: Int, currentValue: Int) => accumulatedValue + currentValue)
So, we can now see that we are simply taking an accumulated value for the given key and summing it with the next value of that key. NOW, let's break it further so we can understand the key part. So, let's visualize the method more like this:
pairs.reduce((accumulatedValue: List[(String, Int)], currentValue: (String, Int)) => {
//Turn the accumulated value into a true key->value mapping
val accumAsMap = accumulatedValue.toMap
//Try to get the key's current value if we've already encountered it
accumAsMap.get(currentValue._1) match {
//If we have encountered it, then add the new value to the existing value and overwrite the old
case Some(value : Int) => (accumAsMap + (currentValue._1 -> (value + currentValue._2))).toList
//If we have NOT encountered it, then simply add it to the list
case None => currentValue :: accumulatedValue
}
})
So, you can see that the reduceByKey takes the boilerplate of finding the key and tracking it so that you don't have to worry about managing that part.
Deeper, truer if you want
All that being said, that is a simplified version of what happens as there are some optimizations that are done here. This operation is associative, so the spark engine will perform these reductions locally first (often termed map-side reduce) and then once again at the driver. This saves network traffic; instead of sending all the data and performing the operation, it can reduce it as small as it can and then send that reduction over the wire.
One requirement for the reduceByKey function is that is must be associative. To build some intuition on how reduceByKey works, let's first see how an associative associative function helps us in a parallel computation:
As we can see, we can break an original collection in pieces and by applying the associative function, we can accumulate a total. The sequential case is trivial, we are used to it: 1+2+3+4+5+6+7+8+9+10.
Associativity lets us use that same function in sequence and in parallel. reduceByKey uses that property to compute a result out of an RDD, which is a distributed collection consisting of partitions.
Consider the following example:
// collection of the form ("key",1),("key,2),...,("key",20) split among 4 partitions
val rdd =sparkContext.parallelize(( (1 to 20).map(x=>("key",x))), 4)
rdd.reduceByKey(_ + _)
rdd.collect()
> Array[(String, Int)] = Array((key,210))
In spark, data is distributed into partitions. For the next illustration, (4) partitions are to the left, enclosed in thin lines. First, we apply the function locally to each partition, sequentially in the partition, but we run all 4 partitions in parallel. Then, the result of each local computation are aggregated by applying the same function again and finally come to a result.
reduceByKey is an specialization of aggregateByKey aggregateByKey takes 2 functions: one that is applied to each partition (sequentially) and one that is applied among the results of each partition (in parallel). reduceByKey uses the same associative function on both cases: to do a sequential computing on each partition and then combine those results in a final result as we have illustrated here.
In your example of
val counts = pairs.reduceByKey((a,b) => a+b)
a and b are both Int accumulators for _2 of the tuples in pairs. reduceKey will take two tuples with the same value s and use their _2 values as a and b, producing a new Tuple[String,Int]. This operation is repeated until there is only one tuple for each key s.
Unlike non-Spark (or, really, non-parallel) reduceByKey where the first element is always the accumulator and the second a value, reduceByKey operates in a distributed fashion, i.e. each node will reduce it's set of tuples into a collection of uniquely-keyed tuples and then reduce the tuples from multiple nodes until there is a final uniquely-keyed set of tuples. This means as the results from nodes are reduced, a and b represent already reduced accumulators.
Spark RDD reduceByKey function merges the values for each key using an associative reduce function.
The reduceByKey function works only on the RDDs and this is a transformation operation that means it is lazily evaluated. And an associative function is passed as a parameter, which is applied to source RDD and creates a new RDD as a result.
So in your example, rdd pairs has a set of multiple paired elements like (s1,1), (s2,1) etc. And reduceByKey accepts a function (accumulator, n) => (accumulator + n), which initialise the accumulator variable to default value 0 and adds up the element for each key and return the result rdd counts having the total counts paired with key.
Simple if your input RDD data look like this:
(aa,1)
(bb,1)
(aa,1)
(cc,1)
(bb,1)
and if you apply reduceByKey on above rdd data then few you have to remember,
reduceByKey always takes 2 input (x,y) and always works with two rows at a time.
As it is reduceByKey it will combine two rows of same key and combine the result of value.
val rdd2 = rdd.reduceByKey((x,y) => x+y)
rdd2.foreach(println)
output:
(aa,2)
(bb,2)
(cc,1)

How do I run the Spark decision tree with a categorical feature set using Scala?

I have a feature set with a corresponding categoricalFeaturesInfo: Map[Int,Int]. However, for the life of me I cannot figure out how I am supposed to get the DecisionTree class to work. It will not accept anything, but a LabeledPoint as data. However, LabeledPoint requires (double, vector) where the vector requires doubles.
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
// Run training algorithm to build the model
val maxDepth: Int = 3
val isMulticlassWithCategoricalFeatures: Boolean = true
val numClassesForClassification: Int = countPossibilities(labelCol)
val model = DecisionTree.train(LP, Classification, Gini, isMulticlassWithCategoricalFeatures, maxDepth, numClassesForClassification,categoricalFeaturesInfo)
The error I get:
scala> val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
<console>:32: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[String])
val LP = featureSet.map(x => LabeledPoint(classMap(x(0)),Vectors.dense(x.tail)))
My resources thus far:
tree config, decision tree, labeledpoint
You can first transform categories to numbers, then load data as if all features are numerical.
When you build a decision tree model in Spark, you just need to tell spark which features are categorical and also the feature's arity (the number of distinct categories of that feature) by specifying a map Map[Int, Int]() from feature indices to its arity.
For example if you have data as:
1,a,add
2,b,more
1,c,thinking
3,a,to
1,c,me
You can first transform data into numerical format as:
1,0,0
2,1,1
1,2,2
3,0,3
1,2,4
In that format you can load data to Spark. Then if you want to tell Spark the second and the third columns are categorical, you should create a map:
categoricalFeaturesInfo = Map[Int, Int]((1,3),(2,5))
The map tells us that feature with index 1 has arity 3, and feature with index 2 has artity 5. They will be considered as categorical when we build a decision tree model passing that map as a parameter of the training function:
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
Strings are not supported by LabeledPoint, one way to put it into a LabeledPoint is to split your data into multiple columns, considering that your strings are categorical.
So for example, if you have the following dataset:
id,String,Intvalue
1,"a",123
2,"b",456
3,"c",789
4,"a",887
Then you could split your string data, making each value of the strings into a new column
a -> 1,0,0
b -> 0,1,0
c -> 0,0,1
As you have 3 distinct values of Strings, you will convert your string column to 3 new columns, and each value will be represented by a value in this new columns.
Now your dataset will be
id,String,Intvalue
1,1,0,0,123
2,0,1,0,456
3,0,0,1,789
4,1,0,0,887
Which now you can convert into Double values and use it into your LabeledPoint.
Another way to convert your strings into a LabeledPoint is to create a distinctlist of values for each column, and convert the values of the strings into the index of that string in this list. Which is not recommended because if so, in this supposed dataset it will be
a = 0
b = 1
c = 2
But in this case the algorithms will consider a closer to b than to c, which cannot be determined.
You need to confirm the type of array x.
From the error log, it said that the item in array x is string which is not supported in spark.
Current spark Vectors can only be filled by Double.