I tried to implement a Spark Streaming application that reads streaming data from Kafka. The streaming data are (key, value) pairs in the form of "String,int", and I want to calculate the average value of each key.
The data is in form as below:
x,20
y,10
z,3
...
I want to measure the average value for each key in a stateful manner. Therefore, I intend to save the sum of value and how many times its corresponding key appears into the State in the mapping function.
def mappingFunc(key: String, value: Option[Double], state: State[Double], count: State[Int]): (String, Double) = {
val sum = value.getOrElse(0.0) + state.getOption.getOrElse(0.0)
val cnt = count.getOption.getOrElse(1) + 1
state.update(sum)
count.update(cnt)
val output = (key, sum/cnt)
output
}
It reminds me there's an error:
[error] /Users/Rabbit/Desktop/KTH_Second_Year/Periods/P1/Data-intensive_Computing/Lab_Assignment/lab3/src/sparkstreaming/KafkaSpark.scala:78: wrong number of type parameters for overloaded method value function with alternatives:
[error] [KeyType, ValueType, StateType, MappedType](mappingFunction: org.apache.spark.api.java.function.Function3[KeyType,org.apache.spark.api.java.Optional[ValueType],org.apache.spark.streaming.State[StateType],MappedType])org.apache.spark.streaming.StateSpec[KeyType,ValueType,StateType,MappedType] <and>
[error] [KeyType, ValueType, StateType, MappedType](mappingFunction: org.apache.spark.api.java.function.Function4[org.apache.spark.streaming.Time,KeyType,org.apache.spark.api.java.Optional[ValueType],org.apache.spark.streaming.State[StateType],org.apache.spark.api.java.Optional[MappedType]])org.apache.spark.streaming.StateSpec[KeyType,ValueType,StateType,MappedType] <and>
[error] [KeyType, ValueType, StateType, MappedType](mappingFunction: (KeyType, Option[ValueType], org.apache.spark.streaming.State[StateType]) => MappedType)org.apache.spark.streaming.StateSpec[KeyType,ValueType,StateType,MappedType] <and>
[error] [KeyType, ValueType, StateType, MappedType](mappingFunction: (org.apache.spark.streaming.Time, KeyType, Option[ValueType], org.apache.spark.streaming.State[StateType]) => Option[MappedType])org.apache.spark.streaming.StateSpec[KeyType,ValueType,StateType,MappedType]
How can I pass the sum of value and the count at the same time in Spark Streaming?
You need to combine the sum and the count as tuple (Double, Int) which is stored in the state. The following snippet should do the trick:
def mappingFunc(key: String, value: Option[Double], state: State[(Double, Int)]): (String, Double) = {
val (sum, cnt) = state.getOption.getOrElse((0.0, 0))
val newSum = value.getOrElse(0.0) + sum
val newCnt = cnt + 1
state.update((newSum, newCnt))
(key, newSum/newCnt)
}
Related
I am new to Apache Spark, and am not able to get this to work.
I have an RDD of the form (Int,(Int,Int)), and would like to sum up the first element of the value while appending the second element.
For example, I have the following RDD:
[(5,(1,0)), (5,(1,2)), (5,(1,5)))]
And I want to be able to get something like this:
(5,3,(0,2,5))
I tried this:
sampleRdd.reduceByKey{case(a,(b,c)) => (a + b)}
But I get this error:
type mismatch;
[error] found : Int
[error] required: String
[error] .reduceByKey{case(a,(b,c)) => (a + b)}
[error] ^
How can I achieve this?
Please try this
def seqOp = (accumulator: (Int, List[String]), element: (Int, Int)) =>
(accumulator._1 + element._1, accumulator._2 :+ element._2.toString)
def combOp = (accumulator1: (Int, List[String]), accumulator2: (Int, List[String])) => {
(accumulator1._1 + accumulator2._1, accumulator1._2 ::: accumulator2._2)
}
val zeroVal = ((0, List.empty[String]))
rdd.aggregateByKey(zeroVal)(seqOp, combOp).collect
I have a spark streaming job, the codes are below there:
val filterActions = userActions.filter(Utils.filterPageType)
val parseAction = filterActions.flatMap(record => ParseOperation.parseMatch(categoryMap, record))
val finalActions = parseAction.filter(record => record.get("invalid") == None)
val userModels = finalActions.map(record => (record("deviceid"), record)).mapWithState(StateSpec.function(stateUpdateFunction))
but all functions can compile smoothly except for the mapWithState function, the return type of ParseOperation.parseMatch(categoryMap, record) is ListBuffer[Map[String, Any]], the error like below:
[INFO] Compiling 9 source files to /Users/spare/project/campaign-project/stream-official-mall/target/classes at 1530404002409
[ERROR] /Users/spare/project/campaign-project/stream-official-mall/src/main/scala/com/shopee/mall/data/OfficialMallTracker.scala:77: error: overloaded method value function with alternatives:
[ERROR] [KeyType, ValueType, StateType, MappedType](mappingFunction: org.apache.spark.api.java.function.Function3[KeyType,org.apache.spark.api.java.Optional[ValueType],org.apache.spark.streaming.State[StateType],MappedType])org.apache.spark.streaming.StateSpec[KeyType,ValueType,StateType,MappedType] <and>
[ERROR] [KeyType, ValueType, StateType, MappedType](mappingFunction: org.apache.spark.api.java.function.Function4[org.apache.spark.streaming.Time,KeyType,org.apache.spark.api.java.Optional[ValueType],org.apache.spark.streaming.State[StateType],org.apache.spark.api.java.Optional[MappedType]])org.apache.spark.streaming.StateSpec[KeyType,ValueType,StateType,MappedType] <and>
[ERROR] [KeyType, ValueType, StateType, MappedType](mappingFunction: (KeyType, Option[ValueType], org.apache.spark.streaming.State[StateType]) => MappedType)org.apache.spark.streaming.StateSpec[KeyType,ValueType,StateType,MappedType] <and>
[ERROR] [KeyType, ValueType, StateType, MappedType](mappingFunction: (org.apache.spark.streaming.Time, KeyType, Option[ValueType], org.apache.spark.streaming.State[StateType]) => Option[MappedType])org.apache.spark.streaming.StateSpec[KeyType,ValueType,StateType,MappedType]
[ERROR] cannot be applied to ((Any, Map[String,Any], org.apache.spark.streaming.State[Map[String,Any]]) => Some[Map[String,Any]])
[ERROR] val userModels = finalActions.map(record => (record("deviceid"), record)).mapWithState(StateSpec.function(stateUpdateFunction))
[ERROR] ^
[ERROR] one error found
what caused the issue? How do I modify the code?
I had fixed it, it caused StateSpec.function(stateUpdateFunction)) required the type of input parameter is Map[String, Any], before calling it ,I used the map function, the code is below:
val finalActions = parseAction.filter(record => record.get("invalid") == None).map(Utils.parseFinalRecord)
val parseFinalRecord = (record: Map[String, Any]) => {
val recordMap = collection.mutable.Map(record.toSeq: _*)
logger.info(s"recordMap: ${recordMap}")
recordMap.toMap
}
it works!
A previous process gave me the accumulator and count of every group in the next way:
val data: Array[(Int, (Double, Int))] = Array((2,(2.1463120403829962,7340)), (1,(1.4532644653720025,4280)))
the structure is (groupId,(acum,count))
now want reduce to get the sum of every tuple:
(k1,(a1,n1)),(k2,(a2,n2))
need:
(a1+a2),(n1+n2)
Sound like a simple task, So do:
val mainMean = groups.reduce((acc,v)=>(acc._1 + v._1,acc._2 + v._2))
And get:
:33: error: type mismatch; found : (Double, Int)
required: String
val mainMean = groups.reduce((acc,v)=>(acc._1 + v._1,acc._2 + v._2))
Also tried:
val mainMean = groups.reduce((k,(acc,v))=>(acc._1 + v._1,acc._2 + v._2))
and tell me: Note: Tuples cannot be directly destructured in method or function parameters.
Either create a single parameter accepting the Tuple2,
or consider a pattern matching anonymous function
So:
val mainMean = groups.reduce({case(k,(acc,n))=>(k,(acc._1+n._1,acc._1+n._2))})
and get
error: type mismatch; found : (Int, (Double, Int)) required: Int
I know it a newbie question but I am stuck on it
There can be some difficulties working with tuples.
Below you can see working code, but let me explain.
val data = Array((2,(2.1463120403829962,7340)), (1,(1.4532644653720025,4280)))
def tupleSum(t1: (Int, (Double, Int)), t2: (Int, (Double, Int))): (Int, (Double, Int)) =
(0,(t1._2._1 + t2._2._1, t1._2._2 + t2._2._2))
val mainMean = data.reduce(tupleSum)._2
We can introduce reduce arguments like
data.reduce((tuple1, tuple2) => tupleSum(tuple1, tuple2))
where tuple1 is kind of accumulator. On the first iteration it takes the first value of the array, and every next value adds to the value of accumulator.
So if you want to perform reduce using pattern matching it will look like this:
val mainMean = data.reduce((tuple1, tuple2) => {
val t1 = tuple1 match { case (i, t) => t }
val t2 = tuple2 match { case (i, t) => t }
// now t1 and t2 represents inner tuples of input tuples
(0, (t1._1 + t2._1, t1._2 + t2._2))}
)
UPD.
I rewrite previous listing adding type annotations and println statements. I hope it will help to get the point. And there is some explanation after.
val data = Array((3, (3.0, 3)), (2,(2.0,2)), (1,(1.0,1)))
val mainMean = data.reduce((tuple1: (Int, (Double, Int)),
tuple2: (Int, (Double, Int))) => {
println("tuple1: " + tuple1)
println("tuple2: " + tuple2)
val t1: (Double, Int) = tuple1 match {
case (i: Int, t: (Double, Int)) => t
}
val t2: (Double, Int) = tuple2 match {
case (i: Int, t: (Double, Int)) => t
}
// now t1 and t2 represents inner tuples of input tuples
(0, (t1._1 + t2._1, t1._2 + t2._2))}
)
println("mainMean: " + mainMean)
And the output will be:
tuple1: (3,(3.0,3)) // 1st element of the array
tuple2: (2,(2.0,2)) // 2nd element of the array
tuple1: (0,(5.0,5)) // sum of 1st and 2nd elements
tuple2: (1,(1.0,1)) // 3d element
mainMean: (0,(6.0,6)) // result sum
tuple1 and tuple2 type is (Int, (Double, Int)). We know it always be only this type, that is why we use only one case in pattern matching. We unpack tuple1 to i: Int and t: (Int, Double). As far as we are not interested in key, we return only t. Now t1 is representing the inner tuple of tuple1. The same story with tuple2 andt2.
You can find more information about fold functions here and here
I'm trying to generate an RDD from a List and another RDD using scala and spark. The idea is to take a list of values, and generate an index containing all the entries of the original dataset that contains each value.
Here's the code that I'm trying
def mcveInvertIndex(foos: List[String], bars: RDD[Int]): RDD[(String, Iterable[Int])] = {
// filter function
def hasVal(foo: String)(bar: Int): Boolean =
foo.toInt == bar
// call to sc.parallelize to ensure that an RDD is returned
sc parallelize(
foos map (_ match {
case (s: String) => (s, bars filter hasVal(s))
})
)
}
Unfortunately this does not compile in sbt
> compile
[info] Compiling 1 Scala source to $TARGETDIR/target/scala-2.11/classes...
[error] $TARGETDIR/src/main/scala/wikipedia/WikipediaRanking.scala:56: type mismatch;
[error] found : List[(String, org.apache.spark.rdd.RDD[Int])]
[error] required: Seq[(String, Iterable[Int])]
[error] Error occurred in an application involving default arguments.
[error] foos map (_ match {
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 1 s, completed Mar 11, 2017 7:11:31 PM
I really don't understand the errors that I'm getting. List is a subclass of Seq, and I presume that RDDs are a subclass of Iterable. Is there something obvious that I've missed?
Here is my solution with for-comprehension (should use less memory than cartesian product)
def mcveInvertIndex(foos: List[String],
bars: RDD[Int]): RDD[(String, Iterable[Int])] =
{
// filter function
def hasVal(foo: String, bar: Int): Boolean =
foo.toInt == bar
// Producing RDD[(String, Iterable[Int])]
(for {
bar <- bars // it's important to have RDD
// at first position of for-comprehesion
// to produce the correct result type
foo <- foos
if hasVal(foo, bar)
} yield (foo, bar)).groupByKey()
}
As mentioned in the comment, RDD is not an Iterable, so you have to combine the two in some way and then aggregate them. This is my quick solution, although there might be a more efficient way:
def mcveInvertIndex(foos: List[String], bars: RDD[Int]): RDD[(String, Iterable[Int])] = {
sc.makeRDD(foos)
.cartesian(bars)
.keyBy(x=>x._1)
.aggregateByKey(Iterable.empty[Int])(
(agg: Iterable[Int], currVal: (String, Int))=>{
if(currVal._1.toInt != currVal._2) agg
else currVal._2 +: agg.toList
},
_ ++ _
)
}
I have this Scala snippet from a custom mapper (for use in a Spark mapPartitions) I'm writing to compute histograms of multiple Int fields simultaneously.
def multiFeatureHistogramFunc(iter: Iterator[Row]) : Iterator[(Int, (Int, Long))] = {
var featureHistMap:Map[Int, (Int, Long)] = Map()
while (iter.hasNext)
{
val cur = iter.next;
indices.foreach( { index:Int =>
val v:Int = if ( cur.isNullAt(index) ) -100 else cur.getInt(index)
var featureHist:Map[Int, Long] = featureHistMap.getOrElse(index, Map())
val newCount = featureHist.getOrElse(v,0L) + 1L
featureHist += (v -> newCount)
featureHistMap += (index -> featureHist)
})
}
featureHistMap.iterator
}
But the error I'm getting is this:
<console>:49: error: type mismatch;
found : Equals
required: Map[Int,Long]
var featureHist:Map[Int, Long] =
featureHistMap.getOrElse(index, Map())
^
I couldn't find the answer to this specific issue. It looks to me like the default parameter in featureHistMap.getOrElse is a different type than the value field of the featureHistMap itself and the common parent type is Equals so this causes a type mismatch. I tried a number of different things like changing the default parameter to be a more specific type, but this just caused a different error.
Can someone explain what's going on here and how to fix it?
The problem is that you declared your featureHistMap as Map[Int, (Int, Long)] - note that you are mapping an Int to a pair (Int, Long). Later, you try to retrieve a value from it as a Map[Int, Long], instead of a pair (Int, Long).
You either need to redeclare the type of featureHistMap to Map[Int, Map[Int, Long]], or the type of featureHist to (Int, Long).