scala> val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
dataArray: Array[String] = Array(a,1|2|3, b,4|5|6, a,7|8|9, b,10|11|12)
scala> val dataRDD = sc.parallelize(dataArray)
dataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at <console>:26
scala> val mapRDD = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1).split("\\|")))
mapRDD: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[9] at map at
<console>:25
scala> mapRDD.collect
res20: Array[(String, Array[String])] = Array((a,Array(1, 2, 3)), (b,Array(4, 5, 6)), (a,Array(7, 8, 9)), (b,Array(10, 11, 12)))
scala> mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
<console>:26: error: type mismatch;
found : List[String]
required: Array[String]
mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
I tried like this further
val finalRDD = mapRDD.map(elem => (elem._1,elem._2.mkString("#")))
scala> finalRDD.reduceByKey((a,b) => (a.split("#")(0).toInt + b.split("#")(0).toInt).toString )
res31: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[14] at reduceByKey at <console>:26
scala> res31.collect
res32: Array[(String, String)] = Array((b,14), (a,8))
As you can see i am not able to get the result for all indexes , my code gives the sum only for one index .
My expected output is below
I want the sum to be applied on index basis such sum of all a[0] and sum of all a[1]
(a,(8,10,12))
(b,(14,16,18))
please help
Use transpose and map the result to _.sum
import org.apache.spark.rdd.RDD
object RddExample {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val sc = spark.sparkContext
// import spark.implicits._
val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
val dataRDD = sc.parallelize(dataArray)
val mapRDD : RDD[(String, Array[Int])] = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1)
.split("\\|").map(_.toInt)))
// val mapRdd : Array[(String, Array[String])] = mapRDD.collect
val result : Array[(String, List[Int])] = mapRDD.groupByKey().mapValues(itr => {
itr.toList.transpose.map(_.sum)
}).collect()
println(result)
}
}
I feel that QuickSilver answer is perfect
I tried this approach using reduceByKey , but i am manually adding for every index , if there are more indexes, then this wont help
Till mapRDD the code is same as per my question
.....
.....
mapRDD = code as per Question
scala> val finalRDD = mapRDD.map(elem => (elem._1, elem._2 match {
| case Array(a:String,b:String,c:String) => (a.toInt,b.toInt,c.toInt)
| case _ => (100,200,300)
| }
| )
| )
scala> finalRDD.collect
res39: Array[(String, (Int, Int, Int))] = Array((a,(1,2,3)), (b,(4,5,6)), (a,(7,8,9)), (b,(10,11,12)))
scala> finalRDD.reduceByKey((v1,v2) => (v1._1+v2._1,v1._2+v2._2,v1._3+v2._3))
res40: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ShuffledRDD[18] at reduceByKey at <console>:26
scala> res40.collect
res41: Array[(String, (Int, Int, Int))] = Array((b,(14,16,18)), (a,(8,10,12)))
Given
scala> val rdd1 = sc.parallelize(Seq(("a",1),("a",2),("b",3)))
scala> val rdd2 = sc.parallelize(Seq("a",5),("c",6))
scala> val rdd3 = rdd1.leftOuterJoin(rdd2)
scala> rdd3.collect()
res: Array[(String, (Int, Option[Int]))] = Array((a,(1,Some(5))), (a,(2,Some(5))), (b,(3,None)))
We can see that the data type of "Option[Int]" in rdd3. Is there a way to rectify this so that rdd3 can be Array[String, (Int, Int)]? Suppose we can specify a value (e.g. 999) for the "None".
scala> val result = rdd3.collect()
scala> result.map(t => (t._1, (t._2._1, t._2._2.getOrElse(999))))
This should do it.
I'm creating shapeless HMap using the code like below from scala-exercises.
import shapeless.HMap
class BiMapIS[K, V]
implicit val intToString = new BiMapIS[Int, String]
implicit val stringToInt = new BiMapIS[String, Int]
val hm = HMap[BiMapIS](23 -> "foo", "bar" -> 13)
I would like to create HMap from variable arguments as below (I'm having long list of arguments so just checking whether I can simplify the code littlebit) -
import shapeless.{HMap, HNil}
import java.util.{List => JList}
val entities: JList[(_, _)] = ???
class BiMapIS[K, V]
implicit val intToString = new BiMapIS[Int, String]
implicit val stringToInt = new BiMapIS[String, Int]
import collection.JavaConverters._
val entitiesSeq = entities.asScala.toList
val hm = HMap[BiMapIS](entitiesSeq:_*)
Is there any way I can create HMap from variable args?
I'm using shapless 2.33 with scala 2.12 https://mvnrepository.com/artifact/com.chuusai/shapeless_2.12/2.3.3
Try
val entitiesSeq = entities.asScala.toMap[Any, Any]
val hm = new HMap[BiMapIS](entitiesSeq)
Is there a way in Scala to define an explicit function for an RDD Transformation with additional/extra arguments?
For example, the Python code below uses a lambda expression to apply the transformation map (requiring a function with one argument) with the function my_power (actually having 2 arguments).
def my_power(a, b):
res = a ** b
return res
def my_main(sc, n):
inputRDD = sc.parallelize([1, 2, 3, 4])
powerRDD = inputRDD.map(lambda x: my_power(x, n))
resVAL = powerRDD.collect()
for item in resVAL:
print(item)
However, when attempting an equivalent implementation in Scala, I get a Task not serializable exception.
val myPower: (Int, Int) => Int = (a: Int, b: Int) => {
val res: Int = math.pow(a, b).toInt
res
}
def myMain(sc: SparkContext, n: Int): Unit = {
val inputRDD: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4))
val squareRDD: RDD[Int] = inputRDD.map( (x: Int) => myPower(x, n) )
val resVAL: Array[Int] = squareRDD.collect()
for (item <- resVAL){
println(item)
}
}
In this way it was working for me.
package examples
import org.apache.log4j.Level
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDTest extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
val myPower: (Int, Int) => Int = (a: Int, b: Int) => {
val res: Int = math.pow(a, b).toInt
res
}
val scontext = spark.sparkContext
myMain(scontext, 10);
def myMain(sc: SparkContext, n: Int): Unit = {
val inputRDD: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4))
val squareRDD: RDD[Int] = inputRDD.map((x: Int) => myPower(x, n))
val resVAL: Array[Int] = squareRDD.collect()
for ( item <- resVAL ) {
println(item)
}
}
}
Result :
1024
59049
1048576
There is another option to broadcast n using sc.broadcast and access in the closure like map is also possible...
Simply adding a local variable as a function alias made it work:
val myPower: (Int, Int) => Int = (a: Int, b: Int) => {
val res: Int = math.pow(a, b).toInt
res
}
def myMain(sc: SparkContext, n: Int): Unit = {
val inputRDD: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4))
val myPowerAlias = myPower
val squareRDD: RDD[Int] = inputRDD.map( (x: Int) => myPowerAlias(x, n) )
val resVAL: Array[Int] = squareRDD.collect()
for (item <- resVAL){
println(item)
}
}
I have the following test case:
test("check foo") {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlc = new SQLContext(sc)
val res = foo("A", "B")
assert(true)
}
Which checks the following method:
def foo(arg1: String, arg2: String) (implicit sqlContext: SQLContext) : Seq[String] = {
//some other code
}
When running the tests I get the following issue:
Error:(65, 42) could not find implicit value for parameter sqlContext: org.apache.spark.sql.SQLContext
val res = foo("A", "B")
How can I share the SqlContext instance I create in the test method with foo?
Put implicit in front of val sqlc:
implicit val sqlc = new SQLContext(sc)