scala function not called from lambda function

scala function not called from lambda function - scala

I am new to scala and just understanding how can I transform via Map call. The foo function is not getting called. What is that I am missing ?
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object Expt {
def main(args: Array[String]): Unit = {
var conf = new SparkConf().setAppName("Test").setMaster("local[*]")
val sc = new SparkContext(conf)
val a1 = sc.parallelize(List((1,"one"),(2,"two"),(3,"three"))
val a3 = a1.map(x => Expt.foo(x))
}
def foo(x: (Int,String)) : (Int,String) = {
println(x)
x
}
}

You don't execute any action so map is never evaluated. Also println in map usually won't have visible effect.

trigger computation using collect or foreach (do some action)
scala> val a1 = sc.parallelize(List((1,"one"),(2,"two"),(3,"three")))
a1: org.apache.spark.rdd.RDD[(Int, String)] = ParallelCollectionRDD[6] at parallelize at <console>:24
scala> val a3 = a1.map(x => foo(x))
a3: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[7] at map at <console>:28
scala> a3.collect
(1,one)
(2,two)
(3,three)
res3: Array[(Int, String)] = Array((1,one), (2,two), (3,three))
scala> a3.foreach(_ => ())
(1,one)
(3,three)
(2,two)

Related

How to find the sum of values by for each as per index position in spark/scala

scala> val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
dataArray: Array[String] = Array(a,1|2|3, b,4|5|6, a,7|8|9, b,10|11|12)
scala> val dataRDD = sc.parallelize(dataArray)
dataRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[8] at parallelize at <console>:26
scala> val mapRDD = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1).split("\\|")))
mapRDD: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[9] at map at
<console>:25
scala> mapRDD.collect
res20: Array[(String, Array[String])] = Array((a,Array(1, 2, 3)), (b,Array(4, 5, 6)), (a,Array(7, 8, 9)), (b,Array(10, 11, 12)))
scala> mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
<console>:26: error: type mismatch;
found : List[String]
required: Array[String]
mapRDD.reduceByKey((value1,value2) => List(value1(0) + value2(0)))
I tried like this further
val finalRDD = mapRDD.map(elem => (elem._1,elem._2.mkString("#")))
scala> finalRDD.reduceByKey((a,b) => (a.split("#")(0).toInt + b.split("#")(0).toInt).toString )
res31: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[14] at reduceByKey at <console>:26
scala> res31.collect
res32: Array[(String, String)] = Array((b,14), (a,8))
As you can see i am not able to get the result for all indexes , my code gives the sum only for one index .
My expected output is below
I want the sum to be applied on index basis such sum of all a[0] and sum of all a[1]
(a,(8,10,12))
(b,(14,16,18))
please help

Use transpose and map the result to _.sum
import org.apache.spark.rdd.RDD
object RddExample {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val sc = spark.sparkContext
// import spark.implicits._
val dataArray = Array("a,1|2|3","b,4|5|6","a,7|8|9","b,10|11|12")
val dataRDD = sc.parallelize(dataArray)
val mapRDD : RDD[(String, Array[Int])] = dataRDD.map(rec => (rec.split(",")(0),rec.split(",")(1)
.split("\\|").map(_.toInt)))
// val mapRdd : Array[(String, Array[String])] = mapRDD.collect
val result : Array[(String, List[Int])] = mapRDD.groupByKey().mapValues(itr => {
itr.toList.transpose.map(_.sum)
}).collect()
println(result)
}
}

I feel that QuickSilver answer is perfect
I tried this approach using reduceByKey , but i am manually adding for every index , if there are more indexes, then this wont help
Till mapRDD the code is same as per my question
.....
.....
mapRDD = code as per Question
scala> val finalRDD = mapRDD.map(elem => (elem._1, elem._2 match {
| case Array(a:String,b:String,c:String) => (a.toInt,b.toInt,c.toInt)
| case _ => (100,200,300)
| }
| )
| )
scala> finalRDD.collect
res39: Array[(String, (Int, Int, Int))] = Array((a,(1,2,3)), (b,(4,5,6)), (a,(7,8,9)), (b,(10,11,12)))
scala> finalRDD.reduceByKey((v1,v2) => (v1._1+v2._1,v1._2+v2._2,v1._3+v2._3))
res40: org.apache.spark.rdd.RDD[(String, (Int, Int, Int))] = ShuffledRDD[18] at reduceByKey at <console>:26
scala> res40.collect
res41: Array[(String, (Int, Int, Int))] = Array((b,(14,16,18)), (a,(8,10,12)))

scala: how to rectify "option" type after leftOuterJoin

Given
scala> val rdd1 = sc.parallelize(Seq(("a",1),("a",2),("b",3)))
scala> val rdd2 = sc.parallelize(Seq("a",5),("c",6))
scala> val rdd3 = rdd1.leftOuterJoin(rdd2)
scala> rdd3.collect()
res: Array[(String, (Int, Option[Int]))] = Array((a,(1,Some(5))), (a,(2,Some(5))), (b,(3,None)))
We can see that the data type of "Option[Int]" in rdd3. Is there a way to rectify this so that rdd3 can be Array[String, (Int, Int)]? Suppose we can specify a value (e.g. 999) for the "None".

scala> val result = rdd3.collect()
scala> result.map(t => (t._1, (t._2._1, t._2._2.getOrElse(999))))
This should do it.

How to create Shapeless HMap from from var arg list

I'm creating shapeless HMap using the code like below from scala-exercises.
import shapeless.HMap
class BiMapIS[K, V]
implicit val intToString = new BiMapIS[Int, String]
implicit val stringToInt = new BiMapIS[String, Int]
val hm = HMap[BiMapIS](23 -> "foo", "bar" -> 13)
I would like to create HMap from variable arguments as below (I'm having long list of arguments so just checking whether I can simplify the code littlebit) -
import shapeless.{HMap, HNil}
import java.util.{List => JList}
val entities: JList[(_, _)] = ???
class BiMapIS[K, V]
implicit val intToString = new BiMapIS[Int, String]
implicit val stringToInt = new BiMapIS[String, Int]
import collection.JavaConverters._
val entitiesSeq = entities.asScala.toList
val hm = HMap[BiMapIS](entitiesSeq:_*)
Is there any way I can create HMap from variable args?
I'm using shapless 2.33 with scala 2.12 https://mvnrepository.com/artifact/com.chuusai/shapeless_2.12/2.3.3

Try
val entitiesSeq = entities.asScala.toMap[Any, Any]
val hm = new HMap[BiMapIS](entitiesSeq)

Spark in Scala - Map with Function with Extra Arguments

Is there a way in Scala to define an explicit function for an RDD Transformation with additional/extra arguments?
For example, the Python code below uses a lambda expression to apply the transformation map (requiring a function with one argument) with the function my_power (actually having 2 arguments).
def my_power(a, b):
res = a ** b
return res
def my_main(sc, n):
inputRDD = sc.parallelize([1, 2, 3, 4])
powerRDD = inputRDD.map(lambda x: my_power(x, n))
resVAL = powerRDD.collect()
for item in resVAL:
print(item)
However, when attempting an equivalent implementation in Scala, I get a Task not serializable exception.
val myPower: (Int, Int) => Int = (a: Int, b: Int) => {
val res: Int = math.pow(a, b).toInt
res
}
def myMain(sc: SparkContext, n: Int): Unit = {
val inputRDD: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4))
val squareRDD: RDD[Int] = inputRDD.map( (x: Int) => myPower(x, n) )
val resVAL: Array[Int] = squareRDD.collect()
for (item <- resVAL){
println(item)
}
}

In this way it was working for me.
package examples
import org.apache.log4j.Level
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDTest extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
val myPower: (Int, Int) => Int = (a: Int, b: Int) => {
val res: Int = math.pow(a, b).toInt
res
}
val scontext = spark.sparkContext
myMain(scontext, 10);
def myMain(sc: SparkContext, n: Int): Unit = {
val inputRDD: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4))
val squareRDD: RDD[Int] = inputRDD.map((x: Int) => myPower(x, n))
val resVAL: Array[Int] = squareRDD.collect()
for ( item <- resVAL ) {
println(item)
}
}
}
Result :
1024
59049
1048576
There is another option to broadcast n using sc.broadcast and access in the closure like map is also possible...

Simply adding a local variable as a function alias made it work:
val myPower: (Int, Int) => Int = (a: Int, b: Int) => {
val res: Int = math.pow(a, b).toInt
res
}
def myMain(sc: SparkContext, n: Int): Unit = {
val inputRDD: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4))
val myPowerAlias = myPower
val squareRDD: RDD[Int] = inputRDD.map( (x: Int) => myPowerAlias(x, n) )
val resVAL: Array[Int] = squareRDD.collect()
for (item <- resVAL){
println(item)
}
}

Passing sqlContext when testing a method

I have the following test case:
test("check foo") {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlc = new SQLContext(sc)
val res = foo("A", "B")
assert(true)
}
Which checks the following method:
def foo(arg1: String, arg2: String) (implicit sqlContext: SQLContext) : Seq[String] = {
//some other code
}
When running the tests I get the following issue:
Error:(65, 42) could not find implicit value for parameter sqlContext: org.apache.spark.sql.SQLContext
val res = foo("A", "B")
How can I share the SqlContext instance I create in the test method with foo?

Put implicit in front of val sqlc:
implicit val sqlc = new SQLContext(sc)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

scala function not called from lambda function - scala

You don't execute any action so map is never evaluated. Also println in map usually won't have visible effect.

Related

How to find the sum of values by for each as per index position in spark/scala

scala: how to rectify "option" type after leftOuterJoin

How to create Shapeless HMap from from var arg list

Spark in Scala - Map with Function with Extra Arguments

Passing sqlContext when testing a method

Categories

Resources