Is there a way in Scala to define an explicit function for an RDD Transformation with additional/extra arguments?
For example, the Python code below uses a lambda expression to apply the transformation map (requiring a function with one argument) with the function my_power (actually having 2 arguments).
def my_power(a, b):
res = a ** b
return res
def my_main(sc, n):
inputRDD = sc.parallelize([1, 2, 3, 4])
powerRDD = inputRDD.map(lambda x: my_power(x, n))
resVAL = powerRDD.collect()
for item in resVAL:
print(item)
However, when attempting an equivalent implementation in Scala, I get a Task not serializable exception.
val myPower: (Int, Int) => Int = (a: Int, b: Int) => {
val res: Int = math.pow(a, b).toInt
res
}
def myMain(sc: SparkContext, n: Int): Unit = {
val inputRDD: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4))
val squareRDD: RDD[Int] = inputRDD.map( (x: Int) => myPower(x, n) )
val resVAL: Array[Int] = squareRDD.collect()
for (item <- resVAL){
println(item)
}
}
In this way it was working for me.
package examples
import org.apache.log4j.Level
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDTest extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
val myPower: (Int, Int) => Int = (a: Int, b: Int) => {
val res: Int = math.pow(a, b).toInt
res
}
val scontext = spark.sparkContext
myMain(scontext, 10);
def myMain(sc: SparkContext, n: Int): Unit = {
val inputRDD: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4))
val squareRDD: RDD[Int] = inputRDD.map((x: Int) => myPower(x, n))
val resVAL: Array[Int] = squareRDD.collect()
for ( item <- resVAL ) {
println(item)
}
}
}
Result :
1024
59049
1048576
There is another option to broadcast n using sc.broadcast and access in the closure like map is also possible...
Simply adding a local variable as a function alias made it work:
val myPower: (Int, Int) => Int = (a: Int, b: Int) => {
val res: Int = math.pow(a, b).toInt
res
}
def myMain(sc: SparkContext, n: Int): Unit = {
val inputRDD: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4))
val myPowerAlias = myPower
val squareRDD: RDD[Int] = inputRDD.map( (x: Int) => myPowerAlias(x, n) )
val resVAL: Array[Int] = squareRDD.collect()
for (item <- resVAL){
println(item)
}
}
Related
I have Object in scala in which I have defined some functions.
Object Sample {
val listFunction = Seq(func1(a,b),func2(p,q))
def func1(a: Int,b : Int) : Int ={
val c = a+b
c
}
def func2(p: Int,q : Int) : Int ={
val d = p+q
}
}
def main(args: Array[String]): Unit = {
//Want to call the list and execute the functions
ListFunction
}
How to call the list in main method and execute it?
Given
def func1(a: Int, b: Int): Int = a + b
def func2(p: Int, q: Int): Int = p + q
Consider the difference between
val x: Int = func1(2, 3) // applied function
val f: (Int, Int) => Int = func1 // function as value
So you have to use functions as values as you pass them to the sequence like so
val listFunction: Seq[(Int, Int) => Int] = Seq(func1, func2)
and then map over the list to apply the functions
listFunction.map(f => f.apply(2, 3))
listFunction.map(f => f(2, 3))
listFunction.map(_(2, 3))
scastie
Are these two partial functions equivalent?
val f0: PartialFunction[Int, String] = {
case 10 => "ten"
case n: Int => s"$n"
}
val f1 = new PartialFunction[Int, String] {
override def isDefinedAt(x: Int): Boolean = true
override def apply(v: Int): String = if (v == 10) "ten" else s"$v"
}
UPD
val pf = new PartialFunction[Int, String] {
def isDefinedAt(x: Int) = x == 10
def apply(v: Int) = if (isDefinedAt(v)) "ten" else "undefined"
}
def fun(n: Int)(pf: PartialFunction[Int, String]) = pf.apply(n)
println(fun(100)(pf))
Is it truly PF now?
I think you need 2 partial (value) functions to use the PartialFunction the way it is designed to be: one for the value 10, and the other for the other Ints:
val f0:PartialFunction[Int, String] = { case 10 => "ten" }
val fDef:PartialFunction[Int, String] = { case n => s"$n" }
And how to apply them:
val t1 = (9 to 11) collect f0
t1 shouldBe(Array("ten"))
val t2 = (9 to 11) map (f0 orElse fDef)
t2 shouldBe(Array("9", "ten", "11"))
Here is my attempt:
case class A(val a: A, val b: Int){
override def toString() = b.toString
}
lazy val x: A = A(y, 0)
lazy val y: A = A(z, 1)
lazy val z: A = A(x, 2)
The problem comes when trying to do anything with x; causing x to be evaluated starts off a circular evaluation going through x, y, z and ends in a stack overflow. Is there a way of specifying that val a should be computed lazily?
You could use Stream like this:
lazy val stream: Stream[Int] = 0 #:: 1 #:: 2 #:: stream
stream.take(10).toList
// List(0, 1, 2, 0, 1, 2, 0, 1, 2, 0)
In general you should use call-by-name parameters:
class A(_a: => A, val b: Int) {
lazy val a = _a
override def toString() = s"A($b)"
}
Usage:
scala> :paste
// Entering paste mode (ctrl-D to finish)
lazy val x: A = new A(y, 0)
lazy val y: A = new A(z, 1)
lazy val z: A = new A(x, 2)
// Exiting paste mode, now interpreting.
x: A = <lazy>
y: A = <lazy>
z: A = <lazy>
scala> z.a.a.a.a.a
res0: A = A(1)
You need to make A.a itself lazy.
You can do it by turning it into a by name parameter that is used to initialize a lazy field:
class A(a0: => A, val b: Int){
lazy val a = a0
override def toString() = b.toString
}
object A {
def apply( a0: => A, b: Int ) = new A( a0, b )
}
You could also do the same using a helper class Lazy:
implicit class Lazy[T]( getValue: => T ) extends Proxy {
def apply(): T = value
lazy val value = getValue
def self = value
}
It has the advantage that you code is pretty much unchanged except for changing a: A into a: Lazy[A]:
case class A(val a: Lazy[A], val b: Int){
override def toString() = b.toString
}
Note that to access the actual value wrapped in Lazy, you can either use apply or value (as in x.a() or x.a.value)
You can define a lazy circular list using the Stream data type:
lazy val circular: Stream[Int] = 1 #:: 2 #:: 3 #:: circular
You can do the same trick on your own with by-name parameters:
class A(head: Int, tail: => A)
lazy val x = new A(0, y)
lazy val y = new A(1, z)
lazy val z = new A(2, x)
Note that this does not work with case classes.
You could use a by-name parameter.
class A(__a: => A, val b: Int) {
def a = __a
override def toString() = b.toString
}
object A {
def apply(a: => A, b: Int) = new A(a, b)
}
Im looking to extended the iterator to create a new method takeWhileInclusive, which will operate like takeWhile but include the last element.
My issue is what is best practice to extend the iterator to return a new iterator which I would like to be lazy evaluated. Coming from a C# background I normal use IEnumerable and use the yield keyword, but such an option doesn't appear to exist in Scala.
for example I could have
List(0,1,2,3,4,5,6,7).iterator.map(complex time consuming algorithm).takeWhileInclusive(_ < 6)
so in this case the takeWhileInclusive would only have resolve the predicate on the values until I get the a result greater than 6, and it will include this first result
so far I have:
object ImplicitIterator {
implicit def extendIterator(i : Iterator[Any]) = new IteratorExtension(i)
}
class IteratorExtension[T <: Any](i : Iterator[T]) {
def takeWhileInclusive(predicate:(T) => Boolean) = ?
}
You can use the span method of Iterator to do this pretty cleanly:
class IteratorExtension[A](i : Iterator[A]) {
def takeWhileInclusive(p: A => Boolean) = {
val (a, b) = i.span(p)
a ++ (if (b.hasNext) Some(b.next) else None)
}
}
object ImplicitIterator {
implicit def extendIterator[A](i : Iterator[A]) = new IteratorExtension(i)
}
import ImplicitIterator._
Now (0 until 10).toIterator.takeWhileInclusive(_ < 4).toList gives List(0, 1, 2, 3, 4), for example.
This is one case where I find the mutable solution superior:
class InclusiveIterator[A](ia: Iterator[A]) {
def takeWhileInclusive(p: A => Boolean) = {
var done = false
val p2 = (a: A) => !done && { if (!p(a)) done=true; true }
ia.takeWhile(p2)
}
}
implicit def iterator_can_include[A](ia: Iterator[A]) = new InclusiveIterator(ia)
The following requires scalaz to get fold on a tuple (A, B)
scala> implicit def Iterator_Is_TWI[A](itr: Iterator[A]) = new {
| def takeWhileIncl(p: A => Boolean)
| = itr span p fold (_ ++ _.toStream.headOption)
| }
Iterator_Is_TWI: [A](itr: Iterator[A])java.lang.Object{def takeWhileIncl(p: A => Boolean): Iterator[A]}
Here it is at work:
scala> List(1, 2, 3, 4, 5).iterator takeWhileIncl (_ < 4)
res0: Iterator[Int] = non-empty iterator
scala> res0.toList
res1: List[Int] = List(1, 2, 3, 4)
You can roll your own fold over a pair like this:
scala> implicit def Pair_Is_Foldable[A, B](pair: (A, B)) = new {
| def fold[C](f: (A, B) => C): C = f.tupled(pair)
| }
Pair_Is_Foldable: [A, B](pair: (A, B))java.lang.Object{def fold[C](f: (A, B) => C): C}
class IteratorExtension[T](i : Iterator[T]) {
def takeWhileInclusive(predicate:(T) => Boolean) = new Iterator[T] {
val it = i
var isLastRead = false
def hasNext = it.hasNext && !isLastRead
def next = {
val res = it.next
isLastRead = !predicate(res)
res
}
}
}
And there's an error in your implicit. Here it is fixed:
object ImplicitIterator {
implicit def extendIterator[T](i : Iterator[T]) = new IteratorExtension(i)
}
scala> List(0,1,2,3,4,5,6,7).toStream.filter (_ < 6).take(2)
res8: scala.collection.immutable.Stream[Int] = Stream(0, ?)
scala> res8.toList
res9: List[Int] = List(0, 1)
After your update:
scala> def timeConsumeDummy (n: Int): Int = {
| println ("Time flies like an arrow ...")
| n }
timeConsumeDummy: (n: Int)Int
scala> List(0,1,2,3,4,5,6,7).toStream.filter (x => timeConsumeDummy (x) < 6)
Time flies like an arrow ...
res14: scala.collection.immutable.Stream[Int] = Stream(0, ?)
scala> res14.take (4).toList
Time flies like an arrow ...
Time flies like an arrow ...
Time flies like an arrow ...
res15: List[Int] = List(0, 1, 2, 3)
timeConsumeDummy is called 4 times. Am I missing something?
I have this code:
val arr: Array[Int] = ...
val largestIndex = {
var i = arr.length-2
while (arr(i) > arr(i+1)) i -= 1
i
}
val smallestIndex = {
var k = arr.length-1
while (arr(largestIndex) > arr(k)) k -= 1
k
}
But there is to much code duplication. I tried to rewrite this with closures but I failed. I tried something like this:
def index(sub: Int, f: => Boolean): Int = {
var i = arr.length-sub
while (f) i -= 1
i
}
val largest = index(2, i => arr(i) > arr(i+1))
val smallest = index(1, i => arr(largest) > arr(i))
The problem is that i can't use parameter i of the method index() in the closure. Is there a way to avoid this problem?
val arr = Array(1,2,4,3,3,4,5)
def index(sub: Int, f: Int => Boolean): Int = {
var i = arr.length-sub
while (f(i)) i -= 1
i
}
val largest = index(2, i => arr(i) > arr(i+1))
val smallest = index(1, i => arr(largest) > arr(i))
val arr = Array(1,2,4,3,3,4,5)
arr: Array[Int] = Array(1, 2, 4, 3, 3, 4, 5)
scala> arr.zipWithIndex.max(Ordering.by((x: (Int, Int)) => x._1))._2
res0: Int = 6
scala> arr.zipWithIndex.min(Ordering.by((x: (Int, Int)) => x._1))._2
res1: Int = 0
or
scala> val pairOrdering = Ordering.by((x: (Int, Int)) => x._1)
pairOrdering: scala.math.Ordering[(Int, Int)] = scala.math.Ordering$$anon$4#145ad3d
scala> arr.zipWithIndex.max(pairOrdering)._2
res2: Int = 6
scala> arr.zipWithIndex.min(pairOrdering)._2
res3: Int = 0