simultaneous access to a utility function in Scala - scala

I have a simple utility function defined on an object. Like the one below, it reads an immutable data structure from the object. Can many threads, or in this case many Akka actors, call this method simultaneously?
object Util {
private val names = Set("alice", "bob")
def isName(s: String) = names.contains(s.toLowerCase)
}

Related

Writing a Custom Class to HDFS in Apache Flink

I am trying to get familiar with the semantics of Flink after having started with Spark. I would like to write a DataSet[IndexNode] to persistent storage in HDFS so that it can be read later by another process. Spark has a simple ObjectFile API that provides such a functionality, but I cannot find a similar option in Flink.
case class IndexNode(vec: Vector[IndexNode],
id: Int) extends Serializable {
// Getters and setters etc. here
}
The build-in sinks tend to serialize the instance based on the toString method, which is not suitable here due to the nested structure of the class. I imagine the solution is to use a FileOutputFormat and provide a translation of the instances to a byte stream. However, I am not sure how to serialize the vector, which is of an arbitrary length and can be many levels deep.
You can achieve this by using SerializedOutputFormat and SerializedInputFormat.
Try following steps:
Make IndexNode extend IOReadableWritable interface from FLINK. Make unserialisable fields #transient. Implement write(DataOutputView out) and read(DataInputView in) method. The write method will write out all data from IndexNode and read method will read them back and build all internal data fields. For example, instead of serialising all data from arr field in Result class, I write out all value out and then read them back and rebuild the array in read method.
class Result(var name: String, var count: Int) extends IOReadableWritable {
#transient
var arr = Array(count, count)
def this() {
this("", 1)
}
override def write(out: DataOutputView): Unit = {
out.writeInt(count)
out.writeUTF(name)
}
override def read(in: DataInputView): Unit = {
count = in.readInt()
name = in.readUTF()
arr = Array(count, count)
}
override def toString: String = s"$name, $count, ${getArr}"
}
Write out data with
myDataSet.write(new SerializedOutputFormat[Result], "/tmp/test")
and read it back with
env.readFile(new SerializedInputFormat[Result], "/tmp/test")

Understanding companion object in scala

While learning Scala, I came across interesting concept of companion object. Companion object can used to define static methods in Scala. Need few clarifications in the below Spark Scala code in regard of companion object.
class BballStatCounter extends Serializable {
val stats: StatCounter = new StatCounter()
var missing: Long = 0
def add(x: Double): BballStatCounter = {
if (x.isNaN) {
missing += 1
} else {
stats.merge(x)
}
this
}
}
object BballStatCounter extends Serializable {
def apply(x: Double) = new BballStatCounter().add(x)
}
Above code is invoked using val stat3 = stats1.map(b=>BballStatCounter(b)).
What is nature of variables stats and missing declared in the
class? Is it similar to class attributes of Python?
What is the significance of apply method in here?
Here stats and missing are class attributes and each instance of BballStatCounter will have their own copy of them just like in Python.
In Scala the method apply serves a special purpose, if any object has a method apply and if that object is used as function calling notation like Obj() then the compiler replaces that with its apply method calling, like Obj.apply() .
The apply method is generally used as a constructor in a Class Companion object.
All the collection Classes in Scala has a Companion Object with apply method, thus you are able to create a list like : List(1,2,3,4)
Thus in your above code BballStatCounter(b) will get compiled to BballStatCounter.apply(b)
stats and missing are members of the class BcStatCounter. stats is a val so it cannot be changed once it has been defined. missing is a var so it is more like a traditional variable and can be updated, as it is in the add method. Every instance of BcStatCounter will have these members. (Unlike Python, you can't add or remove members from a Scala object)
The apply method is a shortcut that makes objects look like functions. If you have an object x with an apply method, you write x(...) and the compiler will automatically convert this to x.apply(...). In this case it means that you can call BballStatCounter(1.0) and this will call the apply method on the BballStatCounter object.
Neither of these questions is really about companion objects, this is just the normal Scala class framework.
Please note the remarks in the comments about asking multiple questions.

Scala - how to access an object from a class?

I want to create a Map with information about sales in one class (object) and that use it in another class ProcessSales - iterate over the Map keys and use values. I have already written logic creating a Map in an object SalesData.
However since I've started learning Scala not long ago I'm not sure if it is a good approach to implement the logic in an object.
What will be the correct way to access the Map from another class?
Should the Map be created in an object or in a separate class? Or maybe it's better to create an object in the ProcessSales class that will be using it?
Could you share best practices and examples?
object SalesData {
val stream : InputStream = getClass.getResourceAsStream("/sales.csv")
val salesIterator: Iterator[String] = scala.io.Source.fromInputStream(stream).getLines
def getSales(salesData: Iterator[String]): Map[Int, String] = {
salesData
.map(_.split(","))
.map(line => (line(0).toInt, line(1)))
.toMap
}
val salesMap: Map[Int, String] = getSales(salesIterator)
}
If you wanted flexibility to "mix in" this map you could put the map and getSales() into a new trait.
If, on the other hand, you wanted to insure one and only one factory method existed to create the map, you could put getSales() into a companion object, which has to have the same name as your class and defined in the same source file. For example,
object ProcessSales {
def getSales():Map[Int,String] = {...}
}
Remember that methods in a companion object are analogous to static methods in Java.
It is also possible to put the map instance itself into the companion object, if you want the map to be a singleton--one map instance per many instances of ProcessSales.
Or, if you want 1 such map per each instance of ProcessSales, you would make it a field within the ProcessSales class.
Or, if you wanted the map to be available to all members of a class hierarchy under ProcessSales, you could make ProcessSales an abstract class. But regarding use of an abstract class, remember that use of a trait affords greater flexibility in case you are not certain that all subclasses in the hierarchy will need the map.
It all depends on how you want to use it. Scala is more functional oriented. So for the best practice, you could define getSalesData in an object and in another object you could pass the parameters and call the def getSalesData.
For example,
import packaganame.SalesData._;
Object Check {
val stream : InputStream = getClass.getResourceAsStream("/sales.csv");
val salesIterator: Iterator[String] = scala.io.Source.fromInputStream(stream).getLines;
val salesMap = getSales(salesIterator);
}

akka: using akka-typed to implement the active objects pattern

The Akka Typed Actors documentation states that it will be superseded by Akka Typed. I am inferring from this that Akka Typed can be used to implement the Active Object pattern; but it is not too clear to me how. Here is my attempt so far; I'm aware it stinks :D
object HelloWorld {
final case class Greet(whom: String, replyTo: ActorRef[Greeted])
final case class Greeted(whom: String)
private val greeter = Static[Greet] { msg ⇒
println(s"Hello ${msg.whom}!")
msg.replyTo ! Greeted(msg.whom)
}
private val system = ActorSystem("HelloWorld", Props(greeter))
def greet(whom: String): Future[Greeted] = system ? (Greet(whom, _))
}
Cheers
The Active Object Pattern as defined by the page you link to is not desirable for all the reasons why TypedActors are being removed: the fact that a method is executed asynchronously is so important that it should not be hidden by technologies like proxy objects that implement normal interfaces. Instead, Akka Typed allows you to write nearly the same code as if it was an Active Object while retaining the asynchronous marker: instead of selecting a method with the . syntax you send a message using ? (or ! if the protocol is not simple request–response). Your example would look like this:
object HelloWorld {
final case class Greet(whom: String)(replyTo: ActorRef[Greeted])
final case class Greeted(whom: String)
val greeter = Static[Greet] { msg ⇒
println(s"Hello ${msg.whom}!")
msg.replyTo ! Greeted(msg.whom)
}
}
object Sample extends App {
import HelloWorld._
val system = ActorSystem("HelloWorld", Props(greeter))
val fg = system ? Greet("John")
}
Please note that creating a separate thread (or ActorSystem) per object may sound okay as per the classical pattern, but doing that foregoes many of the benefits of a message-driven architecture, namely that many Actors can share the same resources for more efficient execution and they can form supervision hierarchies for principled failure handling etc.

In Spark, what is the right way to have a static object on all workers?

I've been looking at the documentation for Spark and it mentions this:
Spark’s API relies heavily on passing functions in the driver program
to run on the cluster. There are two recommended ways to do this:
Anonymous function syntax, which can be used for short pieces of code.
Static methods in a global singleton object. For example, you can
define object MyFunctions and then pass MyFunctions.func1, as follows:
object MyFunctions { def func1(s: String): String = { ... } }
myRdd.map(MyFunctions.func1)
Note that while it is also possible to
pass a reference to a method in a class instance (as opposed to a
singleton object), this requires sending the object that contains that
class along with the method. For example, consider:
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
Here, if we create a new MyClass and call doStuff on it, the map inside there
references the func1 method of that MyClass instance, so the whole
object needs to be sent to the cluster. It is similar to writing
rdd.map(x => this.func1(x)).
Now my doubt is what happens if you have attributes on the singleton object (which are supposed to be equivalent to static). Same example with a small alteration:
object MyClass {
val value = 1
def func1(s: String): String = { s + value }
}
myRdd.map(MyClass.func1)
So the function is still referenced statically, but how far does Spark goes by trying to serialize all referenced variables? Will it serialize value or will it be initialized again in the remote workers?
Additionally, this is all in the context that I have some heavy models inside a singleton object and I would like to find the correct way to serialize them to workers while keeping the ability to reference them from the singleton everywhere, instead of passing them around as function parameters across a pretty deep function call stack.
Any in-depth information on what/how/when does Spark serialize things would be appreciated.
This is less a question about Spark and more of a question of how Scala generates code. Remember that a Scala object is pretty much a Java class full of static methods. Consider a simple example like this:
object foo {
val value = 42
def func(i: Int): Int = i + value
def main(args: Array[String]): Unit = {
println(Seq(1, 2, 3).map(func).sum)
}
}
That will be translated to 3 Java classes; one of them will be the closure that is a parameter to the map method. Using javap on that class yields something like this:
public final class foo$$anonfun$main$1 extends scala.runtime.AbstractFunction1$mcII$sp implements scala.Serializable {
public static final long serialVersionUID;
public final int apply(int);
public int apply$mcII$sp(int);
public final java.lang.Object apply(java.lang.Object);
public foo$$anonfun$main$1();
}
Note there are no fields or anything. If you look at the disassembled bytecode, all it does is call the func() method. When running in Spark, this is the instance that will get serialized; since it has no fields, there's not much to be serialized.
As for your question, how to initialize static objects, you can have an idempotent initialization function that you call at the start of your closures. The first one will trigger initialization, the subsequent calls will be no-ops. Cleanup, though, is a lot trickier, since I'm not familiar with an API that does something like "run this code on all executors".
One approach that can be useful if you need cleanup is explained in this blog, in the "setup() and cleanup()" section.
EDIT: just for clarification, here's the disassembly of the method that actually makes the call.
public int apply$mcII$sp(int);
Code:
0: getstatic #29; //Field foo$.MODULE$:Lfoo$;
3: iload_1
4: invokevirtual #32; //Method foo$.func:(I)I
7: ireturn
See how it just references the static field holding the singleton and calls the func() method.