Where to initialize a reusable object in a ParDo? - apache-beam

An example of my ParDo in my beam job (running with Dataflow runner):
class StreamEventToJsonConverter : DoFn<MyClass, String>() {
#ProcessElement
fun processElement(#Element element: MyClass, receiver: OutputReceiver<String>) {
val gson = Gson()
val jsonValue = gson.toJson(element)
receiver.output(jsonValue)
}
}
My question is: should I initialize the Gson object inside of the processElement function?
Is it only initialized once per worker or everytime a new element enters the function (seems to be overkilled)?
Given that the Gson object is not serializable.
Thank you.

Turned out I can do via a DoFn.Setup function:
Annotation for the method to use to prepare an instance for processing bundles of elements.
This is a good place to initialize transient in-memory resources, such as network connections. The resources can then be disposed in DoFn.Teardown.

According to the documentation
A given DoFn instance generally gets invoked one or more times to
process some arbitrary bundle of elements. However, Beam doesn’t
guarantee an exact number of invocations; it may be invoked multiple
times on a given worker node to account for failures and retries. As
such, you can cache information across multiple calls to your
processing method, but if you do so, make sure the implementation does
not depend on the number of invocations.
Based on this, seems the DoFn is executed each time and according to your use case, the Gson object will be initialized different times in a worker node

Related

Scala: Making singleton to process a client request

Is it a good practice to make use of singleton object to process client requests?
As object instance is singleton, and if its in middle of processing a client request , and meanwhile another request arrives then same instance is invoked with new client data. Won't it make things messy?
When using singleton objects we must ensure everything inside it as well as everything it calls must be thread-safe. For example, javax.crypto.Cipher does not seem to be thread-safe, so it should probably not be called from a singleton. Consider how guice uses #Singleton to specify threadsafety intention:
#Singleton
public class InMemoryTransactionLog implements TransactionLog {
/* everything here should be threadsafe! */
}
Also consider the example of Play Framework which starting version 2.4 began moving away from singleton controllers and started encouraging class controllers.
It depends on whether the object holds any mutable data or not.
If the object is just a holder for pure functions and immutable state, it does not matter how many threads are using it at the same time because they can't affect each other via shared state.
If the object has mutable state then things can definitely go wrong if you access it from multiple threads without some kind of locking, either inside the object or externally.
So it is good practice as long as there is no mutable state. It is a good way of collecting related methods under the same namespace, or for creating a global function (by defining an apply method).

Actor accessing things out of scope

I'm using the Akka libraries.
What happens when multiple actors call a function on an object? Would this block other actors from accessing the object?
The reason I ask this is because I want to use JBCrypt with akka actors. And since we can encrypt multiple strings concurrently I have each actor calling JBcrypt.hash(...). Not sure how it works since I think, in scala, objects exist in one place, and I feel like multiple actors using the same object (library) might block the concurrency from actually happening.
Multiple actors calling a function in an object that calls a library will not block unless the library being called uses concurrency control mechanisms such as sychronized, ThreadLocal or an object lock.
For example, calling print on the below Printer object will block:
class BlockingPrinter(){
def print(s: String) = synchronized{s}
}
object Printer{
val printer = new BlockingPrinter()
def print(str: String) = printer.print(str)
}
But calling it on the below Printer object will not
class NonBlockingPrinter(){
def print(s: String) = s
}
object Printer{
val printer = new NonBlockingPrinter()
def print(str: String) = printer.print(str)
}
In summary, the library that you're calling is the one that decides how concurrency is handled. Not the fact that you're calling an object.
It depends on how the function is implemented. If the function accessed some internal mutable state and tries to synchronize in order to achieve thread safety then there is a problem. If it's a pure function and does not access any external state, then it is safe. If the function has the mutable state at least it must contain the mutable state to itself.

Spark Object (singleton) serialization on executors

I am not sure that what I want to achieve is possible. What I do know is I am accessing a singleton object from an executor to ensure it's constructor has been called only once on each executor. This pattern is already proven and works as expected for similar use cases in my code base.
However, What I would like to know is if I can ship the object after it has been initialized on the driver. In this scenario,
when accesing ExecutorAccessedObject.y, ideally it would not call the println but just return the value. This is a highly simplified version, in reality, I would like to make a call to some external system on the driver, so when accessed on the executor, it will not re-call that external system. I am ok with #transient lazy val x to be reinitialized once on the executors, as that will hold a connection pool which cannot be serialized.
object ExecutorAccessedObject extends Serializable {
#transient lazy val x: Int = {
println("Ok with initialzing this on the executor. I.E. database connection pool")
1
}
val y: Int = {
// call some external system to return a value.
// I do not want to call the external system from the executor
println(
"""
|Idealy, this would not be printed on the executor.
|return value 1 without re initializing
""")
1
}
println("The constructor will be initialized Once on each executor")
}
someRdd.mapPartitions { part =>
ExecutorAccessedObject
ExecutorAccessedObject.x // first time accessed should re-evaluate
ExecutorAccessedObject.y // idealy, never re-evaluate and return 1
part
}
I attempted to solve this with broadcast variables as well, but I am unsure how to access the broadcast variable within the singleton object.
What I would like to know is if I can ship the object after it has been initialized on the driver.
You cannot. Objects, as singletons, are never shipped to executors. There initialized locally, whenever objects is accessed for the first time.
If the result of the call is serializable, just pass it alone, either as an arguments to the ExecutorAccessedObject (implicitly or explicitly) or making ExecutorAccessedObject mutable (and adding required synchronization).

how init a val out the object in scala?

A redis cluster client should should be shared in many place,am I right? with the google, so I use a RedisCli object:
object RedisCli {
val jedisClusterNodes = new java.util.HashSet[HostAndPort]()
jedisClusterNodes.add(new HostAndPort("192.168.1.100", 6379))
lazy val jedisCluster = new JedisCluster(jedisClusterNodes)
//...method with jedisCluster
}
the problem is how can I init the jedisCluster out the object--I want init the HostAndPort in the main method of other object, get the ip from properties file the file passed by command line. should I just use class RedisCli in my circumstance?
I think I am totally lost in class and object.
In Scala all members of a singleton objects should be defined. While you are allowed to modify var members from the outside, take a step back and ask yourself what is the point of having a singleton object in your case if each client can modify its members? You will only end up with spaghetti code.
I would highly recommend using a dependency injection framework (Spring for example) where you can create beans in a specific place then inject them where you need them.
In a nutshell singleton objects should be used when you want to define methods and values (never seen a case where a var is used) that are not specific to each instance of a class (think Java static). In your case, you seem to want different instances (otherwise why should they be set from client code) but want a certain instance to be shared across different clients and this is exactly what dependency injection allows you to do.
If you don't want to use a DI framework and are okay with having clients modify your instances as they please, then simply use a class as opposed to an object. When you use the class keyword, different instances can be instantiated.
class RedisCli(val ip: String, val port: Int) {
val hostAndPort: HostAndPort = new HostAndPort(ip, port)
etc...
}
Hope this helps.

Scala folding using Akka

I implemented in Java what I called a "foldable queue", i.e., a LinkedBlockingQueue used by an ExecutorService. The idea is that each task as a unique id that if is in the queue while another task is submitted via that same id, it is not added to the queue. The Java code looks like this:
public final class FoldablePricingQueue extends LinkedBlockingQueue<Runnable> {
#Override
public boolean offer(final Runnable runnable) {
if (contains(runnable)) {
return true; // rejected, but true not to throw an exception
} else {
return super.offer(runnable);
}
}
}
Threads have to be pre-started but this is a minor detail. I have an Abstract class that implements Runnable that takes a unique id... this is the one passed in
I would like to implement the same logic using Scala and Akka (Actors).
I would need to have access to the mailbox, and I think I would need to override the ! method and check the mailbox for the event.. has anyone done this before?
This is exactly how the Akka mailbox works. The Akka mailbox can only exist once in the task-queue.
Look at:
https://github.com/jboner/akka/blob/master/akka-actor/src/main/scala/akka/dispatch/Dispatcher.scala#L143
https://github.com/jboner/akka/blob/master/akka-actor/src/main/scala/akka/dispatch/Dispatcher.scala#L198
Very cheaply implemented using an atomic boolean, so no need to traverse the queue.
Also, by the way, your Queue in Java is broken since it doesn't override put, add or offer(E, long, TimeUnit).
Maybe you could do that with two actors. A facade one and a worker one. Clients send jobs to facade. Facade forwards then to worker, and remember them in its internal state, a Set queuedJobs. When it receives a job that is queued, it just discard it. Each time the worker starts processing a job (or completes it, whichever suits you), it sends a StartingOn(job) message to facade, which removes it from queuedJobs.
The proposed design doesn't make sense. The closest thing to a Runnable would be an Actor. Sure, you can keep them in a list, and not add them if they are already there. Such lists are kept by routing actors, which can be created from ready parts provided by Akka, or from a basic actor using the forward method.
You can't look into another actor's mailbox, and overriding ! makes no sense. What you do is you send all your messages to a routing actor, and that routing actor forwards them to a proper destination.
Naturally, since it receives these messages, it can do any logic at that point.