Akka cluster-sharding: Can Entry actors have dynamic props - scala

Akka Cluster-Sharding looks like it matches well with a use case I have to create single instances of stateful persistent actors across Akka nodes.
I'm not clear if it is possible though to have an Entry actor type that requires arguments to construct it. Or maybe I need to reconsider how the Entry actor gets this information.
Object Account {
def apply(region: String, accountId: String): Props = Props(new Account(region, accountId))
}
class Account(val region: String, val accountId: String) extends Actor with PersistentActor { ... }
Whereas the ClusterSharding.start takes in a single Props instance for creating all Entry actors.
From akka cluster-sharding:
val counterRegion: ActorRef = ClusterSharding(system).start(
typeName = "Counter",
entryProps = Some(Props[Counter]),
idExtractor = idExtractor,
shardResolver = shardResolver)
And then it resolves the Entry actor that receives the message based on how you define the idExtractor. From the source code for shard it can be seen it uses the id as the name for a given Entry actor instance:
def getEntry(id: EntryId): ActorRef = {
val name = URLEncoder.encode(id, "utf-8")
context.child(name).getOrElse {
log.debug("Starting entry [{}] in shard [{}]", id, shardId)
val a = context.watch(context.actorOf(entryProps, name))
idByRef = idByRef.updated(a, id)
refById = refById.updated(id, a)
state = state.copy(state.entries + id)
a
}
}
It seems I should instead have my Entry actor figure out its region and accountId by the name it is given, although this does feel a bit hacky now that I'll be parsing it out of a string instead of directly getting the values. Is this my best option?

I am in a very similar situation as yours. I don't have an exact answer but I can share with you and the readers what I did/tried/thought.
Option 1) As you mentioned, you can extract id, shard and region information from how you name your stuff and parsing the path. The upside is
a) that it's kind of easy to do.
The downsides are that
a) Akka encodes actor paths as UTF-8, so if you are using anything as a separator that is not a standard url character (such as || or w/e) you will need to first decode it from utf8. Note that inside Akka utf8 is hard-coded as encoding method, there is no way to extract the encoding format as in a function, so if tomorrow akka changes you'll have to adapt your code too.
b) your system is not preserving homomorphism anymore (what you mean by "it feels kinda hacky"). Which implies that you are adding the risk that your data, one day, may contain your information separator string as meaningful data and your system may mess up.
Option 2) Sharding will spawn your actor if it doesn't exist. So you can force your code to always send an init message to non initialized actors, which contains your constructor parameters. Your sharded actors will have something inside of them of the kind:
val par1: Option[param1Type] = None
def receive = {
case init(par1value) => par1 = Some(par1value)
case query(par1) => sender ! par1
}
And from your region access actor you can always send first the query message and then the init message if the return is None. This assumes that your region access actor does not mantain a list of the initialized actors, in which case you can just spawn with init and then use them normally.
The upside is
a) It's elegant
b) it "feels" right
Downside: a) it takes 2x messages (if you don't maintain a list of initialized actors)
Option 3) THIS OPTION HAS BEEN TESTED AND DOESN'T WORK. I'll just leave it here for people to avoid wasting time trying the same.
I have no idea if this works, I haven't tested because I'm using this scenario in production with special constraints and fancy stuff is not allowed ^_^ But feel free to try and please let me know with a pm or comment!
Basically, you start your region with
val counterRegion: ActorRef = ClusterSharding(system).start(
typeName = "Counter",
entryProps = Some(Props[Counter]),
idExtractor = idExtractor,
shardResolver = shardResolver)
What if you, in your region creation actor, do something like:
var providedPar1 = v1
def providePar1 = providedPar1
val counterRegion: ActorRef = ClusterSharding(system).start(
typeName = "Counter",
entryProps = Some(Props(classOf[Counter], providePar1),
idExtractor = idExtractor,
shardResolver = shardResolver)
And then you change the value of providedPar1 for each creation? The downside of this is that, in the option it works, you'd need to avoid changing the value of providedPar1 until you are 100% sure that the actor has been created, or you may risk it accessing the new, wrong value (yay, race conditions!)
In general you are better off with option 2 imho, but in most scenarios the risks introduced by 1 are small and you can mitigate them properly given the simplicity (and performance) advantages.
Hope this rant helps, let me know if you try 3 out how it works!

Related

Akka: when is it safe to send a message

I am creating an actor via:
system.actorOf(Props(....))
or
system.actorOf(SmallestMailboxPool(instances).props(Props(....))).
I usually block the thread calling system.actorOf till actorSelection works.
Await.result(system.actorSelection("/user/" + name).resolveOne(), timeout.duration)
I am wondering if this is at all needed or I can immediately start using the actorRef and send (tell) messages to the actor/actor pool.
So the question boils down to, if I have an actorRef, does that mean the mailbox is already created or it might so happen that the messages I sent immediately after calling system.actorOf might get dropped?
If you drill down the the implementation of system.actorOf, you see a call to a method names makeChild. Internally, this utilizes a lengthy method on the ActorRefProvider trait (internally using LocalActorRefProvider) called actorOf. This rather lengthy method initializes the child actor. Relevant parts are:
val props2 =
// mailbox and dispatcher defined in deploy should override props
(if (lookupDeploy) deployer.lookup(path) else deploy) match {
case Some(d) ⇒
(d.dispatcher, d.mailbox) match {
case (Deploy.NoDispatcherGiven, Deploy.NoMailboxGiven) ⇒ props
case (dsp, Deploy.NoMailboxGiven) ⇒ props.withDispatcher(dsp)
case (Deploy.NoMailboxGiven, mbx) ⇒ props.withMailbox(mbx)
case (dsp, mbx) ⇒ props.withDispatcher(dsp).withMailbox(mbx)
}
case _ ⇒ props // no deployment config found
}
Or if a Router is explicitly provided:
val routerDispatcher = system.dispatchers.lookup(p.routerConfig.routerDispatcher)
val routerMailbox = system.mailboxes.getMailboxType(routerProps, routerDispatcher.configurator.config)
// routers use context.actorOf() to create the routees, which does not allow us to pass
// these through, but obtain them here for early verification
val routeeDispatcher = system.dispatchers.lookup(p.dispatcher)
val routeeMailbox = system.mailboxes.getMailboxType(routeeProps, routeeDispatcher.configurator.config)
new RoutedActorRef(system, routerProps, routerDispatcher, routerMailbox, routeeProps, supervisor, path).initialize(async)
Which means that once you get back an ActorRef, the mailbox has been initialized and you shouldn't be scared of sending it messages.
If you think about the semantics of what an ActorRef stands for, it would be a bit pointless to provide one with an ActorRef which is partly/not initialized. It would make system guarantees weak and would make the programmer think twice before passing messages, which is the opposite desire of the framework.

How to pass data from closure without repeating yourself

I'm using Play 2 with Anorm to manage database access. A common pattern I find myself doing is this:
val (futureChecklists, jobsLookup) =
DB.withConnection { implicit connection =>
val futureChecklists = futureChecklistRepository.getAllHavingActiveTemplateAndNonNullNextRunDate()
val jobsLookup = futureChecklistJobRepository.getAllHavingActiveTemplateAndNonNullNextRunDate()
.groupBy(_.futureChecklist.id)
.withDefaultValue(List.empty)
(futureChecklists, jobsLookup)
}
Which seems kinda weird, because I have to repeat myself. It also gets a bit unruly if I have several variables I'll need in the outer scope, but I don't want to keep the connection open.
Is there an easy way to pass this information back without having to resort to using vars?
What I would like is something like:
val futureChecklists
val jobsLookup
DB.withConnection { implicit connection =>
futureChecklists = futureChecklistRepository.getAllHavingActiveTemplateAndNonNullNextRunDate()
jobsLookup = futureChecklistJobRepository.getAllHavingActiveTemplateAndNonNullNextRunDate()
.groupBy(_.futureChecklist.id)
.withDefaultValue(List.empty)
}
That way I don't have the same tuple at the beginning and end.
I am afraid there is no easy way not to duplicate the tuple declaration, but var is definitely not the way to go around it.
You're mentioning that it becomes weird and difficult with multiple variables at time which returned as a tuple. This indeed can become really tricky and error prone, especially then you end up having large N-tuples with the same parameter types. In that scenario I would consider having a dedicated contained i.e. a case class where you can reference variables by name and not by position in the tuple. The side benefit is that you can assign the whole container to a variable and reference it in the natural way.
Last but not least you don't mention much about your particular use case, but maybe it is worth considering having the 2 queries results obtained in the separate withConnection block. If you are using any collection pooling mechanism, then there is hardly any benefit having it in the same with connection block and with the separate blocks you might even get a flexibility to pararelize the DB queries using separate connections.
There are three ways that i came up with:
Return tuple immediately
val (users, posts) =
DB.withConnection { connection => (
connection.getUsers,
connection.getPosts
)}
I think this is OK for simple code and small numbers of vals. For more complex code and more vals this can be error prone. Someone can accidentally change order of elements in tuple on just one side of assignment, and assign data to wrong vals (which will be reported by compiler only if it also cause type mismatch).
Use anonymous class
val dbResult =
DB.withConnection { connection =>
new {
val users = connection.getUsers
val posts = connection.getPosts
}
}
If you like to have users and posts variables instead of dbResult.users and dbResult.posts you can:
import dbResult._
This solution is a little exotic, but it works just fine and is quite clean.
Use case class
First define case class for your return value:
case class DBResult(users: List[User], posts: List[Post])
and then use it:
val DBResult(users: List[User], posts: List[Post]) =
DB.withConnection { connection =>
DBResult(
users = connection.getUsers,
posts = connection.getPosts
)
}
This is best if you intend to reuse this case class multiple times.

Why does Akka application fail with Out-of-Memory Error while executing NLP task?

I notice that my program has a severe memory leak (the memory consumption spirals up). I had to parallelize this NLP task (using StanfordNLP EnglishPCFG Parser and Tregex Matcher). So I built a pipeline of actors (only 6 actors for each task):
val listOfTregexActors = (0 to 5).map(m => system.actorOf(Props(new TregexActor(timer, filePrinter)), "TregexActor" + m))
val listOfParsers = (0 to 5).map(n => system.actorOf(Props(new ParserActor(timer, listOfTregexActors(n), lp)), "ParserActor" + n))
val listOfSentenceSplitters = (0 to 5).map(j => system.actorOf(Props(new SentenceSplitterActor(listOfParsers(j), timer)), "SplitActor" + j))
My actors are pretty standard. They need to stay alive to process all the information (there's no poisonpill along the way). The memory consumption goes up and up, and I don't have a single clue of what's wrong. If I run single-thread, the memory consumption would be just fine. I read it somewhere that if actors don't die, nothing inside will be released. Should I manually release things?
There are two heavy-lifting actors:
https://github.com/windweller/parallelAkka/blob/master/src/main/scala/blogParallel/ParserActor.scala
https://github.com/windweller/parallelAkka/blob/master/src/main/scala/blogParallel/TregexActor.scala
I wonder if it could be Scala's closure or other mechanism that retains too much information, and GC can't collect it somehow.
Here's part of TregexActor:
def receive = {
case Match(rows, sen) =>
println("Entering Pattern matching: " + rows(0))
val result = patternSearching(sen)
filePrinter ! Print(rows :+ sen.toString, result)
}
def patternSearching(tree: Tree):List[Array[Int]] = {
val statsFuture = search(patternFuture, tree)
val statsPast = search(patternsPast, tree)
List(statsFuture, statsPast)
}
def search(patterns: List[String], tree: Tree) = {
val stats = Array.fill[Int](patterns.size)(0)
for (i <- 0 to patterns.size - 1) {
val searchPattern = TregexPattern.compile(patterns(i))
val matcher = searchPattern.matcher(tree)
if (matcher.find()) {
stats(i) = stats(i) + 1
}
timer ! PatternAddOne
}
stats
}
Or if my code checks out, could it be StanfordNLP parser or tregex matcher's memory leak?? Is there a strategy to manually release memory or do I need to kill those actors after a while and assign their mailbox tasks to a new actor to release memory? (If so, how?)
After some struggling with profiling tools, I finally was able to use VisualVM with IntelliJ. Here are the snapshots. GC never ran.
The other is Heap Dump:
Summary of pipeline:
Raw Files -> SentenceSplit Actors (6) -> Parser Actors (6) -> Tregex Actors (6) -> File Output Actors (done)
Patterns are defined in Entry.scala file: https://github.com/windweller/parallelAkka/blob/master/src/main/scala/blogParallel/Entry.scala
This may not the correct answer but I don't have enough space to write it in the comment.
Try moving the actor creating inside your a Companion object.
val listOfTregexActors = (0 to 5).map(m => system.actorOf(Props(new TregexActor(timer, filePrinter)), "TregexActor" + m))
val listOfParsers = (0 to 5).map(n => system.actorOf(Props(new ParserActor(timer, listOfTregexActors(n), lp)), "ParserActor" + n))
val listOfSentenceSplitters = (0 to 5).map(j => system.actorOf(Props(new SentenceSplitterActor(listOfParsers(j), timer)), "SplitActor" + j))
OR don't use the new to create your actors.
I suspect when your create the app you are closing your App which is preventing the GC to collect any garbage.
You can easily verify if this is the issue by looking at the heap on Visual VM once you have made the change.
Also, how long does it take for you to run out of memory and what is the max. heap memory you are giving your JVM ?
EDIT
See - Creating Actors with Props here
Few other things to consider:
Make sure your actor are not dying and restarted automatically.
Create your NLP objects outside your actors and pass them when you create your actors.
Use Akka router instead of your hashing logic to distribute work between different actors.
I see these pieces of code in your actors:
val cleanedSentences = new java.util.ArrayList[java.util.List[HasWord]]()
Are these objects ever freed? If not, that might explain why memory goes up. Especially taking into account that you return these newly created objects (e.g. in cleanSentence method).
UPDATE: you might try, instead of creating new object, modify the object you have received instead, and then flag it's availability as the response (instead of sending the new object back), though from the point of thread-safety it might also be funky. Other option might be to use an external storage (e.g. database or Redis key-value store), put the resulting sentences there and send "sentence cleaned" as a reply, so that the client can then access the sentence from that key-value store or DB.
In general, using mutable objects (like java.util.List) which might potentially be leaked in actors is not a good idea, so it might be worth redesigning your application to use immutable objects whenever possible.
E.g. your cleanSentence method would then look like:
def cleanSentence(sentences: List[HasWord]): List[HasWord] = {
import TwitterRegex._
sentences.filter(ref =>
val word = ref.word() // do not call same function several times
!word.contains("#") &&
!word.contains("#") &&
!word.matches(searchPattern.toString())
)
}
You can convert your java.util.List to Scala list (before sending it to actor) in the following manner:
import scala.collection.JavaConverters._
val javaList:java.util.List[HasWord] = ...
javaList.asScala

Scala Actors: Returning a Future of a type other than Any.

I am working my way through a book on Scala Actors, and I am running into a bit of a syntactical hangup. In practice, I tend to assign my variables and function definitions as such:
val v: String = "blahblahblah"
def f(n: Int): Int = n+1
including the (return)type of the item after its name. While I know this is not necessary, I have grown comfortable with this convention and find that it makes the code more easily understood by myself.
That being said, observe the below example:
class Server extends Actor {
def act() = {
while (true) {
receive {
case Message(string) => reply("Good,very good.")
}
}
}
}
def sendMsg(m: Message, s: Server): Future[String] = {
s !! m
}
The above code produces an error at compile time, complaining that the server returned a Future[Any], as opposed to a Future[String]. I understand that this problem can be circumvented by removing the return type from sendMsg:
def sendMsg(m: Message,s: Server) = s !! m
However, this is not consistant with my style. Is there a way that I can specify the type of Future that the server generates (as opposed to Future[Any])?
Your problem is a lot deeper than just style: you get a Future[Any] because the compiler cannot statically know better—with the current Akka actors as well as with the now deprecated scala.actors. In the absence of compile-time checks you need to resort to runtime checks instead, as idonnie already commented:
(actorRef ? m).mapTo[String]
This will chain another Future to the original one which is filled either with a String result, a ClassCastException if the actor was naughty, or with a TimeoutException if the actor did not reply, see the Akka docs.
There might be a way out soon, I’m working on an Akka extension to include statically typed channels, but that will lead to you having to write your code a little differently, with more type annotations.

Parallel file processing in Scala

Suppose I need to process files in a given folder in parallel. In Java I would create a FolderReader thread to read file names from the folder and a pool of FileProcessor threads. FolderReader reads file names and submits the file processing function (Runnable) to the pool executor.
In Scala I see two options:
create a pool of FileProcessor actors and schedule a file processing function with Actors.Scheduler.
create an actor for each file name while reading the file names.
Does it make sense? What is the best option?
Depending on what you're doing, it may be as simple as
for(file<-files.par){
//process the file
}
I suggest with all my energies to keep as far as you can from the threads. Luckily we have better abstractions which take care of what's happening below, and in your case it appears to me that you do not need to use actors (while you can) but you can use a simpler abstraction, called Futures. They are a part of Akka open source library, and I think in the future will be a part of the Scala standard library as well.
A Future[T] is simply something that will return a T in the future.
All you need to run a future, is to have an implicit ExecutionContext, which you can derive from a java executor service. Then you will be able to enjoy the elegant API and the fact that a future is a monad to transform collections into collections of futures, collect the result and so on. I suggest you to give a look to http://doc.akka.io/docs/akka/2.0.1/scala/futures.html
object TestingFutures {
implicit val executorService = Executors.newFixedThreadPool(20)
implicit val executorContext = ExecutionContext.fromExecutorService(executorService)
def testFutures(myList:List[String]):List[String]= {
val listOfFutures : Future[List[String]] = Future.traverse(myList){
aString => Future{
aString.reverse
}
}
val result:List[String] = Await.result(listOfFutures,1 minute)
result
}
}
There's a lot going on here:
I am using Future.traverse which receives as a first parameter which is M[T]<:Traversable[T] and as second parameter a T => Future[T] or if you prefer a Function1[T,Future[T]] and returns Future[M[T]]
I am using the Future.apply method to create an anonymous class of type Future[T]
There are many other reasons to look at Akka futures.
Futures can be mapped because they are monad, i.e. you can chain Futures execution :
Future { 3 }.map { _ * 2 }.map { _.toString }
Futures have callback: future.onComplete, onSuccess, onFailure, andThen etc.
Futures support not only traverse, but also for comprehension
Ideally you should use two actors. One for reading the list of files, and one for actually reading the file.
You start the process by simply sending a single "start" message to the first actor. The actor can then read the list of files, and send a message to the second actor. The second actor then reads the file and processes the contents.
Having multiple actors, which might seem complicated, is actually a good thing in the sense that you have a bunch of objects communicating with eachother, like in a theoretical OO system.
Edit: you REALLY shouldn't be doing doing concurrent reading of a single file.
I was going to write up exactly what #Edmondo1984 did except he beat me to it. :) I second his suggestion in a big way. I'll also suggest that you read the documentation for Akka 2.0.2. As well, I'll give you a slightly more concrete example:
import akka.dispatch.{ExecutionContext, Future, Await}
import akka.util.duration._
import java.util.concurrent.Executors
import java.io.File
val execService = Executors.newCachedThreadPool()
implicit val execContext = ExecutionContext.fromExecutorService(execService)
val tmp = new File("/tmp/")
val files = tmp.listFiles()
val workers = files.map { f =>
Future {
f.getAbsolutePath()
}
}.toSeq
val result = Future.sequence(workers)
result.onSuccess {
case filenames =>
filenames.foreach { fn =>
println(fn)
}
}
// Artificial just to make things work for the example
Thread.sleep(100)
execContext.shutdown()
Here I use sequence instead of traverse, but the difference is going to depend on your needs.
Go with the Future, my friend; the Actor is just a more painful approach in this instance.
But if use actors, what's wrong with that?
If we have to read / write to some property file. There is my Java example. But still with Akka Actors.
Lest's say we have an actor ActorFile represents one file. Hm.. Probably it can not represent One file. Right? (would be nice it could). So then it represents several files like PropertyFilesActor then:
Why would not use something like this:
public class PropertyFilesActor extends UntypedActor {
Map<String, String> filesContent = new LinkedHashMap<String, String>();
{ // here we should use real files of cource
filesContent.put("file1.xml", "");
filesContent.put("file2.xml", "");
}
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof WriteMessage) {
WriteMessage writeMessage = (WriteMessage) message;
String content = filesContent.get(writeMessage.fileName);
String newContent = content + writeMessage.stringToWrite;
filesContent.put(writeMessage.fileName, newContent);
}
else if (message instanceof ReadMessage) {
ReadMessage readMessage = (ReadMessage) message;
String currentContent = filesContent.get(readMessage.fileName);
// Send the current content back to the sender
getSender().tell(new ReadMessage(readMessage.fileName, currentContent), getSelf());
}
else unhandled(message);
}
}
...a message will go with parameter (fileName)
It has its own in-box, accepting messages like:
WriteLine(fileName, string)
ReadLine(fileName, string)
Those messages will be storing into to the in-box in the order, one after antoher. The actor would do its work by receiving messages from the box - storing/reading, and meanwhile sending feedback sender ! message back.
Thus, let's say if we write to the property file, and send showing the content on the web page. We can start showing page (right after we sent message to store a data to the file) and as soon as we received the feedback, update part of the page with a data from just updated file (by ajax).
Well, grab your files and stick them in a parallel structure
scala> new java.io.File("/tmp").listFiles.par
res0: scala.collection.parallel.mutable.ParArray[java.io.File] = ParArray( ... )
Then...
scala> res0 map (_.length)
res1: scala.collection.parallel.mutable.ParArray[Long] = ParArray(4943, 1960, 4208, 103266, 363 ... )