I have an actor based system in which I am reading an external file sitting in an S3 bucket and moving taking each of the file lines and sending it over to another actor that processes that particular line. What I am trouble understanding is what happens when an exception is thrown while reading the file.
My code is as follows:
import akka.actor._
import akka.actor.ActorSystem
class FileWorker(processorWorker: ActorRef) extends Actor with ActorLogging {
val fileUtils = new S3Utils()
private def processFile(fileLocation: String): Unit = {
try{
fileUtils.getLinesFromLocation(fileLocation).foreach {
r =>
{
//Some processing happens for the line
}
}
}
}
}catch{
case e:Exception => log.error("Issue processing files from the following location %s".format(fileLocation))
}
}
def receive() = {
case fileLocation: String => {
processFile(fileLocation)
}
}
}
In my S3Utils class I have defined the getLinesFromLocation method as follows:
def getLinesFromLocation(fileLocation: String): Iterator[String] = {
try{
for {
fileEntry <- getFileInfo(root,fileLocation)
} yield fileEntry
}catch{
case e:Exception => logger.error("Issue with file location %s: %s".format(fileLocation,e.getStackTraceString));throw e
}
}
The method where I am actually reading the file is defined in the private method getFileInfo
private def getFileInfo(rootBucket: String,fileLocation: String): Iterator[String] = {
implicit val codec = Codec(Codec.UTF8)
codec.onMalformedInput(CodingErrorAction.IGNORE)
codec.onUnmappableCharacter(CodingErrorAction.IGNORE)
Source.fromInputStream(s3Client.
getObject(rootBucket,fileLocation).
getObjectContent()).getLines
}
I have written the above pieces with the assumption that the underlying file sitting on S3 will not be cached anywhere and I will be simply iterating through the individual lines in constant space and processing them. In case there's an issue with reading a particular line, the iterator will move on without affecting the Actor.
My first question would be, is my understanding of iterators correct? In all actuality, am I actually reading the lines from the underlying file system(in this case the S3 bucket) without applying any pressure to the memory/or introducing any memory leaks.
The next question would be, if the iterator encounters an error while reading an individual entry, does the entire process of iteration is killed or it moves on to the next entry.
My last question would be, is my file-processing logic written correctly?
It will be great to get some insights into this.
Thanks
Looks like amazon s3 has no async implementation and we are stuck with pinned actors. So your implementation is correct, providing you allocate a thread per connection and will not block input, and will not use too many connections.
Important steps to take:
1) processFile should not block current thread. Preferably it should delegate it's input to the another actor:
private def processFile(fileLocation: String): Unit = {
...
fileUtils.getLinesFromLocation(fileLocation).foreach { r =>
lineWorker ! FileLine(fileLocation, r)
}
...
}
2) Make FileWorker a pinned actor:
## in application.config:
my-pinned-dispatcher {
executor = "thread-pool-executor"
type = PinnedDispatcher
}
// in the code:
val fileWorker = context.actorOf(Props(classOf[FileWorker], lineWorker).withDispatcher("my-pinned-dispatcher"), "FileWorker")
if the iterator encounters an error while reading an individual entry, does the entire process of iteration is killed?
yes, your entire process will be killed and the actor will take the next job from it's mailbox.
Related
Information: You might want to skip directly to Edit 4 after the introduction.
Recently I wrote a simple scala server app. The app mostly wrote incoming data to the database or retrieved it. This is my first scala-akka application.
When I deployed the app to my server it failed after about a day. I realized from the simple statistics provided by digitalocean that the CPU usage was rising in a linear fashion if I s̶e̶n̶d̶ ̶d̶a̶t̶a̶ ̶t̶o̶ request data from the server. If I didn't s̶e̶n̶d̶ request anything from the server the CPU usage was about constant but wasn't falling from it's previous state.
I connected the app to visualvm and saw that the number of threads is either constant if I don't do anything with the app (I), or grows if I send stuff to the server (II) in a saw-like fashion.
There is an obvious coloration here between the number of threads and CPU usage which makes sens.
When I checked the threads tab I saw that most threads are default-akka.actor.default-dispatcher threads
They also don't seem to be doing much.
What could cause this sort of problem? How do I solve it?
Ad. Edit4: I think I found the source of the problem, but I still don't understand why it happens, and how I should solve it.
PS: I must admit that the screenshots are not from the application which failed. I don't have any from the original failure. However the only difference between this program and the one that failed is that in application.conf I added:
actor {
default-dispatcher {
fork-join-executor {
# Settings this to 1 instead of 3 seems to improve performance.
parallelism-factor = 2.0
parallelism-max = 24
task-peeking-mode = FIFO
}
}
}
This seems to have slowed the speed by which the number of threads rises, but didn't solve the problem.
Edit: Fragment of WriterActor usage (RestApi)
trait RestApi extends CassandraCluster {
import models._
import cassandraDB.{WriterActor, ReaderActor}
implicit def system: ActorSystem
implicit def materializer: ActorMaterializer
implicit def ec: ExecutionContext
implicit val timeout = Timeout(20 seconds)
val cassandraWriterWorker = system.actorOf(Props(new WriterActor(cluster)), "cassandra-writer-actor")
val cassandraReaderWorker = system.actorOf(Props(new ReaderActor(cluster)), "cassandra-reader-actor")
...
def cassandraReaderCall(message: Any): ToResponseMarshallable = message match {
//...
case message: GetActiveContactsByPhoneNumber => (cassandraReaderWorker ? message)(2 seconds).mapTo[Vector[String]].map(result => Json.obj("active_contacts" -> result))
case _ => StatusCodes.BadRequest
}
def confirmedWriterCall(message: Any) = {
(cassandraWriterWorker ? message).mapTo[Boolean].map(result => result)
}
val apiKeyStringV1 = "test123"
val route =
...
path("contacts") {
parameter('apikey ! apiKeyStringV1) {
post {
entity(as[Contacts]){ contact: Contacts =>
cassandraWriterWorker ! contact
complete(StatusCodes.OK)
}
} ~
get {
parameter('phonenumber) { phoneNumber: String =>
complete(cassandraReaderCall(GetActiveContactsByPhoneNumber(phoneNumber)))
}
}
}
} ~
} ~
path("log"/ "gps") {
parameter('apikey ! apiKeyStringV1) {
(post & entity(as[GpsLog])) { gpsLog =>
cassandraWriterWorker ! gpsLog
complete(StatusCodes.OK)
}
}
}
}
}
}
Edit 2: Writer Worker relevant code.
I didn't post all the methods since they are all basically the same. But here you can find the whole file
import java.util.UUID
import akka.actor.Actor
import com.datastax.driver.core.Cluster
import models._
class WriterActor(cluster: Cluster) extends Actor{
import scala.collection.JavaConversions._
val session = cluster.connect(Keyspaces.akkaCassandra)
// ... other inserts
val insertGpsLog = session.prepare("INSERT INTO gps_logs(id, phone_number, lat, long, time) VALUES (?,?,?,?,?);")
// ...
def insertGpsLog(phoneNumber: String, locWithTime: LocationWithTime): Unit =
session.executeAsync(insertGpsLog.bind(UUID.randomUUID().toString, phoneNumber, new java.lang.Double(locWithTime.location.lat),
new java.lang.Double(locWithTime.location.long), new java.lang.Long(locWithTime.time)))
def receive: Receive = {
// ...
case gpsLog: GpsLog => gpsLog.locationWithTimeLog.foreach(locWithTime => insertGpsLog(gpsLog.phoneNumber, locWithTime))
}
}
Edit 3. Miss diagnosis of excessive thread use.
I'm afraid I miss diagnosed the origin of the problem.Later on, I have added a request for data after the write and forgot about it. When I removed it the number of threads stopped growing. So this is the most likely place where the mistake was made. I updated the trait where the ReaderActor is used and also added the relevant code of the ReaderActor below.
object ReaderActor {
// ...
case class GetActiveContactsByPhoneNumber(phoneNumber: String)
}
class ReaderActor(cluster: Cluster) extends Actor {
import models._
import ReaderActor._
import akka.pattern.pipe
import scala.collection.JavaConversions._
import cassandra.resultset._
import context.dispatcher
val session = cluster.connect(Keyspaces.akkaCassandra)
def buildActiveContactByPhoneNumberResponse(r: Row): String = {
val phoneNumber = r.getString(ContactsKeys.phoneNumber)
return phoneNumber
}
def buildSubSelectContactsList(r: Row): java.util.List[String] = {
val phoneNumber = r.getSet(ContactsKeys.contacts, classOf[String])
return phoneNumber.toList
}
def receive: Receive = {
//...
case GetActiveContactsByPhoneNumber(phoneNumber: String) =>
val subQuery = QueryBuilder.select(ContactsKeys.contacts).
from(Keyspaces.akkaCassandra, ColumnFamilies.contact).
where(QueryBuilder.eq(ContactsKeys.phoneNumber, phoneNumber))
def queryActiveUsers(phoneNumbers: java.util.List[String]) = QueryBuilder.select(ContactsKeys.phoneNumber).
from(Keyspaces.akkaCassandra, ColumnFamilies.contact).
where(QueryBuilder.in(ContactsKeys.phoneNumber, phoneNumbers))
session.execute(subQuery) map((row: Row) => session.executeAsync(queryActiveUsers(buildSubSelectContactsList(row))) map(_.all().map(buildActiveContactByPhoneNumberResponse).toVector) pipeTo sender)
//...
}
}
Edit 4
I run the code locally, controlling all the request. When there are no requests the number of running threads alternates around a certain number, but doesn't have a tendency to go up or down.
I made a variety of requests to see what will change.
The Image posted below shows several states.
I - no requests yet. number of threads 44-45
II - after a request to the ReaderActor. number of threads 46-47
III - after a request to the ReaderActor. number of threads 48-49
IV - after a request to the ReaderActor. number of threads 50-51
V - after a request to the WrtierActor. number of threads 51-52 (but no problem, notice a daemon thread was started)
VI - after a request to the WriterActor. number of threads 51-52 (constant)
VII - after a request to the ReaderActor (but a different resource then the first three). number of threads 53-54
So what happens is every time we read from the database (regardless how many executeAsync calls are used) 2 extra threads are created. The only difference between the read and the write calls is that one uses the ask pattern and the other doesn't. I checked it by changing the route from:
get {
parameter('phonenumber) { phoneNumber: String =>
complete(cassandraReaderCall(GetActiveContactsByPhoneNumber(phoneNumber)))
}
}
to
get {
parameter('phonenumber) { phoneNumber: String =>
cassandraReaderWorker ! GetActiveContactsByPhoneNumber(phoneNumber)
complete(StatusCodes.OK)
}
}
obviously not getting any results now, but also not spawning those threads.
So the answer seems to lie in the ask pattern.
I hope somebody can provide an answer as to why this happens and how to solve it?
I am trying to continuously read the wikipedia IRC channel using this lib: https://github.com/implydata/wikiticker
I created a custom Akka Publisher, which will be used in my system as a Source.
Here are some of my classes:
class IrcPublisher() extends ActorPublisher[String] {
import scala.collection._
var queue: mutable.Queue[String] = mutable.Queue()
override def receive: Actor.Receive = {
case Publish(s) =>
println(s"->MSG, isActive = $isActive, totalDemand = $totalDemand")
queue.enqueue(s)
publishIfNeeded()
case Request(cnt) =>
println("Request: " + cnt)
publishIfNeeded()
case Cancel =>
println("Cancel")
context.stop(self)
case _ =>
println("Hm...")
}
def publishIfNeeded(): Unit = {
while (queue.nonEmpty && isActive && totalDemand > 0) {
println("onNext")
onNext(queue.dequeue())
}
}
}
object IrcPublisher {
case class Publish(data: String)
}
I am creating all this objects like so:
def createSource(wikipedias: Seq[String]) {
val dataPublisherRef = system.actorOf(Props[IrcPublisher])
val dataPublisher = ActorPublisher[String](dataPublisherRef)
val listener = new MessageListener {
override def process(message: Message) = {
dataPublisherRef ! Publish(Jackson.generate(message.toMap))
}
}
val ticker = new IrcTicker(
"irc.wikimedia.org",
"imply",
wikipedias map (x => s"#$x.wikipedia"),
Seq(listener)
)
ticker.start() // if I comment this...
Thread.currentThread().join() //... and this I get Request(...)
Source.fromPublisher(dataPublisher)
}
So the problem I am facing is this Source object. Although this implementation works well with other sources (for example from local file), the ActorPublisher don't receive Request() messages.
If I comment the two marked lines I can see, that my actor has received the Request(count) message from my flow. Otherwise all messages will be pushed into the queue, but not in my flow (so I can see the MSG messages printed).
I think it's something with multithreading/synchronization here.
I am not familiar enough with wikiticker to solve your problem as given. One question I would have is: why is it necessary to join to the current thread?
However, I think you have overcomplicated the usage of Source. It would be easier for you to work with the stream as a whole rather than create a custom ActorPublisher.
You can use Source.actorRef to materialize a stream into an ActorRef and work with that ActorRef. This allows you to utilize akka code to do the enqueing/dequeing onto the buffer while you can focus on the "business logic".
Say, for example, your entire stream is only to filter lines above a certain length and print them to the console. This could be accomplished with:
def dispatchIRCMessages(actorRef : ActorRef) = {
val ticker =
new IrcTicker("irc.wikimedia.org",
"imply",
wikipedias map (x => s"#$x.wikipedia"),
Seq(new MessageListener {
override def process(message: Message) =
actorRef ! Publish(Jackson.generate(message.toMap))
}))
ticker.start()
Thread.currentThread().join()
}
//these variables control the buffer behavior
val bufferSize = 1024
val overFlowStrategy = akka.stream.OverflowStrategy.dropHead
val minMessageSize = 32
//no need for a custom Publisher/Queue
val streamRef =
Source.actorRef[String](bufferSize, overFlowStrategy)
.via(Flow[String].filter(_.size > minMessageSize))
.to(Sink.foreach[String](println))
.run()
dispatchIRCMessages(streamRef)
The dispatchIRCMessages has the added benefit that it will work with any ActorRef so you aren't required to only work with streams/publishers.
Hopefully this solves your underlying problem...
I think the main problem is Thread.currentThread().join(). This line will 'hang' current thread because this thread is waiting for himself to die. Please read https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html#join-long- .
Problem Statement
Assume I have a file with sentences that is processed line by line. In my case, I need to extract Named Entities (Persons, Organizations, ...) from these lines. Unfortunately, the tagger is quite slow. Therefore, I decided to parallelize the computation, such that lines could be processed independent from each other and the result is collected in a central location.
Current Approach
My current approach comprises the usage of a single producer multiple consumer concept. However, I'm relative new to Akka, but I think my problem description fits well into its capabilities. Let me show you some code:
Producer
The Producer reads the file line by line and sends it to the Consumer. If it reaches the total line limit, it propagates the result back to WordCount.
class Producer(consumers: ActorRef) extends Actor with ActorLogging {
var master: Option[ActorRef] = None
var result = immutable.List[String]()
var totalLines = 0
var linesProcessed = 0
override def receive = {
case StartProcessing() => {
master = Some(sender)
Source.fromFile("sent.txt", "utf-8").getLines.foreach { line =>
consumers ! Sentence(line)
totalLines += 1
}
context.stop(self)
}
case SentenceProcessed(list) => {
linesProcessed += 1
result :::= list
//If we are done, we can propagate the result to the creator
if (linesProcessed == totalLines) {
master.map(_ ! result)
}
}
case _ => log.error("message not recognized")
}
}
Consumer
class Consumer extends Actor with ActorLogging {
def tokenize(line: String): Seq[String] = {
line.split(" ").map(_.toLowerCase)
}
override def receive = {
case Sentence(sent) => {
//Assume: This is representative for the extensive computation method
val tokens = tokenize(sent)
sender() ! SentenceProcessed(tokens.toList)
}
case _ => log.error("message not recognized")
}
}
WordCount (Master)
class WordCount extends Actor {
val consumers = context.actorOf(Props[Consumer].
withRouter(FromConfig()).
withDispatcher("consumer-dispatcher"), "consumers")
val producer = context.actorOf(Props(new Producer(consumers)), "producer")
context.watch(consumers)
context.watch(producer)
def receive = {
case Terminated(`producer`) => consumers ! Broadcast(PoisonPill)
case Terminated(`consumers`) => context.system.shutdown
}
}
object WordCount {
def getActor() = new WordCount
def getConfig(routerType: String, dispatcherType: String)(numConsumers: Int) = s"""
akka.actor.deployment {
/WordCount/consumers {
router = $routerType
nr-of-instances = $numConsumers
dispatcher = consumer-dispatcher
}
}
consumer-dispatcher {
type = $dispatcherType
executor = "fork-join-executor"
}"""
}
The WordCount actor is responsible for creating the other actors. When the Consumer is finished the Producer sends a message with all tokens. But, how to propagate the message again and also accept and wait for it? The architecture with the third WordCount actor might be wrong.
Main Routine
case class Run(name: String, actor: () => Actor, config: (Int) => String)
object Main extends App {
val run = Run("push_implementation", WordCount.getActor _, WordCount.getConfig("balancing-pool", "Dispatcher") _)
def execute(run: Run, numConsumers: Int) = {
val config = ConfigFactory.parseString(run.config(numConsumers))
val system = ActorSystem("Counting", ConfigFactory.load(config))
val startTime = System.currentTimeMillis
system.actorOf(Props(run.actor()), "WordCount")
/*
How to get the result here?!
*/
system.awaitTermination
System.currentTimeMillis - startTime
}
execute(run, 4)
}
Problem
As you see, the actual problem is to propagate the result back to the Main routine. Can you tell me how to do this in a proper way? The question is also how to wait for the result until the consumers are finished? I had a brief look into the Akka Future documentation section, but the whole system is a little bit overwhelming for beginners. Something like var future = message ? actor seems suitable. Not sure, how to do this. Also using the WordCount actor causes additional complexity. Maybe it is possible to come up with a solution that doesn't need this actor?
Consider using the Akka Aggregator Pattern. That takes care of the low-level primitives (watching actors, poison pill, etc). You can focus on managing state.
Your call to system.actorOf() returns an ActorRef, but you're not using it. You should ask that actor for results. Something like this:
implicit val timeout = Timeout(5 seconds)
val wCount = system.actorOf(Props(run.actor()), "WordCount")
val answer = Await.result(wCount ? "sent.txt", timeout.duration)
This means your WordCount class needs a receive method that accepts a String message. That section of code should aggregate the results and tell the sender(), like this:
class WordCount extends Actor {
def receive: Receive = {
case filename: String =>
// do all of your code here, using filename
sender() ! results
}
}
Also, rather than blocking on the results with Await above, you can apply some techniques for handling Futures.
My current application is based on akka 1.1. It has multiple ProjectAnalysisActors each responsible for handling analysis tasks for a specific project. The analysis is started when such an actor receives a generic start message. After finishing one step it sends itself a message with the next step as long one is defined. The executing code basically looks as follows
sealed trait AnalysisEvent {
def run(project: Project): Future[Any]
def nextStep: AnalysisEvent = null
}
case class StartAnalysis() extends AnalysisEvent {
override def run ...
override def nextStep: AnalysisEvent = new FirstStep
}
case class FirstStep() extends AnalysisEvent {
override def run ...
override def nextStep: AnalysisEvent = new SecondStep
}
case class SecondStep() extends AnalysisEvent {
...
}
class ProjectAnalysisActor(project: Project) extends Actor {
def receive = {
case event: AnalysisEvent =>
val future = event.run(project)
future.onComplete { f =>
self ! event.nextStep
}
}
}
I have some difficulties how to implement my code for the run-methods for each analysis step. At the moment I create a new future within each run-method. Inside this future I send all follow-up messages into the different subsystems. Some of them are non-blocking fire-and-forget messages, but some of them return a result which should be stored before the next analysis step is started.
At the moment a typical run-method looks as follows
def run(project: Project): Future[Any] = {
Future {
progressActor ! typicalFireAndForget(project.name)
val calcResult = (calcActor1 !! doCalcMessage(project)).getOrElse(...)
val p: Project = ... // created updated project using calcResult
val result = (storage !! updateProjectInformation(p)).getOrElse(...)
result
}
}
Since those blocking messages should be avoided, I'm wondering if this is the right way. Does it make sense to use them in this use case or should I still avoid it? If so, what would be a proper solution?
Apparently the only purpose of the ProjectAnalysisActor is to chain future calls. Second, the runs methods seems also to wait on results to continue computations.
So I think you can try refactoring your code to use Future Composition, as explained here: http://akka.io/docs/akka/1.1/scala/futures.html
def run(project: Project): Future[Any] = {
progressActor ! typicalFireAndForget(project.name)
for(
calcResult <- calcActor1 !!! doCalcMessage(project);
p = ... // created updated project using calcResult
result <- storage !!! updateProjectInformation(p)
) yield (
result
)
}
I start two remote actors on one host which just echo whatever is sent to them. I then create another actor which sends some number of messages (using !! ) to both actors and keep a List of Future objects holding the replies from these actors. Then I loop over this List fetching the result of each Future. The problem is that most of the time some futures never return, even thought the actor claims it has sent the reply. The problem happens randomly, sometimes it will get through the whole list, but most of the time it gets stuck at some point and hangs indefinitely.
Here is some code which produces the problem on my machine:
Sink.scala:
import scala.actors.Actor
import scala.actors.Actor._
import scala.actors.Exit
import scala.actors.remote.RemoteActor
import scala.actors.remote.RemoteActor._
object Sink {
def main(args: Array[String]): Unit = {
new RemoteSink("node03-0",43001).start()
new RemoteSink("node03-1",43001).start()
}
}
class RemoteSink(name: String, port: Int) extends Actor
{
def act() {
println(name+" starts")
trapExit=true
alive(port)
register(Symbol(name),self)
loop {
react {
case Exit(from,reason) =>{
exit()
}
case msg => reply{
println(name+" sending reply to: "+msg)
msg+" back at you from "+name
}
}
}
}
}
Source.scala:
import scala.actors.Actor
import scala.actors.Actor._
import scala.actors.remote.Node;
import scala.actors.remote.RemoteActor
import scala.actors.remote.RemoteActor._
object Source {
def main(args: Array[String]):Unit = {
val peer = Node("127.0.0.1", 43001)
val source = new RemoteSource(peer)
source.start()
}
}
class RemoteSource(peer: Node) extends Actor
{
def act() {
trapExit=true
alive(43001)
register(Symbol("source"),self)
val sinks = List(select(peer,Symbol("node03-0"))
,select(peer,Symbol("node03-1"))
)
sinks.foreach(link)
val futures = for(sink <- sinks; i <- 0 to 20) yield sink !! "hello "+i
futures.foreach( f => println(f()))
exit()
}
}
What am I doing wrong?
I'm guessing your problem is due to this line:
futures.foreach( f => println(f()))
in which you loop through all your futures and block on each in turn, waiting for its result. Blocking on futures is generally a bad idea and should be avoided. What you want to do instead is specify an action to carry out when the future's result is available. Try this:
futures.foreach(f => f.foreach(r => println(r)))
Here's an alternate way to say that with a for comprehension:
for (future <- futures; result <- future) { println(result) }
This blog entry is an excellent primer on the problem of blocking on futures and how monadic futures overcome it.
I have seen a similar case as well. When the code inside the thread throws certain types of exception and exits, the corresponding future.get never returns. One can try with raising an exception of java.lang.Error vs java.lang.NoSuchMethodError. The latter's corresponding future would never return.