Scala program using futures is not terminating - scala

I am trying to learn concurrency in Scala and using Scala futures to generate a dataset with random string. I want to create an application which should generate a file with any number of records and it should be scalable.
Code:
import java.util.concurrent.{ExecutorService, Executors}
import scala.util.{Failure, Random, Success}
import scala.concurrent.duration._
object datacreator {
implicit val ec: ExecutionContext = new ExecutionContext {
val threadPool: ExecutorService = Executors.newFixedThreadPool(4)
def execute(runnable: Runnable) {
threadPool.submit(runnable)
}
def reportFailure(t: Throwable) {}
}
def getRecord : String = {
"Random string"
}
def main(args: Array[String]): Unit = {
val filename = args(0)
val number_of_records = args(1)
val file_Object = new FileWriter(filename, true)
val data: Future[Iterable[String]] = Future {
for (i <- 1 to number_of_records.toInt)
yield getRecord
}
val result = data.map{
result => result.foreach(record => file_Object.write(record))
}
result.onComplete{
case Success(value) => {
println("Success")
file_Object.close()
}
case Failure(e) => e.printStackTrace()
}
}
}
I am facing the following issues:
When I am running the program using SBT it is writing results to the file but not terminating as going in infinite mode.
[info] Loading project definition from /Users/cw0155/PersonalProjects/datagen/project
[info] Loading settings for project datagen from build.sbt ...
[info] Set current project to datagenerator (in build file:/Users/cw0155/PersonalProjects/datagen/)
[info] running com.generator.DataGenerator xyz.csv 100
Success
| => datagen / Compile / runMain 255s
When I am running the program using Jar as:
scala -cp target/scala-2.13/datagenerator_2.13-0.1.jar com.generator.DataGenerator "pqr.csv" "1000"
It is waiting infinite time and not writing to the file.
Any help is much appreciated :)

Try this version
bar.scala
import scala.concurrent.{Await, Future, ExecutionContext}
import scala.concurrent.duration._
import scala.util.{Success, Failure}
import ExecutionContext.Implicits.global
import java.io.FileWriter
object bar {
def getRecord: String = "Random string\n"
def main(args: Array[String]): Unit = {
val filename = args(0)
val number_of_records = args(1)
val data: Future[Iterable[String]] = Future {
for (i <- 1 to number_of_records.toInt)
yield getRecord
}
val file_Object = new FileWriter(filename, true)
val result = data.map( r => r.foreach(record => file_Object.write(record)) )
result.onComplete {
case Success(value) =>
println("Success")
file_Object.close()
case Failure(e) =>
e.printStackTrace()
}
Await.result( result, 10.second )
}
}
Your original version gave me the expected output when I ran it like so
bash-3.2$ scala bar.scala /dev/fd/1 10
Success
Random string
Random string
Random string
Random string
Random string
Random string
Random string
Random string
Random string
Random string
However without the Await.result your program can exit before the future finishes.

Related

Spark Job Stuck and Never Finish Executing

I am trying to do some analysis on a dataset using spark. I am using sbt scala 2.12.8 on my local machine which is (16 GB).
The issue is that Spark Job never finished executing after using transformation and then an action, if I use transformation only or action only it would be executed within seconds.
Code:
package wikipedia
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import scala.io.Source
import org.apache.spark.rdd.RDD
case class WikipediaArticle2(title: String, text: String) {
/**
* #return Whether the text of this article mentions `lang` or not
* #param lang Language to look for (e.g. "Scala")
*/
def mentionsLanguage(lang: String): Boolean = text.split(' ').contains(lang)
}
object WikipediaData2 {
def lines: List[String] = {
Option(getClass.getResourceAsStream("/wikipedia/wikipedia.dat")) match {
case None => sys.error("Please download the dataset as explained in the assignment instructions")
case Some(resource) => Source.fromInputStream(resource).getLines().toList
}
}
def parse(line: String): WikipediaArticle2 = {
val subs = "</title><text>"
val i = line.indexOf(subs)
val title = line.substring(14, i)
val text = line.substring(i + subs.length, line.length-16)
WikipediaArticle2(title, text)
}
}
object WikipediaTest {
val langs = List(
"JavaScript", "Java", "PHP", "Python", "C#", "C++", "Ruby", "CSS",
"Objective-C", "Perl", "Scala", "Haskell", "MATLAB", "Clojure", "Groovy")
val conf: SparkConf = new SparkConf()
.setMaster("local[*]") // tried local[4] 4 cores
.setAppName("Wikis Most Popular Languages")
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("ERROR")
// Hint: use a combination of `sc.parallelize`, `WikipediaData.lines` and `WikipediaData.parse`
// val wikiRdd: RDD[WikipediaArticle] = sc.parallelize(WikipediaData.lines.map(WikipediaData.parse)).cache()
val wikiRdd: RDD[WikipediaArticle2] = sc.parallelize(WikipediaData2.lines).map(WikipediaData2.parse)
def main(args: Array[String]): Unit = {
// wikiRdd.count() // returned 4086
// wikiRdd.take(2) // result within less than second
// wikiRdd.filter(a => a.mentionsLanguage("Java")).take(2) // execution stuck here
// wikiRdd.filter(a => a.mentionsLanguage("Perl")).collect() // exceution stuck here
sc.stop()
}
}
on terminal I execute the code like this:
$ sbt
$ console
scala> import wikipedia.WikipediaTest._
scala> wikiRdd.count() // returned 4086
scala> wikiRdd.take(2) // result within less than second
scala> wikiRdd.filter(a => a.mentionsLanguage("Java")).take(2)
This code snippet also get the Job to freeze:
wikiRdd.filter(a => a.mentionsLanguage("Perl")).collect()
Spark Job UI
I have been stumbled for hours and I have no idea what's wrong.
Any suggestion folks?

Error handling in list of scala futures - Apache Spark

I am having issues while handle exceptions in List of Scala futures. I am calling getQC_report(qcArgsThread,spark) method within ruuner method which process input file and saves in Hive table. Code below
import scala.util.{Failure, Success}
import scala.concurrent._
import scala.concurrent.duration._
val spark = SparkSession.builder.master("yarn").enableHiveSupport().getOrCreate()
var argsList: List[Array[String]] = List[Array[String]]()
for(ip_file <- INPUT_FILE.asScala.toList) {
var qcArgs:Array[String] = null
qcArgs = Array("input_file", ip_file,
"hiveDB",hiveDB,
"Outputhive_table",Outputhive_table)
argsList = qcArgs :: argsList
}
var pool = 0
def poolId = {
pool = pool + 1
pool
}
def runner(qcArgsThread: Array[String]) = Future {
sc.setLocalProperty("spark.scheduler.pool", poolId.toString)
getQC_report(qcArgsThread,spark)
}
val futures = argsList map(i => runner(i))
futures foreach(f => Await.ready(f, Duration.Inf))
futures.onComplete {
case Success(x) => {
println(s"\nresult = $x")
}
case Failure(e) => {
System.err.println("Failure happened!")
System.err.println(e.getMessage)
}
}
I am getting error in futures.onComplete line.
Error - Cannot resolve symbol onComplete.
Please help me in improving the code as I am new to using Scala Futures. Thanks!
The short answer is that because argsList is a List[Array[String]]
val futures = argsList map(i => runner(i))
will have the type List[Future[WhateverGetQC_ReportReturns]]. It specifically is not a Future, so has no onComplete method.
If you want to have a Future which completes when all the futures are completed, Future.sequence will convert a List[Future[T]] into a Future[List[T]]:
// replaces all code after val futures = argsList map ...
val allFutures = Future.sequence(futures)
val result: List[WhateverGetQC_ReportReturns] =
try {
Await.result(allFutures, Duration.Inf)
} catch {
case NonFatal(e) =>
System.err.println("Failure happened!")
System.err.println(e.getMessage)
}

scala (spark) zio convert future to zio

My objective is to run a number of spark ml regression models (1000s of times) on one dataset and I want to do this using zio instead of future, because it is running too slow. Below is the working example of using Future.
A distinct list of keys is used to filter the partitioned dataset on key and run the model on. I've set up a thread pool with 8 executors to manage it, but it quickly degrades in performance.
import scala.concurrent.{Await, ExecutionContext, ExecutionContextExecutorService, Future}
import java.util.concurrent.{Executors, TimeUnit}
import scala.concurrent.duration._
import org.apache.spark.sql.SaveMode
val pool = Executors.newFixedThreadPool(8)
implicit val xc: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(pool)
case class Result(key: String, coeffs: String)
try {
import spark.implicits._
val tasks = {
for (x <- keys)
yield Future {
Seq(
Result(
x.group,
runModel(input.filter(col("group")===x)).mkString(",")
)
).toDS()
.write.mode(SaveMode.Overwrite).option("header", false).csv(
s"hdfs://namenode:8020/results/$x.csv"
)
}
}.toSeq
Await.result(Future.sequence(tasks), Duration.Inf)
}
finally {
pool.shutdown()
pool.awaitTermination(Long.MaxValue, TimeUnit.NANOSECONDS)
}
I've tried to implement this in zio, but I don't know how to implement queues and set a limit of executors like in futures.
Below is my failed attempt so far...
import zio._
import zio.console._
import zio.stm._
import org.apache.spark.sql.{Dataset, SaveMode, SparkSession}
import org.apache.spark.sql.functions.col
//example data/signatures
case class ModelResult(key: String, coeffs: String)
case class Data(key: String, sales: Double)
val keys: Array[String] = Array("100_1", "100_2")
def runModel[T](ds: Dataset[T]): Vector[Double]
object MyApp1 extends App {
val spark = SparkSession
.builder()
.getOrCreate()
import spark.implicits._
val input: Dataset[Data] = Seq(Data("100_1", 1d), Data("100_2", 2d)).toDS
def run(args: List[String]): ZIO[ZEnv, Nothing, Int] = {
for {
queue <- Queue.bounded[Int](8)
_ <- ZIO.foreach(1 to 8) (i => queue.offer(i)).fork
_ <- ZIO.foreach(keys) { k => queue.take.flatMap(_ => readWrite(k, input, queue)) }
} yield 0
}
def writecsv(k: String, v: String) = {
Seq(ModelResult(k, v))
.toDS
.write
.mode(SaveMode.Overwrite).option("header", value = false)
.csv(s"hdfs://namenode:8020/results/$k.csv")
}
def readWrite[T](key: String, ds: Dataset[T], queue: Queue[Int]): ZIO[ZEnv, Nothing, Int] = {
(for {
result <- runModel(ds.filter(col("key")===key)).mkString(",")
_ <- writecsv(key, result)
_ <- queue.offer(1)
_ <- putStrLn(s"successfully wrote output for $key")
} yield 0)
}
}
//to run
MyApp1.run(List[String]())
What is the best way to deal with compute this in zio?
To parallelize some workload across, say, 8 threads all you need is
ZIO.foreachParN(8)(1 to 100)(id => zio.blocking.blocking(Task{yourClusterJob(id)}))
But don't expect lots of a boost by switching from Futures to ZIO here:
1) Actual workload dominates coordination overhead so difference between ZIO and Future should be marginal.
2) Maybe you won't get any boost at all because 8 tasks will be fighting for the same resource pool in the Spark cluster.

In akka streaming program w/ Source.queue & Sink.queue I offer 1000 items, but it just hangs when I try to get 'em out

I am trying to understand how i should be working with Source.queue & Sink.queue in Akka streaming.
In the little test program that I wrote below I find that I am able to successfully offer 1000 items to the Source.queue.
However, when i wait on the future that should give me the results of pulling all those items off the queue, my
future never completes. Specifically, the message 'print what we pulled off the queue' that we should see at the end
never prints out -- instead we see the error "TimeoutException: Futures timed out after [10 seconds]"
any guidance greatly appreciated !
import akka.actor.ActorSystem
import akka.event.{Logging, LoggingAdapter}
import akka.stream.scaladsl.{Flow, Keep, Sink, Source}
import akka.stream.{ActorMaterializer, Attributes}
import org.scalatest.FunSuite
import scala.collection.immutable
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
class StreamSpec extends FunSuite {
implicit val actorSystem: ActorSystem = ActorSystem()
implicit val materializer: ActorMaterializer = ActorMaterializer()
implicit val log: LoggingAdapter = Logging(actorSystem.eventStream, "basis-test")
implicit val ec: ExecutionContext = actorSystem.dispatcher
case class Req(name: String)
case class Response(
httpVersion: String = "",
method: String = "",
url: String = "",
headers: Map[String, String] = Map())
test("put items on queue then take them off") {
val source = Source.queue[String](128, akka.stream.OverflowStrategy.backpressure)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.queue[String]().withAttributes( Attributes.inputBuffer(128, 128))
val (sourceQueue, sinkQueue) = source.via(flow).toMat(sink)(Keep.both).run()
(1 to 1000).map( i =>
Future {
println("offerd" + i) // I see this print 1000 times as expected
sourceQueue.offer(s"batch-$i")
}
)
println("DONE OFFER FUTURE FIRING")
// Now use the Sink.queue to pull the items we added onto the Source.queue
val seqOfFutures: immutable.Seq[Future[Option[String]]] =
(1 to 1000).map{ i => sinkQueue.pull() }
val futureOfSeq: Future[immutable.Seq[Option[String]]] =
Future.sequence(seqOfFutures)
val seq: immutable.Seq[Option[String]] =
Await.result(futureOfSeq, 10.second)
// unfortunately our future times out here
println("print what we pulled off the queue:" + seq);
}
}
Looking at this again, I realize that I originally set up and posed my question incorrectly.
The test that accompanies my original question launches a wave
of 1000 futures, each of which tries to offer 1 item to the queue.
Then the second step in that test attempts create a 1000-element sequence (seqOfFutures)
where each future is trying to pull a value from the queue.
My theory as to why I was getting time-out errors is that there was some kind of deadlock due to running
out of threads or due to one thread waiting on another but where the waited-on-thread was blocked,
or something like that.
I'm not interested in hunting down the exact cause at this point because I have corrected
things in the code below (see CORRECTED CODE).
In the new code the test that uses the queue is called:
"put items on queue then take them off (with async parallelism) - (3)".
In this test I have a set of 10 tasks which run in parallel to do the 'enequeue' operation.
Then I have another 10 tasks which do the dequeue operation, which involves not only taking
the item off the list, but also calling stringModifyFunc which introduces a 1 ms processing delay.
I also wanted to prove that I got some performance benefit from
launching tasks in parallel and having the task steps communicate by passing their results through a
queue, so test 3 runs as a timed operation, and I found that it takes 1.9 seconds.
Tests (1) and (2) do the same amount of work, but serially -- The first with no intervening queue, and the second
using the queue to pass results between steps. These tests run in 13.6 and 15.6 seconds respectively
(which shows that the queue adds a bit of overhead, but that this is overshadowed by the efficiencies of running tasks in parallel.)
CORRECTED CODE
import akka.{Done, NotUsed}
import akka.actor.ActorSystem
import akka.event.{Logging, LoggingAdapter}
import akka.stream.scaladsl.{Flow, Keep, Sink, Source}
import akka.stream.{ActorMaterializer, Attributes, QueueOfferResult}
import org.scalatest.FunSuite
import scala.concurrent.duration._
import scala.concurrent.{Await, ExecutionContext, Future}
class Speco extends FunSuite {
implicit val actorSystem: ActorSystem = ActorSystem()
implicit val materializer: ActorMaterializer = ActorMaterializer()
implicit val log: LoggingAdapter = Logging(actorSystem.eventStream, "basis-test")
implicit val ec: ExecutionContext = actorSystem.dispatcher
val stringModifyFunc: String => String = element => {
Thread.sleep(1)
s"Modified $element"
}
def setup = {
val source = Source.queue[String](128, akka.stream.OverflowStrategy.backpressure)
val sink = Sink.queue[String]().withAttributes(Attributes.inputBuffer(128, 128))
val (sourceQueue, sinkQueue) = source.toMat(sink)(Keep.both).run()
val offers: Source[String, NotUsed] = Source(
(1 to iterations).map { i =>
s"item-$i"
}
)
(sourceQueue,sinkQueue,offers)
}
val outer = 10
val inner = 1000
val iterations = outer * inner
def timedOperation[T](block : => T) = {
val t0 = System.nanoTime()
val result: T = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) / (1000 * 1000) + " milliseconds")
result
}
test("20k iterations in single threaded loop no queue (1)") {
timedOperation{
(1 to iterations).foreach { i =>
val str = stringModifyFunc(s"tag-${i.toString}")
System.out.println("str:" + str);
}
}
}
test("20k iterations in single threaded loop with queue (2)") {
timedOperation{
val (sourceQueue, sinkQueue, offers) = setup
val resultFuture: Future[Done] = offers.runForeach{ str =>
val itemFuture = for {
_ <- sourceQueue.offer(str)
item <- sinkQueue.pull()
} yield (stringModifyFunc(item.getOrElse("failed")) )
val item = Await.result(itemFuture, 10.second)
System.out.println("item:" + item);
}
val result = Await.result(resultFuture, 20.second)
System.out.println("result:" + result);
}
}
test("put items on queue then take them off (with async parallelism) - (3)") {
timedOperation{
val (sourceQueue, sinkQueue, offers) = setup
def enqueue(str: String) = sourceQueue.offer(str)
def dequeue = {
sinkQueue.pull().map{
maybeStr =>
val str = stringModifyFunc( maybeStr.getOrElse("failed2"))
println(s"dequeud value is $str")
}
}
val offerResults: Source[QueueOfferResult, NotUsed] =
offers.mapAsyncUnordered(10){ string => enqueue(string)}
val dequeueResults: Source[Unit, NotUsed] = offerResults.mapAsyncUnordered(10){ _ => dequeue }
val runAll: Future[Done] = dequeueResults.runForeach(u => u)
Await.result(runAll, 20.second)
}
}
}

Scala parallel execution

I am working on a requirement to get stats about files stored in Linux using Scala.
We will pass the root directory as input and our code will get the complete list of sub directories for the root directory passed.
Then for each directory in the list i will get the files list and for each files I will get the owners, groups, permission, lastmodifiedtime, createdtime, lastaccesstime.
The problem is how to can I process the directories list in parallel to get the stats of the files stored in that directory.
In production environment we have 100000+ of folders inside root folders.
So my list is having 100000+ folders list.
How can I parallize my operation(file stats) on my available list.
Since I am new to Scala please help me in this requirement.
Sorry for posting without code snippet.
Thanks.
I ended up using Akka actors.
I made assumptions about your desired output so that the program would be simple and fast. The assumptions I made are that the output is JSON, the hierarchy is not preserved, and that multiple files are acceptable. If you don't like JSON, you can replace it with something else, but the other two assumptions are important for keeping the current speed and simplicity of the program.
There are some command line parameters you can set. If you don't set them, then defaults will be used. The defaults are contained in Main.scala.
The command line parameters are as follows:
(0) the root directory you are starting from; (no default)
(1) the timeout interval (in seconds) for all the timeouts in this program; (default is 60)
(2) the number of printer actors to use; this will be the number of log files created; (default is 50)
(3) the tick interval to use for the monitor actor; (default is 500)
For the timeout, keep in mind this is the value of the time interval to wait at the completion of the program. So if you run a small job and wonder why it is taking a minute to complete, it is because it is waiting for the timeout interval to elapse before closing the program.
Because you are running such a large job, it is possible that the default timeout of 60 is too small. If you are getting exceptions complaining about timeout, increase the timeout value.
Please note that if your tick interval is set too high, there is a chance your program will close prematurely.
To run, just start sbt in project folder, and type
runMain Main <canonical path of root directory>
I couldn't figure how to get the group of a File in Java. You'll need to research that and add the relevant code to Entity.scala and TraverseActor.scala.
Also f.list() in TraverseActor.scala was sometimes coming back as null, which was why I wrapped it in an Option. You'll have to debug that issue to make sure you aren't failing silently on certain files.
Now, here are the contents of all the files.
build.sbt
name := "stackoverflow20191110"
version := "0.1"
scalaVersion := "2.12.1"
libraryDependencies ++= Seq(
"io.circe" %% "circe-core",
"io.circe" %% "circe-generic",
"io.circe" %% "circe-parser"
).map(_ % "0.12.2")
libraryDependencies += "com.typesafe.akka" %% "akka-actor" % "2.4.16"
Entity.scala
import io.circe.Encoder
import io.circe.generic.semiauto._
sealed trait Entity {
def path: String
def owner: String
def permissions: String
def lastModifiedTime: String
def creationTime: String
def lastAccessTime: String
def hashCode: Int
}
object Entity {
implicit val entityEncoder: Encoder[Entity] = deriveEncoder
}
case class FileEntity(path: String, owner: String, permissions: String, lastModifiedTime: String, creationTime: String, lastAccessTime: String) extends Entity
object fileentityEncoder {
implicit val fileentityEncoder: Encoder[FileEntity] = deriveEncoder
}
case class DirectoryEntity(path: String, owner: String, permissions: String, lastModifiedTime: String, creationTime: String, lastAccessTime: String) extends Entity
object DirectoryEntity {
implicit val directoryentityEncoder: Encoder[DirectoryEntity] = deriveEncoder
}
case class Contents(path: String, files: IndexedSeq[Entity])
object Contents {
implicit val contentsEncoder: Encoder[Contents] = deriveEncoder
}
Main.scala
import akka.actor.ActorSystem
import akka.pattern.ask
import akka.util.Timeout
import java.io.{BufferedWriter, File, FileWriter}
import ShutDownActor.ShutDownYet
import scala.concurrent.Await
import scala.concurrent.duration._
import scala.util.Try
object Main {
val defaultNumPrinters = 50
val defaultMonitorTickInterval = 500
val defaultTimeoutInS = 60
def main(args: Array[String]): Unit = {
val timeoutInS = Try(args(1).toInt).toOption.getOrElse(defaultTimeoutInS)
val system = ActorSystem("SearchHierarchy")
val shutdown = system.actorOf(ShutDownActor.props)
val monitor = system.actorOf(MonitorActor.props(shutdown, timeoutInS))
val refs = (0 until Try(args(2).toInt).toOption.getOrElse(defaultNumPrinters)).map{x =>
val name = "logfile" + x
(name, system.actorOf(PrintActor.props(name, Try(args(3).toInt).toOption.getOrElse(defaultMonitorTickInterval), monitor)))
}
val root = system.actorOf(TraverseActor.props(new File(args(0)), refs))
implicit val askTimeout = Timeout(timeoutInS seconds)
var isTimedOut = false
while(!isTimedOut){
Thread.sleep(30000)
val fut = (shutdown ? ShutDownYet).mapTo[Boolean]
isTimedOut = Await.result(fut, timeoutInS seconds)
}
refs.foreach{ x =>
val fw = new BufferedWriter(new FileWriter(new File(x._1), true))
fw.write("{}\n]")
fw.close()
}
system.terminate
}
}
MonitorActor.scala
import MonitorActor.ShutDown
import akka.actor.{Actor, ActorRef, Props, ReceiveTimeout, Stash}
import io.circe.syntax._
import scala.concurrent.duration._
class MonitorActor(shutdownActor: ActorRef, timeoutInS: Int) extends Actor with Stash {
context.setReceiveTimeout(timeoutInS seconds)
override def receive: Receive = {
case ReceiveTimeout =>
shutdownActor ! ShutDown
}
}
object MonitorActor {
def props(shutdownActor: ActorRef, timeoutInS: Int) = Props(new MonitorActor(shutdownActor, timeoutInS))
case object ShutDown
}
PrintActor.scala
import java.io.{BufferedWriter, File, FileWriter, PrintWriter}
import akka.actor.{Actor, ActorRef, Props, Stash}
import PrintActor.{Count, HeartBeat}
class PrintActor(name: String, interval: Int, monitorActor: ActorRef) extends Actor with Stash {
val file = new File(name)
override def preStart = {
val fw = new BufferedWriter(new FileWriter(file, true))
fw.write("[\n")
fw.close()
self ! Count(0)
}
override def receive: Receive = {
case Count(c) =>
context.become(withCount(c))
unstashAll()
case _ =>
stash()
}
def withCount(c: Int): Receive = {
case s: String =>
val fw = new BufferedWriter(new FileWriter(file, true))
fw.write(s)
fw.write(",\n")
fw.close()
if (c == interval) {
monitorActor ! HeartBeat
context.become(withCount(0))
} else {
context.become(withCount(c+1))
}
}
}
object PrintActor {
def props(name: String, interval: Int, monitorActor: ActorRef) = Props(new PrintActor(name, interval, monitorActor))
case class Count(count: Int)
case object HeartBeat
}
ShutDownActor.scala
import MonitorActor.ShutDown
import ShutDownActor.ShutDownYet
import akka.actor.{Actor, Props, Stash}
class ShutDownActor() extends Actor with Stash {
override def receive: Receive = {
case ShutDownYet => sender ! false
case ShutDown => context.become(canShutDown())
}
def canShutDown(): Receive = {
case ShutDownYet => sender ! true
}
}
object ShutDownActor {
def props = Props(new ShutDownActor())
case object ShutDownYet
}
TraverseActor.scala
import java.io.File
import akka.actor.{Actor, ActorRef, PoisonPill, Props, ReceiveTimeout}
import io.circe.syntax._
import scala.collection.JavaConversions
import scala.concurrent.duration._
import scala.util.Try
class TraverseActor(start: File, printers: IndexedSeq[(String, ActorRef)]) extends Actor{
val hash = start.hashCode()
val mod = hash % printers.size
val idx = if (mod < 0) -mod else mod
val myPrinter = printers(idx)._2
override def preStart = {
self ! start
}
override def receive: Receive = {
case f: File =>
val path = f.getCanonicalPath
val files = Option(f.list()).map(_.toIndexedSeq.map(x =>new File(path + "/" + x)))
val directories = files.map(_.filter(_.isDirectory))
directories.foreach(ds => processDirectories(ds))
val entities = files.map{fs =>
fs.map{ f =>
val path = f.getCanonicalPath
val owner = Try(java.nio.file.Files.getOwner(f.toPath).toString).toOption.getOrElse("")
val permissions = Try(java.nio.file.Files.getPosixFilePermissions(f.toPath).toString).toOption.getOrElse("")
val attributes = Try(java.nio.file.Files.readAttributes(f.toPath, "lastModifiedTime,creationTime,lastAccessTime"))
val lastModifiedTime = attributes.flatMap(a => Try(a.get("lastModifiedTime").toString)).toOption.getOrElse("")
val creationTime = attributes.flatMap(a => Try(a.get("creationTime").toString)).toOption.getOrElse("")
val lastAccessTime = attributes.flatMap(a => Try(a.get("lastAccessTime").toString)).toOption.getOrElse("")
if (f.isDirectory) FileEntity(path, owner, permissions, lastModifiedTime, creationTime, lastAccessTime)
else DirectoryEntity(path, owner, permissions, lastModifiedTime, creationTime, lastAccessTime)
}
}
directories match {
case Some(seq) =>
seq match {
case x+:xs =>
case IndexedSeq() => self ! PoisonPill
}
case None => self ! PoisonPill
}
entities.foreach(e => myPrinter ! Contents(f.getCanonicalPath, e).asJson.toString)
}
def processDirectories(directories: IndexedSeq[File]): Unit = {
def inner(fs: IndexedSeq[File]): Unit = {
fs match {
case x +: xs =>
context.actorOf(TraverseActor.props(x, printers))
processDirectories(xs)
case IndexedSeq() =>
}
}
directories match {
case x +: xs =>
self ! x
inner(xs)
case IndexedSeq() =>
}
}
}
object TraverseActor {
def props(start: File, printers: IndexedSeq[(String, ActorRef)]) = Props(new TraverseActor(start, printers))
}
I only tested on a small example, so it is possible this program will run into problems when running your job. If that happens, feel free to ask questions.