scala-akka:- Passing command line arguments to sbt run to locate .csv files - scala

I have a command line program that calculates statistics from humidity sensor data.
I also have .csv files inside src/main/scala/data
When i do sbt "run data" or sbt "run src/main/scala/data"
Seems like it's not able to locate .csv files and i get the result as 0.
output for sbt "run src/main/scala/data"
Looking for CSV files in directory: src/main/scala/data
output for sbt "run data"
Looking for CSV files in directory: data
Num of processed files: 0
Num of processed measurements: 0
Num of failed measurements: 0
Sensors with highest avg humidity:
sensor-id,min,avg,max
expected output example:-
Num of processed files: 2
Num of processed measurements: 7
Num of failed measurements: 2
Sensors with highest avg humidity:
sensor-id,min,avg,max
s2,78,82,88
s1,10,54,98
s3,NaN,NaN,NaN
Code for reference:-
import java.io.File
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{FileIO, Framing, Sink, Source}
import akka.util.ByteString
import scala.collection.mutable
import scala.concurrent.ExecutionContext.Implicits.global
object HumiditySensorStatistics {
case class HumidityData(sum: Double, count: Int) {
def avg: Option[Double] = if (count > 0) Some(sum / count) else None
}
case class SensorStats(min: Option[Double], avg: Option[Double], max: Option[Double])
def main(args: Array[String]): Unit = {
val directoryPath = args(0)
implicit val system: ActorSystem = ActorSystem("HumiditySensorStatistics")
implicit val materializer: ActorMaterializer = ActorMaterializer()
val sensors = mutable.Map[String, HumidityData]()
var failedMeasurements = 0
println(s"Looking for CSV files in directory: $directoryPath")
val fileSource = Source.fromIterator(() => new File(directoryPath).listFiles().iterator)
val measurementSource = fileSource.flatMapConcat(f => FileIO.fromPath(f.toPath))
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = 1024, allowTruncation = true))
.drop(1) // skip header line
.map(_.utf8String)
.map(line => {
val fields = line.split(",")
(fields(0), fields(1))
})
val sink = Sink.foreach[(String, String)](data => {
val sensorId = data._1
val humidity = data._2.toDoubleOption
if (humidity.isDefined) {
sensors.put(sensorId, sensors.getOrElse(sensorId, HumidityData(0.0, 0)) match {
case HumidityData(sum, count) => HumidityData(sum + humidity.get, count + 1)
})
} else {
failedMeasurements += 1
}
})
measurementSource.runWith(sink).onComplete(_ => {
val numFilesProcessed = sensors.size
val numMeasurementsProcessed = sensors.values.map(_.count).sum
val numFailedMeasurements = failedMeasurements
println(s"Num of processed files: $numFilesProcessed")
println(s"Num of processed measurements: $numMeasurementsProcessed")
println(s"Num of failed measurements: $numFailedMeasurements")
val statsBySensor = sensors.map {
case (sensorId, humidityData) =>
val stats = SensorStats(
min = Some(humidityData.sum / humidityData.count),
avg = humidityData.avg,
max = Some(humidityData.sum / humidityData.count)
)
(sensorId, stats)
}
println("Sensors with highest avg humidity:")
println("sensor-id,min,avg,max")
statsBySensor.toList.sortBy(_._2.avg).reverse.foreach {
case (sensorId, stats) =>
println(s"$sensorId,${stats.min.getOrElse("NaN")},${stats.avg.getOrElse("NaN")},${stats.max.getOrElse("NaN")}")
}
system.terminate()
})
}
}
build.sbt
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / scalaVersion := "2.13.8"
lazy val root = (project in file("."))
.settings(
name := "sensor-task"
)
libraryDependencies ++= Seq(
"com.typesafe.akka" %% "akka-stream" % "2.6.16",
)
.csv file data:-
sensor-id
humidity
s1
80
s3
NaN
s2
78
s1
98

Both your sbt commands (i.e. sbt "run data" and sbt "run src/main/scala/data") look correct, with the assumption you're running sbt from the Scala project-root with source code under "src/main/scala/" and the csv files under "src/main/scala/data/".
A couple of observed issues with the code:
In creating fileSource there is a good possibility new File().listFiles() is getting more files than you intend to include (e.g. non csv files, hidden files, etc), resulting in a single blob after passing through Framing.delimiter() and subsequently dropped by drop(1). In such case, the "sensors" Map will be empty, resulting in all-0's in the output.
I was able to reproduce the all-0's result using your exact source code and "build.sbt" apparently due to non-csv files (in my test case, file ".DS_Store") included in listFiles().
Providing specific file selection criteria for listFiles() such as including only "*.csv", like below, should fix the problem:
val fileSource = Source.fromIterator( () =>
new File(directoryPath).listFiles((_, name) => name.endsWith(".csv")).iterator
)
Another issue is that the computation logic (humidityData.sum / humidityData.count) for the min and max are incorrect, essentially repeating the avg calculation. To calculate them, one could expand the parameters in HumidityData as follows:
case class HumidityData(sum: Double, count: Int, min: Double, max: Double) {...}
The min/max could then be updated with something like below:
humidity match {
case Some(h) =>
sensors.put(sensorId, sensors.getOrElse(sensorId, HumidityData(0.0, 0, Double.MaxValue, 0.0)) match {
case HumidityData(sum, count, min, max) =>
HumidityData(sum + h, count + 1, Math.min(h, min), Math.max(h, max))
})
case None =>
failedMeasurements += 1
}
As a side note, I would recommend separating data from code by moving data files away from under "src/main/scala/", and maybe place them under, say, "src/main/resources/data/".
Testing with the following csv data files ...
File src/main/resources/data/sensor_data1.csv:
sensor-id,humidity
s1,80
s3,NaN
s2,78
s1,98
File src/main/resources/data/sensor_data2.csv:
sensor-id,humidity
s1,70
s3,80
s2,60
$ sbt "run src/main/resources/data"
[info] welcome to sbt 1.5.5 (Oracle Corporation Java 1.8.0_181)
[info] loading settings for project global-plugins from idea.sbt ...
[info] loading global plugins from /Users/leo/.sbt/1.0/plugins
[info] loading project definition from /Users/leo/work/so-75459442/project
[info] loading settings for project root from build.sbt ...
[info] set current project to sensor-task (in build file:/Users/leo/work/so-75459442/)
[info] running HumiditySensorStatistics src/main/resources/data
Looking for CSV files in directory: src/main/resources/data
Num of processed sensors: 3
Num of processed measurements: 7
Num of failed measurements: 1
Sensors with highest avg humidity:
sensor-id,min,avg,max
s3,NaN,NaN,NaN
s1,70.0,82.66666666666667,98.0
s2,60.0,69.0,78.0
[success] Total time: 2 s, completed Feb 16, 2023 10:33:35 AM
Appended is the revised source code.
import java.io.File
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{FileIO, Framing, Sink, Source}
import akka.util.ByteString
import scala.collection.mutable
import scala.concurrent.ExecutionContext.Implicits.global
object HumiditySensorStatistics {
case class HumidityData(sum: Double, count: Int, min: Double, max: Double) {
def avg: Option[Double] = if (count > 0) Some(sum / count) else None
}
case class SensorStats(min: Option[Double], avg: Option[Double], max: Option[Double])
def main(args: Array[String]): Unit = {
val directoryPath = args(0)
implicit val system: ActorSystem = ActorSystem("HumiditySensorStatistics")
// implicit val materializer: ActorMaterializer = ActorMaterializer() // Not needed for Akka Stream 2.6+
val sensors = mutable.Map[String, HumidityData]()
var failedMeasurements = 0
println(s"Looking for CSV files in directory: $directoryPath")
val fileSource = Source.fromIterator( () =>
new File(directoryPath).listFiles((_, name) => name.endsWith(".csv")).iterator
)
val measurementSource = fileSource.flatMapConcat(f => FileIO.fromPath(f.toPath))
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = 1024, allowTruncation = true))
.drop(1) // skip header line
.map(_.utf8String)
.map(line => {
val fields = line.split(",")
(fields(0), fields(1))
})
val sink = Sink.foreach[(String, String)](data => {
val sensorId = data._1
val humidity = data._2.toDoubleOption
humidity match {
case Some(h) =>
sensors.put(sensorId, sensors.getOrElse(sensorId, HumidityData(0.0, 0, Double.MaxValue, 0.0)) match {
case HumidityData(sum, count, min, max) =>
HumidityData(sum + h, count + 1, Math.min(h, min), Math.max(h, max))
})
case None =>
failedMeasurements += 1
}
})
measurementSource.runWith(sink).onComplete(_ => {
val numSensorsProcessed = sensors.size
val numMeasurementsProcessed = sensors.values.map(_.count).sum
val numFailedMeasurements = failedMeasurements
println(s"Num of processed sensors: $numSensorsProcessed")
println(s"Num of processed measurements: $numMeasurementsProcessed")
println(s"Num of failed measurements: $numFailedMeasurements")
val statsBySensor = sensors.map {
case (sensorId, humidityData) =>
val stats = SensorStats(
min = Some(humidityData.min),
avg = humidityData.avg,
max = Some(humidityData.max)
)
(sensorId, stats)
}
println("Sensors with highest avg humidity:")
println("sensor-id,min,avg,max")
statsBySensor.toList.sortBy(_._2.avg).reverse.foreach {
case (sensorId, stats) =>
println(s"$sensorId,${stats.min.getOrElse("NaN")},${stats.avg.getOrElse("NaN")},${stats.max.getOrElse("NaN")}")
}
system.terminate()
})
}
}

Related

How do I create an MQTT sink for Spark Streaming?

There are some examples of how to create MQTT sources [1] [2] for Spark Streaming. However, I want to create an MQTT sink where I can publish the results instead of using the print() method. I tried to create one MqttSink but I am getting object not serializable error. Then I am basing the code on this blog but I cannot find the method send that I created on the MqttSink object.
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{HashPartitioner, SparkConf}
import org.fusesource.mqtt.client.QoS
import org.sense.spark.util.{MqttSink, TaxiRideSource}
object TaxiRideCountCombineByKey {
val mqttTopic: String = "spark-mqtt-sink"
val qos: QoS = QoS.AT_LEAST_ONCE
def main(args: Array[String]): Unit = {
val outputMqtt: Boolean = if (args.length > 0 && args(0).equals("mqtt")) true else false
// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 4 cores to prevent from a starvation scenario.
val sparkConf = new SparkConf()
.setAppName("TaxiRideCountCombineByKey")
.setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val stream = ssc.receiverStream(new TaxiRideSource())
val driverStream = stream.map(taxiRide => (taxiRide.driverId, 1))
val countStream = driverStream.combineByKey(
(v) => (v, 1), //createCombiner
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1), //mergeValue
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2), // mergeCombiners
new HashPartitioner(3)
)
if (outputMqtt) {
println("Use the command below to consume data:")
println("mosquitto_sub -h 127.0.0.1 -p 1883 -t " + mqttTopic)
val mqttSink = ssc.sparkContext.broadcast(MqttSink)
countStream.foreachRDD { rdd =>
rdd.foreach { message =>
mqttSink.value.send(mqttTopic, message.toString()) // "send" method does not exist
}
}
} else {
countStream.print()
}
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
import org.fusesource.mqtt.client.{FutureConnection, MQTT, QoS}
class MqttSink(createProducer: () => FutureConnection) extends Serializable {
lazy val producer = createProducer()
def send(topic: String, message: String): Unit = {
producer.publish(topic, message.toString().getBytes, QoS.AT_LEAST_ONCE, false)
}
}
object MqttSink {
def apply(): MqttSink = {
val f = () => {
val mqtt = new MQTT()
mqtt.setHost("localhost", 1883)
val producer = mqtt.futureConnection()
producer.connect().await()
sys.addShutdownHook {
producer.disconnect().await()
}
producer
}
new MqttSink(f)
}
}
As an alternative you could also use Structure Streaming with the Apache Bahir Spark Extention for MQTT.
Complete Example
build.sbt:
name := "MQTT_StructuredStreaming"
version := "0.1"
libraryDependencies += "org.apache.spark" % "spark-core_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-sql_2.12" % "2.4.4"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.12" % "2.4.4" % "provided"
libraryDependencies += "org.apache.bahir" % "spark-sql-streaming-mqtt_2.12" % "2.4.0"
Main.scala
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
object Main extends App {
val brokerURL = "tcp://localhost:1883"
val subTopicName = "/my/subscribe/topic"
val pubTopicName = "/my/publish/topic"
val spark: SparkSession = SparkSession
.builder
.appName("MQTT_StructStreaming")
.master("local[*]")
.config("spark.sql.streaming.checkpointLocation", "/my/sparkCheckpoint/dir")
.getOrCreate
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val lines: Dataset[String] = spark.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("topic", subTopicName)
.option("clientId", "some-client-id")
.option("persistence", "memory")
.load(brokerURL)
.selectExpr("CAST(payload AS STRING)").as[String]
// Split the lines into words
val words: Dataset[String] = lines.as[String].flatMap(_.split(";"))
// Generate running word count
val wordCounts: DataFrame = words.groupBy("value").count()
// Start running the query that prints the running counts to the console
val query: StreamingQuery = wordCounts.writeStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSinkProvider")
.outputMode("complete")
.option("topic", pubTopicName)
.option("brokerURL", brokerURL)
.start
query.awaitTermination()
}
this is a working example based on the blog entry Spark and Kafka integration patterns.
package org.sense.spark.app
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{HashPartitioner, SparkConf}
import org.fusesource.mqtt.client.QoS
import org.sense.spark.util.{MqttSink, TaxiRideSource}
object TaxiRideCountCombineByKey {
val mqttTopic: String = "spark-mqtt-sink"
val qos: QoS = QoS.AT_LEAST_ONCE
def main(args: Array[String]): Unit = {
val outputMqtt: Boolean = if (args.length > 0 && args(0).equals("mqtt")) true else false
// Create a local StreamingContext with two working thread and batch interval of 1 second.
// The master requires 4 cores to prevent from a starvation scenario.
val sparkConf = new SparkConf()
.setAppName("TaxiRideCountCombineByKey")
.setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val stream = ssc.receiverStream(new TaxiRideSource())
val driverStream = stream.map(taxiRide => (taxiRide.driverId, 1))
val countStream = driverStream.combineByKey(
(v) => (v, 1), //createCombiner
(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1), //mergeValue
(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2), // mergeCombiners
new HashPartitioner(3)
)
if (outputMqtt) {
println("Use the command below to consume data:")
println("mosquitto_sub -h 127.0.0.1 -p 1883 -t " + mqttTopic)
val mqttSink = ssc.sparkContext.broadcast(MqttSink())
countStream.foreachRDD { rdd =>
rdd.foreach { message =>
mqttSink.value.send(mqttTopic, message.toString()) // "send" method does not exist
}
}
} else {
countStream.print()
}
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
package org.sense.spark.util
import org.fusesource.mqtt.client.{FutureConnection, MQTT, QoS}
class MqttSink(createProducer: () => FutureConnection) extends Serializable {
lazy val producer = createProducer()
def send(topic: String, message: String): Unit = {
producer.publish(topic, message.toString().getBytes, QoS.AT_LEAST_ONCE, false)
}
}
object MqttSink {
def apply(): MqttSink = {
val f = () => {
val mqtt = new MQTT()
mqtt.setHost("localhost", 1883)
val producer = mqtt.futureConnection()
producer.connect().await()
sys.addShutdownHook {
producer.disconnect().await()
}
producer
}
new MqttSink(f)
}
}
package org.sense.spark.util
import java.io.{BufferedReader, FileInputStream, InputStreamReader}
import java.nio.charset.StandardCharsets
import java.util.Locale
import java.util.zip.GZIPInputStream
import org.apache.spark.storage._
import org.apache.spark.streaming.receiver._
import org.joda.time.DateTime
import org.joda.time.format.{DateTimeFormat, DateTimeFormatter}
case class TaxiRide(rideId: Long, isStart: Boolean, startTime: DateTime, endTime: DateTime,
startLon: Float, startLat: Float, endLon: Float, endLat: Float,
passengerCnt: Short, taxiId: Long, driverId: Long)
object TimeFormatter {
val timeFormatter: DateTimeFormatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss").withLocale(Locale.US).withZoneUTC()
}
class TaxiRideSource extends Receiver[TaxiRide](StorageLevel.MEMORY_AND_DISK_2) {
val dataFilePath = "/home/flink/nycTaxiRides.gz";
var dataRateListener: DataRateListener = _
/**
* Start the thread that receives data over a connection
*/
def onStart() {
dataRateListener = new DataRateListener()
dataRateListener.start()
new Thread("TaxiRide Source") {
override def run() {
receive()
}
}.start()
}
def onStop() {}
/**
* Periodically generate a TaxiRide event and regulate the emission frequency
*/
private def receive() {
while (!isStopped()) {
val gzipStream = new GZIPInputStream(new FileInputStream(dataFilePath))
val reader: BufferedReader = new BufferedReader(new InputStreamReader(gzipStream, StandardCharsets.UTF_8))
try {
var line: String = null
do {
// start time before reading the line
val startTime = System.nanoTime
// read the line on the file and yield the object
line = reader.readLine
if (line != null) {
val taxiRide: TaxiRide = getTaxiRideFromString(line)
store(taxiRide)
}
// regulate frequency of the source
dataRateListener.busySleep(startTime)
} while (line != null)
} finally {
reader.close
}
}
}
def getTaxiRideFromString(line: String): TaxiRide = {
// println(line)
val tokens: Array[String] = line.split(",")
if (tokens.length != 11) {
throw new RuntimeException("Invalid record: " + line)
}
val rideId: Long = tokens(0).toLong
val (isStart, startTime, endTime) = tokens(1) match {
case "START" => (true, DateTime.parse(tokens(2), TimeFormatter.timeFormatter), DateTime.parse(tokens(3), TimeFormatter.timeFormatter))
case "END" => (false, DateTime.parse(tokens(2), TimeFormatter.timeFormatter), DateTime.parse(tokens(3), TimeFormatter.timeFormatter))
case _ => throw new RuntimeException("Invalid record: " + line)
}
val startLon: Float = if (tokens(4).length > 0) tokens(4).toFloat else 0.0f
val startLat: Float = if (tokens(5).length > 0) tokens(5).toFloat else 0.0f
val endLon: Float = if (tokens(6).length > 0) tokens(6).toFloat else 0.0f
val endLat: Float = if (tokens(7).length > 0) tokens(7).toFloat else 0.0f
val passengerCnt: Short = tokens(8).toShort
val taxiId: Long = tokens(9).toLong
val driverId: Long = tokens(10).toLong
TaxiRide(rideId, isStart, startTime, endTime, startLon, startLat, endLon, endLat, passengerCnt, taxiId, driverId)
}
}

Scala program using futures is not terminating

I am trying to learn concurrency in Scala and using Scala futures to generate a dataset with random string. I want to create an application which should generate a file with any number of records and it should be scalable.
Code:
import java.util.concurrent.{ExecutorService, Executors}
import scala.util.{Failure, Random, Success}
import scala.concurrent.duration._
object datacreator {
implicit val ec: ExecutionContext = new ExecutionContext {
val threadPool: ExecutorService = Executors.newFixedThreadPool(4)
def execute(runnable: Runnable) {
threadPool.submit(runnable)
}
def reportFailure(t: Throwable) {}
}
def getRecord : String = {
"Random string"
}
def main(args: Array[String]): Unit = {
val filename = args(0)
val number_of_records = args(1)
val file_Object = new FileWriter(filename, true)
val data: Future[Iterable[String]] = Future {
for (i <- 1 to number_of_records.toInt)
yield getRecord
}
val result = data.map{
result => result.foreach(record => file_Object.write(record))
}
result.onComplete{
case Success(value) => {
println("Success")
file_Object.close()
}
case Failure(e) => e.printStackTrace()
}
}
}
I am facing the following issues:
When I am running the program using SBT it is writing results to the file but not terminating as going in infinite mode.
[info] Loading project definition from /Users/cw0155/PersonalProjects/datagen/project
[info] Loading settings for project datagen from build.sbt ...
[info] Set current project to datagenerator (in build file:/Users/cw0155/PersonalProjects/datagen/)
[info] running com.generator.DataGenerator xyz.csv 100
Success
| => datagen / Compile / runMain 255s
When I am running the program using Jar as:
scala -cp target/scala-2.13/datagenerator_2.13-0.1.jar com.generator.DataGenerator "pqr.csv" "1000"
It is waiting infinite time and not writing to the file.
Any help is much appreciated :)
Try this version
bar.scala
import scala.concurrent.{Await, Future, ExecutionContext}
import scala.concurrent.duration._
import scala.util.{Success, Failure}
import ExecutionContext.Implicits.global
import java.io.FileWriter
object bar {
def getRecord: String = "Random string\n"
def main(args: Array[String]): Unit = {
val filename = args(0)
val number_of_records = args(1)
val data: Future[Iterable[String]] = Future {
for (i <- 1 to number_of_records.toInt)
yield getRecord
}
val file_Object = new FileWriter(filename, true)
val result = data.map( r => r.foreach(record => file_Object.write(record)) )
result.onComplete {
case Success(value) =>
println("Success")
file_Object.close()
case Failure(e) =>
e.printStackTrace()
}
Await.result( result, 10.second )
}
}
Your original version gave me the expected output when I ran it like so
bash-3.2$ scala bar.scala /dev/fd/1 10
Success
Random string
Random string
Random string
Random string
Random string
Random string
Random string
Random string
Random string
Random string
However without the Await.result your program can exit before the future finishes.

Scala parallel execution

I am working on a requirement to get stats about files stored in Linux using Scala.
We will pass the root directory as input and our code will get the complete list of sub directories for the root directory passed.
Then for each directory in the list i will get the files list and for each files I will get the owners, groups, permission, lastmodifiedtime, createdtime, lastaccesstime.
The problem is how to can I process the directories list in parallel to get the stats of the files stored in that directory.
In production environment we have 100000+ of folders inside root folders.
So my list is having 100000+ folders list.
How can I parallize my operation(file stats) on my available list.
Since I am new to Scala please help me in this requirement.
Sorry for posting without code snippet.
Thanks.
I ended up using Akka actors.
I made assumptions about your desired output so that the program would be simple and fast. The assumptions I made are that the output is JSON, the hierarchy is not preserved, and that multiple files are acceptable. If you don't like JSON, you can replace it with something else, but the other two assumptions are important for keeping the current speed and simplicity of the program.
There are some command line parameters you can set. If you don't set them, then defaults will be used. The defaults are contained in Main.scala.
The command line parameters are as follows:
(0) the root directory you are starting from; (no default)
(1) the timeout interval (in seconds) for all the timeouts in this program; (default is 60)
(2) the number of printer actors to use; this will be the number of log files created; (default is 50)
(3) the tick interval to use for the monitor actor; (default is 500)
For the timeout, keep in mind this is the value of the time interval to wait at the completion of the program. So if you run a small job and wonder why it is taking a minute to complete, it is because it is waiting for the timeout interval to elapse before closing the program.
Because you are running such a large job, it is possible that the default timeout of 60 is too small. If you are getting exceptions complaining about timeout, increase the timeout value.
Please note that if your tick interval is set too high, there is a chance your program will close prematurely.
To run, just start sbt in project folder, and type
runMain Main <canonical path of root directory>
I couldn't figure how to get the group of a File in Java. You'll need to research that and add the relevant code to Entity.scala and TraverseActor.scala.
Also f.list() in TraverseActor.scala was sometimes coming back as null, which was why I wrapped it in an Option. You'll have to debug that issue to make sure you aren't failing silently on certain files.
Now, here are the contents of all the files.
build.sbt
name := "stackoverflow20191110"
version := "0.1"
scalaVersion := "2.12.1"
libraryDependencies ++= Seq(
"io.circe" %% "circe-core",
"io.circe" %% "circe-generic",
"io.circe" %% "circe-parser"
).map(_ % "0.12.2")
libraryDependencies += "com.typesafe.akka" %% "akka-actor" % "2.4.16"
Entity.scala
import io.circe.Encoder
import io.circe.generic.semiauto._
sealed trait Entity {
def path: String
def owner: String
def permissions: String
def lastModifiedTime: String
def creationTime: String
def lastAccessTime: String
def hashCode: Int
}
object Entity {
implicit val entityEncoder: Encoder[Entity] = deriveEncoder
}
case class FileEntity(path: String, owner: String, permissions: String, lastModifiedTime: String, creationTime: String, lastAccessTime: String) extends Entity
object fileentityEncoder {
implicit val fileentityEncoder: Encoder[FileEntity] = deriveEncoder
}
case class DirectoryEntity(path: String, owner: String, permissions: String, lastModifiedTime: String, creationTime: String, lastAccessTime: String) extends Entity
object DirectoryEntity {
implicit val directoryentityEncoder: Encoder[DirectoryEntity] = deriveEncoder
}
case class Contents(path: String, files: IndexedSeq[Entity])
object Contents {
implicit val contentsEncoder: Encoder[Contents] = deriveEncoder
}
Main.scala
import akka.actor.ActorSystem
import akka.pattern.ask
import akka.util.Timeout
import java.io.{BufferedWriter, File, FileWriter}
import ShutDownActor.ShutDownYet
import scala.concurrent.Await
import scala.concurrent.duration._
import scala.util.Try
object Main {
val defaultNumPrinters = 50
val defaultMonitorTickInterval = 500
val defaultTimeoutInS = 60
def main(args: Array[String]): Unit = {
val timeoutInS = Try(args(1).toInt).toOption.getOrElse(defaultTimeoutInS)
val system = ActorSystem("SearchHierarchy")
val shutdown = system.actorOf(ShutDownActor.props)
val monitor = system.actorOf(MonitorActor.props(shutdown, timeoutInS))
val refs = (0 until Try(args(2).toInt).toOption.getOrElse(defaultNumPrinters)).map{x =>
val name = "logfile" + x
(name, system.actorOf(PrintActor.props(name, Try(args(3).toInt).toOption.getOrElse(defaultMonitorTickInterval), monitor)))
}
val root = system.actorOf(TraverseActor.props(new File(args(0)), refs))
implicit val askTimeout = Timeout(timeoutInS seconds)
var isTimedOut = false
while(!isTimedOut){
Thread.sleep(30000)
val fut = (shutdown ? ShutDownYet).mapTo[Boolean]
isTimedOut = Await.result(fut, timeoutInS seconds)
}
refs.foreach{ x =>
val fw = new BufferedWriter(new FileWriter(new File(x._1), true))
fw.write("{}\n]")
fw.close()
}
system.terminate
}
}
MonitorActor.scala
import MonitorActor.ShutDown
import akka.actor.{Actor, ActorRef, Props, ReceiveTimeout, Stash}
import io.circe.syntax._
import scala.concurrent.duration._
class MonitorActor(shutdownActor: ActorRef, timeoutInS: Int) extends Actor with Stash {
context.setReceiveTimeout(timeoutInS seconds)
override def receive: Receive = {
case ReceiveTimeout =>
shutdownActor ! ShutDown
}
}
object MonitorActor {
def props(shutdownActor: ActorRef, timeoutInS: Int) = Props(new MonitorActor(shutdownActor, timeoutInS))
case object ShutDown
}
PrintActor.scala
import java.io.{BufferedWriter, File, FileWriter, PrintWriter}
import akka.actor.{Actor, ActorRef, Props, Stash}
import PrintActor.{Count, HeartBeat}
class PrintActor(name: String, interval: Int, monitorActor: ActorRef) extends Actor with Stash {
val file = new File(name)
override def preStart = {
val fw = new BufferedWriter(new FileWriter(file, true))
fw.write("[\n")
fw.close()
self ! Count(0)
}
override def receive: Receive = {
case Count(c) =>
context.become(withCount(c))
unstashAll()
case _ =>
stash()
}
def withCount(c: Int): Receive = {
case s: String =>
val fw = new BufferedWriter(new FileWriter(file, true))
fw.write(s)
fw.write(",\n")
fw.close()
if (c == interval) {
monitorActor ! HeartBeat
context.become(withCount(0))
} else {
context.become(withCount(c+1))
}
}
}
object PrintActor {
def props(name: String, interval: Int, monitorActor: ActorRef) = Props(new PrintActor(name, interval, monitorActor))
case class Count(count: Int)
case object HeartBeat
}
ShutDownActor.scala
import MonitorActor.ShutDown
import ShutDownActor.ShutDownYet
import akka.actor.{Actor, Props, Stash}
class ShutDownActor() extends Actor with Stash {
override def receive: Receive = {
case ShutDownYet => sender ! false
case ShutDown => context.become(canShutDown())
}
def canShutDown(): Receive = {
case ShutDownYet => sender ! true
}
}
object ShutDownActor {
def props = Props(new ShutDownActor())
case object ShutDownYet
}
TraverseActor.scala
import java.io.File
import akka.actor.{Actor, ActorRef, PoisonPill, Props, ReceiveTimeout}
import io.circe.syntax._
import scala.collection.JavaConversions
import scala.concurrent.duration._
import scala.util.Try
class TraverseActor(start: File, printers: IndexedSeq[(String, ActorRef)]) extends Actor{
val hash = start.hashCode()
val mod = hash % printers.size
val idx = if (mod < 0) -mod else mod
val myPrinter = printers(idx)._2
override def preStart = {
self ! start
}
override def receive: Receive = {
case f: File =>
val path = f.getCanonicalPath
val files = Option(f.list()).map(_.toIndexedSeq.map(x =>new File(path + "/" + x)))
val directories = files.map(_.filter(_.isDirectory))
directories.foreach(ds => processDirectories(ds))
val entities = files.map{fs =>
fs.map{ f =>
val path = f.getCanonicalPath
val owner = Try(java.nio.file.Files.getOwner(f.toPath).toString).toOption.getOrElse("")
val permissions = Try(java.nio.file.Files.getPosixFilePermissions(f.toPath).toString).toOption.getOrElse("")
val attributes = Try(java.nio.file.Files.readAttributes(f.toPath, "lastModifiedTime,creationTime,lastAccessTime"))
val lastModifiedTime = attributes.flatMap(a => Try(a.get("lastModifiedTime").toString)).toOption.getOrElse("")
val creationTime = attributes.flatMap(a => Try(a.get("creationTime").toString)).toOption.getOrElse("")
val lastAccessTime = attributes.flatMap(a => Try(a.get("lastAccessTime").toString)).toOption.getOrElse("")
if (f.isDirectory) FileEntity(path, owner, permissions, lastModifiedTime, creationTime, lastAccessTime)
else DirectoryEntity(path, owner, permissions, lastModifiedTime, creationTime, lastAccessTime)
}
}
directories match {
case Some(seq) =>
seq match {
case x+:xs =>
case IndexedSeq() => self ! PoisonPill
}
case None => self ! PoisonPill
}
entities.foreach(e => myPrinter ! Contents(f.getCanonicalPath, e).asJson.toString)
}
def processDirectories(directories: IndexedSeq[File]): Unit = {
def inner(fs: IndexedSeq[File]): Unit = {
fs match {
case x +: xs =>
context.actorOf(TraverseActor.props(x, printers))
processDirectories(xs)
case IndexedSeq() =>
}
}
directories match {
case x +: xs =>
self ! x
inner(xs)
case IndexedSeq() =>
}
}
}
object TraverseActor {
def props(start: File, printers: IndexedSeq[(String, ActorRef)]) = Props(new TraverseActor(start, printers))
}
I only tested on a small example, so it is possible this program will run into problems when running your job. If that happens, feel free to ask questions.

Change datatype on scala Spark Streaming

On that course on Module 3 - hands on lab ... there's an example (Spark Fundamentals 1) that I'm using to learn Scala and Spark.
https://courses.cognitiveclass.ai/courses/course-v1:BigDataUniversity+BD0211EN+2016/courseware/14ec4166bc9b4a3a9592b7960f4a5401/b0c736193c834b01b3c1c5bd4ce2d8a8/
I tried to modify the Streaming part in order to calculate the moving average as streaming comes in. I haven't figured out how to do it, but right now I'm facing the problem that I don't know how to change the datatype.
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc,Seconds(1))
val lines = ssc.socketTextStream("localhost",7777)
import scala.collection.mutable.Queue
var ints = Queue[Double]()
def movingAverage(values: Queue[Double], period: Int): List[Double] = {
val first = (values take period).sum / period
val subtract = values map (_ / period)
val add = subtract drop period
val addAndSubtract = add zip subtract map Function.tupled(_ - _)
val res = (addAndSubtract.foldLeft(first :: List.fill(period - 1)(0.0)) {
(acc, add) => (add + acc.head) :: acc
}).reverse
res
}
val pass = lines.map(_.split(",")).
map(pass=>(pass(7).toDouble))
pass.getClass
class org.apache.spark.streaming.dstream.MappedDStream
ints ++= List(pass).to[Queue]
Name: Compile Error
Message: console :41: error: type mismatch;
found : scala.collection.mutable.Queue[org.apache.spark.streaming.dstream.DStream[Double]]
required: scala.collection.TraversableOnce[Double]
ints ++= List(pass).to[Queue]
^
StackTrace:
al pass2 = movingAverage(ints,2)
pass2.print()
ints.dequeue
ssc.start()
ssc.awaitTermination()
How to get the streaming data from pass to ints as a queue of doubles?
After a lot of asking
val p1 = new scala.collection.mutable.Queue[Double]
pass.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
p1 += item ;
println(item +" - "+ movingAverage(p1,2).last) ;
}
})

SparkContext cannot be launched in the same programe with Streaming SparkContext

I created the following test that fit a simple linear regression model to a dummy streaming data.
I use hyper-parameters optimisation to find good values of stepSize, numiterations and initialWeights of the linear model.
Everything runs fine, except the last lines of the code that are commented out:
// Save the evaluations for further visualization
// val gridEvalsRDD = sc.parallelize(gridEvals)
// gridEvalsRDD.coalesce(1)
// .map(e => "%.3f\t%.3f\t%d\t%.3f".format(e._1, e._2, e._3, e._4))
// .saveAsTextFile("data/mllib/streaming")
The problem is with the SparkContext sc. If I initialize it at the beginning of a test, then the program shown errors. It looks like sc should be defined in some special way in order to avoid conflicts with scc (streaming spark context). Any ideas?
The whole code:
// scalastyle:off
package org.apache.spark.mllib.regression
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.util.LinearDataGenerator
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.{StreamingContext, TestSuiteBase}
import org.apache.spark.streaming.TestSuiteBase
import org.scalatest.BeforeAndAfter
class StreamingLinearRegressionHypeOpt extends TestSuiteBase with BeforeAndAfter {
// use longer wait time to ensure job completion
override def maxWaitTimeMillis: Int = 20000
var ssc: StreamingContext = _
override def afterFunction() {
super.afterFunction()
if (ssc != null) {
ssc.stop()
}
}
def calculateMSE(output: Seq[Seq[(Double, Double)]], n: Int): Double = {
val mse = output
.map {
case seqOfPairs: Seq[(Double, Double)] =>
val err = seqOfPairs.map(p => math.abs(p._1 - p._2)).sum
err*err
}.sum / n
mse
}
def calculateRMSE(output: Seq[Seq[(Double, Double)]], n: Int): Double = {
val mse = output
.map {
case seqOfPairs: Seq[(Double, Double)] =>
val err = seqOfPairs.map(p => math.abs(p._1 - p._2)).sum
err*err
}.sum / n
math.sqrt(mse)
}
def dummyStringStreamSplit(datastream: Stream[String]) =
datastream.flatMap(txt => txt.split(" "))
test("Test 1") {
// create model initialized with zero weights
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.dense(0.0, 0.0))
.setStepSize(0.2)
.setNumIterations(25)
// generate sequence of simulated data for testing
val numBatches = 10
val nPoints = 100
val inputData = (0 until numBatches).map { i =>
LinearDataGenerator.generateLinearInput(0.0, Array(10.0, 10.0), nPoints, 42 * (i + 1))
}
// Without hyper-parameters optimization
withStreamingContext(setupStreams(inputData, (inputDStream: DStream[LabeledPoint]) => {
model.trainOn(inputDStream)
model.predictOnValues(inputDStream.map(x => (x.label, x.features)))
})) { ssc =>
val output: Seq[Seq[(Double, Double)]] = runStreams(ssc, numBatches, numBatches)
val rmse = calculateRMSE(output, nPoints)
println(s"RMSE = $rmse")
}
// With hyper-parameters optimization
val gridParams = Map(
"initialWeights" -> List(Vectors.dense(0.0, 0.0), Vectors.dense(10.0, 10.0)),
"stepSize" -> List(0.1, 0.2, 0.3),
"numIterations" -> List(25, 50)
)
val gridEvals = for (initialWeights <- gridParams("initialWeights");
stepSize <- gridParams("stepSize");
numIterations <- gridParams("numIterations")) yield {
val lr = new StreamingLinearRegressionWithSGD()
.setInitialWeights(initialWeights.asInstanceOf[Vector])
.setStepSize(stepSize.asInstanceOf[Double])
.setNumIterations(numIterations.asInstanceOf[Int])
withStreamingContext(setupStreams(inputData, (inputDStream: DStream[LabeledPoint]) => {
lr.trainOn(inputDStream)
lr.predictOnValues(inputDStream.map(x => (x.label, x.features)))
})) { ssc =>
val output: Seq[Seq[(Double, Double)]] = runStreams(ssc, numBatches, numBatches)
val cvRMSE = calculateRMSE(output, nPoints)
println(s"RMSE = $cvRMSE")
(initialWeights, stepSize, numIterations, cvRMSE)
}
}
// Save the evaluations for further visualization
// val gridEvalsRDD = sc.parallelize(gridEvals)
// gridEvalsRDD.coalesce(1)
// .map(e => "%.3f\t%.3f\t%d\t%.3f".format(e._1, e._2, e._3, e._4))
// .saveAsTextFile("data/mllib/streaming")
}
}
// scalastyle:on