I am using Akka stream to process a CSV file containing 1839 lines. I have added counters to count the number of lines processed.
Here is my source, I have made sure that each line in the input file is less than 700 chars.
case class ParsedLine(input: String, field1: String, field2: String, field3: String)
val counter0 = new AtomicInteger()
val counter1 = new AtomicInteger()
val lineSource = FileIO
.fromPath(Paths.get(InputFile))
.via(Framing.delimiter(ByteString("\n"), 1024, allowTruncation = true))
.map { l =>
counter0.incrementAndGet()
l.utf8String
}
val parseLine = Flow[String].map { l =>
val words = l.split(",")
ParsedLine(l, words(0), words(1), words(2))
}
This source is processed as follows, corresponding to every line in the source there should be a processed line in the output.
val done = lineSource
.via(parseLine)
.to(Sink.foreach(_.input))
.run()
done.onComplete {
case Success(_) =>
println("Counter0 " + counter0.get())
println("Counter1 " + counter1.get())
system.terminate()
case Failure(e) =>
println(e.getLocalizedMessage)
system.terminate()
}
The interesting thing is the counters are printing as follows & the each time I am getting different numbers. If I remove the .to(Sink.foreach(_.input)) line I am getting the counts as 1839.
Counter0 1445
Counter1 1667
First of all I expect Counter0 to have a higher value than Counter1 because Counter0 comes in a stage before Counter1, I expect all the lines to be processed and counters should have printed the total number of lines 1839.
Any idea what is going on in this case ? Is akka stream dropping items in between ?
You are actually not waiting for the stream to finish.
You are attatching the Sink.foreach(...) using to which drops the processing details of Sink.foreach stage and keeps only the processing stage of the earlier stage.
Also, keep in mind that you are doing the same at every step (via, map, via and then to). So, you are only keeping track of the processing stage of first graph step which was created by FileIO.from(...). This means that you are only waiting to read the full file but not for any of the subsequent processing steps.
You just need to preserve the results of both and wait for both of them to finish.
val stream =
lineSource
.via(parseLine)
.toMat(Sink.foreach(_.input))(Keep.both)
val resultFutures: (Future[IOResult], Future[Done]) = stream.run()
val resultsFuture = Future.sequence(List(resultFutures._1, resultFutures._2))
resultsFuture.onComplete {
case Success(List(ioResult, done)) =>
println(ioResult)
println(done)
println(counter0.get())
actorSystem.terminate()
case Failure(e) =>
println(e.getLocalizedMessage)
actorSystem.terminate()
}
Or, you can choose to keep track of just the last processing stage (which is Sink.foreach(...) in this case)
val stream =
lineSource
.via(parseLine)
.toMat(Sink.foreach(_.input))(Keep.right)
val resuleFuture: Future[Done] = stream.run()
resuleFuture.onComplete({
case Success(_) =>
println("Counter0 " + counter0.get())
actorSystem.terminate()
case Failure(e) =>
println(e.getLocalizedMessage)
actorSystem.terminate()
})
Related
I have a graph that reads lines from multiple gzipped files and writes those lines to another set of gzipped files, mapped according to some value in each line.
It works correctly against small data sets, but fails to terminate on larger data. (It may not be the size of the data that's to blame, as I have not run it enough times to be sure - it takes a while).
def files: Source[File, NotUsed] =
Source.fromIterator(
() =>
Files
.fileTraverser()
.breadthFirst(inDir)
.asScala
.filter(_.getName.endsWith(".gz"))
.toIterator)
def extract =
Flow[File]
.mapConcat[String](unzip)
.mapConcat(s =>
(JsonMethods.parse(s) \ "tk").extract[Array[String]].map(_ -> s).to[collection.immutable.Iterable])
.groupBy(1 << 16, _._1)
.groupedWithin(1000, 1.second)
.map { lines =>
val w = writer(lines.head._1)
w.println(lines.map(_._2).mkString("\n"))
w.close()
Done
}
.mergeSubstreams
def unzip(f: File) = {
scala.io.Source
.fromInputStream(new GZIPInputStream(new FileInputStream(f)))
.getLines
.toIterable
.to[collection.immutable.Iterable]
}
def writer(tk: String): PrintWriter =
new PrintWriter(
new OutputStreamWriter(
new GZIPOutputStream(
new FileOutputStream(new File(outDir, s"$tk.json.gz"), true)
))
)
val process = files.via(extract).toMat(Sink.ignore)(Keep.right).run()
Await.result(process, Duration.Inf)
The thread dump shows that the process is WAITING at Await.result(process, Duration.Inf) and nothing else is happening.
OpenJDK v11 with Akka v2.5.15
Most likely it's stuck in groupBy because it ran out of available threads in dispatcher to gather items into 2^16 groups for all sources.
So if I were you I'd probably implement grouping in extract semi-manually using statefulMapConcat with mutable Map[KeyType, List[String]]. Or buffer lines with groupedWithin first and split them into groups that you would write to different files in Sink.foreach.
I have very simple Akka Streams flow which reads msg from Kafka using alpakka, performs some manipulation on msg and indexes it to Elasticsearch.
I'm using CommitableSource, therefore i'm in At-Least-Once strategy. I commit my offset only when index to ES succeed, if it fails I will read again the message because form latest known offset.
val decider: Supervision.Decider = {
case _:Throwable => Supervision.Restart
case _ => Supervision.Restart
}
val config: Config = context.system.settings.config.getConfig("akka.kafka.consumer")
val flow: Flow[CommittableMessage[String, String], Done, NotUsed] =
Flow[CommittableMessage[String,String]].
map(msg => Event(msg.committableOffset,Success(Json.parse(msg.record.value()))))
.mapAsync(10) { event => indexEvent(event.json.get).map(f=> event.copy(json = f))}
.mapAsync(10)(f => {
f.json match {
case Success(_)=> f.committableOffset.commitScaladsl()
case Failure(ex) => throw new StreamFailedException(ex.getMessage,ex)
}
})
val r: Flow[CommittableMessage[String, String], Done, NotUsed] = RestartFlow.onFailuresWithBackoff(
minBackoff = 3.seconds,
maxBackoff = 3.seconds,
randomFactor = 0.2, // adds 20% "noise" to vary the intervals slightly
maxRestarts = 20 // limits the amount of restarts to 20
)(() => {
println("Creating flow")
flow
})
val consumerSettings: ConsumerSettings[String, String] =
ConsumerSettings(config, new StringDeserializer, new StringDeserializer)
.withBootstrapServers("localhost:9092")
.withGroupId("group1")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
val restartSource: Source[CommittableMessage[String, String], NotUsed] = RestartSource.withBackoff(
minBackoff = 3.seconds,
maxBackoff = 30.seconds,
randomFactor = 0.2, // adds 20% "noise" to vary the intervals slightly
maxRestarts = 20 // limits the amount of restarts to 20
) {() =>
Consumer.committableSource(consumerSettings, Subscriptions.topics("test"))
}
implicit val mat: ActorMaterializer = ActorMaterializer(ActorMaterializerSettings(context.system).withSupervisionStrategy(decider))
restartSource
.via(flow)
.toMat(Sink.ignore)(Keep.both).run()
What I would like to achieve, is to restart entire flow Source -> Flow-> Sink. If from any reason I was no able to index message in Elastic.
I tried the following:
Supervision.Decider - It looks like flow was recreated but no
message was pulled from Kafka, obviously because it remembers it
offset.
RestartSource - doesn't looks ether, because exception happens in flow stage.
RestartFlow - Doesn't help as well because it restarts only Flow, but I need to restart Source from last successful offset.
Is there any elegant way to do that?
You can combine restartable source, flow & sink. Nobody prevents you from doing restartable source/flow/sink for each part of the graph
Update:
code example
val sourceFactory = () => Source(1 to 10).via(Flow.fromFunction(x => { println("problematic flow"); x }))
RestartSource.withBackoff(4.seconds, 4.seconds, 0.2)(sourceFactory)
I read this article on akka streams error handling
http://doc.akka.io/docs/akka/2.5.4/scala/stream/stream-error.html
and wrote this code.
val decider: Supervision.Decider = {
case _: Exception => Supervision.Restart
case _ => Supervision.Stop
}
implicit val actorSystem = ActorSystem()
implicit val actorMaterializer = ActorMaterializer(ActorMaterializerSettings(actorSystem).withSupervisionStrategy(decider))
val source = Source(1 to 10)
val flow = Flow[Int].map{x => if (x != 9) 2 * x else throw new Exception("9!")}
val sink : Sink[Int, Future[Done]] = Sink.foreach[Int](x => println(x))
val graph = RunnableGraph.fromGraph(GraphDSL.create(sink){implicit builder => s =>
import GraphDSL.Implicits._
source ~> flow ~> s.in
ClosedShape
})
val future = graph.run()
future.onComplete{ _ =>
actorSystem.terminate()
}
Await.result(actorSystem.whenTerminated, Duration.Inf)
This works very well .... except that I need to scan the output to see which row did not get processed. Is there a way for me to print/log the row which failed? [Without putting explicit try/catch blocks in each and every flow that I write?]
So for example If I was using actors (as opposed to streams) I could have written a life cycle event of an actor and I could have logged when an actor restarted along with the message which was being processed at the time of restart.
but here I am not using actors explicitly (although they are used internally). Are there life cycle events for a Flow / Source / Sink?
Just a small modification to your code:
val decider: Supervision.Decider = {
case e: Exception =>
println("Exception handled, recovering stream:" + e.getMessage)
Supervision.Restart
case _ => Supervision.Stop
}
If you pass meaningful messages to your exceptions in the stream, the line for example, you can print them in the supervision decider.
I used println to give a quick and short answer, but strongly recommend to use
some logging libraries such as scala-logging
I have a stream that
listens for HTTP post receiving a list of events
mapconcat the list of events in stream elements
convert events in kafka record
produce the record with reactive kafka (akka stream kafka producer sink)
Here is the simplified code
// flow to split group of lines into lines
val splitLines = Flow[List[Evt]].mapConcat(list=>list)
// sink to produce kafka records in kafka
val kafkaSink: Sink[Evt, Future[Done]] = Flow[Evt]
.map(evt=> new ProducerRecord[Array[Byte], String](evt.eventType, evt.value))
.toMat(Producer.plainSink(kafka))(Keep.right)
val routes = {
path("ingest") {
post {
(entity(as[List[ReactiveEvent]]) & extractMaterializer) { (eventIngestList,mat) =>
val ingest= Source.single(eventIngestList).via(splitLines).runWith(kafkaSink)(mat)
val result = onComplete(ingest){
case Success(value) => complete(s"OK")
case Failure(ex) => complete((StatusCodes.InternalServerError, s"An error occurred: ${ex.getMessage}"))
}
complete("eventList ingested: " + result)
}
}
}
}
Could you highlight me what is run in parallel and what is sequential ?
I think the mapConcat sequentialize the events in the stream so how could I parallelize the stream so after the mapConcat each step would be processed in parallel ?
Would a simple mapAsyncUnordered be sufficient ? Or should I use the GraphDSL with a Balance and Merge ?
In your case it will be sequential I think. Also you're getting whole request before you start pushing data to Kafka. I'd use extractDataBytes directive that gives you src: Source[ByteString, Any]. Then I'd process it like
src
.via(Framing.delimiter(ByteString("\n"), 1024 /* Max size of line */ , allowTruncation = true).map(_.utf8String))
.mapConcat { line =>
line.split(",")
}.async
.runWith(kafkaSink)(mat)
class Cleaner {
def getDocumentData() = {
val conf = new SparkConf()
.setAppName("linkin_spark")
.setMaster("local[2]")
.set("spark.executor.memory", "1g")
.set("spark.rdd.compress", "true")
.set("spark.storage.memoryFraction", "1")
val CorpusReader = new Corpus()
val files = CorpusReader.getListOfFiles("/home/DATA/doc_collection/")
val sc = new SparkContext(conf)
val temp = sc.textFile(files(0).toString())
println(files(0).toString())
var count = 0
val regex = """<TAG>""".r
for (line <- temp ) {
line match {
case regex(_*) => {
println(line)
count += 1
println(count)
}
case _ => null //Handle error - scala.MatchError
}
}
println(s"There are " + count + " documents.") // this comes out to be 0
}
}
I have a list of text file that I have to read. They are XML-like files so, I need to extract the relevant text. Since, they are not standard XML files I thought of using regex to get the text. Every document start with a <TAG> tag. So I tried to count no. of documents in a file which will be equal to no. of <TAG> matches in the file. The function above does the same thing. Originally the file has 264 docs but when I run the function I get 127 or 137, either of the numbers. It does not seem to reading the whole file. Also count at the end comes out to be 0.
I am Scala/Spark newbie.
UPDATE:
var count = sc.accumulator(0)
val regex = """<TAG>""".r
for (line <- temp ) {
println(line)
line match {
case regex(_*) => {
count += 1
println(s"$line # $count") //There is no "<TAG> # 264" in the output
}
case _ => null
}
}
println(s"There are " + count.value + " documents.")
This change in the program gives me the correct value of count i.e. 264 but the file id printed correctly!! It appears to start somewhere from the middle and end somewhere in the middle.
UPDATE II:
This has something to do with threads. The SparkConf() has been initialised using local[2] which means 2 threads, if I am not wrong. As soon as I changed it to local[1] I got the correct answer but I cannot use one thread.
File appears like this:
<TAG>
<DOCNO> AP890825-0001 </DOCNO>
<FILEID>AP-NR-08-25-89 0134EDT</FILEID>
<TEXT>
Some large text.
</TEXT>
</TAG>
<TAG> // new doc started
How should I correct this issue?
This is a closure problem. Each node gets its own version of the count variable. You want to use accumulators or simply perform a reduce
Created by:
val tagCounter = sc.accumulator(0, "tagCount")
Updated by: (not readable on the nodes)
tagCounter += 1
Readable on the driver by:
tagCounter.value
Following your update:
var count = sc.accumulator(0)
val regex = """<TAG>""".r
val output = for (line <- temp ) yield {
//println(line)
line match {
case regex(_*) => {
count += 1
//println(s"$line # $count") //There is no "<TAG> # 264" in the output
line
}
case _ => "ERROR"
}
}
println(s"output size:{output.count}")
println(s"There are " + count.value + " documents.")
UPDATE AFTER SEEING THE INPUT FORMAT:
You may end up at the whim of using wholeTextFiles to guarantee ordering. Otherwise the distributed nature means that ordering is often not guaranteed, but if you can guarantee ordering (possibly custom partitioner or custom InputFormat), then something like this should work:
sc.parallelize(list)
.aggregate(Nil : List[String])((accum, value) => {
value match {
case regex(_*) => accum :+ value
case _ => {
accum match {
case Nil => List(value)
case _ => accum.init ++ (accum.tail :+ value)
}
}
}, _ ++ _)