Race condition when generating data with a GenericInputFormat - scala

I'm trying Flink and wrote the following example program:
object IFJob {
#SerialVersionUID(1L)
final class StringInputFormat extends GenericInputFormat[String] {
val N = 100
var i = 0L
override def reachedEnd(): Boolean = this.synchronized {
i >= N
}
override def nextRecord(ot: String): String = this.synchronized {
i += 1
return (i % 2) + ""
}
}
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val text: DataSet[String] = env.createInput(new StringInputFormat())
val map = text.map {
(_, 1)
}
// map.print()
val by = map.groupBy(0)
val aggregate: AggregateDataSet[(String, Int)] = by.aggregate(Aggregations.SUM, 1)
aggregate.print()
}
}
I am creating a StringInputFormat once and read it in parallel (with a default parallelism = 8).
When I run the above program, the results vary between executions, i.e., they are not deterministic. Results are duplicated 1-8x times.
For example I get the following results:
// first run
(0,150)
(1,150)
// second run
(0,50)
(1,50)
// third run
(0,200)
(1,200)
The expected result would be
(0,400)
(1,400)
Because there the StringInputFormat should generate 8 times 50 "0" and "1" records.
I even added synchronization to the input format, but it didn't help.
What am I missing in the Flink computation model?

The behavior you observe is the result of how Flink assigns work to an InputFormat. This works as follows:
On the master (JobManager), the createInputSplits() method is called which returns an array of InputSplit. An InputSplit is a chunk of data to read (or generate). The GenericInputFormat creates one InputSplit for each parallel task instance. In your case, it creates 8 InputSplit objects and each InputSplit should generate 50 "1" and 50 "0" records.
The parallel instances of a DataSourceTask are started on the workers (TaskManagers). Each DataSourceTask has an own instance of the InputFormat.
Once started, a DataSourceTask requests an InputSplit from the master and calls the open() method of its InputFormat with the InputSplit. When the InputFormat finished processing the InputSplit, the DataSourceTask requests a new from the master.
In your case, each InputSplit is very quickly processed. Hence, there is a race between DataSourceTasks requesting InputSplits for their InputFormats and some InputFormats processes more than one InputSplit. Since an InputFormat does not reset its internal state (i.e., set i = 0) when is opens a new InputSplit it will only generate data for the first InputSplit it processes.
You can fix this by adding this method to the StringInputFormat:
override def open(split: GenericInputSplit): Unit = {
super.open(split)
i = 0
}

Related

Cache Cassandra table in scala application

I need to get some data from Cassandra for entries in a Kafka-Streams streaming application. I'd need to perform the join on ID. I'd like to set up a cache to save time used for queries.
The table is simple:
id | name
---|-----
1 |Mike
My plan is straightforward: query the table from database then store into a Map[Int, String].
The main problem is - data may change in the table and needs to be updated periodically, so I need to query it from time to time.
So far I've come up with a threaded solution like this:
// local database mirror
class Mirror(user: String, password: String) extends Runnable {
var database: Map[Int, String] = Map[Int, String]() withDefaultValue "undefined"
def run(): Unit = {
update()
}
//
def update(): Unit = {
println("update")
database.synchronized {
println("sync-update")
// val c = Driver.getConnection(...)
// database = c.execute(select id, name from table). ...
database += (1 -> "one")
Thread.sleep(100)
// c.close()
}
}
def get(k: Int): Option[String] = {
println("get")
database.synchronized {
println("sync-get")
if (! (database contains k)) {
update()
database.get(k)
} else {
database.get(k)
}
}
}
}
Main looks like this:
def main(args: Array[String]): Unit = {
val db = new Mirror("u", "p")
val ex = new ScheduledThreadPoolExecutor(1)
val f = ex.scheduleAtFixedRate(db, 100, 100, TimeUnit.SECONDS)
while(true) { // simulate stream
val res = db.get(1)
println(res)
Thread.sleep(10000)
}
}
It seems to function fine. But are there any pitfalls in my code? Especially I'm not confident about thread safety of update & get functions.
If you are not opposed to using Akka I would look at Akka Streams; specifically Alpakka to do this. There's no need to reinvent the wheel if you don't have to.
That being said the code has the following problems:
Existence check on cache will not help if the entries in Cassandra are updated. It will only help if they are missing from your cache
Look at using a reentrant read write lock if you believe that most of the time your cache will have the current entries. This will help with contention if you have multiple threads calling your mirror.
Again, I would highly recommend you look at Akka Streams with Alpakka because you can do what you want with that tool wihtout having to write a bunch of code yourself.

Akka streams: dealing with futures within graph stage

Within an akka stream stage FlowShape[A, B] , part of the processing I need to do on the A's is to save/query a datastore with a query built with A data. But that datastore driver query gives me a future, and I am not sure how best to deal with it (my main question here).
case class Obj(a: String, b: Int, c: String)
case class Foo(myobject: Obj, name: String)
case class Bar(st: String)
//
class SaveAndGetId extends GraphStage[FlowShape[Foo, Bar]] {
val dao = new DbDao // some dao with an async driver
override def createLogic(inheritedAttributes: Attributes) = new GraphStageLogic(shape) {
setHandlers(in, out, new InHandler with Outhandler {
override def onPush() = {
val foo = grab(in)
val add = foo.record.value()
val result: Future[String] = dao.saveAndGetRecord(add.myobject)//saves and returns id as string
//the naive approach
val record = Await(result, Duration.inf)
push(out, Bar(record))// ***tests pass every time
//mapping the future approach
result.map { x=>
push(out, Bar(x))
} //***tests fail every time
The next stage depends on the id of the db record returned from query, but I want to avoid Await. I am not sure why mapping approach fails:
"it should work" in {
val source = Source.single(Foo(Obj("hello", 1, "world")))
val probe = source
.via(new SaveAndGetId))
.runWith(TestSink.probe)
probe
.request(1)
.expectBarwithId("one")//say we know this will be
.expectComplete()
}
private implicit class RichTestProbe(probe: Probe[Bar]) {
def expectBarwithId(expected: String): Probe[Bar] =
probe.expectNextChainingPF{
case r # Bar(str) if str == expected => r
}
}
When run with mapping future, I get failure:
should work ***FAILED***
java.lang.AssertionError: assertion failed: expected: message matching partial function but got unexpected message OnComplete
at scala.Predef$.assert(Predef.scala:170)
at akka.testkit.TestKitBase$class.expectMsgPF(TestKit.scala:406)
at akka.testkit.TestKit.expectMsgPF(TestKit.scala:814)
at akka.stream.testkit.TestSubscriber$ManualProbe.expectEventPF(StreamTestKit.scala:570)
The async side channels example in the docs has the future in the constructor of the stage, as opposed to building the future within the stage, so doesn't seem to apply to my case.
I agree with Ramon. Constructing a new FlowShapeis not necessary in this case and it is too complicated. It is very much convenient to use mapAsync method here:
Here is a code snippet to utilize mapAsync:
import akka.stream.scaladsl.{Sink, Source}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object MapAsyncExample {
val numOfParallelism: Int = 10
def main(args: Array[String]): Unit = {
Source.repeat(5)
.mapAsync(parallelism)(x => asyncSquare(x))
.runWith(Sink.foreach(println)) previous stage
}
//This method returns a Future
//You can replace this part with your database operations
def asyncSquare(value: Int): Future[Int] = Future {
value * value
}
}
In the snippet above, Source.repeat(5) is a dummy source to emit 5 indefinitely. There is a sample function asyncSquare which takes an integer and calculates its square in a Future. .mapAsync(parallelism)(x => asyncSquare(x)) line uses that function and emits the output of Future to the next stage. In this snipet, the next stage is a sink which prints every item.
parallelism is the maximum number of asyncSquare calls that can run concurrently.
I think your GraphStage is unnecessarily overcomplicated. The below Flow performs the same actions without the need to write a custom stage:
val dao = new DbDao
val parallelism = 10 //number of parallel db queries
val SaveAndGetId : Flow[Foo, Bar, _] =
Flow[Foo]
.map(foo => foo.record.value().myobject)
.mapAsync(parallelism)(rec => dao.saveAndGetRecord(rec))
.map(Bar.apply)
I generally try to treat GraphStage as a last resort, there is almost always an idiomatic way of getting the same Flow by using the methods provided by the akka-stream library.

Akka Thread Tuning

I have 100 threads, need to process only 12 threads at a time not more than that. After completion of these threads other 12 have to be processed and so on but it's processing only first 12 set threads then it terminates after that.
Here is my Logic :
class AkkaProcessing extends Actor {
def receive = {
case message: List[Any] =>
var meterName = message(0) // It Contains only 12 threads , it process them and terminates. Am unable to get remaining threads
val sqlContext = message(1).asInstanceOf[SQLContext]
val FlagDF = message(2).asInstanceOf[DataFrame]
{
All the business logic here
}
context.system.shutdown()
}
}
}
object Processing {
def main(args: Array[String]) = {
val rawBuff = new ArrayBuffer[Any]()
val actorSystem = ActorSystem("ActorSystem") // Creating ActorSystem
val actor = actorSystem.actorOf(Props[AkkaProcessing].withRouter(RoundRobinPool(200)), "my-Actor")
implicit val executionContext = actorSystem.dispatchers.lookup("akka.actor.my-dispatcher")
for (i <- 0 until meter_list.length) {
var meterName = meter_list(i) // All 100 Meters here
rawBuff.append(meterName, sqlContext, FlagDF)
actor ! rawBuff.toList
}
}
}
Any Inputs highly appreciated
I think you might be best to create 2 actor types : consumer (which run in parallel) and coordinator (which takes the 12 thread tasks and passes them to the consumers). The coordinator would wait for the consumers to finish and then run the next batch.
See this answer for a code example: Can Scala actors process multiple messages simultaneously?
Failing that, you could just use Futures in a similar manner.

Akka stream - Splitting a stream of ByteString into multiple files

I'm trying to split an incoming Akka stream of bytes (from the body of an http request, but it could also be from a file) into multiple files of a defined size.
For example, if I'm uploading a 10Gb file, it would create something like 10 files of 1Gb. The files would have randomly generated names. My issue is that I don't really know where to start, because all the responses and examples I've read are either storing the whole chunk into memory, or using a delimiter based on a string. Except I can't really have "chunks" of 1Gb, and then just write them to the disk..
Is there any easy way to perform that kind of operation ? My only idea would be to use something like this http://doc.akka.io/docs/akka/2.4/scala/stream/stream-cookbook.html#Chunking_up_a_stream_of_ByteStrings_into_limited_size_ByteStrings but transformed to something like FlowShape[ByteString, File], writting myself into a file the chunks until the max file size is reached, then creating a new file, etc.., and streaming back the created files. Which looks like an atrocious idea not using properly Akka..
Thanks in advance
I often revert to purely functional, non-akka, techniques for problems such as this and then "lift" those functions into akka constructs. By this I mean I try to use only scala "stuff" and then try to wrap that stuff inside of akka later on...
File Creation
Starting with the FileOutputStream creation based on "randomly generated names":
def randomFileNameGenerator : String = ??? //not specified in question
import java.io.FileOutputStream
val randomFileOutGenerator : () => FileOutputStream =
() => new FileOutputStream(randomFileNameGenerator)
State
There needs to be some way of storing the "state" of the current file, e.g. the number of bytes already written:
case class FileState(byteCount : Int = 0,
fileOut : FileOutputStream = randomFileOutGenerator())
File Writing
First we determine if we'd breach the maximum file size threshold with the given ByteString:
import akka.util.ByteString
val isEndOfChunk : (FileState, ByteString, Int) => Boolean =
(state, byteString, maxBytes) =>
state.byteCount + byteString.length > maxBytes
We then have to write the function that creates a new FileState if we've maxed out the capacity of the current one or returns the current state if it is still below capacity:
val closeFileInState : FileState => Unit =
(_ : FileState).fileOut.close()
val getCurrentFileState(FileState, ByteString, Int) => FileState =
(state, byteString, maxBytes) =>
if(isEndOfChunk(maxBytes, state, byteString)) {
closeFileInState(state)
FileState()
}
else
state
The only thing left is to write to the FileOutputStream:
val writeToFileAndReturn(FileState, ByteString) => FileState =
(fileState, byteString) => {
fileState.fileOut write byteString.toArray
fileState copy (byteCount = fileState.byteCount + byteString.size)
}
//the signature ordering will become useful
def writeToChunkedFile(maxBytes : Int)(fileState : FileState, byteString : ByteString) : FileState =
writeToFileAndReturn(getCurrentFileState(maxBytes, fileState, byteString), byteString)
Fold On Any GenTraversableOnce
In scala a GenTraversableOnce is any collection, parallel or not, that has the fold operator. These include Iterator, Vector, Array, Seq, scala stream, ... Th final writeToChunkedFile function perfectly matches the signature of GenTraversableOnce#fold:
val anyIterable : Iterable = ???
val finalFileState = anyIterable.fold(FileState())(writetochunkedFile(maxBytes))
One final loose end; the last FileOutputStream needs to be closed as well. Since the fold will only emit that last FileState we can close that one:
closeFileInState(finalFileState)
Akka Streams
Akka Flow gets its fold from FlowOps#fold which happens to match the GenTraversableOnce signature. Therefore we can "lift" our regular functions into stream values similar to the way we used Iterable fold:
import akka.stream.scaladsl.Flow
def chunkerFlow(maxBytes : Int) : Flow[ByteString, FileState, _] =
Flow[ByteString].fold(FileState())(writeToChunkedFile(maxBytes))
The nice part about handling the problem with regular functions is that they can be used within other asynchronous frameworks beyond streams, e.g. Futures or Actors. You also don't need an akka ActorSystem in unit testing, just regular language data structures.
import akka.stream.scaladsl.Sink
import scala.concurrent.Future
def byteStringSink(maxBytes : Int) : Sink[ByteString, _] =
chunkerFlow(maxBytes) to (Sink foreach closeFileInState)
You can then use this Sink to drain HttpEntity coming from HttpRequest.
You could write a custom graph stage.
Your issue is similar to the one faced in alpakka during upload to amazon S3. ( google alpakka s3 connector.. they wont let me post more than 2 links)
For some reason the s3 connector DiskBuffer however writes the entire incoming source of bytestrings to a file, before emitting out the chunk to do further stream processing..
What we want is something similar to limit a source of byte strings to specific length. In the example, they have limited the incoming Source[ByteString, _] to a source of fixed sized byteStrings by maintaining a memory buffer. I adopted it to work with Files.
The advantage of this is that you can use a dedicated thread pool for this stage to do blocking IO. For good reactive stream you want to keep blocking IO in separate thread pool in actor system.
PS: this does not try to make files of exact size.. so if we read 2KB extra in a 100MB file.. we write those extra bytes to the current file rather than trying to achieve exact size.
import java.io.{FileOutputStream, RandomAccessFile}
import java.nio.channels.FileChannel
import java.nio.file.Path
import akka.stream.stage.{GraphStage, GraphStageLogic, InHandler, OutHandler}
import akka.stream._
import akka.util.ByteString
case class MultipartUploadChunk(path: Path, size: Int, partNumber: Int)
//Starts writing the byteStrings received from upstream to a file. Emits a path after writing a partSize number of bytes. Does not attemtp to write exact number of bytes.
class FileChunker(maxSize: Int, tempDir: Path, partSize: Int)
extends GraphStage[FlowShape[ByteString, MultipartUploadChunk]] {
assert(maxSize > partSize, "Max size should be larger than part size. ")
val in: Inlet[ByteString] = Inlet[ByteString]("PartsMaker.in")
val out: Outlet[MultipartUploadChunk] = Outlet[MultipartUploadChunk]("PartsMaker.out")
override val shape: FlowShape[ByteString, MultipartUploadChunk] = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) with OutHandler with InHandler {
var partNumber: Int = 0
var length: Int = 0
var currentBuffer: Option[PartBuffer] = None
override def onPull(): Unit =
if (isClosed(in)) {
emitPart(currentBuffer, length)
} else {
pull(in)
}
override def onPush(): Unit = {
val elem = grab(in)
length += elem.size
val currentPart: PartBuffer = currentBuffer match {
case Some(part) => part
case None =>
val newPart = createPart(partNumber)
currentBuffer = Some(newPart)
newPart
}
currentPart.fileChannel.write(elem.asByteBuffer)
if (length > partSize) {
emitPart(currentBuffer, length)
//3. .increment part number, reset length.
partNumber += 1
length = 0
} else {
pull(in)
}
}
override def onUpstreamFinish(): Unit =
if (length > 0) emitPart(currentBuffer, length) // emit part only if something is still left in current buffer.
private def emitPart(maybePart: Option[PartBuffer], size: Int): Unit = maybePart match {
case Some(part) =>
//1. flush the part buffer and truncate the file.
part.fileChannel.force(false)
// not sure why we do this truncate.. but was being done in alpakka. also maybe safe to do.
// val ch = new FileOutputStream(part.path.toFile).getChannel
// try {
// println(s"truncating to size $size")
// ch.truncate(size)
// } finally {
// ch.close()
// }
//2emit the part
val chunk = MultipartUploadChunk(path = part.path, size = length, partNumber = partNumber)
push(out, chunk)
part.fileChannel.close() // TODO: probably close elsewhere.
currentBuffer = None
//complete stage if in is closed.
if (isClosed(in)) completeStage()
case None => if (isClosed(in)) completeStage()
}
private def createPart(partNum: Int): PartBuffer = {
val path: Path = partFile(partNum)
//currentPart.deleteOnExit() //TODO: Enable in prod. requests that the file be deleted when VM dies.
PartBuffer(path, new RandomAccessFile(path.toFile, "rw").getChannel)
}
/**
* Creates a file in the temp directory with name bmcs-buffer-part-$partNumber
* #param partNumber the part number in multipart upload.
* #return
* TODO:add unique id to the file name. for multiple
*/
private def partFile(partNumber: Int): Path =
tempDir.resolve(s"bmcs-buffer-part-$partNumber.bin")
setHandlers(in, out, this)
}
case class PartBuffer(path: Path, fileChannel: FileChannel) //TODO: see if you need mapped byte buffer. might be ok with just output stream / channel.
}
The idiomatic way to split a ByteString stream to multiple files is to use Alpakka's LogRotatorSink. From the documentation:
This sink will takes a function as parameter which returns a Bytestring => Option[Path] function. If the generated function returns a path the sink will rotate the file output to this new path and the actual ByteString will be written to this new file too. With this approach the user can define a custom stateful file generation implementation.
The following fileSizeRotationFunction is also from the documentation:
val fileSizeRotationFunction = () => {
val max = 10 * 1024 * 1024
var size: Long = max
(element: ByteString) =>
{
if (size + element.size > max) {
val path = Files.createTempFile("out-", ".log")
size = element.size
Some(path)
} else {
size += element.size
None
}
}
}
An example of its use:
val source: Source[ByteString, _] = ???
source.runWith(LogRotatorSink(fileSizeRotationFunction))

How can Akka streams be materialized continually?

I am using Akka Streams in Scala to poll from an AWS SQS queue using the AWS Java SDK. I created an ActorPublisher which dequeues messages on a two second interval:
class SQSSubscriber(name: String) extends ActorPublisher[Message] {
implicit val materializer = ActorMaterializer()
val schedule = context.system.scheduler.schedule(0 seconds, 2 seconds, self, "dequeue")
val client = new AmazonSQSClient()
client.setRegion(RegionUtils.getRegion("us-east-1"))
val url = client.getQueueUrl(name).getQueueUrl
val MaxBufferSize = 100
var buf = Vector.empty[Message]
override def receive: Receive = {
case "dequeue" =>
val messages = iterableAsScalaIterable(client.receiveMessage(new ReceiveMessageRequest(url).getMessages).toList
messages.foreach(self ! _)
case message: Message if buf.size == MaxBufferSize =>
log.error("The buffer is full")
case message: Message =>
if (buf.isEmpty && totalDemand > 0)
onNext(message)
else {
buf :+= message
deliverBuf()
}
case Request(_) =>
deliverBuf()
case Cancel =>
context.stop(self)
}
#tailrec final def deliverBuf(): Unit =
if (totalDemand > 0) {
if (totalDemand <= Int.MaxValue) {
val (use, keep) = buf.splitAt(totalDemand.toInt)
buf = keep
use foreach onNext
} else {
val (use, keep) = buf.splitAt(Int.MaxValue)
buf = keep
use foreach onNext
deliverBuf()
}
}
}
In my application, I am attempting to run the flow at a 2 second interval as well:
val system = ActorSystem("system")
val sqsSource = Source.actorPublisher[Message](SQSSubscriber.props("queue-name"))
val flow = Flow[Message]
.map { elem => system.log.debug(s"${elem.getBody} (${elem.getMessageId})"); elem }
.to(Sink.ignore)
system.scheduler.schedule(0 seconds, 2 seconds) {
flow.runWith(sqsSource)(ActorMaterializer()(system))
}
However, when I run my application I receive java.util.concurrent.TimeoutException: Futures timed out after [20000 milliseconds] and subsequent dead letter notices which is caused by the ActorMaterializer.
Is there a recommended approach for continually materializing an Akka Stream?
I don't think you need to create a new ActorPublisher every 2 seconds. This seems redundant and wasteful of memory. Also, I don't think an ActorPublisher is necessary. From what I can tell of the code, your implementation will have an ever growing number of Streams all querying the same data. Each Message from the client will be processed by N different akka Streams and, even worse, N will grow over time.
Iterator For Infinite Loop Querying
You can get the same behavior from your ActorPublisher by using scala's Iterator. It is possible to create an Iterator which continuously queries the client:
//setup the client
val client = {
val sqsClient = new AmazonSQSClient()
sqsClient setRegion (RegionUtils getRegion "us-east-1")
sqsClient
}
val url = client.getQueueUrl(name).getQueueUrl
//single query
def queryClientForMessages : Iterable[Message] = iterableAsScalaIterable {
client receiveMessage (new ReceiveMessageRequest(url).getMessages)
}
def messageListIteartor : Iterator[Iterable[Message]] =
Iterator continually messageListStream
//messages one-at-a-time "on demand", no timer pushing you around
def messageIterator() : Iterator[Message] = messageListIterator flatMap identity
This implementation only queries the client when all previous Messages have been consumed and is therefore truly reactive. No need to keep track of a buffer with fixed size. Your solution needs a buffer because the creation of Messages (via a timer) is de-coupled from the consumption of Messages (via println). In my implementation, creation & consumption are tightly coupled via back-pressure.
Akka Stream Source
You can then use this Iterator generator-function to feed an akka stream Source:
def messageSource : Source[Message, _] = Source fromIterator messageIterator
Flow Formation
And finally you can use this Source to perform the println (As a side note: your flow value is actually a Sink since Flow + Sink = Sink). Using your flow value from the question:
messageSource runWith flow
One akka Stream processing all messages.