The aim is to stream data from a database, perform some computation on this chunk of data(this computation returns a Future of some case class) and send this data as chunked response to the user. Currently I am able to stream data and send the response without performing any computation. However, I am unable to perform this computation and then stream the result.
This is the route I have implemented.
def streamingDB1 =
path("streaming-db1") {
get {
val src = Source.fromPublisher(db.stream(getRds))
complete(src)
}
}
The function getRds returns the rows of a table mapped into a case class(Using slick). Now consider the function compute which takes each row as an input and returns a Future of another case class. Something like
def compute(x: Tweet) : Future[TweetNew] = ?
How can I implement this function on variable src and send the chunked response(as a stream) of this computation to the user.
You could transform the source using mapAsync:
val src =
Source.fromPublisher(db.stream(getRds))
.mapAsync(parallelism = 3)(compute)
complete(src)
Adjust the level of parallelism as needed.
Note that you might need to configure a few settings as mentioned in the Slick documentation:
Note: Some database systems may require session parameters to be set in a certain way to support streaming without caching all data at once in memory on the client side. For example, PostgreSQL requires both .withStatementParameters(rsType = ResultSetType.ForwardOnly, rsConcurrency = ResultSetConcurrency.ReadOnly, fetchSize = n) (with the desired page size n) and .transactionally for proper streaming.
So if you're using PostgreSQL, for example, then your Source might look something like the following:
val src =
Source.fromPublisher(
db.stream(
getRds.withStatementParameters(
rsType = ResultSetType.ForwardOnly,
rsConcurrency = ResultSetConcurrency.ReadOnly,
fetchSize = 10
).transactionally
)
).mapAsync(parallelism = 3)(compute)
You need to have a way to marshall TweetNew and also if you send a chunk with length 0 client may close connection.
This code works with curl:
case class TweetNew(str: String)
def compute(string: String) : Future[TweetNew] = Future {
TweetNew(string)
}
val route = path("hello") {
get {
val byteString: Source[ByteString, NotUsed] = Source.apply(List("t1", "t2", "t3"))
.mapAsync(2)(compute)
.map(tweet => ByteString(tweet.str + "\n"))
complete(HttpEntity(ContentTypes.`text/plain(UTF-8)`, byteString))
}
}
Related
I am developing a Apache Flink application using Scala API ( I am pretty new using this technology).
I am using a hashmap to store some values that come from a database, and I need to refresh these values each 1h. There is any way to refresh this hashmap asynchronously?
Thanks!
I'm not sure what you mean by "refresh this hashmap asynchronously" in the context of a Flink workflow.
For what it's worth, if you have a hashmap that's keyed by some piece of data from records flowing through your workflow, then you can use Flink's support for managed key state to store the value (and checkpoint it), and make it queryable.
I interpret your question to mean that you are using some state in Flink to mirror/cache some data that comes from an external database, and you wish to periodically refresh it.
Typically this sort of thing is done by continuously streaming a Change Data Capture (CDC) stream from the external database into Flink. Continuous, streaming solutions are generally a better fit for Flink. But if you want to do this in hourly batches, you could write a custom source or a ProcessFunction that wakes up once an hour, makes a query to the database, and emits a stream of records that can be used to update the operator holding the state.
You can achieve this with the use of Apache Flink's Asynchronous I/O for External Data Access, see this post for details async io.
Here's a way to use AsyncDataStream to refresh a map periodically by creating a async function and attaching it to a source stream.
class AsyncEnricherFunction extends RichAsyncFunction[String, (String String)] {
#transient private var m: Map[String, String] = _
#transient private var client: DataBaseClient = _
#transient private var refreshInterval: Int = _
#throws(classOf[Exception])
override def open(parameters: Configuration): Unit = {
client = new DataBaseClient(host, port, credentials)
refreshInterval = 1000
load()
}
private def load(): Unit = {
val str = "select key, value from KeyValue"
m = client.query(str).asMap
lastRefreshed = System.currentTimeMillis()
}
override def asyncInvoke(input: String, resultFuture: ResultFuture[(String, String]): Unit = {
Future {
if (System.currentTimeMillis() > lastRefreshed + refreshInterval) load()
val enriched = (input, m(input))
resultFuture.complete(Seq(enriched))
}(ExecutionContext.global)
}
override def close() : Unit = { client.close() }
}
val in: DataStream[String] = env.addSource(src)
val enriched = AsyncDataStream.unorderedWait(in, AsyncEnricherFunction(), 5000, TimeUnit.MILLISECONDS, 100)
I'm trying to split an incoming Akka stream of bytes (from the body of an http request, but it could also be from a file) into multiple files of a defined size.
For example, if I'm uploading a 10Gb file, it would create something like 10 files of 1Gb. The files would have randomly generated names. My issue is that I don't really know where to start, because all the responses and examples I've read are either storing the whole chunk into memory, or using a delimiter based on a string. Except I can't really have "chunks" of 1Gb, and then just write them to the disk..
Is there any easy way to perform that kind of operation ? My only idea would be to use something like this http://doc.akka.io/docs/akka/2.4/scala/stream/stream-cookbook.html#Chunking_up_a_stream_of_ByteStrings_into_limited_size_ByteStrings but transformed to something like FlowShape[ByteString, File], writting myself into a file the chunks until the max file size is reached, then creating a new file, etc.., and streaming back the created files. Which looks like an atrocious idea not using properly Akka..
Thanks in advance
I often revert to purely functional, non-akka, techniques for problems such as this and then "lift" those functions into akka constructs. By this I mean I try to use only scala "stuff" and then try to wrap that stuff inside of akka later on...
File Creation
Starting with the FileOutputStream creation based on "randomly generated names":
def randomFileNameGenerator : String = ??? //not specified in question
import java.io.FileOutputStream
val randomFileOutGenerator : () => FileOutputStream =
() => new FileOutputStream(randomFileNameGenerator)
State
There needs to be some way of storing the "state" of the current file, e.g. the number of bytes already written:
case class FileState(byteCount : Int = 0,
fileOut : FileOutputStream = randomFileOutGenerator())
File Writing
First we determine if we'd breach the maximum file size threshold with the given ByteString:
import akka.util.ByteString
val isEndOfChunk : (FileState, ByteString, Int) => Boolean =
(state, byteString, maxBytes) =>
state.byteCount + byteString.length > maxBytes
We then have to write the function that creates a new FileState if we've maxed out the capacity of the current one or returns the current state if it is still below capacity:
val closeFileInState : FileState => Unit =
(_ : FileState).fileOut.close()
val getCurrentFileState(FileState, ByteString, Int) => FileState =
(state, byteString, maxBytes) =>
if(isEndOfChunk(maxBytes, state, byteString)) {
closeFileInState(state)
FileState()
}
else
state
The only thing left is to write to the FileOutputStream:
val writeToFileAndReturn(FileState, ByteString) => FileState =
(fileState, byteString) => {
fileState.fileOut write byteString.toArray
fileState copy (byteCount = fileState.byteCount + byteString.size)
}
//the signature ordering will become useful
def writeToChunkedFile(maxBytes : Int)(fileState : FileState, byteString : ByteString) : FileState =
writeToFileAndReturn(getCurrentFileState(maxBytes, fileState, byteString), byteString)
Fold On Any GenTraversableOnce
In scala a GenTraversableOnce is any collection, parallel or not, that has the fold operator. These include Iterator, Vector, Array, Seq, scala stream, ... Th final writeToChunkedFile function perfectly matches the signature of GenTraversableOnce#fold:
val anyIterable : Iterable = ???
val finalFileState = anyIterable.fold(FileState())(writetochunkedFile(maxBytes))
One final loose end; the last FileOutputStream needs to be closed as well. Since the fold will only emit that last FileState we can close that one:
closeFileInState(finalFileState)
Akka Streams
Akka Flow gets its fold from FlowOps#fold which happens to match the GenTraversableOnce signature. Therefore we can "lift" our regular functions into stream values similar to the way we used Iterable fold:
import akka.stream.scaladsl.Flow
def chunkerFlow(maxBytes : Int) : Flow[ByteString, FileState, _] =
Flow[ByteString].fold(FileState())(writeToChunkedFile(maxBytes))
The nice part about handling the problem with regular functions is that they can be used within other asynchronous frameworks beyond streams, e.g. Futures or Actors. You also don't need an akka ActorSystem in unit testing, just regular language data structures.
import akka.stream.scaladsl.Sink
import scala.concurrent.Future
def byteStringSink(maxBytes : Int) : Sink[ByteString, _] =
chunkerFlow(maxBytes) to (Sink foreach closeFileInState)
You can then use this Sink to drain HttpEntity coming from HttpRequest.
You could write a custom graph stage.
Your issue is similar to the one faced in alpakka during upload to amazon S3. ( google alpakka s3 connector.. they wont let me post more than 2 links)
For some reason the s3 connector DiskBuffer however writes the entire incoming source of bytestrings to a file, before emitting out the chunk to do further stream processing..
What we want is something similar to limit a source of byte strings to specific length. In the example, they have limited the incoming Source[ByteString, _] to a source of fixed sized byteStrings by maintaining a memory buffer. I adopted it to work with Files.
The advantage of this is that you can use a dedicated thread pool for this stage to do blocking IO. For good reactive stream you want to keep blocking IO in separate thread pool in actor system.
PS: this does not try to make files of exact size.. so if we read 2KB extra in a 100MB file.. we write those extra bytes to the current file rather than trying to achieve exact size.
import java.io.{FileOutputStream, RandomAccessFile}
import java.nio.channels.FileChannel
import java.nio.file.Path
import akka.stream.stage.{GraphStage, GraphStageLogic, InHandler, OutHandler}
import akka.stream._
import akka.util.ByteString
case class MultipartUploadChunk(path: Path, size: Int, partNumber: Int)
//Starts writing the byteStrings received from upstream to a file. Emits a path after writing a partSize number of bytes. Does not attemtp to write exact number of bytes.
class FileChunker(maxSize: Int, tempDir: Path, partSize: Int)
extends GraphStage[FlowShape[ByteString, MultipartUploadChunk]] {
assert(maxSize > partSize, "Max size should be larger than part size. ")
val in: Inlet[ByteString] = Inlet[ByteString]("PartsMaker.in")
val out: Outlet[MultipartUploadChunk] = Outlet[MultipartUploadChunk]("PartsMaker.out")
override val shape: FlowShape[ByteString, MultipartUploadChunk] = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) with OutHandler with InHandler {
var partNumber: Int = 0
var length: Int = 0
var currentBuffer: Option[PartBuffer] = None
override def onPull(): Unit =
if (isClosed(in)) {
emitPart(currentBuffer, length)
} else {
pull(in)
}
override def onPush(): Unit = {
val elem = grab(in)
length += elem.size
val currentPart: PartBuffer = currentBuffer match {
case Some(part) => part
case None =>
val newPart = createPart(partNumber)
currentBuffer = Some(newPart)
newPart
}
currentPart.fileChannel.write(elem.asByteBuffer)
if (length > partSize) {
emitPart(currentBuffer, length)
//3. .increment part number, reset length.
partNumber += 1
length = 0
} else {
pull(in)
}
}
override def onUpstreamFinish(): Unit =
if (length > 0) emitPart(currentBuffer, length) // emit part only if something is still left in current buffer.
private def emitPart(maybePart: Option[PartBuffer], size: Int): Unit = maybePart match {
case Some(part) =>
//1. flush the part buffer and truncate the file.
part.fileChannel.force(false)
// not sure why we do this truncate.. but was being done in alpakka. also maybe safe to do.
// val ch = new FileOutputStream(part.path.toFile).getChannel
// try {
// println(s"truncating to size $size")
// ch.truncate(size)
// } finally {
// ch.close()
// }
//2emit the part
val chunk = MultipartUploadChunk(path = part.path, size = length, partNumber = partNumber)
push(out, chunk)
part.fileChannel.close() // TODO: probably close elsewhere.
currentBuffer = None
//complete stage if in is closed.
if (isClosed(in)) completeStage()
case None => if (isClosed(in)) completeStage()
}
private def createPart(partNum: Int): PartBuffer = {
val path: Path = partFile(partNum)
//currentPart.deleteOnExit() //TODO: Enable in prod. requests that the file be deleted when VM dies.
PartBuffer(path, new RandomAccessFile(path.toFile, "rw").getChannel)
}
/**
* Creates a file in the temp directory with name bmcs-buffer-part-$partNumber
* #param partNumber the part number in multipart upload.
* #return
* TODO:add unique id to the file name. for multiple
*/
private def partFile(partNumber: Int): Path =
tempDir.resolve(s"bmcs-buffer-part-$partNumber.bin")
setHandlers(in, out, this)
}
case class PartBuffer(path: Path, fileChannel: FileChannel) //TODO: see if you need mapped byte buffer. might be ok with just output stream / channel.
}
The idiomatic way to split a ByteString stream to multiple files is to use Alpakka's LogRotatorSink. From the documentation:
This sink will takes a function as parameter which returns a Bytestring => Option[Path] function. If the generated function returns a path the sink will rotate the file output to this new path and the actual ByteString will be written to this new file too. With this approach the user can define a custom stateful file generation implementation.
The following fileSizeRotationFunction is also from the documentation:
val fileSizeRotationFunction = () => {
val max = 10 * 1024 * 1024
var size: Long = max
(element: ByteString) =>
{
if (size + element.size > max) {
val path = Files.createTempFile("out-", ".log")
size = element.size
Some(path)
} else {
size += element.size
None
}
}
}
An example of its use:
val source: Source[ByteString, _] = ???
source.runWith(LogRotatorSink(fileSizeRotationFunction))
I've got a SourceQueue. When I offer an element to this I want it to pass through the Stream and when it reaches the Sink have the output returned to the code that offered this element (similar as Sink.head returns an element to the RunnableGraph.run() call).
How do I achieve this? A simple example of my problem would be:
val source = Source.queue[String](100, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.ReturnTheStringSomehow
val graph = source.via(flow).to(sink).run()
val x = graph.offer("foo")
println(x) // Output should be "Modified foo"
val y = graph.offer("bar")
println(y) // Output should be "Modified bar"
val z = graph.offer("baz")
println(z) // Output should be "Modified baz"
Edit: For the example I have given in this question Vladimir Matveev provided the best answer. However, it should be noted that this solution only works if the elements are going into the sink in the same order they were offered to the source. If this cannot be guaranteed the order of the elements in the sink may differ and the outcome might be different from what is expected.
I believe it is simpler to use the already existing primitive for pulling values from a stream, called Sink.queue. Here is an example:
val source = Source.queue[String](128, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.queue[String]().withAttributes(Attributes.inputBuffer(1, 1))
val (sourceQueue, sinkQueue) = source.via(flow).toMat(sink)(Keep.both).run()
def getNext: String = Await.result(sinkQueue.pull(), 1.second).get
sourceQueue.offer("foo")
println(getNext)
sourceQueue.offer("bar")
println(getNext)
sourceQueue.offer("baz")
println(getNext)
It does exactly what you want.
Note that setting the inputBuffer attribute for the queue sink may or may not be important for your use case - if you don't set it, the buffer will be zero-sized and the data won't flow through the stream until you invoke the pull() method on the sink.
sinkQueue.pull() yields a Future[Option[T]], which will be completed successfully with Some if the sink receives an element or with a failure if the stream fails. If the stream completes normally, it will be completed with None. In this particular example I'm ignoring this by using Option.get but you would probably want to add custom logic to handle this case.
Well, you know what offer() method returns if you take a look at its definition :) What you can do is to create Source.queue[(Promise[String], String)], create helper function that pushes pair to stream via offer, make sure offer doesn't fail because queue might be full, then complete promise inside your stream and use future of the promise to catch completion event in external code.
I do that to throttle rate to external API used from multiple places of my project.
Here is how it looked in my project before Typesafe added Hub sources to akka
import scala.concurrent.Promise
import scala.concurrent.Future
import java.util.concurrent.ConcurrentLinkedDeque
import akka.stream.scaladsl.{Keep, Sink, Source}
import akka.stream.{OverflowStrategy, QueueOfferResult}
import scala.util.Success
private val queue = Source.queue[(Promise[String], String)](100, OverflowStrategy.backpressure)
.toMat(Sink.foreach({ case (p, param) =>
p.complete(Success(param.reverse))
}))(Keep.left)
.run
private val futureDeque = new ConcurrentLinkedDeque[Future[String]]()
private def sendQueuedRequest(request: String): Future[String] = {
val p = Promise[String]
val offerFuture = queue.offer(p -> request)
def addToQueue(future: Future[String]): Future[String] = {
futureDeque.addLast(future)
future.onComplete(_ => futureDeque.remove(future))
future
}
offerFuture.flatMap {
case QueueOfferResult.Enqueued =>
addToQueue(p.future)
}.recoverWith {
case ex =>
val first = futureDeque.pollFirst()
if (first != null)
addToQueue(first.flatMap(_ => sendQueuedRequest(request)))
else
sendQueuedRequest(request)
}
}
I realize that blocking synchronized queue may be bottleneck and may grow indefinitely but because API calls in my project are made only from other akka streams which are backpressured I never have more than dozen items in futureDeque. Your situation may differ.
If you create MergeHub.source[(Promise[String], String)]() instead you'll get reusable sink. Thus every time you need to process item you'll create complete graph and run it. In that case you won't need hacky java container to queue requests.
I'm building a micro-service using Play Framework 2.3.x using Scala (I'm a beginner in both) but I can't figure out a way to stream my request body.
Here is the problem:
I need an endpoint /transform where I can receive a huge TSV file that I will parse and render in another format: simple transformation. The problem is that every single command in my controller is ran "too late". It waits to receive the full file before starting the code.
Example:
def transform = Action.async {
Future {
Logger.info("Too late")
Ok("A response")
}
}
I want to be able to read line-by-line the request body during its upload and process already the request without having to wait for the file to be received completely.
Any hint would be welcome.
This answer applies to Play 2.5.x and higher since it uses the Akka streams API that replaced Play's Iteratee-based streaming in that version.
Basically, you can create a body parser that returns a Source[T] that you can pass to Ok.chunked(...). One way to do this is to use Accumulator.source[T] in the body parser. For example, an action that just returned data sent to it verbatim might look like this:
def verbatimBodyParser: BodyParser[Source[ByteString, _]] = BodyParser { _ =>
// Return the source directly. We need to return
// an Accumulator[Either[Result, T]], so if we were
// handling any errors we could map to something like
// a Left(BadRequest("error")). Since we're not
// we just wrap the source in a Right(...)
Accumulator.source[ByteString]
.map(Right.apply)
}
def stream = Action(verbatimBodyParser) { implicit request =>
Ok.chunked(request.body)
}
If you want to do something like transform a TSV file you can use a Flow to transform the source, e.g:
val tsvToCsv: BodyParser[Source[ByteString, _]] = BodyParser { req =>
val transformFlow: Flow[ByteString, ByteString, NotUsed] = Flow[ByteString]
// Chunk incoming bytes by newlines, truncating them if the lines
// are longer than 1000 bytes...
.via(Framing.delimiter(ByteString("\n"), 1000, allowTruncation = true))
// Replace tabs by commas. This is just a silly example and
// you could obviously do something more clever here...
.map(s => ByteString(s.utf8String.split('\t').mkString(",") + "\n"))
Accumulator.source[ByteString]
.map(_.via(transformFlow))
.map(Right.apply)
}
def convert = Action(tsvToCsv) { implicit request =>
Ok.chunked(request.body).as("text/csv")
}
There may be more inspiration in the Directing the Body Elsewhere section of the Play docs.
Problem Statement : We are adding all incoming request parameters of user for particular module in MySQL DB table as a row (this is a huge data). Now, we want to design a process which will read each record from this table and will get more information about that request of user by calling third party APIs and after that it will put this returned meta information in another table.
Current Attempts:
I am using Scala + Slick to do this. As the data to read is huge, I want to read this table one row at a time and process it. I tried using slick + akka streams, however I am getting 'java.util.concurrent.RejectedExecutionException'
Following is the rough logic that I have tried,
implicit val system = ActorSystem("Example")
import system.dispatcher
implicit val materializer = ActorMaterializer()
val future = db.stream(SomeQuery.result)
Source.fromPublisher(future).map(row => {
id = dataEnrichmentAPI.process(row)
}).runForeach(id => println("Processed row : "+ id))
dataEnrichmentAPI.process : This function makes a third party REST call and also does some DB query to get required data. This DB query is done using 'db.run' method and it also waits until it finishes (Using Await)
e.g.,
def process(row: RequestRecord): Int = {
// SomeQuery2 = Check if data is already there in DB
val retId: Seq[Int] = Await.result(db.run(SomeQuery2.result), Duration.Inf)
if(retId.isEmpty){
val metaData = RestCall()
// SomeQuery3 = Store this metaData in DB
Await.result(db.run(SomeQuery3.result), Duration.Inf)
return metaData.id;
}else{
// SomeQuery4 = Get meta data id
return Await.result(db.run(SomeQuery4.result), Duration.Inf)
}
}
I am getting this exception where I am using blocking call to DB. I don't think if I can get rid of it as return value is required for later flow to continue.
Does 'blocking call' is a reason behind this Exception ?
What is the best practice to solve this kind of problem ?
Thanks.
I don't know if this is your problem (too few details) but you should never block.
Speaking of best practices, us async stages instead.
This is more or less what your code would look like without using Await.result:
def process(row: RequestRecord): Future[Int] = {
db.run(SomeQuery2.result) flatMap {
case retId if retId.isEmpty =>
// what is this? is it a sync call? if it's a rest call it should return a future
val metaData = RestCall()
db.run(SomeQuery3.result).map(_ => metaData.id)
case _ => db.run(SomeQuery4.result)
}
}
Source.fromPublisher(db.stream(SomeQuery.result))
// choose your own parallelism
.mapAsync(2)(dataEnrichmentAPI.process)
.runForeach(id => println("Processed row : "+ id))
This way you will be handling backpressure and parallelism explicitly and idiomatically.
Try to never call Await.result in production code and only compose futures using map, flatMap and for comprehensions