I am playing with Akka Streams and streaming content from a file using Alpakka. I need to stop the stream after some time so I want to use KillSwitch. But I don't know how to use it because I am using the graph DSL.
My graph looks like this:
val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
source ~> mainFlow ~> sink
ClosedShape
})
graph.run()
I found a solution here: How to abruptly stop an akka stream Runnable Graph?
However, I don't know how to apply it if I'm using the graph DSL. Can you give me some advice?
To surface a materialized value in the GraphDSL, you can pass the stage that materialized to that value to the create method. It is easier explained with an example. In your case:
val switch = KillSwitches.single[Int]
val graph: RunnableGraph[UniqueKillSwitch] =
RunnableGraph.fromGraph(GraphDSL.create(switch) { implicit builder: GraphDSL.Builder[UniqueKillSwitch] => sw =>
import GraphDSL.Implicits._
source ~> mainFlow ~> sw ~> sink
ClosedShape
})
val ks = graph.run()
ks.shutdown()
Related
Is this possible to write tweets in real-time into some relational database (like MySQL or PostgreSQL) without the usage of Spark in Scala? Instead, I would like to user streaming module of Akka.
I would really appreciate some resources and advice on this topic.
EDIT: I have tried to connect to Twitter API and I used Twitter4J to get some tweets and filter them. Now I am on the stage where I want to write extracted tweets into the database instead of writing to the console.
class Counter extends StatusAdapter{
implicit val system = ActorSystem("EmojiTrends")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
implicit val LoggingAdapter =
Logging(system, classOf[Counter])
val overflowStrategy = OverflowStrategy.backpressure
val bufferSize = 1000
val statusSource = Source.queue[Status](
bufferSize,
overflowStrategy
)
val sink: Sink[Any, Future[Done]] = Sink.foreach(println)
val flow: Flow[Status, String, NotUsed] = Flow[Status].map(status => status.getText)
val graph = statusSource via flow to sink
val queue = graph.run()
override def onStatus(status: Status) =
Await.result(queue.offer(status), Duration.Inf)
}
I have following code sample that works fine. I want to add some changes to preserve relation between request and responce. How can I ahieve that?
Rest api flow's materialized value is NotUsed. Is it possible to somehow use Keep.both for that?
// this flow is provided by some third party library that I can't change in place
val someRestApiFlow: Flow[Int, Int, NotUsed] = Flow[Int].mapAsync(10)(x => Future(x + 1))
val digits: Source[Int, NotUsed] = Source(List(1, 2, 3))
val r = digits.via(someRestApiFlow).runForeach(println)
Result is
2
3
4
I want result to be like
1 -> 2
2 -> 3
3 -> 4
You can use a broadcast element to create 2 separate flows. The first output of broadcast goes through someRestApiFlow, the second output of the broadcast goes unmodified. You then zip the output of the someRestApiFlow with the second output of the broadcast flow. Doing that, you have both the input element and the result of its transformation through someRestApiFlow.
digits ---> broadcast --> someRestApiFlow ---> zip --> result
\----------------------/
I have also encountered this kind of cases a couple of times. The only solution I have found is creating a graph using DSL and making use of broadcast and zip stages.
import akka.actor.ActorSystem
import akka.stream.scaladsl.{Broadcast, Flow, GraphDSL, RunnableGraph, Sink, Source, Zip}
import akka.stream.{ActorMaterializer, ClosedShape}
import akka.{Done, NotUsed}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object Main extends App {
implicit val system: ActorSystem = ActorSystem("my-system")
implicit val materializer: ActorMaterializer = ActorMaterializer()
val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit builder: GraphDSL.Builder[NotUsed] =>
import GraphDSL.Implicits._
val src: Source[Int, NotUsed] = Source(List(1, 2, 3))
val someRestApiFlow: Flow[Int, Int, NotUsed] = Flow[Int].mapAsync(10)(x => Future(x + 1))
val out: Sink[(Int, Int), Future[Done]] = Sink.foreach[(Int, Int)](println)
val bcast = builder.add(Broadcast[Int](2))
val zip = builder.add(Zip[Int, Int])
src ~> bcast ~> zip.in0
bcast ~> someRestApiFlow ~> zip.in1
zip.out ~> out
ClosedShape
})
graph.run()
}
What is being done here is we are broadcasting the input to both a zip and a custom flow, and that zip also waits for the result of that custom flow and finally merges them to send sink.
I want to upload file into S3 using Alpakka and at the same time parse it with Tika to obtain its MimeType.
I have 3 parts of graph at the moment:
val fileSource: Source[ByteString, Any] // comes from Akka-HTTP
val fileUpload: Sink[ByteString, Future[MultipartUploadResult]] // created by S3Client from Alpakka
val mimeTypeDetection: Sink[ByteString, Future[MediaType.Binary]] // my implementation using Apache Tika
I would like to obtain both results at one place, something like:
Future[(MultipartUploadResult, MediaType.Binary)]
I have no issue with broadcasting part:
val broadcast = builder.add(Broadcast[ByteString](2))
source ~> broadcast ~> fileUpload
broadcast ~> mimeTypeDetection
However I have a trouble composing Sinks. Methods I found in API and documentation assumes that either combined sinks are of the same type or that I am Zipping Flows, not Sinks.
What is suggested approach in such case?
Two ways:
1) using alsoToMat (easier, no GraphDSL, enough for your example)
val mat1: (Future[MultipartUploadResult], Future[Binary]) =
fileSource
.alsoToMat(fileUpload)(Keep.right)
.toMat(mimeTypeDetection)(Keep.both)
.run()
2) using GraphDSL with custom materialized values (more verbose, more flexible). More info on this in the docs)
val mat2: (Future[MultipartUploadResult], Future[Binary]) =
RunnableGraph.fromGraph(GraphDSL.create(fileUpload, mimeTypeDetection)((_, _)) { implicit builder =>
(fileUpload, mimeTypeDetection) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[ByteString](2))
fileSource ~> broadcast ~> fileUpload
broadcast ~> mimeTypeDetection
ClosedShape
}).run()
I'm reading a csv file. I am using Akka Streams to do this so that I can create a graph of actions to perform on each line. I've got the following toy example up and running.
def main(args: Array[String]): Unit = {
implicit val system = ActorSystem("MyAkkaSystem")
implicit val materializer = ActorMaterializer()
val source = akka.stream.scaladsl.Source.fromIterator(Source.fromFile("a.csv").getLines)
val sink = Sink.foreach(println)
source.runWith(sink)
}
The two Source types don't sit easy with me. Is this idiomatic or is there is a better way to write this?
Actually, akka-streams provides a function to directly read from a file.
FileIO.fromPath(Paths.get("a.csv"))
.via(Framing.delimiter(ByteString("\n"), 256, true).map(_.utf8String))
.runForeach(println)
Here, runForeach method is to print the lines. If you have a proper Sink to process these lines, use it instead of this function. For example, if you want to split the lines by ' and print the total number of words in it:
val sink: Sink[String] = Sink.foreach(x => println(x.split(",").size))
FileIO.fromPath(Paths.get("a.csv"))
.via(Framing.delimiter(ByteString("\n"), 256, true).map(_.utf8String))
.to(sink)
.run()
The idiomatic way to read a CSV file with Akka Streams is to use the Alpakka CSV connector. The following example reads a CSV file, converts it to a map of column names (assumed to be the first line in the file) and ByteString values, transforms the ByteString values to String values, and prints each line:
FileIO.fromPath(Paths.get("a.csv"))
.via(CsvParsing.lineScanner())
.via(CsvToMap.toMap())
.map(_.mapValues(_.utf8String))
.runForeach(println)
Try this:
import java.nio.file.Paths
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl._
import akka.util.ByteString
import scala.concurrent.Await
import scala.concurrent.duration._
object ReadStreamApp extends App {
implicit val actorSystem = ActorSystem()
import actorSystem.dispatcher
implicit val flowMaterializer = ActorMaterializer()
val logFile = Paths.get("src/main/resources/a.csv")
val source = FileIO.fromPath(logFile)
val flow = Framing
.delimiter(ByteString(System.lineSeparator()), maximumFrameLength = 512, allowTruncation = true)
.map(_.utf8String)
val sink = Sink.foreach(println)
source
.via(flow)
.runWith(sink)
.andThen {
case _ =>
actorSystem.terminate()
Await.ready(actorSystem.whenTerminated, 1 minute)
}
}
Yeah, it's ok because these are different Sources. But if you don't like scala.io.Source you can read file yourself (which sometimes we have to do e.g. source csv file is zipped) and then parse it using given InputStream like this
StreamConverters.fromInputStream(() => input)
.via(Framing.delimiter(ByteString("\n"), 4096))
.map(_.utf8String)
.collect { line =>
line
}
Having said that consider using Apache Commons CSV with akka-stream. You may end up writing less code :)
I'm building a REST API that starts some calculation in a Spark cluster and responds with a chunked stream of the results. Given the Spark stream with calculation results, I can use
dstream.foreachRDD()
to send the data out of Spark. I'm sending the chunked HTTP response with akka-http:
val requestHandler: HttpRequest => HttpResponse = {
case HttpRequest(HttpMethods.GET, Uri.Path("/data"), _, _, _) =>
HttpResponse(entity = HttpEntity.Chunked(ContentTypes.`text/plain`, source))
}
For simplicity, I'm trying to get plain text working first, will add JSON marshalling later.
But what is the idiomatic way of using the Spark DStream as a Source for the Akka stream? I figured I should be able to do it via a socket but since the Spark driver and the REST endpoint are sitting on the same JVM opening a socket just for this seems a bit of an overkill.
Not sure about the version of api at the time of the question. But now, with akka-stream 2.0.3, I believe you can do it like:
val source = Source
.actorRef[T](/* buffer size */ 100, OverflowStrategy.dropHead)
.mapMaterializedValue[Unit] { actorRef =>
dstream.foreach(actorRef ! _)
}
Edit: This answer only applies to older version of spark and akka. PH88's answer is the correct method for recent versions.
You can use an intermediate akka.actor.Actor that feeds a Source (similar to this question). The solution below is not "reactive" because the underlying Actor would need to maintain a buffer of RDD messages that may be dropped if the downstream http client isn't consuming chunks quickly enough. But this problem occurs regardless of the implementation details since you cannot connect the "throttling" of the akka stream back-pressure to the DStream in order to slow down the data. This is due to the fact that DStream does not implement org.reactivestreams.Publisher .
The basic topology is:
DStream --> Actor with buffer --> Source
To construct this toplogy you have to create an Actor similar to the implementation here :
//JobManager definition is provided in the link
val actorRef = actorSystem actorOf JobManager.props
Create a stream Source of ByteStrings (messages) based on the JobManager. Also, convert the ByteString to HttpEntity.ChunkStreamPart which is what the HttpResponse requires:
import akka.stream.actor.ActorPublisher
import akka.stream.scaladsl.Source
import akka.http.scaladsl.model.HttpEntity
import akka.util.ByteString
type Message = ByteString
val messageToChunkPart =
Flow[Message].map(HttpEntity.ChunkStreamPart(_))
//Actor with buffer --> Source
val source : Source[HttpEntity.ChunkStreamPart, Unit] =
Source(ActorPublisher[Message](actorRef)) via messageToChunkPart
Link the Spark DStream to the Actor so that each incomining RDD is converted to an Iterable of ByteString and then forwarded to the Actor:
import org.apache.spark.streaming.dstream.Dstream
import org.apache.spark.rdd.RDD
val dstream : DStream = ???
//This function converts your RDDs to messages being sent
//via the http response
def rddToMessages[T](rdd : RDD[T]) : Iterable[Message] = ???
def sendMessageToActor(message : Message) = actorRef ! message
//DStream --> Actor with buffer
dstream foreachRDD {rddToMessages(_) foreach sendMessageToActor}
Provide the Source to the HttpResponse:
val requestHandler: HttpRequest => HttpResponse = {
case HttpRequest(HttpMethods.GET, Uri.Path("/data"), _, _, _) =>
HttpResponse(entity = HttpEntity.Chunked(ContentTypes.`text/plain`, source))
}
Note: there should be very little time/code between the dstream foreachRDD line and the HttpReponse since the Actor's internal buffer will immediately begin to fill with ByteString message coming from the DStream after the foreach line is executed.