Akka Stream - Source.fromPublisher - scala

I'm trying to execute the following code based on the akka stream quick start guide:
implicit val system = ActorSystem("QuickStart")
implicit val materializer = ActorMaterializer()
val songs = Source.fromPublisher(SongsService.stream)
val count: Flow[Song, Int, NotUsed] = Flow[Song].map(_ => 1)
val sumSink: Sink[Int, Future[Int]] = Sink.fold[Int, Int](0)(_ + _)
val counterGraph: RunnableGraph[Future[Int]] =
songs
.via(count)
.toMat(sumSink)(Keep.right)
val sum: Future[Int] = counterGraph.run()
sum.foreach(c => println(s"Total songs processed: $c"))
The problem here is that the future never return a result. The biggest difference from the documentation example is my Source.
I have a play enumerator, which I'm converting it to an Akka Publisher, resulting in this SongsService.stream
When using a defined list as a Source like:
val songs = Source(list)
It works, but using the Source.fromPublisher does not.
But the problem here is not the publisher indeed, I can do a simple operation and it works:
val songs = Source.fromPublisher(SongsService.stream)
songs.runForeach(println)
It goes through the database, create the play enumerator, convert it to a publisher and I can iterate over.
Any ideas?

Your publisher is likely never completing.

Related

How does Akka Stream's Keep right/left/both result in a different output?

I am trying to wrap my head around how Keep works in Akka streams. Reading answers in What does Keep in akka stream mean, I understand that it helps to control we get the result from the left/right/both sides of the materializer. However, I still can't build an example were I can change the value of left/right and get different results.
For example,
implicit val system: ActorSystem = ActorSystem("Playground")
implicit val materializer: ActorMaterializer = ActorMaterializer()
val sentenceSource = Source(List(
"Materialized values are confusing me",
"I love streams",
"Left foo right bar"
))
val wordCounter = Flow[String].fold[Int](0)((currentWords, newSentence) => currentWords + newSentence.split(" ").length)
val result = sentenceSource.viaMat(wordCounter)(Keep.left).toMat(Sink.head)(Keep.right).run()
val res = Await.result(result, 2 second)
println(res)
In this example if I change values from keep left to keep right, I still get the same result. Can someone provide me with a basic example where changing keep to left/right/both values results in a different outcome?
In your example, since:
sentenceSource: akka.stream.scaladsl.Source[String,akka.NotUsed] = ???
wordCounter: akka.stream.scaladsl.Flow[String,Int,akka.NotUsed] = ???
both have NotUsed as their materialization (indicating that they don't have a useful materialization),
sentenceSource.viaMat(wordCounter)(Keep.right)
sentenceSource.viaMat(wordCounter)(Keep.left)
have the same materialization. However, since Sink.head[T] materializes to Future[T], changing the combiner clearly has an impact
val intSource = sentenceSource.viaMat(wordCounter)(Keep.right)
val notUsed = intSource.toMat(Sink.head)(Keep.left)
// akka.stream.scaladsl.RunnableGraph[akka.NotUsed]
val intFut = intSource.toMat(Sink.head)(Keep.right)
// akka.stream.scaladsl.RunnableGraph[scala.concurrent.Future[Int]]
notUsed.run // akka.NotUsed
intFut.run // Future(Success(12))
Most of the sources in Source materialize to NotUsed and nearly all of the common Flow operators do as well, so toMat(someSink)(Keep.right) (or the equivalent .runWith(someSink)) is far more prevalent than using Keep.left or Keep.both. The most common usecases for source/flow materialization are to provide some sort of control plane, such as:
import akka.Done
import akka.stream.{ CompletionStrategy, OverflowStrategy }
import system.dispatcher
val completionMatcher: PartialFunction[Any, CompletionStrategy] = { case Done => CompletionStrategy.draining }
val failureMatcher: PartialFunction[Any, Throwable] = { case 666 => new Exception("""\m/""") }
val sentenceSource = Source.actorRef[String](completionMatcher = completionMatcher, failureMatcher = failureMatcher, bufferSize = 100, overflowStrategy = OverflowStrategy.dropNew)
// same wordCounter as before
val stream = sentenceSource.viaMat(wordCounter)(Keep.left).toMat(Sink.head)(Keep.both) // akka.stream.scaladsl.RunnableGraph[(akka.actor.ActorRef, scala.concurrent.Future[Int])]
val (sourceRef, intFut) = stream.run()
sourceRef ! "Materialized values are confusing me"
sourceRef ! "I love streams"
sourceRef ! "Left foo right bar"
sourceRef ! Done
intFut.foreach { result =>
println(result)
system.terminate()
}
In this case, we use Keep.left to pass through sentenceSource's materialized value and then Keep.both to get both that materialized value and that of Sink.head.

Write GeoLocation Twitter4J to Postgres

I am extracting tweets using Twitter4J and Akka Streams. I have chosen a few fields like userId, tweetId, tweet text and so on. This Tweet entity gets written to the database:
class Counter extends StatusAdapter with Databases{
implicit val system = ActorSystem("TweetsExtractor")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
implicit val LoggingAdapter =
Logging(system, classOf[Counter])
val overflowStrategy = OverflowStrategy.backpressure
val bufferSize = 1000
val statusSource = Source.queue[Status](
bufferSize,
overflowStrategy
)
val insertFlow: Flow[Status, Tweet, NotUsed] =
Flow[Status].map(status => Tweet(status.getId, status.getUser.getId, status.getText, status.getLang,
status.getFavoriteCount, status.getRetweetCount))
val insertSink: Sink[Tweet, Future[Done]] = Sink.foreach(tweetRepository.create)
val insertGraph = statusSource via insertFlow to insertSink
val queueInsert = insertGraph.run()
override def onStatus(status: Status) =
Await.result(queueInsert.offer(status), Duration.Inf)
}
My intention is to add location field. There is a specific GeoLocation type for that in Twitter4J which contains latitude and longitude of double type. However, when I try to extract latitude and longitude directly through the flow nothing is written to the database:
Flow[Status].map(status => Tweet(status.getId, status.getUser.getId, status.getText, status.getLang, status.getFavoriteCount, status.getRetweetCount, status.getGeoLocation.getLatitude, status.getGeoLocation.getLongitude))
What may be the reason of such behavior and how can I fix it?
What is happening here, as confirmed in the comments to the question, is that most tweets do not come with geolocation data attached, making those fields empty and resulting in the misbehavior.
A couple of simple checks for empty values should solve the issue.

Eliminating internal collecting when constructing a Source as a response to data

I have a Flow (createDataPointFlow) which is constructed by performing a mapAsync which collects data points (via Sink.seq) which I would otherwise like to stream directly (i.e. without collecting first).
However, it is not obvious to me how I can do this without collecting items, it seems I need some sort of mechanism to publish my items directly to the output portion of the flow I am creating, but I'm new to this and don't know how to do that without getting explicit actors involved, which I would like to avoid.
How can I achieve this without the need to collect things to a Sink first? Remember what I want to achieve is full streaming without the explicit buffering that Sink.seq(...) is doing.
object MyProcess {
def createDataSource(job:Job, dao:DataService):Source[JobDataPoint,NotUsed] = {
// Imagine the below call is equivalent to streaming a parameterized query using Slick
val publisher: Publisher[JobDataPoint] = dao.streamData(Criteria(job.name, job.data))
// Convert to a Source
val src: Source[JobDataPoint, NotUsed] = Source.fromPublisher(publisher)
src
}
def createDataPointFlow(dao:DataService, parallelism:Int=1): Flow[Job,JobDataPoint, NotUsed] =
Flow[Job].mapAsync(parallelism)(job =>
createDataSource(job,dao).toMat(Sink.seq)(Keep.right).run()
).mapConcat(identity)
def apply(src:Source[Job,NotUsed], dao:DataService,parallelism:Int=5) = RunnableGraph.fromGraph(GraphDSL.create(){ implicit builder =>
import GraphDSL.Implicits._
//Source
val jobs:Outlet[Job] = builder.add(src).out
//val bcastJobsSrc: Source[Job, NotUsed] = src.toMat(BroadcastHub.sink(256))(Keep.right).run()
//val bcastOutlet:Outlet[Job] = builder.add(bcastJobsSrc).out
//Flows
val bcastJobs:UniformFanOutShape[Job,Job] = builder.add(Broadcast[Job](4))
val rptMaker = builder.add(MyProcessors.flow(dao,parallelism))
val dpFlow = createDataPointFlow(dao,parallelism)
//Sinks
val jobPrinter:Inlet[Job] = builder.add(Sink.foreach[Job](job=>println(s"[MyGraph] Received job: ${job.name} => $job"))).in
val jobList:Inlet[Job] = builder.add(Sink.fold(List.empty[Job])((list,job:Job)=>job::list)).in
val reporter: Inlet[ReportTable] = builder.add(Sink.foreach[ReportTable](r=>println(s"[Report]: $r"))).in
val dpSink: Inlet[JobDataPoint] = builder.add(Sink.foreach[JobDataPoint](dp=>println(s"[DataPoint]: $dp"))).in
jobs ~> bcastJobs
bcastJobs ~> jobPrinter
bcastJobs ~> jobList
bcastJobs ~> rptMaker ~> reporter
bcastJobs ~> dpFlow ~> dpSink
ClosedShape
})
}
So after re-reading the documentation about the various stages available it turns out that what I needed was a flatMapConcat:
def createDataPointFlow(dao:DataService, parallelism:Int=1): Flow[Job,JobDataPoint, NotUsed] =
Flow[Job].flatMapConcat(createDataSource(_,dao))

How do I dynamically add Source to existing Graph?

What can be alternative to dynamically changing running graph ? Here is my situation. I have graph that ingests articles into DB. Articles come from 3 plugins in different format. Thus I have several flows
val converterFlow1: Flow[ImpArticle, Article, NotUsed]
val converterFlow2: Flow[NewsArticle, Article, NotUsed]
val sinkDB: Sink[Article, Future[Done]]
// These are being created every time I poll plugins
val sourceContentProvider : Source[ImpArticle, NotUsed]
val sourceNews : Source[NewsArticle, NotUsed]
val sourceCit : Source[Article, NotUsed]
val merged = Source.combine(
sourceContentProvider.via(converterFlow1),
sourceNews.via(converterFlow2),
sourceCit)(Merge(_))
val res = merged
.buffer(10, OverflowStrategy.backpressure)
.toMat(sinkDB)(Keep.both)
.run()
Problem is that I get data from content provider once per 24 hrs, from news once per 2 hrs and last source may come at any time because it's coming from humans.
I realize that graphs are immutable but how I can periodically attach new instances of Source to my graph so that I have single point of throttling of the process of ingesting ?
UPDATE: You can say my data is stream of Source-s, three sources in my case. But I cannot change that because I get instances of Source from external classes (so called plugins). These plugins work independently from my ingestion class. I can't combine them into one gigantic class to have single Source.
Okay, in general the correct way would be to join a stream of sources into a single source, i.e. go from Source[Source[T, _], Whatever] to Source[T, Whatever]. This can be done with flatMapConcat or with flatMapMerge. Therefore, if you can get a Source[Source[Article, NotUsed], NotUsed], you can use one of flatMap* variants and obtain a final Source[Article, NotUsed]. Do it for each of your sources (no pun intended), and then your original approach should work.
I've implemented code based up on answer given by Vladimir Matveev and want to share it with others since it looks like common use-case to me.
I knew about Source.queue which Viktor Klang mentioned but I wasn't aware of flatMapConcat. It's pure awesomeness.
implicit val system = ActorSystem("root")
implicit val executor = system.dispatcher
implicit val materializer = ActorMaterializer()
case class ImpArticle(text: String)
case class NewsArticle(text: String)
case class Article(text: String)
val converterFlow1: Flow[ImpArticle, Article, NotUsed] = Flow[ImpArticle].map(a => Article("a:" + a.text))
val converterFlow2: Flow[NewsArticle, Article, NotUsed] = Flow[NewsArticle].map(a => Article("a:" + a.text))
val sinkDB: Sink[Article, Future[Done]] = Sink.foreach { a =>
Thread.sleep(1000)
println(a)
}
// These are being created every time I poll plugins
val sourceContentProvider: Source[ImpArticle, NotUsed] = Source(List(ImpArticle("cp1"), ImpArticle("cp2")))
val sourceNews: Source[NewsArticle, NotUsed] = Source(List(NewsArticle("news1"), NewsArticle("news2")))
val sourceCit: Source[Article, NotUsed] = Source(List(Article("a1"), Article("a2")))
val (queue, completionFut) = Source
.queue[Source[Article, NotUsed]](10, backpressure)
.flatMapConcat(identity)
.buffer(2, OverflowStrategy.backpressure)
.toMat(sinkDB)(Keep.both)
.run()
queue.offer(sourceContentProvider.via(converterFlow1))
queue.offer(sourceNews.via(converterFlow2))
queue.offer(sourceCit)
queue.complete()
completionFut.onComplete {
case Success(res) =>
println(res)
system.terminate()
case Failure(ex) =>
ex.printStackTrace()
system.terminate()
}
Await.result(system.whenTerminated, Duration.Inf)
I'd still check success of Future returned by queue.offer but in my case these calls will be pretty infrequent.
If you cannot model it as a Source[Source[_,_],_] then I'd consider using a Source.queue[Source[T,_]](queueSize, overflowStrategy): here
What you'll have to be careful about though is what happens if submission fails.

How to consume grouped sub streams with mapAsync in akka streams

I need to do something really similar to this https://github.com/typesafehub/activator-akka-stream-scala/blob/master/src/main/scala/sample/stream/GroupLogFile.scala
my problem is that I have an unknown number of groups and if the number of parallelism of the mapAsync is less of the number of groups i got and error in the last sink
Tearing down
SynchronousFileSink(/Users/sam/dev/projects/akka-streams/target/log-ERROR.txt)
due to upstream error
(akka.stream.impl.StreamSubscriptionTimeoutSupport$$anon$2)
I tried to put a buffer in the middle as suggested in the pattern guide of akka streams http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-cookbook.html
groupBy {
case LoglevelPattern(level) => level
case other => "OTHER"
}.buffer(1000, OverflowStrategy.backpressure).
// write lines of each group to a separate file
mapAsync(parallelism = 2) {....
but with the same result
Expanding on jrudolph's comment which is completely correct...
You do not need a mapAsync in this instance. As a basic example, suppose you have a source of tuples
import akka.stream.scaladsl.{Source, Sink}
def data() = List(("foo", 1),
("foo", 2),
("bar", 1),
("foo", 3),
("bar", 2))
val originalSource = Source(data)
You can then perform a groupBy to create a Source of Sources
def getID(tuple : (String, Int)) = tuple._1
//a Source of (String, Source[(String, Int),_])
val groupedSource = originalSource groupBy getID
Each one of the grouped Sources can be processed in parallel with just a map, no need for anything fancy. Here is an example of each grouping being summed in an independent stream:
import akka.actor.ActorSystem
import akka.stream.ACtorMaterializer
implicit val actorSystem = ActorSystem()
implicit val mat = ActorMaterializer()
import actorSystem.dispatcher
def getValues(tuple : (String, Int)) = tuple._2
//does not have to be a def, we can re-use the same sink over-and-over
val sumSink = Sink.fold[Int,Int](0)(_ + _)
//a Source of (String, Future[Int])
val sumSource =
groupedSource map { case (id, src) =>
id -> {src map getValues runWith sumSink} //calculate sum in independent stream
}
Now all of the "foo" numbers are being summed in parallel with all of the "bar" numbers.
mapAsync is used when you have a encapsulated function that returns a Future[T] and you're trying to emit a T instead; which is not the case in you question. Further, mapAsync involves waiting for results which is not reactive...