How to read a web page, as a stream of lines, using akka-http 2.4.6 - scala

I've done a sample project in GitHub: akauppi/akka-2.4.6-trial
What I want seems simple: read a URL, provide the contents as a line-wise stream of Strings. Now I've struggled with this (and reading documentation) for the whole day so decided to push the sample public and ask for help.
I'm comfortable with Scala. I know Akka, and last time I've used Akka-streams it was probably pre-2.4. Now, I'm lost.
Questions:
On these lines I'd like to return a Source[String,Any], not a Future (note: those lines do not compile).
The problem probably is that Http().singleRequest(...) materialises the flow, and I don't want that. How to just inject the "recipe" of reading a web page without actually reading it?
def sourceAsByteString(url: URL)(implicit as: ActorSystem, mat: Materializer): Source[ByteString, Any] = {
import as.dispatcher
val req: HttpRequest = HttpRequest( uri = url.toString )
val tmp: Source[ByteString, Any] = Http().singleRequest(req).map( resp => resp.entity.dataBytes ) // does not compile, gives a 'Future'
tmp
}

The problem is that the chunks you get from the server are not lines, but might be anything. You will often get small responses in a single chunk. So you have to split the stream to lines yourself.
Something like this should work:
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.http.scaladsl.client.RequestBuilding._
import akka.stream.ActorMaterializer
implicit val system = ActorSystem("test")
implicit val mat = ActorMaterializer()
val delimiter: Flow[ByteString, ByteString, NotUsed] =
Framing.delimiter(
ByteString("\r\n"),
maximumFrameLength = 100000,
allowTruncation = true)
import system.dispatcher
val f = Http().singleRequest(Get("http://www.google.com")).flatMap { res =>
val lines = res.entity.dataBytes.via(delimiter).map(_.utf8String)
lines.runForeach { line =>
println(line)
}
}
f.foreach { _ =>
system.terminate()
}
Note that if you wanted to return the lines instead of printing them, you would end up with a Future[Source[String, Any]], which is unavoidable because everything in akka-http is async. You could "flatten" this to a Source[String, Any] that produces no elements in case of a failed request, but that would probably not be a good idea.
To get a "recipe" for reading a web page, you could use Http().outgoingConnection("http://www.google.com"), which creates a Flow[HttpRequest, HttpResponse, Future[OutgoingConnection]], so a thing where you put in HttpRequest objects and get back HttpResponse objects.

The problem probably is that Http().singleRequest(...) materialises
the flow, and I don't want that.
That was indeed at the heart of the problem. There are two ways to start:
Http().singleRequest(...) leads to a Future (i.e. materializes the stream, already in the very beginning).
Source.single(HttpRequest(...)) leads to a Source (non-materialized).
http://doc.akka.io/docs/akka/2.4.7/scala/http/client-side/connection-level.html#connection-level-api
Ideally, such an important difference would be visible in the names of the methods used, but it's not. One simply has to know it, and understand the two above approaches are actually vastly different.
#RüdigerKlaehn's answer covers the linewise cutting pretty well, but also see the Cookbook
when dealing with a Source, use mapConcat in place of the flatMap (Futures), to flatten the res.entity.dataBytes (which is an inner stream). Having these two levels of streams (requests, then chunks per request) adds to the mental complexity especially since we only have one of the outer entities.
There might still be some simpler way, but I'm not looking at that actively any more. Things work. Maybe once I become more fluent with akka streams, I'll suggest a further solution.

Code for reading an HTTP response, linewise (akka-http 1.1.0-RC2):
val req: HttpRequest = ???
val fut: Future[Source[String,_]] = Http().singleRequest(req).map { resp =>
resp.entity.dataBytes.via(delimiter)
.map(_.utf8String)
}
delimiter as in #Rüdiger-klaehn's answer.

Related

scala ZIO foreachPar

I'm new to parallel programming and ZIO, i'm trying to get data from an API, by parallel requests.
import sttp.client._
import zio.{Task, ZIO}
ZIO.foreach(files) { file =>
getData(file)
Task(file.getName)
}
def getData(file: File) = {
val data: String = readData(file)
val request = basicRequest.body(data).post(uri"$url")
.headers(content -> "text", char -> "utf-8")
.response(asString)
implicit val backend: SttpBackend[Identity, Nothing, NothingT] = HttpURLConnectionBackend()
request.send().body
resquest.Response match {
case Success(value) => {
val src = new PrintWriter(new File(filename))
src.write(value.toString)
src.close()
}
case Failure(exception) => log error
}
when i execute the program sequentially, it work as expected,
if i tried to run parallel, by changing ZIO.foreach to ZIO.foreachPar.
The program is terminating prematurely, i get that, i'm missing something basic here,
any help is appreciated to help me figure out the issue.
Generally speaking I wouldn't recommend mixing synchronous blocking code as you have with asynchronous non-blocking code which is the primary role of ZIO. There are some great talks out there on how to effectively use ZIO with the "world" so to speak.
There are two key points I would make, one ZIO lets you manage resources effectively by attaching allocation and finalization steps and two, "effects" we could say are "things which actually interact with the world" should be wrapped in the tightest scope possible*.
So lets go through this example a bit, first of all, I would not suggest using the default Identity backed backend with ZIO, I would recommend using the AsyncHttpClientZioBackend instead.
import sttp.client._
import zio.{Task, ZIO}
import zio.blocking.effectBlocking
import sttp.client.asynchttpclient.zio.AsyncHttpClientZioBackend
// Extract the common elements of the request
val baseRequest = basicRequest.post(uri"$url")
.headers(content -> "text", char -> "utf-8")
.response(asString)
// Produces a writer which is wrapped in a `Managed` allowing it to be properly
// closed after being used
def managedWriter(filename: String): Managed[IOException, PrintWriter] =
ZManaged.fromAutoCloseable(UIO(new PrintWriter(new File(filename))))
// This returns an effect which produces an `SttpBackend`, thus we flatMap over it
// to extract the backend.
val program = AsyncHttpClientZioBackend().flatMap { implicit backend =>
ZIO.foreachPar(files) { file =>
for {
// Wrap the synchronous reading of data in a `Task`, but which allows runs this effect on a "blocking" threadpool instead of blocking the main one.
data <- effectBlocking(readData(file))
// `send` will return a `Task` because it is using the implicit backend in scope
resp <- baseRequest.body(data).send()
// Build the managed writer, then "use" it to produce an effect, at the end of `use` it will automatically close the writer.
_ <- managedWriter("").use(w => Task(w.write(resp.body.toString)))
} yield ()
}
}
At this point you will just have the program which you will need to run using one of the unsafe methods or if you are using a zio.App through the main method.
* Not always possible or convenient, but it is useful because it prevents resource hogging by yielding tasks back to the runtime for scheduling.
When you use a purely functional IO library like ZIO, you must not call any side-effecting functions (like getData) except when calling factory methods like Task.effect or Task.apply.
ZIO.foreach(files) { file =>
Task {
getData(file)
file.getName
}
}

Akka Streams, break tuple item apart?

Using the superPool from akka-http, I have a stream that passes down a tuple. I would like to pipeline it to the Alpakka Google Pub/Sub connector. At the end of the HTTP processing, I encode everything for the pub/sub connector and end up with
(PublishRequest, Long) // long is a timestamp
but the interface of the connector is
Flow[PublishRequest, Seq[String], NotUsed]
One first approach is to kill one part:
.map{ case(publishRequest, timestamp) => publishRequest }
.via(publishFlow)
Is there an elegant way to create this pipeline while keeping the Long information?
EDIT: added my not-so-elegant solution in the answers. More answers welcome.
I don't see anything inelegant about your solution using GraphDSL.create(), which I think has an advantage of visualizing the stream structure via the diagrammatic ~> clauses. I do see problem in your code. For example, I don't think publisher should be defined by add-ing a flow to the builder.
Below is a skeletal version (briefly tested) of what I believe publishAndRecombine should look like:
val publishFlow: Flow[PublishRequest, Seq[String], NotUsed] = ???
val publishAndRecombine = Flow.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val bcast = b.add(Broadcast[(PublishRequest, Long)](2))
val zipper = b.add(Zip[Seq[String], Long])
val publisher = Flow[(PublishRequest, Long)].
map{ case (pr, _) => pr }.
via(publishFlow)
val timestamp = Flow[(PublishRequest, Long)].
map{ case (_, ts) => ts }
bcast.out(0) ~> publisher ~> zipper.in0
bcast.out(1) ~> timestamp ~> zipper.in1
FlowShape(bcast.in, zipper.out)
})
There is now a much nicer solution for this which will be released in Akka 2.6.19 (see https://github.com/akka/akka/pull/31123).
In order to use the aformentioned unsafeViaData you would first have to represent (PublishRequest, Long) using FlowWithContext/SourceWithContext. FlowWithContext/SourceWithContext is an abstraction that was specifically designed to solve this problem (see https://doc.akka.io/docs/akka/current/stream/stream-context.html). The problem being you have a stream with the data part that is typically what you want to operate on (in your case the ByteString) and then you have the context (aka metadata) part which you typically just pass along unmodified (in your case the Long).
So in the end you would have something like this
val myFlow: FlowWithContext[PublishRequest, Long, PublishRequest, Long, NotUsed] =
FlowWithContext.fromTuples(originalFlowAsTuple) // Original flow that has `(PublishRequest, Long)` as an output
myFlow.unsafeViaData(publishFlow)
In contrast to Akka Streams, break tuple item apart?, not only is this solution involve much less boilerplate since its part of akka but it also retains the materialized value rather than losing it and always ending up with a NotUsed.
For the people wondering why the method unsafeViaData has unsafe in the name, its because the Flow that you pass into this method cannot add,drop or reorder any of the elements in the stream (doing so would mean that the context no longer properly corresponds to the data part of the stream). Ideally we would use Scala's type system to catch such errors at compile time but doing so would require a lot of changes to akka-stream especially if the changes need to remain backwards compatibility (which when dealing with akka we do). More details are in the PR mentioned earlier.
My not-so-elegant solution is using a custom flows that recombine things:
val publishAndRecombine = Flow.fromGraph(GraphDSL.create() { implicit b =>
val bc = b.add(Broadcast[(PublishRequest, Long)](2))
val publisher = b.add(Flow[(PublishRequest, Long)]
.map { case (pr, _) => pr }
.via(publishFlow))
val zipper = b.add(Zip[Seq[String], Long]).
bc.out(0) ~> publisher ~> zipper.in0
bc.out(1).map { case (pr, long) => long } ~> zipper.in1
FlowShape(bc.in, zipper.out)
})

How do I create an Akka Source[String] from a text file?

This seems like the simplest thing in the world, but I am just getting started and I'm stymied, so please bear with me.
The FileIO object provides the fromFile function, which unsurprisingly returns a Source[ByteString, Future[IOResult]].
But I have an UTF-encoded text file and I want a Source[String, Future[IOResult]] -- that is, a source of a regular string, with unicode characters, not of a byte string of nonsense.
This is a Hello, World-level example, but I'm stuck.
(And what's not helping is the name collision between scala.io.Source, whose fromFile is exactly what I needed, and akka.stream.scaladsl.Source so if anyone could explain that to me, I'd be grateful.)
You can use decodeString function from ByteString to decode to UTF-x:
decodeString(charset: String): String
Or more commonly, for UTF-8:
utf8String: String
So, basically:
val path = Paths.get("/tmp/example.txt")
FileIO.fromPath(path)
.via(Framing.delimiter(ByteString(System.lineSeparator), maximumFrameLength=8192, allowTruncation=true))
.map(_.utf8String)
Please notice FileIO.fromFile is now deprecated.
import akka.NotUsed
import scala.io.{Source => fileSource}
import akka.stream.scaladsl.{Source => strmSource}
val it = fileSource.fromFile("/home/expert/myfile.txt").getLines()
val source : strmSource[String, NotUsed] = strmSource.fromIterator(() => it)
.map { line =>
line.reverse
}
This whole thing is lazily evaluated thus 100GB won't read at once.

Iteratees in Scala that use lazy evaluation or fusion?

I have heard that iteratees are lazy, but how lazy exactly are they? Alternatively, can iteratees be fused with a postprocessing function, so that an intermediate data structure does not have to be built?
Can I in my iteratee for example build a 1 million item Stream[Option[String]] from a java.io.BufferedReader, and then subsequently filter out the Nones, in a compositional way, without requiring the entire Stream to be held in memory? And at the same time guarantee that I don't blow the stack? Or something like that - it doesn't have to use a Stream.
I'm currently using Scalaz 6 but if other iteratee implementations are able to do this in a better way, I'd be interested to know.
Please provide a complete solution, including closing the BufferedReader and calling unsafePerformIO, if applicable.
Here's a quick iteratee example using the Scalaz 7 library that demonstrates the properties you're interested in: constant memory and stack usage.
The problem
First assume we've got a big text file with a string of decimal digits on each line, and we want to find all the lines that contain at least twenty zeros. We can generate some sample data like this:
val w = new java.io.PrintWriter("numbers.txt")
val r = new scala.util.Random(0)
(1 to 1000000).foreach(_ =>
w.println((1 to 100).map(_ => r.nextInt(10)).mkString)
)
w.close()
Now we've got a file named numbers.txt. Let's open it with a BufferedReader:
val reader = new java.io.BufferedReader(new java.io.FileReader("numbers.txt"))
It's not excessively large (~97 megabytes), but it's big enough for us to see easily whether our memory use is actually staying constant while we process it.
Setting up our enumerator
First for some imports:
import scalaz._, Scalaz._, effect.IO, iteratee.{ Iteratee => I }
And an enumerator (note that I'm changing the IoExceptionOrs into Options for the sake of convenience):
val enum = I.enumReader(reader).map(_.toOption)
Scalaz 7 doesn't currently provide a nice way to enumerate a file's lines, so we're chunking through the file one character at a time. This will of course be painfully slow, but I'm not going to worry about that here, since the goal of this demo is to show that we can process this large-ish file in constant memory and without blowing the stack. The final section of this answer gives an approach with better performance, but here we'll just split on line breaks:
val split = I.splitOn[Option[Char], List, IO](_.cata(_ != '\n', false))
And if the fact that splitOn takes a predicate that specifies where not to split confuses you, you're not alone. split is our first example of an enumeratee. We'll go ahead and wrap our enumerator in it:
val lines = split.run(enum).map(_.sequence.map(_.mkString))
Now we've got an enumerator of Option[String]s in the IO monad.
Filtering the file with an enumeratee
Next for our predicate—remember that we said we wanted lines with at least twenty zeros:
val pred = (_: String).count(_ == '0') >= 20
We can turn this into a filtering enumeratee and wrap our enumerator in that:
val filtered = I.filter[Option[String], IO](_.cata(pred, true)).run(lines)
We'll set up a simple action that just prints everything that makes it through this filter:
val printAction = (I.putStrTo[Option[String]](System.out) &= filtered).run
Of course we've not actually read anything yet. To do that we use unsafePerformIO:
printAction.unsafePerformIO()
Now we can watch the Some("0946943140969200621607610...")s slowly scroll by while our memory usage remains constant. It's slow and the error handling and output are a little clunky, but not too bad I think for about nine lines of code.
Getting output from an iteratee
That was the foreach-ish usage. We can also create an iteratee that works more like a fold—for example gathering up the elements that make it through the filter and returning them in a list. Just repeat everything above up until the printAction definition, and then write this instead:
val gatherAction = (I.consume[Option[String], IO, List] &= filtered).run
Kick that action off:
val xs: Option[List[String]] = gatherAction.unsafePerformIO().sequence
Now go get a coffee (it might need to be pretty far away). When you come back you'll either have a None (in the case of an IOException somewhere along the way) or a Some containing a list of 1,943 strings.
Complete (faster) example that automatically closes the file
To answer your question about closing the reader, here's a complete working example that's roughly equivalent to the second program above, but with an enumerator that takes responsibility for opening and closing the reader. It's also much, much faster, since it reads lines, not characters. First for imports and a couple of helper methods:
import java.io.{ BufferedReader, File, FileReader }
import scalaz._, Scalaz._, effect._, iteratee.{ Iteratee => I, _ }
def tryIO[A, B](action: IO[B]) = I.iterateeT[A, IO, Either[Throwable, B]](
action.catchLeft.map(
r => I.sdone(r, r.fold(_ => I.eofInput, _ => I.emptyInput))
)
)
def enumBuffered(r: => BufferedReader) =
new EnumeratorT[Either[Throwable, String], IO] {
lazy val reader = r
def apply[A] = (s: StepT[Either[Throwable, String], IO, A]) => s.mapCont(
k =>
tryIO(IO(reader.readLine())).flatMap {
case Right(null) => s.pointI
case Right(line) => k(I.elInput(Right(line))) >>== apply[A]
case e => k(I.elInput(e))
}
)
}
And now the enumerator:
def enumFile(f: File): EnumeratorT[Either[Throwable, String], IO] =
new EnumeratorT[Either[Throwable, String], IO] {
def apply[A] = (s: StepT[Either[Throwable, String], IO, A]) => s.mapCont(
k =>
tryIO(IO(new BufferedReader(new FileReader(f)))).flatMap {
case Right(reader) => I.iterateeT(
enumBuffered(reader).apply(s).value.ensuring(IO(reader.close()))
)
case Left(e) => k(I.elInput(Left(e)))
}
)
}
And we're ready to go:
val action = (
I.consume[Either[Throwable, String], IO, List] %=
I.filter(_.fold(_ => true, _.count(_ == '0') >= 20)) &=
enumFile(new File("numbers.txt"))
).run
Now the reader will be closed when the processing is done.
I should have read a little bit further... this is precisely what enumeratees are for. Enumeratees are defined in Scalaz 7 and Play 2, but not in Scalaz 6.
Enumeratees are for "vertical" composition (in the sense of "vertically integrated industry") while ordinary iteratees compose monadically in a "horizontal" way.

Parallel file processing in Scala

Suppose I need to process files in a given folder in parallel. In Java I would create a FolderReader thread to read file names from the folder and a pool of FileProcessor threads. FolderReader reads file names and submits the file processing function (Runnable) to the pool executor.
In Scala I see two options:
create a pool of FileProcessor actors and schedule a file processing function with Actors.Scheduler.
create an actor for each file name while reading the file names.
Does it make sense? What is the best option?
Depending on what you're doing, it may be as simple as
for(file<-files.par){
//process the file
}
I suggest with all my energies to keep as far as you can from the threads. Luckily we have better abstractions which take care of what's happening below, and in your case it appears to me that you do not need to use actors (while you can) but you can use a simpler abstraction, called Futures. They are a part of Akka open source library, and I think in the future will be a part of the Scala standard library as well.
A Future[T] is simply something that will return a T in the future.
All you need to run a future, is to have an implicit ExecutionContext, which you can derive from a java executor service. Then you will be able to enjoy the elegant API and the fact that a future is a monad to transform collections into collections of futures, collect the result and so on. I suggest you to give a look to http://doc.akka.io/docs/akka/2.0.1/scala/futures.html
object TestingFutures {
implicit val executorService = Executors.newFixedThreadPool(20)
implicit val executorContext = ExecutionContext.fromExecutorService(executorService)
def testFutures(myList:List[String]):List[String]= {
val listOfFutures : Future[List[String]] = Future.traverse(myList){
aString => Future{
aString.reverse
}
}
val result:List[String] = Await.result(listOfFutures,1 minute)
result
}
}
There's a lot going on here:
I am using Future.traverse which receives as a first parameter which is M[T]<:Traversable[T] and as second parameter a T => Future[T] or if you prefer a Function1[T,Future[T]] and returns Future[M[T]]
I am using the Future.apply method to create an anonymous class of type Future[T]
There are many other reasons to look at Akka futures.
Futures can be mapped because they are monad, i.e. you can chain Futures execution :
Future { 3 }.map { _ * 2 }.map { _.toString }
Futures have callback: future.onComplete, onSuccess, onFailure, andThen etc.
Futures support not only traverse, but also for comprehension
Ideally you should use two actors. One for reading the list of files, and one for actually reading the file.
You start the process by simply sending a single "start" message to the first actor. The actor can then read the list of files, and send a message to the second actor. The second actor then reads the file and processes the contents.
Having multiple actors, which might seem complicated, is actually a good thing in the sense that you have a bunch of objects communicating with eachother, like in a theoretical OO system.
Edit: you REALLY shouldn't be doing doing concurrent reading of a single file.
I was going to write up exactly what #Edmondo1984 did except he beat me to it. :) I second his suggestion in a big way. I'll also suggest that you read the documentation for Akka 2.0.2. As well, I'll give you a slightly more concrete example:
import akka.dispatch.{ExecutionContext, Future, Await}
import akka.util.duration._
import java.util.concurrent.Executors
import java.io.File
val execService = Executors.newCachedThreadPool()
implicit val execContext = ExecutionContext.fromExecutorService(execService)
val tmp = new File("/tmp/")
val files = tmp.listFiles()
val workers = files.map { f =>
Future {
f.getAbsolutePath()
}
}.toSeq
val result = Future.sequence(workers)
result.onSuccess {
case filenames =>
filenames.foreach { fn =>
println(fn)
}
}
// Artificial just to make things work for the example
Thread.sleep(100)
execContext.shutdown()
Here I use sequence instead of traverse, but the difference is going to depend on your needs.
Go with the Future, my friend; the Actor is just a more painful approach in this instance.
But if use actors, what's wrong with that?
If we have to read / write to some property file. There is my Java example. But still with Akka Actors.
Lest's say we have an actor ActorFile represents one file. Hm.. Probably it can not represent One file. Right? (would be nice it could). So then it represents several files like PropertyFilesActor then:
Why would not use something like this:
public class PropertyFilesActor extends UntypedActor {
Map<String, String> filesContent = new LinkedHashMap<String, String>();
{ // here we should use real files of cource
filesContent.put("file1.xml", "");
filesContent.put("file2.xml", "");
}
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof WriteMessage) {
WriteMessage writeMessage = (WriteMessage) message;
String content = filesContent.get(writeMessage.fileName);
String newContent = content + writeMessage.stringToWrite;
filesContent.put(writeMessage.fileName, newContent);
}
else if (message instanceof ReadMessage) {
ReadMessage readMessage = (ReadMessage) message;
String currentContent = filesContent.get(readMessage.fileName);
// Send the current content back to the sender
getSender().tell(new ReadMessage(readMessage.fileName, currentContent), getSelf());
}
else unhandled(message);
}
}
...a message will go with parameter (fileName)
It has its own in-box, accepting messages like:
WriteLine(fileName, string)
ReadLine(fileName, string)
Those messages will be storing into to the in-box in the order, one after antoher. The actor would do its work by receiving messages from the box - storing/reading, and meanwhile sending feedback sender ! message back.
Thus, let's say if we write to the property file, and send showing the content on the web page. We can start showing page (right after we sent message to store a data to the file) and as soon as we received the feedback, update part of the page with a data from just updated file (by ajax).
Well, grab your files and stick them in a parallel structure
scala> new java.io.File("/tmp").listFiles.par
res0: scala.collection.parallel.mutable.ParArray[java.io.File] = ParArray( ... )
Then...
scala> res0 map (_.length)
res1: scala.collection.parallel.mutable.ParArray[Long] = ParArray(4943, 1960, 4208, 103266, 363 ... )