I'm quite newbie to Akka Streams and Akka HTTP.
I'd like to generate a simple HTTP server that can generate a zip file from the contents of a folder and send it to the client.
The org.zeroturnaround.zip.ZipUtil makes the task of creating a zip file very easy, but it needs an outputStream.
Here is my solution (written in Scala language):
val os = new ByteArrayOutputStream()
ZipUtil.pack(myFolder, os)
HttpResponse(entity = HttpEntity(
This solution works, but keeps all the contents to memory, so it isn't scalable.
I think the key for solving this is to use this:
val source = StreamConverters.asOutputStream()
but don't know how to use it. :-(
Any help please?
Try this
val byteSource: Source[ByteString, Unit] = StreamConverters.asOutputStream()
.mapMaterializedValue(os => ZipUtil.pack(myFolder, os))
HttpResponse(entity = HttpEntity(
You only get access to the OutputStream once the source gets materialized,
which might not happen immediately. In theory the source could also materialized multiple times, so you should be able to deal with this.
I had same problem. In order to make it backpressure-compatible I had to write artificial InputStream which is later converted to Source via StreamConverters.fromInputStream(() => input) which in turn you return from your Akka-Http DSL complete directive.
Here is what I wrote.
import java.io.{File, IOException, InputStream}
import java.nio.charset.StandardCharsets
import java.time.LocalDate
import java.time.format.DateTimeFormatter
import org.apache.commons.compress.archivers.sevenz.{SevenZArchiveEntry, SevenZFile}
import scala.annotation.tailrec
import scala.util.{Failure, Success, Try}
class DownloadStatsZipReader(path: String, password: String) extends InputStream {
private val (archive, targetDate) = {
val inputFile = new SevenZFile(new File(path), password.getBytes(StandardCharsets.UTF_16LE.displayName()))
def findValidEntry(): Option[(LocalDate, SevenZArchiveEntry)] =
Option(inputFile.getNextEntry) match {
case Some(entry) =>
if (!entry.isDirectory) {
val parts = entry.getName.toLowerCase.split("\\.(?=[^\\.]+$)")
if (parts(1) == "tab" && entry.getSize > 0)
Try(LocalDate.parse(parts(0), DateTimeFormatter.ISO_LOCAL_DATE)) match {
case Success(localDate) =>
Some(localDate -> entry)
case Failure(_) =>
} else
case None => None
val (date, _) = findValidEntry().getOrElse {
throw new RuntimeException(s"$path has no files named as `YYYY-MM-DD.tab`")
inputFile -> date
private val buffer = new Array[Byte](1024)
private var offsetBuffer: Int = 0
private var sizeBuffer: Int = 0
def getTargetDate: LocalDate = targetDate
override def read(): Int =
sizeBuffer match {
case -1 =>
case 0 =>
case _ =>
if (offsetBuffer < sizeBuffer) {
val result = buffer(offsetBuffer)
offsetBuffer += 1
} else {
sizeBuffer = 0
override def close(): Unit = {
private def loadNextChunk(): Unit = try {
val bytesRead = archive.read(buffer)
if (bytesRead >= 0) {
offsetBuffer = 0
sizeBuffer = bytesRead
} else {
offsetBuffer = -1
sizeBuffer = -1
} catch {
case ex: Throwable =>
throw ex
If you find bugs in my code please let me know.
I am using WebsocketClient and would like to test against the received message. I've chosen the Scalatest framework and I know, that the test has be carry out asynchronously.
The websocket client looks as the following:
import akka.{Done}
import akka.http.scaladsl.Http
import akka.stream.scaladsl._
import akka.http.scaladsl.model.ws._
import io.circe.syntax._
import scala.concurrent.Future
object WsClient {
import Trigger._
private val convertJson: PreMsg => String = msg =>
val send: PreMsg => (String => Unit) => RunnableGraph[Future[Done]] = msg => fn =>
and the test:
feature("Process incoming messages") {
info("As a user, I want that incoming messages is going to process appropriately.")
info("A message should contain the following properties: `sap_id`, `sap_event`, `payload`")
scenario("Message is not intended for the server") {
Given("A message with `sap_id:unknown`")
val msg = PreMsg("unknown", "unvalid", "{}")
When("the message gets validated")
val ws = WsClient.send(msg)
Then("it should has the `status: REJECT` in the response content")
ws { msg =>
//Would like test against the msg here
.map(_ => assert(1 == 1))
I would to test against the content of msg, but I do not know, how to do it.
I followed the play-scala-websocket-example
They use a WebSocketClient as a helper, see WebSocketClient.java
Then a test looks like:
Helpers.running(TestServer(port, app)) {
val myPublicAddress = s"localhost:$port"
val serverURL = s"ws://$myPublicAddress/ws"
val asyncHttpClient: AsyncHttpClient = client.underlying[AsyncHttpClient]
val webSocketClient = new WebSocketClient(asyncHttpClient)
val queue = new ArrayBlockingQueue[String](10)
val origin = serverURL
val consumer: Consumer[String] = new Consumer[String] {
override def accept(message: String): Unit = queue.put(message)
val listener = new WebSocketClient.LoggingListener(consumer)
val completionStage = webSocketClient.call(serverURL, origin, listener)
val f = FutureConverters.toScala(completionStage)
// Test we can get good output from the websocket
whenReady(f, timeout = Timeout(1.second)) { webSocket =>
val condition: Callable[java.lang.Boolean] = new Callable[java.lang.Boolean] {
override def call(): java.lang.Boolean = webSocket.isOpen && queue.peek() != null
val input: String = queue.take()
val json:JsValue = Json.parse(input)
val symbol = (json \ "symbol").as[String]
List(symbol) must contain oneOf("AAPL", "GOOG", "ORCL")
See here: FunctionalSpec.scala
I want to stream some files and zip them on the fly, so users can download multiple files into a single zipped file without writing anything to the local disk. However, my current implementation holds everything in the memory, and will no work for large files. Is there any way to fix it?
I was looking at this implementation: https://gist.github.com/kirked/03c7f111de0e9a1f74377bf95d3f0f60, but couldn't figure out how to use it.
import java.io.{BufferedOutputStream, ByteArrayInputStream, ByteArrayOutputStream}
import java.util.zip.{ZipEntry, ZipOutputStream}
import akka.stream.scaladsl.{StreamConverters}
import org.apache.commons.io.FileUtils
import play.api.mvc.{Action, Controller}
class HomeController extends Controller {
def single() = Action {
content = new java.io.File("C:\\Users\\a.csv"),
fileName = _ => "a.csv"
def zip() = Action {
CONTENT_TYPE -> "application/zip",
CONTENT_DISPOSITION -> s"attachment; filename = test.zip"
def fileByteData(): ByteArrayInputStream = {
val fileList = List(
new java.io.File("C:\\Users\\a.csv"),
new java.io.File("C:\\Users\\b.csv")
val baos = new ByteArrayOutputStream()
val zos = new ZipOutputStream(new BufferedOutputStream(baos))
try {
fileList.map(file => {
zos.putNextEntry(new ZipEntry(file.toPath.getFileName.toString))
} finally {
new ByteArrayInputStream(baos.toByteArray)
Instead of using a ByteArrayOutputStream to buffer the contents in an array then putting them into a ByteArrayInputStream you could use Java's piping mechanism.
Here's a sketch solution:
def zip() = Action {
// Create Source that listens to an OutputStream
// and pass it to `fileByteData` method.
val zipSource: Source[ByteString, Unit] =
CONTENT_TYPE -> "application/zip",
CONTENT_DISPOSITION -> s"attachment; filename = test.zip")
// Send the file data, given an OutputStream to write to.
def fileByteData(os: OutputStream): Unit = {
val fileList = List(
new java.io.File("C:\\Users\\a.csv"),
new java.io.File("C:\\Users\\b.csv")
val zos = new ZipOutputStream(os)
val buffer: Array[Byte] = new Array[Byte](2048)
try {
for (file <- fileList) {
zos.putNextEntry(new ZipEntry(file.toPath.getFileName.toString))
val fis = new Files.newInputStream(file.toPath)
try {
def zipFile(): Unit = {
val bytesRead = fis.read(buffer)
if (bytesRead == -1) () else {
zos.write(buffer, 0, bytesRead)
} finally fis.close()
} finally {
This is just an outline of an approach. You'll also want to make sure:
- the threading is OK - the fileByteData will hopefully run on a different thread to the sending thread
- the error handling is OK - e.g. all streams are closed properly if there's an error on either the server (e.g. file not found) or client side (early disconnect)
I made a Source for an Akka Stream based on a ReactiveStreams Publisher like this:
object FlickrSource {
val apiKey = Play.current.configuration.getString("flickr.apikey")
val flickrUserId = Play.current.configuration.getString("flickr.userId")
val flickrPhotoSearchUrl = s"https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=$apiKey&user_id=$flickrUserId&min_taken_date=%s&max_taken_date=%s&format=json&nojsoncallback=1&page=%s&per_page=500"
def byDate(date: LocalDate): Source[JsValue, Unit] = {
Source(new FlickrPhotoSearchPublisher(date))
class FlickrPhotoSearchPublisher(date: LocalDate) extends Publisher[JsValue] {
override def subscribe(subscriber: Subscriber[_ >: JsValue]) {
try {
val from = new LocalDate()
val fromSeconds = from.toDateTimeAtStartOfDay.getMillis
val toSeconds = from.plusDays(1).toDateTimeAtStartOfDay.getMillis
def pageGet(page: Int): Unit = {
val url = flickrPhotoSearchUrl format (fromSeconds, toSeconds, page)
Logger.debug("Flickr search request: " + url)
val photosFound = WS.url(url).get().map { response =>
val json = response.json
val photosThisPage = (json \ "photos" \ "photo").as[JsArray]
val numPages = (json \ "photos" \ "pages").as[JsNumber].value.toInt
Logger.debug(s"pages: $numPages")
Logger.debug(s"photos this page: ${photosThisPage.value.size}")
photosThisPage.value.foreach { photo =>
if (numPages > page) {
pageGet(page + 1)
} else {
} catch {
case ex: Exception => {
It will make a search request to Flickr and source the results as JsValues. I tried to wire it to lots of different Flows and Sinks, but this would be the most basic setup:
val source: Source[JsValue, Unit] = FlickrSource.byDate(date)
val sink: Sink[JsValue, Future[Unit]] = Sink.foreach(println)
val stream = source.toMat(sink)(Keep.right)
I see that the onNext gets called a couple of times, and then the onComplete. However, the Sink does not receive anything. What am I missing, is this not a valid way to create a Source?
I mistakenly understood that Publisher was a simple interface like Observable, that you can implement yourself. The Akka team pointed out that this is not the correct way to implement a Publisher. In fact Publisher is a complicated class that is supposed to be implemented by libraries, rather than end users. This Source.apply(Publisher) method used in the question is there for interoperability with other Reactive Streams implementations.
The purpose for wanting an implementation of Source is that I want a backpressured source to fetch the search results from Flickr (which is maximized at 500 per request) and I don't want to make more (or faster) requests than is needed downstream. This can be achieved by implementing an ActorPublisher.
This is the ActorPublisher that does what I want: create a Source that produces search results, but only makes as many REST calls as are needed downstream. I think there is still room for improvement, so feel free to edit it.
import akka.actor.Props
import akka.stream.actor.ActorPublisher
import akka.stream.actor.ActorPublisherMessage.{Cancel, Request}
import org.joda.time.LocalDate
import play.api.Play.current
import play.api.libs.json.{JsArray, JsNumber, JsValue}
import play.api.libs.ws.WS
import play.api.{Logger, Play}
import scala.concurrent.ExecutionContext.Implicits.global
object FlickrSearchActorPublisher {
val apiKey = Play.current.configuration.getString("flickr.apikey")
val flickrUserId = Play.current.configuration.getString("flickr.userId")
val flickrPhotoSearchUrl = s"https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=$apiKey&user_id=$flickrUserId&min_taken_date=%s&max_taken_date=%s&format=json&nojsoncallback=1&per_page=500&page="
def byDate(from: LocalDate): Props = {
val fromSeconds = from.toDateTimeAtStartOfDay.getMillis / 1000
val toSeconds = from.plusDays(1).toDateTimeAtStartOfDay.getMillis / 1000
val url = flickrPhotoSearchUrl format (fromSeconds, toSeconds)
Props(new FlickrSearchActorPublisher(url))
class FlickrSearchActorPublisher(url: String) extends ActorPublisher[JsValue] {
var currentPage = 1
var numPages = 1
var photos = Seq[JsValue]()
def searching: Receive = {
case Request(count) =>
Logger.debug(s"Received Request for $count results from Subscriber, ignoring as we are still searching")
case Cancel =>
Logger.info("Cancel Message Received, stopping")
case _ =>
def accepting: Receive = {
case Request(count) =>
Logger.debug(s"Received Request for $count results from Subscriber")
case Cancel =>
Logger.info("Cancel Message Received, stopping")
case _ =>
def getNextPageOrStop() {
if (currentPage > numPages) {
Logger.debug("No more pages, stopping")
} else {
val pageUrl = url + currentPage
Logger.debug("Flickr search request: " + pageUrl)
WS.url(pageUrl).get().map { response =>
val json = response.json
val photosThisPage = (json \ "photos" \ "photo").as[JsArray]
numPages = (json \ "photos" \ "pages").as[JsNumber].value.toInt
Logger.debug(s"page $currentPage of $numPages")
Logger.debug(s"photos this page: ${photosThisPage.value.size}")
photos = photosThisPage.value.seq
if (photos.isEmpty) {
Logger.debug("No photos found, stopping")
} else {
currentPage = currentPage + 1
def sendSearchResults() {
if (photos.isEmpty) {
} else {
while(isActive && totalDemand > 0) {
photos = photos.tail
if (photos.isEmpty) {
val receive = searching
I am having a Zipped file containing multiple text files.
I want to read each of the file and build a List of RDD containining the content of each files.
val test = sc.textFile("/Volumes/work/data/kaggle/dato/test/5.zip")
will just entire files, but how to iterate through each content of zip and then save the same in RDD using Spark.
I am fine with Scala or Python.
Possible solution in Python with using Spark -
archive = zipfile.ZipFile(archive_path, 'r')
file_paths = zipfile.ZipFile.namelist(archive)
for file_path in file_paths:
urls = file_path.split("/")
urlId = urls[-1].split('_')[0]
Apache Spark default compression support
I have written all the necessary theory in other answer, that you might want to refer to: https://stackoverflow.com/a/45958182/1549135
Read zip containing multiple files
I have followed the advice given by #Herman and used ZipInputStream. This gave me this solution, which returns RDD[String] of the zip content.
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.endsWith(".zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
.takeWhile {
case null => zis.close(); false
case _ => true
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
} else {
sc.textFile(path, minPartitions)
simply use it by importing the implicit class and call the readFile method on SparkContext:
import com.github.atais.spark.Implicits.ZipSparkContext
If you are reading binary files use sc.binaryFiles. This will return an RDD of tuples containing the file name and a PortableDataStream. You can feed the latter into a ZipInputStream.
Here's a working version of #Atais solution (which needs enhancement by closing the streams) :
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.toLowerCase.contains("zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap {
case (zipFilePath, zipContent) ⇒
val zipInputStream = new ZipInputStream(zipContent.open())
.takeWhile(_ != null)
.map { _ ⇒
scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString("\n")
} #::: { zipInputStream.close; Stream.empty[String] }
} else {
sc.textFile(path, minPartitions)
Then all you have to do is the following to read a zip file :
This filters only the first line. can anyone share your insights. I am trying to read a CSV file which is zipped and create JavaRDD for further processing.
JavaPairRDD<String, PortableDataStream> zipData =
JavaRDD<Record> newRDDRecord = zipData.flatMap(
new FlatMapFunction<Tuple2<String, PortableDataStream>, Record>(){
public Iterator<Record> call(Tuple2<String,PortableDataStream> content) throws Exception {
List<Record> records = new ArrayList<Record>();
ZipInputStream zin = new ZipInputStream(content._2.open());
ZipEntry zipEntry;
while ((zipEntry = zin.getNextEntry()) != null) {
if (!zipEntry.isDirectory()) {
Record sd;
String line;
InputStreamReader streamReader = new InputStreamReader(zin);
BufferedReader bufferedReader = new BufferedReader(streamReader);
line = bufferedReader.readLine();
String[] records= new CSVParser().parseLineMulti(line);
sd = new Record(TimeBuilder.convertStringToTimestamp(records[0]),
return records.iterator();
Here is another working solution which gives out file name which can be later split and used to create separate schemas from it.
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.toLowerCase.contains("zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap {
case (zipFilePath, zipContent) ⇒
val zipInputStream = new ZipInputStream(zipContent.open())
.takeWhile(_ != null)
.map { x ⇒
val filename1 = x.getName
scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString(s"~${filename1}\n")+s"~${filename1}"
} #::: { zipInputStream.close; Stream.empty[String] }
} else {
sc.textFile(path, minPartitions)
full code is here
I try write some simple akka-http and akka-streams based application, that handle http requests, always with one precompiled stream, because I plan to use long time processing with back-pressure in my requestProcessor stream
My application code:
import akka.actor.{ActorSystem, Props}
import akka.http.scaladsl._
import akka.http.scaladsl.server.Directives._
import akka.http.scaladsl.server._
import akka.stream.ActorFlowMaterializer
import akka.stream.actor.ActorPublisher
import akka.stream.scaladsl.{Sink, Source}
import scala.annotation.tailrec
import scala.concurrent.Future
object UserRegisterSource {
def props: Props = Props[UserRegisterSource]
final case class RegisterUser(username: String)
class UserRegisterSource extends ActorPublisher[UserRegisterSource.RegisterUser] {
import UserRegisterSource._
import akka.stream.actor.ActorPublisherMessage._
val MaxBufferSize = 100
var buf = Vector.empty[RegisterUser]
override def receive: Receive = {
case request: RegisterUser =>
if (buf.isEmpty && totalDemand > 0)
else {
buf :+= request
case Request(_) =>
case Cancel =>
#tailrec final def deliverBuf(): Unit =
if (totalDemand > 0) {
if (totalDemand <= Int.MaxValue) {
val (use, keep) = buf.splitAt(totalDemand.toInt)
buf = keep
use foreach onNext
} else {
val (use, keep) = buf.splitAt(Int.MaxValue)
buf = keep
use foreach onNext
object Main extends App {
val host = ""
val port = 8094
implicit val system = ActorSystem("my-testing-system")
implicit val fm = ActorFlowMaterializer()
implicit val executionContext = system.dispatcher
val serverSource: Source[Http.IncomingConnection, Future[Http.ServerBinding]] = Http(system).bind(interface = host, port = port)
val mySource = Source.actorPublisher[UserRegisterSource.RegisterUser](UserRegisterSource.props)
val requestProcessor = mySource
val route: Route =
get {
path("test") {
parameter('test) { case t: String =>
requestProcessor ! UserRegisterSource.RegisterUser(t)
def fakeSaveUserAndReturnCreatedUserId(param: UserRegisterSource.RegisterUser): Future[Int] =
Future.successful {
serverSource.to(Sink.foreach {
connection =>
connection handleWith Route.handlerFlow(route)
I found solution about how create Source that can dynamically accept new items to process, but I can found any solution about how than obtain result of stream execution in my route
The direct answer to your question is to materialize a new Stream for each HttpRequest and use Sink.head to get the value you're looking for. Modifying your code:
val requestStream =
//.run() - don't materialize here
val route: Route =
get {
path("test") {
parameter('test) { case t: String =>
//materialize a new Stream here
val userIdFut : Future[Int] = requestStream.run()
requestProcessor ! UserRegisterSource.RegisterUser(t)
//get the result of the Stream
userIdFut onSuccess { case userId : Int => ...}
However, I think your question is ill posed. In your code example the only thing you're using an akka Stream for is to create a new UserId. Futures readily solve this problem without the need for a materialized Stream (and all the accompanying overhead):
val route: Route =
get {
path("test") {
parameter('test) { case t: String =>
val user = RegisterUser(t)
fakeSaveUserAndReturnCreatedUserId(user) onSuccess { case userId : Int =>
If you want to limit the number of concurrent calls to fakeSaveUserAndReturnCreateUserId then you can create an ExecutionContext with a defined ThreadPool size, as explained in the answer to this question, and use that ExecutionContext to create the Futures:
val ThreadCount = 10 //concurrent queries
val limitedExecutionContext =
def fakeSaveUserAndReturnCreatedUserId(param: UserRegisterSource.RegisterUser): Future[Int] =
Future { 1 }(limitedExecutionContext)