I'm creating a ByteArrayOutputStream using ZIO Streams i.e.:
lazy val byteArrayOutputStream = new ByteArrayOutputStream()
val sink = ZSink.fromOutputStream(byteArrayOutputStream).contramapChunks[String](_.flatMap(_.getBytes)
val data = ZStream.unwrap(callToFunction).run(sink)
This works fine - now I need to stream this data back to the client using akka http.
I can do this:
val arr = byteArrayOutputStream.toByteArray
complete(HttpEntity(ContentTypes.`application/octet-stream`, arr)
which works but of course the toByteArray brings the outputstream into memory i.e. I don't stream the data. I'm missing something obvious - is there an easy way to do this?
You can convert output stream to Akka Stream Source:
val byteArrayOutputStream = new ByteArrayOutputStream()
val source = StreamConverters.asOutputStream().mapMaterializedValue(_ => byteArrayOutputStream)
and then simply create a chunked HTTP entity:
HttpResponse(entity = HttpEntity.Chunked.fromData(ContentTypes.`application/octet-stream`, source))
More about chunked transfer: https://datatracker.ietf.org/doc/html/rfc7230#section-4.1
For ZIO, you could probably use something like this:
val zSource = ZStream.fromOutputStreamWriter(os => byteArrayOutputStream.writeTo(os))
However, you need to find a way to convert ZStream to Akka Stream Source.
Related
I should get a Map [String, String] back from a Kafka Consumer, but I don't really know how. I managed to configure the consumer, it works fine, but I don't understand how I could get the Map.
implicit val system: ActorSystem = ActorSystem()
val consumerConfig = system.settings.config.getConfig("akka.kafka.consumer")
val = kafkaConsumerSettings =
ConsumerSettings(consumerConfig, new StringDeserializer, new StringDeserializer)
.withBootstrapServers(localhost:9094)
.withGroupId(group1)
Consumer
.plainSource(kafkaConsumerSettings, Subscriptions.topics(entity.entity_name))
.toMat(Sink.foreach(println))(DrainingControl.apply)
.run()
Lightbend's recommendation is to deal with byte arrays while deserializing incoming data from Kafka
The general recommendation for de-/serialization of messages is to use byte arrays (or Strings) as value and do the de-/serialization in a map operation in the Akka Stream instead of implementing it directly in Kafka de-/serializers. When deserialization is handled explicitly within the Akka Stream, it is easier to implement the desired error handling strategy as the examples below show.
To do so, you may setup a consumer using this setting:
val consumerSettings = ConsumerSettings(consumerConfig, new StringDeserializer, new ByteArrayDeserializer)
And get the results by calling the .value() method from your Record class. To deserialize it, i would recommend using circe + jawn. This code should do the trick.
import io.circe.jawn
import io.circe.generic.auto._
val bytes = record.value()
val data = jawn.parseByteBuffer(ByteBuffer.wrap(bytes)).flatMap(_.as[Map[String, String]])
I want to upload file into S3 using Alpakka and at the same time parse it with Tika to obtain its MimeType.
I have 3 parts of graph at the moment:
val fileSource: Source[ByteString, Any] // comes from Akka-HTTP
val fileUpload: Sink[ByteString, Future[MultipartUploadResult]] // created by S3Client from Alpakka
val mimeTypeDetection: Sink[ByteString, Future[MediaType.Binary]] // my implementation using Apache Tika
I would like to obtain both results at one place, something like:
Future[(MultipartUploadResult, MediaType.Binary)]
I have no issue with broadcasting part:
val broadcast = builder.add(Broadcast[ByteString](2))
source ~> broadcast ~> fileUpload
broadcast ~> mimeTypeDetection
However I have a trouble composing Sinks. Methods I found in API and documentation assumes that either combined sinks are of the same type or that I am Zipping Flows, not Sinks.
What is suggested approach in such case?
Two ways:
1) using alsoToMat (easier, no GraphDSL, enough for your example)
val mat1: (Future[MultipartUploadResult], Future[Binary]) =
fileSource
.alsoToMat(fileUpload)(Keep.right)
.toMat(mimeTypeDetection)(Keep.both)
.run()
2) using GraphDSL with custom materialized values (more verbose, more flexible). More info on this in the docs)
val mat2: (Future[MultipartUploadResult], Future[Binary]) =
RunnableGraph.fromGraph(GraphDSL.create(fileUpload, mimeTypeDetection)((_, _)) { implicit builder =>
(fileUpload, mimeTypeDetection) =>
import GraphDSL.Implicits._
val broadcast = builder.add(Broadcast[ByteString](2))
fileSource ~> broadcast ~> fileUpload
broadcast ~> mimeTypeDetection
ClosedShape
}).run()
I have a use case where I need to call RESTAPI from spark streaming after messages are read from Kafka to perform some calculation and save back the result to HDFS and third party application.
I have few doubts here:
How can we call RESTAPI directly from the spark streaming.
How to manage RESTAPI timeout with streaming batch time.
This code will not compile as it is. But this the approach for the given usecase.
val conf = new SparkConf().setAppName("App name").setMaster("yarn")
val ssc = new StreamingContext(conf, Seconds(1))
val dstream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
dstream.foreachRDD { rdd =>
//Write the rdd to HDFS directly
rdd.saveAsTextFile("hdfs/location/to/save")
//loop through each parttion in rdd
rdd.foreachPartition { partitionOfRecords =>
//1. Create HttpClient object here
//2.a POST data to API
//Use it if you want record level control in rdd or partion
partitionOfRecords.foreach { record =>
//2.b Post the the date to API
record.toString
}
}
//Use 2.a or 2.b to POST data as per your req
}
ssc.start()
ssc.awaitTermination()
Most of the HttpClients (for REST call) supports request timeout.
Sample Http POST call with timeout using Apache HttpClient
val CONNECTION_TIMEOUT_MS = 20000; // Timeout in millis (20 sec).
val requestConfig = RequestConfig.custom()
.setConnectionRequestTimeout(CONNECTION_TIMEOUT_MS)
.setConnectTimeout(CONNECTION_TIMEOUT_MS)
.setSocketTimeout(CONNECTION_TIMEOUT_MS)
.build();
HttpClientBuilder.create().build();
val client: CloseableHttpClient = HttpClientBuilder.create().build();
val url = "https://selfsolve.apple.com/wcResults.do"
val post = new HttpPost(url);
//Set config to post
post.setConfig(requestConfig)
post.setEntity(EntityBuilder.create.setText("some text to post to API").build())
val response: HttpResponse = client.execute(post)
I'm building a REST API that starts some calculation in a Spark cluster and responds with a chunked stream of the results. Given the Spark stream with calculation results, I can use
dstream.foreachRDD()
to send the data out of Spark. I'm sending the chunked HTTP response with akka-http:
val requestHandler: HttpRequest => HttpResponse = {
case HttpRequest(HttpMethods.GET, Uri.Path("/data"), _, _, _) =>
HttpResponse(entity = HttpEntity.Chunked(ContentTypes.`text/plain`, source))
}
For simplicity, I'm trying to get plain text working first, will add JSON marshalling later.
But what is the idiomatic way of using the Spark DStream as a Source for the Akka stream? I figured I should be able to do it via a socket but since the Spark driver and the REST endpoint are sitting on the same JVM opening a socket just for this seems a bit of an overkill.
Not sure about the version of api at the time of the question. But now, with akka-stream 2.0.3, I believe you can do it like:
val source = Source
.actorRef[T](/* buffer size */ 100, OverflowStrategy.dropHead)
.mapMaterializedValue[Unit] { actorRef =>
dstream.foreach(actorRef ! _)
}
Edit: This answer only applies to older version of spark and akka. PH88's answer is the correct method for recent versions.
You can use an intermediate akka.actor.Actor that feeds a Source (similar to this question). The solution below is not "reactive" because the underlying Actor would need to maintain a buffer of RDD messages that may be dropped if the downstream http client isn't consuming chunks quickly enough. But this problem occurs regardless of the implementation details since you cannot connect the "throttling" of the akka stream back-pressure to the DStream in order to slow down the data. This is due to the fact that DStream does not implement org.reactivestreams.Publisher .
The basic topology is:
DStream --> Actor with buffer --> Source
To construct this toplogy you have to create an Actor similar to the implementation here :
//JobManager definition is provided in the link
val actorRef = actorSystem actorOf JobManager.props
Create a stream Source of ByteStrings (messages) based on the JobManager. Also, convert the ByteString to HttpEntity.ChunkStreamPart which is what the HttpResponse requires:
import akka.stream.actor.ActorPublisher
import akka.stream.scaladsl.Source
import akka.http.scaladsl.model.HttpEntity
import akka.util.ByteString
type Message = ByteString
val messageToChunkPart =
Flow[Message].map(HttpEntity.ChunkStreamPart(_))
//Actor with buffer --> Source
val source : Source[HttpEntity.ChunkStreamPart, Unit] =
Source(ActorPublisher[Message](actorRef)) via messageToChunkPart
Link the Spark DStream to the Actor so that each incomining RDD is converted to an Iterable of ByteString and then forwarded to the Actor:
import org.apache.spark.streaming.dstream.Dstream
import org.apache.spark.rdd.RDD
val dstream : DStream = ???
//This function converts your RDDs to messages being sent
//via the http response
def rddToMessages[T](rdd : RDD[T]) : Iterable[Message] = ???
def sendMessageToActor(message : Message) = actorRef ! message
//DStream --> Actor with buffer
dstream foreachRDD {rddToMessages(_) foreach sendMessageToActor}
Provide the Source to the HttpResponse:
val requestHandler: HttpRequest => HttpResponse = {
case HttpRequest(HttpMethods.GET, Uri.Path("/data"), _, _, _) =>
HttpResponse(entity = HttpEntity.Chunked(ContentTypes.`text/plain`, source))
}
Note: there should be very little time/code between the dstream foreachRDD line and the HttpReponse since the Actor's internal buffer will immediately begin to fill with ByteString message coming from the DStream after the foreach line is executed.
I've been using spark to stream data from kafka and it's pretty easy.
I thought using the MQTT utils would also be easy, but it is not for some reason.
I'm trying to execute the following piece of code.
val sparkConf = new SparkConf(true).setAppName("amqStream").setMaster("local")
val ssc = new StreamingContext(sparkConf, Seconds(10))
val actorSystem = ActorSystem()
implicit val kafkaProducerActor = actorSystem.actorOf(Props[KafkaProducerActor])
MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
.foreachRDD { rdd =>
println("got rdd: " + rdd.toString())
rdd.foreach { msg =>
println("got msg: " + msg)
}
}
ssc.start()
ssc.awaitTermination()
The weird thing is that spark logs the msg I sent in the console, but not my println.
It logs something like this:
19:38:18.803 [RecurringTimer - BlockGenerator] DEBUG
o.a.s.s.receiver.BlockGenerator - Last element in
input-0-1435790298600 is SOME MESSAGE
foreach is a distributed action, so your println may be executing on the workers. If you want to see some of the messages printed out locally, you could use the built in print function on the DStream or instead of your foreachRDD collect (or take) some of the elements back to the driver and print them there. Hope that helps and best of luck with Spark Streaming :)
If you wish to just print incoming messages, try something like this instead of the for_each (translating from a working Python version, so do check for Scala typos):
val mqttStream = MQTTUtils.createStream(ssc, "tcp://localhost:1883", "AkkaTest")
mqttStream.print()