Play 2.1: Await result from enumerator - scala

I'm working on testing my WebSocket code in Play Framework 2.1. My approach is to get the iterator/enumerator pair that are used for the actual web socket, and just test pushing data in and pulling data out.
Unfortunately, I just cannot figure out how to get data out of an Enumerator. Right now my code looks roughly like this:
val (in, out) = createClient(FakeRequest("GET", "/myendpoint"))
in.feed(Input.El("My input here"))
in.feed(Input.EOF)
//no idea how to get data from "out"
As far as I can tell, the only way to get data out of an enumerator is through an iteratee. But I can't figure out how to just wait until I get the full list of strings coming out of the enumerator. What I want is a List[String], not a Future[Iteratee[A,String]] or an Expectable[Iteratee[String]] or yet another Iteratee[String]. The documentation is confusing at best.
How do I do that?

You can consume an Enumerator like this:
val out = Enumerator("one", "two")
val consumer = Iteratee.getChunks[String]
val appliedEnumeratorFuture = out.apply(consumer)
val appliedEnumerator = Await.result(appliedEnumeratorFuture, 1.seconds)
val result = Await.result(appliedEnumerator.run, 1.seconds)
println(result) // List("one", "two")
Note that you need to await a Future twice because Enumerator and Iteratee control the speed of respectively producing and consuming values.
A more elaborate example for an Iteratee -> Enumerator chain where feeding the Iteratee results in the Enumerator producing values:
// create an enumerator to which messages can be pushed
// using a channel
val (out, channel) = Concurrent.broadcast[String]
// create the input stream. When it receives an string, it
// will push the info into the channel
val in =
Iteratee.foreach[String] { s =>
channel.push(s)
}.map(_ => channel.eofAndEnd())
// Instead of using the complex `feed` method, we just create
// an enumerator that we can use to feed the input stream
val producer = Enumerator("one", "two").andThen(Enumerator.eof)
// Apply the input stream to the producer (feed the input stream)
val producedElementsFuture = producer.apply(in)
// Create a consumer for the output stream
val consumer = Iteratee.getChunks[String]
// Apply the consumer to the output stream (consume the output stream)
val consumedOutputStreamFuture = out.apply(consumer)
// Await the construction of the input stream chain
Await.result(producedElementsFuture, 1.second)
// Await the construction of the output stream chain
val consumedOutputStream = Await.result(consumedOutputStreamFuture, 1.second)
// Await consuming the output stream
val result = Await.result(consumedOutputStream.run, 1.second)
println(result) // List("one", "two")

Related

Scala making parallel network calls using Futures

i'm new to Scala, i have a method, that reads data from the given list of files and does api calls with
the data, and writes the response to a file.
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
the above method, is taking too much time, during the network call, so tried to do using parallel
processing to reduce the time.
so i tried wrapping the block of code, which consumes more time, but the program ends quickly
and its not generating any output, as the above code.
import scala.concurrent.ExecutionContext.Implicits.global
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
it would be helpful, if you have any suggestions.
(I also tried using "par", it works fine,
I'm exploring other options other than 'par' and using frameworks like 'akka', 'cats' etc)
Based on Jatin instead of using default execution context which contains deamon threads
import scala.concurrent.ExecutionContext.Implicits.global
define execution context with non-deamon threads
implicit val nonDeamonEc = ExecutionContext.fromExecutor(Executors.newCachedThreadPool)
Also you can use Future.traverse and Await like so
val resultF = Future.traverse(listOfFiles) { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
Await.result(resultF, Duration.Inf)
traverse converts List[Future[A]] to Future[List[A]].

How can I use a value in an Akka Stream to instantiate a GooglePubSub Flow?

I'm attempting to create a Flow to be used with a Source queue. I would like this to work with the Alpakka Google PubSub connector: https://doc.akka.io/docs/alpakka/current/google-cloud-pub-sub.html
In order to use this connector, I need to create a Flow that depends on the topic name provided as a String, as shown in the above link and in the code snippet.
val publishFlow: Flow[PublishRequest, Seq[String], NotUsed] =
GooglePubSub.publish(topic, config)
The question
I would like to be able to setup a Source queue that receives the topic and message required for publishing a message. I first create the necessary PublishRequest out of the message String. I then want to run this through the Flow that is instantiated by running GooglePubSub.publish(topic, config). However, I don't know how to get the topic to that part of the flow.
val gcFlow: Flow[(String, String), PublishRequest, NotUsed] = Flow[(String, String)]
.map(messageData => {
PublishRequest(Seq(
PubSubMessage(new String(Base64.getEncoder.encode(messageData._1.getBytes))))
)
})
.via(GooglePubSub.publish(topic, config))
val bufferSize = 10
val elementsToProcess = 5
// newSource is a Source[PublishRequest, NotUsed]
val (queue, newSource) = Source
.queue[(String, String)](bufferSize, OverflowStrategy.backpressure)
.via(gcFlow)
.preMaterialize()
I'm not sure if there's a way to get the topic into the queue without it being a part of the initial data stream. And I don't know how to get the stream value into the dynamic Flow.
If I have improperly used some terminology, please keep in mind that I'm new to this.
You can achieve it by using flatMapConcat and generating a new Source within it:
// using tuple assuming (Topic, Message)
val gcFlow: Flow[(String, String), (String, PublishRequest), NotUsed] = Flow[(String, String)]
.map(messageData => {
val pr = PublishRequest(immutable.Seq(
PubSubMessage(new String(Base64.getEncoder.encode(messageData._2.getBytes)))))
// output flow shape of (String, PublishRequest)
(messageData._1, pr)
})
val publishFlow: Flow[(String, PublishRequest), Seq[String], NotUsed] =
Flow[(String, PublishRequest)].flatMapConcat {
case (topic: String, pr: PublishRequest) =>
// Create a Source[PublishRequest]
Source.single(pr).via(GooglePubSub.publish(topic, config))
}
// wire it up
val (queue, newSource) = Source
.queue[(String, String)](bufferSize, OverflowStrategy.backpressure)
.via(gcFlow)
.via(publishFlow)
.preMaterialize()
Optionally you could substitute tuple with a case class to document it better
case class Something(topic: String, payload: PublishRequest)
// output flow shape of Something[String, PublishRequest]
Something(messageData._1, pr)
Flow[Something[String, PublishRequest]].flatMapConcat { s =>
Source.single(s.payload)... // etc
}
Explanation:
In gcFlow we output FlowShape of tuple (String, PublishRequest) which is passed through publishFlow. The input is tuple (String, PublishRequest) and in flatMapConcat we generate new Source[PublishRequest] which is flowed through GooglePubSub.publish
There would be slight overhead creating new Source for every item. This shouldn't have measurable impact on performance

Directive to complete with RandomAccessFile read

I have a large data file and respond to GET requests with very small portions of that file as Array[Byte]
The directive is:
get {
dataRepo.load(param).map(data =>
complete(
HttpResponse(
entity = HttpEntity(myContentType, data),
headers = List(gzipContentEncoding)
)
)
).getOrElse(complete(HttpResponse(status = StatusCodes.NoContent)))
}
Where dataRepo.load is a function along the lines of:
val pointers: Option[Long, Int] = calculateFilePointers(param)
pointers.map { case (index, length) =>
val dataReader = new RandomAccessFile(dataFile, "r")
dataReader.seek(index)
val data = Array.ofDim[Byte](length)
dataReader.readFully(data)
data
}
Is there a more efficient way to pipe the RandomAccessFile read directly back in the response, rather than having to read it fully first?
Instead of reading the data into a Array[Byte] you could create an Iterator[Array[Byte]] which reads chunks of the file at a time:
val dataReader = new RandomAccessFile(dataFile, 'r')
val chunkSize = 1024
Iterator
.range(index, index + length, chunkSize)
.map { currentIndex =>
val currentBytes =
Array.ofDim[Byte](Math.min(chunkSize, length - currentIndex))
dataReader seek currentIndex
dataReader readFully currentBytes
currentBytes
}
This iterator can now feed an akka Source:
val source : Source[Array[Byte], _] =
Source fromIterator (() => dataRepo.load(param))
Which can then feed an HttpEntity:
val byteStrSource : Source[ByteString, _] = source.map(ByteString.apply)
val httpEntity = HttpEntity(myContentType, byteStrSource)
Now each client will only use 1024 Bytes of memory at-a-time instead of the full length of your file read. This will make your server much more efficient at handling multiple concurrent requests as well as allowing your dataRepo.load to return immediately with a lazy Source value instead of utilizing a Future.

[Spark Streaming]How to load the model every time a new message comes in?

In Spark Streaming, every time a new message is received, a model will be used to predict sth based on this new message. But as time goes by, the model can be changed for some reason, so I want to re-load the model whenever a new message comes in. My code looks like this
def loadingModel(#transient sc:SparkContext)={
val model=LogisticRegressionModel.load(sc, "/home/zefu/BIA800/LRModel")
model
}
var error=0.0
var size=0.0
implicit def bool2int(b:Boolean) = if (b) 1 else 0
def updateState(batchTime: Time, key: String, value: Option[String], state: State[Array[Double]]): Option[(String, Double,Double)] = {
val model=loadingModel(sc)
val parts = value.getOrElse("0,0,0,0").split(",").map { _.toDouble }
val pairs = LabeledPoint(parts(0), Vectors.dense(parts.tail))
val prediction = model.predict(pairs.features)
val wrong= prediction != pairs.label
error = state.getOption().getOrElse(Array(0.0,0.0))(0) + 1.0*(wrong:Int)
size=state.getOption().getOrElse(Array(0.0,0.0))(1) + 1.0
val output = (key, error,size)
state.update(Array(error,size))
Some(output)
}
val stateSpec = StateSpec.function(updateState _)
.numPartitions(1)
setupLogging()
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = List("test").toSet
val lines = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics).mapWithState(stateSpec)
When I run this code, there would be an exception like this
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
If you need more information, please let me know.
Thank you!
When a model is used within DStream function, spark seem to serialize the context object (because model's load function uses sc), and it fails because the context object is not serializable. One workaround is to convert DStream to RDD, collect the result and then run model prediction/scoring in the driver.
Used netcat utility to simulate streaming, tried the following code to convert DStream to RDD, it works. See if it helps.
val ssc = new StreamingContext(sc,Seconds(10))
val lines = ssc.socketTextStream("xxx", 9998)
val linedstream = lines.map(lineRDD => Vectors.dense(lineRDD.split(" ").map(_.toDouble)) )
val logisModel = LogisticRegressionModel.load(sc, /path/LR_Model")
linedstream.foreachRDD( rdd => {
for(item <- rdd.collect().toArray) {
val predictedVal = logisModel.predict(item)
println(predictedVal + "|" + item);
}
})
Understand collect is not scalable here, but if you think that your streaming messages are less in number for any interval, this is probably an option. This is what I see it possible in Spark 1.4.0, the higher versions probably have a fix for this. See this if its useful,
Save ML model for future usage

spark: How to get data by foreachRDD which is a DSStream created by ReceiverLauncher.launch?

// consumer data from kafka
val tmp_stream = ReceiverLauncher.launch(ssc, props, numberOfReceivers,StorageLevel.MEMORY_ONLY)
tmp_stream.foreachRDD(rdd => {
rdd.collect()
val count = rdd.count() // here I can get the count of datas
// How can I get data here ?
})
Any idea how to complete this code to get data by foreachRDD from stream created by ReceiverLauncher.launch
You can call the getPayLoad to get the raw byte[]. Something like this
val stream = tmp_stream.map(x => { val s = new String(x.getPayload); s })