How to read file to array bytes from gridfs? - mongodb

I want to get Array/List[Byte] from Enumerator[Array[Byte]]. Some articles show me how to do it in playframework. But I'm not using play framework for this project. This is what I have:
lazy val gridfs = GridFS[BSONSerializationPack.type](db, "resource")
gridfs.find(BSONDocument("_id" -> BSONObjectID(id))).headOption.map{
case Some(file) => {
//this gives me Enumerator[Array[Byte]].
//I'm not using playframework, how to get Future[Array[Byte]] from here?
gridfs.enumerate(file)
}
}

Using Play Iteratees is not specific to Play app, but to streaming, like Akka Stream or RxStuff.
You can consume all the chunks of such enumerator using Iteratee.consume.
val sink: Iteratee[Array[Byte], Array[Byte]] = Iteratee.consume[Array[Byte]]()
val allInMem: Future[Array[Byte]] = enumerator |>>> sink
For obvious reason, it's recommanded not to consume big data, but to use appropriate Iteratee to process the data in a streaming way.

Related

Akka FileIO.fromPath - How to deal with IOResult and get the data instead?

I looked at a lot of examples and posts about this. I got it working in one way but I haven't quite gotten the idea yet, I'm still getting tripped up by Future[IOResult] when I'm trying to read a file into a stream of record objects, one per line, call it Future[List[LineRecordCaseClass]] is what I want instead.
val source = FileIO.fromPath(Paths.get("/tmp/junk_data.csv"))
val flow = makeFlow() // Framing.delimiter->split(",")->map to LineRecordCaseClass
val sink = Sink.collection[LineRecordCaseClass, List[LineRecordCaseClass]]
val graph = source.via(flow).to(sink)
val typeMismatchError: Future[List[LineRecordCaseClass]] = graph.run()
Why does graph.run() return a Future[IOResult] instead? Perhaps I'm missing a Keep.left somewhere, or something? If so what and where at?
Some concept I'm missing.
Here are the type of yours vals
val source: Source[ByteString, Future[IOResult]] =
val flow: Flow[ByteString, LineRecordCaseClass, NotUsed] =
val sink: Sink[LineRecordCaseClass, Future[List[LineRecordCaseClass]]] =
From the akka-stream doc , in the code snippet
By default, the materialized value of the leftmost stage is preserved
The materialized value at your leftmost stage (the source) is Future[IOResult].
In source.via(flow).to(sink), if you look at the implementation of .to, it calls .toMat with a default Keep.left
The type for Keep.both is
val check: RunnableGraph[(Future[IOResult], Future[List[LineRecordCaseClass]])] = source.via(flow).toMat(sink)(Keep.both)
So if you want Future[List[LineRecordCaseClass]], you can do
source.via(flow).toMat(sink)(Keep.right)
I recommend this video which explains the materialized value

Possible encoding issue with Google PubSub

When running a subscription source from the Alpakka PubSub library I received possible encoded data.
#Singleton
class Consumer #Inject()(config: Configuration, credentialsService: google.creds.Service)(implicit actorSystem: ActorSystem) {
implicit val m: ActorMaterializer = ActorMaterializer.create(actorSystem)
val logger = Logger(this.getClass)
val subName: String = config.get[String]("google.pubsub.subname")
val credentials: Credentials = credentialsService.getCredentials
val pubSubConfig = PubSubConfig(credentials.projectId, credentials.clientEmail, credentials.privateKey)
val subSource: Source[ReceivedMessage, NotUsed] = GooglePubSub.subscribe(subName, pubSubConfig)
val ackSink: Sink[AcknowledgeRequest, Future[Done]] = GooglePubSub.acknowledge(subName, pubSubConfig)
val computeGraph = Flow[ReceivedMessage].map {
x =>
logger.info(x.message.data)
x
}
val ackGraph = Flow.fromFunction((msgs: Seq[ReceivedMessage]) => AcknowledgeRequest(msgs.map(_.ackId).toList))
subSource
.via(computeGraph)
.groupedWithin(10, 5.minutes)
.via(ackGraph)
.to(ackSink)
.run()
}
I publish the message from the PubSub console. I am expecting my test message to appear however when publishing test I receive dGVzdA==. Is this an expected result? I have had issues with importing the private key and this might be a result of it?
The consumer is bound eagerly with Guice.
Data that is received over REST apis will be base64 encoded. My guess would be that the Alpakka Pub/Sub library which uses the REST APIs is not properly decoding the received data. It looks like they also have a library that uses the GRPC Pub/Sub client as the underlying layer which may not suffer from this defect? You can also use the Cloud Pub/Sub Java client library from Scala directly.

Testing Twitter with Spark Streaming API

I am new to Streaming framework of Spark and was trying to process the twitter stream.
I am in process of writing test cases for same and understand that I can use Spark StreamingSuiteBase which will help me test input as a stream on my functions.
But I have written a function which take DStream[Status] as input and after processing gives DStream[String] as output.
The api I am using from StreamingSuiteBase is testOperation.
test("Filter only words Starting with #") {
val inputTweet = List(List("this is #firstHash"), List("this is #secondHash"), List("this is #thirdHash"))
val expected = List(List("#firstHash"), List("#secondHash"), List("#thirdHash"))
testOperation(inputTweet, TransformTweets.getText _, expected, ordered = false)
And this is the function on which the input is sent..
def getText(englishTweets: DStream[Status]): DStream[String] = {
println(englishTweets.toString)
val hashTags = englishTweets.flatMap(x => x.getText.split(" ").filter(_.startsWith("#")))
hashTags
}
But I am getting the error "type mismatch" due to DStream[Status] and DStream[String]. How do I mock Stream[Status].
So, I resolved this issue by getting the Twitter status from "createStatus" API of TwitterObjectFactory. There was no need to mock TwitterStatus. Even if you manage to mock it there are Serialization issues. So, this is the best solution:
val rawJson = Source.fromURL(getClass.getResource("/tweetStatus.json")).getLines.mkString
val tweetStatus = TwitterObjectFactory.createStatus(rawJson)
Hope this helps someone !

Large file download with Play framework

I have a sample download code that works fine if the file is not zipped because I know the length and when I provide, it I think while streaming play does not have to bring the whole file in memory and it works. The below code works
def downloadLocalBackup() = Action {
var pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val source: Source[ByteString, _] = FileIO.fromPath(path)
logger.info("from local backup set the length in header as "+file.length())
Ok.sendEntity(HttpEntity.Streamed(source, Some(file.length()), Some("application/zip"))).withHeaders("Content-Disposition" -> s"attachment; filename=backup")
}
I don't know how the streaming in above case takes care of the difference in speed between disk reads(Which are faster than network). This never runs out of memory even for large files. But when I use the below code, which has zipOutput stream I am not sure of the reason to run out of memory. Somehow the same 3GB file when I try to use with zip stream, is not working.
def downloadLocalBackup2() = Action {
var pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val enumerator = Enumerator.outputStream { os =>
val zipStream = new ZipOutputStream(os)
zipStream.putNextEntry(new ZipEntry("backup2"))
val is = new BufferedInputStream(new FileInputStream(pathOfFile))
val buf = new Array[Byte](1024)
var len = is.read(buf)
var totalLength = 0L;
var logged = false;
while (len >= 0) {
zipStream.write(buf, 0, len)
len = is.read(buf)
if (!logged) {
logged = true;
logger.info("logging the while loop just one time")
}
}
is.close
zipStream.close()
}
logger.info("log right before sendEntity")
val kk = Ok.sendEntity(HttpEntity.Streamed(Source.fromPublisher(Streams.enumeratorToPublisher(enumerator)).map(x => {
val kk = Writeable.wByteArray.transform(x); kk
}),
None, Some("application/zip"))
).withHeaders("Content-Disposition" -> s"attachment; filename=backupfile.zip")
kk
}
In the first example, Akka Streams handles all details for you. It knows how to read the input stream without loading the complete file in memory. That is the advantage of using Akka Streams as explained in the docs:
The way we consume services from the Internet today includes many instances of streaming data, both downloading from a service as well as uploading to it or peer-to-peer data transfers. Regarding data as a stream of elements instead of in its entirety is very useful because it matches the way computers send and receive them (for example via TCP), but it is often also a necessity because data sets frequently become too large to be handled as a whole. We spread computations or analyses over large clusters and call it “big data”, where the whole principle of processing them is by feeding those data sequentially—as a stream—through some CPUs.
...
The purpose [of Akka Streams] is to offer an intuitive and safe way to formulate stream processing setups such that we can then execute them efficiently and with bounded resource usage—no more OutOfMemoryErrors. In order to achieve this our streams need to be able to limit the buffering that they employ, they need to be able to slow down producers if the consumers cannot keep up. This feature is called back-pressure and is at the core of the Reactive Streams initiative of which Akka is a founding member.
At the second example, you are handling the input/output streams by yourself, using the standard blocking API. I'm not 100% sure about how writing to a ZipOutputStream works here, but it is possible that it is not flushing the writes and accumulating everything before close.
Good thing is that you don't need to handle this manually since Akka Streams provides a way to gzip a Source of ByteStrings:
import javax.inject.Inject
import akka.util.ByteString
import akka.stream.scaladsl.{Compression, FileIO, Source}
import play.api.http.HttpEntity
import play.api.mvc.{BaseController, ControllerComponents}
class FooController #Inject()(val controllerComponents: ControllerComponents) extends BaseController {
def download = Action {
val pathOfFile = "/opt/mydir/backups/big/backup"
val file = new java.io.File(pathOfFile)
val path: java.nio.file.Path = file.toPath
val source: Source[ByteString, _] = FileIO.fromPath(path)
val gzipped = source.via(Compression.gzip)
Ok.sendEntity(HttpEntity.Streamed(gzipped, Some(file.length()), Some("application/zip"))).withHeaders("Content-Disposition" -> s"attachment; filename=backup")
}
}

Play 2.1: Await result from enumerator

I'm working on testing my WebSocket code in Play Framework 2.1. My approach is to get the iterator/enumerator pair that are used for the actual web socket, and just test pushing data in and pulling data out.
Unfortunately, I just cannot figure out how to get data out of an Enumerator. Right now my code looks roughly like this:
val (in, out) = createClient(FakeRequest("GET", "/myendpoint"))
in.feed(Input.El("My input here"))
in.feed(Input.EOF)
//no idea how to get data from "out"
As far as I can tell, the only way to get data out of an enumerator is through an iteratee. But I can't figure out how to just wait until I get the full list of strings coming out of the enumerator. What I want is a List[String], not a Future[Iteratee[A,String]] or an Expectable[Iteratee[String]] or yet another Iteratee[String]. The documentation is confusing at best.
How do I do that?
You can consume an Enumerator like this:
val out = Enumerator("one", "two")
val consumer = Iteratee.getChunks[String]
val appliedEnumeratorFuture = out.apply(consumer)
val appliedEnumerator = Await.result(appliedEnumeratorFuture, 1.seconds)
val result = Await.result(appliedEnumerator.run, 1.seconds)
println(result) // List("one", "two")
Note that you need to await a Future twice because Enumerator and Iteratee control the speed of respectively producing and consuming values.
A more elaborate example for an Iteratee -> Enumerator chain where feeding the Iteratee results in the Enumerator producing values:
// create an enumerator to which messages can be pushed
// using a channel
val (out, channel) = Concurrent.broadcast[String]
// create the input stream. When it receives an string, it
// will push the info into the channel
val in =
Iteratee.foreach[String] { s =>
channel.push(s)
}.map(_ => channel.eofAndEnd())
// Instead of using the complex `feed` method, we just create
// an enumerator that we can use to feed the input stream
val producer = Enumerator("one", "two").andThen(Enumerator.eof)
// Apply the input stream to the producer (feed the input stream)
val producedElementsFuture = producer.apply(in)
// Create a consumer for the output stream
val consumer = Iteratee.getChunks[String]
// Apply the consumer to the output stream (consume the output stream)
val consumedOutputStreamFuture = out.apply(consumer)
// Await the construction of the input stream chain
Await.result(producedElementsFuture, 1.second)
// Await the construction of the output stream chain
val consumedOutputStream = Await.result(consumedOutputStreamFuture, 1.second)
// Await consuming the output stream
val result = Await.result(consumedOutputStream.run, 1.second)
println(result) // List("one", "two")