How to convert Source[ByteString, Any] to InputStream - scala

akka-http represents a file uploaded using multipart/form-data encoding as Source[ByteString, Any]. I need to unmarshal it using Java library that expects an InputStream.
How Source[ByteString, Any] can be turned into an InputStream?

As of version 2.x you achieve this with the following code:
import akka.stream.scaladsl.StreamConverters
...
val inputStream: InputStream = entity.dataBytes
.runWith(
StreamConverters.asInputStream(FiniteDuration(3, TimeUnit.SECONDS))
)
See: http://doc.akka.io/docs/akka-stream-and-http-experimental/2.0.1/scala/migration-guide-1.0-2.x-scala.html
Note: was broken in version 2.0.2 and fixed in 2.4.2

You could try using an OutputStreamSink that writes to a PipedOutputStream and feed that into a PipedInputStream that your other code uses as its input stream. It's a little rough of an idea but it could work. The code would look like this:
import akka.util.ByteString
import akka.stream.scaladsl.Source
import java.io.PipedInputStream
import java.io.PipedOutputStream
import akka.stream.io.OutputStreamSink
import java.io.BufferedReader
import java.io.InputStreamReader
import akka.actor.ActorSystem
import akka.stream.ActorFlowMaterializer
object PipedStream extends App{
implicit val system = ActorSystem("flowtest")
implicit val mater = ActorFlowMaterializer()
val lines = for(i <- 1 to 100) yield ByteString(s"This is line $i\n")
val source = Source(lines)
val pipedIn = new PipedInputStream()
val pipedOut = new PipedOutputStream(pipedIn)
val flow = source.to(OutputStreamSink(() => pipedOut))
flow.run()
val reader = new BufferedReader(new InputStreamReader(pipedIn))
var line:String = reader.readLine
while(line != null){
println(s"Reader received line: $line")
line = reader.readLine
}
}

You could extract an interator from ByteString and then get the InputStream. Something like this (pseudocode):
source.map { data: ByteString =>
data.iterator.asInputStream
}
Update
A more elaborated sample starting with a Multipart.FormData
def isSourceFromFormData(formData: Multipart.FormData): Source[InputStream, Any] =
formData.parts.map { part =>
part.entity.dataBytes
.map(_.iterator.asInputStream)
}.flatten(FlattenStrategy.concat)

Related

Why Files.write can't use Array[Byte]()?

I have this code:
val socket=new ServerSocket(25)
val client=socket.accept()
val inputStream=client.getInputStream
var dataBuffer=new Array[Byte](4096)
inputStream.read(dataBuffer)
Files.write("",dataBuffer)
Files.write need a Byte array,and I have give it a Byte array,so why I got error at last line:
Type mismatch,expected:Iterable[_<:CharSequence],actual:Array[Byte]
inputStream.read also need a Byte array param,it can use dataBuffer,so why the next line got a error?How to fix it?Thanks!
If you use java.nio.file.Files you should use Path as first parameter.
val b: Array[Byte] = Array()
Files.write(Paths.get(""), b)
import java.net.ServerSocket
import java.nio.file.{Files, Paths}
object Test1 {
def main(args: Array[String]): Unit = {
val socket = new ServerSocket(9999)
val client = socket.accept()
val inputStream = client.getInputStream
var dataBuffer = new Array[Byte](4096)
inputStream.read(dataBuffer)
Files.write(Paths.get("/home/eiffel/a.txt"), dataBuffer)
}
}

How to read PDF files and xml files in Apache Spark scala?

My sample code for reading text file is
val text = sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], sc.defaultMinPartitions)
var rddwithPath = text.asInstanceOf[HadoopRDD[LongWritable, Text]].mapPartitionsWithInputSplit { (inputSplit, iterator) ⇒
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map { tpl ⇒ (file.getPath.toString, tpl._2.toString) }
}.reduceByKey((a,b) => a)
In this way how can I use PDF and Xml files
PDF & XML can be parsed using Tika:
look at Apache Tika - a content analysis toolkit
look at
- https://tika.apache.org/1.9/api/org/apache/tika/parser/xml/
- http://tika.apache.org/0.7/api/org/apache/tika/parser/pdf/PDFParser.html
- https://tika.apache.org/1.9/api/org/apache/tika/parser/AutoDetectParser.html
Below is example integration of Spark with Tika :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.input.PortableDataStream
import org.apache.tika.metadata._
import org.apache.tika.parser._
import org.apache.tika.sax.WriteOutContentHandler
import java.io._
object TikaFileParser {
def tikaFunc (a: (String, PortableDataStream)) = {
val file : File = new File(a._1.drop(5))
val myparser : AutoDetectParser = new AutoDetectParser()
val stream : InputStream = new FileInputStream(file)
val handler : WriteOutContentHandler = new WriteOutContentHandler(-1)
val metadata : Metadata = new Metadata()
val context : ParseContext = new ParseContext()
myparser.parse(stream, handler, metadata, context)
stream.close
println(handler.toString())
println("------------------------------------------------")
}
def main(args: Array[String]) {
val filesPath = "/home/user/documents/*"
val conf = new SparkConf().setAppName("TikaFileParser")
val sc = new SparkContext(conf)
val fileData = sc.binaryFiles(filesPath)
fileData.foreach( x => tikaFunc(x))
}
}
PDF can be parse in pyspark as follow:
If PDF is store in HDFS then using sc.binaryFiles() as PDF is store in binary format.
Then the binary content can be send to pdfminer for parsing.
import pdfminer
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
def return_device_content(cont):
fp = io.BytesIO(cont)
parser = PDFParser(fp)
document = PDFDocument(parser)
filesPath="/user/root/*.pdf"
fileData = sc.binaryFiles(filesPath)
file_content = fileData.map(lambda content : content[1])
file_content1 = file_content.map(return_device_content)
Further parsing is can be done using functionality provided by pdfminer.
You can simply use spark-shell with tika and run the below code in a sequential manner or in a distributed manner depending upon your use case
spark-shell --jars tika-app-1.8.jar
val binRDD = sc.binaryFiles("/data/")
val textRDD = binRDD.map(file => {new org.apache.tika.Tika().parseToString(file._2.open( ))})
textRDD.saveAsTextFile("/output/")
System.exit(0)

akka HttpResponse read body as String scala

So I have a function with this signature (akka.http.model.HttpResponse):
def apply(query: Seq[(String, String)], accept: String): HttpResponse
I simply get a value in a test like:
val resp = TagAPI(Seq.empty[(String, String)], api.acceptHeader)
I want to check its body in a test something like:
resp.entity.asString == "tags"
My question is how I can get the response body as string?
import akka.http.scaladsl.unmarshalling.Unmarshal
implicit val system = ActorSystem("System")
implicit val materializer = ActorFlowMaterializer()
val responseAsString: Future[String] = Unmarshal(entity).to[String]
Since Akka Http is streams based, the entity is streaming as well. If you really need the entire string at once, you can convert the incoming request into a Strict one:
This is done by using the toStrict(timeout: FiniteDuration)(mat: Materializer) API to collect the request into a strict entity within a given time limit (this is important since you don't want to "try to collect the entity forever" in case the incoming request does actually never end):
import akka.stream.ActorFlowMaterializer
import akka.actor.ActorSystem
implicit val system = ActorSystem("Sys") // your actor system, only 1 per app
implicit val materializer = ActorFlowMaterializer() // you must provide a materializer
import system.dispatcher
import scala.concurrent.duration._
val timeout = 300.millis
val bs: Future[ByteString] = entity.toStrict(timeout).map { _.data }
val s: Future[String] = bs.map(_.utf8String) // if you indeed need a `String`
You can also try this one also.
responseObject.entity.dataBytes.runFold(ByteString(""))(_ ++ _).map(_.utf8String) map println
Unmarshaller.stringUnmarshaller(someHttpEntity)
works like a charm, implicit materializer needed as well
Here is simple directive that extracts string from request's body
def withString(): Directive1[String] = {
extractStrictEntity(3.seconds).flatMap { entity =>
provide(entity.data.utf8String)
}
}
Unfortunately in my case, Unmarshal to String didn't work properly complaining on: Unsupported Content-Type, supported: application/json. That would be more elegant solution, but I had to use another way. In my test I used Future extracted from entity of the response and Await (from scala.concurrent) to get the result from the Future:
Put("/post/item", requestEntity) ~> route ~> check {
val responseContent: Future[Option[String]] =
response.entity.dataBytes.map(_.utf8String).runWith(Sink.lastOption)
val content: Option[String] = Await.result(responseContent, 10.seconds)
content.get should be(errorMessage)
response.status should be(StatusCodes.InternalServerError)
}
If you need to go through all lines in a response, you can use runForeach of Source:
response.entity.dataBytes.map(_.utf8String).runForeach(data => println(data))
Here is my working example,
import akka.actor.ActorSystem
import akka.http.scaladsl.Http
import akka.http.scaladsl.model._
import akka.stream.ActorMaterializer
import akka.util.ByteString
import scala.concurrent.Future
import scala.util.{ Failure, Success }
def getDataAkkaHTTP:Unit = {
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
// needed for the future flatMap/onComplete in the end
implicit val executionContext = system.dispatcher
val url = "http://localhost:8080/"
val responseFuture: Future[HttpResponse] = Http().singleRequest(HttpRequest(uri = url))
responseFuture.onComplete {
case Success(res) => {
val HttpResponse(statusCodes, headers, entity, _) = res
println(entity)
entity.dataBytes.runFold(ByteString(""))(_ ++ _).foreach (body => println(body.utf8String))
system.terminate()
}
case Failure(_) => sys.error("something wrong")
}
}

Scala Futures not executing when sending to Kinesis (Amazon AWS)

I am attempting to asynchronously write messages to Amazon Kinesis using Scala Futures so I can load test an application.
This code works, and I can see data moving down my pipeline as well as the output printing to the console.
import com.amazonaws.services.kinesis.AmazonKinesisClient
import java.nio.CharBuffer
import java.nio.charset.Charset
import java.text.SimpleDateFormat
import java.util.{Date, TimeZone}
object KinesisDummyDataProducer extends App {
val kinesis = new AmazonKinesisClient(PipelineConfig.awsCredentials)
println("Connected")
lazy val encoder = Charset.forName("UTF-8").newEncoder()
lazy val tz = TimeZone.getTimeZone("UTC")
lazy val df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'Z")
df.setTimeZone(tz)
(1 to args(0).toInt).map(int => send(int)).map(msg => println(msg))
private def send(int: Int) = {
val msg = "{\"event_name\":\"test\",\"timestamp\":\"%s\",\"int\":%s}".format(df.format(new Date()), int.toString)
val bytes = encoder.encode(CharBuffer.wrap(msg))
encoder.flush(bytes)
kinesis.putRecord("PrimaryEventStream", bytes, "123")
msg
}
}
This code works with Scala Futures.
import scala.concurrent.future
import scala.concurrent.ExecutionContext.Implicits.global
def doIt(x: Int) = {Thread.sleep(1000); x + 1}
(1 to 10).map(x => future{doIt(x)}).map(y => y.onSuccess({case x => println(x)}))
You'll note that the syntax is nearly identical on the mapping of sequences. However, the follwoing does not work (i.e., it neither prints to the console nor sends data down my pipeline).
import com.amazonaws.services.kinesis.AmazonKinesisClient
import java.nio.CharBuffer
import java.nio.charset.Charset
import java.text.SimpleDateFormat
import java.util.{Date, TimeZone}
import scala.concurrent.future
import scala.concurrent.ExecutionContext.Implicits.global
object KinesisDummyDataProducer extends App {
val kinesis = new AmazonKinesisClient(PipelineConfig.awsCredentials)
println("Connected")
lazy val encoder = Charset.forName("UTF-8").newEncoder()
lazy val tz = TimeZone.getTimeZone("UTC")
lazy val df = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'Z")
df.setTimeZone(tz)
(1 to args(0).toInt).map(int => future {send(int)}).map(f => f.onSuccess({case msg => println(msg)}))
private def send(int: Int) = {
val msg = "{\"event_name\":\"test\",\"timestamp\":\"%s\",\"int\":%s}".format(df.format(new Date()), int.toString)
val bytes = encoder.encode(CharBuffer.wrap(msg))
encoder.flush(bytes)
kinesis.putRecord("PrimaryEventStream", bytes, "123")
msg
}
}
Some more notes about this project. I am using Maven to do the build (from the command line), and running all of the above code (also from the command line) works just dandy.
My question is: Why with using the same syntax does my function 'send' appear to not be executing?

Error: "could not find implicit value for parameter readFileReader" trying to save a file using GridFS with reactivemongo

I am trying to save an attachment using reactivemongo in Play 2.1 using the following code:
def upload = Action(parse.multipartFormData) { request =>
request.body.file("carPicture").map { picture =>
val filename = picture.filename
val contentType = picture.contentType
val gridFS = new GridFS(db, "attachments")
val fileToSave = DefaultFileToSave(filename, contentType)
val futureResult: Future[ReadFile[BSONValue]] = gridFS.writeFromInputStream(fileToSave, new FileInputStream(new File(filename)))
Ok(Json.obj("e" -> 0))
}.getOrElse {
Redirect(routes.Application.index).flashing(
"error" -> "Missing file"
)
}
}
I am getting the following error:
could not find implicit value for parameter readFileReader: reactivemongo.bson.BSONDocumentReader[reactivemongo.api.gridfs.ReadFile[reactivemongo.bson.BSONValue]]
[error] val futureResult: Future[ReadFile[BSONValue]] = gridFS.writeFromInputStream(fileToSave, new FileInputStream(new File(filename)))
What am I missing?
Thanks,
GA
Most likely you don't have the DefaultReadFileReader implicit object in scope, which you can fix by adding an import:
import reactivemongo.api.gridfs.Implicits._
The following compiles fine for me (using the Play 2.1 reactivemongo module, version 0.9):
package controllers
import java.io.{ File, FileInputStream }
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import play.api._
import play.api.mvc._
import play.api.libs.json._
import reactivemongo.api._
import reactivemongo.bson._
import reactivemongo.api.gridfs._
import reactivemongo.api.gridfs.Implicits._
import play.modules.reactivemongo.MongoController
object Application extends Controller with MongoController {
def index = Action {
Ok(views.html.index("Hello, world..."))
}
def upload = Action(parse.multipartFormData) { request =>
request.body.file("carPicture").map { picture =>
val filename = picture.filename
val contentType = picture.contentType
val gridFS = new GridFS(db, "attachments")
val fileToSave = DefaultFileToSave(filename, contentType)
val futureResult: Future[ReadFile[BSONValue]] = gridFS.writeFromInputStream(fileToSave, new FileInputStream(new File(filename)))
Ok(Json.obj("e" -> 0))
}.getOrElse {
Redirect(routes.Application.index).flashing(
"error" -> "Missing file"
)
}
}
}