Reading a large file using Akka Streams - scala

I'm trying out Akka Streams and here is a short snippet that I have:
override def main(args: Array[String]) {
val filePath = "/Users/joe/Softwares/data/FoodFacts.csv"//args(0)
val file = new File(filePath)
println(file.getAbsolutePath)
// read 1MB of file as a stream
val fileSource = SynchronousFileSource(file, 1 * 1024 * 1024)
val shaFlow = fileSource.map(chunk => {
println(s"the string obtained is ${chunk.toString}")
})
shaFlow.to(Sink.foreach(println(_))).run // fails with a null pointer
def sha256(s: String) = {
val messageDigest = MessageDigest.getInstance("SHA-256")
messageDigest.digest(s.getBytes("UTF-8"))
}
}
When I ran this snippet, I get:
Exception in thread "main" java.lang.NullPointerException
at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:365)
at com.test.api.consumer.DataScienceBoot$.main(DataScienceBoot.scala:30)
at com.test.api.consumer.DataScienceBoot.main(DataScienceBoot.scala)
It seems to me that it is not fileSource is just empty? Why is this? Any ideas? The FoodFacts.csv if 40MB in size and all I'm trying to do is to create a 1MB stream of data!
Even using the defaultChunkSize of 8192 did not work!

Well 1.0 is deprecated. And if you can, use 2.x.
When I try with 2.0.1 version by using FileIO.fromFile(file) instead of SynchronousFileSource, it is a compile failure with message fails with null pointer. This was simply because it didnt have ActorMaterializer in scope. Including it, makes it work:
object TImpl extends App {
import java.io.File
implicit val system = ActorSystem("Sys")
implicit val materializer = ActorMaterializer()
val file = new File("somefile.csv")
val fileSource = FileIO.fromFile(file,1 * 1024 * 1024 )
val shaFlow: Source[String, Future[Long]] = fileSource.map(chunk => {
s"the string obtained is ${chunk.toString()}"
})
shaFlow.runForeach(println(_))
}
This works for file of any size. For more information on configuration of dispatcher, refer here.

Related

Akka chunk size exception

I am trying to make a singleRequest from Akka using this code:
val request = HttpRequest(
method = HttpMethods.GET,
uri = "url"
)
val responseFuture: Future[HttpResponse] = Http().singleRequest(request)
val entityFuture: Future[HttpEntity.Strict] = responseFuture.flatMap(response => response.entity.toStrict(2.seconds))
entityFuture.map(entity => entity.data.utf8String)
However when I request a big json string I get the following exception.
akka.http.scaladsl.model.EntityStreamException: HTTP chunk size exceeds the configured limit of 1048576 bytes
How do I configure this, I am not using Akka typed(I think), just this:
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
import system.dispatcher
You can use the following configuration to increase the request size:
akka.http.parsing.max-chunk-size = 16m

How to send multiple files to kafka producer using akka stream in Scala

I am trying to send multiple data to kafka producer using akka stream , meanwhile I wrote the producer itself , but struggling of how to use akka-streamIO in order to get multiple files which will be the data I want to send to my kafka Producer this is my code:
object App {
def main(args: Array[String]): Unit = {
val file = Paths.get("233339.8.1231731728115136.1722327129833578.log")
// val file = Paths.get("example.csv")
//
// val foreach: Future[IOResult] = FileIO.fromPath(file)
// .to(Sink.ignore)
// .run()
println("Hello from producer")
implicit val system:ActorSystem = ActorSystem("producer-example")
implicit val materializer:Materializer = ActorMaterializer()
val producerSettings = ProducerSettings(system,new StringSerializer,new StringSerializer)
val done: Future[Done] =
Source(1 to 955)
.map(value => new ProducerRecord[String, String]("test-topic", s"$file : $value"))
.runWith(Producer.plainSink(producerSettings))
implicit val ec: ExecutionContextExecutor = system.dispatcher
done onComplete {
case Success(_) => println("Done"); system.terminate()
case Failure(err) => println(err.toString); system.terminate()
}
}
}
Given multiple file names:
val fileNames : Iterable[String] = ???
It is possible to create a Source that emits the contents of the files concatenated together using flatMapConcat:
val chunkSize = 8192
val chunkSource : Source[ByteString, _] =
Source.apply(fileNames)
.map(fileName => Paths get fileName)
.flatMapConcat(path => FileIO.fromPath(path, chunkSize))
This will emit fixed size ByteString values that are all chunkSize length, except for possibly the last value which may be smaller.
If you want to breakup the lines by some delimiter then you can use Framing:
val delimiter : ByteString = ???
val maxFrameLength : Int = ???
val framingSource : Source[ByteString, _] =
chunkSource.via(Framing.delimiter(delimiter, maxFrameLength))

File Upload and processing using akka-http websockets

I'm using some sample Scala code to make a server that receives a file over websocket, stores the file temporarily, runs a bash script on it, and then returns stdout by TextMessage.
Sample code was taken from this github project.
I edited the code slightly within echoService so that it runs another function that processes the temporary file.
object WebServer {
def main(args: Array[String]) {
implicit val actorSystem = ActorSystem("akka-system")
implicit val flowMaterializer = ActorMaterializer()
val interface = "localhost"
val port = 3000
import Directives._
val route = get {
pathEndOrSingleSlash {
complete("Welcome to websocket server")
}
} ~
path("upload") {
handleWebSocketMessages(echoService)
}
val binding = Http().bindAndHandle(route, interface, port)
println(s"Server is now online at http://$interface:$port\nPress RETURN to stop...")
StdIn.readLine()
binding.flatMap(_.unbind()).onComplete(_ => actorSystem.shutdown())
println("Server is down...")
}
implicit val actorSystem = ActorSystem("akka-system")
implicit val flowMaterializer = ActorMaterializer()
val echoService: Flow[Message, Message, _] = Flow[Message].mapConcat {
case BinaryMessage.Strict(msg) => {
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
TextMessage(analyze(imgOutFile))
}
case BinaryMessage.Streamed(stream) => {
stream
.limit(Int.MaxValue) // Max frames we are willing to wait for
.completionTimeout(50 seconds) // Max time until last frame
.runFold(ByteString(""))(_ ++ _) // Merges the frames
.flatMap { (msg: ByteString) =>
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future(Source.single(""))
}
TextMessage(analyze(imgOutFile))
}
private def analyze(imgfile: File): String = {
val p = Runtime.getRuntime.exec(Array("./run-vision.sh", imgfile.toString))
val br = new BufferedReader(new InputStreamReader(p.getInputStream, StandardCharsets.UTF_8))
try {
val result = Stream
.continually(br.readLine())
.takeWhile(_ ne null)
.mkString
result
} finally {
br.close()
}
}
}
}
During testing using Dark WebSocket Terminal, case BinaryMessage.Strict works fine.
Problem: However, case BinaryMessage.Streaming doesn't finish writing the file before running the analyze function, resulting in a blank response from the server.
I'm trying to wrap my head around how Futures are being used here with the Flows in Akka-HTTP, but I'm not having much luck outside trying to get through all the official documentation.
Currently, .mapAsync seems promising, or basically finding a way to chain futures.
I'd really appreciate some insight.
Yes, mapAsync will help you in this occasion. It is a combinator to execute Futures (potentially in parallel) in your stream, and present their results on the output side.
In your case to make things homogenous and make the type checker happy, you'll need to wrap the result of the Strict case into a Future.successful.
A quick fix for your code could be:
val echoService: Flow[Message, Message, _] = Flow[Message].mapAsync(parallelism = 5) {
case BinaryMessage.Strict(msg) => {
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future.successful(TextMessage(analyze(imgOutFile)))
}
case BinaryMessage.Streamed(stream) =>
stream
.limit(Int.MaxValue) // Max frames we are willing to wait for
.completionTimeout(50 seconds) // Max time until last frame
.runFold(ByteString(""))(_ ++ _) // Merges the frames
.flatMap { (msg: ByteString) =>
val decoded: Array[Byte] = msg.toArray
val imgOutFile = new File("/tmp/" + "filename")
val fileOuputStream = new FileOutputStream(imgOutFile)
fileOuputStream.write(decoded)
fileOuputStream.close()
Future.successful(TextMessage(analyze(imgOutFile)))
}
}

500 Internal Server Error in Akka Scala server

This is my code for the server written using Akka framework:
case class Sentence(data: String)
case class RawTriples(triples: List[String])
trait Protocols extends DefaultJsonProtocol {
implicit val sentenceRequestFormat = jsonFormat1(Sentence)
implicit val rawTriplesFormat = jsonFormat1(RawTriples)
}
trait Service extends Protocols {
implicit val system: ActorSystem
implicit def executor: ExecutionContextExecutor
implicit val materializer: Materializer
val openie = new OpenIE
def config: Config
val logger: LoggingAdapter
lazy val ipApiConnectionFlow: Flow[HttpRequest, HttpResponse, Any] =
Http().outgoingConnection(config.getString("services.ip-api.host"), config.getInt("services.ip-api.port"))
def ipApiRequest(request: HttpRequest): Future[HttpResponse] = Source.single(request).via(ipApiConnectionFlow).runWith(Sink.head)
val routes = {
logRequestResult("akka-http-microservice") {
pathPrefix("openie") {
post {
decodeRequest{
entity(as[Sentence]){ sentence =>
complete {
var rawTriples = openie.extract(sentence.data)
val resp: MutableList[String] = MutableList()
for(rtrip <- rawTriples){
resp += (rtrip.toString())
}
val response: List[String] = resp.toList
println(response)
response
}
}
}
}
}
}
}
}
object AkkaHttpMicroservice extends App with Service {
override implicit val system = ActorSystem()
override implicit val executor = system.dispatcher
override implicit val materializer = ActorMaterializer()
override val config = ConfigFactory.load()
override val logger = Logging(system, getClass)
Http().bindAndHandle(routes, config.getString("http.interface"), config.getInt("http.port"))
}
The server accepts a POST request containing a sentence and returns a json array in return. It works fine but if I am making requests to it too frequently using parallelized code, then it gives 500 Internal server error. I wanted to know is there any parameter which I can set in the server to avoid that (number of ready threads for accepting requests etc).
In log files, the error is logged as:
[ERROR] [05/31/2017 11:48:38.110]
[default-akka.actor.default-dispatcher-6]
[akka.actor.ActorSystemImpl(default)] Error during processing of
request: 'null'. Completing with 500 Internal Server Error response.
The doc on the bindAndHandle method shows what you want:
/**
* Convenience method which starts a new HTTP server at the given endpoint and uses the given `handler`
* [[akka.stream.scaladsl.Flow]] for processing all incoming connections.
*
* The number of concurrently accepted connections can be configured by overriding
* the `akka.http.server.max-connections` setting. Please see the documentation in the reference.conf for more
* information about what kind of guarantees to expect.
*
* To configure additional settings for a server started using this method,
* use the `akka.http.server` config section or pass in a [[akka.http.scaladsl.settings.ServerSettings]] explicitly.
*/
akka.http.server.max-connections is probably what you want. As the doc suggests, you can also dig deeper into the akka.http.server config section.
Add following in application.conf file
akka.http {
server {
server-header = akka-http/${akka.http.version}
idle-timeout = infinite
request-timeout = infinite
}
}

Spark Streaming using Scala to insert to Hbase Issue

I am trying to read records from Kafka message and put into Hbase. Though the scala script is running with out any issue, the inserts are not happening. Please help me.
Input:
rowkey1,1
rowkey2,2
Here is the code which I am using:
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(row(1)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap).map(_._2)
val words = lines.map(line => line.split(",")).map(line => (line(0),line(1)))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
From the API doc for HTable's flushCommits() method: "Executes all the buffered Put operations". You should call this at the end of your blah() method -- it looks like they're currently being buffered but never executed or executed at some random time.