Converting disrete chunks of Stdin to usable form - scala

Simply, if the user pastes a chunk of text (multiple lines) into the console all at once, I want to be able to grab that chunk and use it.
Currently my code is
val stringLines: List[String] = io.Source.stdin.getLines().toList
doStuff(stringLines)
However doStuff is never called. I realize the stdin iterator doesn't have an 'end', but how do I get the input as it is currently? I've checked many SO answers, but none of them work for multiple lines that need to be collated. I need to have all the lines of the user input at once, and it will always come as a single paste of data.

This is a rough outline, but it seems to work. I had to delve into Java Executors, which I hadn't encountered before, and I might not be using correctly.
You might want to play around with the timeout value, but 10 milliseconds passes my tests.
import java.util.concurrent.{Callable, Executors, TimeUnit}
import scala.util.{Failure, Success, Try}
def getChunk: List[String] = { // blocks until StdIn has data
val executor = Executors.newSingleThreadExecutor()
val callable = new Callable[String]() {
def call(): String = io.StdIn.readLine()
}
def nextLine(acc: List[String]): List[String] =
Try {executor.submit(callable).get(10L, TimeUnit.MILLISECONDS)} match {
case Success(str) => nextLine(str :: acc)
case Failure(_) => executor.shutdownNow() // should test for Failure type
acc.reverse
}
nextLine(List(io.StdIn.readLine())) // this is the blocking part
}
Usage is quite simple.
val input: List[String] = getChunk

Related

Scala: Most efficient way to process files in folder based on a file list

I am trying to find the most efficient way to process files in multiple folders based on a list of allowed files.
I have a list of allowed files that I should process.
The proces is as follows
val allowedFiles = List("File1.json","File2.json","File3.json")
Get list of folders in directory. For this I could use:
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.filter(_.isDirectory)
.map(_.getName)
.toList
Loop through each folder from step 2. and get all files. For this I would use :
def getListOfFiles(dir: String):List[File] = {
val d = new File(dir)
if (d.exists && d.isDirectory) {
d.listFiles.filter(_.isFile).toList
} else {
List[File]()
}
}
If file from step 3. are in list of allowed files call another method that process the file
So I need to loop through a first directory, get files, check if file need to be procssed and then call another functionn. I was thinking about double loop which would work but is the most efficient way. I know in scala I should be using resursive funstions but failed with this double recursive function with call to extra method.
Any ideas welcome.
Files.find() will do both the depth search and filter.
import java.nio.file.{Files,Paths,Path}
import scala.jdk.StreamConverters._
def getListOfFiles(dir: String, targets:Set[String]): List[Path] =
Files.find( Paths.get(dir)
, 999
, (p, _) => targets(p.getFileName.toString)
).toScala(List)
usage:
val lof = getListOfFiles("/DataDir", allowedFiles.toSet)
But, depending on what kind of processing is required, instead of returning a List you might just process each file as it is encountered.
import java.nio.file.{Files,Paths,Path}
def processFile(path: Path): Unit = ???
def processSelected(dir: String, targets:Set[String]): Unit =
Files.find( Paths.get(dir)
, 999
, (p, _) => targets(p.getFileName.toString)
).forEach(processFile)
You can use Files.walk
The code would look like this (I didn't compile it, so it may have some typos)
import java.nio.file.{Files, Path}
import scala.jdk.StreamConverters._
def getFilesRecursive(initialFolder: Path, allowedFiles: Set[String]): List[Path] =
Files
.walk(initialFolder)
.filter(path => allowedFiles.contains(path.getFileName.toString.toLowerCase))
.toScala(List)
I'm no expert on Scala (last time I dabbled in it was probably 18 years ago) but I figured there had to be a way to take this code:
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.filter(_.isDirectory)
.map(_.getName)
.toList
And eliminate at least one extra list creation. I found this SO question which was instructive, and then did a Google search for withFilter.
Looks like you can take that bit above and translate it to the following. By replacing filter with withFilter, a new list is not created and then iterated over.
def getListOfSubDirectories(dir: File): List[String] =
dir.listFiles
.withFilter(_.isDirectory)
.map(_.getName)
.toList

Wait for a list of futures with composing Option in Scala

I have to get a list of issues for each file of a given list from a REST API with Scala. I want to do the requests in parallel, and use the Dispatch library for this. My method is called from a Java framework and I have to wait at the end of this method for the result of all the futures to yield the overall result back to the framework. Here's my code:
def fetchResourceAsJson(filePath: String): dispatch.Future[json4s.JValue]
def extractLookupId(json: org.json4s.JValue): Option[String]
def findLookupId(filePath: String): Future[Option[String]] =
for (json <- fetchResourceAsJson(filePath))
yield extractLookupId(json)
def searchIssuesJson(lookupId: String): Future[json4s.JValue]
def extractIssues(json: org.json4s.JValue): Seq[Issue]
def findIssues(lookupId: String): Future[Seq[Issue]] =
for (json <- searchIssuesJson(componentId))
yield extractIssues(json)
def getFilePathsToProcess: List[String]
def thisIsCalledByJavaFramework(): java.util.Map[String, java.util.List[Issue]] = {
val finalResultPromise = Promise[Map[String, Seq[Issue]]]()
// (1) inferred type of issuesByFile not as expected, cannot get
// the type system happy, would like to have Seq[Future[(String, Seq[Issue])]]
val issuesByFile = getFilePathsToProcess map { f =>
findLookupId(f).flatMap { lookupId =>
(f, findIssues(lookupId)) // I want to yield a tuple (String, Seq[Issue]) here
}
}
Future.sequence(issuesByFile) onComplete {
case Success(x) => finalResultPromise.success(x) // (2) how to return x here?
case Failure(x) => // (3) how to return null from here?
}
//TODO transform finalResultPromise to Java Map
}
This code snippet has several issues. First, I'm not getting the type I would expect for issuesByFile (1). I would like to just ignore the result of findLookUpId if it is not able to find the lookUp ID (i.e., None). I've read in various tutorials that Future[Option[X]] is not easy to handle in function compositions and for expressions in Scala. So I'm also curious what the best practices are to handle these properly.
Second, I somehow have to wait for all futures to finish, but don't know how to return the result to the calling Java framework (2). Can I use a promise here to achieve this? If yes, how can I do it?
And last but not least, in case of any errors, I would just like to return null from thisIsCalledByJavaFramework but don't know how (3).
Any help is much appreciated.
Thanks,
Michael
Several points:
The first problem at (1) is that you don't handle the case where findLookupId returns None. You need to decide what to do in this case. Fail the whole process? Exclude that file from the list?
The second problem at (1) is that findIssues will itself return a Future, which you need to map before you can build the result tuple
There's a shortcut for map and then Future.sequence: Future.traverse
If you cannot change the result type of the method because the Java interface is fixed and cannot be changed to support Futures itself you must wait for the Future to be completed. Use Await.ready or Await.result to do that.
Taking all that into account and choosing to ignore files for which no id could be found results in this code:
// `None` in an entry for a file means that no id could be found
def entryForFile(file: String): Future[(String, Option[Seq[Issue]])] =
findLookupId(file).flatMap {
// the need for this kind of pattern match shows
// the difficulty of working with `Future[Option[T]]`
case Some(id) ⇒ findIssues(id).map(issues ⇒ file -> Some(issues))
case None ⇒ Future.successful(file -> None)
}
def thisIsCalledByJavaFramework(): java.util.Map[String, java.util.List[Issue]] = {
val issuesByFile: Future[Seq[(String, Option[Seq[Issue]])]] =
Future.traverse(getFilePathsToProcess)(entryForFile)
import scala.collection.JavaConverters._
try
Await.result(issuesByFile, 10.seconds)
.collect {
// here we choose to ignore entries where no id could be found
case (f, Some(issues)) ⇒ f -> issues
}
.toMap.mapValues(_.asJava).asJava
catch {
case NonFatal(_) ⇒ null
}
}

Using Iteratees and Enumerators in Play Scala to Stream Data to S3

I am building a Play Framework application in Scala where I would like to stream an array of bytes to S3. I am using the Play-S3 library to do this. The "Multipart file upload" of the documentation section is what's relevant here:
// Retrieve an upload ticket
val result:Future[BucketFileUploadTicket] =
bucket initiateMultipartUpload BucketFile(fileName, mimeType)
// Upload the parts and save the tickets
val result:Future[BucketFilePartUploadTicket] =
bucket uploadPart (uploadTicket, BucketFilePart(partNumber, content))
// Complete the upload using both the upload ticket and the part upload tickets
val result:Future[Unit] =
bucket completeMultipartUpload (uploadTicket, partUploadTickets)
I am trying to do the same thing in my application but with Iteratees and Enumerators.
The streams and asynchronicity make things a little complicated, but here is what I have so far (Note uploadTicket is defined earlier in the code):
val partNumberStream = Stream.iterate(1)(_ + 1).iterator
val partUploadTicketsIteratee = Iteratee.fold[Array[Byte], Future[Vector[BucketFilePartUploadTicket]]](Future.successful(Vector.empty[BucketFilePartUploadTicket])) { (partUploadTickets, bytes) =>
bucket.uploadPart(uploadTicket, BucketFilePart(partNumberStream.next(), bytes)).flatMap(partUploadTicket => partUploadTickets.map( _ :+ partUploadTicket))
}
(body |>>> partUploadTicketsIteratee).andThen {
case result =>
result.map(_.map(partUploadTickets => bucket.completeMultipartUpload(uploadTicket, partUploadTickets))) match {
case Success(x) => x.map(d => println("Success"))
case Failure(t) => throw t
}
}
Everything compiles and runs without incident. In fact, "Success" gets printed, but no file ever shows up on S3.
There might be multiple problems with your code. It's a bit unreadable caused by the map method calls. You might have a problem with your future composition. Another problem might be caused by the fact that all chunks (except for the last) should be at least 5MB.
The code below has not been tested, but shows a different approach. The iteratee approach is one where you can create small building blocks and compose them into a pipe of operations.
To make the code compile I added a trait and a few methods
trait BucketFilePartUploadTicket
val uploadPart: (Int, Array[Byte]) => Future[BucketFilePartUploadTicket] = ???
val completeUpload: Seq[BucketFilePartUploadTicket] => Future[Unit] = ???
val body: Enumerator[Array[Byte]] = ???
Here we create a few parts
// Create 5MB chunks
val chunked = {
val take5MB = Traversable.takeUpTo[Array[Byte]](1024 * 1024 * 5)
Enumeratee.grouped(take5MB transform Iteratee.consume())
}
// Add a counter, used as part number later on
val zipWithIndex = Enumeratee.scanLeft[Array[Byte]](0 -> Array.empty[Byte]) {
case ((counter, _), bytes) => (counter + 1) -> bytes
}
// Map the (Int, Array[Byte]) tuple to a BucketFilePartUploadTicket
val uploadPartTickets = Enumeratee.mapM[(Int, Array[Byte])](uploadPart.tupled)
// Construct the pipe to connect to the enumerator
// the ><> operator is an alias for compose, it is more intuitive because of
// it's arrow like structure
val pipe = chunked ><> zipWithIndex ><> uploadPartTickets
// Create a consumer that ends by finishing the upload
val consumeAndComplete =
Iteratee.getChunks[BucketFilePartUploadTicket] mapM completeUpload
Running it is done by simply connecting the parts
// This is the result, a Future[Unit]
val result = body through pipe run consumeAndComplete
Note that I did not test any code and might have made some mistakes in my approach. This however shows a different way of dealing with the problem and should probably help you to find a good solution.
Note that this approach waits for one part to complete upload before it takes on the next part. If the connection from your server to amazon is slower than the connection from the browser to you server this mechanism will slow the input.
You could take another approach where you do not wait for the Future of the part upload to complete. This would result in another step where you use Future.sequence to convert the sequence of upload futures into a single future containing a sequence of the results. The result would be a mechanism sending a part to amazon as soon as you have enough data.

How to create a play.api.libs.iteratee.Enumerator which inserts some data between the items of a given Enumerator?

I use Play framework with ReactiveMongo. Most of ReactiveMongo APIs are based on the Play Enumerator. As long as I fetch some data from MongoDB and return it "as-is" asynchronously, everything is fine. Also the transformation of the data, like converting BSON to String, using Enumerator.map is obvious.
But today I faced a problem which at the bottom line narrowed to the following code. I wasted half of the day trying to create an Enumerator which would consume items from the given Enumerator and insert some items between them. It is important not to load all the items at once, as there could be many of them (the code example has only two items "1" and "2"). But semantically it is similar to mkString of the collections. I am sure it can be done very easily, but the best I could come with - was this code. Very similar code creating an Enumerator using Concurrent.broadcast serves me well for WebSockets. But here even that does not work. The HTTP response never comes back. When I look at Enumeratee, it looks that it is supposed to provide such functionality, but I could not find the way to do the trick.
P.S. Tried to call chan.eofAndEnd in Iteratee.mapDone, and chunked(enums >>> Enumerator.eof instead of chunked(enums) - did not help. Sometimes the response comes back, but does not contain the correct data. What do I miss?
def trans(in:Enumerator[String]):Enumerator[String] = {
val (res, chan) = Concurrent.broadcast[String]
val iter = Iteratee.fold(true) { (isFirst, curr:String) =>
if (!isFirst)
chan.push("<-------->")
chan.push(curr)
false
}
in.apply(iter)
res
}
def enums:Enumerator[String] = {
val en12 = Enumerator[String]("1", "2")
trans(en12)
//en12 //if I comment the previous line and uncomment this, it prints "12" as expected
}
def enum = Action {
Ok.chunked(enums)
}
Here is my solution which I believe to be correct for this type of problem. Comments are welcome:
def fill[From](
prefix: From => Enumerator[From],
infix: (From, From) => Enumerator[From],
suffix: From => Enumerator[From]
)(implicit ec:ExecutionContext) = new Enumeratee[From, From] {
override def applyOn[A](inner: Iteratee[From, A]): Iteratee[From, Iteratee[From, A]] = {
//type of the state we will use for fold
case class State(prev:Option[From], it:Iteratee[From, A])
Iteratee.foldM(State(None, inner)) { (prevState, newItem:From) =>
val toInsert = prevState.prev match {
case None => prefix(newItem)
case Some(prevItem) => infix (prevItem, newItem)
}
for(newIt <- toInsert >>> Enumerator(newItem) |>> prevState.it)
yield State(Some(newItem), newIt)
} mapM {
case State(None, it) => //this is possible when our input was empty
Future.successful(it)
case State(Some(lastItem), it) =>
suffix(lastItem) |>> it
}
}
}
// if there are missing integers between from and to, fill that gap with 0
def fillGap(from:Int, to:Int)(implicit ec:ExecutionContext) = Enumerator enumerate List.fill(to-from-1)(0)
def fillFrom(x:Int)(input:Int)(implicit ec:ExecutionContext) = fillGap(x, input)
def fillTo(x:Int)(input:Int)(implicit ec:ExecutionContext) = fillGap(input, x)
val ints = Enumerator(10, 12, 15)
val toStr = Enumeratee.map[Int] (_.toString)
val infill = fill(
fillFrom(5),
fillGap,
fillTo(20)
)
val res = ints &> infill &> toStr // res will have 0,0,0,0,10,0,12,0,0,15,0,0,0,0
You wrote that you are working with WebSockets, so why don't you use dedicated solution for that? What you wrote is better for Server-Sent-Events rather than WS. As I understood you, you want to filter your results before sending them back to client? If its correct then you Enumeratee instead of Enumerator. Enumeratee is transformation from-to. This is very good piece of code how to use Enumeratee. May be is not directly about what you need but I found there inspiration for my project. Maybe when you analyze given code you would find best solution.

Using Scalaz Stream for parsing task (replacing Scalaz Iteratees)

Introduction
I use Scalaz 7's iteratees in a number of projects, primarily for processing large-ish files. I'd like to start switching to Scalaz streams, which are designed to replace the iteratee package (which frankly is missing a lot of pieces and is kind of a pain to use).
Streams are based on machines (another variation on the iteratee idea), which have also been implemented in Haskell. I've used the Haskell machines library a bit, but the relationship between machines and streams isn't completely obvious (to me, at least), and the documentation for the streams library is still a little sparse.
This question is about a simple parsing task that I'd like to see implemented using streams instead of iteratees. I'll answer the question myself if nobody else beats me to it, but I'm sure I'm not the only one who's making (or at least considering) this transition, and since I need to work through this exercise anyway, I figured I might as well do it in public.
Task
Supposed I've got a file containing sentences that have been tokenized and tagged with parts of speech:
no UH
, ,
it PRP
was VBD
n't RB
monday NNP
. .
the DT
equity NN
market NN
was VBD
illiquid JJ
. .
There's one token per line, words and parts of speech are separated by a single space, and blank lines represent sentence boundaries. I want to parse this file and return a list of sentences, which we might as well represent as lists of tuples of strings:
List((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.))
List((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.)
As usual, we want to fail gracefully if we hit invalid input or file reading exceptions, we don't want to have to worry about closing resources manually, etc.
An iteratee solution
First for some general file reading stuff (that really ought to be part of the iteratee package, which currently doesn't provide anything remotely this high-level):
import java.io.{ BufferedReader, File, FileReader }
import scalaz._, Scalaz._, effect.IO
import iteratee.{ Iteratee => I, _ }
type ErrorOr[A] = EitherT[IO, Throwable, A]
def tryIO[A, B](action: IO[B]) = I.iterateeT[A, ErrorOr, B](
EitherT(action.catchLeft).map(I.sdone(_, I.emptyInput))
)
def enumBuffered(r: => BufferedReader) = new EnumeratorT[String, ErrorOr] {
lazy val reader = r
def apply[A] = (s: StepT[String, ErrorOr, A]) => s.mapCont(k =>
tryIO(IO(Option(reader.readLine))).flatMap {
case None => s.pointI
case Some(line) => k(I.elInput(line)) >>== apply[A]
}
)
}
def enumFile(f: File) = new EnumeratorT[String, ErrorOr] {
def apply[A] = (s: StepT[String, ErrorOr, A]) => tryIO(
IO(new BufferedReader(new FileReader(f)))
).flatMap(reader => I.iterateeT[String, ErrorOr, A](
EitherT(
enumBuffered(reader).apply(s).value.run.ensuring(IO(reader.close()))
)
))
}
And then our sentence reader:
def sentence: IterateeT[String, ErrorOr, List[(String, String)]] = {
import I._
def loop(acc: List[(String, String)])(s: Input[String]):
IterateeT[String, ErrorOr, List[(String, String)]] = s(
el = _.trim.split(" ") match {
case Array(form, pos) => cont(loop(acc :+ (form, pos)))
case Array("") => cont(done(acc, _))
case pieces =>
val throwable: Throwable = new Exception(
"Invalid line: %s!".format(pieces.mkString(" "))
)
val error: ErrorOr[List[(String, String)]] = EitherT.left(
throwable.point[IO]
)
IterateeT.IterateeTMonadTrans[String].liftM(error)
},
empty = cont(loop(acc)),
eof = done(acc, eofInput)
)
cont(loop(Nil))
}
And finally our parsing action:
val action =
I.consume[List[(String, String)], ErrorOr, List] %=
sentence.sequenceI &=
enumFile(new File("example.txt"))
We can demonstrate that it works:
scala> action.run.run.unsafePerformIO().foreach(_.foreach(println))
List((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.))
List((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.))
And we're done.
What I want
More or less the same program implemented using Scalaz streams instead of iteratees.
A scalaz-stream solution:
import scalaz.std.vector._
import scalaz.syntax.traverse._
import scalaz.std.string._
val action = linesR("example.txt").map(_.trim).
splitOn("").flatMap(_.traverseU { s => s.split(" ") match {
case Array(form, pos) => emit(form -> pos)
case _ => fail(new Exception(s"Invalid input $s"))
}})
We can demonstrate that it works:
scala> action.collect.attempt.run.foreach(_.foreach(println))
Vector((no,UH), (,,,), (it,PRP), (was,VBD), (n't,RB), (monday,NNP), (.,.))
Vector((the,DT), (equity,NN), (market,NN), (was,VBD), (illiquid,JJ), (.,.))
And we're done.
The traverseU function is a common Scalaz combinator. In this case it's being used to traverse, in the Process monad, the sentence Vector generated by splitOn. It's equivalent to map followed by sequence.