Elegant traversal of a source in Scala - scala

As a data scientist I frequently use the following pattern for data extraction (i.e. DB, file reading and others):
val source = open(sourceName)
var item = source.getNextItem()
while(item != null){
processItem(item)
item = source.getNextItem()
}
source.close
My (current) dream is to wrap this verbosity into a Scala object "SourceTrav" that would allow this elegance:
SourceTrav(sourceName).foreach(item => processItem(item))
with the same functionality as above, but without running into StackOverflowError, as might happen with the examples in Semantics of Scala Traversable, Iterable, Sequence, Stream and View?
Any idea?

If Scala's standard library (for example scala.io.Source) doesn't suit your needs, you can use different Iterator or Stream companion object methods to wrap manual iterator traversal.
In this case, for example, you can do the following, when you already have an open source:
Iterator.continually(source.getNextItem()).takeWhile(_ != null).foreach(processItem)
If you also want to add automatic opening and closing of the source, don't forget to add try-finally or some other flavor of loan pattern:
case class SourceTrav(sourceName: String) {
def foreach(processItem: Item => Unit): Unit = {
val source = open(sourceName)
try {
Iterator.continually(source.getNextItem()).takeWhile(_ != null).foreach(processItem)
} finally {
source.close()
}
}
}

Related

Scala How to create an Array of defs/Objects and the call those defs in a foreach?

I have a bunch of Scala objects with def's that do a bunch of processing
Foo\CatProcessing (def processing)
Foo\DogProcessing (def processing)
Foo\BirdProcessing (def processing)
Then I have a my main def that will call all of the individual Foo\obj defProcessing. Passing in common parameter values and such
I am trying to put all the list of objects into an Array or List, and then do a 'Foreach' to loop through the list passing in the parameter values or such. ie
foreach(object in objList){
object.Processing(parametmers)
}
Coming from C#, I could do this via binders or the like, so who would I manage this is in Scala?
for (obj <- objList) {
obj.processing(parameters) // `object` is a reserved keyword in Scala
}
or
objList.foreach(obj => obj.processing(parameters))
They are actually the same thing, the former being "syntactic sugar" for the latter.
In the second case, you can bind the only parameter of the anonymous function passed to the foreach function to _, resulting in the following
objList.foreach(_.processing(parameters))
for comprehensions in Scala can be quite expressive and go beyond simple iteration, if you're curious you can read more about it here.
Since you are coming from C#, if by any chance you have had any exposure to LINQ you will find yourself at home with the Scala Collection API. The official documentation is quite extensive in this regard and you can read more about it here.
As it came up in the comments following my reply, you also need the objects you want to iterate to:
have a common type that
exposes the processing method
Alternatively, Scala allows to use structural typing but that relies on runtime reflection and it's unlikely something you really need or want in this case.
You can achieve it by having a common trait for your objects, as in the following example:
trait Processing {
def processing(): Unit
}
final class CatProcessing extends Processing {
def processing(): Unit = println("cat")
}
final class DogProcessing extends Processing {
def processing(): Unit = println("dog")
}
final class BirdProcessing extends Processing {
def processing(): Unit = println("bird")
}
val cat = new CatProcessing
val dog = new DogProcessing
val bird = new BirdProcessing
for (process <- List(cat, dog, bird)) {
process.processing()
}
You can run the code above and play around with it here on Scastie.
Using a Map instead, you can do it as such. (wonder if this works through other types of lists)
val test = Map("foobar" -> CatProcessing)
test.values.foreach(
(movie) => movie.processing(spark)
)

Wrapping a class with side-effects

After reading chapter six of Functional Programming in Scala and trying to understand the State monad, I have a question regarding wrapping an side-effecting class.
Say I have a class that modifies itself in someway.
class SideEffect(x:Int) {
var value = x
def modifyValue(newValue:Int):Unit = { value = newValue }
}
My understanding is that if we wrapped this in a State monad as below, it would still modify the original making the point of wrapping it moot.
case class State[S,+A](run: S => (A, S)) { // See footnote
// map, flatmap, unit, utility functions
}
val sideEffect = new SideEffect(20)
println(sideEffect.value) // Prints "20"
val stateMonad = State[SideEffect,Int](state => {
state.modifyValue(10)
(state.value,state)
})
stateMonad.run(sideEffect) // run the modification
println(sideEffect.value) // Prints "10" i.e. we have modified the original state
The only solution to this that I can see is to make a copy of the class and modify that, but that seems computationally expensive as SideEffect grows. Plus if we wanted to wrap something like a Java class that doesn't implement Cloneable, we'd be out of luck.
val stateMonad = State[SideEffect,Int](state => {
val newState = SideEffect(state.value) // Easier if it was a case class but hypothetically if one was, say, working with a Java library, one would not have this luxury
newState.modifyValue(10)
(newState.value,newState)
})
stateMonad.run(sideEffect) // run the modification
println(sideEffect.value) // Prints "20", original state not modified
Am I using the State monad wrong? How would one go about wrapping a side-effecting class without having to copy it or is this the only way?
The implementation for the State monad I'm using here is from the book, and might differ from Scalaz implementation
You can't do anything with mutable object except hiding mutation in some wrapper. So the scope of program that requires more attention in testing will be much more smaller. Your first sample is good enough. One moment only. It would be better to hide outer reference at all. Instead stateMonad.run(sideEffect) use something like stateMonad.run(new SideEffect(20)) or
def initState: SideEffect = new SideEffect(20)
val (state, value) = stateMonad.run(initState)

How to write efficient type bounded code if the types are unrelated in Scala

I want to improve the following Cassandra related Scala code. I have two unrelated user defined types which are actually in Java source files (leaving out the details).
public class Blob { .. }
public class Meta { .. }
So here is how I use them currently from Scala:
private val blobMapper: Mapper[Blob] = mappingManager.mapper(classOf[Blob])
private val metaMapper: Mapper[Meta] = mappingManager.mapper(classOf[Meta])
def save(entity: Object) = {
entity match {
case blob: Blob => blobMapper.saveAsync(blob)
case meta: Meta => metaMapper.saveAsync(meta)
case _ => // exception
}
}
While this works, how can you avoid the following problems
repetition when adding new user defined type classes like Blob or Meta
pattern matching repetition when adding new methods like save
having Object as parameter type
You can definitely use Mapper as a typeclass, doing:
def save[A](entity: A)(implicit mapper: Mapper[A]) = mapper.saveAsync(entity)
Now you have a generic method able to perform a save operation on every type A for which a Mapper[A] is in scope.
Also, the mappingManager.mapper implementation could be probably improved to avoid classOf, but it's hard to tell from the question in the current state.
A few questions:
Is mappingManager.mapper(cls) expensive?
How much do you care about handling subclasses of Blob or Meta?
Can something like this work for you?
def save[T: Manifest](entity: T) = {
mappingManager.mapper(manifest[T].runtimeClass).saveAsync(entity)
}
If you do care about making sure that subclasses of Meta grab the proper mapper then you may find isAssignableFrom helpful in your .mapper (and store found sub-classes in a HashMap so you only have to look once).
EDIT: Then maybe you want something like this (ignoring threading concerns):
private[this] val mapperMap = mutable.HashMap[Class[_], Mapper[_]]()
def save[T: Manifest](entity: T) = {
val cls = manifest[T].runtimeClass
mapperMap.getOrElseUpdate(cls, mappingManager.mapper(cls))
.asInstanceOf[Mapper[T]]
.saveAsync(entity)
}

Parallel file processing in Scala

Suppose I need to process files in a given folder in parallel. In Java I would create a FolderReader thread to read file names from the folder and a pool of FileProcessor threads. FolderReader reads file names and submits the file processing function (Runnable) to the pool executor.
In Scala I see two options:
create a pool of FileProcessor actors and schedule a file processing function with Actors.Scheduler.
create an actor for each file name while reading the file names.
Does it make sense? What is the best option?
Depending on what you're doing, it may be as simple as
for(file<-files.par){
//process the file
}
I suggest with all my energies to keep as far as you can from the threads. Luckily we have better abstractions which take care of what's happening below, and in your case it appears to me that you do not need to use actors (while you can) but you can use a simpler abstraction, called Futures. They are a part of Akka open source library, and I think in the future will be a part of the Scala standard library as well.
A Future[T] is simply something that will return a T in the future.
All you need to run a future, is to have an implicit ExecutionContext, which you can derive from a java executor service. Then you will be able to enjoy the elegant API and the fact that a future is a monad to transform collections into collections of futures, collect the result and so on. I suggest you to give a look to http://doc.akka.io/docs/akka/2.0.1/scala/futures.html
object TestingFutures {
implicit val executorService = Executors.newFixedThreadPool(20)
implicit val executorContext = ExecutionContext.fromExecutorService(executorService)
def testFutures(myList:List[String]):List[String]= {
val listOfFutures : Future[List[String]] = Future.traverse(myList){
aString => Future{
aString.reverse
}
}
val result:List[String] = Await.result(listOfFutures,1 minute)
result
}
}
There's a lot going on here:
I am using Future.traverse which receives as a first parameter which is M[T]<:Traversable[T] and as second parameter a T => Future[T] or if you prefer a Function1[T,Future[T]] and returns Future[M[T]]
I am using the Future.apply method to create an anonymous class of type Future[T]
There are many other reasons to look at Akka futures.
Futures can be mapped because they are monad, i.e. you can chain Futures execution :
Future { 3 }.map { _ * 2 }.map { _.toString }
Futures have callback: future.onComplete, onSuccess, onFailure, andThen etc.
Futures support not only traverse, but also for comprehension
Ideally you should use two actors. One for reading the list of files, and one for actually reading the file.
You start the process by simply sending a single "start" message to the first actor. The actor can then read the list of files, and send a message to the second actor. The second actor then reads the file and processes the contents.
Having multiple actors, which might seem complicated, is actually a good thing in the sense that you have a bunch of objects communicating with eachother, like in a theoretical OO system.
Edit: you REALLY shouldn't be doing doing concurrent reading of a single file.
I was going to write up exactly what #Edmondo1984 did except he beat me to it. :) I second his suggestion in a big way. I'll also suggest that you read the documentation for Akka 2.0.2. As well, I'll give you a slightly more concrete example:
import akka.dispatch.{ExecutionContext, Future, Await}
import akka.util.duration._
import java.util.concurrent.Executors
import java.io.File
val execService = Executors.newCachedThreadPool()
implicit val execContext = ExecutionContext.fromExecutorService(execService)
val tmp = new File("/tmp/")
val files = tmp.listFiles()
val workers = files.map { f =>
Future {
f.getAbsolutePath()
}
}.toSeq
val result = Future.sequence(workers)
result.onSuccess {
case filenames =>
filenames.foreach { fn =>
println(fn)
}
}
// Artificial just to make things work for the example
Thread.sleep(100)
execContext.shutdown()
Here I use sequence instead of traverse, but the difference is going to depend on your needs.
Go with the Future, my friend; the Actor is just a more painful approach in this instance.
But if use actors, what's wrong with that?
If we have to read / write to some property file. There is my Java example. But still with Akka Actors.
Lest's say we have an actor ActorFile represents one file. Hm.. Probably it can not represent One file. Right? (would be nice it could). So then it represents several files like PropertyFilesActor then:
Why would not use something like this:
public class PropertyFilesActor extends UntypedActor {
Map<String, String> filesContent = new LinkedHashMap<String, String>();
{ // here we should use real files of cource
filesContent.put("file1.xml", "");
filesContent.put("file2.xml", "");
}
#Override
public void onReceive(Object message) throws Exception {
if (message instanceof WriteMessage) {
WriteMessage writeMessage = (WriteMessage) message;
String content = filesContent.get(writeMessage.fileName);
String newContent = content + writeMessage.stringToWrite;
filesContent.put(writeMessage.fileName, newContent);
}
else if (message instanceof ReadMessage) {
ReadMessage readMessage = (ReadMessage) message;
String currentContent = filesContent.get(readMessage.fileName);
// Send the current content back to the sender
getSender().tell(new ReadMessage(readMessage.fileName, currentContent), getSelf());
}
else unhandled(message);
}
}
...a message will go with parameter (fileName)
It has its own in-box, accepting messages like:
WriteLine(fileName, string)
ReadLine(fileName, string)
Those messages will be storing into to the in-box in the order, one after antoher. The actor would do its work by receiving messages from the box - storing/reading, and meanwhile sending feedback sender ! message back.
Thus, let's say if we write to the property file, and send showing the content on the web page. We can start showing page (right after we sent message to store a data to the file) and as soon as we received the feedback, update part of the page with a data from just updated file (by ajax).
Well, grab your files and stick them in a parallel structure
scala> new java.io.File("/tmp").listFiles.par
res0: scala.collection.parallel.mutable.ParArray[java.io.File] = ParArray( ... )
Then...
scala> res0 map (_.length)
res1: scala.collection.parallel.mutable.ParArray[Long] = ParArray(4943, 1960, 4208, 103266, 363 ... )

Is there a FIFO stream in Scala?

I'm looking for a FIFO stream in Scala, i.e., something that provides the functionality of
immutable.Stream (a stream that can be finite and memorizes the elements that have already been read)
mutable.Queue (which allows for added elements to the FIFO)
The stream should be closable and should block access to the next element until the element has been added or the stream has been closed.
Actually I'm a bit surprised that the collection library does not (seem to) include such a data structure, since it is IMO a quite classical one.
My questions:
1) Did I overlook something? Is there already a class providing this functionality?
2) OK, if it's not included in the collection library then it might by just a trivial combination of existing collection classes. However, I tried to find this trivial code but my implementation looks still quite complex for such a simple problem. Is there a simpler solution for such a FifoStream?
class FifoStream[T] extends Closeable {
val queue = new Queue[Option[T]]
lazy val stream = nextStreamElem
private def nextStreamElem: Stream[T] = next() match {
case Some(elem) => Stream.cons(elem, nextStreamElem)
case None => Stream.empty
}
/** Returns next element in the queue (may wait for it to be inserted). */
private def next() = {
queue.synchronized {
if (queue.isEmpty) queue.wait()
queue.dequeue()
}
}
/** Adds new elements to this stream. */
def enqueue(elems: T*) {
queue.synchronized {
queue.enqueue(elems.map{Some(_)}: _*)
queue.notify()
}
}
/** Closes this stream. */
def close() {
queue.synchronized {
queue.enqueue(None)
queue.notify()
}
}
}
Paradigmatic's solution (sightly modified)
Thanks for your suggestions. I slightly modified paradigmatic's solution so that toStream returns an immutable stream (allows for repeatable reads) so that it fits my needs. Just for completeness, here is the code:
import collection.JavaConversions._
import java.util.concurrent.{LinkedBlockingQueue, BlockingQueue}
class FIFOStream[A]( private val queue: BlockingQueue[Option[A]] = new LinkedBlockingQueue[Option[A]]() ) {
lazy val toStream: Stream[A] = queue2stream
private def queue2stream: Stream[A] = queue take match {
case Some(a) => Stream cons ( a, queue2stream )
case None => Stream empty
}
def close() = queue add None
def enqueue( as: A* ) = queue addAll as.map( Some(_) )
}
In Scala, streams are "functional iterators". People expect them to be pure (no side effects) and immutable. In you case, everytime you iterate on the stream you modify the queue (so it's no pure). This can create a lot of misunderstandings, because iterating twice the same stream, will have two different results.
That being said, you should rather use Java BlockingQueues, rather than rolling your own implementation. They are considered well implemented in term of safety and performances. Here is the cleanest code I can think of (using your approach):
import java.util.concurrent.BlockingQueue
import scala.collection.JavaConversions._
class FIFOStream[A]( private val queue: BlockingQueue[Option[A]] ) {
def toStream: Stream[A] = queue take match {
case Some(a) => Stream cons ( a, toStream )
case None => Stream empty
}
def close() = queue add None
def enqueue( as: A* ) = queue addAll as.map( Some(_) )
}
object FIFOStream {
def apply[A]() = new LinkedBlockingQueue
}
I'm assuming you're looking for something like java.util.concurrent.BlockingQueue?
Akka has a BoundedBlockingQueue implementation of this interface. There are of course the implementations available in java.util.concurrent.
You might also consider using Akka's actors for whatever it is you are doing. Use Actors to be notified or pushed a new event or message instead of pulling.
1) It seems you're looking for a dataflow stream seen in languages like Oz, which supports the producer-consumer pattern. Such a collection is not available in the collections API, but you could always create one yourself.
2) The data flow stream relies on the concept of single-assignment variables (such that they don't have to be initialized upon declaration point and reading them prior to initialization causes blocking):
val x: Int
startThread {
println(x)
}
println("The other thread waits for the x to be assigned")
x = 1
It would be straightforward to implement such a stream if single-assignment (or dataflow) variables were supported in the language (see the link). Since they are not a part of Scala, you have to use the wait-synchronized-notify pattern just like you did.
Concurrent queues from Java can be used to achieve that as well, as the other user suggested.