Writing custom Enumeratee with Scala and Play2 - scala

I having hard time understanding Iteratee/Enumeratee/Enumerator concept. Looks like I understood how to create custom Iteratee - there are some good examples like that.
Now I'm going to write my custom Enumeratee. I start digging code for that, there is not so much comments there but a lot of fold(), fold0(), foldM(), joinI(). I understood that Enumeratee is really something made of Iteratee with sauce, but I still can't catch conception of writing my own. So, if somebody will help me with that example task it will give right direction. Lets consider such example:
val stringEnumerator = Enumerator("abc", "def,ghi", "jkl,mnopqrstuvwxyz")
val myEnumeratee: Enumeratee[String, Int] = ... // ???
val lengthEnumerator: Enumerator[Int] = stringEnumerator through myEnumeratee // should be equal to Enumerator(6, 6, 14)
myEnumeratee should resample stream by splitting given character flow by comma and returning length of each chunk ("abc" + "def" length is 6, "ghi" + "jkl" length is 6 and so on). How to write it?
P.S.
There is an Iteratee I've wrote for counting length of each chunk and eventually return List[Int]. Maybe it will help.

The fancy thing you are trying to do here is repartition the characters not according to their preexisting iteratee Input boundaries but by the comma boundaries. After that it is as simple as composing Enumeratee.map{_.length}. Here is your example using the scala interpreter in paste mode. You can see that result1 at the bottom is the repartitioned strings and result2 is just the count of each.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import play.api.libs.iteratee._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Await
import scala.concurrent.duration._
def repartitionStrings: Enumeratee[String, String] = {
Enumeratee.grouped[String](Traversable.splitOnceAt[String, Char](c => c != ',') transform Iteratee.consume())
}
val stringEnumerator = Enumerator("abc", "def,ghi", "jkl,mnopqrstuvwxyz")
val repartitionedEnumerator: Enumerator[String] = stringEnumerator.through(repartitionStrings)
val lengthEnumerator: Enumerator[Int] = stringEnumerator.through(repartitionStrings).through(Enumeratee.map{_.length}) // should be equal to Enumerator(6, 6, 14)
val result1 = Await.result(repartitionedEnumerator.run(Iteratee.getChunks[String]), 200 milliseconds)
val result2 = Await.result(lengthEnumerator.run(Iteratee.getChunks[Int]), 200 milliseconds)
// Exiting paste mode, now interpreting.
import play.api.libs.iteratee._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Await
import scala.concurrent.duration._
repartitionStrings: play.api.libs.iteratee.Enumeratee[String,String]
stringEnumerator: play.api.libs.iteratee.Enumerator[String] = play.api.libs.iteratee.Enumerator$$anon$19#77e8800b
repartitionedEnumerator: play.api.libs.iteratee.Enumerator[String] = play.api.libs.iteratee.Enumerator$$anon$3#73216e8d
lengthEnumerator: play.api.libs.iteratee.Enumerator[Int] = play.api.libs.iteratee.Enumerator$$anon$3#2046e423
result1: List[String] = List(abcdef, ghijkl, mnopqrstuvwxyz)
result2: List[Int] = List(6, 6, 14)
Enumeratee.grouped is a powerful method that will group together traversable things (Seq, String, ...) according to a small internal custom Iteratee that you define. This iteratee should consume all the elements from the stream and produce the element that will be in the first element in the outer enumeratee and then will be rerun on the remaining input when it is time for the second outer element, and so on. We achieve this by using the special helper method Enumeratee.splitOnceAt which does precisely what we are looking for, we just need to compose it with a simple iteratee to concatenate all of these chunks together into the string that will be returned at the end (Iteratee.consume).

Related

Applying multiple map functions to streaming database results in Play 2.6

I have a large query that seems to be a prime candidate for streaming results.
I would like to make a call to a function, which returns an object which I can apply additional map transformations on, and then ultimately convert the entire result into a list. This is because the conversions will results in a set of objects much smaller than the results in the database and there are many different transformations that must take place sequentially. Processing each result at a time will save me significant memory.
For example, if the results from the database were a stream (though the correct thing is likely an AkkaStream or an Iteratee), then I could do something like:
def outer(converter1[String, Int}, converter2[Int,Double]) {
val sqlIterator = getSqlIterator()
val mappedIterator1 = sqlIterator.map(x => converter1(x.bigColumn))
val mappedIterator2 = sqlIterator.map(x => converter2(x))
val retVal = mappedIterator.toList
retVal
}
def getSqlIterator() {
val selectedObjects = SQL( """SELECT * FROM table""").map { x =>
val id = x[Long]("id")
val tinyColumn = x[String]("tiny_column")
val bigColumn = x[String]("big_column")
NewObject(id, tinyColumn, bigColumn)
}
val transformed = UNKNOWN_FUNCTION(selectedObjects)
transformed
}
Most of the documentation appears to provide the mechanism to apply a "reduce" function to the results, rather than a "map" function, but the resulting mapped functions will be much smaller, saving me significant memory. What should I do for UNKNOWN_FUNCTION?
The following is a simple example of using Anorm's Akka Streams support to read the values from a single column of type String, applying two transformations to each element, and placing the results in a Seq. I'll leave it as an exercise for you to retrieve the values from multiple columns at a time, if that's what you need.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import anorm._
import scala.collection.immutable.Seq
import scala.concurrent.Future
implicit val system = ActorSystem("MySystem")
implicit val materializer = ActorMaterializer()
implicit val ec = system.dispatcher
val convertStringToInt: String => Int = ???
val convertIntToDouble: Int => Double = ???
val result: Future[Seq[Double]] =
AkkaStream.source(SQL"SELECT big_column FROM table", SqlParser.scalar[String])
.map(convertStringToInt)
.map(convertIntToDouble)
.runWith(Sink.seq[Double])

Scala - Remove header from Pair RDD

I am new in Scala and want to remove header from data. I have below data
recordid,income
1,50000000
2,50070000
3,50450000
5,50920000
and I am using below code to read
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object PAN {
def main(args: Array[String]) {
case class income(recordid : Int, income : Int)
val sc = new SparkContext(new SparkConf().setAppName("income").setMaster("local[2]"))
val income_data = sc.textFile("file:///home/user/Documents/income_info.txt").map(_.split(","))
val income_recs = income_data.map(r => (r(0).toInt, income(r(0).toInt, r(1).toInt)))
}
}
I want to remove header from pair RDD but not getting how.
Thanks.
===============================Edit=========================================
I was playing with below code
val header = income_data.first()
val a = income_data.filter(row => row != header)
a.foreach { println }
but it return below output
[Ljava.lang.String;#1657737
[Ljava.lang.String;#75c5d3
[Ljava.lang.String;#ed63f
[Ljava.lang.String;#13f04a
[Ljava.lang.String;#1048c5d
You technique to remove the header by filtering it out will work fine. The problem is how you are trying to print the array.
Arrays in Scala do not override toString so when you try to print one it uses the default string representation which is just the name and hashcode and usually not very useful.
If you want to print an array, turn it into a string first using the mkString method on string, or use foreach(println)
a.foreach {array => println(array.mkString("[",", ","]")}
or
a.foreach {array => array.foreach{println}}
Will both print out the elements of your array so you can see what they contain.
Keep in mind that when working with Spark, printing inside transformation and actions only works in local mode. Once you move to the cluster, the work will be done on remote executors so you won't be able to see and console output from them.
val income_data = sc.textFile("file:///home/user/Documents/income_info.txt")
income_data.collect().drop(1)
When you create an RDD it will return RDD[String] , then when you collect() on top of it it will return Array[String], drop(number of elements) is a function on top of Array to remove those many rows from RDD.

Tokenization by Stanford parser is slow?

Quession Summary: tokenization by stanford parser is slow on my local machine, but unreasonably much much faster on spark. Why?
I'm using stanford coreNLP tool to tokenize sentences.
My script in Scala is like this:
import java.util.Properties
import scala.collection.JavaConversions._
import scala.collection.immutable.ListMap
import scala.io.Source
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val properties = new Properties()
val coreNLP = new StanfordCoreNLP(properties)
def tokenize(s: String) = {
properties.setProperty("annotators", "tokenize")
val annotation = new Annotation(s)
coreNLP.annotate(annotation)
annotation.get(classOf[TokensAnnotation]).map(_.value.toString)
}
tokenize("Here is my sentence.")
One call of tokenize function takes roughly (at least) 0.1 sec.
This is very very slow because I have 3 million sentences.
(3M * 0.1sec = 300K sec = 5000H)
As an alternative approach, I have applied the tokenizer on Spark.
(with four worker machines.)
import java.util.List
import java.util.Properties
import scala.collection.JavaConversions._
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val file = sc.textFile("hdfs:///myfiles")
def tokenize(s: String) = {
val properties = new Properties()
properties.setProperty("annotators", "tokenize")
val coreNLP = new StanfordCoreNLP(properties)
val annotation = new Annotation(s)
coreNLP.annotate(annotation)
annotation.get(classOf[TokensAnnotation]).map(_.toString)
}
def normalizeToken(t: String) = {
val ts = t.toLowerCase
val num = "[0-9]+[,0-9]*".r
ts match {
case num() => "NUMBER"
case _ => ts
}
}
val tokens = file.map(tokenize(_))
val tokenList = tokens.flatMap(_.map(normalizeToken))
val wordCount = tokenList.map((_,1)).reduceByKey(_ + _).sortBy(_._2, false)
wordCount.saveAsTextFile("wordcount")
This scripts finishes tokenization and word count of 3 million sentences just in 5 minites!
And results seems reasonable.
Why this is so first? Or, why the first scala script is so slow?
The problem with your first approach is that you set the annotators property after you initialize the StanfordCoreNLP object. Therefore CoreNLP is initialized with the list of default annotators which include the part-of-speech tagger and the parser which are orders of magnitude slower than the tokenizer.
To fix this, simply move the line
properties.setProperty("annotators", "tokenize")
before the line
val coreNLP = new StanfordCoreNLP(properties)
This should be even slightly faster than your second approach as you don't have to reinitialize CoreNLP for each sentence.

Scala function does not return a value

I think I understand the rules of implicit returns but I can't figure out why splithead is not being set. This code is run via
val m = new TaxiModel(sc, file)
and then I expect
m.splithead
to give me an array strings. Note head is an array of strings.
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
class TaxiModel(sc: SparkContext, dat: String) {
val rawData = sc.textFile(dat)
val head = rawData.take(10)
val splithead = head.slice(1,11).foreach(splitData)
def splitData(dat: String): Array[String] = {
val splits = dat.split("\",\"")
val split0 = splits(0).substring(1, splits(0).length)
val split8 = splits(8).substring(0, splits(8).length - 1)
Array(split0).union(splits.slice(1, 8)).union(Array(split8))
}
}
foreach just evaluates expression, and do not collect any data while iterating. You probably need map or flatMap (see docs here)
head.slice(1,11).map(splitData) // gives you Array[Array[String]]
head.slice(1,11).flatMap(splitData) // gives you Array[String]
Consider also a for comprehension (which desugars in this case into flatMap),
for (s <- head.slice(1,11)) yield splitData(s)
Note also that Scala strings are equipped with ordered collections methods, thus
splits(0).substring(1, splits(0).length)
proves equivalent to any of the following
splits(0).drop(1)
splits(0).tail

Using scalaz-stream to calculate a digest

So I was wondering how I might use scalaz-stream to generate the digest of a file using java.security.MessageDigest?
I would like to do this using a constant memory buffer size (for example 4KB). I think I understand how to start with reading the file, but I am struggling to understand how to:
1) call digest.update(buf) for each 4KB which effectively is a side-effect on the Java MessageDigest instance, which I guess should happen inside the scalaz-stream framework.
2) finally call digest.digest() to receive back the calculated digest from within the scalaz-stream framework some how?
I think I understand kinda how to start:
import scalaz.stream._
import java.security.MessageDigest
val f = "/a/b/myfile.bin"
val bufSize = 4096
val digest = MessageDigest.getInstance("SHA-256")
Process.constant(bufSize).toSource
.through(io.fileChunkR(f, bufSize))
But then I am stuck!
Any hints please? I guess it must also be possible to wrap the creation, update, retrieval (of actual digest calculatuon) and destruction of digest object in a scalaz-stream Sink or something, and then call .to() passing in that Sink? Sorry if I am using the wrong terminology, I am completely new to using scalaz-stream. I have been through a few of the examples but am still struggling.
Since version 0.4 scalaz-stream contains processes to calculate digests. They are available in the hash module and use java.security.MessageDigest under the hood. Here is a minimal example how you could use them:
import scalaz.concurrent.Task
import scalaz.stream._
object Sha1Sum extends App {
val fileName = "testdata/celsius.txt"
val bufferSize = 4096
val sha1sum: Task[Option[String]] =
Process.constant(bufferSize)
.toSource
.through(io.fileChunkR(fileName, bufferSize))
.pipe(hash.sha1)
.map(sum => s"${sum.toHex} $fileName")
.runLast
sha1sum.run.foreach(println)
}
The update() and digest() calls are all contained inside the hash.sha1 Process1.
So I have something working, but it could probably be improved:
import java.io._
import java.security.MessageDigest
import resource._
import scodec.bits.ByteVector
import scalaz._, Scalaz._
import scalaz.concurrent.Task
import scalaz.stream._
import scalaz.stream.io._
val f = "/a/b/myfile.bin"
val bufSize = 4096
val md = MessageDigest.getInstance("SHA-256")
def _digestResource(md: => MessageDigest): Sink[Task,ByteVector] =
resource(Task.delay(md))(md => Task.delay(()))(
md => Task.now((bytes: ByteVector) => Task.delay(md.update(bytes.toArray))))
Process.constant(4096).toSource
.through(fileChunkR(f.getAbsolutePath, 4096))
.to(_digestResource(md))
.run
.run
md.digest()
However, it seems to me that there should be a cleaner way to do this, by moving the creation of the MessageDigest inside the scalaz-stream stuff and have the final .run yield the md.digest().
Better answers welcome...