So I was wondering how I might use scalaz-stream to generate the digest of a file using java.security.MessageDigest?
I would like to do this using a constant memory buffer size (for example 4KB). I think I understand how to start with reading the file, but I am struggling to understand how to:
1) call digest.update(buf) for each 4KB which effectively is a side-effect on the Java MessageDigest instance, which I guess should happen inside the scalaz-stream framework.
2) finally call digest.digest() to receive back the calculated digest from within the scalaz-stream framework some how?
I think I understand kinda how to start:
import scalaz.stream._
import java.security.MessageDigest
val f = "/a/b/myfile.bin"
val bufSize = 4096
val digest = MessageDigest.getInstance("SHA-256")
Process.constant(bufSize).toSource
.through(io.fileChunkR(f, bufSize))
But then I am stuck!
Any hints please? I guess it must also be possible to wrap the creation, update, retrieval (of actual digest calculatuon) and destruction of digest object in a scalaz-stream Sink or something, and then call .to() passing in that Sink? Sorry if I am using the wrong terminology, I am completely new to using scalaz-stream. I have been through a few of the examples but am still struggling.
Since version 0.4 scalaz-stream contains processes to calculate digests. They are available in the hash module and use java.security.MessageDigest under the hood. Here is a minimal example how you could use them:
import scalaz.concurrent.Task
import scalaz.stream._
object Sha1Sum extends App {
val fileName = "testdata/celsius.txt"
val bufferSize = 4096
val sha1sum: Task[Option[String]] =
Process.constant(bufferSize)
.toSource
.through(io.fileChunkR(fileName, bufferSize))
.pipe(hash.sha1)
.map(sum => s"${sum.toHex} $fileName")
.runLast
sha1sum.run.foreach(println)
}
The update() and digest() calls are all contained inside the hash.sha1 Process1.
So I have something working, but it could probably be improved:
import java.io._
import java.security.MessageDigest
import resource._
import scodec.bits.ByteVector
import scalaz._, Scalaz._
import scalaz.concurrent.Task
import scalaz.stream._
import scalaz.stream.io._
val f = "/a/b/myfile.bin"
val bufSize = 4096
val md = MessageDigest.getInstance("SHA-256")
def _digestResource(md: => MessageDigest): Sink[Task,ByteVector] =
resource(Task.delay(md))(md => Task.delay(()))(
md => Task.now((bytes: ByteVector) => Task.delay(md.update(bytes.toArray))))
Process.constant(4096).toSource
.through(fileChunkR(f.getAbsolutePath, 4096))
.to(_digestResource(md))
.run
.run
md.digest()
However, it seems to me that there should be a cleaner way to do this, by moving the creation of the MessageDigest inside the scalaz-stream stuff and have the final .run yield the md.digest().
Better answers welcome...
Related
While using monix.eval.Task or zio.Task, is there a simple way to convert Option of Task to Task of Option?
If you want a pure ZIO solution, you can use .foreach with identity:
val fx: Option[UIO[Int]] = Option(Task.effectTotal(42))
val res: UIO[Option[Int]] = ZIO.foreach(fx)(identity)
If you're also using cats, the method you're looking for is called .sequence.
import cats.implicits.toTraverseOps
import zio.interop.catz._
import zio.{Task, UIO}
val fx: Option[UIO[Int]] = Option(Task.effectTotal(42))
val res: UIO[Option[Int]] = fx.sequence
The other way around is not possible as one would need to materialize the Task in order to be able to lift it into an Option[T].
I'm trying to get basic file attributes using Scala, and my reference is this Java question:
Determine file creation date in Java
and this piece of code I'm trying to rewrite in Scala:
static void getAttributes(String pathStr) throws IOException {
Path p = Paths.get(pathStr);
BasicFileAttributes view
= Files.getFileAttributeView(p, BasicFileAttributeView.class)
.readAttributes();
System.out.println(view.creationTime()+" is the same as "+view.lastModifiedTime());
}
The thing I just can't figure out is this line of code..I don't understand how to pass a class in this way using scala... or why Java is insisting upon this in the first place instead of using an actual constructed object as the parameter. Can someone please help me write this line of code to function properly? I must be using the wrong syntax
val attr = Files.readAttributes(f,Class[BasicFileAttributeView])
Try this:
def attrs(pathStr:String) =
Files.getFileAttributeView(
Paths.get(pathStr),
classOf[BasicFileAttributes] //corrected
).readAttributes
Get file creation date in Scala, from Basic Files Attributes:
// option 1,
import java.nio.file.{Files, Paths}
import java.nio.file.attribute.BasicFileAttributes
val pathStr = "/tmp/test.sql"
Files.readAttributes(Paths.get(pathStr), classOf[BasicFileAttributes]).creationTime
res3: java.nio.file.attribute.FileTime = 2018-03-06T00:25:52Z
// option 2,
import java.nio.file.{Files, Paths}
import java.nio.file.attribute.BasicFileAttributeView
val pathStr = "/tmp/test.sql"
{
Files
.getFileAttributeView(Paths.get(pathStr), classOf[BasicFileAttributeView])
.readAttributes.creationTime
}
res20: java.nio.file.attribute.FileTime = 2018-03-07T19:00:19Z
I having hard time understanding Iteratee/Enumeratee/Enumerator concept. Looks like I understood how to create custom Iteratee - there are some good examples like that.
Now I'm going to write my custom Enumeratee. I start digging code for that, there is not so much comments there but a lot of fold(), fold0(), foldM(), joinI(). I understood that Enumeratee is really something made of Iteratee with sauce, but I still can't catch conception of writing my own. So, if somebody will help me with that example task it will give right direction. Lets consider such example:
val stringEnumerator = Enumerator("abc", "def,ghi", "jkl,mnopqrstuvwxyz")
val myEnumeratee: Enumeratee[String, Int] = ... // ???
val lengthEnumerator: Enumerator[Int] = stringEnumerator through myEnumeratee // should be equal to Enumerator(6, 6, 14)
myEnumeratee should resample stream by splitting given character flow by comma and returning length of each chunk ("abc" + "def" length is 6, "ghi" + "jkl" length is 6 and so on). How to write it?
P.S.
There is an Iteratee I've wrote for counting length of each chunk and eventually return List[Int]. Maybe it will help.
The fancy thing you are trying to do here is repartition the characters not according to their preexisting iteratee Input boundaries but by the comma boundaries. After that it is as simple as composing Enumeratee.map{_.length}. Here is your example using the scala interpreter in paste mode. You can see that result1 at the bottom is the repartitioned strings and result2 is just the count of each.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import play.api.libs.iteratee._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Await
import scala.concurrent.duration._
def repartitionStrings: Enumeratee[String, String] = {
Enumeratee.grouped[String](Traversable.splitOnceAt[String, Char](c => c != ',') transform Iteratee.consume())
}
val stringEnumerator = Enumerator("abc", "def,ghi", "jkl,mnopqrstuvwxyz")
val repartitionedEnumerator: Enumerator[String] = stringEnumerator.through(repartitionStrings)
val lengthEnumerator: Enumerator[Int] = stringEnumerator.through(repartitionStrings).through(Enumeratee.map{_.length}) // should be equal to Enumerator(6, 6, 14)
val result1 = Await.result(repartitionedEnumerator.run(Iteratee.getChunks[String]), 200 milliseconds)
val result2 = Await.result(lengthEnumerator.run(Iteratee.getChunks[Int]), 200 milliseconds)
// Exiting paste mode, now interpreting.
import play.api.libs.iteratee._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Await
import scala.concurrent.duration._
repartitionStrings: play.api.libs.iteratee.Enumeratee[String,String]
stringEnumerator: play.api.libs.iteratee.Enumerator[String] = play.api.libs.iteratee.Enumerator$$anon$19#77e8800b
repartitionedEnumerator: play.api.libs.iteratee.Enumerator[String] = play.api.libs.iteratee.Enumerator$$anon$3#73216e8d
lengthEnumerator: play.api.libs.iteratee.Enumerator[Int] = play.api.libs.iteratee.Enumerator$$anon$3#2046e423
result1: List[String] = List(abcdef, ghijkl, mnopqrstuvwxyz)
result2: List[Int] = List(6, 6, 14)
Enumeratee.grouped is a powerful method that will group together traversable things (Seq, String, ...) according to a small internal custom Iteratee that you define. This iteratee should consume all the elements from the stream and produce the element that will be in the first element in the outer enumeratee and then will be rerun on the remaining input when it is time for the second outer element, and so on. We achieve this by using the special helper method Enumeratee.splitOnceAt which does precisely what we are looking for, we just need to compose it with a simple iteratee to concatenate all of these chunks together into the string that will be returned at the end (Iteratee.consume).
Quession Summary: tokenization by stanford parser is slow on my local machine, but unreasonably much much faster on spark. Why?
I'm using stanford coreNLP tool to tokenize sentences.
My script in Scala is like this:
import java.util.Properties
import scala.collection.JavaConversions._
import scala.collection.immutable.ListMap
import scala.io.Source
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val properties = new Properties()
val coreNLP = new StanfordCoreNLP(properties)
def tokenize(s: String) = {
properties.setProperty("annotators", "tokenize")
val annotation = new Annotation(s)
coreNLP.annotate(annotation)
annotation.get(classOf[TokensAnnotation]).map(_.value.toString)
}
tokenize("Here is my sentence.")
One call of tokenize function takes roughly (at least) 0.1 sec.
This is very very slow because I have 3 million sentences.
(3M * 0.1sec = 300K sec = 5000H)
As an alternative approach, I have applied the tokenizer on Spark.
(with four worker machines.)
import java.util.List
import java.util.Properties
import scala.collection.JavaConversions._
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val file = sc.textFile("hdfs:///myfiles")
def tokenize(s: String) = {
val properties = new Properties()
properties.setProperty("annotators", "tokenize")
val coreNLP = new StanfordCoreNLP(properties)
val annotation = new Annotation(s)
coreNLP.annotate(annotation)
annotation.get(classOf[TokensAnnotation]).map(_.toString)
}
def normalizeToken(t: String) = {
val ts = t.toLowerCase
val num = "[0-9]+[,0-9]*".r
ts match {
case num() => "NUMBER"
case _ => ts
}
}
val tokens = file.map(tokenize(_))
val tokenList = tokens.flatMap(_.map(normalizeToken))
val wordCount = tokenList.map((_,1)).reduceByKey(_ + _).sortBy(_._2, false)
wordCount.saveAsTextFile("wordcount")
This scripts finishes tokenization and word count of 3 million sentences just in 5 minites!
And results seems reasonable.
Why this is so first? Or, why the first scala script is so slow?
The problem with your first approach is that you set the annotators property after you initialize the StanfordCoreNLP object. Therefore CoreNLP is initialized with the list of default annotators which include the part-of-speech tagger and the parser which are orders of magnitude slower than the tokenizer.
To fix this, simply move the line
properties.setProperty("annotators", "tokenize")
before the line
val coreNLP = new StanfordCoreNLP(properties)
This should be even slightly faster than your second approach as you don't have to reinitialize CoreNLP for each sentence.
But their documentation looks assuming I'm already familiar with Scala, Akka and Spray itself. I mean I couldn't find out how to do this simple basic thing, that I would love to have as one snippet of code in their home page...
The only thing I could find is how to build a request with their spray-httpx:
import spray.httpx.RequestBuilder._
val req = Get("http://url")
The object doesn't have operation to send itself to anywhere, so I'm sure I'm supposed to use Akka things to do it, but their documentation doesn't show the process. Please tell me how to do it. If spray-can do the same thing, I know it can, I would prefer the way.
There is an example here: http://spray.io/documentation/1.1-SNAPSHOT/spray-client/
import spray.http._
import spray.client.pipelining._
implicit val system = ActorSystem()
import system.dispatcher // execution context for futures
val pipeline: HttpRequest => Future[HttpResponse] = sendReceive
val response: Future[HttpResponse] = pipeline(Get("http://spray.io/"))
and even simpler example here: https://github.com/spray/spray/wiki/spray-client
val conduit = new HttpConduit("github.com")
val responseFuture = conduit.sendReceive(HttpRequest(GET, uri = "/"))
In both cases you have to process the result like you normally process a Future, e.g.:
for {response <- responseFuture} yield { someFunction(response) }