I am trying to use Monix Observable to control the big memory of file into smaller chunks of bytes so that it won't use up too many RAM to load the file's bytes.
However, when I using Observable.frominputStreram, it doesn't provide the Array[Byte] that fills into update() function from MessageDigest.
Any suggestions on my codes?
def SHA256_5(file: File)= {
val sha256 = MessageDigest.getInstance("SHA-256")
val in: Observable[Array[Byte]] = {
Observable.fromInputStream(Task(new FileInputStream(file)))
}
in.map(byteArray=>sha256.update(byteArray)).completed
sha256.digest().map("%02x".format(_)).mkString
}
def main(args: Array[String]): Unit = {
val path = "C:\\Users\\ME\\IdeaProjects\\HELLO\\src\\main\\scala\\TRY.scala"
println(SHA256_5(new File(path)))
}
in.map(byteArray=>sha256.update(byteArray)).completed
returns Task - it means that you have to execute that Task and when it finishes you will be able to call
sha256.digest().map("%02x".format(_)).mkString
because Task is used for lazily building asynchronous operation.
Try this instead:
def calcuateSHA(file: File) = for {
sha256 <- Task(MessageDigest.getInstance("SHA-256"))
in = Observable.fromInputStream(Task(new FileInputStream(file)))
_ <- in.map(byteArray=>sha256.update(byteArray)).completed
} yield sha256.digest().map("%02x".format(_)).mkString
def main(args: Array[String]): Unit = {
val path = "C:\\Users\\ME\\IdeaProjects\\HELLO\\src\\main\\scala\\TRY.scala"
import monix.execution.Implicits.global
Await.result(calcuateSHA(new File(path)).runToFuture, Duration.Inf)
}
for starters, or if you want to do it using build in Monix TaskApp instead of hacks for running asynchronous computation in a synchronous main:
object Test extends TaskApp {
def calcuateSHA(file: File) = for {
sha256 <- Task(MessageDigest.getInstance("SHA-256"))
in = Observable.fromInputStream(Task(new FileInputStream(file)))
_ <- in.map(byteArray=>sha256.update(byteArray)).completed
} yield sha256.digest().map("%02x".format(_)).mkString
def run(args: List[String]) = {
val path = "C:\\Users\\ME\\IdeaProjects\\HELLO\\src\\main\\scala\\TRY.scala"
for {
sha <- calcuateSHA(new File(path)
_ = println(sha)
} yield ExitCode.Success
}
}
Related
Firstly I describe what I want to do. I have an API that gets a function as a argument (looks like this:dataFromApi => {//do sth}) and I would like to process this data by Flink. I wrote this code to simulate this API:
val myIterator = new TestIterator
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val th1 = new Thread {
override def run(): Unit = {
for (i <- 0 to 10) {
Thread sleep 1000
myIterator.addToQueue("test" + i)
}
}
}
th1.start()
val texts: DataStream[String] = env
.fromCollection(new TestIterator)
texts.print()
This is my iterator:
class TestIterator extends Iterator[String] with Serializable {
private val q: BlockingQueue[String] = new LinkedBlockingQueue[String]
def addToQueue(s: String): Unit = {
println("Put")
q.put(s)
}
override def hasNext: Boolean = true
override def next(): String = {
println("Wait for queue")
q.take()
}
}
My idea was execute myIterator.addToQueue(dataFromApi) when I receive data, but this code doesn't work. Despiting adding to the queue, execution blocks on q.take(). I tried to write own SourceFunction based on idea with Queue and also I tried with this: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/ but I can't manage I want.
I am trying to process some files asynchronously, with the ability to choose the number of threads the program should use. But I want to wait till processFiles() is completed processing all the files. So, I am searching for ways to stop function from returning until all the Futures are done executing. And it would be very helpful if anyone gives any ideas to approach this problem. Here is my sample code.
object FileReader{
def processFiles(files: Array[File]) = {
val execService = Executors.newFixedThreadPool(5)
implicit val execContext = ExecutionContext.fromExecutorService(execService)
val processed = files.map { f =>
Future {
val name = f.getAbsolutePath()
val fp = Source.fromFile(name)
var data = ""
fp.getLines().foreach(x => {
data = data ++ s"$x\n"
})
fp.close()
// process the data.
println("Processing ....")
data
}
}
execContext.shutdown()
}
def main(args: Array[String]): Unit = {
println("Start")
val tmp = new File("/path/to/files")
val files = tmp.listFiles()
val result = processFiles(files)
println("done processing")
println("done work")
}
}
I am thinking if my usage of Future here is wrong, please correct me if I am wrong.
My expected output :
Start
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
done processing
done work
My current output:
Start
done processing
done work
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
Processing ....
You need to use Future.traverse to combine all the Future's for individual file processing and Await.result on them after:
import java.io.File
import java.util.concurrent.Executors
import scala.io.Source
import scala.concurrent._
import scala.concurrent.duration._
import scala.language.postfixOps
object FileReader {
def processFiles(files: Array[File]) = {
val execService = Executors.newFixedThreadPool(5)
implicit val execContext = ExecutionContext.fromExecutorService(execService)
//Turn `Array[Future[String]]` to `Future[Array[String]]`
val processed = Future.traverse(files.toList) { f =>
Future {
val name = f.getAbsolutePath()
val fp = Source.fromFile(name)
var data = ""
fp.getLines()
.foreach(x => {
data = data ++ s"$x\n"
})
fp.close()
// process the data.
println("Processing ....")
data
}
}
//TODO: Put proper timeout here
//Execution will be blocked until all futures completed
Await.result(processed, 30 minute)
execContext.shutdown()
}
def main(args: Array[String]): Unit = {
println("Start")
val tmp = new File(
"/path/to/file"
)
val files = tmp.listFiles()
val result = processFiles(files)
println("done processing")
println("done work")
}
}
Thanks to #Ivan Kurchenko. The solution worked. I am posting my final version of the code that worked.
object FileReader {
def processFiles(files: Seq[File]) = {
val execService = Executors.newFixedThreadPool(5)
implicit val execContext = ExecutionContext.fromExecutorService(execService)
//Turn `Array[Future[String]]` to `Future[Array[String]]`
val processed = Future.traverse(files) {
f =>
Future {
val name = f.getAbsolutePath()
val fp = Source.fromFile(name)
var data = ""
fp.getLines()
.foreach(x => {
data = augmentString(data) ++ s"$x\n"
})
fp.close()
// process the data.
println("Processing ....")
f
}
}
// TODO: Put proper timeout here
// Execution will be blocked until all futures completed
Await.result(processed, 30.minute)
execContext.shutdown()
}
def main(args: Array[String]): Unit = {
println("Start")
val tmp = new File(
"/path/to/file"
)
val files =tmp.listFiles.toSeq
val result = processFiles(files)
println("done processing")
println("done work")
}
}
I am parsing 50000 records which contain their titles and URLs on the web page. While parsing, I am writing them to the database, which is PostgreSQL. I deployed my application using docker-compose. However, it keeps stopping on some page without any reason. I tried to write some logs to figure out what's happening, but there is no connection error or anything like that.
Here is my code for parsing and writing to the database:
object App {
val db = Database.forURL("jdbc:postgresql://db:5432/toloka?user=user&password=password")
val browser = JsoupBrowser()
val catRepo = new CategoryRepo(db)
val torrentRepo = new TorrentRepo(db)
val torrentForParseRepo = new TorrentForParseRepo(db)
val parallelismFactor = 10
val groupFactor = 10
implicit val system = ActorSystem("TolokaParser")
implicit val materializer = ActorMaterializer()
implicit val executionContext = system.dispatcher
def parseAndWriteTorrentsForParseToDb(doc: App.browser.DocumentType) = {
Source(getRecordsLists(doc))
.grouped(groupFactor)
.mapAsync(parallelismFactor) { torrentForParse: Seq[TorrentForParse] =>
torrentForParseRepo.createInBatch(torrentForParse)
}
.runWith(Sink.ignore)
}
def getRecordsLists(doc: App.browser.DocumentType) = {
val pages = generatePagesFromHomePage(doc)
println("torrent links generated")
println(pages.size)
val result = for {
page <- pages
} yield {
println(s"Parsing torrent list...$page")
val tmp = getTitlesAndLinksTuple(getTitlesList(browser.get(page)), getLinksList(browser.get(page)))
println(tmp.size)
tmp
}
println("torrent links and names tupled")
result flatten
}
}
What may be the cause of such problems?
Put a supervision strategy to avoid stream finalization in case of error. Such as:
val decider: Supervision.Decider = {
case _ => Supervision.Resume
}
def parseAndWriteTorrentsForParseToDb = {
Source.fromIterator(() => List(1,2,3).toIterator)
.grouped(1)
.mapAsync(1) { torrentForParse: Seq[Int] =>
Future { 0 }
}
.withAttributes(ActorAttributes.supervisionStrategy(decider))
.runWith(Sink.ignore)
}
The stream should not stop with this async stage config
My stream works for smaller file of 1000 lines but stops when I test it on a large file ~12MB and ~250,000 lines? I tried applying backpressure with a buffer and throttling it and still same thing...
Here is my data streamer:
class UserDataStreaming(usersFile: File) {
implicit val system = ActorSystemContainer.getInstance().getSystem
implicit val materializer = ActorSystemContainer.getInstance().getMaterializer
def startStreaming() = {
val graph = RunnableGraph.fromGraph(GraphDSL.create() {
implicit builder =>
val usersSource = builder.add(Source.fromIterator(() => usersDataLines)).out
val stringToUserFlowShape: FlowShape[String, User] = builder.add(csvToUser)
val averageAgeFlowShape: FlowShape[User, (String, Int, Int)] = builder.add(averageUserAgeFlow)
val averageAgeSink = builder.add(Sink.foreach(averageUserAgeSink)).in
usersSource ~> stringToUserFlowShape ~> averageAgeFlowShape ~> averageAgeSink
ClosedShape
})
graph.run()
}
val usersDataLines = scala.io.Source.fromFile(usersFile, "ISO-8859-1").getLines().drop(1)
val csvToUser = Flow[String].map(_.split(";").map(_.trim)).map(csvLinesArrayToUser)
def csvLinesArrayToUser(line: Array[String]) = User(line(0), line(1), line(2))
def averageUserAgeSink[usersSource](source: usersSource) {
source match {
case (age: String, count: Int, totalAge: Int) => println(s"age = $age; Average reader age is: ${Try(totalAge/count).getOrElse(0)} count = $count and total age = $totalAge")
case bad => println(s"Bad case: $bad")
}
}
def averageUserAgeFlow = Flow[User].fold(("", 0, 0)) {
(nums: (String, Int, Int), user: User) =>
var counter: Option[Int] = None
var totalAge: Option[Int] = None
val ageInt = Try(user.age.substring(1, user.age.length-1).toInt)
if (ageInt.isSuccess) {
counter = Some(nums._2 + 1)
totalAge = Some(nums._3 + ageInt.get)
}
else {
counter = Some(nums._2 + 0)
totalAge = Some(nums._3 + 0)
}
//println(counter.get)
(user.age, counter.get, totalAge.get)
}
}
Here is my Main:
object Main {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystemContainer.getInstance().getSystem
implicit val materializer = ActorSystemContainer.getInstance().getMaterializer
val usersFile = new File("data/BX-Users.csv")
println(usersFile.length())
val userDataStreamer = new UserDataStreaming(usersFile)
userDataStreamer.startStreaming()
}
It´s possible that there may be any error related to one row of your csv file. In that case, the stream materializes and stops. Try to define your flows like that:
FlowFlowShape[String, User].map {
case (user) => try {
csvToUser(user)
}
}.withAttributes(ActorAttributes.supervisionStrategy {
case ex: Throwable =>
log.error("Error parsing row event: {}", ex)
Supervision.Resume
}
In this case the possible exception is captured and the stream ignores the error and continues.
If you use Supervision.Stop, the stream stops.
To embed Scala as a "scripting language", I need to be able to compile text fragments to simple objects, such as Function0[Unit] that can be serialised to and deserialised from disk and which can be loaded into the current runtime and executed.
How would I go about this?
Say for example, my text fragment is (purely hypothetical):
Document.current.elements.headOption.foreach(_.open())
This might be wrapped into the following complete text:
package myapp.userscripts
import myapp.DSL._
object UserFunction1234 extends Function0[Unit] {
def apply(): Unit = {
Document.current.elements.headOption.foreach(_.open())
}
}
What comes next? Should I use IMain to compile this code? I don't want to use the normal interpreter mode, because the compilation should be "context-free" and not accumulate requests.
What I need to get hold off from the compilation is I guess the binary class file? In that case, serialisation is straight forward (byte array). How would I then load that class into the runtime and invoke the apply method?
What happens if the code compiles to multiple auxiliary classes? The example above contains a closure _.open(). How do I make sure I "package" all those auxiliary things into one object to serialize and class-load?
Note: Given that Scala 2.11 is imminent and the compiler API probably changed, I am happy to receive hints as how to approach this problem on Scala 2.11
Here is one idea: use a regular Scala compiler instance. Unfortunately it seems to require the use of hard disk files both for input and output. So we use temporary files for that. The output will be zipped up in a JAR which will be stored as a byte array (that would go into the hypothetical serialization process). We need a special class loader to retrieve the class again from the extracted JAR.
The following assumes Scala 2.10.3 with the scala-compiler library on the class path:
import scala.tools.nsc
import java.io._
import scala.annotation.tailrec
Wrapping user provided code in a function class with a synthetic name that will be incremented for each new fragment:
val packageName = "myapp"
var userCount = 0
def mkFunName(): String = {
val c = userCount
userCount += 1
s"Fun$c"
}
def wrapSource(source: String): (String, String) = {
val fun = mkFunName()
val code = s"""package $packageName
|
|class $fun extends Function0[Unit] {
| def apply(): Unit = {
| $source
| }
|}
|""".stripMargin
(fun, code)
}
A function to compile a source fragment and return the byte array of the resulting jar:
/** Compiles a source code consisting of a body which is wrapped in a `Function0`
* apply method, and returns the function's class name (without package) and the
* raw jar file produced in the compilation.
*/
def compile(source: String): (String, Array[Byte]) = {
val set = new nsc.Settings
val d = File.createTempFile("temp", ".out")
d.delete(); d.mkdir()
set.d.value = d.getPath
set.usejavacp.value = true
val compiler = new nsc.Global(set)
val f = File.createTempFile("temp", ".scala")
val out = new BufferedOutputStream(new FileOutputStream(f))
val (fun, code) = wrapSource(source)
out.write(code.getBytes("UTF-8"))
out.flush(); out.close()
val run = new compiler.Run()
run.compile(List(f.getPath))
f.delete()
val bytes = packJar(d)
deleteDir(d)
(fun, bytes)
}
def deleteDir(base: File): Unit = {
base.listFiles().foreach { f =>
if (f.isFile) f.delete()
else deleteDir(f)
}
base.delete()
}
Note: Doesn't handle compiler errors yet!
The packJar method uses the compiler output directory and produces an in-memory jar file from it:
// cf. http://stackoverflow.com/questions/1281229
def packJar(base: File): Array[Byte] = {
import java.util.jar._
val mf = new Manifest
mf.getMainAttributes.put(Attributes.Name.MANIFEST_VERSION, "1.0")
val bs = new java.io.ByteArrayOutputStream
val out = new JarOutputStream(bs, mf)
def add(prefix: String, f: File): Unit = {
val name0 = prefix + f.getName
val name = if (f.isDirectory) name0 + "/" else name0
val entry = new JarEntry(name)
entry.setTime(f.lastModified())
out.putNextEntry(entry)
if (f.isFile) {
val in = new BufferedInputStream(new FileInputStream(f))
try {
val buf = new Array[Byte](1024)
#tailrec def loop(): Unit = {
val count = in.read(buf)
if (count >= 0) {
out.write(buf, 0, count)
loop()
}
}
loop()
} finally {
in.close()
}
}
out.closeEntry()
if (f.isDirectory) f.listFiles.foreach(add(name, _))
}
base.listFiles().foreach(add("", _))
out.close()
bs.toByteArray
}
A utility function that takes the byte array found in deserialization and creates a map from class names to class byte code:
def unpackJar(bytes: Array[Byte]): Map[String, Array[Byte]] = {
import java.util.jar._
import scala.annotation.tailrec
val in = new JarInputStream(new ByteArrayInputStream(bytes))
val b = Map.newBuilder[String, Array[Byte]]
#tailrec def loop(): Unit = {
val entry = in.getNextJarEntry
if (entry != null) {
if (!entry.isDirectory) {
val name = entry.getName
// cf. http://stackoverflow.com/questions/8909743
val bs = new ByteArrayOutputStream
var i = 0
while (i >= 0) {
i = in.read()
if (i >= 0) bs.write(i)
}
val bytes = bs.toByteArray
b += mkClassName(name) -> bytes
}
loop()
}
}
loop()
in.close()
b.result()
}
def mkClassName(path: String): String = {
require(path.endsWith(".class"))
path.substring(0, path.length - 6).replace("/", ".")
}
A suitable class loader:
class MemoryClassLoader(map: Map[String, Array[Byte]]) extends ClassLoader {
override protected def findClass(name: String): Class[_] =
map.get(name).map { bytes =>
println(s"defineClass($name, ...)")
defineClass(name, bytes, 0, bytes.length)
} .getOrElse(super.findClass(name)) // throws exception
}
And a test case which contains additional classes (closures):
val exampleSource =
"""val xs = List("hello", "world")
|println(xs.map(_.capitalize).mkString(" "))
|""".stripMargin
def test(fun: String, cl: ClassLoader): Unit = {
val clName = s"$packageName.$fun"
println(s"Resolving class '$clName'...")
val clazz = Class.forName(clName, true, cl)
println("Instantiating...")
val x = clazz.newInstance().asInstanceOf[() => Unit]
println("Invoking 'apply':")
x()
}
locally {
println("Compiling...")
val (fun, bytes) = compile(exampleSource)
val map = unpackJar(bytes)
println("Classes found:")
map.keys.foreach(k => println(s" '$k'"))
val cl = new MemoryClassLoader(map)
test(fun, cl) // should call `defineClass`
test(fun, cl) // should find cached class
}