I'm very new to FS2 and need some help about the desing. I'm trying to design a stream which will pull the chunks from the underlying InputStream till it's over. Here is what I tried:
import java.io.{File, FileInputStream, InputStream}
import cats.effect.IO
import cats.effect.IO._
object Fs2 {
def main(args: Array[String]): Unit = {
val is = new FileInputStream(new File("/tmp/my-file.mf"))
val stream = fs2.Stream.eval(read(is))
stream.compile.drain.unsafeRunSync()
}
def read(is: InputStream): IO[Array[Byte]] = IO {
val buf = new Array[Byte](4096)
is.read(buf)
println(new String(buf))
buf
}
}
And the program prints the only first chunk. This is reasonable. But I want to find a way to "signal" where to stop reading and where to not stop. I mean keep calling read(is) till its end. Is there a way to achieve that?
I also tried repeatEval(read(is)) but it keeps reading forever... I need something in between.
Use fs2.io.readInputStream or fs2.io.readInputStreamAsync. The former blocks the current thread; the latter blocks a thread in the ExecutionContext. For example:
val is: InputStream = new FileInputStream(new File("/tmp/my-file.mf"))
val stream = fs2.io.readInputStreamAsync(IO(is), 128)
Related
i'm new to Scala, i have a method, that reads data from the given list of files and does api calls with
the data, and writes the response to a file.
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
the above method, is taking too much time, during the network call, so tried to do using parallel
processing to reduce the time.
so i tried wrapping the block of code, which consumes more time, but the program ends quickly
and its not generating any output, as the above code.
import scala.concurrent.ExecutionContext.Implicits.global
listOfFiles.map { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
it would be helpful, if you have any suggestions.
(I also tried using "par", it works fine,
I'm exploring other options other than 'par' and using frameworks like 'akka', 'cats' etc)
Based on Jatin instead of using default execution context which contains deamon threads
import scala.concurrent.ExecutionContext.Implicits.global
define execution context with non-deamon threads
implicit val nonDeamonEc = ExecutionContext.fromExecutor(Executors.newCachedThreadPool)
Also you can use Future.traverse and Await like so
val resultF = Future.traverse(listOfFiles) { file =>
val bufferedSource = Source.fromFile(file)
val data = bufferedSource.mkString
bufferedSource.close()
Future {
val response = doApiCall(data) // time consuming task
if (response.nonEmpty) writeFile(response, outputLocation)
}
}
Await.result(resultF, Duration.Inf)
traverse converts List[Future[A]] to Future[List[A]].
I'd like to convert fs2.Stream to java.io.InputStream so I can pass that input stream to an http framework (Finch and Akka Http).
I found a fs2.io.toInputStream, but this doesn't work (it prints nothing):
import java.io.{ByteArrayInputStream, InputStream}
import cats.effect.IO
import scala.concurrent.ExecutionContext.Implicits.global
object IOTest {
def main(args: Array[String]): Unit = {
val is: InputStream = new ByteArrayInputStream("test".getBytes)
val stream: fs2.Stream[IO, Byte] = fs2.io.readInputStream(IO(is), 128)
val test: Seq[InputStream] = stream.through(fs2.io.toInputStream).compile.toList.unsafeRunSync()
println(scala.io.Source.fromInputStream(test.head).mkString)
}
}
As far as I understand when I run .unsafeRunSync() it's consuming the whole stream, so even though it returns a Seq[InputStream] the under-laying input stream is already consumed.
Is there any way I can convert fs2.Stream[IO, Byte] to java.io.InputStream without it being consumed?
Thnaks!
The problem is that compile is being invoked prematurely. I'm sure that under the hood fs2.io.toInputStream does the correct thing and brackets the created InputStream. Which means that the InputStream must be accessed inside the Stream itself (e.g., in a map/flatMap call):
val wire: fs2.Stream[IO, Byte] = ???
val result: fs2.Stream[IO, String] = for {
is <- wire.through(fs2.io.toInputStream)
str = scala.io.Source.fromInputStream(is).mkString //<--- use the InputStream here
} yield str
println( result.compile.lastOrError.unsafeRunSync() ) //<--- compile at the _very_ end
Outputs:
test
It looks that Finch has fs2 support https://github.com/finagle/finch/tree/master/fs2 and Akka also has it's stream implementation and there are fs2 - Akka Stream interop libraries like https://github.com/krasserm/streamz/tree/master/streamz-converter
So i recommend you to take a look to the implementations because they take care of the resources life cycle. Probably you don't need the whole library but it serves as guideline.
And if you are starting at the "safe zone" with fs2, why moving out of there :)
I have this code below:
import java.util.Properties
import com.google.gson._
import com.typesafe.config.ConfigFactory
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.cep.scala.CEP
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
object WindowedWordCount {
val configFactory = ConfigFactory.load()
def main(args: Array[String]) = {
val brokers = configFactory.getString("kafka.broker")
val topicChannel1 = configFactory.getString("kafka.topic1")
val props = new Properties()
...
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val dataStream = env.addSource(new FlinkKafkaConsumer010[String](topicChannel1, new SimpleStringSchema(), props))
val partitionedInput = dataStream.keyBy(jsonString => {
val jsonParser = new JsonParser()
val jsonObject = jsonParser.parse(jsonString).getAsJsonObject()
jsonObject.get("account")
})
val priceCheck = Pattern.begin[String]("start").where({jsonString =>
val jsonParser = new JsonParser()
val jsonObject = jsonParser.parse(jsonString).getAsJsonObject()
jsonObject.get("account").toString == "iOS"})
val pattern = CEP.pattern(partitionedInput, priceCheck)
val newStream = pattern.select(x =>
x.get("start").map({str =>
str
})
)
newStream.print()
env.execute()
}
}
For some reason in the above code at the newStream.print() nothing is being printed out. I am positive that there is data in Kafka that matches the pattern that I defined above but for some reason nothing is being printed out.
Can anyone please help me spot an error in this code?
EDIT:
class TimestampExtractor extends AssignerWithPeriodicWatermarks[String] with Serializable {
override def extractTimestamp(e: String, prevElementTimestamp: Long) = {
val jsonParser = new JsonParser()
val context = jsonParser.parse(e).getAsJsonObject.getAsJsonObject("context")
Instant.parse(context.get("serverTimestamp").toString.replaceAll("\"", "")).toEpochMilli
}
override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis())
}
}
I saw on the flink doc that you can just return prevElementTimestamp in the extractTimestamp method (if you are using Kafka010) and new Watermark(System.currentTimeMillis) in the getCurrentWatermark method.
But I don't understand what prevElementTimestamp is or why one would return new Watermark(System.currentTimeMillis) as the WaterMark and not something else. Can you please elaborate on why we do this on how WaterMark and EventTime work together please?
You do setup your job to work in EventTime, but you do not provide timestamp and watermark extractor.
For more on working in event time see those docs. If you want to use the kafka embedded timestamps this docs may help you.
In EventTime the CEP library buffers events upon watermark arrival so to properly handle out-of-order events. In your case there are no watermarks generated, so the events are buffered infinitly.
Edit:
For the prevElementTimestamp I think the docs are pretty clear:
There is no need to define a timestamp extractor when using the timestamps from Kafka. The previousElementTimestamp argument of the extractTimestamp() method contains the timestamp carried by the Kafka message.
Since Kafka 0.10.x Kafka messages can have embedded timestamp.
Generating Watermark as new Watermark(System.currentTimeMillis) in this case is not a good idea. You should create Watermark based on your knowledge of the data. For information on how Watermark and EventTime work together I could not be more clear than the docs
I'm trying to make a very simple scala socket program that will "echo" out any input it recieves from multiple clients
This program does work but only for a single client. I think this is because execution is always in while(true) loop
import java.net._
import java.io._
import scala.io._
//println(util.Properties.versionString)
val server = new ServerSocket(9999)
println("initialized server")
val client = server.accept
while(true){
val in = new BufferedReader(new InputStreamReader(client.getInputStream)).readLine
val out = new PrintStream(client.getOutputStream)
println("Server received:" + in) // print out the input message
out.println("Message received")
out.flush
}
I've tried
making this modification
while(true){
val client = server.accept
val in = new BufferedReader(new InputStreamReader(client.getInputStream)).readLine
val out = new PrintStream(client.getOutputStream)
println("Server received:" + in)
}
But this does'nt work beyond "echo"ing out a single message
I'd like multiple clients to connect to the socket and constantly receive the output of whatever they type in
Basically you should accept the connection and create a new Future for each client. Beware that the implicit global ExecutionContext might be limited, you might need to find a different one that better fits your use cases.
You can use Scala async if you need more complex tasks with futures, but I think this is probably fine.
Disclaimer, I have not tried this, but something similar might work (based on your code and the docs):
import scala.concurrent._
import ExecutionContext.Implicits.global
...
while(true){
val client = server.accept
Future {
val inReader = new BufferedReader(new InputStreamReader(client.getInputStream))
val out = new PrintStream(client.getOutputStream)
do {
val in = inReader.readLine
println("Server received:" + in)
} while (true/*or a better condition to close the connection */)
client.close
}
}
Here you can find an example for the scala language:
[http://www.scala-lang.org/old/node/55][1]
And this is also a good example from scala twitter school, that works with java libraries:
import java.net.{Socket, ServerSocket}
import java.util.concurrent.{Executors, ExecutorService}
import java.util.Date
class NetworkService(port: Int, poolSize: Int) extends Runnable {
val serverSocket = new ServerSocket(port)
def run() {
while (true) {
// This will block until a connection comes in.
val socket = serverSocket.accept()
(new Handler(socket)).run()
}
}
}
class Handler(socket: Socket) extends Runnable {
def message = (Thread.currentThread.getName() + "\n").getBytes
def run() {
socket.getOutputStream.write(message)
socket.getOutputStream.close()
}
}
(new NetworkService(2020, 2)).run
Say, in an action I have:
val linesEnu = {
val is = new java.io.FileInputStream(path)
val isr = new java.io.InputStreamReader(is, "UTF-8")
val br = new java.io.BufferedReader(isr)
import scala.collection.JavaConversions._
val rows: scala.collection.Iterator[String] = br.lines.iterator
Enumerator.enumerate(rows)
}
Ok.feed(linesEnu).as(HTML)
How to close readers/streams?
There is a onDoneEnumerating callback that functions like finally (will always be called whether or not the Enumerator fails). You can close the streams there.
val linesEnu = {
val is = new java.io.FileInputStream(path)
val isr = new java.io.InputStreamReader(is, "UTF-8")
val br = new java.io.BufferedReader(isr)
import scala.collection.JavaConversions._
val rows: scala.collection.Iterator[String] = br.lines.iterator
Enumerator.enumerate(rows).onDoneEnumerating {
is.close()
// ... Anything else you want to execute when the Enumerator finishes.
}
}
The IO tools provided by Enumerator give you this kind of resource management out of the box—e.g. if you create an enumerator with fromStream, the stream is guaranteed to get closed after running (even if you only read a single line, etc.).
So for example you could write the following:
import play.api.libs.iteratee._
val splitByNl = Enumeratee.grouped(
Traversable.splitOnceAt[Array[Byte], Byte](_ != '\n'.toByte) &>>
Iteratee.consume()
) compose Enumeratee.map(new String(_, "UTF-8"))
def fileLines(path: String): Enumerator[String] =
Enumerator.fromStream(new java.io.FileInputStream(path)).through(splitByNl)
It's a shame that the library doesn't provide a linesFromStream out of the box, but I personally would still prefer to use fromStream with hand-rolled splitting, etc. over using an iterator and providing my own resource management.