What are the key difference between run and runWith:-
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Keep, Sink, Source}
object RunAndRunWith extends App {
implicit val system: ActorSystem = ActorSystem("Run_RunWith")
implicit val materializer: ActorMaterializer = ActorMaterializer()
Source(1 to 10).toMat(Sink.foreach[Int](println))(Keep.right).run()
Source(1 to 10).runWith(Sink.foreach[Int](println))
}
How to know which one to use?
to(Sink) and toMat(Sink) terminates the source with the sink and produces a RunnableGraph, which you can execute with run() but it also gives you the chance to set stream attributes for the whole graph before running it, or hand it to some other function/method which will run it (or possibly do something else with it than executing it).
This form also gives you some control of where the materialized value should come from if you need that.
Since wanting to terminate and run a source with a sink, without any additional attributes, keeping the materialized value of the sink, is so common, runWith(Sink) is a convenient shortcut for this.
Related
I'm a scala newbie, using pyspark extensively (on DataBricks, FWIW). I'm finding that Protobuf deserialization is too slow for me in python, so I'm porting my deserialization udf to scala.
I've compiled my .proto files to scala and then a JAR using scalapb as described here
When I try to use these instructions to create a UDF like this:
import gnmi.gnmi._
import org.apache.spark.sql.{Dataset, DataFrame, functions => F}
import spark.implicits.StringToColumn
import scalapb.spark.ProtoSQL
// import scalapb.spark.ProtoSQL.implicits._
import scalapb.spark.Implicits._
val deserialize_proto_udf = ProtoSQL.udf { bytes: Array[Byte] => SubscribeResponse.parseFrom(bytes) }
I get the following error:
command-4409173194576223:9: error: could not find implicit value for evidence parameter of type frameless.TypedEncoder[Array[Byte]]
val deserialize_proto_udf = ProtoSQL.udf { bytes: Array[Byte] => SubscribeResponse.parseFrom(bytes) }
I've double checked that I'm importing the correct implicits, to no avail. I'm pretty fuzzy on implicits, evidence parameters and scala in general.
I would really appreciate it if someone would point me in the right direction. I don't even know how to start diagnosing!!!
Update
It seems like frameless doesn't include an implicit encoder for Array[Byte]???
This works:
frameless.TypedEncoder[Byte]
this does not:
frameless.TypedEncoder[Array[Byte]]
The code for frameless.TypedEncoder seems to include a generic Array encoder, but I'm not sure I'm reading it correctly.
#Dymtro, Thanks for the suggestion. That helped.
Does anyone have ideas about what is going on here?
Update
Ok, progress - this looks like a DataBricks issue. I think that the notebook does something like the following on startup:
import spark.implicits._
I'm using scalapb, which requires that you don't do that
I'm hunting for a way to disable that automatic import now, or "unimport" or "shadow" those modules after they get imported.
If spark.implicits._ are already imported then a way to "unimport" (hide or shadow them) is to create a duplicate object and import it too
object implicitShadowing extends SQLImplicits with Serializable {
protected override def _sqlContext: SQLContext = ???
}
import implicitShadowing._
Testing for case class Person(id: Long, name: String)
// no import
List(Person(1, "a")).toDS() // doesn't compile, value toDS is not a member of List[Person]
import spark.implicits._
List(Person(1, "a")).toDS() // compiles
import spark.implicits._
import implicitShadowing._
List(Person(1, "a")).toDS() // doesn't compile, value toDS is not a member of List[Person]
How to override an implicit value?
Wildcard Import, then Hide Particular Implicit?
How to override an implicit value, that is imported?
How can an implicit be unimported from the Scala repl?
Not able to hide Scala Class from Import
NullPointerException on implicit resolution
Constructing an overridable implicit
Caching the circe implicitly resolved Encoder/Decoder instances
Scala implicit def do not work if the def name is toString
Is there a workaround for this format parameter in Scala?
Please check whether this helps.
Possible problem can be that you don't want just to unimport spark.implicits._ (scalapb.spark.Implicits._), you probably want to import scalapb.spark.ProtoSQL.implicits._ too. And I don't know whether implicitShadowing._ shadow some of them too.
Another possible workaround is to resolve implicits manually and use them explicitly.
I'm writing a UCI interpreter, as an Akka Finite State Machine. As per the specification, the interpreter must write its output to stdout, and take its input from stdin.
I have a test suit for the actor, and I can test some aspects (message related) of its behaviour, but I don't know how to capture the stdout to make assertions, nor how to send it its input through stdin. I've explored the scalatest API to the best of my abilities, but can't find how to achieve what I need.
This is the current test class:
package org.chess
import akka.actor.ActorSystem
import akka.testkit.{TestKit, TestProbe}
import org.chess.Keyword.Quit
import org.scalatest.wordspec.AnyWordSpecLike
import org.scalatest.{BeforeAndAfterAll, Matchers}
import scala.concurrent.duration._
import scala.language.postfixOps
class UCIInterpreterSpec(_system: ActorSystem)
extends TestKit(_system)
with Matchers
with AnyWordSpecLike
with BeforeAndAfterAll {
def this() = this(ActorSystem("UCIInterpreterSpec"))
override def afterAll: Unit = {
super.afterAll()
shutdown(system)
}
"A UCI interpreter" should {
"be able to quit" in {
val testProbe = TestProbe()
val interpreter = system.actorOf(UCIInterpreter.props)
testProbe watch interpreter
interpreter ! Command(Quit, Nil)
testProbe.expectTerminated(interpreter, 3 seconds)
}
}
}
Of course, knowing that the interpreter can quit is useful... but not very useful. I need to test, for example, if sending the string isready to the interpreter, it returns readyok.
Is it possible that I am overcomplicating the test by using akka.testkit, instead of a simpler framework? I would like to keep using a single testing framework for simplicity, and I will need to test many other actor-related elements of the system, so if this could be solved without leaving the akka-testkit/scalatest domain, it would be fantastic.
Any help will be appreciated. Thanks in advance.
You need to change the design of your Actor.
The Actor should not read stdin or write stdout directly. Instead, give the actor objects in the Props that provide input and accept output. stdin could be something like () => String that is called each time input is needed. stdout could be String => Unit that is called each time output is generated. Or you could use Streams or similar constructs that are designed to be abstract sources and sinks of data.
In production code you pass objects that use stdin and stdout, but for test code you pass objects that read and write memory buffers. You can then check that the appropriate input is consumed by the Actor and that the appropriate output is generated by the Actor.
My Scala application kicks off an external process that writes a file to disk. In a separate thread, I want to read that file and copy its contents to an OutputStream until the process is done and the file is no longer growing.
There are a couple of edge cases to consider:
The file may not exist yet when the thread is ready to start.
The thread may copy faster than the process is writing. In other words, it may reach the end of the file while the file is still growing.
BTW I can pass the thread a processCompletionFuture variable which indicates when the file is done growing.
Is there an elegant and efficient way to do this? Perhaps using Akka Streams or actors? (I've tried using an Akka Stream off of the FileInputStream, but the stream seems to terminate as soon as there are no more bytes in the input stream, which happens in case #2).
Alpakka, a library that is built on Akka Streams, has a FileTailSource utility that mimics the tail -f Unix command. For example:
import akka.NotUsed
import akka.stream._
import akka.stream.scaladsl._
import akka.stream.alpakka.file.scaladsl._
import akka.util.{ ByteString, Timeout }
import java.io.OutputStream
import java.nio.file.Path
import scala.concurrent._
import scala.concurrent.duration._
val path: Path = ???
val maxLineSize = 10000
val tailSource: Source[ByteString, NotUsed] = FileTailSource(
path = path,
maxChunkSize = maxLineSize,
startingPosition = 0,
pollingInterval = 500.millis
).via(Framing.delimiter(ByteString(System.lineSeparator), maxLineSize, true))
The above tailSource reads an entire file line-by-line and continually reads freshly appended data every 500 milliseconds. To copy the stream contents to an OutputStream, connect the source to a StreamConverters.fromOutputStream sink:
val stream: Future[IOResult] =
tailSource
.runWith(StreamConverters.fromOutputStream(() => new OutputStream {
override def write(i: Int): Unit = ???
override def write(bytes: Array[Byte]): Unit = ???
}))
(Note that there is a FileTailSource.lines method that produces a Source[String, NotUsed], but in this scenario it's more felicitous to work with ByteString instead of String. This is why the example uses FileTailSource.apply(), which produces a Source[ByteString, NotUsed].)
The stream will fail if the file doesn't exist at the time of materialization. Therefore, you'll need to confirm the existence of the file before running the stream. This might be overkill, but one idea is to use Alpakka's DirectoryChangesSource for that.
I want to test RMQSource class for receiving data from RabbitMQ, but i donĀ“t know how to config the Rabbit virtual host for my exchange, and i think is the problem i have. My code:
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.rabbitmq.RMQSource
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
object rabbitjob {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val stream = env.addSource(new RMQSource[String]("192.168.1.11", 5672,"user","pass", "inbound.input.data",false, new SimpleStringSchema())).print
def main (args:Array[String]){
env.execute("Test Rabbit")
}
}
Error in IntelliJ IDE:
Error:(10, 29) could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[String]
val stream = env.addSource(new RMQSource[String]("192.168.1.11", 5672,"user","pass", "inbound.input.data",false, new SimpleStringSchema())).print
^
Error:(10, 29) not enough arguments for method addSource: (implicit evidence$7: org.apache.flink.api.common.typeinfo.TypeInformation[String])org.apache.flink.streaming.api.scala.DataStream[String].
Unspecified value parameter evidence$7.
val stream = env.addSource(new RMQSource[String]("192.168.1.11", 5672,"user","pass", "inbound.input.data",false, new SimpleStringSchema())).print
^
Any idea how to solve it or alternatives??
Thank you in advance.
Things changed over time.
Please have a look at RMQConnectionConfig: here you can find the way to specify a virtual host through builder pattern.
The error you're seeing is a Scala compile time error caused by some needed imports not being there. Whenever you are using the Flink Scala API you should include the following:
import org.apache.flink.api.scala._
This will solve the compile time problem you're having.
You need to provide the vhost name as well. Take a look at AMQP URI spec.
In your case the whole AMQP URI would look like would be "user:pass#192.168.1.11:5672/TestVHost".
I want to compare the read performance of different storage systems using Spark ,e.g. HDFS/S3N. I have written a small Scala program for this:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val file = sc.textFile("s3n://test/wordtest")
val splits = file.map(word => word)
splits.saveAsTextFile("s3n://test/myoutput")
}
}
My question is, is it possible to run a read-only test with Spark? For the program above, isn't saveAsTextFile() causing some write as well?
I am not sure if that is possible at all. In order to run a transformation, a posterior action is necessary.
From the official Spark documentation:
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
Taking this into account, saveAsTextFile might not be considered the lightest from the wide range of actions available. Several lightweight alternatives exists, actions like count or first, for example. These would leverage almost the totality of the work to the transformations phase, making you able to measure the performance of your solution.
You might want to check the available actions and choose the one that best fits your requirements.
Yes."saveAsTextFile" writes the RDD data to text file using given path.