Apache Spark Receiver Scheduling - scala

I have implemented a receiver that is supposed to connect to a WebSocket stream and get the messages for processing. Here is the implementation that I have done so far:
class WebSocketReader (wsConfig: WebSocketConfig, stringMessageHandler: String => Option[String],
storageLevel: StorageLevel) extends Receiver[String] (storageLevel) {
// TODO: avoid using a var
private var wsClient: WebSocketClient = _
def sendRequest(isRequest: Boolean, msgCount: Int) = {
while (isRequest) {
wsClient.send(msgCount.toString)
Thread.sleep(1000)
}
}
// TODO: avoid using Synchronization...
private def connect(): Unit = {
Try {
wsClient = createWsClient
} match {
case Success(_) =>
wsClient.connect().map {
case result if result.isSuccess =>
sendRequest(true, 10)
case _ =>
connect()
}
case Failure(ex) =>
// TODO: how to signal a failure so that it is tried the next time....
ex.printStackTrace()
}
}
def onStart(): Unit = {
new Thread(getClass.getSimpleName) {
override def run() { connect() }
}.start()
}
override def onStop(): Unit =
if (wsClient != null) wsClient.disconnect()
private def createWsClient = {
new DefaultHookupClient(new HookupClientConfig(new URI(wsConfig.wsUrl))) {
override def receive: Receive = {
case Disconnected(_) =>
// TODO: use Logging framework, try reconnecting....
println(s"the web socket is disconnected")
case TextMessage(message) =>
stringMessageHandler(message).foreach(store)
case JsonMessage(jsValue) =>
stringMessageHandler(jsValue.toString).foreach(store)
}
}
}
}
How is this Receiver being run? Does this Receiver run on the worker nodes or on the driver node? Is this way of sleeping a thread a correct approach?
The reason why I want to do this is that the server that is exposing the WebSocket end point would need a count on the messages that I want to receive. Say if I ask the server for 100 messages, it would give me 100 messages and so on. So I need a way to periodically schedule this request to the server. Currently, I'm using the Thread.sleep mechanism. Is this advisable? What could be the alternative?

Related

How to work with Source.Queue in Akka-Stream

I am toying around trying to use a source.queue from an Actor. I am stuck in parttern match the result of an offer operation
class MarcReaderActor(file: File, sourceQueue: SourceQueueWithComplete[Record]) extends Actor {
val inStream = file.newInputStream
val reader = new MarcStreamReader(inStream)
override def receive: Receive = {
case Process => {
if (reader.hasNext()) {
val record = reader.next()
pipe(sourceQueue.offer(record)) to self
}
}
case f:Future[QueueOfferResult] =>
}
}
}
I don't know how to check if it was Enqueued or Dropped or Failure
if i write f:Future[QueueOfferResult.Enqueued] the compile complain
Since you use pipeTo, you do no need to match on futures - the contents of the future will be sent to the actor when this future is completed, not the future itself. Do this:
override def receive: Receive = {
case Process =>
if (reader.hasNext()) {
val record = reader.next()
pipe(sourceQueue.offer(record)) to self
}
case r: QueueOfferResult =>
r match {
case QueueOfferResult.Enqueued => // element has been consumed
case QueueOfferResult.Dropped => // element has been ignored because of backpressure
case QueueOfferResult.QueueClosed => // the queue upstream has terminated
case QueueOfferResult.Failure(e) => // the queue upstream has failed with an exception
}
case Status.Failure(e) => // future has failed, e.g. because of invalid usage of `offer()`
}

NATS streaming server subscriber rate limiting and exactly once delivery

I am playing a bit with the NATS streaming and I have a problem with the subscriber rate limiting. When I set the max in flight to 1 and the timeout to 1 second and I have a consumer which is basically a Thread.sleep(1000) then I get multiple times the same event. I thought by limiting the in flight and using a manual ack this should not happen. How can I get exatly once delivery on very slow consumers?
case class EventBus[I, O](inputTopic: String, outputTopic: String, connection: Connection, eventProcessor: StatefulEventProcessor[I, O]) {
// the event bus could be some abstract class while the `Connection` coulbd be injected using DI
val substritionOptions: SubscriptionOptions = new SubscriptionOptions.Builder()
.setManualAcks(true)
.setDurableName("foo")
.setMaxInFlight(1)
.setAckWait(1, TimeUnit.SECONDS)
.build()
if (!inputTopic.isEmpty) {
connection.subscribe(inputTopic, new MessageHandler() {
override def onMessage(m: Message) {
m.ack()
try {
val event = eventProcessor.deserialize(m.getData)
eventProcessor.onEvent(event)
} catch {
case any =>
try {
val command = new String(m.getData)
eventProcessor.onCommand(command)
} catch {
case any => println(s"de-serialization error: $any")
}
} finally {
println("got event")
}
}
}, substritionOptions)
}
if (!outputTopic.isEmpty) {
eventProcessor.setBus(e => {
try {
connection.publish(outputTopic, eventProcessor.serialize(e))
} catch {
case ex => println(s"serialization error $ex")
}
})
}
}
abstract class StatefulEventProcessor[I, O] {
private var bus: Option[O => Unit] = None
def onEvent(event: I): Unit
def onCommand(command: String): Unit
def serialize(o: O): Array[Byte] =
SerializationUtils.serialize(o.asInstanceOf[java.io.Serializable])
def deserialize(in: Array[Byte]): I =
SerializationUtils.deserialize[I](in)
def setBus(push: O => Unit): Unit = {
if (bus.isDefined) {
throw new IllegalStateException("bus already set")
} else {
bus = Some(push)
}
}
def push(event: O) =
bus.get.apply(event)
}
EventBus("out-1", "out-2", sc, new StatefulEventProcessor[String, String] {
override def onEvent(event: String): Unit = {
Thread.sleep(1000)
push("!!!" + event)
}
override def onCommand(command: String): Unit = {}
})
(0 until 100).foreach(i => sc.publish("out-1", SerializationUtils.serialize(s"test-$i")))
First, there is no exactly once (re)delivery guarantee with NATS Streaming. What MaxInflight gives you, is the assurance that the server will not send new messages to the subscriber until the number of unacknowledged messages is below that number. So in case of MaxInflight(1), you are asking the server to send the next new message only after receiving the ack from the previously delivered message. However, this does not block redelivery of unacknowledged messages.
The server has no guarantee or no knowledge that a message is actually received by a subscriber. This is what the ACK is for, to let the server know that the message was properly processed by the subscriber. If the server would not honor redelivery (even when MaxInflight is reached), then a "lost" message would stall your subscription for ever. Keep in mind that NATS Streaming server and clients are not directly connected to each other with a TCP connection (they are both connected to a NATS server, aka gnatsd).

Update state in actor from within a future

Consider the following code sample:
class MyActor (httpClient: HttpClient) {
var canSendMore = true
override def receive: Receive = {
case PayloadA(name: String) => send(urlA)
case PayloadB(name: String) => send(urlB)
def send(url: String){
if (canSendMore)
httpClient.post(url).map(response => canSendMore = response.canSendMore)
else {
Thread.sleep(5000) //this will be done in a more elegant way, it's just for the example.
httpClient.post(url).map(response => canSendMore = response.canSendMore)
}
}
}
Each message handling will result in an async http request. (post return value is a Future[Response])
My problem is that I want to safely update counter ( At the moment there is a race condition)
BTW, I must somehow update counter in the same thread, or at least before any other message is processed by this actor.
Is this possible?
You can use become + stash combination to keep on stashing messages when the http request future is in process.
object FreeToProcess
case PayloadA(name: String)
class MyActor (httpClient: HttpClient) extends Actor with Stash {
def canProcessReceive: Receive = {
case PayloadA(name: String) => {
// become an actor which just stashes messages
context.become(canNotProcessReceive, discardOld = false)
httpClient.post(urlA).onComplete({
case Success(x) => {
// Use your result
self ! FreeToProcess
}
case Failure(e) => {
// Use your failure
self ! FreeToProcess
}
})
}
}
def canNotProcessReceive: Receive = {
case CanProcess => {
// replay stash to mailbox
unstashAll()
// start processing messages
context.unbecome()
}
case msg => {
stash()
}
}
}

spray.can.Http$ConnectionException: Premature connection close

In my below test, I tried to simulate a timeout and then send a normal request. however, I got spray.can.Http$ConnectionException: Premature connection close (the server doesn't appear to support request pipelining)
class SprayCanTest extends ModuleTestKit("/SprayCanTest.conf") with FlatSpecLike with Matchers {
import system.dispatcher
var app = Actor.noSender
protected override def beforeAll(): Unit = {
super.beforeAll()
app = system.actorOf(Props(new MockServer))
}
override protected def afterAll(): Unit = {
system.stop(app)
super.afterAll()
}
"response time out" should "work" in {
val setup = Http.HostConnectorSetup("localhost", 9101, false)
connect(setup).onComplete {
case Success(conn) => {
conn ! HttpRequest(HttpMethods.GET, "/timeout")
}
}
expectMsgPF() {
case Status.Failure(t) =>
t shouldBe a[RequestTimeoutException]
}
}
"normal http response" should "work" in {
//Thread.sleep(5000)
val setup = Http.HostConnectorSetup("localhost", 9101, false)
connect(setup).onComplete {
case Success(conn) => {
conn ! HttpRequest(HttpMethods.GET, "/hello")
}
}
expectMsgPF() {
case HttpResponse(status, entity, _, _) =>
status should be(StatusCodes.OK)
entity should be(HttpEntity("Helloworld"))
}
}
def connect(setup: HostConnectorSetup)(implicit system: ActorSystem) = {
// for the actor 'asks'
import system.dispatcher
implicit val timeout: Timeout = Timeout(1 second)
(IO(Http) ? setup) map {
case Http.HostConnectorInfo(connector, _) => connector
}
}
class MockServer extends Actor {
//implicit val timeout: Timeout = 1.second
implicit val system = context.system
// Register connection service
IO(Http) ! Http.Bind(self, interface = "localhost", port = 9101)
def receive: Actor.Receive = {
case _: Http.Connected => sender ! Http.Register(self)
case HttpRequest(GET, Uri.Path("/timeout"), _, _, _) => {
Thread.sleep(3000)
sender ! HttpResponse(entity = HttpEntity("ok"))
}
case HttpRequest(GET, Uri.Path("/hello"), _, _, _) => {
sender ! HttpResponse(entity = HttpEntity("Helloworld"))
}
}
}
}
and My config for test:
spray {
can {
client {
response-chunk-aggregation-limit = 0
connecting-timeout = 1s
request-timeout = 1s
}
host-connector {
max-retries = 0
}
}
}
I found that in both cases, the "conn" object is the same.
So I guess when RequestTimeoutException happens, spray put back the conn to the pool (by default 4?) and the next case will use the same conn but at this time, this conn is keep alive, so the server will treat it as chunked request.
If I put some sleep in the second case, it will just passed.
So I guess I must close the conn when got RequestTimeoutException and make sure the second case use a fresh new connection, right?
How should I do? Any configurations?
Thanks
Leon
You should not block inside an Actor (your MockServer). When it is blocked, it is unable to respond to any messages. You can wrap the Thread.sleep and response inside a Future. Or even better: use the Akka Scheduler. Be sure to assign the sender to a val because it may change when you respond to the request asynchronously. This should do the trick:
val savedSender = sender()
context.system.scheduler.scheduleOnce(3 seconds){
savedSender ! HttpResponse(entity = HttpEntity("ok"))
}

Play 2.2.2-WebSocket / Equivalent of in.onClose() in Scala

I use Play 2.2.2 with Scala.
I have this code in my controller:
def wsTest = WebSocket.using[JsValue] {
implicit request =>
val (out, channel) = Concurrent.broadcast[JsValue]
val in = Iteratee.foreach[JsValue] {
msg => println(msg)
}
userAuthenticatorRequest.tracked match { //detecting wheter the user is authenticated
case Some(u) =>
mySubscriber.start(u.id, channel)
case _ =>
channel push Json.toJson("{error: Sorry, you aren't authenticated yet}")
}
(in, out)
}
calling this code:
object MySubscriber {
def start(userId: String, channel: Concurrent.Channel[JsValue]) {
ctx.getBean(classOf[ActorSystem]).actorOf(Props(classOf[MySubscriber], Seq("comment"), channel), name = "mySubscriber") ! "start"
//a simple refresh would involve a duplication of this actor!
}
}
class MySubscriber(redisChannels: Seq[String], channel: Concurrent.Channel[JsValue]) extends RedisSubscriberActor(new InetSocketAddress("localhost", 6379), redisChannels, Nil) with ActorLogging {
def onMessage(message: Message) {
println(s"message received: $message")
channel.push(Json.parse(message.data))
}
override def onPMessage(pmessage: PMessage) {
//not used
println(s"message received: $pmessage")
}
}
The problem is that when the user refreshes the page, then a new websocket restarts involving a duplication of Actors named mySubscriber.
I noticed that the Play's Java version has a way to detect a closed connection, in order to shutdown an actor.
Example:
// When the socket is closed.
in.onClose(new Callback0() {
public void invoke() {
// Shutdown the actor
defaultRoom.shutdown();
}
});
How to handle the same thing with the Scala WebSocket API? I want to close the actor each time the socket is closed.
As #Mik378 suggested, Iteratee.map serves the role of onClose.
val in = Iteratee.foreach[JsValue] {
msg => println(msg)
} map { _ =>
println("Connection has closed")
}