Flow of my spark streaming program - scala

Small question about the flow of my spark streaming program.
I have this function :
def parse(msg: String): Seq[String]
Which actually split a "good" message into multiple strings, and, if the string is "bad", returns an empty Seq.
I'm reading the messages from a kafka topic, and I want to send the results of the parsing into two different topics:
If the message is "good", send the result of the parsing to the topic "good_msg_topic"
If the message is "bad", send the "bad" message to the topic "bad_msg_topic"
To achieve that, I did this :
stream.foreachRDD(rdd => {
val res = rdd.map(msg => msg.value() -> parse(msg.value()))
res.foreach(pair => {
if (pair._2.isEmpty) {
producer.send(junkTopic, pair._1)
} else {
pair._2.foreach(m => producer.send(splitTopic, m))
}
})
})
However, I feel like this is not optimal. Using a Map Object to associate the original message to the result may slow down the process...
I'm beginning with Spark and Scala, so I think one could do better.
Any idea on how I could improve that ? Changing the signature of the parse function if also possible if you think it's better.
Thank you

I wouldn't be too concerned regarding performance if you haven't already measured this and found a bottleneck.
One thing I can think of which might make this code clearer is to use an ADT to describe the result type:
sealed trait Result
case class GoodResult(seq: Seq[String]) extends Result
case class BadResult(original: String) extends Result
Have parse return Result
def parse(s: String): Result
And then use map on DStream instead of RDD:
stream
.map(msg => parse(msg.value())
.foreachRDD { rdd =>
rdd.foreach { result =>
result match {
case GoodResult(seq) => seq.foreach(value => producer.send(splitTopic, value))
case BadResult(original) => producer.send(junkTopic, original)
}
}
}

Related

Akka Stream continuously consume websocket

Im kinda new to Scala and Akka Stream and im trying to get JSON String messages from a websocket and push them to a Kafka topic.
For now i am only working on the "get messages from a ws" part.
Messages coming from the websocket looks like this :
{
"bitcoin":"6389.06534240",
"ethereum":"192.93111286",
"monero":"108.90302506",
"litecoin":"52.25484165"
}
I want to split this JSON message to multiple messages :
{"coin": "bitcoin", "price": "6389.06534240"}
{"coin": "ethereum", "price": "192.93111286"}
{"coin": "monero", "price": "108.90302506"}
{"coin": "litecoin", "price": "52.25484165"}
And then push each of these messages to a kafka topic.
Here's what i achieved so far :
val message_decomposition: Flow[Message, String, NotUsed] = Flow[Message].mapConcat(
msg => msg.toString.replaceAll("[{})(]", "").split(",")
).map( msg => {
val splitted = msg.split(":")
s"{'coin': ${splitted(0)}, 'price': ${splitted(1)}}"
})
val sink: Sink[String, Future[Done]] = Sink.foreach[String](println)
val flow: Flow[Message, Message, Promise[Option[Message]]] =
Flow.fromSinkAndSourceMat(
message_decomposition.to(sink),
Source.maybe[Message])(Keep.right)
val (upgradeResponse, promise) = Http().singleWebSocketRequest(
WebSocketRequest("wss://ws.coincap.io/prices?assets=ALL"),
flow)
It's working im getting the expected output Json message but i was wondering if i could write this producer in a more "Akka-ish" style, like using GraphDSL. So i have a few questions :
Is it possible to continuously consume a WebSocket using a GraphDSL ? If yes, can you show me a example please ?
Is it a good idea to consume the WS using a GraphDSL ?
Should i decompose the received Json Message like im doing before sending it to kafka ? Or it's better to send it as it is for lower latency ?
After producing the message to Kafka, i am planning to consume it using Apache Storm, is it a good idea ? Or should i stick with Akka ?
Thanks for reading me, Regards,
Arès
That code is plenty Akka-ish: scaladsl is just as Akka as the GraphDSL or implementing a custom GraphStage. The only reason, IMO/E, to go to the GraphDSL is if the actual shape of the graph isn't readily expressible in the scaladsl.
I would personally go the route of defining a CoinPrice class to make the model explicit
case class CoinPrice(coin: String, price: BigDecimal)
And then have a Flow[Message, CoinPrice, NotUsed] which parses 1 incoming message into zero or more CoinPrices. Something (using Play JSON here) like:
val toCoinPrices =
Flow[Message]
.mapConcat { msg =>
Json.parse(msg.toString)
.asOpt[JsObject]
.toList
.flatMap { json =>
json.underlying.flatMap { kv =>
import scala.util.Try
kv match {
case (coin, JsString(priceStr)) =>
Try(BigDecimal(priceStr)).toOption
.map(p => CoinPrice(coin, p))
case (coin, JsNumber(price)) => Some(CoinPrice(coin, price))
case _ => None
}
}
}
}
You might, depending on what the size of the JSONs in the message are, want to break that into different stream stages to allow for an async boundary between the JSON parse and the extraction into CoinPrices. For example,
Flow[Message]
.mapConcat { msg =>
Json.parse(msg.toString).asOpt[JsObject].toList
}
.async
.mapConcat { json =>
json.underlying.flatMap { kv =>
import scala.util.Try
kv match {
case (coin, JsString(priceStr)) =>
Try(BigDecimal(priceStr)).toOption
.map(p => CoinPrice(coin, p))
case (coin, JsNumber(price)) => Some(CoinPrice(coin, price))
case _ => None
}
}
}
In the above, the stages on either side of the async boundary will execute in separate actors and thus, possibly concurrently (if there's enough CPU cores available etc.), at the cost of extra overhead for the actors to coordinate and exchange messages. That extra coordination/communication overhead (cf. Gunther's Universal Scalability Law) is only going to be worth it if the JSON objects are sufficiently large and coming in sufficiently quickly (consistently coming in before the previous one has finished processing).
If your intention is to consume the websocket until the program stops, you might find it clearer to just use Source.never[Message].

How to use Futures within Kafka Streams

When using the org.apache.kafka.streams.KafkaStreams library in Scala, I have been trying to read in an inputStream, pass that information over to a method: validateAll(infoToValidate) that returns a Future, resolve that and then send to an output stream.
Example:
builder.stream[String, Object](REQUEST_TOPIC)
.mapValues(v => ValidateFormat.from(v.asInstanceOf[GenericRecord]))
.mapValues(infoToValidate => {
SuccessFailFormat.to(validateAll(infoToValidate))
})
Is there any documentation on performing this? I have looked into filter() and transform() but still not sure how to deal with Futures in KStreams.
The answer depends whether you need to preserve the original order of your messages. If yes, then you will have to block in one way or the other. For example:
val duration = 10 seconds // whatever your timeout should be or Duration.Inf
sourceStream
.mapValues(x => Await.result(validate(x), duration))
.to(outputTopic)
If however the order is not important, you can simply use a Kafka producer:
sourceStream
.mapValues(x => validate(x)) // now you have KStream[.., Future[...]]
.foreach { future =>
future.foreach { item =>
val record = new ProducerRecord(outputTopic, key, item)
producer.send(record) // provided you have the implicit serializer
}
}

Processing an akka stream asynchronously and writing to a file sink

I am trying to write a piece of code that would consume a stream of tickers (stock exchange symbol of a company) and fetch company information from a REST API for each ticker.
I want to fetch information for multiple companies asynchronously.
I would like to save the results to a file in a continuous manner as the entire data set might not fit into memory.
Following the documentation of akka streams and resources that I was able to google on this subject I have come up with the following piece of code (some parts are omitted for brevity):
implicit val actorSystem: ActorSystem = ActorSystem("stock-fetcher-system")
implicit val materializer: ActorMaterializer = ActorMaterializer(None, Some("StockFetcher"))(actorSystem)
implicit val context = system.dispatcher
import CompanyJsonMarshaller._
val parallelism = 10
val connectionPool = Http().cachedHostConnectionPoolHttps[String](s"api.iextrading.com")
val listOfSymbols = symbols.toList
val outputPath = "out.txt"
Source(listOfSymbols)
.mapAsync(parallelism) {
stockSymbol => Future(HttpRequest(uri = s"https://api.iextrading.com/1.0/stock/${stockSymbol.symbol}/company"), stockSymbol.symbol)
}
.via(connectionPool)
.map {
case (Success(response), _) => Unmarshal(response.entity).to[Company]
case (Failure(ex), symbol) => println(s"Unable to fetch char data for $symbol") "x"
}
.runWith(FileIO.toPath(new File(outputPath).toPath, Set(StandardOpenOption.APPEND)))
.onComplete { _ =>
bufferedSource.close
actorSystem.terminate()
}
This is the problematic line:
runWith(FileIO.toPath(new File(outputPath).toPath, Set(StandardOpenOption.APPEND)))
which doesn't compile and the compiler gives me this mysteriously looking error:
Type mismatch, expected Graph[SinkShape[Any, NotInferedMat2], actual Sink[ByeString, Future[IOResult]]
If I change the sink to Sink.ignore or println(_) it works.
I'd appreciate some more detailed explanation.
As the compiler is indicating, the types don't match. In the call to .map...
.map {
case (Success(response), _) =>
Unmarshal(response.entity).to[Company]
case (Failure(ex), symbol) =>
println(s"Unable to fetch char data for $symbol")
"x"
}
...you're returning either a Company instance or a String, so the compiler infers the closest supertype (or "least upper bounds") to be Any. The Sink expects input elements of type ByteString, not Any.
One approach is to send the response to the file sink without unmarshalling the response:
Source(listOfSymbols)
.mapAsync(parallelism) {
...
}
.via(connectionPool)
.map(_.entity.dataBytes) // entity.dataBytes is a Source[ByteString, _]
.flatMapConcat(identity)
.runWith(FileIO.toPath(...))

Combining Futures dependent on each other

I'm using Scala to make HTTP GET requests to an API (Play Framework's WS, to be exact) which responds with a JSON response that looks like;
{
data: [
{text: "Hello there", id: 1},
{text: "Hello there again", id: 2}
],
next_url: 'http://request-this-for-more.com/api?page=2' //optional
}
So, the next_url field in the returned JSON may or may not be present.
What my method needs to do is start with calling the first URL, check if the response has a next_url and then do a GET on that. In the end, I should have all the data fields from the responses combined into one single future of all the data fields. I terminate when the response has no next_url present in it.
Now, doing this in a blocking way is easier, but I don't want to do that. What is the best way do tackle a problem like this?
There's probably a method to do this somewhere in scalaz, but if you don't know a specific solution it's usually possible to construct one with recursion and flatMap. Something like:
//Assume we have an async fetch method that returns Result and Option of next Url
def fetch(url: Url): Future[(Result, Option[Url])] = ...
//Then we can define fetchAll with recursion:
def fetchAll(url: Url): Future[Vector[Result]] =
fetch(url) flatMap {
case (result, None) => Future.successful(Vector(result))
case (result, Some(nextUrl)) =>
fetchAll(nextUrl) map {results => result +: results}
}
(Note that this uses a stack frame for each call - if you want to do thousands of fetches then we need to write it a little more carefully so that it's tail-recursive)
the Future.flatMap method exists exaclty for cases like that
Suppose you have such things:
case class Data(...)
def getContent(url:String):Future[String]
def parseJson(source:String):Try[JsValue]
def getData(value: JsValue):Seq[Data]
and JsValue type have methods inspired by play json library
def \ (fieldName: String): JsValue
def as[T](implicit ...):T //probably throwing exception
you could compose final result like
def innerContent(url:String):Future[Seq[Data]] = for {
first <- getContent(url)
json <- Future.fromTry(parseJson(first))
nextUrlAttempt = Try((json \ "next_url").as[String])
dataAttempt = Try(getData(json \ "data"))
data <- Future.fromTry(dataAttempt)
result <- nextUrlAttempt match {
case Success(nextUrl) => innerContent(nextUrl)
case Failure(_) => Future.successful(Seq())
} yield data ++ result
Also check out libraries that are targeted for complex asynchronous streams like your one:
play iteratees
scalaz iteratees
scalaz stream

Simple Scala actor question

I'm sure this is a very simple question, but embarrassed to say I can't get my head around it:
I have a list of values in Scala.
I would like to use use actors to make some (external) calls with each value, in parallel.
I would like to wait until all values have been processed, and then proceed.
There's no shared values being modified.
Could anyone advise?
Thanks
There's an actor-using class in Scala that's made precisely for this kind of problem: Futures. This problem would be solved like this:
// This assigns futures that will execute in parallel
// In the example, the computation is performed by the "process" function
val tasks = list map (value => scala.actors.Futures.future { process(value) })
// The result of a future may be extracted with the apply() method, which
// will block if the result is not ready.
// Since we do want to block until all results are ready, we can call apply()
// directly instead of using a method such as Futures.awaitAll()
val results = tasks map (future => future.apply())
There you go. Just that.
Create workers and ask them for futures using !!; then read off the results (which will be calculated and come in in parallel as they're ready; you can then use them). For example:
object Example {
import scala.actors._
class Worker extends Actor {
def act() { Actor.loop { react {
case s: String => reply(s.length)
case _ => exit()
}}}
}
def main(args: Array[String]) {
val arguments = args.toList
val workers = arguments.map(_ => (new Worker).start)
val futures = for ((w,a) <- workers zip arguments) yield w !! a
val results = futures.map(f => f() match {
case i: Int => i
case _ => throw new Exception("Whoops--didn't expect to get that!")
})
println(results)
workers.foreach(_ ! None)
}
}
This does a very inexpensive computation--calculating the length of a string--but you can put something expensive there to make sure it really does happen in parallel (the last thing that case of the act block should be to reply with the answer). Note that we also include a case for the worker to shut itself down, and when we're all done, we tell the workers to shut down. (In this case, any non-string shuts down the worker.)
And we can try this out to make sure it works:
scala> Example.main(Array("This","is","a","test"))
List(4, 2, 1, 4)