Spark Scala Get Data Back from rdd.foreachPartition - scala

I have some code like this:
println("\nBEGIN Last Revs Class: "+ distinctFileGidsRDD.getClass)
val lastRevs = distinctFileGidsRDD.
foreachPartition(iter => {
SetupJDBC(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
while(iter.hasNext) {
val item = iter.next()
//println(item(0))
println("String: "+item(0).toString())
val jsonStr = DB.readOnly { implicit session =>
sql"SELECT jsonStr FROM lasttail WHERE fileGId = ${item(0)}::varchar".
map { resultSet => resultSet.string(1) }.single.apply()
}
println("\nJSON: "+jsonStr)
}
})
println("\nEND Last Revs Class: "+ lastRevs.getClass)
The code outputs (with heavy edits) something like:
BEGIN Last Revs Class: class org.apache.spark.rdd.MapPartitionsRDD
String: 1fqhSXPE3GwrJ6SZzC65gJnBaB5_b7j3pWNSfqzU5FoM
JSON: Some({"Struct":{"fileGid":"1fqhSXPE3GwrJ6SZzC65gJnBaB5_b7j3pWNSfqzU5FoM",... )
String: 1eY2wxoVq17KGMUBzCZZ34J9gSNzF038grf5RP38DUxw
JSON: Some({"Struct":{"fileGid":"1fqhSXPE3GwrJ6SZzC65gJnBaB5_b7j3pWNSfqzU5FoM",... )
...
JSON: None()
END Last Revs Class: void
QUESTION 1:
How can I get the lastRevs value to be in a useful format like the JSON string/null or an option like Some / None?
QUESTION 2:
My preference: IS there another way get at partitions data that an RDD-like format (rather than the iterator format)?
dstream.foreachRDD { (rdd, time) =>
rdd.foreachPartition { partitionIterator =>
val partitionId = TaskContext.get.partitionId()
val uniqueId = generateUniqueId(time.milliseconds, partitionId)
// use this uniqueId to transactionally commit the data in partitionIterator
}
}
from http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
QUESTION 3: Is the method of getting data that I am using a sane method (given I am following the link above)? (Put aside the fact that this is a scalikejdbc system JDBC right now. This is going to be a key, value store of some type other than this prototype.)

To create a transformation that uses resources local to the executor (such as a DB or network connection), you should use rdd.mapPartitions. It allows to initialize some code locally to the executor and use those local resources to process the data in the partition.
The code should look like:
val lastRevs = distinctFileGidsRDD.
mapPartitions{iter =>
SetupJDBC(jdbcDriver, jdbcUrl, jdbcUser, jdbcPassword)
iter.map{ element =>
DB.readOnly { implicit session =>
sql"SELECT jsonStr FROM lasttail WHERE fileGId = ${element(0)}::varchar"
.map { resultSet => resultSet.string(1) }.single.apply()
}
}
}

Related

TypeInformation in Flink

I have a pipeline in a place where data is being sent from Flink to Kafka topic in a JSON format. I was also able to get it from the Kafka topic and was able to get the JSON attributes as well. Now, like scala reflect classes where I can also compare the data type at runtime, I was trying to do the same thing in Fink using TypeInformation where I can set some predefined format and whatever data is being read from topic should go under this Validation and should be passed or failed accordingly.
I have a data like below:.
{"policyName":"String", "premium":2400, "eventTime":"2021-12-22 00:00:00" }
For my problem, I came across a couple of examples in Flink's book where it is mentioned how to create a TypeInformation variable but there was nothing mentioned on how to use it so I tried my way:
val objectMapper = new ObjectMapper()
val tupleType: TypeInformation[(String, String, String)] =
Types.TUPLE[(String, Int, String)]
println(tupleType.getTypeClass)
src.map(v => v)
.map { x =>
val policyName: String = objectMapper.readTree(x).get("policyName").toString()
val premium: Int = objectMapper.readTree(x).get("premium").toString().toInt
val eventTime: String = objectMapper.readTree(x).get("eventTime").toString()
if ((policyName, premium, eventTime)== tupleType.getTypeClass) {
println("Good Record: " + (policyName, premium, eventTime))
}
else {
println("Bad Record: " + (id, category, eventTime))
}
}
Now if I pass the input as below to the flink kafka producer:
{"policyName":"whatever you feel like","premium":"4000","eventTime":"2021-12-20 00:00:00"}
It should give me the expected output as a "Bad record" and the tuple since the datatype of premium is String and not Long/Int.
If a pass the input as below:
{"policyName":"whatever you feel like","premium":4000,"eventTime":"2021-12-20 00:00:00"}
It should give me the output as "Good Record" and the tuple
But according to my code, it is always giving me the else part.
If I create a datastream variable and store the results of the above map and then compare like below then it gives me the correct result:
if (tupleType == datas.getType()) { //where 'datas' is a datastream
print("Good Records")
} else {
println("Bad Records")
}
But I want to send the good/bad records to a different stream or maybe can directly be inserted in the Cassandra table. So, that is why I am using loops for identifying the records one by one. Is my way correct? What would be the best practice considering what I am trying to achieve?
Based on Dominik's inputs, I tried creating my ow CustomDeserializer class:
import com.fasterxml.jackson.databind.ObjectMapper
import org.apache.flink.api.common.serialization.DeserializationSchema
import org.apache.flink.api.common.typeinfo.TypeInformation
import java.nio.charset.StandardCharsets
class sample extends DeserializationSchema[String] {
override def deserialize(message: Array[Byte]): Tuple3[Int, String, String] = {
val data = new String(message,
StandardCharsets.UTF_8)
val objectMapper = new ObjectMapper()
val id: Int = objectMapper.readTree(data).get("id").toString().toInt
val category: String = objectMapper.readTree(data).get("Category").toString()
val eventTime: String = objectMapper.readTree(data).get("eventTime").toString()
return (id, category, eventTime)
}
override def isEndOfStream(t: String): Boolean = ???
override def getProducedType: TypeInformation[String] = return TypeInformation.of(classOf[String])
}
I wanna try to implement something like below:
src.map(v => v)
.map { x =>
if (new sample().deserialize(x)==true) {
println("Good Record: " + (id, category, eventTime))
}
else {
println("Bad Record: " + (id, category, eventTime))
}
}
But the input is in Array[Bytes] form. So how can I implement it? Where exactly I am going wrong? What needs to be modified? This is my first ever attempt in Flink Scala custom classes.
Inputs Passed: Inputs
I don't really think that using TypeInformation to do what You want is best idea. You can simply use something like ProcessFunction that will accept a JSON String and then use the ObjectMapper to deserialize JSON to class with the expected structure. You can output the correctly deserialized objects from the ProcessFunction and the Strings that failed deserialization can be apassed as side output since they will be Your Bad Records.
This could look like below, note that this uses Jackson scala to perform deserialization to case class. You can find more info here
case class Premium(policyName: String, premium: Long, eventTime: String)
class Splitter extends ProcessFunction[String, Premium] {
val outputTag = new OutputTag[String]("failed")
def fromJson[T](json: String)(implicit m: Manifest[T]): Either[String, T] = {
Try {
lazy val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false)
mapper.readValue[T](json)
} match {
case Success(x) => Right(x)
case Failure(err) => {
Left(json)
}
}
}
override def processElement(i: String, context: ProcessFunction[String, Premium]#Context, collector: Collector[Premium]): Unit = {
fromJson(i) match {
case Right(data) => collector.collect(data)
case Left(json) => context.output(outputTag, json)
}
}
}
Then You can use the outputTag to get the side output data from the stream to get incorrect records.

How to save the outcome of collection.find into Array

object ConnHelper extends Serializable{
lazy val jedis = new Jedis("localhost")
lazy val mongoClient = MongoClient("mongodb://localhost:27017/recommender")
}
val ratingCollection = ConnHelper.mongoClient.getDatabase(mongoConfig.db).getCollection(MONGODB_RATING_COLLECTION)
val Existratings: Observable[Option[BsonValue]] = ratingCollection
.find(equal("userId",1234))
.map{
item => item.get("productId")
}
The documents are like these
{
"id":****,
"userId":4567,
"productId":12345,
"score":5.0
}
I use Scala and Mongo-Scala-driver 2.9.0 to connect MongoDB and find documents where the "userId" field equal 1234, then I want save the value of "productId" of the documents into Array, but the returned value is observable type.
Could anyone tell how to save the query outcome into Array? I would appreciate it very much.
Please try a method which uses a Promise/Future structure to find the sequence of documents that match the search criteria. For example:
import org.mongodb.scala.bson._
def find (search_key: String, search_value: String, collection_name: String): Seq[Document] = {
// The application will need to wait for the find operation thread to complete
// in order to process the returned value.
log.debug(s"Starting database find_all operation thread")
// Set up new client connection, database, and collection
val _client: MongoClient = MongoClient(config_client)
val _database: MongoDatabase = _client.getDatabase(config_database)
val collection: MongoCollection[Document] = _database.getCollection(collection_name)
// Set up result sequence
var result_seq : Seq[Document] = Seq.empty
// Set up Promise container to wait for the database operation to complete
val promise = Promise[Boolean]
// Start insert operation thread; once the thread has finished, read resulting documents.
collection.find(equal(search_key, search_value)).collect().subscribe((results: Seq[Document]) => {
log.trace(s"Found operation thread completed")
// Append found documents to the results
result_seq = result_seq ++ results
log.trace(s" Result sequence: $result_seq")
promise.success(true) // set Promise container
_client.close // close client connection to avoid memory leaks
})
val future = promise.future // Promise completion result
Await.result(future, Duration.Inf) // wait for the promise completion result
// Return document sequence
result_seq
}
Then you can iterate through the document sequence and pull the products into a List (better than Array).
def read : List[String] = {
val document_seq = Database.find("userID","1234",collection)
// Set up an empty return map
val return_map : mutable.Map[String, String] = mutable.Map.empty
// Translate data from each document into Product object
document_seq.foreach(_document => {
return_map.put(
_document("id").asString.getValue,
_document("productId").asString.getValue
)
})
// Convert values to list map and return
return_map.values.toList
}
The Mongo Scala Driver uses the Observable model which is composed by three parts.
You need to subscribe an observer to the observable. Take a look to the examples.
The fastest solution is to coplete with a toFuture call:
val Existratings =
ratingCollection
.find(equal("userId",1234))
.map{
item => item.get("productId")
}.toFuture()
That will return a Sep of BsonValues with the resultset
this maybe:
val productIds = ratingCollection
.find(equal("userId",1234))
.map { _.get("productId") }
.toArray
The most direct solution to get an Array is to fold directly in one:
ratingCollection
.find(???)
.map { ??? }
.foldLeft(Array.empty[Item]) { _ :+ _ }
.head() //in order to get a Future[Array[Item]]
.onComplete {
case Success(values: Array[Item]) => //capture the array
case Failure(exception) => //fail logic
}
It's probably best to work with the Future rather build your own Observer logic for subscription.

Scala & Play Websockets: Storing messages exchanged

I started playing around scala and came to this particular boilerplate of web socket chatroom in scala.
They use MessageHub.source() and BroadcastHub.sink() as their Source and Sink for sending the messages to all connected clients.
The example is working fine for exchanging messages as it is.
private val (chatSink, chatSource) = {
// Don't log MergeHub$ProducerFailed as error if the client disconnects.
// recoverWithRetries -1 is essentially "recoverWith"
val source = MergeHub.source[WSMessage]
.log("source")
.recoverWithRetries(-1, { case _: Exception ⇒ Source.empty })
val sink = BroadcastHub.sink[WSMessage]
source.toMat(sink)(Keep.both).run()
}
private val userFlow: Flow[WSMessage, WSMessage, _] = {
Flow.fromSinkAndSource(chatSink, chatSource)
}
def chat(): WebSocket = {
WebSocket.acceptOrResult[WSMessage, WSMessage] {
case rh if sameOriginCheck(rh) =>
Future.successful(userFlow).map { flow =>
Right(flow)
}.recover {
case e: Exception =>
val msg = "Cannot create websocket"
logger.error(msg, e)
val result = InternalServerError(msg)
Left(result)
}
case rejected =>
logger.error(s"Request ${rejected} failed same origin check")
Future.successful {
Left(Forbidden("forbidden"))
}
}
}
I want to store the messages that are exchanged in the chatroom in a DB.
I tried adding map and fold functions to source and sink to get hold of the messages that are sent but I wasn't able to.
I tried adding a Flow stage between MergeHub and BroadcastHub like below
val flow = Flow[WSMessage].map(element => println(s"Message: $element"))
source.via(flow).toMat(sink)(Keep.both).run()
But it throws a compilation error that cannot reference toMat with such signature.
Can someone help or point me how can I get hold of messages that are sent and store them in DB.
Link for full template:
https://github.com/playframework/play-scala-chatroom-example
Let's look at your flow:
val flow = Flow[WSMessage].map(element => println(s"Message: $element"))
It takes elements of type WSMessage, and returns nothing (Unit). Here it is again with the correct type:
val flow: Flow[Unit] = Flow[WSMessage].map(element => println(s"Message: $element"))
This will clearly not work as the sink expects WSMessage and not Unit.
Here's how you can fix the above problem:
val flow = Flow[WSMessage].map { element =>
println(s"Message: $element")
element
}
Not that for persisting messages in the database, you will most likely want to use an async stage, roughly:
val flow = Flow[WSMessage].mapAsync(parallelism) { element =>
println(s"Message: $element")
// assuming DB.write() returns a Future[Unit]
DB.write(element).map(_ => element)
}

How to get count of invalid data during parse

We are using spark to parse a big csv file, which may contain invalid data.
We want to save valid data into the data store, and also return how many valid data we imported and how many invalid data.
I am wondering how we can do this in spark, what's the standard approach when reading data?
My current approach uses Accumulator, but it's not accurate due to how Accumulator works in spark.
// we define case class CSVInputData: all fields are defined as string
val csvInput = spark.read.option("header", "true").csv(csvFile).as[CSVInputData]
val newDS = csvInput
.flatMap { row =>
Try {
val data = new DomainData()
data.setScore(row.score.trim.toDouble)
data.setId(UUID.randomUUID().toString())
data.setDate(Util.parseDate(row.eventTime.trim))
data.setUpdateDate(new Date())
data
} match {
case Success(map) => Seq(map)
case _ => {
errorAcc.add(1)
Seq()
}
}
}
I tried to use Either, but it failed with the exception:
java.lang.NoClassDefFoundError: no Java class corresponding to Product with Serializable with scala.util.Either[xx.CSVInputData,xx.DomainData] found
Update
I think Either doesn't work with spark 2.0 dataset api:
spark.read.option("header", "true").csv("any.csv").map { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}
If we change to use sc(rdd api), it works:
sc.parallelize('a' to 'z').map { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}.collect()
In current latest scala http://www.scala-lang.org/api/2.11.x/index.html#scala.util.Either: Either doesn't implements Serializable trait
sealed abstract class Either[+A, +B] extends AnyRef
In future 2.12 http://www.scala-lang.org/api/2.12.x/scala/util/Either.html, it does:
sealed abstract class Either[+A, +B] extends Product with Serializable
Updated 2 with workaround
More info at Spark ETL: Using Either to handle invalid data
As spark dataset doesn't work with Either, so the workaround is to call ds.rdd, then use try-left-right to capture both valid and invalid data.
spark.read.option("header", "true").csv("/Users/yyuan/jyuan/1.csv").rdd.map ( { row =>
try {
Right("")
} catch { case e: Throwable => Left(""); }
}).collect()
Have you considered using an Either
val counts = csvInput
.map { row =>
try {
val data = new DomainData()
data.setScore(row.score.trim.toDouble)
data.setId(UUID.randomUUID().toString())
data.setDate(Util.parseDate(row.eventTime.trim))
data.setUpdateDate(new Date())
Right(data)
} catch {
case e: Throwable => Left(row)
}
}
val failedCount = counts.map(_.left).filter(_.e.isLeft).count()
val successCount = counts.map(_.right).filter(_.e.isRight).count()
Did you try Spark DDQ - this has most of the data quality rules that you will need. You can even extend and customize it.
Link: https://github.com/FRosner/drunken-data-quality

Concurrent.patchPanel not sending data from multiple enumerators - only sends from last enumerator added

I'm hoping someone else has used patchPanel to combine multiple enumerators together going down to a client over a websocket. The issue i'm running into is that the patchPanel is only sending the data feed from the last enumerator added into it.
I followed the example from; http://lambdaz.blogspot.ca/2012/12/play-21-multiplexing-enumerators-into.html which is the only reference I've been able to find regarding patchPanel.
Versions; play! 2.1.1 (using Java 1.7.0_11 and Scala 2.10.0)
The web socket method;
def monitorStream = WebSocket.async[JsValue] { request =>
val promiseIn = promise[Iteratee[JsValue, Unit]]
val out = Concurrent.patchPanel[JsValue] { patcher =>
val in = Iteratee.foreach[JsValue] { json =>
val event:Option[String] = (json \ "event").asOpt[String]
val systemId = (json \ "systemId").as[Long]
event.getOrElse("") match {
case "join" =>
val physicalSystem = SystemIdHandler.getById(systemId)
val monitorOut = (MonitorStreamActor.joinMonitor(physicalSystem))
monitorOut map { enum =>
val success = patcher.patchIn(enum)
}
}
}.mapDone { _ => Logger.info("Disconnected") }
promiseIn.success(in)
}
future(Iteratee.flatten(promiseIn.future),out)
}
The MonitorStreamActor call;
def joinMonitor(physicalSystem: PhysicalSystem):
scala.concurrent.Future[Enumerator[JsValue]]
= {
val monitorActor = ActorBase.akkaSystem.actorFor("/user/system-" + physicalSystem.name +"/stream")
(monitorActor ? MonitorJoin()).map {
case MonitorConnected(enumerator) =>
enumerator
}
}
The enumerator is returned fine, and the data fed into it is coming from a timer calling the actor. Actor definition, the timer hits the UpdatedTranStates case;
class MonitorStreamActor() extends Actor {
val (monitorEnumerator, monitorChannel) = Concurrent.broadcast[JsValue]
import play.api.Play.current
def receive = {
case MonitorJoin() => {
Logger.debug ("Actor monitor join")
sender ! MonitorConnected(monitorEnumerator)
}
case UpdatedTranStates(systemName,tranStates) => {
//println("Got updated Tran States")
val json = Json.toJson(tranStates.map(m => Map("State" -> m._1, "Count" -> m._2) ))
//println("Pushing updates to monitorChannel")
sendUpdateToClients(systemName, "states", json)
}
def sendUpdateToClients(systemName:String, updateType:String, json:JsValue) {
monitorChannel.push(Json.toJson(
Map(
"dataType"->Json.toJson(updateType),
"systemName" -> Json.toJson(systemName),
"data"->json)))
}
}
}
I've poked around for a while on this and haven't found a reason why only the last enumerator that is added into the patchPanel has the data sent. the API docs are not of much help, it really sounds like all you have to do is call patchIn and it should combine all enumerators to an iteratee, but that doesn't seem to be the case.
The PatchPanel by design replaces the current enumerator with the new one provided by the patchIn method.
In order to have multiple Enumerators combined together you need to use interleave or andThen methods to combine enumerators together. Interleave is preferred for this case as it will take events from each Enumerator as they are available, vs emptying one then moving to the next (as with andThen operator).
ie, in monitorStream;
val monitorOut = (MonitorStreamActor.joinMonitor(physicalSystem))
monitorOut map { enum =>
mappedEnums += ((physicalSystem.name, enum))
patcher.patchIn(Enumerator.interleave[JsValue]( mappedEnums.values.toSeq))
}
patcher is the patch panel, and mappedEnums is a HashMap[String,Enumerator[JsValue]] - re-add the patcher each time the Enumerators change (add or delete) - it works, not sure if it's the best way, but it'll do for now :)