handle spark state istimingout - scala

I'm using the new mapWithState function in spark streaming (1.6) with a timing out state. I want to use the timing out state and add it to another rdd in order to use it for calculations further down the road:
val aggedlogs = sc.emptyRDD[MyLog];
val mappingFunc = (key: String, newlog: Option[MyLog], state: State[MyLog]) => {
val _newLog = newlog.getOrElse(null)
if ((state.exists())&&(_newLog!=null))
{
val stateLog = state.get()
val combinedLog = LogUtil.CombineLogs(_newLog, stateLog);
state.update(combinedLog)
}
else if (_newLog !=null) {
state.update(_newLog);
}
if (state.isTimingOut())
{
val stateLog = state.get();
aggedlogs.union(sc.parallelize(List(stateLog), 1))
}
val stateLog = state.get();
(key,stateLog);
}
val stateDstream = reducedlogs.mapWithState(StateSpec.function(mappingFunc).timeout(Seconds(10)))
but when I try to add it to an rdd in the StateSpec function, I get an error that the function is not serializable. Any thoughts on how I can get pass this?
EDIT:
After drilling deeper i found that my approach was wrong. before trying this solution i tried to get the timing out logs from the statesnapeshot(), but they were not there anymore, changing the mapping function to :
def mappingFunc(key: String, newlog: Option[MyLog], state: State[KomoonaLog]) : Option[(String, MyLog)] = {
val _newLog = newlog.getOrElse(null)
if ((state.exists())&&(_newLog!=null))
{
val stateLog = state.get()
val combinedLog = LogUtil.CombineLogs(_newLog, stateLog);
state.update(combinedLog)
Some(key,combinedLog);
}
else if (_newLog !=null) {
state.update(_newLog);
Some(key,_newLog);
}
if (state.isTimingOut())
{
val stateLog = state.get();
stateLog.timinigOut = true;
System.out.println("timinigOut : " +key );
Some(key, stateLog);
}
val stateLog = state.get();
Some(key,stateLog);
}
i managed to filter the mapedwithstatedstream for the logs that are timing out in each batch:
val stateDstream = reducedlogs.mapWithState(
StateSpec.function(mappingFunc _).timeout(Seconds(60)))
val tiningoutlogs= stateDstream.filter (filtertimingout)

Related

why mapPartitions does not see my val - SCALA/SPARK?

I define val like this :
val config = Config(args)
val product_type = config.product_type
then I send product_type as "AA"
and my code is this :
val scores = df.mapPartitions(iterator => {
val inputStream =
if(product_type == "AA" ) {
getClass().getClassLoader().getResourceAsStream("my_aa.hdf5")
}
else {
getClass().getClassLoader().getResourceAsStream("my_bb.hdf5")
}
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
But my inputStream equals "my_bb.hdf5" which is not correct. This value comes from else statement. So why my product_type variable cant read in mappartition?
I print my product_type value before code and I checked it , it is : "AA"
it occurs because of i get this variable from argument in spark submit.sh
and it can not read from mappartition.
It works like this:
val scores =
if (product_type == "AA") {
df.mapPartitions(iterator => {
val inputStream = getClass().getClassLoader().getResourceAsStream("AA.hdf5")
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
} else {
df.mapPartitions(iterator => {
val inputStream = getClass().getClassLoader().getResourceAsStream("BB.hdf5")
val multiLayerNetwork: MultiLayerNetwork = KerasModelImport.importKerasSequentialModelAndWeights(inputStream, false)
val wrapped: ParallelInference = new ParallelInference.Builder(multiLayerNetwork).build()
val res = iterator.map(row => {
wrapped.output(row).toDoubleVector
})
res
})
}

How to persist the list which we made dynamically from dataFrame in scala spark

def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().collect.toList
}
I am basicaly Calling this function 2 times For getting the list for different purposes . I just want to know is there a way to retain the list in memory and we dont have to call the same function again and again to generate the list and only have to generate the list only one time in scala spark.
Try something as below and you can also check the performance using time func.
Also find the code explanation inline
import org.apache.spark.rdd
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, functions}
object HandleCachedDF {
var cachedAnimalDF : rdd.RDD[String] = _
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val df = spark.read.json("src/main/resources/hugeTest.json") // Load your Dataframe
val df1 = time[rdd.RDD[String]] {
getAnimalName(df)
}
val resultList = df1.collect().toList
val df2 = time{
getAnimalName(df)
}
val resultList1 = df2.collect().toList
println(resultList.equals(resultList1))
}
def getAnimalName(dataFrame: DataFrame): rdd.RDD[String] = {
if (cachedAnimalDF == null) { // Check if this the first initialization of your dataframe
cachedAnimalDF = dataFrame.select("animal").
filter(functions.col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().cache() // Cache your dataframe
}
cachedAnimalDF // Return your cached dataframe
}
def time[R](block: => R): R = { // COmpute the time taken by function to execute
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
}
You would have to persist or cache at this point
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().persist
and then call the function as follow
def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.collect.toList
}
as many times as you need it without repeat the process.
I hope it helps.

Unable to Analyse data

val patterns = ctx.getBroadcastState(patternStateDescriptor)
The imports I made
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.common.state.{MapStateDescriptor, ValueState, ValueStateDescriptor}
import org.apache.flink.api.scala.typeutils.Types
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.datastream.BroadcastStream
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector
Here's the code
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers","localhost:9092")
val patternStream = new FlinkKafkaConsumer010("patterns", new SimpleStringSchema, properties)
val patterns = env.addSource(patternStream)
var patternData = patterns.map {
str =>
val splitted_str = str.split(",")
PatternStream(splitted_str(0).trim, splitted_str(1).trim, splitted_str(2).trim)
}
val logsStream = new FlinkKafkaConsumer010("logs", new SimpleStringSchema, properties)
// logsStream.setStartFromEarliest()
val logs = env.addSource(logsStream)
var data = logs.map {
str =>
val splitted_str = str.split(",")
LogsTest(splitted_str.head.trim, splitted_str(1).trim, splitted_str(2).trim)
}
val keyedData: KeyedStream[LogsTest, String] = data.keyBy(_.metric)
val bcStateDescriptor = new MapStateDescriptor[Unit, PatternStream]("patterns", Types.UNIT, Types.of[PatternStream]) // first type defined is for the key and second data type defined is for the value
val broadcastPatterns: BroadcastStream[PatternStream] = patternData.broadcast(bcStateDescriptor)
val alerts = keyedData
.connect(broadcastPatterns)
.process(new PatternEvaluator())
alerts.print()
// println(alerts.getClass)
// val sinkProducer = new FlinkKafkaProducer010("output", new SimpleStringSchema(), properties)
env.execute("Flink Broadcast State Job")
}
class PatternEvaluator()
extends KeyedBroadcastProcessFunction[String, LogsTest, PatternStream, (String, String, String)] {
private lazy val patternStateDescriptor = new MapStateDescriptor("patterns", classOf[String], classOf[String])
private var lastMetricState: ValueState[String] = _
override def open(parameters: Configuration): Unit = {
val lastMetricDescriptor = new ValueStateDescriptor("last-metric", classOf[String])
lastMetricState = getRuntimeContext.getState(lastMetricDescriptor)
}
override def processElement(reading: LogsTest,
readOnlyCtx: KeyedBroadcastProcessFunction[String, LogsTest, PatternStream, (String, String, String)]#ReadOnlyContext,
out: Collector[(String, String, String)]): Unit = {
val metrics = readOnlyCtx.getBroadcastState(patternStateDescriptor)
if (metrics.contains(reading.metric)) {
val metricPattern: String = metrics.get(reading.metric)
val metricPatternValue: String = metrics.get(reading.value)
val lastMetric = lastMetricState.value()
val logsMetric = (reading.metric)
val logsValue = (reading.value)
if (logsMetric == metricPattern) {
if (metricPatternValue == logsValue) {
out.collect((reading.timestamp, reading.value, reading.metric))
}
}
}
}
override def processBroadcastElement(
update: PatternStream,
ctx: KeyedBroadcastProcessFunction[String, LogsTest, PatternStream, (String, String, String)]#Context,
out: Collector[(String, String, String)]
): Unit = {
val patterns = ctx.getBroadcastState(patternStateDescriptor)
if (update.metric == "IP") {
patterns.put(update.metric /*,update.operator*/ , update.value)
}
// else if (update.metric == "username"){
// patterns.put(update.metric, update.value)
// }
// else {
// println("No required data found")
// }
// }
}
}
Sample Data :- Logs Stream
"21/09/98","IP", "5.5.5.5"
Pattern Stream
"IP","==","5.5.5.5"
I'm unable to analyse data by getting desired result, i.e = 21/09/98,IP,5.5.5.5
There's no error as of now, it's just not analysing the data
The code is reading streams (Checked)
One common source of trouble in cases like this is that the API offers no control over the order in which the patterns and the data are ingested. It could be that processElement is being called before processBroadcastElement.

Why does my Akka data stream stops processing a huge file (~250,000 lines of strings) but works for small file?

My stream works for smaller file of 1000 lines but stops when I test it on a large file ~12MB and ~250,000 lines? I tried applying backpressure with a buffer and throttling it and still same thing...
Here is my data streamer:
class UserDataStreaming(usersFile: File) {
implicit val system = ActorSystemContainer.getInstance().getSystem
implicit val materializer = ActorSystemContainer.getInstance().getMaterializer
def startStreaming() = {
val graph = RunnableGraph.fromGraph(GraphDSL.create() {
implicit builder =>
val usersSource = builder.add(Source.fromIterator(() => usersDataLines)).out
val stringToUserFlowShape: FlowShape[String, User] = builder.add(csvToUser)
val averageAgeFlowShape: FlowShape[User, (String, Int, Int)] = builder.add(averageUserAgeFlow)
val averageAgeSink = builder.add(Sink.foreach(averageUserAgeSink)).in
usersSource ~> stringToUserFlowShape ~> averageAgeFlowShape ~> averageAgeSink
ClosedShape
})
graph.run()
}
val usersDataLines = scala.io.Source.fromFile(usersFile, "ISO-8859-1").getLines().drop(1)
val csvToUser = Flow[String].map(_.split(";").map(_.trim)).map(csvLinesArrayToUser)
def csvLinesArrayToUser(line: Array[String]) = User(line(0), line(1), line(2))
def averageUserAgeSink[usersSource](source: usersSource) {
source match {
case (age: String, count: Int, totalAge: Int) => println(s"age = $age; Average reader age is: ${Try(totalAge/count).getOrElse(0)} count = $count and total age = $totalAge")
case bad => println(s"Bad case: $bad")
}
}
def averageUserAgeFlow = Flow[User].fold(("", 0, 0)) {
(nums: (String, Int, Int), user: User) =>
var counter: Option[Int] = None
var totalAge: Option[Int] = None
val ageInt = Try(user.age.substring(1, user.age.length-1).toInt)
if (ageInt.isSuccess) {
counter = Some(nums._2 + 1)
totalAge = Some(nums._3 + ageInt.get)
}
else {
counter = Some(nums._2 + 0)
totalAge = Some(nums._3 + 0)
}
//println(counter.get)
(user.age, counter.get, totalAge.get)
}
}
Here is my Main:
object Main {
def main(args: Array[String]): Unit = {
implicit val system = ActorSystemContainer.getInstance().getSystem
implicit val materializer = ActorSystemContainer.getInstance().getMaterializer
val usersFile = new File("data/BX-Users.csv")
println(usersFile.length())
val userDataStreamer = new UserDataStreaming(usersFile)
userDataStreamer.startStreaming()
}
It´s possible that there may be any error related to one row of your csv file. In that case, the stream materializes and stops. Try to define your flows like that:
FlowFlowShape[String, User].map {
case (user) => try {
csvToUser(user)
}
}.withAttributes(ActorAttributes.supervisionStrategy {
case ex: Throwable =>
log.error("Error parsing row event: {}", ex)
Supervision.Resume
}
In this case the possible exception is captured and the stream ignores the error and continues.
If you use Supervision.Stop, the stream stops.

Perfomance using orientdb graph api vs sql query

Actually i'm using this api for standards operations (Read, Remove, Find, Save)
http://orientdb.com/docs/last/Graph-Database-Tinkerpop.html
I noticed that the performance of this method for remove are really bad
object Odb {
val factory = new OrientGraphFactory("remote:localhost:2424/recommendation-system","root","root").setupPool(1,10)
def clearDb = {
val graph = factory.getNoTx
val vertices = graph.getVertices().asScala.map(v => v.remove())
}
}
object TagsOdb extends TagsDao {
override def count: Future[Long] = Future {
val graph = Odb.factory.getNoTx
val count = graph.countVertices("Tags")
count
}
override def update(newTag: Tag, oldTag: Tag): Future[Boolean] = Future { synchronized {
val graph = Odb.factory.getTx
val tagVertices = graph.getVertices("Tags.tag",oldTag.flatten).asScala
if(tagVertices.isEmpty) throw new Exception("Tag not found: "+oldTag.id)
tagVertices.head.setProperty("tag",newTag.flatten)
graph.commit()
true
}}
override def all: Future[List[Tag]] = Future {
val graph = Odb.factory.getNoTx
val tagVertices = graph.getVerticesOfClass("Tags").asScala
val tagList = tagVertices.map(v => Tag(v.getProperty("tag"),None)).toList
tagList
}
override def remove(e: Tag): Future[Boolean] = Future { synchronized {
val graph = Odb.factory.getTx
val tagVertices = graph.getVertices("Tags.tag",e.flatten).asScala
if(tagVertices.isEmpty) throw new Exception("Tag not found: "+e.flatten)
tagVertices.head.remove()
graph.commit()
true
}}
override def save(e: Tag, upsert: Boolean = false): Future[Boolean] = Future { synchronized {
val graph = Odb.factory.getTx
val v = graph.getVertices("Tags.tag",e.flatten).asScala
if(v.nonEmpty) {
if (upsert)
v.head.setProperty("tag", e.flatten)
else
throw new Exception("Element already in database")
} else {
val tagVertex = graph.addVertex("Tags", null)
tagVertex.setProperty("tag", e.flatten)
}
graph.commit()
true
}}
override def find(query: String): Future[List[Tag]] = Future {
val graph = Odb.factory.getNoTx
val res: OrientDynaElementIterable = graph.command(new OCommandSQL(query)).execute()
val ridTags: Iterable[Vertex] = res.asScala.asInstanceOf[Iterable[Vertex]]
def getTag(rid: AnyRef): Tag = {
val tagVertex = graph.getVertex(rid)
Tag(tagVertex.getProperty("tag"),None)
}
ridTags.map(r => getTag(r)).toList
}
}
is there any way to get better performance ?
I should use SQL queries?
Since you're connected via remote interface, foreach remove you have a RPC. Try to execute a DELETE VERTEX on the server.