Materialize mapWithState stateSnapShots to database for later resume of spark streaming app - scala

I have a Spark scala streaming app that sessionizes user generated events coming from Kafka, using mapWithState. I want to mature the setup by enabling to pauze and resume the app in the case of maintenance. I’m already writing kafka offset information to a database, so when restarting the app I can pick up at the last offset processed. But I also want to keep the state information.
So my goal is to;
materialize session information after a key identifying the user times out.
materialize a .stateSnapshot() when I gracefully shutdown the application, so I can use that data when restarting the app by feeding it as a parameter to StateSpec.
1 is working, with 2 I have issues.
For the sake of completeness, I also describe 1 because I’m always interested in a better solution for it:
1) materializing session info after key time out
Inside my update function for mapWithState, I have:
if (state.isTimingOut()) {
// if key is timing out.
val output = (key, stateFilterable(isTimingOut = true
, start = state.get().start
, end = state.get().end
, duration = state.get().duration
))
That isTimingOut boolean I then later on use as:
streamParsed
.filter(a => a._2.isTimingOut)
.foreachRDD(rdd =>
rdd
.map(stuff => Model(key = stuff._1,
start = stuff._2.start,
duration = stuff._2.duration)
.saveToCassandra(keyspaceName, tableName)
)
2) materialize a .stateSnapshot() with graceful shutdown
Materializing snapshot info doesn’t work. What is tried:
// define a class Listener
class Listener(ssc: StreamingContext, state: DStream[(String, stateFilterable)]) extends Runnable {
def run {
if( ssc == null )
System.out.println("The spark context is null")
else
System.out.println("The spark context is fine!!!")
var input = "continue"
while( !input.equals("D")) {
input = readLine("Press D to kill: ")
System.out.println(input + " " + input.equals("D"))
}
System.out.println("Accessing snapshot and saving:")
state.foreachRDD(rdd =>
rdd
.map(stuff => Model(key = stuff._1,
start = stuff._2.start,
duration = stuff._2.duration)
.saveToCassandra("some_keyspace", "some_table")
)
System.out.println("Stopping context!")
ssc.stop(true, true)
System.out.println("We have stopped!")
}
}
// Inside the app object:
val state = streamParsed.stateSnapshots()
var listener = new Thread(new Listener(ssc, state))
listener.start()
So the full code becomes:
package main.scala.cassandra_sessionizing
import java.text.SimpleDateFormat
import java.util.Calendar
import org.apache.spark.streaming.dstream.{DStream, MapWithStateDStream}
import scala.collection.immutable.Set
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming._
import org.apache.spark.streaming.Duration
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType, LongType, ArrayType, IntegerType}
import _root_.kafka.serializer.StringDecoder
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
case class userAction(datetimestamp: Double
, action_name: String
, user_key: String
, page_id: Integer
)
case class actionTuple(pages: scala.collection.mutable.Set[Int]
, start: Double
, end: Double)
case class stateFilterable(isTimingOut: Boolean
, start: Double
, end: Double
, duration: Int
, pages: Set[Int]
, events: Int
)
case class Model(user_key: String
, start: Double
, duration: Int
, pages: Set[Int]
, events: Int
)
class Listener(ssc: StreamingContext, state: DStream[(String, stateFilterable)]) extends Runnable {
def run {
var input = "continue"
while( !input.equals("D")) {
input = readLine("Press D to kill: ")
System.out.println(input + " " + input.equals("D"))
}
// Accessing snapshot and saving:
state.foreachRDD(rdd =>
rdd
.map(stuff => Model(user_key = stuff._1,
start = stuff._2.start,
duration = stuff._2.duration,
pages = stuff._2.pages,
events = stuff._2.events))
.saveToCassandra("keyspace1", "snapshotstuff")
)
// Stopping context
ssc.stop(true, true)
}
}
object cassandra_sessionizing {
// where we'll store the stuff in Cassandra
val tableName = "sessionized_stuff"
val keyspaceName = "keyspace1"
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("cassandra-sessionizing")
.set("spark.cassandra.connection.host", "10.10.10.10")
.set("spark.cassandra.auth.username", "keyspace1")
.set("spark.cassandra.auth.password", "blabla")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// setup the cassandra connector and recreate the table we'll use for storing the user session data.
val cc = CassandraConnector(conf)
cc.withSessionDo { session =>
session.execute(s"""DROP TABLE IF EXISTS $keyspaceName.$tableName;""")
session.execute(
s"""CREATE TABLE IF NOT EXISTS $keyspaceName.$tableName (
user_key TEXT
, start DOUBLE
, duration INT
, pages SET<INT>
, events INT
, PRIMARY KEY(user_key, start)) WITH CLUSTERING ORDER BY (start DESC)
;""")
}
// setup the streaming context and make sure we can checkpoint, given we're using mapWithState.
val ssc = new StreamingContext(sc, Seconds(60))
ssc.checkpoint("hdfs:///user/keyspace1/streaming_stuff/")
// Defining the stream connection to Kafka.
val kafkaStream = {
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
Map("metadata.broker.list" -> "kafka1.prod.stuff.com:9092,kafka2.prod.stuff.com:9092"), Set("theTopic"))
}
// this schema definition is needed so the json string coming from Kafka can be parsed into a dataframe using spark read.json.
// if an event does not conform to this structure, it will result in all null values, which are filtered out later.
val struct = StructType(
StructField("datetimestamp", DoubleType, nullable = true) ::
StructField("sub_key", StructType(
StructField("user_key", StringType, nullable = true) ::
StructField("page_id", IntegerType, nullable = true) ::
StructField("name", StringType, nullable = true) :: Nil), nullable = true) ::
)
/*
this is the function needed to keep track of an user key's session.
3 options:
1) key already exists, and new values are coming in to be added to the state.
2) key is new, so initialize the state with the incoming value
3) key is timing out, so mark it with a boolean that can be used by filtering later on. Given the boolean, the data can be materialized to cassandra.
*/
def trackStateFunc(batchTime: Time
, key: String
, value: Option[actionTuple]
, state: State[stateFilterable])
: Option[(String, stateFilterable)] = {
// 1 : if key already exists and we have a new value for it
if (state.exists() && value.orNull != null) {
var current_set = state.getOption().get.pages
var current_start = state.getOption().get.start
var current_end = state.getOption().get.end
if (value.get.pages != null) {
current_set ++= value.get.pages
}
current_start = Array(current_start, value.get.start).min // the starting epoch is used to initialize the state, but maybe some earlier events are processed a bit later.
current_end = Array(current_end, value.get.end).max // always update the end time of the session with new events coming in.
val new_event_counter = state.getOption().get.events + 1
val new_output = stateFilterable(isTimingOut = false
, start = current_start
, end = current_end
, duration = (current_end - current_start).toInt
, pages = current_set
, events = new_event_counter)
val output = (key, new_output)
state.update(new_output)
return Some(output)
}
// 2: if key does not exist and we have a new value for it
else if (value.orNull != null) {
var new_set: Set[Int] = Set()
val current_value = value.get.pages
if (current_value != null) {
new_set ++= current_value
}
val event_counter = 1
val current_start = value.get.start
val current_end = value.get.end
val new_output = stateFilterable(isTimingOut = false
, start = current_start
, end = current_end
, duration = (current_end - current_start).toInt
, pages = new_set
, events = event_counter)
val output = (key, new_output)
state.update(new_output)
return Some(output)
}
// 3: if key is timing out
if (state.isTimingOut()) {
val output = (key, stateFilterable(isTimingOut = true
, start = state.get().start
, end = state.get().end
, duration = state.get().duration
, pages = state.get().pages
, events = state.get().events
))
return Some(output)
}
// this part of the function should never be reached.
throw new Error(s"Entered dead end with $key $value")
}
// defining the state specification used later on as a step in the stream pipeline.
val stateSpec = StateSpec.function(trackStateFunc _)
.numPartitions(16)
.timeout(Seconds(4000))
// RDD 1
val streamParsedRaw = kafkaStream
.map { case (k, v: String) => v } // key is empty, so get the value containing the json string.
.transform { rdd =>
val df = sqlContext.read.schema(struct).json(rdd) // apply schema defined above and parse the json into a dataframe,
.selectExpr("datetimestamp"
, "action.name AS action_name"
, "action.user_key"
, "action.page_id"
)
df.as[userAction].rdd // transform dataframe into spark Dataset so we easily cast to the case class userAction.
}
val initialCount = actionTuple(pages = collection.mutable.Set(), start = 0.0, end = 0.0)
val addToCounts = (left: actionTuple, ua: userAction) => {
val current_start = ua.datetimestamp
val current_end = ua.datetimestamp
if (ua.page_id != null) left.pages += ua.page_id
actionTuple(left.pages, current_start, current_end)
}
val sumPartitionCounts = (p1: actionTuple, p2: actionTuple) => {
val current_start = Array(p1.start, p2.start).min
val current_end = Array(p1.end, p2.end).max
actionTuple(p1.pages ++= p2.pages, current_start, current_end)
}
// RDD 2: add the mapWithState part.
val streamParsed = streamParsedRaw
.map(s => (s.user_key, s)) // create key value tuple so we can apply the mapWithState to the user_key.
.transform(rdd => rdd.aggregateByKey(initialCount)(addToCounts, sumPartitionCounts)) // reduce to one row per user key for each batch.
.mapWithState(stateSpec)
// RDD 3: if the app is shutdown, this rdd should be materialized.
val state = streamParsed.stateSnapshots()
state.print(2)
// RDD 4: Crucial: loop up sessions timing out, extract the fields that we want to keep and materialize in Cassandra.
streamParsed
.filter(a => a._2.isTimingOut)
.foreachRDD(rdd =>
rdd
.map(stuff => Model(user_key = stuff._1,
start = stuff._2.start,
duration = stuff._2.duration,
pages = stuff._2.pages,
events = stuff._2.events))
.saveToCassandra(keyspaceName, tableName)
)
// add a listener hook that we can use to gracefully shutdown the app and materialize the RDD containing the state snapshots.
var listener = new Thread(new Listener(ssc, state))
listener.start()
ssc.start()
ssc.awaitTermination()
}
}
But when running this (so launching the app, waiting several minutes for some state information to build up, and then entering key 'D', I get the below. So I can't do anything 'new' with a dstream after quitting the ssc. I hoped to move from a DStream RDD to a regular RDD, quit the streaming context, and wrap up by saving the normal RDD. But don't know how. Hope someone can help!
Exception in thread "Thread-52" java.lang.IllegalStateException: Adding new inputs, transformations, and output operations after sta$
ting a context is not supported
at org.apache.spark.streaming.dstream.DStream.validateAtInit(DStream.scala:222)
at org.apache.spark.streaming.dstream.DStream.<init>(DStream.scala:64)
at org.apache.spark.streaming.dstream.ForEachDStream.<init>(ForEachDStream.scala:34)
at org.apache.spark.streaming.dstream.DStream.org$apache$spark$streaming$dstream$DStream$$foreachRDD(DStream.scala:687)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply$mcV$sp(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:659)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:659)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at org.apache.spark.streaming.StreamingContext.withScope(StreamingContext.scala:260)
at org.apache.spark.streaming.dstream.DStream.foreachRDD(DStream.scala:659)
at main.scala.feaUS.Listener.run(feaUS.scala:119)
at java.lang.Thread.run(Thread.java:745)

There are 2 main changes to the code which should make it work
1> Use the checkpointed directory to start the spark streaming context.
val ssc = StreamingContext.getOrCreate(checkpointDirectory,
() => createContext(checkpointDirectory));
where createContext method has the logic to create and define new streams and stores the check pointed date in checkpointDirectory.
2> The sql context needs to be constructed in a slightly different way.
val streamParsedRaw = kafkaStream
.map { case (k, v: String) => v } // key is empty, so get the value containing the json string.
.map(s => s.replaceAll("""(\"hotel_id\")\:\"([0-9]+)\"""", "\"hotel_id\":$2")) // some events contain the hotel_id in quotes, making it a string. remove these quotes.
.transform { rdd =>
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val df = sqlContext.read.schema(struct).json(rdd) // apply schema defined above and parse the json into a dataframe,
.selectExpr("__created_epoch__ AS created_epoch" // the parsed json dataframe needs a bit of type cleaning and name changing

I feel your pain! While checkpointing is useful, it does not actually work if the code changes, and we change the code frequently!
What we are doing is to save the state, as json, every cycle, to hbase. So, if snapshotStream is your stream with the state info, we simply save it, as json, to hbase each window. While expensive, it is the only way we can guarantee the state is available upon restart even if the code changes.
Upon startup we load it, deserialize it, and pass it to the stateSpec as the initial rdd.

Related

not able to store result in hdfs when code runs for second iteration

Well I am new to spark and scala and have been trying to implement cleaning of data in spark. below code checks for the missing value for one column and stores it in outputrdd and runs loops for calculating missing value. code works well when there is only one missing value in file. Since hdfs does not allow writing again on the same location it fails if there are more than one missing value. can you please assist in writing finalrdd to particular location once calculating missing values for all occurrences is done.
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }
file.collect.foreach(filename => {
cleaningData(filename)
})
def cleaningData(file: String) = {
//headers has column headers of the files
var hdr = headers.toString()
var vl = hdr.split("\t")
sqlContext.clearCache()
if (hdr.contains("COLUMN_HEADER")) {
//Checks for missing values in dataframe and stores missing values' in outputrdd
if (!outputrdd.isEmpty()) {
logger.info("value is zero then performing further operation")
val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
val outputdatetimerdd = outputdatetimedf.rdd
val strings = outputdatetimerdd.map(row => row.mkString).collect()
for (i <- strings) {
if (Coddition check) {
//Calculates missing value and stores in finalrdd
finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
logger.info("file is written in file")
}
}
}
}
}
}``
It is not clear how (Coddition check) works in your example.
In any case function .saveAsTextFile("/output") should be called only once.
So I would rewrite your example into this:
val strings = outputdatetimerdd
.map(row => row.mkString)
.collect() // perhaps '.collect()' is redundant
val finalrdd = strings
.filter(str => Coddition check str) //don't know how this Coddition works
.map (x => x.mkString("\t"))
// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")

Spark DF: Schema for type Unit is not supported

I am new to Scala and Spark and trying to build on some samples I found. Essentially I am trying to call a function from within a data frame to get State from zip code using Google API..
I have the code working separately but not together ;(
Here is the piece of code not working...
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Unit is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
at org.apache.spark.sql.functions$.udf(functions.scala:2837)
at MovieRatings$.getstate(MovieRatings.scala:51)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:48)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:47)...
Line 51 starts with def getstate = udf {(zipcode:String)...
...
code:
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, zipcode as state FROM Users")
// zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("transformed") else c)
val newDF = zipcodesDF.select(mappedCols:_*).show()
}
def getstate = udf {(zipcode:String) => {
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val shortnames = for {
JObject(address_components) <- address
JField("short_name", short_name) <- address_components
} yield short_name
val state = shortnames(3)
//return state.toString()
val stater = state.toString()
}
}
Thanks for the responses.. I think I figured it out. Here is the code that works. One thing to note is Google API has restriction so some valid zip codes don't have state info.. not an issue for me though.
private def loaduserdata(spark: SparkSession): Unit = {
import spark.implicits._
// Create an RDD of User objects from a text file, convert it to a Dataframe
val userDF = spark.sparkContext
.textFile("examples/src/main/resources/users.csv")
.map(_.split("::"))
.map(attributes => users(attributes(0).trim.toInt, attributes(1), attributes(2).trim.toInt, attributes(3), attributes(4)))
.toDF()
// Register the DataFrame as a temporary view
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, substr(zipcode,1,5) as state FROM Users ORDER BY zipcode desc") // zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("state") else c)
val geoDF = zipcodesDF.select(mappedCols:_*)//.show()
geoDF.createOrReplaceTempView("Geo")
}
val getstate = udf {(zipcode: String) =>
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val statenm = for {
JObject(statename) <- address
JField("types", JArray(types)) <- statename
JField("short_name", JString(short_name)) <- statename
if types.toString().equals("List(JString(administrative_area_level_1), JString(political))")
// if types.head.equals("JString(administrative_area_level_1)")
} yield short_name
val str = if (statenm.isEmpty.toString().equals("true")) "N/A" else statenm.head
}

spark job freeze when started in ParArray

I want to convert a set of time-serial data to Labeledpoint from multiple csv files and save to parquet file. Csv Files are small, usually < 10MiB
When I start it with ParArray, it submit 4 jobs a time and freeze . codes here
val idx = Another_DataFrame
ListFiles(new File("data/stock data"))
.filter(_.getName.contains(".csv")).zipWithIndex
.par //comment this line and code runs smoothly
.foreach{
f=>
val stk = spark_csv(f._1.getPath) //doing good
ColMerge(stk,idx,RESULT_PATH(f)) //freeze here
stk.unpersist()
}
and the freeze part:
def ColMerge(ori:DataFrame,index:DataFrame,PATH:String) = {
val df = ori.join(index,ori("date")===index("index_date")).drop("index_date").orderBy("date").cache
val head = df.head
val col = df.columns.filter(e=>e!="code"&&e!="date"&&e!="name")
val toMap = col.filter{
e=>head.get(head.fieldIndex(e)).isInstanceOf[String]
}.sorted
val toCast = col.diff(toMap).filterNot(_=="data")
val res: Array[((String, String, Array[Double]), Long)] = df.sort("date").map{
row=>
val res1= toCast.map{
col=>
row.getDouble(row.fieldIndex(col))
}
val res2= toMap.flatMap{
col=>
val mapping = new Array[Double](GlobalConfig.ColumnMapping(col).size)
row.getString(row.fieldIndex(col)).split(";").par.foreach{
word=>
mapping(GlobalConfig.ColumnMapping(col)(word)) = 1
}
mapping
}
(
row.getString(row.fieldIndex("code")),
row.getString(row.fieldIndex("date")),
res1++res2++row.getAs[Seq[Double]]("data")
)
}.zipWithIndex.collect
df.unpersist
val dataset = GlobalConfig.sctx.makeRDD(res.map{
day=>
(day._1._1,
day._1._2,
try{
new LabeledPoint(GetHighPrice(res(day._2.toInt+2)._1._3.slice(0,4))/GetLowPrice(res(day._2.toInt)._1._3.slice(0,4))*1.03,Vectors.dense(day._1._3))
}
catch {
case ex:ArrayIndexOutOfBoundsException=>
new LabeledPoint(-1,Vectors.dense(day._1._3))
}
)
}).filter(_._3.label != -1).toDF("code","date","labeledpoint")
dataset.write.mode(SaveMode.Overwrite).parquet(PATH)
}
The exact job that freezes is the DataFrame.sort() or zipWithIndex when generating res in ColMerge
Since most part of the job get done after collect I really want to use ParArray to accelerate ColMerge but this weird freeze stopped me from doing so. Do I need to new a thread pool to do this?

Using Spark Context in map of Spark Streaming Context to retrieve documents after Kafka Event

I'm new to Spark.
What I'm trying to do is retrieving all related documents from a Couchbase View with a given Id from Spark Kafka Streaming.
When I try to get this documents form the Spark Context, I always have the error Task not serializable.
From there, I do understand that I can't use nesting RDD neither multiple Spark Context in the same JVM, but want to find a work around.
Here is my current approach:
package xxx.xxx.xxx
import com.couchbase.client.java.document.JsonDocument
import com.couchbase.client.java.document.json.JsonObject
import com.couchbase.client.java.view.ViewQuery
import com.couchbase.spark._
import org.apache.spark.broadcast.Broadcast
import _root_.kafka.serializer.StringDecoder
import org.apache.kafka.clients.producer.{ProducerRecord, KafkaProducer}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object Streaming {
// Method to create a Json document from Key and Value
def CreateJsonDocument(s: (String, String)): JsonDocument = {
//println("- Parsing document")
//println(s._1)
//println(s._2)
val return_doc = JsonDocument.create(s._1, JsonObject.fromJson(s._2))
(return_doc)
//(return_doc.content().getString("click"), return_doc)
}
def main(args: Array[String]): Unit = {
// get arguments as key value
val arguments = args.grouped(2).collect { case Array(k,v) => k.replaceAll("--", "") -> v }.toMap
println("----------------------------")
println("Arguments passed to class")
println("----------------------------")
println("- Arguments")
println(arguments)
println("----------------------------")
// If the length of the passed arguments is less than 4
if (arguments.get("brokers") == null || arguments.get("topics") == null) {
// Provide system error
System.err.println("Usage: --brokers <broker1:9092> --topics <topic1,topic2,topic3>")
}
// Create the Spark configuration with app name
val conf = new SparkConf().setAppName("Streaming")
// Create the Spark context
val sc = new SparkContext(conf)
// Create the Spark Streaming Context
val ssc = new StreamingContext(sc, Seconds(2))
// Setup the broker list
val kafkaParams = Map("metadata.broker.list" -> arguments.getOrElse("brokers", ""))
// Setup the topic list
val topics = arguments.getOrElse("topics", "").split(",").toSet
// Get the message stream from kafka
val docs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
docs
// Separate the key and the content
.map({ case (key, value) => (key, value) })
// Parse the content to transform in JSON Document
.map(s => CreateJsonDocument(s))
// Call the view to all related Review Application Documents
//.map(messagedDoc => RetrieveAllReviewApplicationDocs(messagedDoc, sc))
.map(doc => {
sc.couchbaseView(ViewQuery.from("my-design-document", "stats").key(messagedDoc.content.getString("id"))).collect()
})
.foreachRDD(
rdd => {
//Create a report of my documents and store it in Couchbase
rdd.foreach( println )
}
)
// Start the streaming context
ssc.start()
// Wait for termination and catch error if there is a problem in the process
ssc.awaitTermination()
}
}
Found the solution by using the Couchbase Client instead of the Couchbase Spark Context.
I don't know if it is the best way to go in a performance side, but I can retrieve the docs I need for computation.
package xxx.xxx.xxx
import com.couchbase.client.java.{Bucket, Cluster, CouchbaseCluster}
import com.couchbase.client.java.document.JsonDocument
import com.couchbase.client.java.document.json.JsonObject
import com.couchbase.client.java.view.{ViewResult, ViewQuery}
import _root_.kafka.serializer.StringDecoder
import org.apache.kafka.clients.producer.{ProducerRecord, KafkaProducer}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
object Streaming {
// Method to create a Json document from Key and Value
def CreateJsonDocument(s: (String, String)): JsonDocument = {
//println("- Parsing document")
//println(s._1)
//println(s._2)
val return_doc = JsonDocument.create(s._1, JsonObject.fromJson(s._2))
(return_doc)
//(return_doc.content().getString("click"), return_doc)
}
// Method to retrieve related documents
def RetrieveDocs (doc: JsonDocument, arguments: Map[String, String]): ViewResult = {
val cbHosts = arguments.getOrElse("couchbase-hosts", "")
val cbBucket = arguments.getOrElse("couchbase-bucket", "")
val cbPassword = arguments.getOrElse("couchbase-password", "")
val cluster: Cluster = CouchbaseCluster.create(cbHosts)
val bucket: Bucket = cluster.openBucket(cbBucket, cbPassword)
val docs : ViewResult = bucket.query(ViewQuery.from("my-design-document", "my-view").key(doc.content().getString("id")))
cluster.disconnect()
println(docs)
(docs)
}
def main(args: Array[String]): Unit = {
// get arguments as key value
val arguments = args.grouped(2).collect { case Array(k,v) => k.replaceAll("--", "") -> v }.toMap
println("----------------------------")
println("Arguments passed to class")
println("----------------------------")
println("- Arguments")
println(arguments)
println("----------------------------")
// If the length of the passed arguments is less than 4
if (arguments.get("brokers") == null || arguments.get("topics") == null) {
// Provide system error
System.err.println("Usage: --brokers <broker1:9092> --topics <topic1,topic2,topic3>")
}
// Create the Spark configuration with app name
val conf = new SparkConf().setAppName("Streaming")
// Create the Spark context
val sc = new SparkContext(conf)
// Create the Spark Streaming Context
val ssc = new StreamingContext(sc, Seconds(2))
// Setup the broker list
val kafkaParams = Map("metadata.broker.list" -> arguments.getOrElse("brokers", ""))
// Setup the topic list
val topics = arguments.getOrElse("topics", "").split(",").toSet
// Get the message stream from kafka
val docs = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
// Get broadcast arguments
val argsBC = sc.broadcast(arguments)
docs
// Separate the key and the content
.map({ case (key, value) => (key, value) })
// Parse the content to transform in JSON Document
.map(s => CreateJsonDocument(s))
// Call the view to all related Review Application Documents
.map(doc => RetrieveDocs(doc, argsBC))
.foreachRDD(
rdd => {
//Create a report of my documents and store it in Couchbase
rdd.foreach( println )
}
)
// Start the streaming context
ssc.start()
// Wait for termination and catch error if there is a problem in the process
ssc.awaitTermination()
}
}

Spark job not parallelising locally (using Parquet + Avro from local filesystem)

edit 2
Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.
edit 1: Removed local variable reference in map function
I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?
I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.
The function to ETL my input for processing looks as follows :
def genForum {
class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
override def write(t: Topic) {
synchronized {
super.write(t)
}
}
}
def makeTopic(x: ForumTopic): Topic = {
// Ommited to save space
}
val writer = new MyWriter
val q =
DBCrawler.db.withSession {
Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
}
val sz = q.size
val c = new AtomicInteger(0)
q.par.foreach {
x =>
writer.write(makeTopic(x))
val count = c.incrementAndGet()
print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
}
writer.close()
}
And my transformation looks as follows :
def sparkNLPTransformation() {
val sc = new SparkContext("local[8]", "forumAddNlp")
// io configuration
val job = new Job()
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)
// configure annotator
val props = new Properties()
props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
val an = DAnnotator(props)
// annotator function
def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
val new_p = top.getPosts.map{ x=>
val at = new Annotation(x.getPostText.toString)
ann.annotator.annotate(at)
val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList
val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
if(t.nonEmpty) r.setTrees(t)
r
}
val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
new_t.setPosts(new_p)
new_t
}
// transformation
val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )
new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
classOf[Void],
classOf[Topic],
classOf[ParquetOutputFormat[Topic]],
job.getConfiguration
)
}
Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file