How scala object member work with spark rdd - scala

I have a spark application, that output result to redis.
It works fine on local mode, but cannot connect the redisHost with the args(0) that I assign like 10.242.10.100 on yarn-cluster mode.
The redisHost is unchanged 127.0.0.1.
object TestSparkClosure {
val logger: Logger = LoggerFactory.getLogger(TestSparkClosure.getClass)
var redisHost = "127.0.0.1"
var redisPort = 6379
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("TestSparkClosure")
if (args.length > 0) {
redisHost = args(0)
} else {
conf.setMaster("local")
}
val sparkContext = new SparkContext(conf)
var rdd = getRdd(sparkContext)
rdd.foreachPartition(partitionOfRecords => {
logger.info("host:port:" + redisHost + ":" + redisPort.toString)
val jedis = new Jedis(redisHost, redisPort)
partitionOfRecords.foreach(pair => {
val keystr = pair._1
val valuestr = pair._2
jedis.set(keystr, valuestr)
})
})
}
def getRdd(spark: SparkContext): RDD[(String, String)] = {
val rdd = spark.parallelize(List("2017\t1", "2018\t2", "2017\t3", "2018\t4", "2017\t5", "2018\t6")).map(line => {
val cols = line.split("\t")
(cols(0), cols(1))
})
rdd.reduceByKey((x, y) => {
((x.toInt + y.toInt).toString)
}, 3)
}
}
When I replace redisHost with local variable like this, It works fine again.
var localRedisHost = redisHost
rdd.foreachPartition(partitionOfRecords => {
logger.info("host:port:" + localRedisHost + ":" + redisPort.toString)
val jedis = new Jedis(localRedisHost , redisPort)
partitionOfRecords.foreach(pair => {
val keystr = pair._1
val valuestr = pair._2
jedis.set(keystr, valuestr)
})
})
Can anyone explain how the spark closure work here?
Thanks so much.

Its because you are using a variable which isnt able to use serialization. when you define a local element it can so you are able to use it inside of the RDD.

Related

recursive value x$5 needs type

i am getting error at this line
val Array(outputDirectory, Utils.IntParam(numTweetsToCollect), Utils.IntParam(intervalSecs), Utils.IntParam(partitionsEachInterval)) =
Utils.parseCommandLineWithTwitterCredentials(args)
recursive value x$7 needs
type
recursive value x$1 needs
type
what does this error meaning, please guide me how to resolve this error.
object Collect {
private var numTweetsCollected = 0L
private var partNum = 0
private var gson = new Gson()
def main(args: Array[String]) {
// Process program arguments and set properties
if (args.length < 3) {
System.err.println("Usage: " + this.getClass.getSimpleName +
"<outputDirectory> <numTweetsToCollect> <intervalInSeconds> <partitionsEachInterval>")
System.exit(1)
}
val Array(outputDirectory, Utils.IntParam(numTweetsToCollect), Utils.IntParam(intervalSecs), Utils.IntParam(partitionsEachInterval)) =
Utils.parseCommandLineWithTwitterCredentials(args)
val outputDir = new File(outputDirectory.toString)
if (outputDir.exists()) {
System.err.println("ERROR - %s already exists: delete or specify another directory".format(
outputDirectory))
System.exit(1)
}
outputDir.mkdirs()
println("Initializing Streaming Spark Context...")
val conf = new SparkConf().setAppName(this.getClass.getSimpleName)
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(intervalSecs))
val tweetStream = TwitterUtils.createStream(ssc, Utils.getAuth)
.map(gson.toJson(_))
tweetStream.foreachRDD((rdd, time) => {
val count = rdd.count()
if (count > 0) {
val outputRDD = rdd.repartition(partitionsEachInterval)
outputRDD.saveAsTextFile(outputDirectory + "/tweets_" + time.milliseconds.toString)
numTweetsCollected += count
if (numTweetsCollected > numTweetsToCollect) {
System.exit(0)
}
}
})
ssc.start()
ssc.awaitTermination()
}
}
Try removing the Utils.IntParam(.. from your pattern matched values. Extract the values, then parse them separately.

How to speed up log parsing Spark job?

My architecture right now is AWS ELB writes log to S3 and it sends a message to SQS for further processing by Spark Streaming. It's working right now but my problem right now is it's taking a bit of time. I'm new to Spark and Scala so just want to make sure that I'm not doing something stupid.
val conf = new SparkConf()
.setAppName("SparrowOrc")
.set("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version","2")
.set("spark.speculation","false")
val sc = new SparkContext(conf)
val streamContext = new StreamingContext(sc, Seconds(1))
val sqs = streamContext.receiverStream(new SQSReceiver("queue")
.at(Regions.US_EAST_1)
.withTimeout(5))
// Got 10 messages at a time
val s3Keys = sqs.map(messages => {
val sqsMsg: JsValue = Json.parse(messages)
val s3Key = "s3://" +
Json.stringify(sqsMsg("Records")(0)("s3")("bucket")("name")).replace("\"", "") + "/" +
Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"", "")
s3Key
})
val rawLogs: DStream[String] = s3Keys.transform(keys => {
val fileKeys = keys.collect()
val files = fileKeys.map(f => {
sc.textFile(f)
})
sc.union(files)
})
val jsonRows = rawLogs.mapPartitions(partition => {
// Parsing raw log to json
val txfm = new LogLine2Json
val log = Logger.getLogger("parseLog")
partition.map(line => {
try{
txfm.parseLine(line)
}
catch {
case e: Throwable => {log.info(line); "";}
}
}).filter(line => line != "{}")
})
val sqlSession = SparkSession
.builder()
.getOrCreate()
// Write to S3
jsonRows.foreachRDD(r => {
val parsedFormat = new SimpleDateFormat("yyyy-MM-dd/")
val parsedDate = parsedFormat.format(new java.util.Date())
val outputPath = "bucket" + parsedDate
val jsonDf = sqlSession.read.schema(sparrowSchema.schema).json(r)
jsonDf.write.mode("append").format("orc").option("compression","zlib").save(outputPath)
})
streamContext.start()
streamContext.awaitTermination()
}
Here's the DAG and it seems like everything is merged in union transformation.

Spark Streaming - Broadcast variable - Case Class

My requirement is to enrich data stream data with profile information from a HBase table. I was looking to use a broadcast variable. Enclosed the whole code here.
The output of HBase data is as follows
In the Driver node HBaseReaderBuilder
(org.apache.spark.SparkContext#3c58b102,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
In the Worker node
HBaseReaderBuilder(null,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
As you can see it has lost the spark context. When i issue the statement val
myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3)) i get a NullPointerException
java.lang.NullPointerException
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toSimpleHBaseRDD(HBaseReaderBuilder.scala:83)
at it.nerdammer.spark.hbase.package$.toSimpleHBaseRDD(package.scala:5)
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toHBaseRDD(HBaseReaderBuilder.scala:67)
at it.nerdammer.spark.hbase.package$.toHBaseRDD(package.scala:5)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:34)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:33)
object testPartition {
def main(args: Array[String]) : Unit = {
val sparkMaster = "spark://x.x.x.x:7077"
val ipaddress = "x.x.x.x:2181" // Zookeeper
val hadoopHome = "/home/hadoop/software/hadoop-2.6.0"
val topicname = "new_events_test_topic"
val mainConf = new SparkConf().setMaster(sparkMaster).setAppName("testingPartition")
val mainSparkContext = new SparkContext(mainConf)
val ssc = new StreamingContext(mainSparkContext, Seconds(30))
val eventsStream = KafkaUtils.createStream(ssc,"x.x.x.x:2181","receive_rest_events",Map(topicname.toString -> 2))
val docRdd = mainSparkContext.hbaseTable[(String, Option[String], Option[String])]("hbase_customer_profile").select("gender","age").inColumnFamily("data")
println ("docRDD from Driver ",docRdd)
val broadcastedprof = mainSparkContext.broadcast(docRdd)
eventsStream.foreachRDD(dstream => {
dstream.foreachPartition(records => {
println("Broadcasted docRDD - in Worker ", broadcastedprof.value)
val bcdocRdd = broadcastedprof.value
records.foreach(record => {
//val myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3))
//myRdd.foreach(println)
val Rows = record._2.split("\r\n")
})
})
})
ssc.start()
ssc.awaitTermination()
}
}

handle spark state istimingout

I'm using the new mapWithState function in spark streaming (1.6) with a timing out state. I want to use the timing out state and add it to another rdd in order to use it for calculations further down the road:
val aggedlogs = sc.emptyRDD[MyLog];
val mappingFunc = (key: String, newlog: Option[MyLog], state: State[MyLog]) => {
val _newLog = newlog.getOrElse(null)
if ((state.exists())&&(_newLog!=null))
{
val stateLog = state.get()
val combinedLog = LogUtil.CombineLogs(_newLog, stateLog);
state.update(combinedLog)
}
else if (_newLog !=null) {
state.update(_newLog);
}
if (state.isTimingOut())
{
val stateLog = state.get();
aggedlogs.union(sc.parallelize(List(stateLog), 1))
}
val stateLog = state.get();
(key,stateLog);
}
val stateDstream = reducedlogs.mapWithState(StateSpec.function(mappingFunc).timeout(Seconds(10)))
but when I try to add it to an rdd in the StateSpec function, I get an error that the function is not serializable. Any thoughts on how I can get pass this?
EDIT:
After drilling deeper i found that my approach was wrong. before trying this solution i tried to get the timing out logs from the statesnapeshot(), but they were not there anymore, changing the mapping function to :
def mappingFunc(key: String, newlog: Option[MyLog], state: State[KomoonaLog]) : Option[(String, MyLog)] = {
val _newLog = newlog.getOrElse(null)
if ((state.exists())&&(_newLog!=null))
{
val stateLog = state.get()
val combinedLog = LogUtil.CombineLogs(_newLog, stateLog);
state.update(combinedLog)
Some(key,combinedLog);
}
else if (_newLog !=null) {
state.update(_newLog);
Some(key,_newLog);
}
if (state.isTimingOut())
{
val stateLog = state.get();
stateLog.timinigOut = true;
System.out.println("timinigOut : " +key );
Some(key, stateLog);
}
val stateLog = state.get();
Some(key,stateLog);
}
i managed to filter the mapedwithstatedstream for the logs that are timing out in each batch:
val stateDstream = reducedlogs.mapWithState(
StateSpec.function(mappingFunc _).timeout(Seconds(60)))
val tiningoutlogs= stateDstream.filter (filtertimingout)

BulkLoading to Phoenix using Spark

I was trying to code some utilities to bulk load data through HFiles from Spark RDDs.
I was taking the pattern of CSVBulkLoadTool from phoenix. I managed to generate some HFiles and load them into HBase, but i can't see the rows using sqlline(e.g using hbase shell it is possible). I would be more than grateful for any suggestions.
BulkPhoenixLoader.scala:
class BulkPhoenixLoader[A <: ImmutableBytesWritable : ClassTag, T <: KeyValue : ClassTag](rdd: RDD[(A, T)]) {
def createConf(tableName: String, inConf: Option[Configuration] = None): Configuration = {
val conf = inConf.map(HBaseConfiguration.create).getOrElse(HBaseConfiguration.create())
val job: Job = Job.getInstance(conf, "Phoenix bulk load")
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
// initialize credentials to possibily run in a secure env
TableMapReduceUtil.initCredentials(job)
val htable: HTable = new HTable(conf, tableName)
// Auto configure partitioner and reducer according to the Main Data table
HFileOutputFormat2.configureIncrementalLoad(job, htable)
conf
}
def bulkSave(tableName: String, outputPath: String, conf:
Option[Configuration]) = {
val configuration: Configuration = createConf(tableName, conf)
rdd.saveAsNewAPIHadoopFile(
outputPath,
classOf[ImmutableBytesWritable],
classOf[Put],
classOf[HFileOutputFormat2],
configuration)
}
}
ExtendedProductRDDFunctions.scala:
class ExtendedProductRDDFunctions[A <: scala.Product](data: org.apache.spark.rdd.RDD[A]) extends
ProductRDDFunctions[A](data) with Serializable {
def toHFile(tableName: String,
columns: Seq[String],
conf: Configuration = new Configuration,
zkUrl: Option[String] =
None): RDD[(ImmutableBytesWritable, KeyValue)] = {
val config = ConfigurationUtil.getOutputConfiguration(tableName, columns, zkUrl, Some(conf))
val tableBytes = Bytes.toBytes(tableName)
val encodedColumns = ConfigurationUtil.encodeColumns(config)
val jdbcUrl = zkUrl.map(getJdbcUrl).getOrElse(getJdbcUrl(config))
val conn = DriverManager.getConnection(jdbcUrl)
val query = QueryUtil.constructUpsertStatement(tableName,
columns.toList.asJava,
null)
data.flatMap(x => mapRow(x, jdbcUrl, encodedColumns, tableBytes, query))
}
def mapRow(product: Product,
jdbcUrl: String,
encodedColumns: String,
tableBytes: Array[Byte],
query: String): List[(ImmutableBytesWritable, KeyValue)] = {
val conn = DriverManager.getConnection(jdbcUrl)
val preparedStatement = conn.prepareStatement(query)
val columnsInfo = ConfigurationUtil.decodeColumns(encodedColumns)
columnsInfo.zip(product.productIterator.toList).zipWithIndex.foreach(setInStatement(preparedStatement))
preparedStatement.execute()
val uncommittedDataIterator = PhoenixRuntime.getUncommittedDataIterator(conn, true)
val hRows = uncommittedDataIterator.asScala.filter(kvPair =>
Bytes.compareTo(tableBytes, kvPair.getFirst) == 0
).flatMap(kvPair => kvPair.getSecond.asScala.map(
kv => {
val byteArray = kv.getRowArray.slice(kv.getRowOffset, kv.getRowOffset + kv.getRowLength - 1) :+ 1.toByte
(new ImmutableBytesWritable(byteArray, 0, kv.getRowLength), kv)
}))
conn.rollback()
conn.close()
hRows.toList
}
def setInStatement(statement: PreparedStatement): (((ColumnInfo, Any), Int)) => Unit = {
case ((c, v), i) =>
if (v != null) {
// Both Java and Joda dates used to work in 4.2.3, but now they must be java.sql.Date
val (finalObj, finalType) = v match {
case dt: DateTime => (new Date(dt.getMillis), PDate.INSTANCE.getSqlType)
case d: util.Date => (new Date(d.getTime), PDate.INSTANCE.getSqlType)
case _ => (v, c.getSqlType)
}
statement.setObject(i + 1, finalObj, finalType)
} else {
statement.setNull(i + 1, c.getSqlType)
}
}
private def getIndexTables(conn: Connection, qualifiedTableName: String) : List[(String, String)]
= {
val table: PTable = PhoenixRuntime.getTable(conn, qualifiedTableName)
val tables = table.getIndexes.asScala.map(x => x.getIndexType match {
case IndexType.LOCAL => (x.getTableName.getString, MetaDataUtil.getLocalIndexTableName(qualifiedTableName))
case _ => (x.getTableName.getString, x.getTableName.getString)
}).toList
tables
}
}
The generated HFiles I load with the utility tool from hbase as follows:
hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles path/to/hfile tableName
You could just convert your csv file to an RDD of Product and use the .saveToPhoenix method. This is generally how I load csv data into phoenix.
Please see: https://phoenix.apache.org/phoenix_spark.html