Error while running the spark scala code to do bulk load - scala

I am using the following code in REPL to create hfiles and to the bulk load into hbase.I used the same code and done the spark-submit it was working fine with no errors but when i run it in REPL it is throwing the error
import org.apache.spark._
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.hbase.client.{ConnectionFactory, HTable}
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.StringType
import scala.collection.mutable.ArrayBuffer
import org.apache.hadoop.hbase.KeyValue
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
val cdt = "dt".getBytes
val ctemp="temp".getBytes
val ctemp_min="temp_min".getBytes
val ctemp_max="temp_max".getBytes
val cpressure="pressure".getBytes
val csea_level="sea_level".getBytes
val cgrnd_level="grnd_level".getBytes
val chumidity="humidity".getBytes
val ctemp_kf="temp_kf".getBytes
val cid="id".getBytes
val cweather_main="weather_main".getBytes
val cweather_description="weather_description".getBytes
val cweather_icon="weather_icon".getBytes
val cclouds_all="clouds_all".getBytes
val cwind_speed="wind_speed".getBytes
val cwind_deg="wind_deg".getBytes
val csys_pod="sys_pod".getBytes
val cdt_txt="dt_txt".getBytes
val crain="rain".getBytes
val COLUMN_FAMILY = "data".getBytes
val cols = ArrayBuffer(cdt,ctemp,ctemp_min,ctemp_max,cpressure,csea_level,cgrnd_level,chumidity,ctemp_kf,cid,cweather_main,cweather_description,cweather_icon,cclouds_all,cwind_speed,cwind_deg,csys_pod,cdt_txt,crain)
val rowKey = new ImmutableBytesWritable()
val conf = HBaseConfiguration.create()
val ZOOKEEPER_QUORUM = "address"
conf.set("hbase.zookeeper.quorum", ZOOKEEPER_QUORUM);
val connection = ConnectionFactory.createConnection(conf)
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferschema","true").load("Hbasedata/Weatherdata.csv")
val rdd = df.flatMap(x => { //Error when i run this
rowKey.set(x(0).toString.getBytes)
for(i <- 0 to cols.length - 1) yield {
val index = x.fieldIndex(new String(cols(i)))
val value = if (x.isNullAt(index)) "".getBytes else x(index).toString.getBytes
(rowKey,new KeyValue(rowKey.get, COLUMN_FAMILY, cols(i), value))
}
})
It is throwing the following error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2067)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:333)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:332)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:332)
at org.apache.spark.sql.DataFrame.flatMap(DataFrame.scala:1418)
The error is thrown when i tried to create the rdd.I have used the same code in spark-submit it was working fine.

Issue in
val rowKey = new ImmutableBytesWritable()
ImmutableBytesWritable is not serializable, and located outside "flatMap" function. Please check exception full stack trace.
You can move mentioned statement inside "flatMap" function, at least for check.

Related

why the error is given while reading the text file in spark

The path given to the text file is correct still I am getting error " Input path does not exist: file:/C:/Users/cmpil/Downloads/hunger_games.txt". Why is it happening
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object WordCountDataSet {
case class Book(value:String)
def main(args:Array[String]): Unit ={
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("WordCount")
.master("local[*]")
.getOrCreate()
import spark.implicits._
//Another way of doing it
val bookRDD = spark.sparkContext.textFile("C:/Users/cmpil/Downloads/hunger_games.txt")
val wordsRDD = bookRDD.flatMap(x => x.split("\\W+"))
val wordsDS = wordsRDD.toDS()
val lowercaseWordsDS = wordsDS.select(lower($"value").alias("word"))
val wordCountsDS = lowercaseWordsDS.groupBy("word").count()
val wordCountsSortedDS = wordCountsDS.sort("count")
wordCountsSortedDS.show(wordCountsSortedDS.count().toInt)
}
}
on windows you have to use '\\' in place of '/'
try using "C:\\Users\\cmpil\\Downloads\\hunger_games.txt"

cant find temp in zeppelin

enter image description hereI receive a error when try do select over my temp table. Somebody can help me please?
object StreamingLinReg extends java.lang.Object{
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1").setAppName("Streaming Liniar Regression")
.set("spark.cassandra.connection.port", "9042")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val sc = new SparkContext(conf);
val ssc = new StreamingContext(sc, Seconds(1));
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._
val trainingData = ssc.cassandraTable[String]("features","consumodata").select("consumo", "consumo_mensal", "soma_pf", "tempo_gasto").map(LabeledPoint.parse)
trainingData.toDF.registerTempTable("training")
val dstream = new ConstantInputDStream(ssc, trainingData)
val numFeatures = 100
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
.setNumIterations(1)
.setStepSize(0.1)
.setMiniBatchFraction(1.0)
model.trainOn(dstream)
model.predictOnValues(dstream.map(lp => (lp.label, lp.features))).foreachRDD { rdd =>
val metrics = new RegressionMetrics(rdd)
val MSE = metrics.meanSquaredError //Squared error
val RMSE = metrics.rootMeanSquaredError //Squared error
val MAE = metrics.meanAbsoluteError //Mean absolute error
val Rsquared = metrics.r2
//val Explained variance = metrics.explainedVariance
rdd.toDF.registerTempTable("liniarRegressionModel")
}
}
ssc.start()
ssc.awaitTermination()
//}
}
%sql
select * from liniarRegressionModel limit 10
when I do select the temporary table I get an error message.I run first paragraph after execute the select over temp table.
org.apache.spark.sql.AnalysisException: Table not found: liniarRegressionModel; line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.
package$AnalysisErrorAt.failAnalysis (package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$.getTable (Analyzer.scala:305) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$$anonfun$apply$9.applyOrElse
(Analyzer.scala:314) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$$anonfun$apply$9.applyOrElse(Analyzer.scala:309) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.trees.CurrentOrigin
$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators
(LogicalPlan.scala:56) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$1.apply(LogicalPlan.scala:54) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply
(LogicalPlan.scala:54) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply
(TreeNode.scala:281) at scala.collection.Iterator
$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$
class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach
(Iterator.scala:1157) at scala.collection.generic.Growable $class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.
$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.
$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to
(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to
(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer
(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer
(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray
(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray
(Iterator.scala:1157)
My output after execute the code
import java.lang.Object
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.StreamingContext._
import com.datastax.spark.connector.streaming._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.mllib.evaluation.RegressionMetrics
defined module StreamingLinReg
FINISHED
Took 15 seconds

Scala/Spark serialization error - streaming data to HBASE

I am a newbie to Scala/Spark. In the following code, I am extracting Twitter public stream content to the HBase.
On commenting the last four lines (put commands in HBase), I am able to print content of tweet on the terminal, however unable to dump it to the HBase table.
I need help in on the following regards:
1. How can I overcome the serialilzation error?
2. Are there efficient methods (may be useing Kryo serialilzation) to overcome this error?
Caused by: java.io.NotSerializableException:
org.apache.hadoop.conf.Configuration Serialization stack:
- object not serializable (class: org.apache.hadoop.conf.Configuration, value: Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml)
import twitter4j.auth._
import twitter4j.conf._
import twitter4j._
import twitter4j.json._
import scala.io.Source
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.hadoop.hbase.util.Bytes
import java.io._
import org.apache.spark.streaming.twitter.TwitterUtils
////////////////////////////
val conf = new SparkConf().setAppName("model1").setMaster("local[*]")
// val sc = new SparkContext(conf)
val TABLE_NAME = "publicrd"
val CF_USER = "user"
val CF_TWEET = "tweet"
val CF_ENTITIES = "entities"
val CF_PLACES = "places"
val hadoopConf = new Configuration
val conf = HBaseConfiguration.create(hadoopConf)
val admin = new HBaseAdmin(conf)
val tableDesc = new HTableDescriptor(Bytes.toBytes(TABLE_NAME))
// Define column family descriptor
val ColumnFamilyDesc1 = new HColumnDescriptor(Bytes.toBytes(CF_USER))
val ColumnFamilyDesc2 = new HColumnDescriptor(Bytes.toBytes(CF_TWEET))
val ColumnFamilyDesc3 = new HColumnDescriptor(Bytes.toBytes(CF_ENTITIES))
val ColumnFamilyDesc4 = new HColumnDescriptor(Bytes.toBytes(CF_PLACES))
// Add column family in table descriptor
tableDesc.addFamily(ColumnFamilyDesc1)
tableDesc.addFamily(ColumnFamilyDesc2)
tableDesc.addFamily(ColumnFamilyDesc3)
tableDesc.addFamily(ColumnFamilyDesc4)
// Check if the table exists
if (admin.tableExists(TABLE_NAME)){
print(">>>>>" + TABLE_NAME + " already exists <<<<<")
admin.disableTable(TABLE_NAME)
admin.deleteTable(TABLE_NAME)
}
// Create HBASE table
admin.createTable(tableDesc)
val table = new HTable(conf, TABLE_NAME)
/////
val timewindow = 2 // seconds
val ssc = new StreamingContext(sc, Seconds(timewindow))
val cb = new ConfigurationBuilder
val ckey = "ckey"
val csecret = "csecret"
val atoken = "atoken"
val atokensecret = "atokensecret"
cb.setDebugEnabled(true).
setOAuthConsumerKey(ckey).
setOAuthConsumerSecret(csecret).
setOAuthAccessToken(atoken).
setOAuthAccessTokenSecret(atokensecret).
setJSONStoreEnabled(true)
val auth = new OAuthAuthorization(cb.build)
val tweets = TwitterUtils.createStream(ssc,Some(auth))
val status = tweets.filter(_.getLang()=="en")
status.foreachRDD(foreachFunc = rdd => {
rdd.foreachPartition {
records => while (records.hasNext) {
var record = records.next
print("\n\n>>>>"+record)
var tweetID = record.getUser().getId().toString//.isInstanceOf[Int]
print("\ntweetID : "+tweetID)
var tweetBody = record.getText()//.toString
print("\ntweetBody : "+tweetBody)
var favoritesCount = record.getFavoriteCount()//.toInt
print("\nfavoritesCount : "+favoritesCount)
var keyrow = "t_"+tweetID //"t_"+
print("\nkeyrow : "+keyrow+"\n")
var theput= new Put(Bytes.toBytes(keyrow))
theput.add(Bytes.toBytes(CF_TWEET),Bytes.toBytes("tweetid"),Bytes.toBytes(tweetID))
theput.add(Bytes.toBytes(CF_TWEET),Bytes.toBytes("tweetid"),Bytes.toBytes(tweetBody))
theput.add(Bytes.toBytes(CF_USER),Bytes.toBytes("tweetid"),Bytes.toBytes(favoritesCount))
table.put(theput)
}
}
}
)
The code is run on the terminal via:
spark-shell --driver-class-path /opt/hadoop/hbase-1.2.1/lib/hbase-server-1.1.4.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-protocol-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-hadoop2-compat-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-client-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-common-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/htrace-core-3.2.0-incubating.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/guava-19.0.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/spark-streaming-twitter_2.10-1.6.1.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-async-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-core-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-examples-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-media-support-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-stream-4.0.4.jar
It says the object org.apache.hadoop.conf.Configuration is not serialisable which mean it does not implement the Serializable interface while it's required. To get rid of that add #transient keyword.
#transient val hadoopConf = new Configuration

Wiki xml parser - org.apache.spark.SparkException: Task not serializable

I am newbie to both scala and spark, and trying some of the tutorials, this one is from Advanced Analytics with Spark. The following code is supposed to work:
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "/home/petr/Downloads/wiki/wiki"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat],
classOf[LongWritable], classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
def wikiXmlToPlainText(xml: String): Option[(String, String)] = {
val page = new EnglishWikipediaPage()
WikipediaPage.readPage(page, xml)
if (page.isEmpty) None
else Some((page.getTitle, page.getContent))
}
val plainText = rawXmls.flatMap(wikiXmlToPlainText)
But it gives
scala> val plainText = rawXmls.flatMap(wikiXmlToPlainText)
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:295)
...
Running Spark v1.3.0 on a local (and I have loaded only about a 21MB of the wiki articles, just to test it).
All of https://stackoverflow.com/search?q=org.apache.spark.SparkException%3A+Task+not+serializable didn't get me any clue...
Thanks.
try
import com.cloudera.datascience.common.XmlInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io._
val path = "/home/terrapin/Downloads/enwiki-20150304-pages-articles1.xml-p000000010p000010000"
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, "<page>")
conf.set(XmlInputFormat.END_TAG_KEY, "</page>")
val kvs = sc.newAPIHadoopFile(path, classOf[XmlInputFormat],
classOf[LongWritable], classOf[Text], conf)
val rawXmls = kvs.map(p => p._2.toString)
import edu.umd.cloud9.collection.wikipedia.language._
import edu.umd.cloud9.collection.wikipedia._
val plainText = rawXmls.flatMap{line =>
val page = new EnglishWikipediaPage()
WikipediaPage.readPage(page, line)
if (page.isEmpty) None
else Some((page.getTitle, page.getContent))
}
The first guess which comes to mind is that: all your code is wrapped in the object where SparkContext is defined. Spark tries to serialize this object to transfer wikiXmlToPlainText function to nodes. Try to create different object with the only one function wikiXmlToPlainText.

Spark Streaming into HBase with filtering logic

I have been trying to understand how spark streaming and hbase connect, but have not been successful. What I am trying to do is given a spark stream, process that stream and store the results in an hbase table. So far this is what I have:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{HBaseAdmin,HTable,Put,Get}
import org.apache.hadoop.hbase.util.Bytes
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "table")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes(row(0)), Bytes.toBytes(row(0)))
hTable.put(thePut)
}
val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.map(_.split(","))
val store = words.foreachRDD(rdd => rdd.foreach(blah))
ssc.start()
I am currently running the above code in spark-shell. I am not sure what I am doing wrong.
I get the following error in the shell:
14/09/03 16:21:03 ERROR scheduler.JobScheduler: Error running job streaming job 1409786463000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I also double checked the hbase table, just in case, and nothing new is written in there.
I am running nc -lk 9999 on another terminal to feed in data into the spark-shell for testing.
With help from users on the spark user group, I was able to figure out how to get this to work. It looks like I needed to wrap my streaming, mapping and foreach call around a serializable object:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{HBaseAdmin,HTable,Put,Get}
import org.apache.hadoop.hbase.util.Bytes
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "table")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes(row(0)), Bytes.toBytes(row(0)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.map(_.split(","))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
Seems to be a typical antipattern.
See "Design Patterns for using foreachRDD" chapter at http://spark.apache.org/docs/latest/streaming-programming-guide.html for correct pattern.