Writing to HBase via Spark: Task not serializable - scala

I'm trying to write some simple data in HBase (0.96.0-hadoop2) using Spark 1.0 but I keep getting getting serialization problems. Here is the relevant code:
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.SparkContext
import java.util.Properties
import java.io.FileInputStream
import org.apache.hadoop.hbase.client.Put
object PutRawDataIntoHbase{
def main(args: Array[String]): Unit = {
var propFileName = "hbaseConfig.properties"
if(args.size > 0){
propFileName = args(0)
}
/** Load properties here **/
val theData = sc.textFile(prop.getProperty("hbase.input.filename"))
.map(l => l.split("\t"))
.map(a => Array("%010d".format(a(9).toInt)+ "-" + a(0) , a(1)))
val tableName = prop.getProperty("hbase.table.name")
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.rootdir", prop.getProperty("hbase.rootdir"))
hbaseConf.addResource(prop.getProperty("hbase.site.xml"))
val myTable = new HTable(hbaseConf, tableName)
theData.foreach(a=>{
var p = new Put(Bytes.toBytes(a(0)))
p.add(Bytes.toBytes(hbaseColFamily), Bytes.toBytes("col"), Bytes.toBytes(a(1)))
myTable.put(p)
})
}
}
Running the code results in:
Failed to run foreach at putDataIntoHBase.scala:79
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:org.apache.hadoop.hbase.client.HTable
Replacing the foreach with map doesn't crash but I doesn't write either.
Any help will be greatly appreciated.

The class HBaseConfiguration represents a pool of connections to HBase servers. Obviously, it can't be serialized and sent to the worker nodes. Since HTable uses this pool to communicate with the HBase servers, it can't be serialized too.
Basically, there are three ways to handle this problem:
Open a connection on each of worker nodes.
Note the use of foreachPartition method:
val tableName = prop.getProperty("hbase.table.name")
<......>
theData.foreachPartition { iter =>
val hbaseConf = HBaseConfiguration.create()
<... configure HBase ...>
val myTable = new HTable(hbaseConf, tableName)
iter.foreach { a =>
var p = new Put(Bytes.toBytes(a(0)))
p.add(Bytes.toBytes(hbaseColFamily), Bytes.toBytes("col"), Bytes.toBytes(a(1)))
myTable.put(p)
}
}
Note that each of worker nodes must have access to HBase servers and must have required jars preinstalled or provided via ADD_JARS.
Also note that since the connection pool if opened for each of partitions, it would be a good idea to reduce the number of partitions roughly to the number of worker nodes (with coalesce function). It's also possible to share a single HTable instance on each of worker nodes, but it's not so trivial.
Serialize all data to a single box and write it to HBase
It's possible to write all data from an RDD with a single computer, even if it the data doesn't fit to memory. The details are explained in this answer: Spark: Best practice for retrieving big data from RDD to local machine
Of course, it would be slower than distributed writing, but it's simple, doesn't bring painful serialization issues and might be the best approach if the data size is reasonable.
Use HadoopOutputFormat
It's possible to create a custom HadoopOutputFormat for HBase or use an existing one. I'm not sure if there exists something that fits your needs, but Google should help here.
P.S. By the way, the map call doesn't crash since it doesn't get evaluated: RDDs aren't evaluated until you invoke a function with side-effects. For example, if you called theData.map(....).persist, it would crash too.

Related

MiniDFS cluster setup for multiple test classes throws java.net.BindException: Address already in use

I am writing unit test cases for spark code that reads/writes data from/to both hdfs files and spark's catalog. For this I created a separate trait that provides initialisation of minidfs cluster and I am using the generated hdfs uri in value for - spark.sql.warehouse.dir while creating the SparkSession object. Here is the code for it -
trait TestSparkSession extends BeforeAndAfterAll {
self: Suite =>
var hdfsCluster: MiniDFSCluster = _
def nameNodeURI: String = s"hdfs://localhost:${hdfsCluster.getNameNodePort}/"
def withLocalSparkSession(tests: SparkSession => Any): Any = {
val baseDir = new File(PathUtils.getTestDir(getClass), "miniHDFS")
val conf = new HdfsConfiguration()
conf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, baseDir.getAbsolutePath)
val builder = new MiniDFSCluster.Builder(conf)
hdfsCluster = builder.nameNodePort(9000)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
hdfsCluster.waitClusterUp()
val testSpark = SparkSession
.builder()
.master("local")
.appName("Test App")
.config("spark.sql.warehouse.dir", s"${nameNodeURI}spark-warehouse/")
.getOrCreate()
tests(testSpark)
}
def stopHdfs(): Unit = hdfsCluster.shutdown(true, true)
override def afterAll(): Unit = stopHdfs()
}
While writing my tests, I inherit this trait and then write test cases like -
class SampleSpec extends FunSuite with TestSparkSession {
withLocalSparkSession {
testSpark =>
import testSpark.implicits._
// Test 1 Here
// Test 2 Here
}
}
Everything works fine when I run my test classes one at a time. But when run them all at once it throws java.net.BindException: Address already in use.
It should mean that the already created hdfsCluster is not yet down when the next set of tests are executed. That is why it is unable to create another one that binds to the same port. But then in the afterAll() I stopped the hfdsCluster.
My question is can I share single instance of hdfs cluster and spark session instead of initialising it every time ? I have tried to extract out the initialisation outside of the method but it still throwing same exception. Even if I can't share it, how can I properly stop my cluster and re-initialise it on next test class execution ?
Also, please let me know if my approach for writing 'unit' test cases that uses SparkSession and HDFS storage is correct.
Any help will be greatly appreciated.
I resolved it by creating the hdfs cluster in companion object instead so that it creates a single instance of it for all the test suits.

Processing a big table with Slick fails with OutOfMemoryError

I am querying a big MySQL table with Akka Streams and Slick, but it fails with an OutOfMemoryError. It seems that Slick is loading all the results into memory (it does not fail if the query is limited to a few rows). Why is this the case, and what is the solution?
val dbUrl = "jdbc:mysql://..."
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream.alpakka.slick.scaladsl.SlickSession
import akka.stream.alpakka.slick.scaladsl.Slick
import akka.stream.scaladsl.Source
import akka.stream.{ActorMaterializer, Materializer}
import com.typesafe.config.ConfigFactory
import slick.jdbc.GetResult
import scala.concurrent.Await
import scala.concurrent.duration.Duration
val slickDbConfig = s"""
|profile = "slick.jdbc.MySQLProfile$$"
|db {
| dataSourceClass = "slick.jdbc.DriverDataSource"
| properties = {
| driver = "com.mysql.jdbc.Driver",
| url = "$dbUrl"
| }
|}
|""".stripMargin
implicit val actorSystem: ActorSystem = ActorSystem()
implicit val materializer: Materializer = ActorMaterializer()
implicit val slickSession: SlickSession = SlickSession.forConfig(ConfigFactory.parseString(slickDbConfig))
import slickSession.profile.api._
val responses: Source[String, NotUsed] = Slick.source(
sql"select my_text from my_table".as(GetResult(r => r.nextString())) // limit 100
)
val future = responses.runForeach((myText: String) =>
println("my_text: " + myText.length)
)
Await.result(future, Duration.Inf)
From the Slick documentation:
Note: Some database systems may require session parameters to be set in a certain way to support streaming without caching all data at once in memory on the client side. For example, PostgreSQL requires both .withStatementParameters(rsType = ResultSetType.ForwardOnly, rsConcurrency = ResultSetConcurrency.ReadOnly, fetchSize = n) (with the desired page size n) and .transactionally for proper streaming.
In other words, to prevent the database from loading all the query results into memory, one might need additional configuration. This configuration is database dependent. The MySQL documentation states the following:
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row.
To set the above configuration in Slick:
import slick.jdbc._
val query =
sql"select my_text from my_table".as(GetResult(r => r.nextString()))
.withStatementParameters(
rsType = ResultSetType.ForwardOnly,
rsConcurrency = ResultSetConcurrency.ReadOnly,
fetchSize = Int.MinValue
)//.transactionally <-- I'm not sure whether you need ".transactionally"
val responses: Source[String, NotUsed] = Slick.source(query)

Running Mlib via Spark Job Server

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.
Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.

Will standalone scala program takes advantage of distributed/parallel processing? or does spark Scala require separate code?

First of all sorry for asking the basic doubt over here, but still explanation for the below will be appreciable..
i am very new to scala and spark, so my doubt is if i write a standalone scala program, and execute it on spark(1 master 3 worker), will the scala program takes advantage of disturbed/parallel processing, or should i need to write a separate program to get an advantage of distributed processing??
For example, we have a scala code that process a particular formatted file to comma separated file, it takes a directory as input and parses all file and write an output to single file(each file will be usually 100-200MB). So here is the code.
import scala.io.Source
import java.io.File
import java.io.PrintWriter
import scala.collection.mutable.ListBuffer
import java.util.Calendar
//import scala.io.Source
//import org.apache.spark.SparkContext
//import org.apache.spark.SparkContext._
//import org.apache.spark.SparkConf
object Parser {
def main(args:Array[String]) {
//val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
//val sc = new SparkContext(conf)
var inp = new File(args(0))
var ext: String = ""
if(args.length == 1)
{ ext = "log" } else { ext = args(1) }
var files: List[String] = List("")
if (inp.exists && inp.isDirectory) {
files = getListOfFiles(inp,ext)
}
else if(inp.exists ) {
files = List(inp.toString)
}
else
{
println("Enter the correct Directory/File name");
System.exit(0);
}
if(files.length <=0 )
{
println(s"No file found with extention '.$ext'")
}
else{
var out_file_name = "output_"+Calendar.getInstance().getTime.toString.replace(" ","-").replace(":","-")+".log"
var data = getHeader(files(0))
var writer=new PrintWriter(new File(out_file_name))
var record_count = 0
//var allrecords = data.mkString(",")+("\n")
//writer.write(allrecords)
for(eachFile <- files)
{
record_count += parseFile(writer,data,eachFile)
}
writer.close()
println(record_count +s" processed into $out_file_name")
}
//all func are defined here.
}
Files from the specific dir are read using scala.io
Source.fromFile(file).getLines
So my doubt is will the above code(standalone prg) can be executed on distributed spark system? will i get an advantage of parallel processing??
ok, how about using sc to read file, will it then uses distributed processing
val conf = new SparkConf().setAppName("fileParsing").setMaster("local[*]")
val sc = new SparkContext(conf)
...
...
for(eachFile <- files)
{
record_count += parseFile(sc,writer,data,eachFile)
}
------------------------------------
def parseFile(......)
sc.textFile(file).getLines
So if i edit the top code to make use of sc then will it process on distributes spark system.
No it won't. To make use of distributed computing using Spark, you need to use SparkContext.
If you run the application you have provided using spark-submit you will not be using the Spark cluster at all. You have to rewrite it to use the SparkContext. Please read through the Spark Programming Guide.
It is extremely helpful to watch some introductory videos on Youtube for getting to know how Apache Spark works in general.
For example, these:
https://www.youtube.com/watch?v=7k4yDKBYOcw
https://www.youtube.com/watch?v=rvDpBTV89AM&list=PLF6snu5Jy-v-WRAcCfWNHks7lcNO-zrTI&index=4
Is is very important to understand it for using Spark.
"advantage of distributed processing"
Using Spark can give you advantages of distributing processing on multiple server cluster. So if you are going to move your application later to the cluster, it makes sense to develop application using Spark model and corresponding API.
Well, you can run Spark application locally on your local machine but in this case you won't get all the advantages the Spark can provide.
Anyway, as it is said before, Spark is a special framework with its own libraries for developtment. So you have to rewrite your application using Spark context and Spark API, i.e. special objects like RDDs or Dataframes and corresponding methods.

How to Cache an Array of Dataframes/Values in Spark

I am trying to built a large amount of random forest models by group using Spark. My approach is to cache a large input data file, split it into pieces based on the school_id, cache the individual school input file in memory, run a model on each of them, and then extract the label and predictions.
model_input.cache()
val schools = model_input.select("School_ID").distinct.collect.flatMap(_.toSeq)
val bySchoolArray = schools.map(School_ID => model_input.where($"School_ID" <=> School_ID).cache)
import org.apache.spark.sql.DataFrame
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.{Pipeline, PipelineModel}
def trainModel(df: DataFrame): PipelineModel = {
val rf = new RandomForestClassifier()
//omit some parameters
val pipeline = new Pipeline().setStages(Array(rf))
pipeline.fit(df)
}
val bySchoolArrayModels = bySchoolArray.map(df => trainModel(df))
val preds = (0 to schools.length -1).map(i => bySchoolArrayModels(i).transform(bySchoolArray(i)).select("prediction", "label")
preds.write.format("com.databricks.spark.csv").
option("header","true").
save("predictions/pred"+schools(i))
The code works fine on a small subset but it takes longer than I expected. It seems to me every time I run an individual model, Spark reads the entire file and it takes forever to complete all the model runs. I was wondering whether I did not cache the files correctly or anything went wrong with the way I code it.
Any suggestions would be useful. Thanks!
rdd's methods are immutable, so rdd.cache() returns a new rdd. So you need to assign the cachedRdd to an other variable and then re-use that. Otherwise your are not using the cached rdd.
val cachedModelInput = model_input.cache()
val schools = cachedModelInput.select("School_ID").distinct.collect.flatMap(_.toSeq)
....