MiniDFS cluster setup for multiple test classes throws java.net.BindException: Address already in use - scala

I am writing unit test cases for spark code that reads/writes data from/to both hdfs files and spark's catalog. For this I created a separate trait that provides initialisation of minidfs cluster and I am using the generated hdfs uri in value for - spark.sql.warehouse.dir while creating the SparkSession object. Here is the code for it -
trait TestSparkSession extends BeforeAndAfterAll {
self: Suite =>
var hdfsCluster: MiniDFSCluster = _
def nameNodeURI: String = s"hdfs://localhost:${hdfsCluster.getNameNodePort}/"
def withLocalSparkSession(tests: SparkSession => Any): Any = {
val baseDir = new File(PathUtils.getTestDir(getClass), "miniHDFS")
val conf = new HdfsConfiguration()
conf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, baseDir.getAbsolutePath)
val builder = new MiniDFSCluster.Builder(conf)
hdfsCluster = builder.nameNodePort(9000)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
hdfsCluster.waitClusterUp()
val testSpark = SparkSession
.builder()
.master("local")
.appName("Test App")
.config("spark.sql.warehouse.dir", s"${nameNodeURI}spark-warehouse/")
.getOrCreate()
tests(testSpark)
}
def stopHdfs(): Unit = hdfsCluster.shutdown(true, true)
override def afterAll(): Unit = stopHdfs()
}
While writing my tests, I inherit this trait and then write test cases like -
class SampleSpec extends FunSuite with TestSparkSession {
withLocalSparkSession {
testSpark =>
import testSpark.implicits._
// Test 1 Here
// Test 2 Here
}
}
Everything works fine when I run my test classes one at a time. But when run them all at once it throws java.net.BindException: Address already in use.
It should mean that the already created hdfsCluster is not yet down when the next set of tests are executed. That is why it is unable to create another one that binds to the same port. But then in the afterAll() I stopped the hfdsCluster.
My question is can I share single instance of hdfs cluster and spark session instead of initialising it every time ? I have tried to extract out the initialisation outside of the method but it still throwing same exception. Even if I can't share it, how can I properly stop my cluster and re-initialise it on next test class execution ?
Also, please let me know if my approach for writing 'unit' test cases that uses SparkSession and HDFS storage is correct.
Any help will be greatly appreciated.

I resolved it by creating the hdfs cluster in companion object instead so that it creates a single instance of it for all the test suits.

Related

Spark Session Dispose after Unit test for specified file is Done

I'm Writing Unit Tests for Spark Scala code and facing this issue.
When I run UnitTests files separately I'm good to go but, When I run all of UnitTests in module using maven Testcases fails.
How we can create local instance of spark or mock for UnitTests.
`
Cannot call methods on a stopped SparkContext. This stopped
SparkContext was created at:
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:947)
`
Method I tried.
Tried using creating private spark session for each one UnitTest File.
Creating common spark session trait for all unit test file.
calling spark.Stop() at end of each file and removing it from all.
File are make two unit test files and try to execute them together. Both files should have spark session.
Class test1 extends AnyFlatSpec
{
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext
val sqlCont: SQLContext = spark.sqlContext
"test1" should "take spark session spark context and sql context" in
{
//do something
}
}
Class test2 extends AnyFlatSpec
{
val spark: SparkSession = SparkSession.builder
.master("local[*]")
.getOrCreate()
val sc: SparkContext = spark.sparkContext`enter code here`
val sqlCont: SQLContext = spark.sqlContext
"test2" should "take spark session spark context and sql context" in
{
//do something
}
}
when you run those independently each file will work fine but when you run them together using mvn test they will failed.

When is onQueryTerminated triggered in Apache Spark StreamingQueryListeners?

I'm developing a custom StreamingQueryListener and I'd like to trigger its onQueryTerminated method in a test.
This is what I tried implementing:
import org.apache.spark.sql.{ SQLContext, SparkSession }
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.functions.{ col, to_date }
import org.apache.spark.sql.streaming.StreamingQueryListener
import org.scalatest.flatspec.AnyFlatSpec
class MyListener extends StreamingQueryListener {
override def onQueryStarted(event: StreamingQueryListener.QueryStartedEvent): Unit = {}
override def onQueryProgress(event: StreamingQueryListener.QueryProgressEvent): Unit = {}
override def onQueryTerminated(event: StreamingQueryListener.QueryTerminatedEvent): Unit = println(event.exception)
}
class ListenerSpec extends AnyFlatSpec {
it should "trigger onQueryTerminated" in {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.streams.addListener(new MyListener())
implicit val sqlContext: SQLContext = spark.sqlContext
import spark.implicits._
val stream = MemoryStream[Int]
stream.addData(Seq(1, 3, 4))
val query = stream
.toDF()
.withColumn("columnDoesntExist", to_date(col("names")))
.writeStream
.format("console")
.start()
query.awaitTermination()
}
}
However, this doesn't work because it raises an AnalysisException but the onQueryTerminated method isn't triggered by the termination of the streaming query.
In what situations is that method triggered and event.exception is Some(exception)?
Update
The following code successfully triggers the execution of onQueryTerminated:
val exceptionUdf = udf(() => throw new Exception())
val query = stream
.toDF()
.withColumn("exception", exceptionUdf())
.writeStream
.format("console")
.start()
Refer to the accepted answer for an explanation as to why.
According to the book "Stream Processing with Apache Spark" (published by O'Reilly) the onQueryTerminated method gets
"Called when a streaming query is stopped. The event contains id and runId fields that correlate with the start event. It also provides an exception field that contains an exception if the query failed due to an error."
As you are getting an AnalysisException, your query did not even start yet. It only got to the first of the four phases in the Catalyst optimizer, which is the "Analysis" and it has not been transformed into runnable code yet:
More details on the Catalyst Optimizer.
The AnalysisException just means that there are issues in the code related to the Catalog which is exactly what you intended to do: Refer to a column that does not exist (in the Catalog).
If you want to run the execution of the onQueryTermination method you need to implement a working code but have it failed while it is already running (e.g. provide wrong data input type).

How can I read HDFS files from a spark executor?

I have a large (> 500m row) CSV file. Each row in this CSV file contains a path to a binary file located on HDFS. I want to use Spark to read each of these files, process them, and write out the results to another CSV file or a table.
Doing this is simple enough in the driver, and the following code gets the job done
val hdfsFilePathList = // read paths from CSV, collect into list
hdfsFilePathList.map( pathToHdfsFile => {
sqlContext.sparkContext.binaryFiles(pathToHdfsFile).mapPartitions {
functionToProcessBinaryFiles(_)
}
})
The major problem with this is that the driver is doing too much of the work. I would like to farm out the work done by binaryFiles to the executors. I've found some promising examples that I thought would allow me to access the sparkContext from an executor:
Use SparkContext hadoop configuration within RDD methods/closures, like foreachPartition
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SerializableConfiguration.scala
but they don't seem to work the way I thought they would. I'd expect the following to work:
import java.io.{ObjectInputStream, ObjectOutputStream}
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.conf.Configuration
class ConfigSerDeser(var conf: Configuration) extends Serializable {
def this() {
this(new Configuration())
}
def get(): Configuration = conf
private def writeObject (out: java.io.ObjectOutputStream): Unit = {
conf.write(out)
}
private def readObject (in: java.io.ObjectInputStream): Unit = {
conf = new Configuration()
conf.readFields(in)
}
private def readObjectNoData(): Unit = {
conf = new Configuration()
}
}
val serConf = new ConfigSerDeser(sc.hadoopConfiguration)
val mappedIn = inputDf.map( row => {
serConf.get()
})
But it fails with KryoException: java.util.ConcurrentModificationException
Is it possible to have to executors access HDFS files or the HDFS file system directly? Or alternatively, is there an efficient way to read millions of binary files on HDFS/S3 and process them with Spark?
There was a similar use case where i was trying to do the same, but realised
SparkSession or SparkContext is not serializable thus cannot be accessed from executors.

Running Mlib via Spark Job Server

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,
curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'
I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.
Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.
However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;
package runner
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
*
*/
object SparkRunner {
def main (args: Array[String]){
val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
val sc: SparkContext = constructSparkContext(config)
val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)
}
def constructSparkContext(config: Config): SparkContext = {
val conf = new SparkConf()
conf
.setMaster(config.getString("spark.master"))
.setAppName(config.getString("app.name"))
/*Set more configuration values here*/
new SparkContext(conf)
}
}
Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.

Writing to HBase via Spark: Task not serializable

I'm trying to write some simple data in HBase (0.96.0-hadoop2) using Spark 1.0 but I keep getting getting serialization problems. Here is the relevant code:
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.SparkContext
import java.util.Properties
import java.io.FileInputStream
import org.apache.hadoop.hbase.client.Put
object PutRawDataIntoHbase{
def main(args: Array[String]): Unit = {
var propFileName = "hbaseConfig.properties"
if(args.size > 0){
propFileName = args(0)
}
/** Load properties here **/
val theData = sc.textFile(prop.getProperty("hbase.input.filename"))
.map(l => l.split("\t"))
.map(a => Array("%010d".format(a(9).toInt)+ "-" + a(0) , a(1)))
val tableName = prop.getProperty("hbase.table.name")
val hbaseConf = HBaseConfiguration.create()
hbaseConf.set("hbase.rootdir", prop.getProperty("hbase.rootdir"))
hbaseConf.addResource(prop.getProperty("hbase.site.xml"))
val myTable = new HTable(hbaseConf, tableName)
theData.foreach(a=>{
var p = new Put(Bytes.toBytes(a(0)))
p.add(Bytes.toBytes(hbaseColFamily), Bytes.toBytes("col"), Bytes.toBytes(a(1)))
myTable.put(p)
})
}
}
Running the code results in:
Failed to run foreach at putDataIntoHBase.scala:79
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException:org.apache.hadoop.hbase.client.HTable
Replacing the foreach with map doesn't crash but I doesn't write either.
Any help will be greatly appreciated.
The class HBaseConfiguration represents a pool of connections to HBase servers. Obviously, it can't be serialized and sent to the worker nodes. Since HTable uses this pool to communicate with the HBase servers, it can't be serialized too.
Basically, there are three ways to handle this problem:
Open a connection on each of worker nodes.
Note the use of foreachPartition method:
val tableName = prop.getProperty("hbase.table.name")
<......>
theData.foreachPartition { iter =>
val hbaseConf = HBaseConfiguration.create()
<... configure HBase ...>
val myTable = new HTable(hbaseConf, tableName)
iter.foreach { a =>
var p = new Put(Bytes.toBytes(a(0)))
p.add(Bytes.toBytes(hbaseColFamily), Bytes.toBytes("col"), Bytes.toBytes(a(1)))
myTable.put(p)
}
}
Note that each of worker nodes must have access to HBase servers and must have required jars preinstalled or provided via ADD_JARS.
Also note that since the connection pool if opened for each of partitions, it would be a good idea to reduce the number of partitions roughly to the number of worker nodes (with coalesce function). It's also possible to share a single HTable instance on each of worker nodes, but it's not so trivial.
Serialize all data to a single box and write it to HBase
It's possible to write all data from an RDD with a single computer, even if it the data doesn't fit to memory. The details are explained in this answer: Spark: Best practice for retrieving big data from RDD to local machine
Of course, it would be slower than distributed writing, but it's simple, doesn't bring painful serialization issues and might be the best approach if the data size is reasonable.
Use HadoopOutputFormat
It's possible to create a custom HadoopOutputFormat for HBase or use an existing one. I'm not sure if there exists something that fits your needs, but Google should help here.
P.S. By the way, the map call doesn't crash since it doesn't get evaluated: RDDs aren't evaluated until you invoke a function with side-effects. For example, if you called theData.map(....).persist, it would crash too.