Apache Spark 2.2 JobProgressListener alternative - scala

JobProgressListener is deprecated since Apache Spark 2.2.0. Which one should I use instead of that?

Create your own SparkListener instead. Add a class extending SparkListener and override the methods you want to use to track the Spark Job progress. Example:
class CustomSparkListener extends SparkListener {
override def onApplicationStart(ev: SparkListenerApplicationStart): Unit = ???
override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit = ???
}
You also have to declare your new Spark Listener in the SparkSession configuration:
.config("spark.extraListeners", "listeners.CustomSparkListener")
Hope it helped!

Related

MiniDFS cluster setup for multiple test classes throws java.net.BindException: Address already in use

I am writing unit test cases for spark code that reads/writes data from/to both hdfs files and spark's catalog. For this I created a separate trait that provides initialisation of minidfs cluster and I am using the generated hdfs uri in value for - spark.sql.warehouse.dir while creating the SparkSession object. Here is the code for it -
trait TestSparkSession extends BeforeAndAfterAll {
self: Suite =>
var hdfsCluster: MiniDFSCluster = _
def nameNodeURI: String = s"hdfs://localhost:${hdfsCluster.getNameNodePort}/"
def withLocalSparkSession(tests: SparkSession => Any): Any = {
val baseDir = new File(PathUtils.getTestDir(getClass), "miniHDFS")
val conf = new HdfsConfiguration()
conf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, baseDir.getAbsolutePath)
val builder = new MiniDFSCluster.Builder(conf)
hdfsCluster = builder.nameNodePort(9000)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
hdfsCluster.waitClusterUp()
val testSpark = SparkSession
.builder()
.master("local")
.appName("Test App")
.config("spark.sql.warehouse.dir", s"${nameNodeURI}spark-warehouse/")
.getOrCreate()
tests(testSpark)
}
def stopHdfs(): Unit = hdfsCluster.shutdown(true, true)
override def afterAll(): Unit = stopHdfs()
}
While writing my tests, I inherit this trait and then write test cases like -
class SampleSpec extends FunSuite with TestSparkSession {
withLocalSparkSession {
testSpark =>
import testSpark.implicits._
// Test 1 Here
// Test 2 Here
}
}
Everything works fine when I run my test classes one at a time. But when run them all at once it throws java.net.BindException: Address already in use.
It should mean that the already created hdfsCluster is not yet down when the next set of tests are executed. That is why it is unable to create another one that binds to the same port. But then in the afterAll() I stopped the hfdsCluster.
My question is can I share single instance of hdfs cluster and spark session instead of initialising it every time ? I have tried to extract out the initialisation outside of the method but it still throwing same exception. Even if I can't share it, how can I properly stop my cluster and re-initialise it on next test class execution ?
Also, please let me know if my approach for writing 'unit' test cases that uses SparkSession and HDFS storage is correct.
Any help will be greatly appreciated.
I resolved it by creating the hdfs cluster in companion object instead so that it creates a single instance of it for all the test suits.

Avoid import tax when using spark implicits

In my testing, I have a test trait to provide spark context:
trait SparkTestTrait {
lazy val spark: SparkSession = SparkSession.builder().getOrCreate()
}
The problem is that I need to add an import in every test function:
test("test1) {
import spark.implicits._
}
I managed to reduce this to on per file by adding to the SparkTestTrait the following:
object testImplicits extends SQLImplicits {
protected override def _sqlContext: SQLContext = spark.sqlContext
}
and then in the constructor of the implementing file:
import testImplicits._
However, I would prefer to have these implicits imported to all classes implementing SparkTestTrait (I can't have SparkTestTrait extend SQLImplicits because the implementing classes already extend an abstract class).
Is there a way to do this?

ClickHouse Spark Connector - Scala Dependancy

I am using https://github.com/DmitryBe/clickhouse-spark-connector
I create my jar with sbt assembly after I cloned the repo and then I add my import statements.
import io.clickhouse.ext.ClickhouseConnectionFactory
import io.clickhouse.ext.spark.ClickhouseSparkExt._
object clickhouse is not a member of package spark.jobserver.io
I can see that these paths exist and they are added as dependencies the same way I have added all the others. I have cleaned and rebuilt etc but it has made no difference. I am using scala-ide(eclipse).
You can try using https://github.com/yandex/clickhouse-jdbc
Here is a snippet which you can use to write dataframe into Clickhouse using your own dialect. ClickhouseDialect is a class which extends JdbcDialects. You can create your dialect and register it using JdbcDialects.registerDialect(clickhouse)
def write(data: DataFrame, jdbcUrl: String, tableName: String): Unit = {
val clickhouse = new ClickhouseDialect()
JdbcDialects.registerDialect(clickhouse)
val props = new java.util.Properties()
props.put("driver", "ru.yandex.clickhouse.ClickHouseDriver")
props.put("connection", jdbcUrl)
val repartionedData = data.repartition(100)
repartionedData.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, tableName, props)
JdbcDialects.unregisterDialect(clickhouse)
}
You can check here to create your own dialect. You might have to override canHandle, getJDBCType, getCatalystType methods of JdbcDialects for your usage.

How to write ElasticsearchSink for Spark Structured streaming

I'm using Spark structured streaming to process high volume data from Kafka queue and doing some heaving ML computation but I need to write the result to Elasticsearch.
I tried using the ForeachWriter but can't get a SparkContext inside it, the other option probably is to do HTTP Post inside the ForeachWriter.
Right now, am thinking of writing my own ElasticsearchSink.
Is there any documentation out there to create a Sink for Spark Structured streaming ?
If you are using Spark 2.2+ and ES 6.x then there is a ES sink out of the box:
df
.writeStream
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "mappingId")
.start("index/type") // index/type
If you are using ES 5.x like I was you need to implement an EsSink and an EsSinkProvider:
EsSinkProvider:
class EsSinkProvider extends StreamSinkProvider with DataSourceRegister {
override def createSink(sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink = {
EsSink(sqlContext, parameters, partitionColumns, outputMode)
}
override def shortName(): String = "my-es-sink"
}
EsSink:
case class ElasticSearchSink(sqlContext: SQLContext,
options: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode)
extends Sink {
override def addBatch(batchId: Long, df: DataFrame): Unit = synchronized {
val schema = data.schema
// this ensures that the same query plan will be used
val rdd: RDD[String] = df.queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row]).map(_.getAs[String](0))
}
// from org.elasticsearch.spark.rdd library
EsSpark.saveJsonToEs(rdd, "index/type", Map("es.mapping.id" -> "mappingId"))
}
}
And then lastly, when writing the stream use this provider class as the format:
df
.writeStream
.queryName("ES-Writer")
.outputMode(OutputMode.Append())
.format("path.to.EsProvider")
.start()
You can take a look at ForeachSink. It shows how to implement a Sink and convert DataFrame to RDD (it's very tricky and has a large comment). However, please be aware that the Sink API is still private and immature, it might be changed in future.

(Play 2.4.2, Play Slick 1.0.0) How do I apply database Evolutions to a Slick managed database within a test?

I would like to write database integration tests against a Play Slick managed database and apply and unapply Evolutions using the helper methods described in the Play documentation namely, Evolutions.applyEvolutions(database) and Evolutions.cleanupEvolutions(database). However these require a play.api.db.Database instance which is not possible to get hold of from what I can see. The jdbc library conflicts with play-slick so how do I get the database instance from slick? I use the following to get a slick database def for running slick queries:
val dbConfig = DatabaseConfigProvider.get[JdbcProfile]("my-test-db")(FakeApplication())
import dbConfig.driver.api._
val db = dbConfig.db
Thanks,
Leanne
Here is how I dow it with Guice:
I inject with Guice:
lazy val appBuilder = new GuiceApplicationBuilder()
lazy val injector = appBuilder.injector()
lazy val databaseApi = injector.instanceOf[DBApi] //here is the important line
(You have to import play.api.db.DBApi.)
And in my tests, I simply do the following (actually I use an other database for my tests):
override def beforeAll() = {
Evolutions.applyEvolutions(databaseApi.database("default"))
}
override def afterAll() = {
Evolutions.cleanupEvolutions(databaseApi.database("default"))
}
(I'm using Scalatest but it the same thing with an other testing framework.)