(Play 2.4.2, Play Slick 1.0.0) How do I apply database Evolutions to a Slick managed database within a test? - playframework-2.4

I would like to write database integration tests against a Play Slick managed database and apply and unapply Evolutions using the helper methods described in the Play documentation namely, Evolutions.applyEvolutions(database) and Evolutions.cleanupEvolutions(database). However these require a play.api.db.Database instance which is not possible to get hold of from what I can see. The jdbc library conflicts with play-slick so how do I get the database instance from slick? I use the following to get a slick database def for running slick queries:
val dbConfig = DatabaseConfigProvider.get[JdbcProfile]("my-test-db")(FakeApplication())
import dbConfig.driver.api._
val db = dbConfig.db
Thanks,
Leanne

Here is how I dow it with Guice:
I inject with Guice:
lazy val appBuilder = new GuiceApplicationBuilder()
lazy val injector = appBuilder.injector()
lazy val databaseApi = injector.instanceOf[DBApi] //here is the important line
(You have to import play.api.db.DBApi.)
And in my tests, I simply do the following (actually I use an other database for my tests):
override def beforeAll() = {
Evolutions.applyEvolutions(databaseApi.database("default"))
}
override def afterAll() = {
Evolutions.cleanupEvolutions(databaseApi.database("default"))
}
(I'm using Scalatest but it the same thing with an other testing framework.)

Related

MiniDFS cluster setup for multiple test classes throws java.net.BindException: Address already in use

I am writing unit test cases for spark code that reads/writes data from/to both hdfs files and spark's catalog. For this I created a separate trait that provides initialisation of minidfs cluster and I am using the generated hdfs uri in value for - spark.sql.warehouse.dir while creating the SparkSession object. Here is the code for it -
trait TestSparkSession extends BeforeAndAfterAll {
self: Suite =>
var hdfsCluster: MiniDFSCluster = _
def nameNodeURI: String = s"hdfs://localhost:${hdfsCluster.getNameNodePort}/"
def withLocalSparkSession(tests: SparkSession => Any): Any = {
val baseDir = new File(PathUtils.getTestDir(getClass), "miniHDFS")
val conf = new HdfsConfiguration()
conf.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, baseDir.getAbsolutePath)
val builder = new MiniDFSCluster.Builder(conf)
hdfsCluster = builder.nameNodePort(9000)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
hdfsCluster.waitClusterUp()
val testSpark = SparkSession
.builder()
.master("local")
.appName("Test App")
.config("spark.sql.warehouse.dir", s"${nameNodeURI}spark-warehouse/")
.getOrCreate()
tests(testSpark)
}
def stopHdfs(): Unit = hdfsCluster.shutdown(true, true)
override def afterAll(): Unit = stopHdfs()
}
While writing my tests, I inherit this trait and then write test cases like -
class SampleSpec extends FunSuite with TestSparkSession {
withLocalSparkSession {
testSpark =>
import testSpark.implicits._
// Test 1 Here
// Test 2 Here
}
}
Everything works fine when I run my test classes one at a time. But when run them all at once it throws java.net.BindException: Address already in use.
It should mean that the already created hdfsCluster is not yet down when the next set of tests are executed. That is why it is unable to create another one that binds to the same port. But then in the afterAll() I stopped the hfdsCluster.
My question is can I share single instance of hdfs cluster and spark session instead of initialising it every time ? I have tried to extract out the initialisation outside of the method but it still throwing same exception. Even if I can't share it, how can I properly stop my cluster and re-initialise it on next test class execution ?
Also, please let me know if my approach for writing 'unit' test cases that uses SparkSession and HDFS storage is correct.
Any help will be greatly appreciated.
I resolved it by creating the hdfs cluster in companion object instead so that it creates a single instance of it for all the test suits.

Apache Spark 2.2 JobProgressListener alternative

JobProgressListener is deprecated since Apache Spark 2.2.0. Which one should I use instead of that?
Create your own SparkListener instead. Add a class extending SparkListener and override the methods you want to use to track the Spark Job progress. Example:
class CustomSparkListener extends SparkListener {
override def onApplicationStart(ev: SparkListenerApplicationStart): Unit = ???
override def onStageSubmitted(stageSubmitted: SparkListenerStageSubmitted): Unit = ???
}
You also have to declare your new Spark Listener in the SparkSession configuration:
.config("spark.extraListeners", "listeners.CustomSparkListener")
Hope it helped!

ClickHouse Spark Connector - Scala Dependancy

I am using https://github.com/DmitryBe/clickhouse-spark-connector
I create my jar with sbt assembly after I cloned the repo and then I add my import statements.
import io.clickhouse.ext.ClickhouseConnectionFactory
import io.clickhouse.ext.spark.ClickhouseSparkExt._
object clickhouse is not a member of package spark.jobserver.io
I can see that these paths exist and they are added as dependencies the same way I have added all the others. I have cleaned and rebuilt etc but it has made no difference. I am using scala-ide(eclipse).
You can try using https://github.com/yandex/clickhouse-jdbc
Here is a snippet which you can use to write dataframe into Clickhouse using your own dialect. ClickhouseDialect is a class which extends JdbcDialects. You can create your dialect and register it using JdbcDialects.registerDialect(clickhouse)
def write(data: DataFrame, jdbcUrl: String, tableName: String): Unit = {
val clickhouse = new ClickhouseDialect()
JdbcDialects.registerDialect(clickhouse)
val props = new java.util.Properties()
props.put("driver", "ru.yandex.clickhouse.ClickHouseDriver")
props.put("connection", jdbcUrl)
val repartionedData = data.repartition(100)
repartionedData.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, tableName, props)
JdbcDialects.unregisterDialect(clickhouse)
}
You can check here to create your own dialect. You might have to override canHandle, getJDBCType, getCatalystType methods of JdbcDialects for your usage.

JDBC-HiveServer:'client_protocol is unset!'-Both 1.1.1 in CS

When I ask this question, I have already read many many article through google. Many answers show that is the mismatch version between client-side and server-side. So I decide to copy the jars from server-side to client-side directly, and the result is .... as you know, same exception:
org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{use:database=default})
It goes well when I connect to hiveserver2 through beeline :)
see my connection.
So, I think it will work when I use jdbc too. But, unfortunately, it throws that exception, below is my jars in my project.
hive-jdbc-1.1.1.jar
hive-jdbc-standalone.jar
hive-metastore-1.1.1.jar
hive-service-1.1.1.jar
those hive jars are copied from server-side.
def connect_hive(master:String){
val conf = new SparkConf()
.setMaster(master)
.setAppName("Hive")
.set("spark.local.dir", "./tmp");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val url = "jdbc:hive2://192.168.40.138:10000";
val prop= new Properties();
prop.setProperty("user", "hive");
prop.setProperty("password", "hive");
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver");
val conn = DriverManager.getConnection(url, prop);
sc.stop();
}
The configment of my server:
hadoop 2.7.3
spark 1.6.0
hive 1.1.1
Does anyone encounter the same situation when connecting hive through spark-JDBC?
Since beeline works, it is expected that your program also should execute correctly.
print current project class path
you can try some thing like this to understand your self.
import java.net.URL
import java.net.URLClassLoader
import scala.collection.JavaConversions._
object App {
def main(args: Array[String]) {
val cl = ClassLoader.getSystemClassLoader
val urls = cl.asInstanceOf[URLClassLoader].getURLs
for (url <- urls) {
println(url.getFile)
}
}
}
Also check hive.aux.jars.path=<file urls> to understand what jars are present in the classpath.

How to save RandomForestClassifier Spark model in scala?

I built a random forest model using the following code:
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier
val rf = new RandomForestClassifier().setLabelCol("indexedLabel").setFeaturesCol("features")
val labelConverter = new IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
val training = labelIndexer.transform(df)
val model = rf.fit(training)
now I want to save the model in order to predict later using the following code:
val predictions: DataFrame = model.transform(testData)
I've looked into Spark documentation here and didn't find any option to do that. Any idea?
It took me a few hours to build the model , if Spark is crushing I won't be able to get it back.
It's possible to save and reload tree based models in HDFS using Spark 1.6 using saveAsObjectFile() for both Pipeline based and basic model.
Below is example for pipeline based model.
// model
val model = pipeline.fit(trainingData)
// Create rdd using Seq
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs://filepath")
// Reload model by using it's class
// You can get class of object using object.getClass()
val sameModel = sc.objectFile[PipelineModel]("filepath").first()
For RandomForestClassifier save & load model: tested spark 1.6.2 + scala in ml(in spark 2.0 you can have direct save option for model)
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.classification.RandomForestClassifier //imports
val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(3).setNumTrees(20).setFeatureSubsetStrategy("auto").setSeed(5043)
val model = classifier.fit(trainingData)
sc.parallelize(Seq(model), 1).saveAsObjectFile(modelSavePath) //save model
val linRegModel = sc.objectFile[RandomForestClassificationModel](modelSavePath).first() //load model
`val predictions1 = linRegModel.transform(testData)` //predictions1 is dataframe
It is in the MLWriter interface - that is accessed via the writer attribute on your model:
model.asInstanceOf[MLWritable].write.save(path)
Here is the interface:
abstract class MLWriter extends BaseReadWrite with Logging {
protected var shouldOverwrite: Boolean = false
/**
* Saves the ML instances to the input path.
*/
#Since("1.6.0")
#throws[IOException]("If the input path already exists but overwrite is not enabled.")
def save(path: String): Unit = {
This is a refactoring from earlier versions of mllib/spark.ml
Update It appears that the Model were not writable:
Exception in thread "main" java.lang.UnsupportedOperationException:
Pipeline write will fail on this Pipeline because it contains a stage
which does not implement Writable. Non-Writable stage:
rfc_4e467607406f of type class
org.apache.spark.ml.classification.RandomForestClassificationModel
So there may not be a straightforward solution for this.
Here is a PySpark v1.6 implementation corresponding to the Scala saveAsObjectFile() answer above.
It coerses the Python objects to/from Java objects to achieve serialisation with saveAsObjectFile().
Without the Java coersion I had weird Py4J errors on serialisation. If anyone has a simplier implementation, please edit or comment.
Save a trained RandomForestClassificationModel object:
# Save RandomForestClassificationModel to hdfs
gateway = sc._gateway
java_list = gateway.jvm.java.util.ArrayList()
java_list.add(rfModel._java_obj)
modelRdd = sc._jsc.parallelize(java_list)
modelRdd.saveAsObjectFile("hdfs:///some/path/rfModel")
Load a trained RandomForestClassificationModel object:
# Load RandomForestClassificationModel from hdfs
rfObjectFileLoaded = sc._jsc.objectFile("hdfs:///some/path/rfModel")
rfModelLoaded_JavaObject = rfObjectFileLoaded.first()
rfModelLoaded = RandomForestClassificationModel(rfModelLoaded_JavaObject)
predictions = rfModelLoaded.transform(test_input_df)