Unit testing in Spark with SQLContext implicits - scala

I'm trying to run multiple unit tests in Spark and have copied (and slightly adapted) the bit from the source code:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.scalatest.{BeforeAndAfterAll, Suite}
trait SharedSparkContext extends BeforeAndAfterAll {
self: Suite =>
#transient private var _sc: SparkContext = _
#transient private var _sqlContext: SQLContext = _
def sc: SparkContext = _sc
def sqlContext: SQLContext = _sqlContext
private var conf = new SparkConf(false)
override def beforeAll() {
super.beforeAll()
_sc = new SparkContext("local[*]", "Test Suites", conf)
_sqlContext = new SQLContext(_sc)
}
override def afterAll() {
try {
LocalSparkContext.stop(_sc)
_sc = null
} finally {
super.afterAll()
}
}
}
The LocalSparkContext class with companion object are simply copied from the source.
I thought about using it as follows, which tells me that stable identifier required because the def sqlContext does not have the member implicits:
class MySuite extends FlatSpec with SharedSparkContext {
import sqlContext.implicits._
// ...
}
I have tried replacing it with the following, but that gives me null pointer exceptions:
class MySuite extends FlatSpec with SharedSparkContext {
val sqlCtxt = sqlContext
import sqlCtxt.implicits._
// ...
}
I am using Spark 1.4.1 and I have set parallelExecution in test := false.
How can get this to work (without using additional packages)?

Instead of using a trait, you can use a simple object that holds all your variables, here's what I do for my tests :
object TestConfiguration extends Serializable {
private val sparkConf = new SparkConf()
.setAppName("Tests")
.setMaster("local")
private lazy val sparkContext = new SparkContext(sparkConf)
private lazy val sqlContext = new SQLContext(sparkContext)
def getSqlContext() = {
sqlContext
}
}
Then, you'll be able to use the sqlContext in a test suite.
class MySuite extends FlatSpec with SharedSparkContext {
val sqlCtxt = TestConfiguration.getSqlContext
import sqlCtxt.implicits._
// ...
}

Related

Where is the implicit value for parameter P: cats.Parallel[cats.effect.IO,F]

Running the example code snippet under the subtopic parSequence in Cats Effect document throws an error,
import cats._, cats.data._, cats.syntax.all._, cats.effect.IO
val anIO = IO(1)
val aLotOfIOs = NonEmptyList.of(anIO, anIO)
val ioOfList = aLotOfIOs.parSequence
<console>:44: error: could not find implicit value for parameter P: cats.Parallel[cats.effect.IO,F]
I include implicit Timer[IO] i.e. implicit val timer = IO.timer(ExecutionContext.global) but it does not work. Please advise. Thanks
Update #1
For a complete working snippet,
import cats._, cats.data._, cats.syntax.all._, cats.effect.IO
import scala.concurrent.ExecutionContext.Implicits.global
implicit val contextShift = IO.contextShift(global)
val anIO = IO(1)
val aLotOfIOs = NonEmptyList.of(anIO, anIO)
val ioOfList = aLotOfIOs.parSequence
The implicit you're looking for is defined in cats.effect.IOInstances and you can bring it in scope by importing cats.effect.IO._.
private[effect] abstract class IOInstances extends IOLowPriorityInstances {
//....
implicit def ioParallel(implicit cs: ContextShift[IO]): Parallel[IO, IO.Par] =
new Parallel[IO, IO.Par] {
final override val applicative: Applicative[IO.Par] =
parApplicative(cs)
final override val monad: Monad[IO] =
ioConcurrentEffect(cs)
final override val sequential: ~>[IO.Par, IO] =
new FunctionK[IO.Par, IO] { def apply[A](fa: IO.Par[A]): IO[A] = IO.Par.unwrap(fa) }
final override val parallel: ~>[IO, IO.Par] =
new FunctionK[IO, IO.Par] { def apply[A](fa: IO[A]): IO.Par[A] = IO.Par(fa) }
}
}
object IO extends IOInstances {
// ...
}
Note that you will need to have an implicit ContextShift[IO] in scope if you want to use the ioParallel instance.
It is a common pattern in Scala to have implicit instances defined as part of the companion object for the class (in this case IO).

Scala, Slick :too many clients already

ive come across this problem :
Caused by: org.postgresql.util.PSQLException: FATAL: sorry, too many clients already
The app works, but i can only make like 3 or 4 requests and then im getting this error, so whats happening (im guessing) is im creating new connections per request and id like to keep one connection per app lifecycle, any idea how to modify the code to do so?
Tried injecting UsersDao to my controller instead of using it as an object but that changes nothing
Im really new to scala so any help is appreciated
Dao
import config.DatabaseConfig
import domain.{User, UsersTable}
import slick.jdbc.PostgresProfile.api._
import slick.sql.SqlAction
import scala.concurrent.Future
trait BaseDao extends DatabaseConfig {
val usersTable = TableQuery[UsersTable]
protected implicit def executeFromDb[A](action: SqlAction[A, NoStream, _ <: slick.dbio.Effect]): Future[A] = {
println(db)
db.run(action)
}
}
object UsersDao extends BaseDao {
def findAll: Future[Seq[User]] = usersTable.result
def create(user: User): Future[Long] = usersTable.returning(usersTable.map(_.id)) += user
def findByFirstName(firstName: String): Future[Seq[User]] = usersTable.filter(_.firstName === firstName).result
def findById(userId: Long): Future[User] = usersTable.filter(_.id === userId).result.head
def delete(userId: Long): Future[Int] = usersTable.filter(_.id === userId).delete
}
DatabaseConfig
trait DatabaseConfig extends Config {
val driver = slick.jdbc.PostgresProfile
import driver.api._
def db = Database.forConfig("dbname")
implicit val session: Session = db.createSession()
}
Controller
import domain.User
import javax.inject.{Inject, Singleton}
import play.api.libs.json.Json
import play.api.mvc._
import repository.UsersDao
import scala.concurrent.{ExecutionContext, Future}
#Singleton
class UserController #Inject() ()(cc: ControllerComponents, parsers: PlayBodyParsers)(implicit exec: ExecutionContext) extends AbstractController(cc) {
def addUser = Action.async(parse.json[User]) { req => {
UsersDao.create(req.body).map({ user =>
Ok(Json.toJson(user))
})
}
}

Spark/scala create empty dataset using generics in a trait

I have a trait called that takes a type parameter, and one of its methods needs to be able to create an empty typed dataset.
trait MyTrait[T] {
val sparkSession: SparkSession
val spark = sparkSession.session
val sparkContext = spark.sparkContext
def createEmptyDataset(): Dataset[T] = {
import spark.implicits._ // to access .toDS() function
// DOESN'T WORK.
val emptyRDD = sparkContext.parallelize(Seq[T]())
val accumulator = emptyRDD.toDS()
...
}
}
So far I have not gotten it to work. It complains no ClassTag for T, and that value toDS is not a member of org.apache.spark.rdd.RDD[T]
Any help would be appreciated. Thanks!
You have to provide both ClassTag[T] and Encoder[T] in the same scope. For example:
import org.apache.spark.sql.{SparkSession, Dataset, Encoder}
import scala.reflect.ClassTag
trait MyTrait[T] {
val ct: ClassTag[T]
val enc: Encoder[T]
val sparkSession: SparkSession
val sparkContext = spark.sparkContext
def createEmptyDataset(): Dataset[T] = {
val emptyRDD = sparkContext.emptyRDD[T](ct)
spark.createDataset(emptyRDD)(enc)
}
}
with concrete implementation:
class Foo extends MyTrait[Int] {
val sparkSession = SparkSession.builder.getOrCreate()
import sparkSession.implicits._
val ct = implicitly[ClassTag[Int]]
val enc = implicitly[Encoder[Int]]
}
It is possible to skip RDD:
import org.apache.spark.sql.{SparkSession, Dataset, Encoder}
trait MyTrait[T] {
val enc: Encoder[T]
val sparkSession: SparkSession
val sparkContext = spark.sparkContext
def createEmptyDataset(): Dataset[T] = {
spark.emptyDataset[T](enc)
}
}
Check How to declare traits as taking implicit "constructor parameters"?, specifically answer by Blaisorblade and another one by Alexey Romanov.

How do I supply an implicit value for an akka.stream.Materializer when sending a FakeRequest?

I'm trying to make sense of the error(s) I'm seeing below, and to learn how to fix it.
could not find implicit value for parameter materializer: akka.Stream.Materializer
val fut: Future[Result] = action.apply(fakeRequest).run
^
not enough arguments for method run (implicit materializer: akka.stream.Materializer)scala.concurrent.Future[play.api.mvc.Result].
Unspecified value parameter materializer.
val fut: Future[Result] = action.apply(fakeRequest).run
^
Here is the test code that produced the error(s):
package com.foo.test
import com.foo.{Api, BoundingBox}
import org.scalatest.{FlatSpec, Matchers}
import play.api.libs.json._
import play.api.mvc._
import play.api.test.{FakeHeaders, FakeRequest}
import scala.concurrent.duration._
import scala.concurrent.{Await, Future}
class TestJmlPlay extends FlatSpec with Matchers {
val bbox = new BoundingBox(-76.778154438007732F, 39.239828198015971F, -76.501003519894326F, 39.354663763993926F)
"latitudes" should "be between swLat and neLat" in {
val action: Action[AnyContent] = (new Api).getForPlay(bbox)
val jsonStr = getStringFromAction(action)
areLatitudesOk(jsonStr, bbox) shouldBe true
}
private def getStringFromAction(action:Action[AnyContent]):String = {
val fakeRequest: Request[String] = new FakeRequest("fakeMethod", "fakeUrl", new FakeHeaders, "fakeBody")
val fut: Future[Result] = action.apply(fakeRequest).run // <== ERROR!
val result = Await.result(fut, 5000 milliseconds)
result.body.toString
}
private def areLatitudesOk(jsonStr: String, bbox: BoundingBox): Boolean = ...
}
You can create an implicit ActorMaterializer within your test class which will use testkit's ActorSystem:
import akka.testkit.TestKit
import akka.actor.ActorSystem
class TestJmlPlay(_system : ActorSystem) extends TestKit(_system) ... {
implicit val materializer: ActorMaterializer = ActorMaterializer()
val bbox = ...
You don't need Materializer.
I believe you are calling not the right action.apply method.
You want def apply(request: Request[A]): Future[Result]
To call the right, you need FakeRequest[AnyContent], same parametrized type as action:Action[AnyContent].This type is forced by PlayBodyParser I believe you set for your action.
After that you don't need .run call

How do I use infinite Scala streams as source in Spark Streaming?

Suppose I essentially want Stream.from(0) as InputDStream. How would I go about this? The only way I can see is to use StreamingContext#queueStream, but I'd have to either enqueue elements from another thread or subclass Queue to create a queue that behaves like an infinite stream, both of which feel like a hack.
What's the correct way to do this?
I don't think that it's available in Spark by default but it's easy to implement it with ReceiverInputDStream.
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.receiver.Receiver
class InfiniteStreamInputDStream[T](
#transient ssc_ : StreamingContext,
stream: Stream[T],
storageLevel: StorageLevel
) extends ReceiverInputDStream[T](ssc_) {
override def getReceiver(): Receiver[T] = {
new InfiniteStreamReceiver(stream, storageLevel)
}
}
class InfiniteStreamReceiver[T](stream: Stream[T], storageLevel: StorageLevel) extends Receiver[T](storageLevel) {
// Stateful iterator
private val streamIterator = stream.iterator
private class ReadAndStore extends Runnable {
def run(): Unit = {
while (streamIterator.hasNext) {
val next = streamIterator.next()
store(next)
}
}
}
override def onStart(): Unit = {
new Thread(new ReadAndStore).run()
}
override def onStop(): Unit = { }
}
Slightly modified code tat works with Spark 2.0:
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.receiver.Receiver
import scala.reflect.ClassTag
class InfiniteDStream[T: ClassTag](
#transient ssc_ : StreamingContext,
stream: Stream[T],
storageLevel: StorageLevel
) extends ReceiverInputDStream[T](ssc_) {
override def getReceiver(): Receiver[T] = {
new InfiniteStreamReceiver(stream, storageLevel)
}
}
class InfiniteStreamReceiver[T](stream: Stream[T], storageLevel: StorageLevel) extends Receiver[T](storageLevel) {
private class ReadAndStore extends Runnable {
def run(): Unit = {
stream.foreach(store)
}
}
override def onStart(): Unit = {
val t = new Thread(new ReadAndStore)
t.setDaemon(true)
t.run()
}
override def onStop(): Unit = {}
}