Integration tests for a simple Spark application - scala

I need to write some unit and integrations tests for a small research project. I am using a simple Spark application which read the data from a file and outputs the number of characters in a file. I am using ScalaTest for writing unit tests. But I could not come up with the integration tests for this project. According to the project flow I need to execute unit tests, package a jar file and then using this jar file execute integration tests. I have a file with data as a resource for the tests. So should I package this file with the source code or should I put it into a separate location? What kinds of integration tests can I write for this application?
Simple Spark application looks like this:
object SparkExample {
def readFile(sparkContext: SparkContext, fileName: String) = {
sparkContext.textFile(fileName)
}
def mapStringToLength(data: RDD[String]) = {
data.map(fileData => fileData.length)
}
def printIntFileData(data: RDD[Int]) = {
data.foreach(fileString =>
println(fileString.toString)
)
}
def printFileData(data: RDD[String]) = {
data.foreach(fileString =>
println(fileString)
)
}
def main(args: Array[String]) {
val spark = SparkSession
.builder
.master("local[*]")
.appName("TestApp")
.getOrCreate()
val dataFromFile = readFile(spark.sparkContext, args(0))
println("\nAll the data:")
val dataToInt = mapStringToLength(dataFromFile)
printFileData(dataFromFile)
printIntFileData(dataToInt)
spark.stop()
}
}
Unit tests I wrote:
class SparkExampleTest extends FunSuite with BeforeAndAfter with Matchers{
val master = "local"
val appName = "TestApp"
var sparkContext: SparkContext = _
val fileContent = "This is the text only for the test purposes. There is no sense in it completely. This is the test of the Spark Application"
val fileName = "src/test/resources/test_data.txt"
val noPathFileName = "test_data.txt"
val errorFileName = "test_data1.txt"
before {
val sparkSession = SparkSession
.builder
.master(master)
.appName(appName)
.getOrCreate()
sparkContext = sparkSession.sparkContext
}
test("SparkExample.readFile"){
assert(SparkExample.readFile(sparkContext, fileName).collect() sameElements Array(fileContent))
}
test("SparkExample.mapStringToLength"){
val stringLength = fileContent.length
val rdd = sparkContext.makeRDD(Array(fileContent))
assert(SparkExample.mapStringToLength(rdd).collect() sameElements Array(stringLength))
}
test("SparkExample.mapStringToLength Negative"){
val stringLength = fileContent.length
val rdd = sparkContext.makeRDD(Array(fileContent + " "))
assert(SparkExample.mapStringToLength(rdd).collect() != Array(stringLength))
}
test("SparkExample.readFile does not throw Exception"){
noException should be thrownBy SparkExample.readFile(sparkContext, fileName).collect()
}
test("SparkExample.readFile throws InvalidInputException without filePath"){
an[InvalidInputException] should be thrownBy SparkExample.readFile(sparkContext, noPathFileName).collect()
}
test("SparkExample.readFile throws InvalidInputException with wrong filename"){
an[InvalidInputException] should be thrownBy SparkExample.readFile(sparkContext, errorFileName).collect()
}
}

Spark Testing Base is the way to go, - it is basically a lightweight embedded spark for your tests. It would probably be more on the "integration tests" side of things than unit tests, but you can track code coverage etc. also, eg. with scoverage
https://github.com/holdenk/spark-testing-base

Related

Spark Job Stuck and Never Finish Executing

I am trying to do some analysis on a dataset using spark. I am using sbt scala 2.12.8 on my local machine which is (16 GB).
The issue is that Spark Job never finished executing after using transformation and then an action, if I use transformation only or action only it would be executed within seconds.
Code:
package wikipedia
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import scala.io.Source
import org.apache.spark.rdd.RDD
case class WikipediaArticle2(title: String, text: String) {
/**
* #return Whether the text of this article mentions `lang` or not
* #param lang Language to look for (e.g. "Scala")
*/
def mentionsLanguage(lang: String): Boolean = text.split(' ').contains(lang)
}
object WikipediaData2 {
def lines: List[String] = {
Option(getClass.getResourceAsStream("/wikipedia/wikipedia.dat")) match {
case None => sys.error("Please download the dataset as explained in the assignment instructions")
case Some(resource) => Source.fromInputStream(resource).getLines().toList
}
}
def parse(line: String): WikipediaArticle2 = {
val subs = "</title><text>"
val i = line.indexOf(subs)
val title = line.substring(14, i)
val text = line.substring(i + subs.length, line.length-16)
WikipediaArticle2(title, text)
}
}
object WikipediaTest {
val langs = List(
"JavaScript", "Java", "PHP", "Python", "C#", "C++", "Ruby", "CSS",
"Objective-C", "Perl", "Scala", "Haskell", "MATLAB", "Clojure", "Groovy")
val conf: SparkConf = new SparkConf()
.setMaster("local[*]") // tried local[4] 4 cores
.setAppName("Wikis Most Popular Languages")
val sc: SparkContext = new SparkContext(conf)
sc.setLogLevel("ERROR")
// Hint: use a combination of `sc.parallelize`, `WikipediaData.lines` and `WikipediaData.parse`
// val wikiRdd: RDD[WikipediaArticle] = sc.parallelize(WikipediaData.lines.map(WikipediaData.parse)).cache()
val wikiRdd: RDD[WikipediaArticle2] = sc.parallelize(WikipediaData2.lines).map(WikipediaData2.parse)
def main(args: Array[String]): Unit = {
// wikiRdd.count() // returned 4086
// wikiRdd.take(2) // result within less than second
// wikiRdd.filter(a => a.mentionsLanguage("Java")).take(2) // execution stuck here
// wikiRdd.filter(a => a.mentionsLanguage("Perl")).collect() // exceution stuck here
sc.stop()
}
}
on terminal I execute the code like this:
$ sbt
$ console
scala> import wikipedia.WikipediaTest._
scala> wikiRdd.count() // returned 4086
scala> wikiRdd.take(2) // result within less than second
scala> wikiRdd.filter(a => a.mentionsLanguage("Java")).take(2)
This code snippet also get the Job to freeze:
wikiRdd.filter(a => a.mentionsLanguage("Perl")).collect()
Spark Job UI
I have been stumbled for hours and I have no idea what's wrong.
Any suggestion folks?

how to Connect to NEO4J in Spark worker nodes?

I need to get a small subgraph in a spark map function. I have tried to use AnormCypher and NEO4J-SPARK-CONNECTOR, but neither works. AnormCypher will lead to a java IOException Error (I build the connection in a mapPartition function, test at localhost server). And Neo4j-spark-connector will cause TASK NOT SERIALIZABLE EXCEPTION below.
Is there a good way to get a subgraph(or connect to graph data base like neo4j) in the Spark worker node?
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
at ....
my code snippet using neo4j-spark-connector 2.0.0-m2:
val neo = Neo4j(sc) // this runs on the driver
// this runs by a map function
def someFunctionToBeMapped(p: List[Long]) = {
val metaGraph = neo.cypher("match p = (a:TourPlace) -[r:could_go_to] -> (b:TourPlace)" +
"return a.id ,r.distance, b.id").loadRowRdd.map( row => ((row(0).asInstanceOf[Long],row(2).asInstanceOf[Long]), row(1).asInstanceOf[Double]) ).collect().toList
The AnromCypher code is :
def partitionMap(partition: Iterator[List[Long]]) = {
import org.anormcypher._
import play.api.libs.ws._
// Provide an instance of WSClient
val wsclient = ning.NingWSClient()
// Setup the Rest Client
// Need to add the Neo4jConnection type annotation so that the default
// Neo4jConnection -> Neo4jTransaction conversion is in the implicit scope
implicit val connection: Neo4jConnection = Neo4jREST("127.0.0.1", 7474, "neo4j", "000000")(wsclient)
//
// Provide an ExecutionContext
implicit val ec = scala.concurrent.ExecutionContext.global
val res = partition.filter( placeList => {
val startPlace = Cypher("match p = (a:TourPlace) -[r:could_go_to] -> (b:TourPlace)" +
"return p")().flatMap( row => row.data )
})
wsclient.close()
res
}
I have used spark standalone mode and able to connect neo4j database
Version used :
spark 2.1.0
neo4j-spark-connector 2.1.0-m2
My code:-
val sparkConf = new SparkConf().setAppName("Neo$j").setMaster("local")
val sc = new SparkContext(sparkConf)
println("***Getting Started ****")
val neo = Neo4j(sc)
val rdd = neo.cypher("MATCH (n) RETURN id(n) as id").loadDataFrame
println(rdd.count)
Spark submit:-
spark-submit --class package.classname --jars pathofneo4jsparkconnectoryJAR --conf spark.neo4j.bolt.password=***** targetJarFile.jar

Spark JobServer JobEnvironment

def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName("LongPiJob")
val sc = new SparkContext(conf)
val env = new JobEnvironment {
def jobId: String = "abcdef"
//scalastyle:off
def namedObjects: NamedObjects = ???
def contextConfig: Config = ConfigFactory.empty
}
val results = runJob(sc, env, 5)
println("Result is " + results)
}
I took this code from the longpi example for spark jobserver relating to the new api which is part of the github repo. I don't understand what new JobEnvironment or any of the variables inside it. My IDE is complains with these default settings.
https://github.com/spark-jobserver/spark-jobserver/blob/spark-2.0-preview/job-server-tests/src/main/scala/spark/jobserver/LongPiJob.scala
JobEnvironment has runtime information about the Job. Like jobId, contextConfig and namedObjects
Now it is easy for you to access these information from runJob.

Spark Execution for twitter Streaming

Hi I'm new to spark and scala . I'm trying to stream some tweets through spark streaming with the following code:
object TwitterStreaming {
def main(args: Array[String]): Unit = {
if (args.length < 1) {
System.err.println("WrongUsage: PropertiesFile, [<filters>]")
System.exit(-1)
}
StreamingExamples.setStreaningLogLevels()
val myConfigFile = args(0)
val batchInterval_s = 1
val fileConfig = ConfigFactory.parseFile(new File(myConfigFile))
val appConf = ConfigFactory.load(fileConfig)
// Set the system properties so that Twitter4j library used by twitter stream
// can use them to generate OAuth credentials
System.setProperty("twitter4j.oauth.consumerKey", appConf.getString("consumerKey"))
System.setProperty("twitter4j.oauth.consumerSecret", appConf.getString("consumerSecret"))
System.setProperty("twitter4j.oauth.accessToken", appConf.getString("accessToken"))
System.setProperty("twitter4j.oauth.accessTokenSecret", appConf.getString("accessTokenSecret"))
val sparkConf = new SparkConf().setAppName("TwitterStreaming").setMaster(appConf.getString("SPARK_MASTER"))//local[2]
val ssc = new StreamingContext(sparkConf, Seconds(batchInterval_s)) // creating spark streaming context
val stream = TwitterUtils.createStream(ssc, None)
val tweet_data = stream.map(status => TweetData(status.getId, "#" + status.getUser.getScreenName, status.getText.trim()))
tweet_data.foreachRDD(rdd => {
println(s"A sample of tweets I gathered over ${batchInterval_s}s: ${rdd.take(10).mkString(" ")} (total tweets fetched: ${rdd.count()})")
})
}
}
case class TweetData(id: BigInt, author: String, tweetText: String)
My Error:
Exception in thread "main" com.typesafe.config.ConfigException$WrongType:/WorkSpace/InputFiles/application.conf: 5: Cannot concatenate object or list with a non-object-or-list, ConfigString("local") and SimpleConfigList([2]) are not compatible
at com.typesafe.config.impl.ConfigConcatenation.join(ConfigConcatenation.java:116)
can any one check the the code and tell me where I'm doing wrong?
If your config file contains:
SPARK_MASTER=local[2]
Change it to:
SPARK_MASTER="local[2]"

sortByKey in Spark

New to Spark and Scala. Trying to sort a word counting example. My code is based on this simple example.
I want to sort the results alphabetically by key. If I add the key sort to an RDD:
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
then I get a compile error:
error: No implicit view available from java.io.Serializable => Ordered[java.io.Serializable].
[INFO] val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
I don't know what the lack of an implicit view means. Can someone tell me how to fix it? I am running the Cloudera 5 Quickstart VM. I think it bundles Spark version 0.9.
Source of the Scala job
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
Array("NO NAME")
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Some (unsorted) output
("INTERNATIONAL EYELETS INC",879)
("SHAQUITA SALLEY",865)
("PAZ DURIGA",791)
("TERESSA ALCARAZ",824)
("MING CHAIX",878)
("JACKSON SHIELDS YEISER",837)
("AUDRY HULLINGER",875)
("GABRIELLE MOLANDS",802)
("TAM TACKER",775)
("HYACINTH VITELA",837)
No implicit view means there is no scala function like this defined
implicit def SerializableToOrdered(x :java.io.Serializable) = new Ordered[java.io.Serializable](x) //note this function doesn't work
The reason this error is coming out is because in your function you are returning two different types with a super type of java.io.Serializable (ones a String the other an Array[String]). Also reduceByKey for obvious reasons requires the key to be an Orderable. Fix it like this
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
"NO NAME"
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Now the function just returns Strings instead of two different types