Spark JobServer JobEnvironment - scala

def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName("LongPiJob")
val sc = new SparkContext(conf)
val env = new JobEnvironment {
def jobId: String = "abcdef"
//scalastyle:off
def namedObjects: NamedObjects = ???
def contextConfig: Config = ConfigFactory.empty
}
val results = runJob(sc, env, 5)
println("Result is " + results)
}
I took this code from the longpi example for spark jobserver relating to the new api which is part of the github repo. I don't understand what new JobEnvironment or any of the variables inside it. My IDE is complains with these default settings.
https://github.com/spark-jobserver/spark-jobserver/blob/spark-2.0-preview/job-server-tests/src/main/scala/spark/jobserver/LongPiJob.scala

JobEnvironment has runtime information about the Job. Like jobId, contextConfig and namedObjects
Now it is easy for you to access these information from runJob.

Related

Apache Spark with Scala

I ran the below code:
object AirportsInUsaSolution {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("airports").setMaster("local[2]")
val sc = new SparkContext(conf)
val airports = sc.textFile("in/airports.text")
val airportsInUSA = airports.filter(line => line.split(Utils.COMMA_DELIMITER)(3) == "\"United States\"")
val airportsNameAndCityNames = airportsInUSA.map(line => {
val splits = line.split(Utils.COMMA_DELIMITER)
splits(1) + ", " + splits(2)
})
airportsNameAndCityNames.saveAsTextFile("out/airports_in_usa.text")
}
}
And received the below error:
Process finished with exit code 1
My final file that I used it to save my data in code is empty.
How can I solve it?
the program made correct file in right directory but it is empty

Integration tests for a simple Spark application

I need to write some unit and integrations tests for a small research project. I am using a simple Spark application which read the data from a file and outputs the number of characters in a file. I am using ScalaTest for writing unit tests. But I could not come up with the integration tests for this project. According to the project flow I need to execute unit tests, package a jar file and then using this jar file execute integration tests. I have a file with data as a resource for the tests. So should I package this file with the source code or should I put it into a separate location? What kinds of integration tests can I write for this application?
Simple Spark application looks like this:
object SparkExample {
def readFile(sparkContext: SparkContext, fileName: String) = {
sparkContext.textFile(fileName)
}
def mapStringToLength(data: RDD[String]) = {
data.map(fileData => fileData.length)
}
def printIntFileData(data: RDD[Int]) = {
data.foreach(fileString =>
println(fileString.toString)
)
}
def printFileData(data: RDD[String]) = {
data.foreach(fileString =>
println(fileString)
)
}
def main(args: Array[String]) {
val spark = SparkSession
.builder
.master("local[*]")
.appName("TestApp")
.getOrCreate()
val dataFromFile = readFile(spark.sparkContext, args(0))
println("\nAll the data:")
val dataToInt = mapStringToLength(dataFromFile)
printFileData(dataFromFile)
printIntFileData(dataToInt)
spark.stop()
}
}
Unit tests I wrote:
class SparkExampleTest extends FunSuite with BeforeAndAfter with Matchers{
val master = "local"
val appName = "TestApp"
var sparkContext: SparkContext = _
val fileContent = "This is the text only for the test purposes. There is no sense in it completely. This is the test of the Spark Application"
val fileName = "src/test/resources/test_data.txt"
val noPathFileName = "test_data.txt"
val errorFileName = "test_data1.txt"
before {
val sparkSession = SparkSession
.builder
.master(master)
.appName(appName)
.getOrCreate()
sparkContext = sparkSession.sparkContext
}
test("SparkExample.readFile"){
assert(SparkExample.readFile(sparkContext, fileName).collect() sameElements Array(fileContent))
}
test("SparkExample.mapStringToLength"){
val stringLength = fileContent.length
val rdd = sparkContext.makeRDD(Array(fileContent))
assert(SparkExample.mapStringToLength(rdd).collect() sameElements Array(stringLength))
}
test("SparkExample.mapStringToLength Negative"){
val stringLength = fileContent.length
val rdd = sparkContext.makeRDD(Array(fileContent + " "))
assert(SparkExample.mapStringToLength(rdd).collect() != Array(stringLength))
}
test("SparkExample.readFile does not throw Exception"){
noException should be thrownBy SparkExample.readFile(sparkContext, fileName).collect()
}
test("SparkExample.readFile throws InvalidInputException without filePath"){
an[InvalidInputException] should be thrownBy SparkExample.readFile(sparkContext, noPathFileName).collect()
}
test("SparkExample.readFile throws InvalidInputException with wrong filename"){
an[InvalidInputException] should be thrownBy SparkExample.readFile(sparkContext, errorFileName).collect()
}
}
Spark Testing Base is the way to go, - it is basically a lightweight embedded spark for your tests. It would probably be more on the "integration tests" side of things than unit tests, but you can track code coverage etc. also, eg. with scoverage
https://github.com/holdenk/spark-testing-base

Spark Execution for twitter Streaming

Hi I'm new to spark and scala . I'm trying to stream some tweets through spark streaming with the following code:
object TwitterStreaming {
def main(args: Array[String]): Unit = {
if (args.length < 1) {
System.err.println("WrongUsage: PropertiesFile, [<filters>]")
System.exit(-1)
}
StreamingExamples.setStreaningLogLevels()
val myConfigFile = args(0)
val batchInterval_s = 1
val fileConfig = ConfigFactory.parseFile(new File(myConfigFile))
val appConf = ConfigFactory.load(fileConfig)
// Set the system properties so that Twitter4j library used by twitter stream
// can use them to generate OAuth credentials
System.setProperty("twitter4j.oauth.consumerKey", appConf.getString("consumerKey"))
System.setProperty("twitter4j.oauth.consumerSecret", appConf.getString("consumerSecret"))
System.setProperty("twitter4j.oauth.accessToken", appConf.getString("accessToken"))
System.setProperty("twitter4j.oauth.accessTokenSecret", appConf.getString("accessTokenSecret"))
val sparkConf = new SparkConf().setAppName("TwitterStreaming").setMaster(appConf.getString("SPARK_MASTER"))//local[2]
val ssc = new StreamingContext(sparkConf, Seconds(batchInterval_s)) // creating spark streaming context
val stream = TwitterUtils.createStream(ssc, None)
val tweet_data = stream.map(status => TweetData(status.getId, "#" + status.getUser.getScreenName, status.getText.trim()))
tweet_data.foreachRDD(rdd => {
println(s"A sample of tweets I gathered over ${batchInterval_s}s: ${rdd.take(10).mkString(" ")} (total tweets fetched: ${rdd.count()})")
})
}
}
case class TweetData(id: BigInt, author: String, tweetText: String)
My Error:
Exception in thread "main" com.typesafe.config.ConfigException$WrongType:/WorkSpace/InputFiles/application.conf: 5: Cannot concatenate object or list with a non-object-or-list, ConfigString("local") and SimpleConfigList([2]) are not compatible
at com.typesafe.config.impl.ConfigConcatenation.join(ConfigConcatenation.java:116)
can any one check the the code and tell me where I'm doing wrong?
If your config file contains:
SPARK_MASTER=local[2]
Change it to:
SPARK_MASTER="local[2]"

Spark Streaming using Scala to insert to Hbase Issue

I am trying to read records from Kafka message and put into Hbase. Though the scala script is running with out any issue, the inserts are not happening. Please help me.
Input:
rowkey1,1
rowkey2,2
Here is the code which I am using:
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "test")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes("a"), Bytes.toBytes(row(1)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val topicmap = Map("test" -> 1)
val lines = KafkaUtils.createStream(ssc,"127.0.0.1:2181", "test-consumer-group",topicmap).map(_._2)
val words = lines.map(line => line.split(",")).map(line => (line(0),line(1)))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
From the API doc for HTable's flushCommits() method: "Executes all the buffered Put operations". You should call this at the end of your blah() method -- it looks like they're currently being buffered but never executed or executed at some random time.

sortByKey in Spark

New to Spark and Scala. Trying to sort a word counting example. My code is based on this simple example.
I want to sort the results alphabetically by key. If I add the key sort to an RDD:
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
then I get a compile error:
error: No implicit view available from java.io.Serializable => Ordered[java.io.Serializable].
[INFO] val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
I don't know what the lack of an implicit view means. Can someone tell me how to fix it? I am running the Cloudera 5 Quickstart VM. I think it bundles Spark version 0.9.
Source of the Scala job
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
Array("NO NAME")
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Some (unsorted) output
("INTERNATIONAL EYELETS INC",879)
("SHAQUITA SALLEY",865)
("PAZ DURIGA",791)
("TERESSA ALCARAZ",824)
("MING CHAIX",878)
("JACKSON SHIELDS YEISER",837)
("AUDRY HULLINGER",875)
("GABRIELLE MOLANDS",802)
("TAM TACKER",775)
("HYACINTH VITELA",837)
No implicit view means there is no scala function like this defined
implicit def SerializableToOrdered(x :java.io.Serializable) = new Ordered[java.io.Serializable](x) //note this function doesn't work
The reason this error is coming out is because in your function you are returning two different types with a super type of java.io.Serializable (ones a String the other an Array[String]). Also reduceByKey for obvious reasons requires the key to be an Orderable. Fix it like this
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
"NO NAME"
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Now the function just returns Strings instead of two different types