ArrayIndexOutOfBoundsException while reading from dataframe in scala - scala

I'm working on scala streaming unit test but while reading from csv file getting ArrayOutOfBoundsException
Code :
import org.scalatest.matchers.should.Matchers
import org.scalatest.wordspec.AnyWordSpecLike
import org.apache.spark.sql.SparkSession
class StreamingTest extends AnyWordSpecLike with Matchers {
val sparkses = SparkSession.builder.appName("MyApp").config("spark.master","local").getOrCreate()
val df = sparkses.read.format("csv").load("file.csv")
df.printSchema()
}
The code works fine without extending AnyWordSpecLike with Matchers, but we need it to work with EmbeddedKafka.
Any guidance would be helpful.

Related

how to fix Scala error with "Not found type"

I'm newbie in Scala, just trying to learn it in Spark. Now I'm writing a Scala app to load csv file from hadoop into dataframe, then I want to add a new column in that dataframe. There is a function to populate the content of that new column, for testing the function just uppercase the column from csv file, the csv file only contains one column: emp_id and it's string.. the function is defined in Object TestService. My IDE is Eclipse. Now I have error: not found: type TestService
Very appreciate if anyone can help me.
\\This is the main:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
import com.poc.spark.service.TestService;
object SparkIntTest {
def main(args:Array[String]){
sys.props.+=(("hadoop.home.dir","C:\\OpenSource\\Hadoop"))
val sparkConf = new SparkConf().setMaster("local").setAppName("employee").set("spark.testing.memory", "2147480000")
val sparkContext = new SparkContext(sparkConf)
val spark = SparkSession.builder().appName("employee").getOrCreate()
val df = spark.read.option("header", "true").csv(".\\src\\main\\resources\\employee.csv")
df.show();
println(df.schema);
val df_Applied = df.withColumn("award_rule",runAllRulesUDF(df("emp_id")))
df_Applied.show();
println(df_Applied.schema)
}
def runAllRulesUDF = udf(new TestService().runAllRulesForUDF(_:String))
}
Here is the Object TestService:
package com.poc.spark.service
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.functions._
object TestService {
def runAllRulesForUDF(empid: String): String = {
empid.toUpperCase();
}
}
TestService is an object, which means that it is a statically created singleton. So instead of
new TestService()
You can just say
TestService

scala import library wildcard

I am new to scala. Please be gentle.
The import below imports everything (every class, trait and object) under ml.
import org.apache.spark.ml._
but NOT ParamMap, which is under
import org.apache.spark.ml.param._
In other words, for the code below, if I do:
import org.apache.spark.ml.param._
import org.apache.spark.ml._
class Kmeans extends Transformer {
def copy(extra: ParamMap): Unit = {
defaultCopy(extra)
}}
Then I have no import errors, but if I comment import org.apache.spark.ml.param._:
//import org.apache.spark.ml.param._
import org.apache.spark.ml._
class Kmeans extends Transformer {
def copy(extra: ParamMap): Unit = {
defaultCopy(extra)
}}
It gives an import error on ParamMap.
Question
why isn't this import org.apache.spark.ml.param.ParamMap included import org.apache.spark.ml.param._
Scala imports are not recursive - import org.apache.spark.ml._ means import all classes and fields directly under ml package but not the ones under its sub-packages.
Since ParamMap is under one of ml's sub-packages (ml.param), you'll have to import that package or ParamMap class directly.

value toDF is not a member of Seq[(Int,String)]

I am trying to execute the following code but getting this error:
value toDF is not a member of Seq[(Int,String)].
I have the case class outside main and I have imported implicits too. But still I am getting this error. Can someone help me to resolve this ? I am using Spark 2.11-2.1.0 and Scala 2.11.8
import org.apache.spark.sql._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark._
final case class Email(id: Int, text: String)
object SampleKMeans {
def main(args: Array[String]) = {
val spark = SparkSession.builder.appName("SampleKMeans")
.master("yarn")
.getOrCreate()
import spark.implicits._
val emails = Seq(
"This is an email from...",
"SPAM SPAM spam",
"Hello, We'd like to offer you")
.zipWithIndex.map(_.swap).toDF("id", "text").as[Email]
}
}
You already have a SparkSession you can just import the spark.implicits._ will work in your case
val spark = SparkSession.builder.appName("SampleKMeans")
.master("local[*]")
.getOrCreate()
import spark.implicits._
Now toDF method works as expected.
If the error still exists, You need to check the version of spark and scala libraries that you are using.
Hope this helps!

How should I write unit tests in Spark, for a basic data frame creation example?

I'm struggling to write a basic unit test for creation of a data frame, using the example text file provided with Spark, as follows.
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
private var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).getOrCreate()
}
import spark.implicits._
case class Person(name: String, age: Int)
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0),attributes(1).trim.toInt))
.toDF()
test("Creating dataframe should produce data from of correct size") {
assert(df.count() == 3)
assert(df.take(1).equals(Array("Michael",29)))
}
override def afterEach(): Unit = {
spark.stop()
}
}
I know that the code itself works (from spark.implicits._ .... toDF()) because I have verified this in the Spark-Scala shell, but inside the test class I'm getting lots of errors; the IDE does not recognise 'import spark.implicits._, or toDF(), and therefore the tests don't run.
I am using SparkSession which automatically creates SparkConf, SparkContext and SQLContext under the hood.
My code simply uses the example code from the Spark repo.
Any ideas why this is not working? Thanks!
NB. I have already looked at the Spark unit test questions on StackOverflow, like this one: How to write unit tests in Spark 2.0+?
I have used this to write the test but I'm still getting the errors.
I'm using Scala 2.11.8, and Spark 2.2.0 with SBT and IntelliJ. These dependencies are correctly included within the SBT build file. The errors on running the tests are:
Error:(29, 10) value toDF is not a member of org.apache.spark.rdd.RDD[dataLoadTest.this.Person]
possible cause: maybe a semicolon is missing before `value toDF'?
.toDF()
Error:(20, 20) stable identifier required, but dataLoadTest.this.spark.implicits found.
import spark.implicits._
IntelliJ won't recognise import spark.implicits._ or the .toDF() method.
I have imported:
import org.apache.spark.sql.SparkSession
import org.scalatest.{BeforeAndAfterEach, FlatSpec, FunSuite, Matchers}
you need to assign sqlContext to a val for implicits to work . Since your sparkSession is a var, implicits won't work with it
So you need to do
val sQLContext = spark.sqlContext
import sQLContext.implicits._
Moreover you can write functions for your tests so that your test class looks as following
class dataLoadTest extends FunSuite with Matchers with BeforeAndAfterEach {
private val master = "local[*]"
private val appName = "data_load_testing"
var spark: SparkSession = _
override def beforeEach() {
spark = new SparkSession.Builder().appName(appName).master(master).getOrCreate()
}
test("Creating dataframe should produce data from of correct size") {
val sQLContext = spark.sqlContext
import sQLContext.implicits._
val df = spark.sparkContext
.textFile("/Applications/spark-2.2.0-bin-hadoop2.7/examples/src/main/resources/people.txt")
.map(_.split(","))
.map(attributes => Person(attributes(0), attributes(1).trim.toInt))
.toDF()
assert(df.count() == 3)
assert(df.take(1)(0)(0).equals("Michael"))
}
override def afterEach() {
spark.stop()
}
}
case class Person(name: String, age: Int)
There are many libraries for unit testing of spark, one of the mostly used is
spark-testing-base: By Holden Karau
This library have all with sc as the SparkContext below is a simple example
class TestSharedSparkContext extends FunSuite with SharedSparkContext {
val expectedResult = List(("a", 3),("b", 2),("c", 4))
test("Word counts should be equal to expected") {
verifyWordCount(Seq("c a a b a c b c c"))
}
def verifyWordCount(seq: Seq[String]): Unit = {
assertResult(expectedResult)(new WordCount().transform(sc.makeRDD(seq)).collect().toList)
}
}
Here, every thing is prepared with sc as a SparkContext
Another approach is to create a TestWrapper and use for the multiple testcases as below
import org.apache.spark.sql.SparkSession
trait TestSparkWrapper {
lazy val sparkSession: SparkSession =
SparkSession.builder().master("local").appName("spark test example ").getOrCreate()
}
And use this TestWrapper for all the tests with Scala-test, playing with BeforeAndAfterAll and BeforeAndAfterEach.
Hope this helps!

IllegalAccessException .. can not access a member of class with modifiers "protected"

this is my scala code . i am trying to ingest geotiff file into HDFS using the geotrellis library.
package RasterDataIngest.RasterDataIngestIntoHadoop
import geotrellis.spark._
import geotrellis.spark.ingest._
import geotrellis.spark.io.hadoop._
import geotrellis.spark.io.index._
import geotrellis.spark.tiling._
import geotrellis.spark.utils.SparkUtils
import geotrellis.vector._
import org.apache.hadoop.fs.Path
import org.apache.spark._
import com.quantifind.sumac.ArgMain
import com.quantifind.sumac.validation.Required
class HadoopIngestArgs extends IngestArgs {
#Required var catalog: String = _
def catalogPath = new Path(catalog)
}
object HadoopIngest extends ArgMain[HadoopIngestArgs] with Logging {
def main(args: HadoopIngestArgs): Unit = {
System.setProperty("com.sun.media.jai.disableMediaLib", "true")
implicit val sparkContext = SparkUtils.createSparkContext("Ingest")
val conf = sparkContext.hadoopConfiguration
conf.set("io.map.index.interval", "1")
val catalog = HadoopRasterCatalog(args.catalogPath)
val source = sparkContext.hadoopGeoTiffRDD(args.inPath)
val layoutScheme = ZoomedLayoutScheme()
Ingest[ProjectedExtent, SpatialKey](source, args.destCrs, layoutScheme, args.pyramid){ (rdd, level) =>
catalog
.writer[SpatialKey](RowMajorKeyIndexMethod, args.clobber)
.write(LayerId(args.layerName, level.zoom), rdd)
}
}
}
When i run this code , i get the following error.
Please help me to solve this error.
java.lang.IllegalAccessException: Class org.osgeo.proj4j.Registry can not access a member of class org.osgeo.proj4j.proj.Projection with modifiers "protected"
I believe the problem is related to a bad sbt cache or Java version mismatch. Try the latest stable GeoTrellis version: 0.10.3 (Scala 2.10/2.11, Java 8, Spark 1.6.x). If you plan to use GeoTrellis with Spark 2, take a look at the GeoTrellis snapshot (version 1.0.0 will support Spark 2+, Java 8, and Scala 2.11).