Spark kryo encoder ArrayIndexOutOfBoundsException - scala

I'm trying to create a dataset with some geo data using spark and esri. If Foo only have Point field, it'll work but if I add some other fields beyond a Point, I get ArrayIndexOutOfBoundsException.
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
object Main {
case class Foo(position: Point, name: String)
object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]
implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
}
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
}
}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69)
at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:263) at
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at
Main$.main(Main.scala:24) at Main.main(Main.scala)

Kryo create encoder for complex data types based on Spark SQL Data Types. So check the result of schema that kryo create:
val enc: Encoder[Foo] = Encoders.kryo[Foo]
println(enc.schema) // StructType(StructField(value,BinaryType,true))
val numCols = schema.fieldNames.length // 1
So you have one column data in Dataset and it's in Binary format. But It's strange that why Spark attempting to show Dataset in more than one column (and that error occurs). To fix this, upgrade Spark version to 2.0.0.
By using Spark 2.0.0, you still have problem with columns data types. I hope writing manual schema works if you can write StructType for esri Point class:
val schema = StructType(
Seq(
StructField("point", StructType(...), true),
StructField("name", StringType, true)
)
)
val rdd = sc.parallelize(Seq(Row(new Point(0,0), "bar")))
sqlContext.createDataFrame(rdd, schema).toDS

Related

UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row

I am trying to create a dataFrame. It seems that spark is unable to create a dataframe from a scala.Tuple2 type. How can I do it? I am new to scala and spark.
Below is a part of the error trace from the code run
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:666)
..........
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:299)
at SparkMapReduce$.runMapReduce(SparkMapReduce.scala:46)
at Entrance$.queryLoader(Entrance.scala:64)
at Entrance$.paramsParser(Entrance.scala:43)
at Entrance$.main(Entrance.scala:30)
at Entrance.main(Entrance.scala)
Below is the code that is a part of the entire program. The problem occurs in the line above the exclamation marks in a comment
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
object SparkMapReduce {
Logger.getLogger("org.spark_project").setLevel(Level.WARN)
Logger.getLogger("org.apache").setLevel(Level.WARN)
Logger.getLogger("akka").setLevel(Level.WARN)
Logger.getLogger("com").setLevel(Level.WARN)
def runMapReduce(spark: SparkSession, pointPath: String, rectanglePath: String): DataFrame =
{
var pointDf = spark.read.format("csv").option("delimiter",",").option("header","false").load(pointPath);
pointDf = pointDf.toDF()
pointDf.createOrReplaceTempView("points")
pointDf = spark.sql("select ST_Point(cast(points._c0 as Decimal(24,20)),cast(points._c1 as Decimal(24,20))) as point from points")
pointDf.createOrReplaceTempView("pointsDf")
// pointDf.show()
var rectangleDf = spark.read.format("csv").option("delimiter",",").option("header","false").load(rectanglePath);
rectangleDf = rectangleDf.toDF()
rectangleDf.createOrReplaceTempView("rectangles")
rectangleDf = spark.sql("select ST_PolygonFromEnvelope(cast(rectangles._c0 as Decimal(24,20)),cast(rectangles._c1 as Decimal(24,20)), cast(rectangles._c2 as Decimal(24,20)), cast(rectangles._c3 as Decimal(24,20))) as rectangle from rectangles")
rectangleDf.createOrReplaceTempView("rectanglesDf")
// rectangleDf.show()
val joinDf = spark.sql("select rectanglesDf.rectangle as rectangle, pointsDf.point as point from rectanglesDf, pointsDf where ST_Contains(rectanglesDf.rectangle, pointsDf.point)")
joinDf.createOrReplaceTempView("joinDf")
// joinDf.show()
import spark.implicits._
val joinRdd = joinDf.rdd
val resmap = joinRdd.map(x=>(x, 1))
val reduced = resmap.reduceByKey(_+_)
val final_datablock = reduced.collect()
val trying : List[Float] = List()
print(final_datablock)
// .toDF("rectangles", "count")
// val dataframe_final1 = spark.createDataFrame(reduced)
val dataframe_final2 = spark.createDataFrame(reduced).toDF("rectangles", "count")
// ^ !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!Line above creates problem
// You need to complete this part
var result = spark.emptyDataFrame
return result // You need to change this part
}
}
Your first column of reduced has a type of ROW and you do not specified it when converting from RDD to DF. A dataframe must have a schema. So you need to use the following method by defining a right schema for your RDD to covert to DataFrame.
createDataFrame(RDD<Row> rowRDD, StructType schema)
for example:
val schema = new StructType()
.add(Array(
StructField("._1a",IntegerType),
StructField("._1b", ArrayType(StringType))
))
.add(StructField("count", IntegerType, true))

Unable to find encoder for type stored in a Dataset in Spark Scala

I am trying to execute the following simple example in Spark. However, I am getting below error
"could not find implicit value for evidence parameter of type org.apache.spark.sql.Encoder[mydata]"
How do I fix this?
import org.apache.spark.sql._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
case class mydata(ID: Int,Salary: Int)
object SampleKMeans {
def main(args: Array[String]) = {
val spark = SparkSession.builder
.appName("SampleKMeans")
.master("yarn")
.getOrCreate()
import spark.implicits._
val ds = spark.read
.option("header","true")
.option("inferSchema","true")
.csv("data/mydata.csv")
.as[mydata]
val assembler = new VectorAssembler()
.setInputCols(Array("Salary"))
.setOutputCol("SalaryOut")
val a = assembler.transform(ds)
}
The error went off after I explicitly specified the schema. Thanks everyone for helping me out.
val ds = spark.read
.schema("Int","Int")
.option("header","true")
.csv("data/mydata.csv").as[mydata]
You need to provide schema information.
case class mydata(ID: Int,Salary: Int)
val schema = StructType(Array(
StructField("ID", IntegerType, false),
StructField("Salary", IntegerType, false)))
Provide the above piece of code inside main method.
And your call for reading CSV will be
spark.read.schema(schema).csv("path").as[mydata]
With this, you can use your rest of the code.
Hope this helps!
An example you provided works on Spark 2.2.0. I guess it's not code you trying to run, but only an example for stackoverflow.
Check if your case class is top level object (and not declared inside method)

How to declare an empty dataset in Spark?

I am new in Spark and Spark dataset. I was trying to declare an empty dataset using emptyDataset but it was asking for org.apache.spark.sql.Encoder. The data type I am using for the dataset is an object of case class Tp(s1: String, s2: String, s3: String).
All you need is to import implicit encoders from SparkSession instance before you create empty Dataset: import spark.implicits._
See full example here
EmptyDataFrame
package com.examples.sparksql
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object EmptyDataFrame {
def main(args: Array[String]){
//Create Spark Conf
val sparkConf = new SparkConf().setAppName("Empty-Data-Frame").setMaster("local")
//Create Spark Context - sc
val sc = new SparkContext(sparkConf)
//Create Sql Context
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
//Import Sql Implicit conversions
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
//Create Schema RDD
val schema_string = "name,id,dept"
val schema_rdd = StructType(schema_string.split(",").map(fieldName => StructField(fieldName, StringType, true)) )
//Create Empty DataFrame
val empty_df = sqlContext.createDataFrame(sc.emptyRDD[Row], schema_rdd)
//Some Operations on Empty Data Frame
empty_df.show()
println(empty_df.count())
//You can register a Table on Empty DataFrame, it's empty table though
empty_df.registerTempTable("empty_table")
//let's check it ;)
val res = sqlContext.sql("select * from empty_table")
res.show
}
}
Alternatively you can convert an empty list into a Dataset:
import sparkSession.implicits._
case class Employee(name: String, id: Int)
val ds: Dataset[Employee] = List.empty[Employee].toDS()

Spark 2 to Spark 1.6

I am trying to convert following code to run on the spark 1.6 but, on which I am facing certain issues. while converting the sparksession to context
object TestData {
def makeIntegerDf(spark: SparkSession, numbers: Seq[Int]): DataFrame =
spark.createDataFrame(
spark.sparkContext.makeRDD(numbers.map(Row(_))),
StructType(List(StructField("column", IntegerType, nullable = false)))
)
}
How Do I convert it to make it run on spark 1.6
SparkSession is supported from spark 2.0 on-wards only. So if you want to use spark 1.6 then you would need to create SparkContext and sqlContext in driver class and pass them to the function.
so you can create
val conf = new SparkConf().setAppName("simple")
val sparkContext = new SparkContext(conf)
val sqlContext = new SQLContext(sparkContext)
and then call the function as
val callFunction = makeIntegerDf(sparkContext, sqlContext, numbers)
And your function should be as
def makeIntegerDf(sparkContext: SparkContext, sqlContext: SQLContext, numbers: Seq[Int]): DataFrame =
sqlContext.createDataFrame(
sparkContext.makeRDD(numbers.map(Row(_))),
StructType(List(StructField("column", IntegerType, nullable = false)))
)
The only main difference here is the use of spark which is a spark session as opposed to spark context.
So you would do something like this:
object TestData {
def makeIntegerDf(sc: SparkContext, sqlContext: SQLContext, numbers: Seq[Int]): DataFrame =
sqlContext.createDataFrame(
sc.makeRDD(numbers.map(Row(_))),
StructType(List(StructField("column", IntegerType, nullable = false)))
)
}
Of course you would need to create a spark context instead of spark session in order to provide it to the function.

Converting error with RDD operation in Scala

I am new to Scala and I ran into the error while doing some practice.
I tried to convert RDD into DataFrame and following is my code.
package com.sclee.examples
import com.sun.org.apache.xalan.internal.xsltc.compiler.util.IntType
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType};
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("examples").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(name: String, age: Long)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val df = personRDD.map({
case Row(val1: String, val2: Long) => Person(val1,val2)
}).toDS()
// val ds = personRDD.toDS()
}
}
I followed the instructions in Spark documentation and also referenced some blogs showing me how to convert rdd into dataframe but the I got the error below.
Error:(20, 27) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val df = personRDD.map({
Although I tried to fix the problem by myself but failed. Any help will be appreciated.
The following code works:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class Person(name: String, age: Long)
object SparkTest {
def main(args: Array[String]): Unit = {
// use the SparkSession of Spark 2
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
// this your RDD - just a sample how to create an RDD
val personRDD: RDD[Person] = spark.sparkContext.parallelize(Seq(Person("A",10),Person("B",20)))
// the sparksession has a method to convert to an Dataset
val ds = spark.createDataset(personRDD)
println(ds.count())
}
}
I made the following changes:
use SparkSession instead of SparkContext and SqlContext
move Person class out of the App (I'm not sure why I had to do
this)
use createDataset for conversion
However, I guess it's pretty uncommon to do this conversion and you probably want to read your input directly into an Dataset using the read method