Unable to find encoder for type stored in a Dataset in Spark Scala - scala

I am trying to execute the following simple example in Spark. However, I am getting below error
"could not find implicit value for evidence parameter of type org.apache.spark.sql.Encoder[mydata]"
How do I fix this?
import org.apache.spark.sql._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
case class mydata(ID: Int,Salary: Int)
object SampleKMeans {
def main(args: Array[String]) = {
val spark = SparkSession.builder
.appName("SampleKMeans")
.master("yarn")
.getOrCreate()
import spark.implicits._
val ds = spark.read
.option("header","true")
.option("inferSchema","true")
.csv("data/mydata.csv")
.as[mydata]
val assembler = new VectorAssembler()
.setInputCols(Array("Salary"))
.setOutputCol("SalaryOut")
val a = assembler.transform(ds)
}

The error went off after I explicitly specified the schema. Thanks everyone for helping me out.
val ds = spark.read
.schema("Int","Int")
.option("header","true")
.csv("data/mydata.csv").as[mydata]

You need to provide schema information.
case class mydata(ID: Int,Salary: Int)
val schema = StructType(Array(
StructField("ID", IntegerType, false),
StructField("Salary", IntegerType, false)))
Provide the above piece of code inside main method.
And your call for reading CSV will be
spark.read.schema(schema).csv("path").as[mydata]
With this, you can use your rest of the code.
Hope this helps!

An example you provided works on Spark 2.2.0. I guess it's not code you trying to run, but only an example for stackoverflow.
Check if your case class is top level object (and not declared inside method)

Related

How to make Scala script infer custom column types in a csv/json schema?

I am trying to infer schema when I load a csv file in my SQLContext using SparkSession. Please note that I do not want to use class here as I am trying to infer the data file schema as soon it is loaded as I do not have any info about the data types or column names of the file before loading it.
Here is what I am trying out in Scala:
package example
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import java.io.File
import org.apache.spark.sql.SparkSession
//import sqlContext.implicits._
object SimpleScalaSpark {
def main(args: Array[String]) {
//val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "local")
.getOrCreate()
//val etl1Rdd = spark.sparkContext.wholeTextFiles("etl1.json").map(x => x._2)
val jsonTbl = spark.sqlContext.read.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("dateFormat","MM/dd/yyyy HH:mm")
.csv("s1.csv")
// print the inferred schema
jsonTbl.printSchema
}
}
I am able to get DateTime, Integer, Double, String as data types for my file. But I want to implement custom data types based on my own regex patterns such as fields like SSN, VIN-ID, PhoneNumber etc. which all have a fixed pattern which can be detected using regex. This would make schema extraction process for me more accurate and precise. For example, suppose I have a column which contains data formed of 5 or more alphabets and 2 or more numbers, I can say that this column is of type ID.
Any ideas on if it is possible to do this using Scala/Spark? Please let me know the implementation part as well if possible or a source to technical documentation.

Converting error with RDD operation in Scala

I am new to Scala and I ran into the error while doing some practice.
I tried to convert RDD into DataFrame and following is my code.
package com.sclee.examples
import com.sun.org.apache.xalan.internal.xsltc.compiler.util.IntType
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType};
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("examples").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(name: String, age: Long)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val df = personRDD.map({
case Row(val1: String, val2: Long) => Person(val1,val2)
}).toDS()
// val ds = personRDD.toDS()
}
}
I followed the instructions in Spark documentation and also referenced some blogs showing me how to convert rdd into dataframe but the I got the error below.
Error:(20, 27) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val df = personRDD.map({
Although I tried to fix the problem by myself but failed. Any help will be appreciated.
The following code works:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class Person(name: String, age: Long)
object SparkTest {
def main(args: Array[String]): Unit = {
// use the SparkSession of Spark 2
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
// this your RDD - just a sample how to create an RDD
val personRDD: RDD[Person] = spark.sparkContext.parallelize(Seq(Person("A",10),Person("B",20)))
// the sparksession has a method to convert to an Dataset
val ds = spark.createDataset(personRDD)
println(ds.count())
}
}
I made the following changes:
use SparkSession instead of SparkContext and SqlContext
move Person class out of the App (I'm not sure why I had to do
this)
use createDataset for conversion
However, I guess it's pretty uncommon to do this conversion and you probably want to read your input directly into an Dataset using the read method

Unable to find encoder for type stored in Dataset when attempting to perform flatMap on a DataFrame in Spark 2.0 [duplicate]

This question already has answers here:
Encoder error while trying to map dataframe row to updated row
(4 answers)
Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?
(3 answers)
Closed 5 years ago.
I keep getting the following compile time error:
Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
I've just upgraded from Spark v1.6 to v2.0.2, and a whole bunch of code using DataFrame are complaining about this error. The code where it is complaining looks like the following.
def doSomething(data: DataFrame): Unit = {
data.flatMap(row => {
...
})
.reduceByKey(_ + _)
.sortByKey(ascending = false)
}
Previous SO posts suggest to
pull out the case class (defined in object)
perform implicit imports
However, I don't have any case classes, as I am using DataFrame which is equal to DataSet[Row], and also, I've inlined the 2 implicit imports as follows, without any help to get rid of this message.
val sparkSession: SparkSession = ???
val sqlContext: SQLContext = ???
import sparkSession.implicits._
import sqlContext.implicits._
Note that I've looked at the documentation for DataSet and Encoder. The docs says something like the following.
Scala
Encoders are generally created automatically through implicits from a
SparkSession, or can be explicitly created by calling static methods on
Encoders.
import spark.implicits._
val ds = Seq(1, 2, 3).toDS() // implicitly provided (spark.implicits.newIntEncoder)
However, my method doesn't have access to SparkSession. Also, when I try that line import spark.implicits._, IntelliJ can't even find it. When I say that my DataFrame is a DataSet[Row], I really do mean it.
This question is marked as a possible duplicate, but please let me clarify.
I have no case class or business object associated.
I am using .flatMap while the other question is using .map
implicit imports do not seem to help
passing a RowEncoder produces a compile-time error e.g. data.flatMap(row => { ... }, RowEncoder(data.schema)) (too many arguments)
I'm reading the other posts, and let me add, I guess I don't know how this new Spark 2.0 Datasets/DataFrame API is supposed to work. In the Spark shell, the code below works. Note that I start the spark shell like this $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
val schema = StructType(Array(
StructField("x1", StringType, true),
StructField("x2", StringType, true),
StructField("x3", StringType, true),
StructField("x4", StringType, true),
StructField("x5", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.schema(schema)
.load("/Users/jwayne/Downloads/mydata.csv")
df.columns.map(col => {
df.groupBy(col)
.count()
.map(_.getString(0))
.collect()
.toList
})
.toList
However, when I run this as a part of a test unit, I get the same unable to find encoder error. Why does this work in the shell but not in my test units?
In the shell, I typed in :imports and :implicits and placed those in my scala files/source, but that doesn't help either.

Spark dataframe join is failing if key column contains a period(".") in the end

I am getting below exception if I do join in between two dataframes in spark (ver 1.5, scala 2.10).
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: col1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:118)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:182)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:653)
at com.nielsen.buy.integration.commons.Demo$.main(Demo.scala:62)
at com.nielsen.buy.integration.commons.Demo.main(Demo.scala)
Code works fine if column in dataframe does not contain any period . Please do help me out.
You can find the code that I am using.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.google.gson.Gson
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
object Demo
{
lazy val sc: SparkContext = {
val conf = new SparkConf().setMaster("local")
.setAppName("demooo")
.set("spark.driver.allowMultipleContexts", "true")
new SparkContext(conf)
}
sc.setLogLevel("ERROR")
lazy val sqlcontext=new SQLContext(sc)
val data=List(Row("a","b"),Row("v","b"))
val dataRdd=sc.parallelize(data)
val schema = new StructType(Array(StructField("col.1",StringType,true),StructField("col2",StringType,true)))
val df1=sqlcontext.createDataFrame(dataRdd, schema)
val data2=List(Row("a","b"),Row("v","b"))
val dataRdd2=sc.parallelize(data2)
val schema2 = new StructType(Array(StructField("col3",StringType,true),StructField("col4",StringType,true)))
val df2=sqlcontext.createDataFrame(dataRdd2, schema2)
val val1="col.1"
val df3= df1.join(df2,df1.col(val1).equalTo(df2.col("col3")),"outer").show
}
In general, period is used to access members of a struct field.
The spark version you are using (1.5) is relatively old. Several such issues were fixed in later versions so if you upgrade it might just solve the issue.
That said, you can simply use withColumnRenamed to rename the column to something which does not have a period before the join.
So you basically do something like this:
val dfTmp = df1.withColumnRenamed(val1, "JOIN_COL")
val df3= dfTmp.join(df2,dfTmp.col("JOIN_COL").equalTo(df2.col("col3")),"outer").withColumnRenamed("JOIN_COL", val1)
df3.show
btw show returns a Unit so you probably meant df3 to be equal to the expression without it and do df3.show separately.

Spark kryo encoder ArrayIndexOutOfBoundsException

I'm trying to create a dataset with some geo data using spark and esri. If Foo only have Point field, it'll work but if I add some other fields beyond a Point, I get ArrayIndexOutOfBoundsException.
import com.esri.core.geometry.Point
import org.apache.spark.sql.{Encoder, Encoders, SQLContext}
import org.apache.spark.{SparkConf, SparkContext}
object Main {
case class Foo(position: Point, name: String)
object MyEncoders {
implicit def PointEncoder: Encoder[Point] = Encoders.kryo[Point]
implicit def FooEncoder: Encoder[Foo] = Encoders.kryo[Foo]
}
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("app").setMaster("local"))
val sqlContext = new SQLContext(sc)
import MyEncoders.{FooEncoder, PointEncoder}
import sqlContext.implicits._
Seq(new Foo(new Point(0, 0), "bar")).toDS.show
}
}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:71)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1$$anonfun$apply$2.apply(Queryable.scala:70)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:70)
at
org.apache.spark.sql.execution.Queryable$$anonfun$formatString$1.apply(Queryable.scala:69)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73) at
org.apache.spark.sql.execution.Queryable$class.formatString(Queryable.scala:69)
at org.apache.spark.sql.Dataset.formatString(Dataset.scala:65) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:263) at
org.apache.spark.sql.Dataset.show(Dataset.scala:230) at
org.apache.spark.sql.Dataset.show(Dataset.scala:193) at
org.apache.spark.sql.Dataset.show(Dataset.scala:201) at
Main$.main(Main.scala:24) at Main.main(Main.scala)
Kryo create encoder for complex data types based on Spark SQL Data Types. So check the result of schema that kryo create:
val enc: Encoder[Foo] = Encoders.kryo[Foo]
println(enc.schema) // StructType(StructField(value,BinaryType,true))
val numCols = schema.fieldNames.length // 1
So you have one column data in Dataset and it's in Binary format. But It's strange that why Spark attempting to show Dataset in more than one column (and that error occurs). To fix this, upgrade Spark version to 2.0.0.
By using Spark 2.0.0, you still have problem with columns data types. I hope writing manual schema works if you can write StructType for esri Point class:
val schema = StructType(
Seq(
StructField("point", StructType(...), true),
StructField("name", StringType, true)
)
)
val rdd = sc.parallelize(Seq(Row(new Point(0,0), "bar")))
sqlContext.createDataFrame(rdd, schema).toDS