How to manually create a Dataset with a Set column in Scala - scala

I'm trying to manually create a dataset with a type Set column:
case class Files(Record: String, ids: Set)
val files = Seq(
Files("202110260931", Set(770010, 770880)),
Files("202110260640", Set(770010, 770880)),
Files("202110260715", Set(770010, 770880))
).toDS()
files.show()
This gives me the error:
>command-1888379816641405:10: error: type Set takes type parameters
case class Files(s3path: String, ids: Set)
What am I doing wrong?

Set is a parametrized type, so when you declare it in your Files case class, you should define what type is inside your Set, like Set[Int] for a set of integers. So your Files case class definition should be:
case class Files(Record: String, ids: Set[Int])
And so the complete code to create a dataset with a set column:
import org.apache.spark.sql.SparkSession
object ToDataset {
private val spark = SparkSession.builder()
.master("local[*]")
.appName("test-app")
.config("spark.ui.enabled", "false")
.config("spark.driver.host", "localhost")
.getOrCreate()
def main(args: Array[String]): Unit = {
import spark.implicits._
val files = Seq(
Files("202110260931", Set(770010, 770880)),
Files("202110260640", Set(770010, 770880)),
Files("202110260715", Set(770010, 770880))
).toDS()
files.show()
}
case class Files(Record: String, ids: Set[Int])
}
that will return the following dataset:
+------------+----------------+
| Record| ids|
+------------+----------------+
|202110260931|[770010, 770880]|
|202110260640|[770010, 770880]|
|202110260715|[770010, 770880]|
+------------+----------------+

Related

Problem creating dataset in Spark and Scala

I ran into a problem using spark dataset!
I keep getting the exception about encoders when I want to use case class
the code is a simple one below:
case class OrderDataType (orderId: String, customerId: String, orderDate: String)
import spark.implicits._
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[OrderDataType]
I get this exception during compile:
Unable to find encoder for type OrderDataType. An implicit Encoder[OrderDataType] is needed to store OrderDataType instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
I have already added this: import spark.implicits._ but it doesn't solve the problem!
According to spark and scala documentation, the encoding must be done implicitly with scala!
What is wrong with this code and what should I do to fix it!
Define your case class outside of main method then in main method read the csv file and convert to dataset.
Example:
case class OrderDataType (orderId: String, customerId: String, orderDate: String)
def main(args: Array[String]): Unit = {
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[OrderDataType]
}
//or
def main(args: Array[String]): Unit = {
val ds = spark.read.option("header", "true").csv("data\\orders.csv").as[(String,String,String)]
}
Other way is ... you can use every thing inside object Orders extends App (intelligent enough to identify case class from out side of def main)
mydata/Orders.csv
orderId,customerId,orderDate
1,2,21/08/1977
1,2,21/08/1978
Example code :
package examples
import org.apache.log4j.Level
import org.apache.spark.sql._
object Orders extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder.appName(getClass.getName)
.master("local[*]").getOrCreate
case class OrderDataType(orderId: String, customerId: String, orderDate: String)
import spark.implicits._
val ds1 = spark.read.option("header", "true").csv("mydata/Orders.csv").as[OrderDataType]
ds1.show
}
Result :
+-------+----------+----------+
|orderId|customerId| orderDate|
+-------+----------+----------+
| 1| 2|21/08/1977|
| 1| 2|21/08/1978|
+-------+----------+----------+
Why case class outside of def main ....
Seems like this is by design of the Encoder from annotation
#implicitNotFound below

NullPointerException when referencing DataFrame column names with $ method call

Following is a simple word count Spark App using DataFrame and the corresponding unit tests using spark-testingbase. It works if I use the following
def toWords(linesDf: DataFrame) = {
linesDf
.select(linesDf("line"),
explode(split(linesDf("line"), WhitespaceRegex)).as("word"))
}
But doesn't work if I use $ method call to reference the columns as shown below
def toWords(linesDf: DataFrame) = {
import spark.implicits._
linesDf
.select($"line"),
explode(split($"line", WhitespaceRegex)).as("word"))
}
Error
java.lang.NullPointerException was thrown.
java.lang.NullPointerException
at com.aravind.oss.eg.wordcount.spark.WordCountDFApp$.toWords(WordCountDFApp.scala:42)
at com.aravind.oss.eg.wordcount.spark.WordCountDFAppTestSpec2$$anonfun$1.apply$mcV$sp(WordCountDFAppTestSpec2.scala:32)
at com.aravind.oss.eg.wordcount.spark.WordCountDFAppTestSpec2$$anonfun$1.apply(WordCountDFAppTestSpec2.scala:17)
at com.aravind.oss.eg.wordcount.spark.WordCountDFAppTestSpec2$$anonfun$1.apply(WordCountDFAppTestSpec2.scala:17)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
Spark App
object WordCountDFApp extends App with Logging {
logInfo("WordCount with Dataframe API")
val paths = getPaths(args)
val cluster = getClusterCfg(args)
if (paths.size > 1) {
logInfo("More than one file to process")
}
logInfo("Path(s): " + paths)
logInfo("Cluster: " + cluster)
val spark = getSparkSession("WordCountDFApp", cluster)
val linesDf: DataFrame = spark.read
.textFile(paths: _*)
.toDF("line") //Dataset[Row]
logInfo("DataFrame before splitting line")
linesDf.show(false)
import spark.implicits._
import org.apache.spark.sql.functions._
val wordsDf = toWords(linesDf)
logInfo("Inferred schema")
wordsDf.printSchema()
logInfo("DataFrame after splitting the line into words")
wordsDf.show(false)
countWords(wordsDf).show(false)
def toWords(linesDf: DataFrame) = {
linesDf
.select(linesDf("line"),
explode(split(linesDf("line"), WhitespaceRegex)).as("word"))
}
}
Test
class WordCountDFAppTestSpec2 extends FlatSpec with DataFrameSuiteBase {
val input: Seq[String] = Seq(
("one"),
("two"),
(""),
("three Three")
)
"toWords" should "split the file into words" in {
val sqlCtx = sqlContext
import sqlCtx.implicits._
val sourceDf = input.toDF("line")
// sourceDf.show(false)
val expectedDF = Seq(
("one", "one"),
("two", "two"),
("", ""),
("three Three", "three"),
("three Three", "Three")
).toDF("line", "word")
// expectedDF.show(false)
val actualDF = WordCountDFApp.toWords(sourceDf)
// actualDF.show(false)
assertDataFrameEquals(actualDF, expectedDF)
}
}
The main problem is the implicits is not imported in runtime, you need to add this line:
import linesDf.sparkSession.implicits._
in your method, e.g:
def toWords(linesDf: DataFrame) = {
import linesDf.sparkSession.implicits._
linesDf
.select($"line",
explode(split(linesDf("line"), WhitespaceRegex)).as("word"))
}
and that will fix the problem.
You should call/import sqlContext.implicits to access $(dollar sign) in your code
import spark.sqlContext.implicits._
So your full imports looks like this:
import spark.implicits._
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._

No implicits found for parameters i0: TypedColumn.Exists

I am trying out frameless library for Scala and getting an "No implicits found for parameters i0: TypedColumn.Exists". If you can help me resolve it - that would be awesome....
I am using spark 2.4.0 and frameless 0.8.0.
Following is my code
import org.apache.spark.sql.SparkSession
import frameless.TypedDataset
object TestSpark {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.appName("Spark Test")
.getOrCreate
import spark.implicits._
val empDS = spark.read
.option("header",true)
.option("delimiter",",")
.csv("emp.csv")
.as[Emp]
val empTyDS = TypedDataset.create(empDS)
import frameless.syntax._
empTyDS.show(10,false).run
val deptCol = empTyDS('dept) //Get the error here.`
}
}
case class for the code is
case class Emp (
name: String,
dept: String,
manager: String,
salary: String
)

Converting error with RDD operation in Scala

I am new to Scala and I ran into the error while doing some practice.
I tried to convert RDD into DataFrame and following is my code.
package com.sclee.examples
import com.sun.org.apache.xalan.internal.xsltc.compiler.util.IntType
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType};
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("examples").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(name: String, age: Long)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val df = personRDD.map({
case Row(val1: String, val2: Long) => Person(val1,val2)
}).toDS()
// val ds = personRDD.toDS()
}
}
I followed the instructions in Spark documentation and also referenced some blogs showing me how to convert rdd into dataframe but the I got the error below.
Error:(20, 27) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val df = personRDD.map({
Although I tried to fix the problem by myself but failed. Any help will be appreciated.
The following code works:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class Person(name: String, age: Long)
object SparkTest {
def main(args: Array[String]): Unit = {
// use the SparkSession of Spark 2
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
// this your RDD - just a sample how to create an RDD
val personRDD: RDD[Person] = spark.sparkContext.parallelize(Seq(Person("A",10),Person("B",20)))
// the sparksession has a method to convert to an Dataset
val ds = spark.createDataset(personRDD)
println(ds.count())
}
}
I made the following changes:
use SparkSession instead of SparkContext and SqlContext
move Person class out of the App (I'm not sure why I had to do
this)
use createDataset for conversion
However, I guess it's pretty uncommon to do this conversion and you probably want to read your input directly into an Dataset using the read method

Why is the error "Unable to find encoder for type stored in a Dataset" when encoding JSON using case classes?

I've written spark job:
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val ctx = new org.apache.spark.sql.SQLContext(sc)
import ctx.implicits._
case class Person(age: Long, city: String, id: String, lname: String, name: String, sex: String)
case class Person2(name: String, age: Long, city: String)
val persons = ctx.read.json("/tmp/persons.json").as[Person]
persons.printSchema()
}
}
In IDE when I run the main function, 2 error occurs:
Error:(15, 67) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val persons = ctx.read.json("/tmp/persons.json").as[Person]
^
Error:(15, 67) not enough arguments for method as: (implicit evidence$1: org.apache.spark.sql.Encoder[Person])org.apache.spark.sql.Dataset[Person].
Unspecified value parameter evidence$1.
val persons = ctx.read.json("/tmp/persons.json").as[Person]
^
but in Spark Shell I can run this job without any error. what is the problem?
The error message says that the Encoder is not able to take the Person case class.
Error:(15, 67) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
Move the declaration of the case class outside the scope of SimpleApp.
You have the same error if you add sqlContext.implicits._ and spark.implicits._ in SimpleApp (the order doesn't matter).
Removing one or the other will be the solution:
val spark = SparkSession
.builder()
.getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._ //sqlContext OR spark implicits
//import spark.implicits._ //sqlContext OR spark implicits
case class Person(age: Long, city: String)
val persons = ctx.read.json("/tmp/persons.json").as[Person]
Tested with Spark 2.1.0
The funny thing is if you add the same object implicits twice you will not have problems.
#Milad Khajavi
Define Person case classes outside object SimpleApp.
Also, add import sqlContext.implicits._ inside main() function.