How to create a Dataset from a csv which doesn't have a header and has more than 150 columns using scala spark - scala

I've a csv which I need to read as Dataset. The csv is having 140 columns and it doesn't have a header.
I created a schema with StructType(Seq(StructFiled(...), Seq(StructFiled(...), ...)) and the code to read that is as follows:-
object dataParser {
def getData(inputPath: String, delimeter: String)(implicit spark: SparkSession): Dataset[MyCaseClass] = {
val parsedData: Dataset[MyCaseClass] = spark.read
.option("header", "false")
.option("delimeter", "delimeter")
.option("inferSchema", "true")
.schema(mySchema)
.load(inputPath)
.as[MyCaseClass]
parsedData
}
}
And the case class I created is like:-
case class MycaseClass(
mycaseClass1: MyCaseClass1,
mycaseClass2: MyCaseClass2,
mycaseClass3: MyCaseClass3,
mycaseClass4: MyCaseClass4,
mycaseClass5: MyCaseClass5,
mycaseClass6: MyCaseClass6,
mycaseClass7: MyCaseClass7,
)
MyCaseClass1(
first 20 columns of csv: it's datatypes
)
MyCaseClass2(
next 20 columns of csv: it's datatypes
)
and so on.
But when I'm trying to compile it, it gives me an error as below:-
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
[error] .as[myCaseClass]
I'm calling this from my Scala App as :-
object MyTestApp{
def main(args: Array[String]): Unit ={
implicit val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
run(args)
}
def run(args: Array[String])(implicit spark: SparkSession): Unit = {
val inputPath = args.get("inputData")
val delimeter = Constants.delimeter
val myData = Dataparser.getData(inputPath, delimeter)
}
}
```
I'm not very sure about the approach also as I'm new to Dataset.
I saw multiple answers around this issue but they were mainly for very small no of columns which can be contained within the scope of a single case class and that too with header which makes this little simpler.
Any help would be really appreciated.

Thanks to all the viewers. Actually I found the issue. Posting the answer here so that other's who come across any such issue, will be able to get rid of this issue.
I needed to import the spark.implicits._ here
object dataParser {
def getData(inputPath: String, delimeter: String)(implicit spark: SparkSession): Dataset[MyCaseClass] = {
**import spark.implicits._**
val parsedData: Dataset[MyCaseClass] = spark.read
.option("header", "false")
.option("delimeter", "delimeter")
.option("inferSchema", "true")
.schema(mySchema)
.load(inputPath)
.as[MyCaseClass]
parsedData
}
}

Related

Functional Programming in Spark/Scala

I am learning more about Scala and Spark but have came stuck upon how to structure a function when I am using two tables as an input. My goal is to condense my code and utilise more functions. I am stuck on how I structure the functions when using two tables which I intend to join. My code without a function looks like:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("XX1")
.getOrCreate()
val df1 = spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load("C:/Users/YYY/Documents/YYY.csv")
// df1: org.apache.spark.sql.DataFrame = [customerID: int, StoreID: int, FirstName: string, Surname: string, dateofbirth: int]
val df2 = spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load("C:/Users/XXX/Documents/XXX.csv")
df1.printSchema()
df1.createOrReplaceTempView("customerinfo")
df2.createOrReplaceTempView("customerorders")
def innerjoinA(df1: DataFrame, df2:Dataframe): Array[String]={
val innerjoindf= df1.join(df2,"customerId")
}
innerjoin().show()
}
My question is: how do I properly define the function for innerjoinA (&why?) and how exactly am I able to call it later in the program? And to a greater point, what else could I format as a function in this example?
you could do something like this.
Create A function to create Spark Session, and ReadCSV. This function if you need put into a different file if it's being called by other programs as well.
Just for join, no Need to crate a function. However, you could create to understand the business flow and give it a proper name.
import org.apache.spark.sql.{DataFrame, SparkSession}
def getSparkSession(unit: Unit) : SparkSession = {
val spark = SparkSession
.builder()
.master("local[*]")
.appName("XX1")
.getOrCreate()
spark
}
def readCSV(filePath: String): DataFrame = {
val df = getSparkSession().sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.load(filePath)
df
}
def getCustomerDetails(customer: DataFrame, details: DataFrame) : DataFrame = {
customer.join(details,"customerId")
}
val xxxDF = readCSV("C:/Users/XXX/Documents/XXX.csv")
val yyyDF = readCSV("C:/Users/XXX/Documents/YYY.csv")
getCustomerDetails(xxxDF, yyyDF).show()
The basic premise on grouping complex tranformations and joins in methods is sound. Only you know if a special innerjoin method makes sense in you usecase.
I usually define them as extension methods so I can chain them one after another.
trait/object DataFrameExtensions{
implicit class JoinDataFrameExtensions(df:DataFrame){
def innerJoin(df2:DataFrame):DataFrame = df.join(df2, Seq("ColumnName"))
}
}
And then later on in the code import/mixin the methods I want and call them on the DataFrame.
originalDataFrame.innerJoin(toBeJoinedDataFrame).show()
I prefer extension methods but you can also just declare a method DataFrame => DataFrame and use it in the .transform method already defined on the Dataset API.
def innerJoin(df2:DataFrame)(df1:DataFrame):DataFrame = df1.join(df2, Seq("ColumnName"))
val join = innerJoin(tobeJoinedDataFrame) _
originalDataFrame.transform(join).show()

Passing case class into function arguments

sorry for asking a simple question. I want to pass a case class to a function argument and I want to use it further inside the function. Till now I have tried this with TypeTag and ClassTag but for some reason, I am unable to properly use it or may be I am not looking at the correct place.
Use cases is something similar to this:
case class infoData(colA:Int,colB:String)
case class someOtherData(col1:String,col2:String,col3:Int)
def readCsv[T:???](path:String,passedCaseClass:???): Dataset[???] = {
sqlContext
.read
.option("header", "true")
.csv(path)
.as[passedCaseClass]
}
It will be called something like this:
val infoDf = readCsv("/src/main/info.csv",infoData)
val otherDf = readCsv("/src/main/someOtherData.csv",someOtherData)
There are two things which you should pay attention to,
class names should be in CamelCase, so InfoData.
Once you have bound a type to a DataSet, its not a DataFrame. DataFrame is a special name for a DataSet of general purpose Row.
What you need is to ensure that your provided class has an implicit instance of corresponding Encoder in current scope.
case class InfoData(colA: Int, colB: String)
Encoder instances for primitive types (Int, String, etc) and case classes can be obtained by importing spark.implicits._
def readCsv[T](path: String)(implicit encoder: Encoder: T): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
Or, you can use context bound,
def readCsv[T: Encoder[T]](path: String): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
Now, you can use it as following,
val spark = ...
import spark.implicits._
def readCsv[T: Encoder[T]](path: String): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
val infoDS = readCsv[InfoData]("/src/main/info.csv")
First change your function definition to:
object t0 {
def readCsv[T] (path: String)(implicit spark: SparkSession, encoder: Encoder[T]): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
}
You donĀ“t need to perform any kind of reflection to create a generic readCsv function. The key here is that Spark needs the encoder at compile time. So you can pass it as implicit parameter and the compiler will add it.
Because Spark SQL can deserialize product types(your case classes) including the default encoders, it is easy to call your function like:
case class infoData(colA: Int, colB: String)
case class someOtherData(col1: String, col2: String, col3: Int)
object test {
import t0._
implicit val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
readCsv[infoData]("/tmp")
}
Hope it helps

Unable to find encoder for type stored in a Dataset in Spark Scala

I am trying to execute the following simple example in Spark. However, I am getting below error
"could not find implicit value for evidence parameter of type org.apache.spark.sql.Encoder[mydata]"
How do I fix this?
import org.apache.spark.sql._
import org.apache.spark.ml.clustering._
import org.apache.spark.ml.feature.VectorAssembler
case class mydata(ID: Int,Salary: Int)
object SampleKMeans {
def main(args: Array[String]) = {
val spark = SparkSession.builder
.appName("SampleKMeans")
.master("yarn")
.getOrCreate()
import spark.implicits._
val ds = spark.read
.option("header","true")
.option("inferSchema","true")
.csv("data/mydata.csv")
.as[mydata]
val assembler = new VectorAssembler()
.setInputCols(Array("Salary"))
.setOutputCol("SalaryOut")
val a = assembler.transform(ds)
}
The error went off after I explicitly specified the schema. Thanks everyone for helping me out.
val ds = spark.read
.schema("Int","Int")
.option("header","true")
.csv("data/mydata.csv").as[mydata]
You need to provide schema information.
case class mydata(ID: Int,Salary: Int)
val schema = StructType(Array(
StructField("ID", IntegerType, false),
StructField("Salary", IntegerType, false)))
Provide the above piece of code inside main method.
And your call for reading CSV will be
spark.read.schema(schema).csv("path").as[mydata]
With this, you can use your rest of the code.
Hope this helps!
An example you provided works on Spark 2.2.0. I guess it's not code you trying to run, but only an example for stackoverflow.
Check if your case class is top level object (and not declared inside method)

Converting error with RDD operation in Scala

I am new to Scala and I ran into the error while doing some practice.
I tried to convert RDD into DataFrame and following is my code.
package com.sclee.examples
import com.sun.org.apache.xalan.internal.xsltc.compiler.util.IntType
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType};
object App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("examples").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Person(name: String, age: Long)
val personRDD = sc.makeRDD(Seq(Person("A",10),Person("B",20)))
val df = personRDD.map({
case Row(val1: String, val2: Long) => Person(val1,val2)
}).toDS()
// val ds = personRDD.toDS()
}
}
I followed the instructions in Spark documentation and also referenced some blogs showing me how to convert rdd into dataframe but the I got the error below.
Error:(20, 27) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val df = personRDD.map({
Although I tried to fix the problem by myself but failed. Any help will be appreciated.
The following code works:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
case class Person(name: String, age: Long)
object SparkTest {
def main(args: Array[String]): Unit = {
// use the SparkSession of Spark 2
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
// this your RDD - just a sample how to create an RDD
val personRDD: RDD[Person] = spark.sparkContext.parallelize(Seq(Person("A",10),Person("B",20)))
// the sparksession has a method to convert to an Dataset
val ds = spark.createDataset(personRDD)
println(ds.count())
}
}
I made the following changes:
use SparkSession instead of SparkContext and SqlContext
move Person class out of the App (I'm not sure why I had to do
this)
use createDataset for conversion
However, I guess it's pretty uncommon to do this conversion and you probably want to read your input directly into an Dataset using the read method

Does spark supports multiple output file with parquet format

The business case is that we'd like to split a big parquet file into small ones by a column as partition. We've tested using dataframe.partition("xxx").write(...). It took about 1hr with 100K entries of records. So, we are going to use map reduce to generate different parquet file in different folder. Sample code:
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String =
key.asInstanceOf[String]+"/aa"
}
object Split {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SplitTest")
val sc = new SparkContext(conf)
sc.parallelize(List(("w", "www"), ("b", "blog"), ("c", "com"), ("w", "bt")))
.map(value => (value._1, value._2 + "Test"))
.partitionBy(new HashPartitioner(3))//.saveAsNewAPIHadoopFile(path, keyClass, valueClass, outputFormatClass, conf)
.saveAsHadoopFile(args(0), classOf[String], classOf[String],
classOf[RDDMultipleTextOutputFormat])
sc.stop()
}
}
The sample above just generates a text file, how to generate a parquet file with multipleoutputformat?
Spark supports Parquet partitioning since 1.4.0 (1.5+ syntax):
df.write.partitionBy("some")
and bucketing since (2.0.0):
df.write.bucketBy("some")
with optional sortBy clause.