Apache Spark - Generic method for loading csv data to dataset - scala

I would like to write generic method with three input parameters:
filePath - String
schema - ?
case class
So, my idea is to write a method like this:
def load_sms_ds(filePath: String, schemaInfo: ?, cc: ?) = {
val ds = spark.read
.format("csv")
.option("header", "true")
.schema(?)
.option("delimiter",",")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss.SSS")
.load(schemaInfo)
.as[?]
ds
}
and to return dataset depending on a input parameters. I am not sure though what type should parameters schemaInfo and cc be?

First of all I would reccommend reading the spark sql programming guide. This contains some thing that I think will generally help you as you learn spark.
Lets run through the process of reading in a csv file using a case class to define the schema.
First add the varioud imports needed for this example:
import java.io.{File, PrintWriter} // for reading / writing the example data
import org.apache.spark.sql.types.{StringType, StructField} // to define the schema
import org.apache.spark.sql.catalyst.ScalaReflection // used to generate the schema from a case class
import scala.reflect.runtime.universe.TypeTag // used to provide type information of the case class at runtime
import org.apache.spark.sql.Dataset, SparkSession}
import org.apache.spark.sql.Encoder // Used by spark to generate the schema
Define a case class, the different types available can be found here:
case class Example(
stringField : String,
intField : Int,
doubleField : Double
)
Add the method for extracting a schema (StructType) given a case class type as a parameter:
// T : TypeTag means that an implicit value of type TypeTag[T] must be available at the method call site. Scala will automatically generate this for you. See [here][3] for further details.
def schemaOf[T: TypeTag]: StructType = {
ScalaReflection
.schemaFor[T] // this method requires a TypeTag for T
.dataType
.asInstanceOf[StructType] // cast it to a StructType, what spark requires as its Schema
}
Defnie a method to read in a csv file from a path with the schema defined using a case class:
// The implicit Encoder is needed by the `.at` method in order to create the Dataset[T]. The TypeTag is required by the schemaOf[T] call.
def readCSV[T : Encoder : TypeTag](
filePath: String
)(implicit spark : SparkSession) : Dataset[T]= {
spark.read
.option("header", "true")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss.SSS")
.schema(schemaOf[T])
.csv(filePath) // spark provides this more explicit call to read from a csv file by default it uses comma and the separator but this can be changes.
.as[T]
}
Create a sparkSession:
implicit val spark = SparkSession.builder().master("local").getOrCreate()
Write some sample data to a temp file:
val data =
s"""|stringField,intField,doubleField
|hello,1,1.0
|world,2,2.0
|""".stripMargin
val file = File.createTempFile("test",".csv")
val pw = new PrintWriter(file)
pw.write(data)
pw.close()
An example of calling this method:
import spark.implicits._ // so that an implicit Encoder gets pulled in for the case class
val df = readCSV[Example](file.getPath)
df.show()

Related

Fetching a DataFrame into a Case Class instead results instead in reading a Tuple1

Given a case class :
case class ScoringSummary(MatchMethod: String="",
TP: Double=0,
FP: Double=0,
Precision: Double=0,
Recall: Double=0,
F1: Double=0)
We are writing summary records out as:
summaryDf.write.parquet(path)
Later we (attempt to) read the parquet file into a new dataframe:
implicit val generalRowEncoder: Encoder[ScoringSummary] =
org.apache.spark.sql.Encoders.kryo[ScoringSummary]
val summaryDf = spark.read.parquet(path).as[ScoringSummary]
But this fails - for some reason spark believes the contents of the data were Tuple1 instead of ScoringSummary:
Try to map struct<MatchMethod:string,TP:double,FP:double,Precision:double,
Recall:double,F1:double> to Tuple1,
but failed as the number of fields does not line up.;
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer$
.org$apache$spark$sql$catalyst$analysis$Analyzer$
ResolveDeserializer$$fail(Analyzer.scala:2168)
What step / setting is missing/incorrect for the correct translation?
Use import spark.implicits._ instead of registering an Encoder
I had forgotten that it is required to import spark.implicits. The incorrect approach was to add the Encoder: i.e. do not include the following line
implicit val generalRowEncoder: Encoder[ScoringSummary] =
org.apache.spark.sql.Encoders.kryo[ScoringSummary] // Do NOT add this Encoder
Here is the error when removing the Encoder line
Error:(59, 113) Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._ Support for serializing
other types will be added in future releases.
val summaryDf = ParquetLoader.loadParquet(sparkEnv,res.state.dfs(ScoringSummaryTag).copy(df=None)).df.get.as[ScoringSummary]
Instead the following code should be added
import spark.implicits._
And then the same code works:
val summaryDf = spark.read.parquet(path).as[ScoringSummary]
As an aside: encoders are not required for case class'es or primitive types: and the above is a case class. kryo becomes handy for complex object types.

Passing case class into function arguments

sorry for asking a simple question. I want to pass a case class to a function argument and I want to use it further inside the function. Till now I have tried this with TypeTag and ClassTag but for some reason, I am unable to properly use it or may be I am not looking at the correct place.
Use cases is something similar to this:
case class infoData(colA:Int,colB:String)
case class someOtherData(col1:String,col2:String,col3:Int)
def readCsv[T:???](path:String,passedCaseClass:???): Dataset[???] = {
sqlContext
.read
.option("header", "true")
.csv(path)
.as[passedCaseClass]
}
It will be called something like this:
val infoDf = readCsv("/src/main/info.csv",infoData)
val otherDf = readCsv("/src/main/someOtherData.csv",someOtherData)
There are two things which you should pay attention to,
class names should be in CamelCase, so InfoData.
Once you have bound a type to a DataSet, its not a DataFrame. DataFrame is a special name for a DataSet of general purpose Row.
What you need is to ensure that your provided class has an implicit instance of corresponding Encoder in current scope.
case class InfoData(colA: Int, colB: String)
Encoder instances for primitive types (Int, String, etc) and case classes can be obtained by importing spark.implicits._
def readCsv[T](path: String)(implicit encoder: Encoder: T): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
Or, you can use context bound,
def readCsv[T: Encoder[T]](path: String): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
Now, you can use it as following,
val spark = ...
import spark.implicits._
def readCsv[T: Encoder[T]](path: String): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
val infoDS = readCsv[InfoData]("/src/main/info.csv")
First change your function definition to:
object t0 {
def readCsv[T] (path: String)(implicit spark: SparkSession, encoder: Encoder[T]): Dataset[T] = {
spark
.read
.option("header", "true")
.csv(path)
.as[T]
}
}
You donĀ“t need to perform any kind of reflection to create a generic readCsv function. The key here is that Spark needs the encoder at compile time. So you can pass it as implicit parameter and the compiler will add it.
Because Spark SQL can deserialize product types(your case classes) including the default encoders, it is easy to call your function like:
case class infoData(colA: Int, colB: String)
case class someOtherData(col1: String, col2: String, col3: Int)
object test {
import t0._
implicit val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
readCsv[infoData]("/tmp")
}
Hope it helps

How to make Scala script infer custom column types in a csv/json schema?

I am trying to infer schema when I load a csv file in my SQLContext using SparkSession. Please note that I do not want to use class here as I am trying to infer the data file schema as soon it is loaded as I do not have any info about the data types or column names of the file before loading it.
Here is what I am trying out in Scala:
package example
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import java.io.File
import org.apache.spark.sql.SparkSession
//import sqlContext.implicits._
object SimpleScalaSpark {
def main(args: Array[String]) {
//val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "local")
.getOrCreate()
//val etl1Rdd = spark.sparkContext.wholeTextFiles("etl1.json").map(x => x._2)
val jsonTbl = spark.sqlContext.read.format("org.apache.spark.csv")
.option("header", true)
.option("inferSchema", true)
.option("dateFormat","MM/dd/yyyy HH:mm")
.csv("s1.csv")
// print the inferred schema
jsonTbl.printSchema
}
}
I am able to get DateTime, Integer, Double, String as data types for my file. But I want to implement custom data types based on my own regex patterns such as fields like SSN, VIN-ID, PhoneNumber etc. which all have a fixed pattern which can be detected using regex. This would make schema extraction process for me more accurate and precise. For example, suppose I have a column which contains data formed of 5 or more alphabets and 2 or more numbers, I can say that this column is of type ID.
Any ideas on if it is possible to do this using Scala/Spark? Please let me know the implementation part as well if possible or a source to technical documentation.

How to convert RDD[GenericRecord] to dataframe in scala?

I get tweets from kafka topic with Avro (serializer and deserializer).
Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord].
Now i want to convert each rdd to a dataframe to analyse these tweets via SQL.
Any solution to convert RDD[GenericRecord] to dataframe please ?
I spent some time trying to make this work (specially how deserialize the data properly but it looks like you already cover this) ... UPDATED
//Define function to convert from GenericRecord to Row
def genericRecordToRow(record: GenericRecord, sqlType : SchemaConverters.SchemaType): Row = {
val objectArray = new Array[Any](record.asInstanceOf[GenericRecord].getSchema.getFields.size)
import scala.collection.JavaConversions._
for (field <- record.getSchema.getFields) {
objectArray(field.pos) = record.get(field.pos)
}
new GenericRowWithSchema(objectArray, sqlType.dataType.asInstanceOf[StructType])
}
//Inside your stream foreachRDD
val yourGenericRecordRDD = ...
val schema = new Schema.Parser().parse(...) // your schema
val sqlType = SchemaConverters.toSqlType(new Schema.Parser().parse(strSchema))
var rowRDD = yourGeneircRecordRDD.map(record => genericRecordToRow(record, sqlType))
val df = sqlContext.createDataFrame(rowRDD , sqlType.dataType.asInstanceOf[StructType])
As you see, I am using a SchemaConverter to get the dataframe structure from the schema that you used to deserialize (this could be more painful with schema registry). For this you need the following dependency
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>
you will need to change your spark version depending on yours.
UPDATE: the code above only works for flat avro schemas.
For nested structures I used something different. You can copy the class SchemaConverters, it has to be inside of com.databricks.spark.avro (it uses some protected classes from the databricks package) or you can try to use the spark-bigquery dependency. The class will not be accessible by default, so you will need to create a class inside a package com.databricks.spark.avro to access the factory method.
package com.databricks.spark.avro
import com.databricks.spark.avro.SchemaConverters.createConverterToSQL
import org.apache.avro.Schema
import org.apache.spark.sql.types.StructType
class SchemaConverterUtils {
def converterSql(schema : Schema, sqlType : StructType) = {
createConverterToSQL(schema, sqlType)
}
}
After that you should be able to convert the data like
val schema = .. // your schema
val sqlType = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
....
//inside foreach RDD
var genericRecordRDD = deserializeAvroData(rdd)
///
var converter = SchemaConverterUtils.converterSql(schema, sqlType)
...
val rowRdd = genericRecordRDD.flatMap(record => {
Try(converter(record).asInstanceOf[Row]).toOption
})
//To DataFrame
val df = sqlContext.createDataFrame(rowRdd, sqlType)
A combination of https://stackoverflow.com/a/48828303/5957143 and https://stackoverflow.com/a/47267060/5957143 works for me.
I used the following to create MySchemaConversions
package com.databricks.spark.avro
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.DataType
object MySchemaConversions {
def createConverterToSQL(avroSchema: Schema, sparkSchema: DataType): (GenericRecord) => Row =
SchemaConverters.createConverterToSQL(avroSchema, sparkSchema).asInstanceOf[(GenericRecord) => Row]
}
And then I used
val myAvroType = SchemaConverters.toSqlType(schema).dataType
val myAvroRecordConverter = MySchemaConversions.createConverterToSQL(schema, myAvroType)
// unionedResultRdd is unionRDD[GenericRecord]
var rowRDD = unionedResultRdd.map(record => MyObject.myConverter(record, myAvroRecordConverter))
val df = sparkSession.createDataFrame(rowRDD , myAvroType.asInstanceOf[StructType])
The advantage of having myConverter in the object MyObject is that you will not encounter serialization issues (java.io.NotSerializableException).
object MyObject{
def myConverter(record: GenericRecord,
myAvroRecordConverter: (GenericRecord) => Row): Row =
myAvroRecordConverter.apply(record)
}
Even though something like this may help you,
val stream = ...
val dfStream = stream.transform(rdd:RDD[GenericRecord]=>{
val df = rdd.map(_.toSeq)
.map(seq=> Row.fromSeq(seq))
.toDF(col1,col2, ....)
df
})
I'd like to suggest you an alternate approach. With Spark 2.x you can skip the whole process of creating DStreams. Instead, you can do something like this with structured streaming,
val df = ss.readStream
.format("com.databricks.spark.avro")
.load("/path/to/files")
This will give you a single dataframe which you can directly query. Here, ss is the instance of spark session. /path/to/files is the place where all your avro files are being dumped from kafka.
PS: You may need to import spark-avro
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
Hope this helped. Cheers
You can use createDataFrame(rowRDD: RDD[Row], schema: StructType), which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:
import sqlContext.implicits.
val rdd = oldDF.rdd
val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.

How to convert RDD of custom Java class objects to a DataFrame with toDF()?

I am trying to convert a Spark RDD to a Spark SQL dataframe with toDF(). I have used this function successfully many times, but in this case I'm getting a compiler error:
error: value toDF is not a member of org.apache.spark.rdd.RDD[com.example.protobuf.SensorData]
Here is my code below:
// SensorData is an auto-generated class
import com.example.protobuf.SensorData
def loadSensorDataToRdd : RDD[SensorData] = ???
object MyApplication {
def main(argv: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("My application")
conf.set("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val sensorDataRdd = loadSensorDataToRdd()
val sensorDataDf = sensorDataRdd.toDF() // <-- CAUSES COMPILER ERROR
}
}
I am guessing that the problem is with the SensorData class, which is a Java class that was auto-generated from a Protocol Buffer. What can I do in order to convert the RDD to a dataframe?
The reason for the compilation error is that there's no Encoder in scope to convert a RDD with com.example.protobuf.SensorData to a Dataset of com.example.protobuf.SensorData.
Encoders (ExpressionEncoders to be exact) are used to convert InternalRow objects into JVM objects according to the schema (usually a case class or a Java bean).
There's a hope you can create an Encoder for the custom Java class using org.apache.spark.sql.Encoders object's bean method.
Creates an encoder for Java Bean of type T.
Something like the following:
import org.apache.spark.sql.Encoders
implicit val SensorDataEncoder = Encoders.bean(classOf[com.example.protobuf.SensorData])
If SensorData uses unsupported types you'll have to map the RDD[SensorData] to an RDD of some simpler type(s), e.g. a tuple of the fields, and only then expect toDF work.