I have been trying to run some experiments on datasets on zepplin 0.9 running locally. However I am running into NPEs when performing operations on Datasets. The same operations seem to work on Dataframe. Here is an example of what is failing
import spark.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
case class Person(firstname: String, middlename: String, lastname: String, id: String, gender: String, salary: Int)
val simpleData = Seq(Row("James","","Smith","36636","M",3000),
Row("Michael","Rose","","40288","M",4000),
Row("Robert","","Williams","42114","M",4000),
Row("Maria","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(simpleData),simpleSchema).as[Person]
df.filter( x => x.firstname == "James").show()
This is the error that i get
java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.sql.catalyst.encoders.OuterScopes$$anonfun$getOuterScope$1.apply(OuterScopes.scala:70)
at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
at org.apache.spark.sql.catalyst.expressions.objects.NewInstance$$anonfun$10.apply(objects.scala:485)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.catalyst.expressions.objects.NewInstance.doGenCode(objects.scala:485)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
at scala.Option.getOrElse(Option.scala:121)
You have to define Person case class in a different cell from use of as[Person]
cell 1
import spark.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
case class Person(firstname: String, middlename: String, lastname: String, id: String, gender: String, salary: Int)
cell2
val simpleData = Seq(Row("James","","Smith","36636","M",3000),
Row("Michael","Rose","","40288","M",4000),
Row("Robert","","Williams","42114","M",4000),
Row("Maria","Anne","Jones","39192","F",4000),
Row("Jen","Mary","Brown","","F",-1)
)
val simpleSchema = StructType(Array(
StructField("firstname",StringType,true),
StructField("middlename",StringType,true),
StructField("lastname",StringType,true),
StructField("id", StringType, true),
StructField("gender", StringType, true),
StructField("salary", IntegerType, true)
))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(simpleData),simpleSchema).as[Person]
df.filter( x => x.firstname == "James").show()
Related
I get an execution error when I try to create a Schema for a dataframe in Spark Scala that says:
Exception in thread "main" java.lang.IllegalArgumentException: No support for Spark SQL type DateType
at org.apache.kudu.spark.kudu.SparkUtil$.sparkTypeToKuduType(SparkUtil.scala:81)
at org.apache.kudu.spark.kudu.SparkUtil$.org$apache$kudu$spark$kudu$SparkUtil$$createColumnSchema(SparkUtil.scala:134)
at org.apache.kudu.spark.kudu.SparkUtil$$anonfun$kuduSchema$3.apply(SparkUtil.scala:120)
at org.apache.kudu.spark.kudu.SparkUtil$$anonfun$kuduSchema$3.apply(SparkUtil.scala:119)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.kudu.spark.kudu.SparkUtil$.kuduSchema(SparkUtil.scala:119)
at org.apache.kudu.spark.kudu.KuduContext.createSchema(KuduContext.scala:234)
at org.apache.kudu.spark.kudu.KuduContext.createTable(KuduContext.scala:210)
where the code is like:
val invoicesSchema = StructType(
List(
StructField("id", StringType, false),
StructField("invoicenumber", StringType, false),
StructField("invoicedate", DateType, true)
))
kuduContext.createTable("invoices", invoicesSchema, Seq("id","invoicenumber"), new CreateTableOptions().setNumReplicas(3).addHashPartitions(List("id").asJava, 6))
How can I use the DateType for this matter? StringType and FloatType don't have this same issue in the same code
A work-around as I call it, with an example that you need to tailor, but gives you the gist of what you need to know I think:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DateType}
import org.apache.spark.sql.functions._
val df = Seq( ("2018-01-01", "2018-01-31", 80)
, ("2018-01-07","2018-01-10", 10)
, ("2018-01-07","2018-01-31", 10)
, ("2018-01-11","2018-01-31", 5)
, ("2018-01-25","2018-01-27", 5)
, ("2018-02-02","2018-02-23", 100)
).toDF("sd","ed","coins")
val schema = List(("sd", "date"), ("ed", "date"), ("coins", "integer"))
val newColumns = schema.map(c => col(c._1).cast(c._2))
val newDF = df.select(newColumns:_*)
newDF.show(false)
...
...
I have multiple schema like below with different column names and data types.
I want to generate test/simulated data using DataFrame with Scala for each schema and save it to parquet file.
Below is the example schema (from a sample json) to generate data dynamically with dummy values in it.
val schema1 = StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true)
StructField("p", LongType, true),
StructField("pp", StringType, true)
)
)
I need rdd/dataframe like this with 1000 rows each based on number of columns in the above schema.
val data = Seq(
Row(1d, "happy", 1L, "Iam"),
Row(2d, "sad", 2L, "Iam"),
Row(3d, "glad", 3L, "Iam")
)
Basically.. like this 200 datasets are there for which I need to generate data dynamically, writing separate programs for each scheme is merely impossible for me.
Pls. help me with your ideas or impl. as I am new to spark.
Is it possible to generate dynamic data based on schema of different types?
Using #JacekLaskowski's advice, you could generate dynamic data using generators with ScalaCheck (Gen) based on field/types you are expecting.
It could look like this:
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SaveMode}
import org.scalacheck._
import scala.collection.JavaConverters._
val dynamicValues: Map[(String, DataType), Gen[Any]] = Map(
("a", DoubleType) -> Gen.choose(0.0, 100.0),
("aa", StringType) -> Gen.oneOf("happy", "sad", "glad"),
("p", LongType) -> Gen.choose(0L, 10L),
("pp", StringType) -> Gen.oneOf("Iam", "You're")
)
val schemas = Map(
"schema1" -> StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true),
StructField("p", LongType, true),
StructField("pp", StringType, true)
)),
"schema2" -> StructType(
List(
StructField("a", DoubleType, true),
StructField("pp", StringType, true),
StructField("p", LongType, true)
)
)
)
val numRecords = 1000
schemas.foreach {
case (name, schema) =>
// create a data frame
spark.createDataFrame(
// of #numRecords records
(0 until numRecords).map { _ =>
// each of them a row
Row.fromSeq(schema.fields.map(field => {
// with fields based on the schema's fieldname & type else null
dynamicValues.get((field.name, field.dataType)).flatMap(_.sample).orNull
}))
}.asJava, schema)
// store to parquet
.write.mode(SaveMode.Overwrite).parquet(name)
}
ScalaCheck is a framework to generate data, you generate a raw data based on the schema using you custom generators.
Visit ScalaCheck Documentation.
You could do something like this
import org.apache.spark.SparkConf
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.json4s
import org.json4s.JsonAST._
import org.json4s.jackson.JsonMethods._
import scala.util.Random
object Test extends App {
val structType: StructType = StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true),
StructField("p", LongType, true),
StructField("pp", StringType, true)
)
)
val spark = SparkSession
.builder()
.master("local[*]")
.config(new SparkConf())
.getOrCreate()
import spark.implicits._
val df = createRandomDF(structType, 1000)
def createRandomDF(structType: StructType, size: Int, rnd: Random = new Random()): DataFrame ={
spark.read.schema(structType).json((0 to size).map { _ => compact(randomJson(rnd, structType))}.toDS())
}
def randomJson(rnd: Random, dataType: DataType): JValue = {
dataType match {
case v: DoubleType =>
json4s.JDouble(rnd.nextDouble())
case v: StringType =>
JString(rnd.nextString(10))
case v: IntegerType =>
JInt(rnd.nextInt())
case v: LongType =>
JInt(rnd.nextLong())
case v: FloatType =>
JDouble(rnd.nextFloat())
case v: BooleanType =>
JBool(rnd.nextBoolean())
case v: ArrayType =>
val size = rnd.nextInt(10)
JArray(
(0 to size).map(_ => randomJson(rnd, v.elementType)).toList
)
case v: StructType =>
JObject(
v.fields.flatMap {
f =>
if (f.nullable && rnd.nextBoolean())
None
else
Some(JField(f.name, randomJson(rnd, f.dataType)))
}.toList
)
}
}
}
How do I create/mock a Spark Scala dataframe with a case class nested inside the top level?
root
|-- _id: long (nullable = true)
|-- continent: string (nullable = true)
|-- animalCaseClass: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- gender: string (nullable = true)
I am currently unit testing a function which outputs a dataframe in the above schema. To check equality, I used the toDF() which unfortunately gives a schema with nullable = true for "_id" in the mocked dataframe, thus making the test fail (Note that the "actual" output from the function has nullable = true for everything).
I also tried creating the mocked dataframe a different way which led to errors: https://pastebin.com/WtxtgMJA
Here is what I tried in this approach:
import org.apache.spark.sql.Encoders
val animalSchema = Encoders.product[AnimalCaseClass].schema
val schema = List(
StructField("_id", LongType, true),
StructField("continent", StringType, true),
StructField("animalCaseClass", animalSchema, true)
)
val data = Seq(Row(12345L, "Asia", AnimalCaseClass("tiger", "male")), Row(12346L, "Asia", AnimalCaseClass("tigress", "female")))
val expected = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
I had to use this approach to make the nullable true for those fields where toDF makes the nullable false by default.
How could I make a dataframe with the same schema as the output of the mocked function and declare values which can also be a case class?
From the logs you provided, you can see that
Caused by: java.lang.RuntimeException: models.AnimalCaseClass is not a valid external type for schema of struct<name:String,gender:String,,... 3 more fields>
which means you are trying to insert an object type of AnimalCaseClass into a datatype of struct<name:String,gender:String> and this was caused since you have used Row object.
import org.apache.spark.SparkConf
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{LongType, StringType, StructField, StructType}
import org.apache.spark.sql.SparkSession
case class AnimalCaseClass(name: String, gender: String)
object Test extends App {
val conf: SparkConf = new SparkConf()
conf.setAppName("Test")
conf.setMaster("local[2]")
conf.set("spark.sql.test", "")
conf.set(SQLConf.CODEGEN_FALLBACK.key, "false")
val spark: SparkSession = SparkSession.builder().config(conf).enableHiveSupport().getOrCreate()
// ** The relevant part **
import org.apache.spark.sql.Encoders
val animalSchema = Encoders.product[AnimalCaseClass].schema
val expectedSchema: StructType = StructType(Seq(
StructField("_id", LongType, true),
StructField("continent", StringType, true),
StructField("animalCaseClass", animalSchema, true)
))
import spark.implicits._
val data = Seq((12345L, "Asia", AnimalCaseClass("tiger", "male")), (12346L, "Asia", AnimalCaseClass("tigress", "female"))).toDF()
val expected = spark.createDataFrame(data.rdd, expectedSchema)
expected.printSchema()
expected.show()
spark.stop()
}
I have a 2d list of the following format with the name tuppleSlides:
List(List(10,4,2,4,5,2,6,2,5,7), List(10,4,2,4,5,2,6,2,5,7), List(10,4,2,4,5,2,6,2,5,7), List(10,4,2,4,5,2,6,2,5,7))
I have created the following schema:
val schema = StructType(
Array(
StructField("1", IntegerType, true),
StructField("2", IntegerType, true),
StructField("3", IntegerType, true),
StructField("4", IntegerType, true),
StructField("5", IntegerType, true),
StructField("6", IntegerType, true),
StructField("7", IntegerType, true),
StructField("8", IntegerType, true),
StructField("9", IntegerType, true),
StructField("10", IntegerType, true) )
)
and I am creating a dataframe like so:
val tuppleSlidesDF = sparkSession.createDataFrame(tuppleSlides, schema)
but it won't even compile. How am I suppose to do it properly?
Thank you.
You need to convert the 2d list to a RDD[Row] object before creating a data frame:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd = sc.parallelize(tupleSlides).map(Row.fromSeq(_))
sqlContext.createDataFrame(rdd, schema)
# res7: org.apache.spark.sql.DataFrame = [1: int, 2: int, 3: int, 4: int, 5: int, 6: int, 7: int, 8: int, 9: int, 10: int]
Also note in spark 2.x, sqlContext is replaced with spark:
spark.createDataFrame(rdd, schema)
# res1: org.apache.spark.sql.DataFrame = [1: int, 2: int ... 8 more fields]
I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.
Lets assume you want a data frame with the following schema:
root
|-- k: string (nullable = true)
|-- v: integer (nullable = false)
You simply define schema for a data frame and use empty RDD[Row]:
import org.apache.spark.sql.types.{
StructType, StructField, StringType, IntegerType}
import org.apache.spark.sql.Row
val schema = StructType(
StructField("k", StringType, true) ::
StructField("v", IntegerType, false) :: Nil)
// Spark < 2.0
// sqlContext.createDataFrame(sc.emptyRDD[Row], schema)
spark.createDataFrame(sc.emptyRDD[Row], schema)
PySpark equivalent is almost identical:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([
StructField("k", StringType(), True), StructField("v", IntegerType(), False)
])
# or df = sc.parallelize([]).toDF(schema)
# Spark < 2.0
# sqlContext.createDataFrame([], schema)
df = spark.createDataFrame([], schema)
Using implicit encoders (Scala only) with Product types like Tuple:
import spark.implicits._
Seq.empty[(String, Int)].toDF("k", "v")
or case class:
case class KV(k: String, v: Int)
Seq.empty[KV].toDF
or
spark.emptyDataset[KV].toDF
As of Spark 2.0.0, you can do the following.
Case Class
Let's define a Person case class:
scala> case class Person(id: Int, name: String)
defined class Person
Import spark SparkSession implicit Encoders:
scala> import spark.implicits._
import spark.implicits._
And use SparkSession to create an empty Dataset[Person]:
scala> spark.emptyDataset[Person]
res0: org.apache.spark.sql.Dataset[Person] = [id: int, name: string]
Schema DSL
You could also use a Schema "DSL" (see Support functions for DataFrames in org.apache.spark.sql.ColumnName).
scala> val id = $"id".int
id: org.apache.spark.sql.types.StructField = StructField(id,IntegerType,true)
scala> val name = $"name".string
name: org.apache.spark.sql.types.StructField = StructField(name,StringType,true)
scala> import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructType
scala> val mySchema = StructType(id :: name :: Nil)
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(id,IntegerType,true), StructField(name,StringType,true))
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val emptyDF = spark.createDataFrame(sc.emptyRDD[Row], mySchema)
emptyDF: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> emptyDF.printSchema
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
Java version to create empty DataSet:
public Dataset<Row> emptyDataSet(){
SparkSession spark = SparkSession.builder().appName("Simple Application")
.config("spark.master", "local").getOrCreate();
Dataset<Row> emptyDataSet = spark.createDataFrame(new ArrayList<>(), getSchema());
return emptyDataSet;
}
public StructType getSchema() {
String schemaString = "column1 column2 column3 column4 column5";
List<StructField> fields = new ArrayList<>();
StructField indexField = DataTypes.createStructField("column0", DataTypes.LongType, true);
fields.add(indexField);
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
return schema;
}
import scala.reflect.runtime.{universe => ru}
def createEmptyDataFrame[T: ru.TypeTag] =
hiveContext.createDataFrame(sc.emptyRDD[Row],
ScalaReflection.schemaFor(ru.typeTag[T].tpe).dataType.asInstanceOf[StructType]
)
case class RawData(id: String, firstname: String, lastname: String, age: Int)
val sourceDF = createEmptyDataFrame[RawData]
Here you can create schema using StructType in scala and pass the Empty RDD so you will able to create empty table.
Following code is for the same.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.BooleanType
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.types.StringType
//import org.apache.hadoop.hive.serde2.objectinspector.StructField
object EmptyTable extends App {
val conf = new SparkConf;
val sc = new SparkContext(conf)
//create sparksession object
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
//Created schema for three columns
val schema = StructType(
StructField("Emp_ID", LongType, true) ::
StructField("Emp_Name", StringType, false) ::
StructField("Emp_Salary", LongType, false) :: Nil)
//Created Empty RDD
var dataRDD = sc.emptyRDD[Row]
//pass rdd and schema to create dataframe
val newDFSchema = sparkSession.createDataFrame(dataRDD, schema)
newDFSchema.createOrReplaceTempView("tempSchema")
sparkSession.sql("create table Finaltable AS select * from tempSchema")
}
This is helpful for testing purposes.
Seq.empty[String].toDF()
Here is a solution that creates an empty dataframe in pyspark 2.0.0 or more.
from pyspark.sql import SQLContext
sc = spark.sparkContext
schema = StructType([StructField('col1', StringType(),False),StructField('col2', IntegerType(), True)])
sqlContext.createDataFrame(sc.emptyRDD(), schema)
I had a special requirement wherein I already had a dataframe but given a certain condition I had to return an empty dataframe so I returned df.limit(0) instead.
I'd like to add the following syntax which was not yet mentioned:
Seq[(String, Integer)]().toDF("k", "v")
It makes it clear that the () part is for values. It's empty, so the dataframe is empty.
This syntax is also beneficial for adding null values manually. It just works, while other options either don't or are overly verbose.
As of Spark 2.4.3
val df = SparkSession.builder().getOrCreate().emptyDataFrame