How to generate datasets dynamically based on schema? - scala

I have multiple schema like below with different column names and data types.
I want to generate test/simulated data using DataFrame with Scala for each schema and save it to parquet file.
Below is the example schema (from a sample json) to generate data dynamically with dummy values in it.
val schema1 = StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true)
StructField("p", LongType, true),
StructField("pp", StringType, true)
)
)
I need rdd/dataframe like this with 1000 rows each based on number of columns in the above schema.
val data = Seq(
Row(1d, "happy", 1L, "Iam"),
Row(2d, "sad", 2L, "Iam"),
Row(3d, "glad", 3L, "Iam")
)
Basically.. like this 200 datasets are there for which I need to generate data dynamically, writing separate programs for each scheme is merely impossible for me.
Pls. help me with your ideas or impl. as I am new to spark.
Is it possible to generate dynamic data based on schema of different types?

Using #JacekLaskowski's advice, you could generate dynamic data using generators with ScalaCheck (Gen) based on field/types you are expecting.
It could look like this:
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SaveMode}
import org.scalacheck._
import scala.collection.JavaConverters._
val dynamicValues: Map[(String, DataType), Gen[Any]] = Map(
("a", DoubleType) -> Gen.choose(0.0, 100.0),
("aa", StringType) -> Gen.oneOf("happy", "sad", "glad"),
("p", LongType) -> Gen.choose(0L, 10L),
("pp", StringType) -> Gen.oneOf("Iam", "You're")
)
val schemas = Map(
"schema1" -> StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true),
StructField("p", LongType, true),
StructField("pp", StringType, true)
)),
"schema2" -> StructType(
List(
StructField("a", DoubleType, true),
StructField("pp", StringType, true),
StructField("p", LongType, true)
)
)
)
val numRecords = 1000
schemas.foreach {
case (name, schema) =>
// create a data frame
spark.createDataFrame(
// of #numRecords records
(0 until numRecords).map { _ =>
// each of them a row
Row.fromSeq(schema.fields.map(field => {
// with fields based on the schema's fieldname & type else null
dynamicValues.get((field.name, field.dataType)).flatMap(_.sample).orNull
}))
}.asJava, schema)
// store to parquet
.write.mode(SaveMode.Overwrite).parquet(name)
}

ScalaCheck is a framework to generate data, you generate a raw data based on the schema using you custom generators.
Visit ScalaCheck Documentation.

You could do something like this
import org.apache.spark.SparkConf
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.json4s
import org.json4s.JsonAST._
import org.json4s.jackson.JsonMethods._
import scala.util.Random
object Test extends App {
val structType: StructType = StructType(
List(
StructField("a", DoubleType, true),
StructField("aa", StringType, true),
StructField("p", LongType, true),
StructField("pp", StringType, true)
)
)
val spark = SparkSession
.builder()
.master("local[*]")
.config(new SparkConf())
.getOrCreate()
import spark.implicits._
val df = createRandomDF(structType, 1000)
def createRandomDF(structType: StructType, size: Int, rnd: Random = new Random()): DataFrame ={
spark.read.schema(structType).json((0 to size).map { _ => compact(randomJson(rnd, structType))}.toDS())
}
def randomJson(rnd: Random, dataType: DataType): JValue = {
dataType match {
case v: DoubleType =>
json4s.JDouble(rnd.nextDouble())
case v: StringType =>
JString(rnd.nextString(10))
case v: IntegerType =>
JInt(rnd.nextInt())
case v: LongType =>
JInt(rnd.nextLong())
case v: FloatType =>
JDouble(rnd.nextFloat())
case v: BooleanType =>
JBool(rnd.nextBoolean())
case v: ArrayType =>
val size = rnd.nextInt(10)
JArray(
(0 to size).map(_ => randomJson(rnd, v.elementType)).toList
)
case v: StructType =>
JObject(
v.fields.flatMap {
f =>
if (f.nullable && rnd.nextBoolean())
None
else
Some(JField(f.name, randomJson(rnd, f.dataType)))
}.toList
)
}
}
}

Related

How to refine Spark StructType Schema based on a list of required fields?

I am trying to create a StructType schema from an already existing schema. I have a list which has the fields required for the new schema. The tough part is that schema is of a nested json data with complex fields including ArrayType(StructType).
Here is the code for the Schema,
val schema1: Seq[StructField] = Seq(
StructField("playerId", StringType, true),
StructField("playerName", StringType, true),
StructField("playerCountry", StringType, true),
StructField("playerBloodType", StringType, true)
)
val schema2: Seq[StructField] =
Seq(
StructField("PlayerHistory", ArrayType(
StructType(
Seq(
StructField("Rating", StringType, true),
StructField("Height", StringType, true),
StructField("Weight", StringType, true),
StructField("CoachDetails",
StructType(
Seq(
StructField("CoachName", StringType, true),
StructField("Address",
StructType(
Seq(
StructField("AddressLine1", StringType, true),
StructField("AddressLine2", StringType, true),
StructField("CoachCity", StringType, true))), true),
StructField("Suffix", StringType, true))), true),
StructField("GoalHistory", ArrayType(
StructType(
Seq(
StructField("MatchDate", StringType, true),
StructField("NumberofGoals", StringType, true),
StructField("SubstitutionIndicator", StringType, true))), true), true),
StructField("receive_date", DateType, true))
), true
)))
val requiredFields = List("playerId", "playerName", "Rating", "CoachName", "CoachCity", "MatchDate", "NumberofGoals")
val schema: StructType = StructType(schema1 ++ schema2)
The variable schema is the current schema, requiredFields holds the fields we require for the new schema. We also need the parent block in the new schema.
The output schema should looks somewhat like this:
val outputSchema =
Seq(
StructField("playerId", StringType, true),
StructField("playerName", StringType, true),
StructField("PlayerHistory",
ArrayType(StructType(
StructField("Rating", StringType, true),
StructField("CoachDetails",
StructType(
StructField("CoachName", StringType, true),
StructField("Address", StructType(
StructField("CoachCity", StringType, true)), true),
StructField("GoalHistory", ArrayType(
StructType(
StructField("MatchDate", StringType, true),
StructField("NumberofGoals", StringType, true)), true), true)))
I have tried approaching the problem in a recursive manner with the following piece of code.
schema.fields.map(f => filterSchema(f, requiredFields)).filter(_.name != "")
def filterSchema(field: StructField, requiredColumns: Seq[String]): StructField = {
field match{
case StructField(_, inner : StructType, _ ,_) => StructField(field.name,StructType(inner.fields.map(f => filterSchema(f, requiredColumns))))
case StructField(_, ArrayType(structType: StructType, _),_,_) =>
if(requiredColumns.contains(field.name))
StructField(field.name, ArrayType(StructType(structType.fields.map(f => filterSchema(f,requiredColumns))),true), true)
else
StructField("",StringType,true)
case StructField(_, _, _, _) => if(requiredColumns.contains(field.name)) field else StructField("",StringType,true)
}
}
However, I am having trouble filtering out the inner structfields.
Feel like there can be some modification for the base condition of the recursive function.
Any help here would be highly appreciated. Thanks in advance.
Here's how I did it,
class SchemaRefiner(schema: StructType, requiredColumns: Seq[String]) {
var FINALSCHEMA: Array[StructField] = Array[StructField]()
private def refine(schematoRefine: StructType, requiredColumns: Seq[String]): Unit = {
schematoRefine.foreach(f => {
if (requiredColumns.contains(f.name)) {
f match {
case StructField(_, inner: StructType, _, _) =>
FINALSCHEMA = FINALSCHEMA :+ f
case StructField(_, inner: StructType, _, _) =>
FINALSCHEMA = FINALSCHEMA :+ StructField(f.name, StructType(new SchemaRefiner(inner, requiredColumns).getRefinedSchema), true)
case StructField(_, ArrayType(structType: StructType, _), _, _) =>
FINALSCHEMA = FINALSCHEMA :+ StructField(f.name, ArrayType(StructType(new SchemaRefiner(structType, requiredColumns).getRefinedSchema)), true)
case StructField(_, _, , _, _) =>
FINALSCHEMA = FINALSCHEMA :+ f
}
}
})
}
def getRefinedSchema: Array[StructField] = {
refine(schema, requiredColumns)
this.FINALSCHEMA
}
}
This will iterate through the structfields and each time a new Structtype is encountered the function is called recursively to get a new Structype.
val fields = new SchemaRefiner(schema,requiredFields)
val newSchema = fields.getRefinedSchema

Spark Scala DateType schema execution error

I get an execution error when I try to create a Schema for a dataframe in Spark Scala that says:
Exception in thread "main" java.lang.IllegalArgumentException: No support for Spark SQL type DateType
at org.apache.kudu.spark.kudu.SparkUtil$.sparkTypeToKuduType(SparkUtil.scala:81)
at org.apache.kudu.spark.kudu.SparkUtil$.org$apache$kudu$spark$kudu$SparkUtil$$createColumnSchema(SparkUtil.scala:134)
at org.apache.kudu.spark.kudu.SparkUtil$$anonfun$kuduSchema$3.apply(SparkUtil.scala:120)
at org.apache.kudu.spark.kudu.SparkUtil$$anonfun$kuduSchema$3.apply(SparkUtil.scala:119)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.kudu.spark.kudu.SparkUtil$.kuduSchema(SparkUtil.scala:119)
at org.apache.kudu.spark.kudu.KuduContext.createSchema(KuduContext.scala:234)
at org.apache.kudu.spark.kudu.KuduContext.createTable(KuduContext.scala:210)
where the code is like:
val invoicesSchema = StructType(
List(
StructField("id", StringType, false),
StructField("invoicenumber", StringType, false),
StructField("invoicedate", DateType, true)
))
kuduContext.createTable("invoices", invoicesSchema, Seq("id","invoicenumber"), new CreateTableOptions().setNumReplicas(3).addHashPartitions(List("id").asJava, 6))
How can I use the DateType for this matter? StringType and FloatType don't have this same issue in the same code
A work-around as I call it, with an example that you need to tailor, but gives you the gist of what you need to know I think:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DateType}
import org.apache.spark.sql.functions._
val df = Seq( ("2018-01-01", "2018-01-31", 80)
, ("2018-01-07","2018-01-10", 10)
, ("2018-01-07","2018-01-31", 10)
, ("2018-01-11","2018-01-31", 5)
, ("2018-01-25","2018-01-27", 5)
, ("2018-02-02","2018-02-23", 100)
).toDF("sd","ed","coins")
val schema = List(("sd", "date"), ("ed", "date"), ("coins", "integer"))
val newColumns = schema.map(c => col(c._1).cast(c._2))
val newDF = df.select(newColumns:_*)
newDF.show(false)
...
...

Define StructType as input datatype of a Function Spark-Scala 2.11 [duplicate]

This question already has an answer here:
Defining a UDF that accepts an Array of objects in a Spark DataFrame?
(1 answer)
Closed 3 years ago.
I'm trying to write a Spark UDF in scala, I need to define a Function's input datatype
I have a schema variable with the StructType, mentioned the same below.
import org.apache.spark.sql.types._
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
I'm trying to write a Function like below
val relationsFunc: Array[Map[String,String]] => Array[String] = _.map(do something)
val relationUDF = udf(relationsFunc)
input.withColumn("relation",relationUDF(col("relation")))
above code throws below exception
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(relation)' due to data type mismatch: argument 1 requires array<map<string,string>> type, however, '`relation`' is of array<struct<attribute:string,email:string,fname:string,lname:string>> type.;;
'Project [relation#89, UDF(relation#89) AS proc#273]
if I give the input type as
val relationsFunc: StructType => Array[String] =
I'm not able to implement the logic, as _.map gives me metadata, filed names, etc.
Please advice how to define relationsSchema as input datatype in the below function.
val relationsFunc: ? => Array[String] = _.map(somelogic)
Your structure under relation is a Row, so your function should have the following signature :
val relationsFunc: Array[Row] => Array[String]
then you can access your data either by position or by name, ie :
{r:Row => r.getAs[String]("email")}
Check the mapping table in the documentation to determine the data type representations between Spark SQL and Scala: https://spark.apache.org/docs/2.4.4/sql-reference.html#data-types
Your relation field is a Spark SQL complex type of type StructType, which is represented by Scala type org.apache.spark.sql.Row so this is the input type you should be using.
I used your code to create this complete working example that extracts email values:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val relationsSchema = StructType(
Seq(
StructField("relation", ArrayType(
StructType(
Seq(
StructField("attribute", StringType, true),
StructField("email", StringType, true),
StructField("fname", StringType, true),
StructField("lname", StringType, true)
)
), true
), true)
)
)
val data = Seq(
Row("{'relation':[{'attribute':'1','email':'johnny#example.com','fname': 'Johnny','lname': 'Appleseed'}]}")
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
relationsSchema
)
val relationsFunc = (relation: Array[Row]) => relation.map(_.getAs[String]("email"))
val relationUdf = udf(relationsFunc)
df.withColumn("relation", relationUdf(col("relation")))

Converting RDD into Dataframe

I am new in spark/scala.
I have a created below RDD by loading data from multiple paths. Now i want to create dataframe from same for further operations.
below should be the schema of dataframe
schema[UserId, EntityId, WebSessionId, ProductId]
rdd.foreach(println)
545456,5615615,DIKFH6545614561456,PR5454564656445454
875643,5485254,JHDSFJD543514KJKJ4
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR5454564656445454
545456,5615615,DIKFH6545614561456,PR54545DSKJD541054
264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515
732543,8765984,UJHSG4240323545144
564574,6276832,KJDXSGFJFS2545DSAS
Will anyone please help me....!!!
I have tried same by defining schema class and mapping same against rdd but getting error
"ArrayIndexOutOfBoundsException :3"
If you treat your columns as String you can create with the following:
import org.apache.spark.sql.Row
val rdd : RDD[Row] = ???
val df = spark.createDataFrame(rdd, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
Note that you must "map" your RDD to a RDD[Row] for the compiler to allow to use the "createDataFrame" method. For the missing fields you can declare the columns as nullable in the DataFrame Schema.
In your example you are using the RDD method spark.sparkContext.textFile(). This method returns a RDD[String] that means that each element of your RDD is a line. But, you need a RDD[Row]. So you need to split your string by commas like:
val list =
List("545456,5615615,DIKFH6545614561456,PR5454564656445454",
"875643,5485254,JHDSFJD543514KJKJ4",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR5454564656445454",
"545456,5615615,DIKFH6545614561456,PR54545DSKJD541054",
"264264,3254564,MNXZCBMNABC5645SAD,PR5142545564542515",
"732543,8765984,UJHSG4240323545144","564574,6276832,KJDXSGFJFS2545DSAS")
val FilterReadClicks = spark.sparkContext.parallelize(list)
val rows: RDD[Row] = FilterReadClicks.map(line => line.split(",")).map { arr =>
val array = Row.fromSeq(arr.foldLeft(List[Any]())((a, b) => b :: a))
if(array.length == 4)
array
else Row.fromSeq(array.toSeq.:+(""))
}
rows.foreach(el => println(el.toSeq))
val df = spark.createDataFrame(rows, StructType(Seq(
StructField("userId", StringType, false),
StructField("EntityId", StringType, false),
StructField("WebSessionId", StringType, false),
StructField("ProductId", StringType, true))))
df.show()
+------------------+------------------+------------+---------+
| userId| EntityId|WebSessionId|ProductId|
+------------------+------------------+------------+---------+
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|JHDSFJD543514KJKJ4| 5485254| 875643| |
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR5454564656445454|DIKFH6545614561456| 5615615| 545456|
|PR54545DSKJD541054|DIKFH6545614561456| 5615615| 545456|
|PR5142545564542515|MNXZCBMNABC5645SAD| 3254564| 264264|
|UJHSG4240323545144| 8765984| 732543| |
|KJDXSGFJFS2545DSAS| 6276832| 564574| |
+------------------+------------------+------------+---------+
With rows rdd you will be able to create the dataframe.

how to convert VertexRDD to DataFrame

I have a VertexRDD[DenseVector[Double]] and I want to convert it to a dataframe. I don't understand how to map the values from the DenseVector to new columns in a data frame.
I am trying to specify the schema as:
val schemaString = "id prop1 prop2 prop3 prop4 prop5 prop6 prop7"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
I think an option is to convert my VertexRDD - where the breeze.linalg.DenseVector holds all the values - into a RDD[Row], so that I can finally create a data frame like:
val myRDD = myvertexRDD.map(f => Row(f._1, f._2.toScalaVector().toSeq))
val mydataframe = SQLContext.createDataFrame(myRDD, schema)
But I get a
// scala.MatchError: 20502 (of class java.lang.Long)
Any hint more than welcome
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType, DoubleType}
val rows = myvertexRDD.map{
case(id, v) => Row.fromSeq(id +: v.toArray)
}
val schema = StructType(
StructField("id", LongType, false) +:
(1 to 7).map(i => StructField(s"prop$i", DoubleType, false)))
val df = sqlContext.createDataFrame(rows, schema)
Notes:
declared types have to match actual types. You cannot declare string and pass long or double
structure of the row has to match declared structure. In your case you're trying to create row with a Long and an Vector[Double] but declare 8 columns