Spark: Programmatically creating dataframe schema in scala

Spark: Programmatically creating dataframe schema in scala - scala

I have a smallish dataset that will be the result of a Spark job. I am thinking about converting this dataset to a dataframe for convenience at the end of the job, but have struggled to correctly define the schema. The problem is the last field below (topValues); it is an ArrayBuffer of tuples -- keys and counts.
val innerSchema =
StructType(
Array(
StructField("value", StringType),
StructField("count", LongType)
)
)
val outputSchema =
StructType(
Array(
StructField("name", StringType, nullable=false),
StructField("index", IntegerType, nullable=false),
StructField("count", LongType, nullable=false),
StructField("empties", LongType, nullable=false),
StructField("nulls", LongType, nullable=false),
StructField("uniqueValues", LongType, nullable=false),
StructField("mean", DoubleType),
StructField("min", DoubleType),
StructField("max", DoubleType),
StructField("topValues", innerSchema)
)
)
val result = stats.columnStats.map{ c =>
Row(c._2.name, c._1, c._2.count, c._2.empties, c._2.nulls, c._2.uniqueValues, c._2.mean, c._2.min, c._2.max, c._2.topValues.topN)
}
val rdd = sc.parallelize(result.toSeq)
val outputDf = sqlContext.createDataFrame(rdd, outputSchema)
outputDf.show()
The error I'm getting is a MatchError: scala.MatchError: ArrayBuffer((10,2), (20,3), (8,1)) (of class scala.collection.mutable.ArrayBuffer)
When I debug and inspect my objects, I'm seeing this:
rdd: ParallelCollectionRDD[2]
rdd.data: "ArrayBuffer" size = 2
rdd.data(0): [age,2,6,0,0,3,14.666666666666666,8.0,20.0,ArrayBuffer((10,2), (20,3), (8,1))]
rdd.data(1): [gender,3,6,0,0,2,0.0,0.0,0.0,ArrayBuffer((M,4), (F,2))]
It seems to me that I've accurately described the ArrayBuffer of tuples in my innerSchema, but Spark disagrees.
Any idea how I should be defining the schema?

val rdd = sc.parallelize(Array(Row(ArrayBuffer(1,2,3,4))))
val df = sqlContext.createDataFrame(
rdd,
StructType(Seq(StructField("arr", ArrayType(IntegerType, false), false)
)
df.printSchema
root
|-- arr: array (nullable = false)
| |-- element: integer (containsNull = false)
df.show
+------------+
| arr|
+------------+
|[1, 2, 3, 4]|
+------------+

As David pointed out, I needed to use an ArrayType. Spark is happy with this:
val outputSchema =
StructType(
Array(
StructField("name", StringType, nullable=false),
StructField("index", IntegerType, nullable=false),
StructField("count", LongType, nullable=false),
StructField("empties", LongType, nullable=false),
StructField("nulls", LongType, nullable=false),
StructField("uniqueValues", LongType, nullable=false),
StructField("mean", DoubleType),
StructField("min", DoubleType),
StructField("max", DoubleType),
StructField("topValues", ArrayType(StructType(Array(
StructField("value", StringType),
StructField("count", LongType)
))))
)
)

import spark.implicits._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val searchPath = "/path/to/.csv"
val columns = "col1,col2,col3,col4,col5,col6,col7"
val fields = columns.split(",").map(fieldName => StructField(fieldName, StringType,
nullable = true))
val customSchema = StructType(fields)
var dfPivot =spark.read.format("com.databricks.spark.csv").option("header","false").option("inferSchema", "false").schema(customSchema).load(searchPath)
When you load the data with custom schema will be much faster compared to loading data with default schema

Related

Creating Schema of JSON type and Reading it using Spark in Scala [Error : cannot resolve jsontostructs]

I have a JSON file like below :
{"Codes":[{"CName":"012","CValue":"XYZ1234","CLevel":"0","msg":"","CType":"event"},{"CName":"013","CValue":"ABC1234","CLevel":"1","msg":"","CType":"event"}}
I wanted to create the schema for this and if the JSON file is empty({}) it should be an empty String.
However, df Output is below when I used df.show:
[[012, XYZ1234, 0, event, ], [013, ABC1234, 1, event, ]]
I created Schema like below :
val schemaF = ArrayType(
StructType(
Array(
StructField("CName", StringType),
StructField("CValue", StringType),
StructField("CLevel", StringType),
StructField("msg", StringType),
StructField("CType", StringType)
)
)
)
When I tried below,
val df1 = df.withColumn("Codes",from_json('Codes, schemaF))
It gives AnalysisException :
org.apache.spark.sql.AnalysisException: cannot resolve
'jsontostructs(Codes)' due to data type mismatch: argument 1
requires string type, however, 'Codes' is of
array<structCName:string,CValue:string,CLevel:string,CType:string,msg:string>
type.;; 'Project [valid#51,
jsontostructs(ArrayType(StructType(StructField(CName,StringType,true),
StructField(CValue,StringType,true),
StructField(CLevel,StringType,true), StructField(msg,StringType,true),
StructField(CType,StringType,true)),true), Codes#8,
Some(America/Bogota)) AS errorCodes#77]
Can someone please tell me why and how to resolve this issue?

val schema =
StructType(
Array(
StructField("CName", StringType),
StructField("CValue", StringType),
StructField("CLevel", StringType),
StructField("msg", StringType),
StructField("CType", StringType)
)
)
val df0 = spark.read.schema(schema).json("/path/to/data.json")

Your schema does not correspond to the JSON file you're trying to read. It's missing the field Codes of array type, it should look like this :
val schema = StructType(
Array(
StructField(
"Codes",
ArrayType(
StructType(
Array(
StructField("CLevel", StringType, true),
StructField("CName", StringType, true),
StructField("CType", StringType, true),
StructField("CValue", StringType, true),
StructField("msg", StringType, true)
)
), true)
,true)
)
)
And you want to apply it when reading the json not with from_json function :
val df = spark.read.schema(schema).json("path/to/json/file")
df.printSchema
//root
// |-- Codes: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- CLevel: string (nullable = true)
// | | |-- CName: string (nullable = true)
// | | |-- CType: string (nullable = true)
// | | |-- CValue: string (nullable = true)
// | | |-- msg: string (nullable = true)
EDIT:
For your comment question, you can use this schema definition:
val schema = StructType(
Array(
StructField(
"Codes",
ArrayType(
StructType(
Array(
StructField("CLevel", StringType, true),
StructField("CName", StringType, true),
StructField("CType", StringType, true),
StructField("CValue", StringType, true),
StructField("msg", StringType, true)
)
), true)
,true),
StructField("lid", StructType(Array(StructField("idNo", StringType, true))), true)
)
)

How do I specify a schema when loading a csv from S3 in Spark with Scala?

I've googled through multiple syntax iterations on stack, and none of them are working for me. My code is as follows:
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType};
val schema1 = (new StructType)
.add("PASSENGERID", IntegerType, true)
.add("PCLASS", IntegerType, true)
.add("NAME", IntegerType, true)
.add("SEX", StringType, true)
.add("AGE", DoubleType, true)
.add("SIBSP", IntegerType, true)
.add("PARCH", IntegerType, true)
.add("TICKET", StringType, true)
.add("FARE", DoubleType, true)
.add("CABIN", StringType, true)
.add("EMBARKED", StringType, true)
val schema2 = StructType(
StructField("PASSENGERID", IntegerType, true) ::
StructField("PCLASS", IntegerType, true) ::
StructField("NAME", IntegerType, true) ::
StructField("SEX", StringType, true) ::
StructField("AGE", DoubleType, true) ::
StructField("SIBSP", IntegerType, true) ::
StructField("PARCH", IntegerType, true) ::
StructField("TICKET", StringType, true) ::
StructField("FARE", DoubleType, true) ::
StructField("CABIN", StringType, true) ::
StructField("EMBARKED", StringType, true) :: Nil)
val schema3 = StructType(Array(
StructField("PASSENGERID", IntegerType, true),
StructField("PCLASS", IntegerType, true),
StructField("NAME", IntegerType, true),
StructField("SEX", StringType, true),
StructField("AGE", DoubleType, true),
StructField("SIBSP", IntegerType, true),
StructField("PARCH", IntegerType, true),
StructField("TICKET", StringType, true),
StructField("FARE", DoubleType, true),
StructField("CABIN", StringType, true),
StructField("EMBARKED", StringType, true)))
val schema4 = StructType(Seq(
StructField("PASSENGERID", IntegerType, true),
StructField("PCLASS", IntegerType, true),
StructField("NAME", IntegerType, true),
StructField("SEX", StringType, true),
StructField("AGE", DoubleType, true),
StructField("SIBSP", IntegerType, true),
StructField("PARCH", IntegerType, true),
StructField("TICKET", StringType, true),
StructField("FARE", DoubleType, true),
StructField("CABIN", StringType, true),
StructField("EMBARKED", StringType, true)
))
val schema5 = StructType(
List(
StructField("PASSENGERID", IntegerType, true),
StructField("PCLASS", IntegerType, true),
StructField("NAME", IntegerType, true),
StructField("SEX", StringType, true),
StructField("AGE", DoubleType, true),
StructField("SIBSP", IntegerType, true),
StructField("PARCH", IntegerType, true),
StructField("TICKET", StringType, true),
StructField("FARE", DoubleType, true),
StructField("CABIN", StringType, true),
StructField("EMBARKED", StringType, true)
)
)
/*
val df = spark.read
.option("header", true)
.csv("s3a://mybucket/ybspark/input/PASSENGERS.csv")
.schema(schema)
*/
//this works
val df = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv")
df.show(false)
df.printSchema()
//fun errors
val df1 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema1)
val df2 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema2)
val df3 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema3)
val df4 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema4)
val df5 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema5)
The data is the kaggle titanic survival set, with fields in the header capitalized. I've tried this as a script submit to spark-shell as well as run commands within spark-shell manually. The spark-shell -i spits out some syntax errors on the dfX reads, if I manually load any of the schemas they seem fine though, and the reads all have the same error.
scala> val df4 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema4)
<console>:26: error: overloaded method value apply with alternatives:
(fieldIndex: Int)org.apache.spark.sql.types.StructField <and>
(names: Set[String])org.apache.spark.sql.types.StructType <and>
(name: String)org.apache.spark.sql.types.StructField
cannot be applied to (org.apache.spark.sql.types.StructType)
val df4 = spark.read.option("header", true).csv("s3a://mybucket/ybspark/input/PASSENGERS.csv").schema(schema4)
I don't understand what I'm doing wrong. I'm on Spark version 2.4.4 on AWS EMR.

set inferSchema param false so that spark will not inferSchema while loading data.
Move your .schema before .csv as DataFrame object will not have schema function.
Please check below code.
scala> val df1 = spark.read.option("header", true).option("inferSchema", false).schema(schema1).csv("s3a://mybucket/ybspark/input/PASSENGERS\.csv")
df1: org.apache.spark.sql.DataFrame = [PASSENGERID: int, PCLASS: int ... 9 more fields]
scala> val df2 = spark.read.option("header", true).option("inferSchema", false).schema(schema2).csv("s3a://mybucket/ybspark/input/PASSENGERS\.csv")
df2: org.apache.spark.sql.DataFrame = [PASSENGERID: int, PCLASS: int ... 9 more fields]
scala> val df3 = spark.read.option("header", true).option("inferSchema", false).schema(schema3).csv("s3a://mybucket/ybspark/input/PASSENGERS\.csv")
df3: org.apache.spark.sql.DataFrame = [PASSENGERID: int, PCLASS: int ... 9 more fields]
scala> val df4 = spark.read.option("header", true).option("inferSchema", false).schema(schema4).csv("s3a://mybucket/ybspark/input/PASSENGERS\.csv")
df4: org.apache.spark.sql.DataFrame = [PASSENGERID: int, PCLASS: int ... 9 more fields]
scala> val df5 = spark.read.option("header", true).option("inferSchema", false).schema(schema5).csv("s3a://mybucket/ybspark/input/PASSENGERS\.csv")
df5: org.apache.spark.sql.DataFrame = [PASSENGERID: int, PCLASS: int ... 9 more fields]

Parsing Event Hub messages using spark streaming

I am trying to parse Azure Event Hub messages generated from Azure blob file events using spark streaming and scala.
import org.apache.spark.eventhubs.{ConnectionStringBuilder, EventHubsConf}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
object eventhub {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("Event Hub")
//.config("spark.some.config.option", "some-value")
.master("local")
.getOrCreate()
import spark.implicits._
// Event hub configurations
// Replace values below with yours
val eventHubName = "xxx"
val eventHubNSConnStr = "Endpoint=xxxxx"
val connStr = ConnectionStringBuilder(eventHubNSConnStr).setEventHubName(eventHubName).build
val customEventhubParameters = EventHubsConf(connStr).setMaxEventsPerTrigger(5)
val incomingStream = spark.readStream.format("eventhubs")
.options(customEventhubParameters.toMap).load()
incomingStream.printSchema
val testSchema = new StructType()
//.add("offset", StringType)
//.add("Time", StringType)
//.add("Timestamp", LongType)
.add ("Body", new ArrayType( new StructType()
.add("topic", StringType)
.add("subject", StringType)
.add("eventType", StringType)
.add("eventTime", StringType)
.add("id", StringType)
.add("data", new StructType()
.add("api", StringType)
.add("clientRequestId", StringType)
.add("requestId", StringType)
.add("eTag", StringType)
.add("contentType", StringType)
.add("contentLength", LongType)
.add("blobType", StringType)
.add("url", StringType)
.add("sequencer", StringType)
.add("storageDiagnostics", new StructType()
.add("batchId", StringType)))
.add("dataVersion", StringType)
.add("metadataVersion", StringType), false))
// Event Hub message format is JSON and contains "body" field
// Body is binary, so you cast it to string to see the actual content of the message
val messages = incomingStream.select($"body".cast(StringType)).alias("body")
//.select(explode($"body")).alias("newbody")
.select(from_json($"body",testSchema)).alias("newbody")
.select("newbody.*")
/*
Output of val messages = incomingStream.select($"body".cast(StringType)).alias("body")
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|body |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"topic":"A1","subject":"A2","eventType":"A3","eventTime":"2019-07-26T17:00:32.4820786Z","id":"1","data":{"api":"PutBlob","clientRequestId":"A4","requestId":"A5","eTag":"A6","contentType":"A7","contentLength":496,"blobType":"BlockBlob","url":"https://test.blob.core.windows.net/test/20190726125719.csv","sequencer":"1","storageDiagnostics":{"batchId":"1"}},"dataVersion":"","metadataVersion":"1"}]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
*/
messages.writeStream
.outputMode("append")
.format("console")
.option("truncate", false)
.start()
.awaitTermination()
}
}
Here are structures of the original incoming stream and the "body"
root
|-- body: binary (nullable = true)
|-- partition: string (nullable = true)
|-- offset: string (nullable = true)
|-- sequenceNumber: long (nullable = true)
|-- enqueuedTime: timestamp (nullable = true)
|-- publisher: string (nullable = true)
|-- partitionKey: string (nullable = true)
|-- properties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- systemProperties: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
root
|-- body: string (nullable = true)
Looking at the output of "body", it feels like an array and seem like it need to be exploded but "body" data type is coming out to be string and it is complaining about using "explode" function.
It is not correctly parsing at this time when I am passing the schema as it is string and I am not sure what exactly should be the structure and how I get JSON structure to be parsed. Currently I get NULL output as it is obviously failing on the JSON parsing. Any inputs are appreciated. Thank you for your help.

Based on the output of body printed above, it appears there is no element with name Body, which is the reason why it is returning null, Please use the modified schema definition below, it should help.
val testSchema = new StructType()
.add("topic", StringType)
.add("subject", StringType)
.add("eventType", StringType)
.add("eventTime", StringType)
.add("id", StringType)
.add("data", new StructType()
.add("api", StringType)
.add("clientRequestId", StringType)
.add("requestId", StringType)
.add("eTag", StringType)
.add("contentType", StringType)
.add("contentLength", LongType)
.add("blobType", StringType)
.add("url", StringType)
.add("sequencer", StringType)
.add("storageDiagnostics", new StructType()
.add("batchId", StringType)))
.add("dataVersion", StringType)
.add("metadataVersion", StringType)
If your input payload contains more than one object in the array, then from_json with the above schema will return null. If you expect more than one object in your array then below schema should help.
val testSchema = new ArrayType(new StructType()
.add("topic", StringType)
.add("subject", StringType)
.add("eventType", StringType)
.add("eventTime", StringType)
.add("id", StringType)
.add("data", new StructType()
.add("api", StringType)
.add("clientRequestId", StringType)
.add("requestId", StringType)
.add("eTag", StringType)
.add("contentType", StringType)
.add("contentLength", LongType)
.add("blobType", StringType)
.add("url", StringType)
.add("sequencer", StringType)
.add("storageDiagnostics", new StructType()
.add("batchId", StringType)))
.add("dataVersion", StringType)
.add("metadataVersion", StringType),false)

StructType from Array

What I need to do?
Create schema for a DataFrame that should look like this:
root
|-- doubleColumn: double (nullable = false)
|-- longColumn: long (nullable = false)
|-- col0: double (nullable = true)
|-- col1: double (nullable = true)
...
Columns with prefix col can vary in number. Their names are stored in an array ar: Array[String].
My attempt
val schema = StructType(
StructField("doubleColumn", DoubleType, false) ::
StructField("longColumn", LongType, false) ::
ar.map(item => StructField(item, DoubleType, true)) // how to reduce it?
Nil
)
I have a problem with the commented line (4), I don't know, how to pass this array.

There is no need to reduce anything. You can just perpend a list of known columns: val
val schema = StructType(Seq(
StructField("doubleColumn", DoubleType, false),
StructField("longColumn", LongType, false)
) ++ ar.map(item => StructField(item, DoubleType, true))
)
You might also
ar.foldLeft(StructType(Seq(
StructField("doubleColumn", DoubleType, false),
StructField("longColumn", LongType, false)
)))((acc, name) => acc.add(name, DoubleType, true))

Overloaded method value apply with alternatives:

I am new to spark and I was trying to define a schema for a json data and ran into the following error in (spark-shell,
<console>:28: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
val schema = StructType(Array(StructField("type", StructType(StructField("name", StringType, true), StructField("version", StringType, true)), true) :: StructField("value", StructType(StructField("answerBlacklistedEntities", StringType, true) :: StructField("answerBlacklistedPhrase", StringType, true) :: StructField("answerEntities", StringType, true) :: StructField("answerText", StringType, true) :: StructField("blacklistReason", StringType, true) :: StructField("blacklistedDomains", StringType, true) :: StructField("blacklistedEntities", ArrayType(StringType, true), true) :: StructField("customerId", StringType, true) :: StructField("impolitePhrase", StringType, true) :: StructField("isResponseBlacklisted", BooleanType, true) :: StructField("queryString", StringType, true) :: StructField("utteranceDomains", StringType, true) :: StructField("utteranceEntities", ArrayType(StringType, true), true) :: StructField("utteranceId", StructType(StructField("identifier", StringType, true)), true)) :: Nil)))
Can anybody guide me to what's going on here? :) I'd really appreciate your help!

This happens because of this:
val schema = StructType(Array(StructField("type",
StructType(StructField("name", StringType, true), ...))
You create StructType and pass StructField as an argument, while it should be a sequence of StructFields:
val schema = StructType(Array(StructField("type",
StructType(Array(StructField("name", StringType, true), ...)) ...)