I am attempting to add a column containing List[Annotation] to a Spark DataFrame using the below code (I've reformatted everything so this can be reproduced by directly copying and pasting).
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
case class Annotation(
field1: String,
field2: String,
field3: Int,
field4: Float,
field5: Int,
field6: List[Mapping]
)
case class Mapping(
fieldA: String,
fieldB: String,
fieldC: String,
fieldD: String,
fieldE: String
)
object StructTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val annotationStruct =
StructType(
Array(
StructField("field1", StringType, nullable = true),
StructField("field2", StringType, nullable = true),
StructField("field3", IntegerType, nullable = false),
StructField("field4", FloatType, nullable = false),
StructField("field5", IntegerType, nullable = false),
StructField(
"field6",
ArrayType(
StructType(Array(
StructField("fieldA", StringType, nullable = true),
StructField("fieldB", StringType, nullable = true),
StructField("fieldC", StringType, nullable = true),
StructField("fieldD", StringType, nullable = true),
StructField("fieldE", StringType, nullable = true)
))),
nullable = true
)
)
)
val df = List(1).toDF
val annotation = Annotation("1", "2", 1, .5f, 1, List(Mapping("a", "b", "c", "d", "e")))
val schema = df.schema
val newSchema = schema.add("annotations", ArrayType(annotationStruct), false)
val rdd = df.rdd.map(x => Row.fromSeq(x.toSeq :+ List(annotation)))
val newDF = spark.createDataFrame(rdd, newSchema)
newDF.printSchema
newDF.show
}
}
However, I'm getting an error when running this code.
Caused by: java.lang.RuntimeException: Annotation is not a valid external type for schema of struct<field1:string,field2:string,field3:int,field4:float,field5:int,field6:array<struct<fieldA:string,fieldB:string,fieldC:string,fieldD:string,fieldE:string>>>
The schema I am passing in (ArrayType(annotationStruct)) appears to be of the incorrect form when creating a dataFrame using createDataFrame, but it seems to match schemas for DataFrames that contain only List[Annotation].
Edit: Example of modifying a DF schema in this fashion with a simple type instead of a case class.
val df = List(1).toDF
spark.createDataFrame(df.rdd.map(x => Row.fromSeq(x.toSeq :+ "moose")), df.schema.add("moose", StringType, false)).show
+-----+-----+
|value|moose|
+-----+-----+
| 1|moose|
+-----+-----+
Edit 2: I've parsed this down a bit more. Sadly, I don't have the option of creating a DataFrame directly from the case class, which is why I am trying to mirror it as a Struct using ScalaReflection. In this case, I am not altering a previous schema, just attempting to create an DataFrame from an RDD of Rows which contain lists of my case class. Spark had an issue in 1.6 which impacts parsing arrays of structs which may be empty or null - I'm wondering if these are linked.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val annotationSchema = ScalaReflection.schemaFor[Annotation].dataType.asInstanceOf[StructType]
val annotation = Annotation("1", "2", 1, .5, 1, List(Mapping("a", "b", "c", "d", "e")))
val testRDD = spark.sparkContext.parallelize(List(List(annotation))).map(x => Row(x))
val testSchema = StructType(
Array(StructField("annotations", ArrayType(annotationSchema), false)
))
spark.createDataFrame(testRDD, testSchema).show
If you are concerned with adding a complex column to an existing dataframe, then following solution should work for you.
val df = List(1).toDF
val annotation = sc.parallelize(List(Annotation("1", "2", 1, .5f, 1, List(Mapping("a", "b", "c", "d", "e")))))
val newDF = df.rdd.zip(annotation).map(x => Merged(x._1.get(0).asInstanceOf[Int], x._2)).toDF
newDF.printSchema
newDF.show(false)
which should give you
root
|-- value: integer (nullable = false)
|-- annotations: struct (nullable = true)
| |-- field1: string (nullable = true)
| |-- field2: string (nullable = true)
| |-- field3: integer (nullable = false)
| |-- field4: float (nullable = false)
| |-- field5: integer (nullable = false)
| |-- field6: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- fieldA: string (nullable = true)
| | | |-- fieldB: string (nullable = true)
| | | |-- fieldC: string (nullable = true)
| | | |-- fieldD: string (nullable = true)
| | | |-- fieldE: string (nullable = true)
+-----+---------------------------------------+
|value|annotations |
+-----+---------------------------------------+
|1 |[1,2,1,0.5,1,WrappedArray([a,b,c,d,e])]|
+-----+---------------------------------------+
The case classes used are the same as yours with Merged case class created.
case class Merged(value : Int, annotations: Annotation)
case class Annotation(field1: String, field2: String, field3: Int, field4: Float, field5: Int, field6: List[Mapping])
case class Mapping(fieldA: String, fieldB: String, fieldC: String, fieldD: String, fieldE: String)
When case classes are used we don't need to define schema. And the process of creation of column names by using case classes and sqlContext.createDataFrame are different.
Related
I have a JSON string that I load into a Spark DataFrame. The JSON string can have between 0 and 3 key-value pairs.
When more than one kv pairs are sent, the product_facets is correctly formatted as an array like below:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":[{"key":"test","value":"success"}, {"key": "test2","value" : "fail"}]}
}}}
I can now use the explode function:
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", explode($"productData.product.product_facets.entry") as "kvPairs")
However when only one key value was sent, the source JSON string for entry is not formatted as an array with square braces:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":{"key":"test","value":"success"}}
}}}
The schema for product tag looks like:
| |-- product: struct (nullable = true)
| | |-- product_facets: struct (nullable = true)
| | | |-- entry: string (nullable = true)
| | |-- product_name: string (nullable = true)
How can I change the entry to an array of key value pairs that is compatible with the explode function. My end goal is to pivot the keys into individual columns and I want to use group by on exploding the kv pairs. I tried using from_json but could not get it to work.
val schema =
StructType(
Seq(
StructField("entry", ArrayType(
StructType(
Seq(
StructField("key", StringType),
StructField("value",StringType)
)
)
))
)
)
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", from_json($"productData.product.product_facets.entry", schema) as "kvPairsFromJson")
But the above does creates a new column kvPairsFromJson that looks like "[]" and using explode does nothing.
Any pointers on whats going on or if there is a better way to do this?
I think one approach could be :
1. Create a udf which takes entry value as json string, and converts it to List( Tuple(K, V))
2. In udf, check if entry value is array or not and do conversion accordingly.
The code below explains above approach:
// one row where entry is array and other non-array
val ds = Seq("""{"id":1,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":[{"key":"test","value":"success"},{"key":"test2","value":"fail"}]}}}}""", """{"id":2,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":{"key":"test","value":"success"}}}}}""").toDS
val df = spark.read.json(ds)
// Schema used by udf to generate output column
import org.apache.spark.sql.types._
val outputSchema = ArrayType(StructType(Seq(
StructField("key", StringType, false),
StructField("value", StringType, false)
)))
// Converts non-array entry value to array
val toArray = udf((json: String) => {
import com.fasterxml.jackson.databind._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
val jsonMapper = new ObjectMapper()
jsonMapper.registerModule(DefaultScalaModule)
if(!json.startsWith("[")) {
val jsonMap = jsonMapper.readValue(json, classOf[Map[String, String]])
List((jsonMap("key"), jsonMap("value")))
} else {
jsonMapper.readValue(json, classOf[List[Map[String, String]]]).map(f => (f("key"), f("value")))
}
}, outputSchema)
val arrayResult = df.select(col("id").as("id"), toArray(col("productData.product.product_facets.entry")).as("entry"))
val arrayExploded = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry"))
val explodedToCols = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry")).select(col("id"), col("entry.key").as("key"), col("entry.value").as("value"))
Results in:
scala> arrayResult.printSchema
root
|-- id: long (nullable = true)
|-- entry: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = false)
| | |-- value: string (nullable = false)
scala> arrayExploded.printSchema
root
|-- id: long (nullable = true)
|-- entry: struct (nullable = true)
| |-- key: string (nullable = false)
| |-- value: string (nullable = false)
scala> arrayResult.show(false)
+---+--------------------------------+
|id |entry |
+---+--------------------------------+
|1 |[[test, success], [test2, fail]]|
|2 |[[test, success]] |
+---+--------------------------------+
scala> arrayExploded.show(false)
+---+---------------+
|id |entry |
+---+---------------+
|1 |[test, success]|
|1 |[test2, fail] |
|2 |[test, success]|
+---+---------------+
I am trying to change the datatype of a column present in a dataframe I that I am reading from an RDBMS database.
To do that, I got the schema of the dataframe in the below way:
val dataSchema = dataDF.schema
To see the schema of the dataframe, I used the below statement:
println(dataSchema.schema)
Output: StructType(StructField(je_header_id,LongType,true), StructField(je_line_num,LongType,true), StructField(last_update_date,TimestampType,true), StructField(last_updated_by,DecimalType(15,0),true), StructField(creation_date,TimestampType,true), StructField(created_by,DecimalType(15,0),true), StructField(created_by_name,StringType,true), StructField(entered_dr,DecimalType(38,30),true), StructField(entered_cr,DecimalType(38,30),true))
My requirement is find the DecimalType and change it to DoubleType from the above schema.
I can get the column name and the datatypes using: dataSchema.dtype but it gives me the datatypes in the format of ((columnName1, column datatype),(columnName2, column datatype)....(columnNameN, column datatype))
I am trying to find a way to parse the StructType and change the schema in dataSchema in vain.
Could anyone let me know if there is a way to parse the StructType so that I can change the datatype to my requirement and get in the below format
StructType(StructField(je_header_id,LongType,true), StructField(je_line_num,LongType,true), StructField(last_update_date,TimestampType,true), StructField(last_updated_by,DoubleType,true), StructField(creation_date,TimestampType,true), StructField(created_by,DoubleType,true), StructField(created_by_name,StringType,true), StructField(entered_dr,DoubleType,true), StructField(entered_cr,DoubleType,true))
To modify a DataFrame Schema specific to a given data type, you can pattern-match against StructField's dataType, as shown below:
import org.apache.spark.sql.types._
val df = Seq(
(1L, BigDecimal(12.34), "a", BigDecimal(10.001)),
(2L, BigDecimal(56.78), "b", BigDecimal(20.002))
).toDF("c1", "c2", "c3", "c4")
val newSchema = df.schema.fields.map{
case StructField(name, _: DecimalType, nullable, _)
=> StructField(name, DoubleType, nullable)
case field => field
}
// newSchema: Array[org.apache.spark.sql.types.StructField] = Array(
// StructField(c1,LongType,false), StructField(c2,DoubleType,true),
// StructField(c3,StringType,true), StructField(c4,DoubleType,true)
// )
However, assuming your end-goal is to transform the dataset with the column type change, it would be easier to just traverse the columns for the targeted data type to iteratively cast them, like below:
import org.apache.spark.sql.functions._
val df2 = df.dtypes.
collect{ case (dn, dt) if dt.startsWith("DecimalType") => dn }.
foldLeft(df)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))
df2.printSchema
// root
// |-- c1: long (nullable = false)
// |-- c2: double (nullable = true)
// |-- c3: string (nullable = true)
// |-- c4: double (nullable = true)
[UPDATE]
Per additional requirement from comments, if you want to change schema only for DecimalType with positive scale, just apply a Regex pattern-match as the guard condition in method collect:
val pattern = """DecimalType\(\d+,(\d+)\)""".r
val df2 = df.dtypes.
collect{ case (dn, dt) if pattern.findFirstMatchIn(dt).map(_.group(1)).getOrElse("0") != "0" => dn }.
foldLeft(df)((accDF, c) => accDF.withColumn(c, col(c).cast("Double")))
Here is another way:
data.show(false)
data.printSchema
+----+------------------------+----+----------------------+
|col1|col2 |col3|col4 |
+----+------------------------+----+----------------------+
|1 |0.003200000000000000 |a |23.320000000000000000 |
|2 |78787.990030000000000000|c |343.320000000000000000|
+----+------------------------+----+----------------------+
root
|-- col1: integer (nullable = false)
|-- col2: decimal(38,18) (nullable = true)
|-- col3: string (nullable = true)
|-- col4: decimal(38,18) (nullable = true)
Create a schema that you want:
Exampe:
val newSchema = StructType(
Seq(
StructField("col1", StringType, true),
StructField("col2", DoubleType, true),
StructField("col3", StringType, true),
StructField("col4", DoubleType, true)
)
)
Cast the columns to the required datatype.
val newDF = data.selectExpr(newSchema.map(
col => s"CAST ( ${col.name} As ${col.dataType.sql}) ${col.name}"
): _*)
newDF.printSchema
root
|-- col1: string (nullable = false)
|-- col2: double (nullable = true)
|-- col3: string (nullable = true)
|-- col4: double (nullable = true)
newDF.show(false)
+----+-----------+----+------+
|col1|col2 |col3|col4 |
+----+-----------+----+------+
|1 |0.0032 |a |23.32 |
|2 |78787.99003|c |343.32|
+----+-----------+----+------+
The accepted solution works great, but its very costly because of huge cost of withColumn, and analyzer has to analyze whole DF for each withColumn, and with a large number of columns it is very costly. I would rather suggest doing this -
val transformedColumns = inputDataDF.dtypes
.collect {
case (dn, dt)
if (dt.startsWith("DecimalType")) =>
(dn, DoubleType)
}
val transformedDF = inputDataDF.select(transformedColumns.map { fieldType =>
inputDataDF(fieldType._1).cast(fieldType._2)
}: _*)
For a very small dataset it took 1+ minute with withColumn approach for me in my machine and 100 ms with approach with select.
you can read more about cost of withColumn here - https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015
I need to create a schema using existing df field.
Consider this example dataframe
scala> case class prd (a:Int, b:Int)
defined class prd
scala> val df = Seq((Array(prd(10,20),prd(15,30),prd(20,25)))).toDF("items")
df: org.apache.spark.sql.DataFrame = [items: array<struct<a:int,b:int>>]
scala> df.printSchema
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: integer (nullable = false)
I need one more field "items_day1" similar to "items" for df2. Right now, I'm doing it like below which is a workaround
scala> val df2=df.select('items,'items.as("item_day1"))
df2: org.apache.spark.sql.DataFrame = [items: array<struct<a:int,b:int>>, item_day1: array<struct<a:int,b:int>>]
scala> df2.printSchema
root
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: integer (nullable = false)
|-- item_day1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: integer (nullable = false)
scala>
But how to get that using the df.schema.add() or df.schema.copy() functions?.
EDIT1:
I'm trying like below
val (a,b) = (df.schema,df.schema) // works
a("items") //works
b.add(a("items").as("items_day1")) //Error..
To add a new field to your DataFrame schema (which is of StructType) with the same structure but a different top-level name of the existing field, you can copy the StructField with a modified StructField member name, as shown below:
import org.apache.spark.sql.types._
case class prd (a:Int, b:Int)
val df = Seq((Array(prd(10,20), prd(15,30), prd(20,25)))).toDF("items")
val schema = df.schema
// schema: org.apache.spark.sql.types.StructType = StructType(
// StructField(items, ArrayType(
// StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false)
// ), true), true)
// )
val newSchema = schema.find(_.name == "items") match {
case Some(field) => schema.add(field.copy(name = "items_day1"))
case None => schema
}
// newSchema: org.apache.spark.sql.types.StructType = StructType(
// StructField(items, ArrayType(
// StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false)
// ), true), true),
// StructField(items_day1, ArrayType(
// StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false)
// ), true), true)
// )
This question already has answers here:
Querying Spark SQL DataFrame with complex types
(3 answers)
Closed 4 years ago.
I have a Spark Dataframe with the following schema.
[{ "map": {
"completed-stages": 1,
"total-stages": 1 },
"rec": "test-plan",
"status": {
"state": "SUCCESS"
}
},
{ "map": {
"completed-stages": 1,
"total-stages": 1 },
"rec": "test-proc",
"status": {
"state": "FAILED"
}
}]
I want to transform it into another DF having the following schema
[{"rec": "test-plan", "status": "SUCCESS"}, {"rec": "test-pROC", "status": "FAILED"}]
I have written the following code, but it doesn't compile and complains of wrong encoding.
val fdf = DF.map(f => {
val listCommands = f.get(0).asInstanceOf[WrappedArray[Map[String, Any]]]
val m = listCommands.map(h => {
var rec = "none"
var status = "none"
if(h.exists("status" == "state" -> _)) {
status = (h.get("status") match {
case Some(x) => x.asInstanceOf[HashMap[String, String]].getOrElse("state", "none")
case _ => "none"
})
if(h.contains("rec")) {
rec = (h.get("rec") match {
case Some(x: String) => x
case _ => "none"
})
}
}
Map("status"->status, "rec"->rec)
})
val rm = m.flatten
rm
})
Please suggest the right way.
That's going to be tricky since the top-level elements of the JSONs are not the same, i.e. you have map1 and map2, and hence the schema is inconsistent. I'd speak to the "data producer" and requests a change so the name of the command is described by a separate element.
Given the schema of the DataFrame is as follows:
scala> commands.printSchema
root
|-- commands: array (nullable = true)
| |-- element: string (containsNull = true)
and the number of elements (rows) in it:
scala> commands.count
res1: Long = 1
You have to explode the commands array of elements first followed by accessing the fields of interest.
// 1. Explode the array
val commandsExploded = commands.select(explode($"commands") as "command")
scala> commandsExploded.count
res2: Long = 2
Let's create the schema of the JSON-encoded records. One could be as follows.
// Note that it accepts map1 and map2 fields
import org.apache.spark.sql.types._
val schema = StructType(
StructField("map1",
StructType(
StructField("completed-stages", LongType, true) ::
StructField("total-stages", LongType, true) :: Nil), true) ::
StructField("map2",
StructType(
StructField("completed-stages", LongType, true) ::
StructField("total-stages", LongType, true) :: Nil), true) ::
StructField("rec", StringType,true) ::
StructField("status", StructType(
StructField("state", StringType, true) :: Nil), true
) :: Nil)
With that, you should use from_json standard function that takes a column with JSON-encoded strings and a schema.
val commands = commandsExploded.select(from_json($"command", schema) as "command")
scala> commands.show(truncate = false)
+-------------------------------+
|command |
+-------------------------------+
|[[1, 1],, test-plan, [SUCCESS]]|
|[, [1, 1], test-proc, [FAILED]]|
+-------------------------------+
Let's have a look at the schema of the commands dataset.
scala> commands.printSchema
root
|-- command: struct (nullable = true)
| |-- map1: struct (nullable = true)
| | |-- completed-stages: long (nullable = true)
| | |-- total-stages: long (nullable = true)
| |-- map2: struct (nullable = true)
| | |-- completed-stages: long (nullable = true)
| | |-- total-stages: long (nullable = true)
| |-- rec: string (nullable = true)
| |-- status: struct (nullable = true)
| | |-- state: string (nullable = true)
The complex fields like rec and status are structs that are .-accessible.
val recs = commands.select(
$"command.rec" as "rec",
$"command.status.state" as "status")
scala> recs.show
+---------+-------+
| rec| status|
+---------+-------+
|test-plan|SUCCESS|
|test-proc| FAILED|
+---------+-------+
Converting it to a single-record JSON-encoded dataset requires Dataset.toJSON followed by collect_list standard function.
val result = recs.toJSON.agg(collect_list("value"))
scala> result.show(truncate = false)
+-------------------------------------------------------------------------------+
|collect_list(value) |
+-------------------------------------------------------------------------------+
|[{"rec":"test-plan","status":"SUCCESS"}, {"rec":"test-proc","status":"FAILED"}]|
+-------------------------------------------------------------------------------+
You didn't provide the schema for the df so the below might not work for you.
I saved the json sample in a test.json file and read it with val df=spark.read.option("multiLine",true).json("test.json") in which case to get the json you want you just df.select($"rec",$"status.state").write.json("test1.json")
I'm currently trying to extract a database from MongoDB and use Spark to ingest into ElasticSearch with geo_points.
The Mongo database has latitude and longitude values, but ElasticSearch requires them to be casted into the geo_point type.
Is there a way in Spark to copy the lat and lon columns to a new column that is an array or struct?
Any help is appreciated!
I assume you start with some kind of flat schema like this:
root
|-- lat: double (nullable = false)
|-- long: double (nullable = false)
|-- key: string (nullable = false)
First lets create example data:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._
val rdd = sc.parallelize(
Row(52.23, 21.01, "Warsaw") :: Row(42.30, 9.15, "Corte") :: Nil)
val schema = StructType(
StructField("lat", DoubleType, false) ::
StructField("long", DoubleType, false) ::
StructField("key", StringType, false) ::Nil)
val df = sqlContext.createDataFrame(rdd, schema)
An easy way is to use an udf and case class:
case class Location(lat: Double, long: Double)
val makeLocation = udf((lat: Double, long: Double) => Location(lat, long))
val dfRes = df.
withColumn("location", makeLocation(col("lat"), col("long"))).
drop("lat").
drop("long")
dfRes.printSchema
and we get
root
|-- key: string (nullable = false)
|-- location: struct (nullable = true)
| |-- lat: double (nullable = false)
| |-- long: double (nullable = false)
A hard way is to transform your data and apply schema afterwards:
val rddRes = df.
map{case Row(lat, long, key) => Row(key, Row(lat, long))}
val schemaRes = StructType(
StructField("key", StringType, false) ::
StructField("location", StructType(
StructField("lat", DoubleType, false) ::
StructField("long", DoubleType, false) :: Nil
), true) :: Nil
)
sqlContext.createDataFrame(rddRes, schemaRes).show
and we get an expected output
+------+-------------+
| key| location|
+------+-------------+
|Warsaw|[52.23,21.01]|
| Corte| [42.3,9.15]|
+------+-------------+
Creating nested schema from scratch can be tedious so if you can I would recommend the first approach. It can be easily extended if you need more sophisticated structure:
case class Pin(location: Location)
val makePin = udf((lat: Double, long: Double) => Pin(Location(lat, long))
df.
withColumn("pin", makePin(col("lat"), col("long"))).
drop("lat").
drop("long").
printSchema
and we get expected output:
root
|-- key: string (nullable = false)
|-- pin: struct (nullable = true)
| |-- location: struct (nullable = true)
| | |-- lat: double (nullable = false)
| | |-- long: double (nullable = false)
Unfortunately you have no control over nullable field so if is important for your project you'll have to specify schema.
Finally you can use struct function introduced in 1.4:
import org.apache.spark.sql.functions.struct
df.select($"key", struct($"lat", $"long").alias("location"))
Try this:
import org.apache.spark.sql.functions._
df.registerTempTable("dt")
dfres = sql("select struct(lat,lon) as colName from dt")