I am trying to 'lift' the fields of a struct to the top level in a dataframe, as illustrated by this example:
case class A(a1: String, a2: String)
case class B(b1: String, b2: A)
val df = Seq(B("X",A("Y","Z"))).toDF
df.show
+---+-----+
| b1| b2|
+---+-----+
| X|[Y,Z]|
+---+-----+
df.printSchema
root
|-- b1: string (nullable = true)
|-- b2: struct (nullable = true)
| |-- a1: string (nullable = true)
| |-- a2: string (nullable = true)
val lifted = df.withColumn("a1", $"b2.a1").withColumn("a2", $"b2.a2").drop("b2")
lifted.show
+---+---+---+
| b1| a1| a2|
+---+---+---+
| X| Y| Z|
+---+---+---+
lifted.printSchema
root
|-- b1: string (nullable = true)
|-- a1: string (nullable = true)
|-- a2: string (nullable = true)
This works. I would like to create a little utility method which does this for me, probably through pimping DataFrame to enable something like df.lift("b2").
To do this, I think I want a way of obtaining a list of all fields within a Struct. E.g. given "b2" as input, return ["a1","a2"]. How do I do this?
If I understand your question correctly, you want to be able to list the nested fields of column b2.
So you would need to filter on b2, access the StructType of b2 and then map the names of the columns from within the fields (StructField):
import org.apache.spark.sql.types.StructType
val nested_fields = df.schema
.filter(c => c.name == "b2")
.flatMap(_.dataType.asInstanceOf[StructType].fields)
.map(_.name)
// nested_fields: Seq[String] = List(a1, a2)
Actually you can use ".fieldNames.toList".
val nested_fields = df.schema("b2").fieldNames.toList
It returns a list of String. If you want a list of columns make a map.
I hope it helps.
Related
I am trying a convert data type of some columns based on a case class.
val simpleDf = Seq(("James",34,"2006-01-01","true","M",3000.60),
("Michael",33,"1980-01-10","true","F",3300.80),
("Robert",37,"1995-01-05","false","M",5000.50)
).toDF("firstName","age","jobStartDate","isGraduated","gender","salary")
// Output
simpleDf.printSchema()
root
|-- firstName: string (nullable = true)
|-- age: integer (nullable = false)
|-- jobStartDate: string (nullable = true)
|-- isGraduated: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: double (nullable = false)
Here I wanted to change the datatype of jobStartDate to Timestamp and isGraduated to Boolean. I am wondering if that conversion is possible using the case class?
I am aware this can be done by casting each column but in my case, I need to map the incoming DF based on a case class defined.
case class empModel(firstName:String,
age:Integer,
jobStartDate:java.sql.Timestamp,
isGraduated:Boolean,
gender:String,
salary:Double
)
val newDf = simpleData.as[empModel].toDF
newDf.show(false)
I am getting errors because of the string to timestamp conversation. Is there any workaround?
You can generate the schema from the case class using ScalaReflection:
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[empModel].dataType.asInstanceOf[StructType]
Now, you can pass this schema when you load your files into dataframe.
Or if you prefer to cast some or all columns after you read the dataframe, you can iterate the schema fields and cast into corresponding data type. By using foldLeft for example :
val df = schema.fields.foldLeft(simpleDf){
(df, s) => df.withColumn(s.name, df(s.name).cast(s.dataType))
}
df.printSchema
//root
// |-- firstName: string (nullable = true)
// |-- age: integer (nullable = true)
// |-- jobStartDate: timestamp (nullable = true)
// |-- isGraduated: boolean (nullable = false)
// |-- gender: string (nullable = true)
// |-- salary: double (nullable = false)
df.show
//+---------+---+-------------------+-----------+------+------+
//|firstName|age| jobStartDate|isGraduated|gender|salary|
//+---------+---+-------------------+-----------+------+------+
//| James| 34|2006-01-01 00:00:00| true| M|3000.6|
//| Michael| 33|1980-01-10 00:00:00| true| F|3300.8|
//| Robert| 37|1995-01-05 00:00:00| false| M|5000.5|
//+---------+---+-------------------+-----------+------+------+
I have a dataframe:
+--------------------+------+
|people |person|
+--------------------+------+
|[[jack, jill, hero]]|joker |
+--------------------+------+
It's schema:
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
Here, root--person is a string. So, I can update this field using udf as:
def updateString = udf((s: String) => {
"Mr. " + s
})
df.withColumn("person", updateString(col("person"))).select("person").show(false)
output:
+---------+
|person |
+---------+
|Mr. joker|
+---------+
I want to do same operation on root--people--person column which contains array of person. How to achieve this using udf?
def updateArray = udf((arr: Seq[Row]) => ???
df.withColumn("people", updateArray(col("people.person"))).select("people").show(false)
expected:
+------------------------------+
|people |
+------------------------------+
|[Mr. hero, Mr. jack, Mr. jill]|
+------------------------------+
Edit: I also want to preserve its schema after updating root--people--person.
Expected schema of people:
df.select("people").printSchema()
root
|-- people: struct (nullable = false)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
Thanks,
The problem here is that people is s struct with only 1 field. In your UDF, you need to return Tuple1 and then further cast the output of your UDF to keep the names correct:
def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))
val newDF = df
.withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))
newDF.printSchema()
newDF.show()
gives
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
+--------------------+------+
| people|person|
+--------------------+------+
|[[Mr.jack, Mr.jil...| joker|
+--------------------+------+
for you just need to update your function and everything remains the same.
here is the code snippet.
scala> df2.show
+------+------------------+
|people| person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]
scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person |test |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.
let me know if you want to know more about same.
Let's create data for testing
scala> val data = Seq((List(Array("ja", "ji", "he")), "person")).toDF("people", "person")
data: org.apache.spark.sql.DataFrame = [people: array<array<string>>, person: string]
scala> data.printSchema
root
|-- people: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
create UDF for our requirements
scala> def arrayConcat(array:Seq[Seq[String]], str: String) = array.map(_.map(str + _))
arrayConcat: (array: Seq[Seq[String]], str: String)Seq[Seq[String]]
scala> val arrayConcatUDF = udf(arrayConcat _)
arrayConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true), StringType)))
Applying the udf
scala> data.withColumn("dasd", arrayConcatUDF($"people", lit("Mr."))).show(false)
+--------------------------+------+-----------------------------------+
|people |person|dasd |
+--------------------------+------+-----------------------------------+
|[WrappedArray(ja, ji, he)]|person|[WrappedArray(Mr.ja, Mr.ji, Mr.he)]|
+--------------------------+------+-----------------------------------+
You may need to tweak a bit(I think any tweak is hardly required) but this contains the most of it to solve your problem
Within the JSON objects I am attempting to process, I am being given a nested StructType where each key represents a specific location, which then contains a currency and price:
-- id: string (nullable = true)
-- pricingByCountry: struct (nullable = true)
|-- regionPrices: struct (nullable = true)
|-- AT: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
|-- BT: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
|-- CL: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
...etc.
and I'd like to explode it so that rather than having a column per country, I can have a row for each country:
+---+--------+---------+------+
| id| country| currency| price|
+---+--------+---------+------+
| 0| AT| EUR| 100|
| 0| BT| NGU| 400|
| 0| CL| PES| 200|
+---+--------+---------+------+
These solution make sense intuitively: Spark DataFrame exploding a map with the key as a member and Spark scala - Nested StructType conversion to Map, but unfortunately don't work because I'm passing in a column and not a whole row to be mapped. I don't want to manually map the whole row--just a specific column that contains nested structs. There are several other attributes at the same level as "id" that I'd like to maintain in the structure.
I think it can done as below:
// JSON test data
val ds = Seq("""{"id":"abcd","pricingByCountry":{"regionPrices":{"AT":{"currency":"EUR","price":100.00},"BT":{"currency":"NGE","price":200.00},"CL":{"currency":"PES","price":300.00}}}}""").toDS
val df = spark.read.json(ds)
// Schema to map udf output
val outputSchema = ArrayType(StructType(Seq(
StructField("country", StringType, false),
StructField("currency", StringType, false),
StructField("price", DoubleType, false)
)))
// UDF takes value of `regionPrices` json string and converts
// it to Array of tuple(country, currency, price)
import org.apache.spark.sql._
val toMap = udf((jsonString: String) => {
import com.fasterxml.jackson.databind._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
val jsonMapper = new ObjectMapper()
jsonMapper.registerModule(DefaultScalaModule)
val jsonMap = jsonMapper.readValue(jsonString, classOf[Map[String, Map[String, Double]]])
jsonMap.map(f => (f._1, f._2("currency"), f._2("price"))).toSeq
}, outputSchema)
val result = df.
select(col("id").as("id"), explode(toMap(to_json(col("pricingByCountry.regionPrices")))).as("temp")).
select(col("id"), col("temp.country").as("country"), col("temp.currency").as("currency"), col("temp.price").as("price"))
Output will be:
scala> result.show
+----+-------+--------+-----+
| id|country|currency|price|
+----+-------+--------+-----+
|abcd| AT| EUR|100.0|
|abcd| BT| NGE|200.0|
|abcd| CL| PES|300.0|
+----+-------+--------+-----+
I have a column in a dataframe that is an array [always of a single item], that looks like this:
root
|-- emdaNo: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- _value: string (nullable = true)
| | |-- id: string (nullable = true)
I can't for the life of me work out how to get the _value from it, in to a string...
Assuming x is the dataframe, i've tried:
x.select($"arrayName._value") // Yields ["myStringHere"]
and
x.select($"arrayName[0]._value") // Errors
How do i get a nice string of the value held in _value out please?
case class Element(_value: String, id: String)
val df = Seq(Array(Element("foo", "bar"))).toDF("emdaNo")
df.select(element_at($"emdaNo._value", 1) as "_value").show()
Output:
+------+
|_value|
+------+
| foo|
+------+
Alternatively (and before Spark 2.4)
df.select($"emdaNo._value"(0))
or
df.select($"emdaNo._value".getItem(0))
I apologize for the verbose title, but I really couldn't come up with something better.
Basically, I have data with the following schema:
|-- id: string (nullable = true)
|-- mainkey: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- price: double (nullable = true)
| | | |-- recordtype: string (nullable = true)
Let me use the following example data:
{"id":1, "mainkey":{"key1":[{"price":0.01,"recordtype":"BID"}],"key2":[{"price":4.3,"recordtype":"FIXED"}],"key3":[{"price":2.0,"recordtype":"BID"}]}}
{"id":2, "mainkey":{"key4":[{"price":2.50,"recordtype":"BID"}],"key5":[{"price":2.4,"recordtype":"BID"}],"key6":[{"price":0.19,"recordtype":"BID"}]}}
For each of the two records above, I want to calculate mean of all prices when the recordtype is "BID". So, for the first record (with "id":1), we have 2 such bids, with prices 0.01 and 2.0, so the mean rounded to 2 decimal places is 1.01. For the second record (with "id":2), there are 3 bids, with prices 2.5, 2.4 and 0.19, and the mean is 1.70. So I want the following output:
+---+---------+
| id|meanvalue|
+---+---------+
| 1| 1.01|
| 2| 1.7|
+---+---------+
The following code does it:
val exSchema = (new StructType().add("id", StringType).add("mainkey", MapType(StringType, new ArrayType(new StructType().add("price", DoubleType).add("recordtype", StringType), true))))
val exJsonDf = spark.read.schema(exSchema).json("file:///data/json_example")
var explodeExJson = exJsonDf.select($"id",explode($"mainkey")).explode($"value") {
case Row(recordValue: Seq[Row] #unchecked ) => recordValue.map{ recordValue =>
val price = recordValue(0).asInstanceOf[Double]
val recordtype = recordValue(1).asInstanceOf[String]
RecordValue(price, recordtype)
}
}.cache()
val filteredExJson = explodeExJson.filter($"recordtype"==="BID")
val aggExJson = filteredExJson.groupBy("id").agg(round(mean("price"),2).alias("meanvalue"))
The problem is that it uses an "expensive" explode operation and it becomes a problem when I am dealing with lots of data, especially when there can be a lot of keys in the map.
Please let me know if you can think of a simpler solution, using UDFs or otherwise. Please also keep in mind that I am a beginner to Spark, and hence may have missed some stuff that would be obvious to you.
Any help would be really appreciated. Thanks in advance!
If aggregation is limited to a single Row udf will solve this:
import org.apache.spark.util.StatCounter
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.Row
val meanPrice = udf((map: Map[String, Seq[Row]]) => {
val prices = map.values
.flatMap(x => x)
.filter(_.getAs[String]("recordtype") == "BID")
.map(_.getAs[Double]("price"))
StatCounter(prices).mean
})
df.select($"id", meanPrice($"mainkey"))