Use spark dataframe column value as an alias of another column - scala

Using spark and scala i would like to set a struct and use one of the column value as an alias of another column.
I have this dataframe
root
|-- type: string (nullable = true)
|-- metadata
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)
And i would like to have this
root
|-- metadata
|-- TYPE_VALUE
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)
In my dataframe, i try with struct($"metadata".as($"type".toString())).alias("metadata") but its doesn't work, it take the field name instead of taking the value.

Well that is not going to work, because that would require a dynamic schema that is not known beforehand.
The best you could do is create a mapping out of it:
df.select(
map('type, 'metadata).as("metadata")
)
With an output like:
+-------------------------------+
|metadata |
+-------------------------------+
|Map(type1 -> [Tom,38,M,NL]) |
|Map(type2 -> [Marijke,37,F,NL])|
+-------------------------------+
res1: Unit = ()
root
|-- metadata: map (nullable = false)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: long (nullable = false)
| | |-- gender: string (nullable = true)
| | |-- country: string (nullable = true)
Or just split the data based on the type and process each type as separate dataframe

Related

Spark reading Open Street Map data and selecting entries

I have OpenStreetMap (OSM) data from a .orc stored in var nlorc of a country for which I am trying to read out data for specific cities. As far as I know, a city entity is defined as a 'relation' in OSM. The nlorc.printSchema() of my data returns the following:
root
|-- id: long (nullable = true)
|-- type: string (nullable = true)
|-- tags: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- lat: decimal(9,7) (nullable = true)
|-- lon: decimal(10,7) (nullable = true)
|-- nds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ref: long (nullable = true)
|-- members: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: string (nullable = true)
| | |-- ref: long (nullable = true)
| | |-- role: string (nullable = true)
|-- changeset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- uid: long (nullable = true)
|-- user: string (nullable = true)
|-- version: long (nullable = true)
|-- visible: boolean (nullable = true)
As an example, https://www.openstreetmap.org/relation/47798#map=13/51.4373/4.8888 shows that the name of the city is part of "Tags". How can I access the keys of Tags and select specific cities?
You can use getItem to access the elements of the map:
df = ...
df.filter(df("tags").getItem("name")==="Baarle-Nassau").show()

Extract struct fields in Spark scala

I have a df of schema
|-- Data: struct (nullable = true)
| |-- address_billing: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| |-- address_shipping: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- city: string (nullable = true)
| |-- cancelled_initiator: string (nullable = true)
| |-- cancelled_reason: string (nullable = true)
| |-- statuses: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- store_code: string (nullable = true)
| |-- store_name: string (nullable = true)
| |-- tax_code: string (nullable = true)
| |-- total: string (nullable = true)
| |-- updated_at: string (nullable = true)
I need to extract its all fields in separate columns without manually giving name.
Is there any way by which we can do this?
I tried:
val df2=df1.select(df1.col("Data.*"))
but got the error
org.apache.spark.sql.AnalysisException: No such struct field * in address_billing, address_shipping,....
Also, Can anyone suggest to me how to add a prefix to all these columns, as the some of the columns name may be the same.
Output should be like
address_billing_address1
address_billing_address2
.
.
.
Just change df1.col to col. Either of these should work:
df1.select(col("Data.*"))
df1.select($"Data.*")
df1.select("Data.*")

Spark Scala: How to Replace a Field in Deeply Nested DataFrame

I have a DataFrame which contains multiple nested columns. The schema is not static and could change upstream of my Spark application. Schema evolution is guaranteed to always be backward compatible. An anonymized, shortened version of the schema is pasted below
root
|-- isXPresent: boolean (nullable = true)
|-- isYPresent: boolean (nullable = true)
|-- isZPresent: boolean (nullable = true)
|-- createTime: long (nullable = true)
<snip>
|-- structX: struct (nullable = true)
| |-- hostIPAddress: integer (nullable = true)
| |-- uriArguments: string (nullable = true)
<snip>
|-- structY: struct (nullable = true)
| |-- lang: string (nullable = true)
| |-- cookies: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
<snip>
The spark job is supposed to transform "structX.uriArguments" from string to map(string, string). There is a somewhat similar situation asked in this post. However, the answer assumes the schema is static and does not change. So case class does not work in my situation.
What would be the best way to transform structX.uriArguments without hard-coding the entire schema inside the code? The outcome should look like this:
root
|-- isXPresent: boolean (nullable = true)
|-- isYPresent: boolean (nullable = true)
|-- isZPresent: boolean (nullable = true)
|-- createTime: long (nullable = true)
<snip>
|-- structX: struct (nullable = true)
| |-- hostIPAddress: integer (nullable = true)
| |-- uriArguments: map (nullable = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
<snip>
|-- structY: struct (nullable = true)
| |-- lang: string (nullable = true)
| |-- cookies: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: string (containsNull = true)
<snip>
Thanks
You could try using the DataFrame.withColumn(). It allows you to reference nested fields. You could add a new map column and drop the flat one. This question shows how to handle structs with withColumn.

Spark Read json object data as MapType

I have written a sample spark app, where I'm creating a dataframe with MapType and writing it to disk. Then I'm reading the same file & printing its schema. Bu the output file schema is different when compared to Input Schema and I don't see the MapType in the Output. How I can read that output file with MapType
Code
import org.apache.spark.sql.{SaveMode, SparkSession}
case class Department(Id:String,Description:String)
case class Person(name:String,department:Map[String,Department])
object sample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local").appName("Custom Poc").getOrCreate
import spark.implicits._
val schemaData = Seq(
Person("Persion1", Map("It" -> Department("1", "It Department"), "HR" -> Department("2", "HR Department"))),
Person("Persion2", Map("It" -> Department("1", "It Department")))
)
val df = spark.sparkContext.parallelize(schemaData).toDF()
println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")
println("Output schema")
spark.read.json("D:\\save\\output\\*.json").printSchema()
}
}
OutPut
Input schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Output schema
root
|-- department: struct (nullable = true)
| |-- HR: struct (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- Id: string (nullable = true)
| |-- It: struct (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- Id: string (nullable = true)
|-- name: string (nullable = true)
Json File
{"name":"Persion1","department":{"It":{"Id":"1","Description":"It Department"},"HR":{"Id":"2","Description":"HR Department"}}}
{"name":"Persion2","department":{"It":{"Id":"1","Description":"It Department"}}}
EDIT :
For just explaining my requirement I have added the saving file part above. In actual scenario I will be just reading JSON data provided above and work on that dataframe
You can pass the schema from prevous dataframe while reading the json data
println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")
println("Output schema")
spark.read.schema(df.schema).json("D:\\save\\output")
Input schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Output schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Hope this helps!

Access nested columns in transformers setInputCol() method

I am trying to parse a Wikipedia dump using Databricks XML parser and Spark's pipeline approach. The goal is to compute a feature vector for the text field, which is a nested column.
The schema of the XML is as follows:
root
|-- id: long (nullable = true)
|-- ns: long (nullable = true)
|-- revision: struct (nullable = true)
| |-- comment: string (nullable = true)
| |-- contributor: struct (nullable = true)
| | |-- id: long (nullable = true)
| | |-- ip: string (nullable = true)
| | |-- username: string (nullable = true)
| |-- format: string (nullable = true)
| |-- id: long (nullable = true)
| |-- minor: string (nullable = true)
| |-- model: string (nullable = true)
| |-- parentid: long (nullable = true)
| |-- sha1: string (nullable = true)
| |-- text: struct (nullable = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _bytes: long (nullable = true)
| | |-- _space: string (nullable = true)
| |-- timestamp: string (nullable = true)
|-- title: string (nullable = true)
After reading in the dump with
val raw = spark.read.format("com.databricks.spark.xml").option("rowTag", "page").load("some.xml")
I am able to access the respective text using
raw.select("revision.text._VALUE").show(10)
I have then as my first stage in the Spark pipeline a RegexTokenizer, which needs to access revision.text._VALUE in order to transform the data:
val tokenizer = new RegexTokenizer().
setInputCol("revision.text._VALUE").
setOutputCol("tokens").
setMinTokenLength(3).
setPattern("\\s+|\\/|_|-").
setToLowercase(true)
val pipeline = new Pipeline().setStages(Array(tokenizer))
val model = pipeline.fit(raw)
However, this step fails with:
Name: java.lang.IllegalArgumentException
Message: Field "revision.text._VALUE" does not exist.
Any advise on how to have nested columns in the setInputCol method?
Thanks a lot!
Try creating a temp column before using in RegexTokenizer as
val rawTemp = raw.withColumn("temp", $"revision.text._VALUE")
Then you can use rawTemp dataframe and temp column in RegexTokenizer as
val tokenizer = new RegexTokenizer().
setInputCol("temp").
setOutputCol("tokens").
setMinTokenLength(3).
setPattern("\\s+|\\/|_|-").
setToLowercase(true)
val pipeline = new Pipeline().setStages(Array(tokenizer))
val model = pipeline.fit(rawTemp)
Hope the answer is helpful