UnFlatten Dataframe to a specific structure - scala

I have a flat dataframe (df) with the structure as below:
root
|-- first_name: string (nullable = true)
|-- middle_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- title: string (nullable = true)
|-- start_date: string (nullable = true)
|-- end_Date: string (nullable = true)
|-- city: string (nullable = true)
|-- zip_code: string (nullable = true)
|-- state: string (nullable = true)
|-- country: string (nullable = true)
|-- email_name: string (nullable = true)
|-- company: struct (nullable = true)
|-- org_name: string (nullable = true)
|-- company_phone: string (nullable = true)
|-- partition_column: string (nullable = true)
And I need to convert this dataframe into a structure like (as my next data will be in this format):
root
|-- firstName: string (nullable = true)
|-- middleName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- currentPosition: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- title: string (nullable = true)
| | |-- startDate: string (nullable = true)
| | |-- endDate: string (nullable = true)
| | |-- address: struct (nullable = true)
| | | |-- city: string (nullable = true)
| | | |-- zipCode: string (nullable = true)
| | | |-- state: string (nullable = true)
| | | |-- country: string (nullable = true)
| | |-- emailName: string (nullable = true)
| | |-- company: struct (nullable = true)
| | | |-- orgName: string (nullable = true)
| | | |-- companyPhone: string (nullable = true)
|-- partitionColumn: string (nullable = true)
So far I have implemented this:
case class IndividualCompany(orgName: String,
companyPhone: String)
case class IndividualAddress(city: String,
zipCode: String,
state: String,
country: String)
case class IndividualPosition(title: String,
startDate: String,
endDate: String,
address: IndividualAddress,
emailName: String,
company: IndividualCompany)
case class Individual(firstName: String,
middleName: String,
lastName: String,
currentPosition: Seq[IndividualPosition],
partitionColumn: String)
val makeCompany = udf((orgName: String, companyPhone: String) => IndividualCompany(orgName, companyPhone))
val makeAddress = udf((city: String, zipCode: String, state: String, country: String) => IndividualAddress(city, zipCode, state, country))
val makePosition = udf((title: String, startDate: String, endDate: String, address: IndividualAddress, emailName: String, company: IndividualCompany)
=> List(IndividualPosition(title, startDate, endDate, address, emailName, company)))
val selectData = df.select(
col("first_name").as("firstName"),
col("middle_name).as("middleName"),
col("last_name").as("lastName"),
makePosition(col("job_title"),
col("start_date"),
col("end_Date"),
makeAddress(col("city"),
col("zip_code"),
col("state"),
col("country")),
col("email_name"),
makeCompany(col("org_name"),
col("company_phone"))).as("currentPosition"),
col("partition_column").as("partitionColumn")
).as[Individual]
select_data.printSchema()
select_data.show(10)
I can see a proper schema generated for select_data, but it gives an error on the last line where I am trying to get some actual data. I am getting an error saying failed to execute user defined function.
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$4: (string, string, string, struct<city:string,zipCode:string,state:string,country:string>, string, struct<orgName:string,companyPhone:string>) => array<struct<title:string,startDate:string,endDate:string,address:struct<city:string,zipCode:string,state:string,country:string>,emailName:string,company:struct<orgName:string,companyPhone:string>>>)
Is there any better way to achieve this?

The problem here is that an udf can't take IndividualAddress and IndividualCompany directly as input. These are represented as structs in Spark and to use them in an udf the correct input type is Row. That means you need to change the declaration of makePosition to:
val makePosition = udf((title: String,
startDate: String,
endDate: String,
address: Row,
emailName: String,
company: Row)
Inside the udf you now need to use e.g. address.getAs[String]("city") to access the case class elements, and to use the class as a whole you need to create it again.
The easier and better alternative would be to do everything in a single udf as follows:
val makePosition = udf((title: String,
startDate: String,
endDate: String,
city: String,
zipCode: String,
state: String,
country: String,
emailName: String,
orgName: String,
companyPhone: String) =>
Seq(
IndividualPosition(
title,
startDate,
endDate,
IndividualAddress(city, zipCode, state, country),
emailName,
IndividualCompany(orgName, companyPhone)
)
)
)

I had a similar requirement.
What I did was create a typed user defined aggregation which will produce a List of elements.
import org.apache.spark.sql.{Encoder, TypedColumn}
import org.apache.spark.sql.expressions.Aggregator
import scala.collection.mutable
object ListAggregator {
private type Buffer[T] = mutable.ListBuffer[T]
/** Returns a column that aggregates all elements of type T in a List. */
def create[T](columnName: String)
(implicit listEncoder: Encoder[List[T]], listBufferEncoder: Encoder[Buffer[T]]): TypedColumn[T, List[T]] =
new Aggregator[T, Buffer[T], List[T]] {
override def zero: Buffer[T] =
mutable.ListBuffer.empty[T]
override def reduce(buffer: Buffer[T], elem: T): Buffer[T] =
buffer += elem
override def merge(b1: Buffer[T], b2: Buffer[T]): Buffer[T] =
if (b1.length >= b2.length) b1 ++= b2 else b2 ++= b1
override def finish(reduction: Buffer[T]): List[T] =
reduction.toList
override def bufferEncoder: Encoder[Buffer[T]] =
listBufferEncoder
override def outputEncoder: Encoder[List[T]] =
listEncoder
}.toColumn.name(columnName)
}
Now you can use it like this.
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
final case class Flat(id: Int, name: String, age: Int)
final case class Grouped(age: Int, users: List[(Int, String)])
val data =
List(
(1, "Luis", 21),
(2, "Miguel", 21),
(3, "Sebastian", 16)
).toDF("id", "name", "age").as[Flat]
val grouped =
data
.groupByKey(flat => flat.age)
.mapValues(flat => (flat.id, flat.name))
.agg(ListAggregator.create(columnName = "users"))
.map(tuple => Grouped(age = tuple._1, users = tuple._2))
// grouped: org.apache.spark.sql.Dataset[Grouped] = [age: int, users: array<struct<_1:int,_2:string>>]
grouped.show(truncate = false)
// +---+------------------------+
// |age|users |
// +---+------------------------+
// |16 |[[3, Sebastian]] |
// |21 |[[1, Luis], [2, Miguel]]|
// +---+------------------------+

Related

Create a new column in a dataset using a case class structure

Assuming the following schema for a table - Places:
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neighborhood: string (nullable = true)
val places is of type Dataset[Row]
and I have the following case class:
case class csm(
city: Option[String] = None,
stateProvince: Option[String] = None,
country: Option[String] = None
)
How would I go about altering or creating a new data set that has the following schema
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- subpremise: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neighborhood: string (nullable = true)
|-- csm: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state_province: string (nullable = true)
| |-- country: string (nullable = true)
I've been looking into withColumn methods and they seem to require UDFs, the challenge here being that I have to manually specify the columns which will be easy for this use case, but as my problem scales it will be difficult to manually maintain them
Used this as a reference: https://intellipaat.com/community/16433/how-to-add-a-new-struct-column-to-a-dataframe
In your case class declaration you have stateProvince parameter, but in your dataframe there's state_province column instead.
I'm not sure if it's not a typo, so first - some quick-n-dirty not-thoroughly-tested camelCase to snake_case converter just in case:
def normalize(x: String): String =
"([a-z])([A-Z])".r replaceAllIn(x, m => s"${m.group(1)}_${m.group(2).toLowerCase}")
Next, let's get the parameters of a case class:
val case_class_params = Seq[csm]().toDF.columns
And with this, we can now get columns for our case class struct:
val csm_cols = case_class_params.map(x => col(normalize(x)))
val df2 = df.withColumn("csm", struct(csm_cols:_*))
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
|place_id|street_address|city |state_province|postal_code|country |neghborhood|csm |
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
|123 |str_addr |some_city|some_province |some_zip |some_country|NA |{some_city, some_province, some_country}|
+--------+--------------+---------+--------------+-----------+------------+-----------+----------------------------------------+
root
|-- place_id: string (nullable = true)
|-- street_address: string (nullable = true)
|-- city: string (nullable = true)
|-- state_province: string (nullable = true)
|-- postal_code: string (nullable = true)
|-- country: string (nullable = true)
|-- neghborhood: string (nullable = true)
|-- csm: struct (nullable = false)
| |-- city: string (nullable = true)
| |-- state_province: string (nullable = true)
| |-- country: string (nullable = true)
Nested case classes
Seq[csm]().toDF.columns won't give you nested columns. For that some basic schema traversal is required. E.g., one way to do it, adapted from here:
def flatten(schema: StructType): Seq[String] =
schema.fields.flatMap {
field =>
field.dataType match {
case structType: StructType =>
flatten(structType)
case _ =>
field.name :: Nil
}
}
case class StateProvince(
stateProvince: Option[String] = None,
country: Option[String] = None)
case class csm(
city: Option[String] = None,
state: StateProvince
)
val case_class_params = flatten(Seq[csm]().toDF.schema)
// case_class_params: Seq[String] = ArraySeq(city, stateProvince, country)
case class Source(
place_id: Option[String],
street_address: Option[String],
city: Option[String],
state_province: Option[String],
postal_code: Option[String],
country: Option[String],
neighborhood: Option[String]
)
case class Csm(
city: Option[String] = None,
stateProvince: Option[String] = None,
country: Option[String] = None
)
case class Result(
place_id: Option[String],
street_address: Option[String],
subpremise: Option[String],
city: Option[String],
state_province: Option[String],
postal_code: Option[String],
country: Option[String],
neighborhood: Option[String],
csm: Csm
)
import spark.implicits._
val sourceDF = Seq(
Source(
Some("s-1-1"),
Some("s-1-2"),
Some("s-1-3"),
Some("s-1-4"),
Some("s-1-5"),
Some("s-1-6"),
Some("s-1-7")
),
Source(
Some("s-2-1"),
Some("s-2-2"),
Some("s-2-3"),
Some("s-2-4"),
Some("s-2-5"),
Some("s-2-6"),
Some("s-2-7")
)
).toDF()
val resultDF = sourceDF
.map(r => {
Result(
Some(r.getAs[String]("place_id")),
Some(r.getAs[String]("street_address")),
Some("set your value"),
Some(r.getAs[String]("city")),
Some(r.getAs[String]("state_province")),
Some(r.getAs[String]("postal_code")),
Some(r.getAs[String]("country")),
Some(r.getAs[String]("neighborhood")),
Csm(
Some(r.getAs[String]("city")),
Some(r.getAs[String]("state_province")),
Some(r.getAs[String]("country"))
)
)
})
.toDF()
resultDF.printSchema()
// root
// |-- place_id: string (nullable = true)
// |-- street_address: string (nullable = true)
// |-- subpremise: string (nullable = true)
// |-- city: string (nullable = true)
// |-- state_province: string (nullable = true)
// |-- postal_code: string (nullable = true)
// |-- country: string (nullable = true)
// |-- neighborhood: string (nullable = true)
// |-- csm: struct (nullable = true)
// | |-- city: string (nullable = true)
// | |-- stateProvince: string (nullable = true)
// | |-- country: string (nullable = true)
resultDF.show(false)
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
// |place_id|street_address|subpremise |city |state_province|postal_code|country|neighborhood|csm |
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+
// |s-1-1 |s-1-2 |set your value|s-1-3|s-1-4 |s-1-5 |s-1-6 |s-1-7 |[s-1-3, s-1-4, s-1-6]|
// |s-2-1 |s-2-2 |set your value|s-2-3|s-2-4 |s-2-5 |s-2-6 |s-2-7 |[s-2-3, s-2-4, s-2-6]|
// +--------+--------------+--------------+-----+--------------+-----------+-------+------------+---------------------+

Exception in thread “main” org.apache.spark.sql.AnalysisException: No such struct field startId in _1, _2;

0
I have 2 case class as below:
case class IcijAddressRaw(
node_id: Option[Long],
name: Option[String],
address: Option[String],
country_codes: Option[String],
countries: Option[String],
sourceID: Option[String],
valid_until: Option[String],
note: Option[String]
)
case class IcijEdgesRaw(
START_ID: Option[Long],
TYPES: Option[String],
END_ID: Option[Long],
link: Option[String],
start_date: Option[java.sql.Date],
end_date: Option[java.sql.Date],
sourceID: Option[String],
valid_until: Option[String]
)
I join both data sets with the above case class as below:
val addressWithEdgesDS = addressRawDS
.joinWith(edgesRawDS, edgesRawDS("END_ID") === addressRawDS("node_id"), "inner")
val addressGroupDS = addressWithEdgesDS.groupByKey { fullAddress => fullAddress}.mapGroups {
case (startId, fullAddress) =>
(startId, fullAddress.toSeq)
}
def thirdPartyWithAddressDS = thirdPartyDS
.joinWith(addressGroupDS, thirdPartyDS("thirdPartyId") === addressGroupDS("_2.startId"), "left_outer")
.map{
case (thirdParty,null) => thirdParty
case (thirdParty, (thirdPartyId, addressSeq)) => thirdParty.copy(addresses = addressSeq.map(addressCaseClassMap[ThirdPartyCC]))
}
I can gradle build the jar without problem. However, when I run the below script:
./runQSS.sh -s com.quantexa.academytask.etl.projects.icij.CreateCaseClass -e local -r etl.icij -c /home/training-admin/academy-task/application.conf
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: No such struct field startId in _1, _2;
I have tried the column name "START_ID" as well as"_2.START_ID" but same error. My understanding is that this joined datasets struct need to use _1 or _2 to refer to the left or right dataset column. But in this case, it seems failed to point to the correct column.
my joined schema as below: scala> addressWithEdgesDS.printSchema() root
|-- _1: struct (nullable = false)
| |-- node_id: integer (nullable = true)
| |-- name: string (nullable = true)
| |-- address: string (nullable = true)
| |-- country_codes: string (nullable = true)
| |-- countries: string (nullable = true)
| |-- sourceID: string (nullable = true)
| |-- valid_until: string (nullable = true)
| |-- note: string (nullable = true)
|-- _2: struct (nullable = false)
| |-- START_ID: integer (nullable = true)
| |-- TYPES: string (nullable = true)
| |-- END_ID: integer (nullable = true)
| |-- link: string (nullable = true)
| |-- start_date: string (nullable = true)
| |-- end_date: string (nullable = true)
| |-- sourceID: string (nullable = true)
| |-- valid_until: string (nullable = true)

Reordering fields in nested dataframe

How do I reorder fields in a nested dataframe in scala?
for e.g below is the expected and desired schemas
currently->
root
|-- domain: struct (nullable = false)
| |-- assigned: string (nullable = true)
| |-- core: string (nullable = true)
| |-- createdBy: long (nullable = true)
|-- Event: struct (nullable = false)
| |-- action: string (nullable = true)
| |-- eventid: string (nullable = true)
| |-- dqid: string (nullable = true)
expected->
root
|-- domain: struct (nullable = false)
| |-- core: string (nullable = true)
| |-- assigned: string (nullable = true)
| |-- createdBy: long (nullable = true)
|-- Event: struct (nullable = false)
| |-- dqid: string (nullable = true)
| |-- eventid: string (nullable = true)
| |-- action: string (nullable = true)
```
You need to define schema before you read the dataframe.
val schema = val schema = StructType(Array(StructField("root",StructType(Array(StructField("domain",StructType(Array(StructField("core",StringType,true), StructField("assigned",StringType,true), StructField("createdBy",StringType,true))),true), StructField("Event",StructType(Array(StructField("dqid",StringType,true), StructField("eventid",StringType,true), StructField("action",StringType,true))),true))),true)))
Now, you can apply this schema while reading your file.
val df = spark.read.schema(schema).json("path/to/json")
Should work with any nested data.
Hope this helps!
Most efficient approach might be to just select the nested elements and wrap in a couple of structs, as shown below:
case class Domain(assigned: String, core: String, createdBy: Long)
case class Event(action: String, eventid: String, dqid: String)
val df = Seq(
(Domain("a", "b", 1L), Event("c", "d", "e")),
(Domain("f", "g", 2L), Event("h", "i", "j"))
).toDF("domain", "event")
val df2 = df.select(
struct($"domain.core", $"domain.assigned", $"domain.createdBy").as("domain"),
struct($"event.dqid", $"event.action", $"event.eventid").as("event")
)
df2.printSchema
// root
// |-- domain: struct (nullable = false)
// | |-- core: string (nullable = true)
// | |-- assigned: string (nullable = true)
// | |-- createdBy: long (nullable = true)
// |-- event: struct (nullable = false)
// | |-- dqid: string (nullable = true)
// | |-- action: string (nullable = true)
// | |-- eventid: string (nullable = true)
An alternative would be to apply row-wise map:
import org.apache.spark.sql.Row
val df2 = df.map{ case Row(Row(as: String, co: String, cr: Long), Row(ac: String, ev: String, dq: String)) =>
((co, as, cr), (dq, ac, ev))
}.toDF("domain", "event")

Scala - Spark: build a Graph (graphX) from vertices and edges dataframe

I have two dataframe with this schema:
edges
|-- src: string (nullable = true)
|-- dst: string (nullable = true)
|-- relationship: struct (nullable = false)
| |-- business_id: string (nullable = true)
| |-- normalized_influence: double (nullable = true) root
vertices
|-- id: string (nullable = true)
|-- state: boolean (nullable = true)
To have a graph I converted these dataframe in this way:
import org.apache.spark.graphx._
import scala.util.hashing.MurmurHash3
case class Relationship(business_id: String, normalized_influence: Double)
case class MyEdge(src: String, dst: String, relationship: Relationship)
val edgesRDD: RDD[Edge[Relationship]] = communityEdgeDF.as[MyEdge].rdd.map { edge =>
Edge(
MurmurHash3.stringHash(edge.src).toLong,
MurmurHash3.stringHash(edge.dst).toLong,
edge.relationship
)
}
case class MyVertex(id: String, state: Boolean)
val verticesRDD : RDD[(VertexId, (String, Boolean))] = communityVertexDF.as[MyVertex].rdd.map { vertex =>
(
MurmurHash3.stringHash(vertex.id).toLong,
(vertex.id, vertex.state)
)
}
val graphX = Graph(verticesRDD, edgesRDD)
This is a part of the output of the vertices
res6: Array[(org.apache.spark.graphx.VertexId, (String, Boolean))] = Array((1874415454,(KRZALzi0ZgrGYyjZNg72_g,false)), (1216259959,(JiFBQ_-vWgJtRZEEruSStg,false)), (-763896211,(LZge-YpVL0ukJVD2nw5sag,false)), (-2032982683,(BHP3LVkTOfh3w4UIhgqItg,false)), (844547135,(JRC3La2fiNkK0VU7qZ9vyQ,false))
and this the edges:
res3: Array[org.apache.spark.graphx.Edge[Relationship]] = Array(Edge(-268040669,1495494297,Relationship(cJWbbvGmyhFiBpG_5hf5LA,0.0017532149785518423)), Edge(-268040669,-125364603,Relationship(cJWbbvGmyhFiBpG_5hf5LA,0.0017532149785518423))
But doing this:
graphX.vertices.collect
I have this wrong output:
Array((1981723824,null), (-333497649,null), (-597749329,null), (451246392,null), (-1287295481,null), (1013727024,null), (-194805089,null), (1621180464,null), (1874415454,(KRZALzi0ZgrGYyjZNg72_g,false)), (1539311488,null)
What is the problem? Was I wrong to build Graph?

Nested JSON in Spark

I have the following JSON loaded as a DataFrame:
root
|-- data: struct (nullable = true)
| |-- field1: string (nullable = true)
| |-- field2: string (nullable = true)
|-- moreData: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- more1: string (nullable = true)
| | |-- more2: string (nullable = true)
| | |-- more3: string (nullable = true)
I want to get the following RDD from this DataFrame:
RDD[(more1, more2, more3, field1, field2)]
How can I do this? I think I have to use flatMap for the nested JSON?
A combination of explode and dot syntax should do the trick:
import org.apache.spark.sql.functions.explode
case class Data(field1: String, field2: String)
case class MoreData(more1: String, more2: String, more3: String)
val df = sc.parallelize(Seq(
(Data("foo", "bar"), Array(MoreData("a", "b", "c"), MoreData("d", "e", "f")))
)).toDF("data", "moreData")
df.printSchema
// root
// |-- data: struct (nullable = true)
// | |-- field1: string (nullable = true)
// | |-- field2: string (nullable = true)
// |-- moreData: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- more1: string (nullable = true)
// | | |-- more2: string (nullable = true)
// | | |-- more3: string (nullable = true)
val columns = Seq(
$"moreData.more1", $"moreData.more2", $"moreData.more3",
$"data.field1", $"data.field2")
val aRDD = df.withColumn("moreData", explode($"moreData"))
.select(columns: _*)
.rdd
aRDD.collect
// Array[org.apache.spark.sql.Row] = Array([a,b,c,foo,bar], [d,e,f,foo,bar])
Depending on your requirements you can follow this with map to extract values from the rows:
import org.apache.spark.sql.Row
aRDD.map{case Row(m1: String, m2: String, m3: String, f1: String, f2: String) =>
(m1, m2, m3, f1, f2)}
See also Querying Spark SQL DataFrame with complex types