map customer and account data to a case class using spark/scala - scala

So I have a case class customer data and a case class account data as follows:
case class CustomerData(
customerId: String,
forename: String,
surname: String
)
case class AccountData(
customerId: String,
accountId: String,
balance: Long
)
I need to join these two to get them to form the following case class:
case class CustomerAccountOutput(
customerId: String,
forename: String,
surname: String,
//Accounts for this customer
accounts: Seq[AccountData],
//Statistics of the accounts
numberAccounts: Int,
totalBalance: Long,
averageBalance: Double
)
I need to show that if null is appearing in accountsId or balance thennumber of accounts is 0, total balance as null and avg balance also as null. replacing the null with 0 is also accepted.
The final result should be something like this:
+----------+-----------+--------+---------------------------------------------------------------------+--------------+------------+-----------------+
|customerId|forename |surname |accounts |numberAccounts|totalBalance|averageBalance |
+----------+-----------+--------+---------------------------------------------------------------------+--------------+------------+-----------------+
|IND0113 |Leonard |Ball |[[IND0113,ACC0577,531]] |1 |531 |531.0 |
|IND0277 |Victoria |Hodges |[[IND0277,null,null]] |0 |null |null |
|IND0055 |Ella |Taylor |[[IND0055,ACC0156,137], [IND0055,ACC0117,148]] |2 |285 |142.5 |
|IND0129 |Christopher|Young |[[IND0129,null,null]] |0 |null
I have already got the two case classes to join and here is the code:
val customerDS = customerDF.as[CustomerData]
val accountDS = accountDF.withColumn("balance",'balance.cast("long")).as[AccountData]
//END GIVEN CODE
val customerAccountsDS = customerDS.joinWith(accountDS,customerDS("customerID") === accountDS("customerID"),"leftouter")
How do i go about getting the above result? I am NOT allowed to use the "spark.sql.function._" library at all.

You should be able to do it by using concat_ws and collect_list functions in spark.
//Creating sample data
case class CustomerData(
customerId: String,
forename: String,
surname: String
)
case class AccountData(
customerId: String,
accountId: String,
balance: Long
)
val customercolumns = Seq("customerId","forename","surname")
val acccolumns = Seq("customerId","accountId","balance")
val custdata = Seq(("IND0113", "Leonard","Ball"), ("IND0277", "Victoria","Hodges"), ("IND0055", "Ella","Taylor"),("IND0129","Christopher","Young")).toDF(customercolumns:_*).as[CustomerData]
val acctdata = Seq(("IND0113","ACC0577",531),("IND0055","ACC0156",137),("IND0055","ACC0117",148)).toDF(acccolumns:_*).as[AccountData]
val customerAccountsDS = custdata.join(acctdata,custdata("customerID") === acctdata("customerID"),"leftouter").drop(acctdata.col("customerId"))
import org.apache.spark.sql.functions._
val result = customerAccountsDS.withColumn("accounts", concat_ws(",", $"customerId", $"accountId",$"balance"))
val finalresult = result.groupBy("customerId","forename","surname").agg(collect_list($"accounts"))
You can see the output as below :

Related

Generate dynamic header using Scala case class for Spark Table

I have an existing case class having many fields
case class output {
userId : String,
timeStamp: String,
...
}
And I am using it to generate header for a spark job like this.
--------------------
userId | timeStamp|
--------------------
1 2324444444
2 2334445556
Now i want to add more columns to this and these column will be come from map(attributeName, attributeValue) as attributeNames. So my question is how can I add map to case class and then how can i use map key as column value to generate dynamic columns. After this my final output should be like
----------------------------------------------------
userId | timeStamp| attributeName1 | attributeName2
----------------------------------------------------
1 2324444444| |
2 2334445554| |
you can do something like this
case class output {
userId : String,
timeStamp: String,
keyvalues: Map,
...
}
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.read.textFile(inputlocation).as[output]
val keysDF = df.select(explode(map_keys($"keyvalues"))).distinct()
val keys = keysDF.collect().map(f=>f.get(0)).map(f=>col("keyvalues").getItem(f).as(f.toString))
df.select(col("userId") +: keyCols:_*)
or you can check this thread for other ways todo.

why spark to_json() not populating null values?

Can try in spark-shell
case class Employee(id: Int, name: String, department: String, salary: Option[Double])
import org.apache.spark.sql.functions._
import spark.implicits._
case class Employee(id: Int, name: String, department: String, salary: Option[Double])
val data = List(Employee(1, "XYZ", "dep1", Some(1234.0)), Employee(0, null, "unknown", None)).toDS()
data.select($"id", to_json(struct($"id",$"name", $"department", $"salary")).as("json_data")).show(false)
return =>
|id |json_data |
+---+---------------------------------------------------------+
|1 |{"id":1,"name":"XYZ","department":"dep1","salary":1234.0}|
|0 |{"id":0,"department":"unknown"} |
expecting =>
|id |json_data |
+---+------------------------------------------------------------+
|1 |{"id":1,"name":"XYZ","department":"dep1","salary":1234.0} |
|0 |{"id":0,"name": null, "department":"unknown","salary":null} |
null fields(name & salary) also should be populated in resulting json. I don't want to use lit("null") to populate null values
A feature was recently added to preserve null values when generating JSON, and should be available in the upcoming Spark 3.0 release. See SPARK-29444 for details. In 3.0, you'll be able to control this via:
data.select($"id", to_json(struct($"id",$"name", $"department", $"salary"), Map("ignoreNullFields" -> "false")).as("json_data")).show(false)
AFAIK, there are no plans at present to add this to the 2.x branch.
For spark 3.0, you could use Map(ignoreNullFields" -> "false") as an option to to_json method.
For spark 2.0 and less, you can use the below implementation -
def convertIntoJsonWithNullValuesIncluded(df: DataFrame): DataFrame = {
val colnms_n_vals = df.columns.flatMap { c => Array(lit(c), col(c)) }
val jsonDf = df.withColumn("myMap", functions.map(colnms_n_vals: _*)).select(to_json(struct(col("myMap"))).alias("json"))
val cutStringUdf = udf((x: String) => cutString(x))
jsonDf.withColumn("value", cutStringUdf(col("json"))).drop("json")
}
def cutString(s: String): String = {
s.substring(9, -2)
}

Index of an array item in scala

I have a table with an array column like:
+-Name-+
array
0: {"given_name":"B. A.", "surname":"Name1"}
1: {"given_name":"A.", "surname":"Name2"}
2: {"given_name":"C." "surname":"Name3"}
I would like to add one more element item "index" starting with 1 into an array to find the sequence of the author, like
+-Name-+
array
0: {"given_name":"B. A.", "surname":"Name1", "index":"1"}
1: {"given_name":"A.", "surname":"Name2", "index":"2"}
2: {"given_name":"C." "surname":"Name3", "index":"3"}
How to do this in Scala, your help is much appreciated.
Here's one approach using a UDF that maps each element of the array-typed column to include also the element index:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import spark.implicits._
case class Name(given_name: String, surname: String)
case class NameIdx(given_name: String, surname: String, index: Int)
val df = Seq(
Seq(Name("John", "Doe"), Name("Jane", "Smith"), Name("Mike", "Davis")),
Seq(Name("Rachel", "Smith"), Name("Steve", "Thompson"))
).toDF("name")
val addIndex = udf((names: Seq[Row]) => names.map{
case name # Row(gn: String, sn: String) => NameIdx(gn, sn, names.indexOf(name) + 1)
})
df.select(addIndex($"name").as("name")).show(false)
// +----------------------------------------------+
// |name |
// +----------------------------------------------+
// |[[John,Doe,1], [Jane,Smith,2], [Mike,Davis,3]]|
// |[[Rachel,Smith,1], [Steve,Thompson,2]] |
// +----------------------------------------------+
To produce JSON values, apply to_json as follows:
df.select(to_json(addIndex($"name")).as("name")).show(false)
// +-----------------------------------------------------------------------------------------------------------------------------------------------------+
// |name |
// +-----------------------------------------------------------------------------------------------------------------------------------------------------+
// |[{"given_name":"John","surname":"Doe","index":1},{"given_name":"Jane","surname":"Smith","index":2},{"given_name":"Mike","surname":"Davis","index":3}]|
// |[{"given_name":"Rachel","surname":"Smith","index":1},{"given_name":"Steve","surname":"Thompson","index":2}] |
// +-----------------------------------------------------------------------------------------------------------------------------------------------------+

How to extract values from an RDD based on the parameter passed

I have created an key-value RDD , but i am not sure how to select the values from it.
val mapdf = merchantData_df.rdd.map(row => {
val Merchant_Name = row.getString(0)
val Display_Name = row.getString(1)
val Store_ID_name = row.getString(2)
val jsonString = s"{Display_Name: $Display_Name, Store_ID_name: $Store_ID_name}"
(Merchant_Name, jsonString)
})
scala> mapdf.take(4).foreach(println)
(Amul,{Display_Name: Amul, Store_ID_name: null})
(Nestle,{Display_Name: Nestle, Store_ID_name: null})
(Ace,{Display_Name: Ace , Store_ID_name: null})
(Acme ,{Display_Name: Acme Fresh Market, Store_ID_name: Acme Markets})
Now suppose my input string to a function will be Amul, My expected output for DisplayName is Amul and another function for StoreID to return NULL.
How can i achieve it?
I don't want to use SparkSQL for this purpose
Given input dataframe as
+-----------------+-----------------+-------------+
|Merchant_Name |Display_Name |Store_ID_name|
+-----------------+-----------------+-------------+
|Fitch |Fitch |null |
|Kids |Kids |null |
|Ace Hardware |Ace Hardware |null |
| Fresh Market |Acme Market |Acme Markets |
|Adventure | Island |null |
+-----------------+-----------------+-------------+
You can write a function with string parameter as
import org.apache.spark.sql.functions._
def filterRowsWithKey(key: String) = df.filter(col("Merchant_Name") === key).select("Display_Name", "Store_ID_name")
And calling the function as
filterRowsWithKey("Fitch").show(false)
would give you
+------------+-------------+
|Display_Name|Store_ID_name|
+------------+-------------+
|Fitch |null |
+------------+-------------+
I hope the answer is helpful
Updated
If you want first row as string to be returned from the function then you can do
import org.apache.spark.sql.functions._
def filterRowsWithKey(key: String) = df.filter(col("Merchant_Name") === key).select("Display_Name", "Store_ID_name").first().mkString(",")
println(filterRowsWithKey("Fitch"))
which should give you
Fitch,null
above function will throw exception if the key passed is not found so to be safe you can use following function
import org.apache.spark.sql.functions._
def filterRowsWithKey(key: String) = {
val filteredDF = df.filter(col("Merchant_Name") === key).select("Display_Name", "Store_ID_name")
if(filteredDF.count() > 0) filteredDF.first().mkString(",") else "key not found"
}

Change value of nested column in DataFrame

I have dataframe with two level nested fields
root
|-- request: struct (nullable = true)
| |-- dummyID: string (nullable = true)
| |-- data: struct (nullable = true)
| | |-- fooID: string (nullable = true)
| | |-- barID: string (nullable = true)
I want to update the value of fooId column here. I was able to update value for the first level for example dummyID column here using this question as reference How to add a nested column to a DataFrame
Input data:
{
"request": {
"dummyID": "test_id",
"data": {
"fooID": "abc",
"barID": "1485351"
}
}
}
output data:
{
"request": {
"dummyID": "test_id",
"data": {
"fooID": "def",
"barID": "1485351"
}
}
}
How can I do it using Scala?
Here is a generic solution to this problem that makes it possible to update any number of nested values, at any level, based on an arbitrary function applied in a recursive traversal:
def mutate(df: DataFrame, fn: Column => Column): DataFrame = {
// Get a projection with fields mutated by `fn` and select it
// out of the original frame with the schema reassigned to the original
// frame (explained later)
df.sqlContext.createDataFrame(df.select(traverse(df.schema, fn):_*).rdd, df.schema)
}
def traverse(schema: StructType, fn: Column => Column, path: String = ""): Array[Column] = {
schema.fields.map(f => {
f.dataType match {
case s: StructType => struct(traverse(s, fn, path + f.name + "."): _*)
case _ => fn(col(path + f.name))
}
})
}
This is effectively equivalent to the usual "just redefine the whole struct as a projection" solutions, but it automates re-nesting fields with the original structure AND preserves nullability/metadata (which are lost when you redefine the structs manually). Annoyingly, preserving those properties isn't possible while creating the projection (afaict) so the code above redefines the schema manually.
An example application:
case class Organ(name: String, count: Int)
case class Disease(id: Int, name: String, organ: Organ)
case class Drug(id: Int, name: String, alt: Array[String])
val df = Seq(
(1, Drug(1, "drug1", Array("x", "y")), Disease(1, "disease1", Organ("heart", 2))),
(2, Drug(2, "drug2", Array("a")), Disease(2, "disease2", Organ("eye", 3)))
).toDF("id", "drug", "disease")
df.show(false)
+---+------------------+-------------------------+
|id |drug |disease |
+---+------------------+-------------------------+
|1 |[1, drug1, [x, y]]|[1, disease1, [heart, 2]]|
|2 |[2, drug2, [a]] |[2, disease2, [eye, 3]] |
+---+------------------+-------------------------+
// Update the integer field ("count") at the lowest level:
val df2 = mutate(df, c => if (c.toString == "disease.organ.count") c - 1 else c)
df2.show(false)
+---+------------------+-------------------------+
|id |drug |disease |
+---+------------------+-------------------------+
|1 |[1, drug1, [x, y]]|[1, disease1, [heart, 1]]|
|2 |[2, drug2, [a]] |[2, disease2, [eye, 2]] |
+---+------------------+-------------------------+
// This will NOT necessarily be equal unless the metadata and nullability
// of all fields is preserved (as the code above does)
assertResult(df.schema.toString)(df2.schema.toString)
A limitation of this is that it cannot add new fields, only update existing ones (though the map can be changed into a flatMap and the function to return Array[Column] for that, if you don't care about preserving nullability/metadata).
Additionally, here is a more generic version for Dataset[T]:
case class Record(id: Int, drug: Drug, disease: Disease)
def mutateDS[T](df: Dataset[T], fn: Column => Column)(implicit enc: Encoder[T]): Dataset[T] = {
df.sqlContext.createDataFrame(df.select(traverse(df.schema, fn):_*).rdd, enc.schema).as[T]
}
// To call as typed dataset:
val fn: Column => Column = c => if (c.toString == "disease.organ.count") c - 1 else c
mutateDS(df.as[Record], fn).show(false)
// To call as untyped dataset:
implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) // This is necessary regardless of sparkSession.implicits._ imports
mutateDS(df, fn).show(false)
One way, although cumbersome is to fully unpack and recreate the column by explicitly referencing each element of the original struct.
dataFrame.withColumn("person",
struct(
col("person.age").alias("age),
struct(
col("person.name.first").alias("first"),
lit("some new value").alias("last")).alias("name")))