I have the following csv file:
LID,Name,age,CID
122,David,29,ECB4
122,Frank,31,ECB4
567,David,29,ECB4
567,Daniel,35,ECB4
I want to group the data first by the CID and later by the LID and save them as json so that they have kind of that structure:
{
"CID": "ECB4",
"logs":[ {
"LID":122,
"body":[{
"name":"David",
"age":29
},
{
"name":"Frank",
"age":31
}
]
},
"LID":567,
"body":[{
"name":"David",
"age":29
},
{
"name":"Daniel",
"age":35
}
]
}
]
}
I have already defined a schema and loaded the data into a dataframe:
sparkSession.sqlContext.read.format("csv")
.option("delimiter",",").schema(someSchema).load("...")
But I have no idea how to group the dataframe in the wanted way. The groupBy functions returns a RelationalGroupedDataset which I can not save as json.
A sql query wants that I use an aggregation after grouping.
I would appreciate any help.
groupBy defines the groupings alone that you can later use to run aggregations upon. In order to have the result saved in JSON format you have to define the final action that will act on the groupings.
groupBy(col1: String, cols: String*): RelationalGroupedDataset Groups the Dataset using the specified columns, so that we can run aggregation on them.
See RelationalGroupedDataset for all the available aggregate functions.
In other words, you have to execute aggregations using RelationalGroupedDataset interface from which you can use the most generic agg operator.
agg(expr: Column, exprs: Column*): DataFrame Compute aggregates by specifying a series of aggregate columns.
If I'm not mistaken (by looking at the output JSON file), you do groupBy to collect the name and age fields per LID.
You should do the following then:
// Load your dataset
val cids = spark.read.option("header", true).csv("cids.csv")
scala> cids.show
+---+------+---+----+
|LID| Name|age| CID|
+---+------+---+----+
|122| David| 29|ECB4|
|122| Frank| 31|ECB4|
|567| David| 29|ECB4|
|567|Daniel| 35|ECB4|
+---+------+---+----+
With the dataset you have to first struct the columns name and age as they participate in the aggregation.
val name_ages = cids.withColumn("name_age", struct("name", "age"))
scala> name_ages.show
+---+------+---+----+-----------+
|LID| Name|age| CID| name_age|
+---+------+---+----+-----------+
|122| David| 29|ECB4| [David,29]|
|122| Frank| 31|ECB4| [Frank,31]|
|567| David| 29|ECB4| [David,29]|
|567|Daniel| 35|ECB4|[Daniel,35]|
+---+------+---+----+-----------+
Now, it should be fairly straightforward.
val logs = name_ages.groupBy("CID", "LID")
.agg(collect_list("name_age") as "logs") // <-- that's the missing piece in the puzzle
scala> logs.show(truncate = false)
+----+---+-------------------------+
|CID |LID|logs |
+----+---+-------------------------+
|ECB4|122|[[David,29], [Frank,31]] |
|ECB4|567|[[David,29], [Daniel,35]]|
+----+---+-------------------------+
Save away...(left as a home exercise :))
Hint: You may want to use struct once more.
Related
How do I write the below code in typesafe manner in spark scala with Dataset Api:
val schema: StructType = Encoders.product[CaseClass].schema
//read json from a file
val readAsDataSet :CaseClass=sparkSession.read.option("mode",mode).schema(schema).json(path)as[CaseClass]
//below code needs to be written in type safe way:
val someDF= readAsDataSet.withColumn("col1",explode(col("col_to_be_exploded")))
.select(from_unixtime(col("timestamp").divide(1000))
.as("date"), col("col1"))
As someone in the comments said, you can create a Dataset[CaseClass] and do your operations on there. Let's set it up:
import spark.implicits._
case class MyTest (timestamp: Long, col_explode: Seq[String])
val df = Seq(
MyTest(1673850366000L, Seq("some", "strings", "here")),
MyTest(1271850365998L, Seq("pasta", "with", "cream")),
MyTest(611850366000L, Seq("tasty", "food"))
).toDF("timestamp", "col_explode").as[MyTest]
df.show(false)
+-------------+---------------------+
|timestamp |col_explode |
+-------------+---------------------+
|1673850366000|[some, strings, here]|
|1271850365998|[pasta, with, cream] |
|611850366000 |[tasty, food] |
+-------------+---------------------+
Typically, you can do many operations with the map function and the Scala language.
A map function returns the same amount of elements as the input has. The explode function that you're using, however, does not return the same amount of elements. You can implement this behaviour using the flatMap function.
So, using the Scala language and the flatMap function together, you can do something like this:
import java.time.LocalDateTime
import java.time.ZoneOffset
case class Exploded (datetime: String, exploded: String)
val output = df.flatMap{ case MyTest(timestamp, col_explode) =>
col_explode.map( value => {
val date = LocalDateTime.ofEpochSecond(timestamp/1000, 0, ZoneOffset.UTC).toString
Exploded(date, value)
}
)
}
output.show(false)
+-------------------+--------+
|datetime |exploded|
+-------------------+--------+
|2023-01-16T06:26:06|some |
|2023-01-16T06:26:06|strings |
|2023-01-16T06:26:06|here |
|2010-04-21T11:46:05|pasta |
|2010-04-21T11:46:05|with |
|2010-04-21T11:46:05|cream |
|1989-05-22T14:26:06|tasty |
|1989-05-22T14:26:06|food |
+-------------------+--------+
As you see, we've created a second case class called Exploded which we use to type our output dataset. Our output dataset has the following type: org.apache.spark.sql.Dataset[Exploded] so everything is completely type safe.
I have seen some variations of this question asked but havent found exactly what Im looking for. Here is the question:
I have some report names that I have collected in a dataframe and pivoted. The trouble I am having is regarding the resilience of the report_name. I cant be assured that every 90 days data will be present and that Rpt1, Rpt2, and Rpt3 will be there. So how do I go about creating a calculation ONLY if the column is present. I have outlined how my code looks right now. It works if all columns are there, but Id like to future proof it to ensure that if the report is not present in the 90 day window the pipline will not error out, but instead just skip the .withColumn addition
df1=(reports.alias("r")
.groupBy(uniqueid)
.filter("current_date<=90")
.pivot(report_name)
**
Result would be the following columns uniqueid Rpt1, Rpt2, Rpt3
* +---+-----+------+----------+
* |id |Rpt1 |Rpt2 |Rpt3 |
* +---+-----+------+----------+
* |205|72 |36 | 12 |
**
df2=(df1.alias("d1")
.withColumn("new_calc",expr("Rpt2/Rpt3"))
You can catch the error with a Try monad and return the original dataframe if withColumn fails.
import scala.util.Try
val df2 = Try(df1.withColumn("new_calc", expr("Rpt2/Rpt3")))
.getOrElse(df1)
.alias("d1")
You can also define it as a method if you want to reuse:
import org.apache.spark.sql.Column
def withColumnIfExist(df: DataFrame, colName: String, col: Column) =
Try(df.withColumn("new_calc",expr("Rpt2/Rpt3"))).getOrElse(df)
val df3 = withColumnIfExist(df1, "new_calc", expr("Rpt2/Rpt3"))
.alias("d1")
And if you need to chain multiple transformation you can use it with transform:
val df4 = df1.alias("d1")
.transform(withColumnIfExist(_, "new_calc", expr("Rpt2/Rpt3")))
.transform(withColumnIfExist(_, "new_calc_2", expr("Rpt1/Rpt2")))
Or you can implement it as an extension method with implicit class:
implicit class RichDataFrame(df: DataFrame) {
def withColumnIfExist(colName: String, col: Column): DataFrame =
Try(df.withColumn("new_calc", expr("Rpt2/Rpt3"))).getOrElse(df)
}
val df5 = df1.alias("d1")
.withColumnIfExist("new_calc", expr("Rpt2/Rpt3"))
.withColumnIfExist("new_calc_2", expr("Rpt1/Rpt2"))
Since withColumn works with all datasets, and if you want withColumnIfExist to work generically for all datasets including dataframe:
implicit class RichDataset[A](ds: Dataset[A]) {
def withColumnIfExist(colName: String, col: Column): DataFrame =
Try(ds.withColumn("new_calc", expr("Rpt2/Rpt3"))).getOrElse(ds.toDF)
}
I have records like below. I would like to convert a single record into two records with values EXTERNAL and INTERNAL each if the 3rd attribute is All.
Input dataset:
Surender,cts,INTERNAL
Raja,cts,EXTERNAL
Ajay,tcs,All
Expected output:
Surender,cts,INTERNAL
Raja,cts,EXTERNAL
Ajay,tcs,INTERNAL
Ajay,tcs,EXTERNAL
My Spark Code :
case class Customer(name:String,organisation:String,campaign_type:String)
val custRDD = sc.textFile("/user/cloudera/input_files/customer.txt")
val mapRDD = custRDD.map(record => record.split(","))
.map(arr => (arr(0),arr(1),arr(2))
.map(tuple => {
val name = tuple._1.trim
val organisation = tuple._2.trim
val campaign_type = tuple._3.trim.toUpperCase
Customer(name, organisation, campaign_type)
})
mapRDD.toDF().registerTempTable("customer_processed")
sqlContext.sql("SELECT * FROM customer_processed").show
Could Someone help me to fix this issue?
Since it's Scala...
If you want to write a more idiomatic Scala code (and perhaps trading some performance due to lack of optimizations to have a more idiomatic code), you can use flatMap operator (removed the implicit parameter):
flatMap[U](func: (T) ⇒ TraversableOnce[U]): Dataset[U] Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
NOTE: flatMap is equivalent to explode function, but you don't have to register a UDF (as in the other answer).
A solution could be as follows:
// I don't care about the names of the columns since we use Scala
// as you did when you tried to write the code
scala> input.show
+--------+---+--------+
| _c0|_c1| _c2|
+--------+---+--------+
|Surender|cts|INTERNAL|
| Raja|cts|EXTERNAL|
| Ajay|tcs| All|
+--------+---+--------+
val result = input.
as[(String, String, String)].
flatMap { case r # (name, org, campaign) =>
if ("all".equalsIgnoreCase(campaign)) {
Seq("INTERNAL", "EXTERNAL").map { cname =>
(name, org, cname)
}
} else Seq(r)
}
scala> result.show
+--------+---+--------+
| _1| _2| _3|
+--------+---+--------+
|Surender|cts|INTERNAL|
| Raja|cts|EXTERNAL|
| Ajay|tcs|INTERNAL|
| Ajay|tcs|EXTERNAL|
+--------+---+--------+
Comparing performance of the two queries, i.e. flatMap-based vs explode-based queries, I think explode-based may be slightly faster and optimized better as some code is under Spark's control (using logical operators before they get mapped to physical couterparts). In flatMap the entire optimization is your responsibility as a Scala developer.
The below red-bounded area corresponds to flatMap-based code and the warning sign are very cost expensive DeserializeToObject and SerializeFromObject operators.
What's interesting is the number of Spark jobs per query and their durations. It appears that explode-based query takes 2 Spark jobs and 200 ms while flatMap-based take only 1 Spark job and 43 ms.
That surprises me a lot and suggests that flatMap-based query could be faster (!)
You can use and udf to transform the campaign_type column containing a Seq of strings to map it to the campaigns type and then explode :
val campaignType_ : (String => Seq[String]) = {
case s if s == "ALL" => Seq("EXTERNAL", "INTERNAL")
case s => Seq(s)
}
val campaignType = udf(campaignType_)
val df = Seq(("Surender", "cts", "INTERNAL"),
("Raja", "cts", "EXTERNAL"),
("Ajay", "tcs", "ALL"))
.toDF("name", "organisation", "campaign_type")
val step1 = df.withColumn("campaign_type", campaignType($"campaign_type"))
step1.show
// +--------+------------+--------------------+
// | name|organisation| campaign_type|
// +--------+------------+--------------------+
// |Surender| cts| [INTERNAL]|
// | Raja| cts| [EXTERNAL]|
// | Ajay| tcs|[EXTERNAL, INTERNAL]|
// +--------+------------+--------------------+
val step2 = step1.select($"name", $"organisation", explode($"campaign_type"))
step2.show
// +--------+------------+--------+
// | name|organisation| col|
// +--------+------------+--------+
// |Surender| cts|INTERNAL|
// | Raja| cts|EXTERNAL|
// | Ajay| tcs|EXTERNAL|
// | Ajay| tcs|INTERNAL|
// +--------+------------+--------+
EDIT:
You don't actually need a udf, you can use a when().otherwise predicate instead on step1 as followed :
val step1 = df.withColumn("campaign_type",
when(col("campaign_type") === "ALL", array("EXTERNAL", "INTERNAL")).otherwise(array(col("campaign_type")))
I have a spark SQL question Id appreciate some guidance on the best way to do a conditional select from nested array of structs.
I have an example json document below
```
{
"id":"p1",
"externalIds":[
{"system":"a","id":"1"},
{"system":"b","id":"2"},
{"system":"c","id":"3"}
]
}
```
In spark SQL I want to select the "id" of one of the array structs based on some conditional logic.
e.g for above, select the id field of array sub element that has "system" = "b", namely the id of "2".
How best to do this in SparkSQL?
Cheers and thanks!
Using an UDF, this could look like this, given a Dataframe (all attributes of type String):
+---+---------------------+
|id |externalIds |
+---+---------------------+
|p1 |[[a,1], [b,2], [c,3]]|
+---+---------------------+
Define an UDF to traverse your array and find the desired element:
def getExternal(system: String) = {
udf((row: Seq[Row]) =>
row.map(r => (r.getString(0), r.getString(1)))
.find { case (s, _) => s == system}
.map(_._2)
.orElse(None)
)
}
and use it like this:
df
.withColumn("external",getExternal("b")($"externalIds"))
.show(false)
+---+---------------------+--------+
|id |externalIds |external|
+---+---------------------+--------+
|p1 |[[a,1], [b,2], [c,3]]|2 |
+---+---------------------+--------+
sql/dataframes,
please help me out or provide some good suggestion on how to read this json
{
"billdate":"2016-08-08',
"accountid":"xxx"
"accountdetails":{
"total":"1.1"
"category":[
{
"desc":"one",
"currentinfo":{
"value":"10"
},
"subcategory":[
{
"categoryDesc":"sub",
"value":"10",
"currentinfo":{
"value":"10"
}
}]
}]
}
}
Thanks,
You can try the following code to read the JSON file based on Schema in Spark 2.2
import org.apache.spark.sql.types.{DataType, StructType}
//Read Json Schema and Create Schema_Json
val schema_json=spark.read.json("/user/Files/ActualJson.json").schema.json
//add the schema
val newSchema=DataType.fromJson(schema_json).asInstanceOf[StructType]
//read the json files based on schema
val df=spark.read.schema(newSchema).json("Json_Files/Folder Path")
Seems like your json is not valid.
pls check with http://www.jsoneditoronline.org/
Please see an-introduction-to-json-support-in-spark-sql.html
if you want to register as the table you can register like below and print the schema.
DataFrame df = sqlContext.read().json("/path/to/validjsonfile").toDF();
df.registerTempTable("df");
df.printSchema();
Below is sample code snippet
DataFrame app = df.select("toplevel");
app.registerTempTable("toplevel");
app.printSchema();
app.show();
DataFrame appName = app.select("toplevel.sublevel");
appName.registerTempTable("sublevel");
appName.printSchema();
appName.show();
Example with scala :
{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]}
{"name":"Andy", "cities":["santa cruz"], "schools":[{"sname":"ucsb", "year":2011}]}
{"name":"Justin", "cities":["portland"], "schools":[{"sname":"berkeley", "year":2014}]}
val people = sqlContext.read.json("people.json")
people: org.apache.spark.sql.DataFrame
Reading top level field
val names = people.select('name).collect()
names: Array[org.apache.spark.sql.Row] = Array([Michael], [Andy], [Justin])
names.map(row => row.getString(0))
res88: Array[String] = Array(Michael, Andy, Justin)
Use the select() method to specify the top-level field, collect() to collect it into an Array[Row], and the getString() method to access a column inside each Row.
Flatten and Read a JSON Array
each Person has an array of "cities". Let's flatten these arrays and read out all their elements.
val flattened = people.explode("cities", "city"){c: List[String] => c}
flattened: org.apache.spark.sql.DataFrame
val allCities = flattened.select('city).collect()
allCities: Array[org.apache.spark.sql.Row]
allCities.map(row => row.getString(0))
res92: Array[String] = Array(palo alto, menlo park, santa cruz, portland)
The explode() method explodes, or flattens, the cities array into a new column named "city". We then use select() to select the new column, collect() to collect it into an Array[Row], and getString() to access the data inside each Row.
Read an Array of Nested JSON Objects, Unflattened
read out the "schools" data, which is an array of nested JSON objects. Each element of the array holds the school name and year:
val schools = people.select('schools).collect()
schools: Array[org.apache.spark.sql.Row]
val schoolsArr = schools.map(row => row.getSeq[org.apache.spark.sql.Row](0))
schoolsArr: Array[Seq[org.apache.spark.sql.Row]]
schoolsArr.foreach(schools => {
schools.map(row => print(row.getString(0), row.getLong(1)))
print("\n")
})
(stanford,2010)(berkeley,2012)
(ucsb,2011)
(berkeley,2014)
Use select() and collect() to select the "schools" array and collect it into an Array[Row]. Now, each "schools" array is of type List[Row], so we read it out with the getSeq[Row]() method. Finally, we can read the information for each individual school, by calling getString() for the school name and getLong() for the school year.