I'm trying to trigger the result of a query to nested Json using
array_to_json and row_to_json function that was added in PostgreSQL
I'm having trouble figuring out the best way as nested objects Here
I've tried query and output with table test3
select array_to_json(array_agg(row_to_json(test3))) from (select
student_name, student_id, promoted from test3 where loc in (select
max(loc) from test3)) test3;
student_id | student_id | promoted | maths | Science | loc
-----------|------------|----------|-------|---------|----------------------------- 19 | John | yes | 75 | 76 | 2022-06-28
06:10:27.25911-04 12 | Peter | no | 79 | 65
| 2022-06-28 06:10:27.25911-04 87 | Steve | yes |
69 | 76 | 2022-06-28 06:59:57.216754-04 98 | Smith
| yes | 78 | 82 | 2022-06-28 06:59:57.216754-04
[ { "student_name": "Steve", "student_id" : 87, "promoted" :
"yes", }, { "student_name": "Smith", "student_id" : 98,
"promoted" : "yes",
} ]
But i need to generate Json output as below
{ "students" : [
{ "student_name" : "Steve", "Details":{
"student_id" : 87,
"promoted" : "yes" }
}, { "student_name" : "Smith", "Details":{
"student_id" : 98,
"promoted" : "yes" }}]}
I am working on a complex logic where I need to redistribute a quantity from one dataset to another dataset.
This questions is a continuation of this question
In the example below I am introducing several new dimensions. After aggregating and distributing all the quantities I am expecting the same total quantity however I have some differences.
See the example below
package playground
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, round, sum}
object sample3 {
val spark = SparkSession
.builder()
.appName("Sample app")
.master("local")
.getOrCreate()
val sc = spark.sparkContext
final case class Owner(a: Long,
b: String,
c: Long,
d: Short,
e: String,
f: String,
o_qtty: Double)
// notice column d is not present in Invoice
final case class Invoice(c: Long,
a: Long,
b: String,
e: String,
f: String,
i_qtty: Double)
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
import spark.implicits._
val ownerData = Seq(
Owner(11, "A", 666, 2017, "x", "y", 50),
Owner(11, "A", 222, 2018, "x", "y", 20),
Owner(33, "C", 444, 2018, "x", "y", 20),
Owner(33, "C", 555, 2018, "x", "y", 120),
Owner(22, "B", 555, 2018, "x", "y", 20),
Owner(99, "D", 888, 2018, "x", "y", 100),
Owner(11, "A", 888, 2018, "x", "y", 100),
Owner(11, "A", 666, 2018, "x", "y", 80),
Owner(33, "C", 666, 2018, "x", "y", 80),
Owner(11, "A", 444, 2018, "x", "y", 50),
)
val invoiceData = Seq(
Invoice(444, 33, "C", "x", "y", 10),
Invoice(999, 22, "B", "x", "y", 200),
Invoice(666, 11, "A", "x", "y", 15),
Invoice(555, 22, "B", "x", "y", 200),
Invoice(888, 11, "A", "x", "y", 12),
)
val owners = spark
.createDataset(ownerData)
.as[Owner]
.cache()
val invoices = spark
.createDataset(invoiceData)
.as[Invoice]
.cache()
val p1 = owners
.join(invoices, Seq("a", "c", "e", "f", "b"))
.selectExpr(
"a",
"d",
"b",
"e",
"f",
"c",
"IF(o_qtty-i_qtty < 0,o_qtty,o_qtty - i_qtty) AS qtty",
"IF(o_qtty-i_qtty < 0,0,i_qtty) AS to_distribute"
)
val p2 = owners
.join(invoices, Seq("a", "c", "e", "f", "b"), "left_outer")
.filter(row => row.anyNull)
.drop(col("i_qtty"))
.withColumnRenamed("o_qtty", "qtty")
val distribute = p1
.groupBy("a", "d", "b", "e", "f")
.agg(sum(col("to_distribute")).as("to_distribute"))
val proportion = p2
.groupBy("a", "d", "b", "e", "f")
.agg(sum(col("qtty")).as("proportion"))
val result = p2
.join(distribute, Seq("a", "d", "b", "e", "f"))
.join(proportion, Seq("a", "d", "b", "e", "f"))
.withColumn(
"qtty",
round(
((col("to_distribute") / col("proportion")) * col("qtty")) + col(
"qtty"
),
2
)
)
.drop("to_distribute", "proportion")
.union(p1.drop("to_distribute"))
result.show(false)
result.selectExpr("SUM(qtty)").show()
owners.selectExpr("SUM(o_qtty)").show()
/*
+---+----+---+---+---+---+-----+
|a |d |b |e |f |c |qtty |
+---+----+---+---+---+---+-----+
|11 |2018|A |x |y |222|27.71|
|33 |2018|C |x |y |555|126.0|
|33 |2018|C |x |y |666|84.0 |
|11 |2018|A |x |y |444|69.29|
|11 |2017|A |x |y |666|35.0 |
|33 |2018|C |x |y |444|10.0 |
|22 |2018|B |x |y |555|20.0 |
|11 |2018|A |x |y |888|88.0 |
|11 |2018|A |x |y |666|65.0 |
+---+----+---+---+---+---+-----+
+---------+
|sum(qtty)|
+---------+
| 525.0|
+---------+
+-----------+
|sum(o_qtty)|
+-----------+
| 640.0|
+-----------+
*/
}
}
Also, note that the aggregation must not produce any negative quantity.
I show the code where was necessary to do changes.
val distribute = p1
.groupBy("a","b", "e", "f") // now we don't need to aggregate by field "d"
.agg(sum(col("to_distribute")).as("to_distribute"))
val proportion = p2
.groupBy("a","b", "e", "f") // now we don't need to aggregate by field "d"
.agg(sum(col("qtty")).as("proportion"))
// Here we remove "d" from the join
// If the distribution is null(there is no data in invoices for that owner)
// then we keep the original "qtty"
// column "d" from p2 dataframe was renamed as "year"
val result = p2
.join(distribute, Seq("a","b", "e", "f"),"left_outer")
.join(proportion, Seq("a","b", "e", "f"))
.selectExpr("a","b","e","f","c","IF(ROUND( ((to_distribute/proportion) * qtty) + qtty, 2) IS NULL,qtty,ROUND( ((to_distribute/proportion) * qtty) + qtty, 2)) AS qtty","d AS year")
.union(p1.withColumn("year",col("d")).drop("d","to_distribute"))
.orderBy(col("b"))
****EXPECTED OUTPUT****
+---+---+---+---+---+-----+----+
|a |b |e |f |c |qtty |year|
+---+---+---+---+---+-----+----+
|11 |A |x |y |444|80.0 |2018|
|11 |A |x |y |222|32.0 |2018|
|11 |A |x |y |666|65.0 |2018|
|11 |A |x |y |888|88.0 |2018|
|11 |A |x |y |666|35.0 |2017|
|22 |B |x |y |555|20.0 |2018|
|33 |C |x |y |555|126.0|2018|
|33 |C |x |y |444|10.0 |2018|
|33 |C |x |y |666|84.0 |2018|
|99 |D |x |y |888|100.0|2018|
+---+---+---+---+---+-----+----+
+---------+
|sum(qtty)|
+---------+
| 640.0|
+---------+
+-----------+
|sum(o_qtty)|
+-----------+
| 640.0|
+-----------+
I have a dataframe in which one of the string type column contains a list of items that I want to explode and make it part of the parent dataframe. How can I do it?
Here is the code to create a sample dataframe:
from pyspark.sql import Row
from collections import OrderedDict
def convert_to_row(d: dict) -> Row:
return Row(**OrderedDict(sorted(d.items())))
df=sc.parallelize([{"arg1": "first", "arg2": "John", "arg3" : '[{"name" : "click", "datetime" : "1570103345039", "event" : "entry" }, {"name" : "drag", "datetime" : "1580133345039", "event" : "exit" }]'},{"arg1": "second", "arg2": "Joe", "arg3": '[{"name" : "click", "datetime" : "1670105345039", "event" : "entry" }, {"name" : "drop", "datetime" : "1750134345039", "event" : "exit" }]'},{"arg1": "third", "arg2": "Jane", "arg3" : '[{"name" : "click", "datetime" : "1580105245039", "event" : "entry" }, {"name" : "drop", "datetime" : "1650134345039", "event" : "exit" }]'}]) \
.map(convert_to_row).toDF()
Running this code will create a dataframe as shown below:
+------+----+--------------------+
| arg1|arg2| arg3|
+------+----+--------------------+
| first|John|[{"name" : "click...|
|second| Joe|[{"name" : "click...|
| third|Jane|[{"name" : "click...|
+------+----+--------------------+
The arg3 column contains a list which I want to explode it into the detailed columns. I want the dataframe as follows:
arg1 | arg2 | arg3 | name | datetime | event
How can I achieve that?
You need to specify array to the schema in from_json function:
from pyspark.sql.functions import explode, from_json
schema = 'array<struct<name:string,datetime:string,event:string>>'
df.withColumn('data', explode(from_json('arg3', schema))) \
.select(*df.columns, 'data.*') \
.show()
+------+----+--------------------+-----+-------------+-----+
| arg1|arg2| arg3| name| datetime|event|
+------+----+--------------------+-----+-------------+-----+
| first|John|[{"name" : "click...|click|1570103345039|entry|
| first|John|[{"name" : "click...| drag|1580133345039| exit|
|second| Joe|[{"name" : "click...|click|1670105345039|entry|
|second| Joe|[{"name" : "click...| drop|1750134345039| exit|
| third|Jane|[{"name" : "click...|click|1580105245039|entry|
| third|Jane|[{"name" : "click...| drop|1650134345039| exit|
+------+----+--------------------+-----+-------------+-----+
Note: if your Spark version does not support simpleString format for schema, try the following:
from pyspark.sql.types import ArrayType, StringType, StructType, StructField
schema = ArrayType(
StructType([
StructField('name',StringType())
, StructField('datetime',StringType())
, StructField('event',StringType())
])
)
I am working with a DataFrame created from this json:
{"id" : "1201", "name" : "satish", "age" : "25"},
{"id" : "1202", "name" : "krishna", "age" : "28"},
{"id" : "1203", "name" : "amith", "age" : "39"},
{"id" : "1204", "name" : "javed", "age" : "23"},
{"id" : "1205", "name" : "mendy", "age" : "25"},
{"id" : "1206", "name" : "rob", "age" : "24"},
{"id" : "1207", "name" : "prudvi", "age" : "23"}
Initially the Dataframe looks like this:
+---+----+-------+
|age| id| name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 25|1205| mendy|
| 24|1206| rob|
| 23|1207| prudvi|
+---+----+-------+
What I need is to group all students with the same age, ordering them depending on their id. This is how I'm approaching this so far:
*Note: I'm pretty sure the are more efficient way than adding a new column using withColumn("newCol", ..) to then use a select("newCol"), but I don't know how to solve it better
val conf = new SparkConf().setAppName("SimpleApp").set("spark.driver.allowMultipleContexts", "true").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read.json("students.json")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id"))).select("List")
The output I am getting is this:
[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([24,1206,rob])]
[WrappedArray([23,1204,javed])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([28,1202,krishna])]
[WrappedArray([39,1203,amith])]
Now, How can I filter the rows which have got more than one element? That is, I want that my final dataframe to be:
[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
My best approach so far is:
val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id")))
val filterd = mergedDF.withColumn("count", count("age").over(Window.partitionBy("age"))).filter($"count" > 1).select("newCol")
But I must be missing something, because the result is not the expected one:
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([25,1201,satish])]
[WrappedArray([25,1201,satish], [25,1205,mendy])]
You can use size() to filter your data:
import org.apache.spark.sql.functions.{col,size}
mergedDF.filter(size(col("newCol"))>1).show(false)
+---+----+------+-----------------------------------+
|age|id |name |newCol |
+---+----+------+-----------------------------------+
|23 |1207|prudvi|[[23,1204,javed], [23,1207,prudvi]]|
|25 |1205|mendy |[[25,1201,satish], [25,1205,mendy]]|
+---+----+------+-----------------------------------+