Spark Dataframe : Pivot with sorting - scala

I am reading the following json file into Dataframe in spark:
{"id" : "a", "country" : "uk", "date" : "2016-01-01"}
{"id" : "b", "country" : "uk", "date" : "2016-01-02"}
{"id" : "c", "country" : "fr", "date" : "2016-02-01"}
{"id" : "d", "country" : "de", "date" : "2016-03-01"}
{"id" : "e", "country" : "tk", "date" : "2016-04-01"}
{"id" : "f", "country" : "be", "date" : "2016-05-01"}
{"id" : "g", "country" : "nl", "date" : "2016-06-01"}
{"id" : "h", "country" : "uk", "date" : "2016-06-01"}
I then apply groupBy on it and pivot it on date, here's the (pseudo) code:
val df = spark.read.json("file.json")
val dfWithFormattedDate = df.withColumn("date", date_format(col("date"), "yyyy-MM"))
dfWithFormattedDate.groupBy("country").pivot("date").agg(countDistinct("id").alias("count")).orderBy("country")
This gives me the Dataframe with country and pivoted dates (months) as columns. I would then like to order the results in descending order of total count. However, I don't have count as one of the columns and I can't apply pivot after applying count() on groupBy as it returns Dataset and not RelationalGroupedDataset. I have tried the following as well:
dfWithFormattedDate.groupBy("country").pivot("date").count()
This does not give me count column either. Is there any way I can gave both count and pivoted dates in resultant Dataset so that I can order by count desc?
Update
Here's the current output:
country|2016-01|2016-02|2016-03| ....
fr | null | 1 | null |
be | null | null | null |
uk | 2 | null | null |
Here's the expected output:
country|count|2016-01|2016-02|2016-03| ....
uk | 3 | 2 | null | null |
fr | 1 | null | 1 | null |
be | 1 | null | null | null |
As you can see, I need the count column in the result and order the rows in descending order of count. Ordering without explicitly having count column is fine as well.

If our starting point is this DataFrame :
import org.apache.spark.sql.functions.{date_format ,col, countDistinct}
val result = df.withColumn("date", date_format(col("date"), "yyyy-MM"))
.groupBy("country").pivot("date").agg(countDistinct("id").alias("count"))
.na.fill(0)
We then can simply calculate the rowsum for all the columns excluding the country column:
import org.apache.spark.sql.functions.desc
val test = result.withColumn("count",
result.columns.filter(_!="country")
.map(c => col(c))
.reduce((x, y) => x + y))
.orderBy(desc("count"))
test.show()
+-------+-------+-------+-------+-------+-------+-------+-----+
|country|2016-01|2016-02|2016-03|2016-04|2016-05|2016-06|count|
+-------+-------+-------+-------+-------+-------+-------+-----+
| uk| 2| 0| 0| 0| 0| 1| 3|
| be| 0| 0| 0| 0| 1| 0| 1|
| de| 0| 0| 1| 0| 0| 0| 1|
| tk| 0| 0| 0| 1| 0| 0| 1|
| nl| 0| 0| 0| 0| 0| 1| 1|
| fr| 0| 1| 0| 0| 0| 0| 1|
+-------+-------+-------+-------+-------+-------+-------+-----+

Related

Query to create Nested JSon object with PostgreSQL

I'm trying to trigger the result of a query to nested Json using
array_to_json and row_to_json function that was added in PostgreSQL
I'm having trouble figuring out the best way as nested objects Here
I've tried query and output with table test3
select array_to_json(array_agg(row_to_json(test3))) from (select
student_name, student_id, promoted from test3 where loc in (select
max(loc) from test3)) test3;
student_id | student_id | promoted | maths | Science | loc
-----------|------------|----------|-------|---------|----------------------------- 19 | John | yes | 75 | 76 | 2022-06-28
06:10:27.25911-04 12 | Peter | no | 79 | 65
| 2022-06-28 06:10:27.25911-04 87 | Steve | yes |
69 | 76 | 2022-06-28 06:59:57.216754-04 98 | Smith
| yes | 78 | 82 | 2022-06-28 06:59:57.216754-04
[ { "student_name": "Steve", "student_id" : 87, "promoted" :
"yes", }, { "student_name": "Smith", "student_id" : 98,
"promoted" : "yes",
} ]
But i need to generate Json output as below
{ "students" : [
{ "student_name" : "Steve", "Details":{
"student_id" : 87,
"promoted" : "yes" }
}, { "student_name" : "Smith", "Details":{
"student_id" : 98,
"promoted" : "yes" }}]}

join datasets with different dimensions - how to aggregate data properly

I am working on a complex logic where I need to redistribute a quantity from one dataset to another dataset.
This questions is a continuation of this question
In the example below I am introducing several new dimensions. After aggregating and distributing all the quantities I am expecting the same total quantity however I have some differences.
See the example below
package playground
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, round, sum}
object sample3 {
val spark = SparkSession
.builder()
.appName("Sample app")
.master("local")
.getOrCreate()
val sc = spark.sparkContext
final case class Owner(a: Long,
b: String,
c: Long,
d: Short,
e: String,
f: String,
o_qtty: Double)
// notice column d is not present in Invoice
final case class Invoice(c: Long,
a: Long,
b: String,
e: String,
f: String,
i_qtty: Double)
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
import spark.implicits._
val ownerData = Seq(
Owner(11, "A", 666, 2017, "x", "y", 50),
Owner(11, "A", 222, 2018, "x", "y", 20),
Owner(33, "C", 444, 2018, "x", "y", 20),
Owner(33, "C", 555, 2018, "x", "y", 120),
Owner(22, "B", 555, 2018, "x", "y", 20),
Owner(99, "D", 888, 2018, "x", "y", 100),
Owner(11, "A", 888, 2018, "x", "y", 100),
Owner(11, "A", 666, 2018, "x", "y", 80),
Owner(33, "C", 666, 2018, "x", "y", 80),
Owner(11, "A", 444, 2018, "x", "y", 50),
)
val invoiceData = Seq(
Invoice(444, 33, "C", "x", "y", 10),
Invoice(999, 22, "B", "x", "y", 200),
Invoice(666, 11, "A", "x", "y", 15),
Invoice(555, 22, "B", "x", "y", 200),
Invoice(888, 11, "A", "x", "y", 12),
)
val owners = spark
.createDataset(ownerData)
.as[Owner]
.cache()
val invoices = spark
.createDataset(invoiceData)
.as[Invoice]
.cache()
val p1 = owners
.join(invoices, Seq("a", "c", "e", "f", "b"))
.selectExpr(
"a",
"d",
"b",
"e",
"f",
"c",
"IF(o_qtty-i_qtty < 0,o_qtty,o_qtty - i_qtty) AS qtty",
"IF(o_qtty-i_qtty < 0,0,i_qtty) AS to_distribute"
)
val p2 = owners
.join(invoices, Seq("a", "c", "e", "f", "b"), "left_outer")
.filter(row => row.anyNull)
.drop(col("i_qtty"))
.withColumnRenamed("o_qtty", "qtty")
val distribute = p1
.groupBy("a", "d", "b", "e", "f")
.agg(sum(col("to_distribute")).as("to_distribute"))
val proportion = p2
.groupBy("a", "d", "b", "e", "f")
.agg(sum(col("qtty")).as("proportion"))
val result = p2
.join(distribute, Seq("a", "d", "b", "e", "f"))
.join(proportion, Seq("a", "d", "b", "e", "f"))
.withColumn(
"qtty",
round(
((col("to_distribute") / col("proportion")) * col("qtty")) + col(
"qtty"
),
2
)
)
.drop("to_distribute", "proportion")
.union(p1.drop("to_distribute"))
result.show(false)
result.selectExpr("SUM(qtty)").show()
owners.selectExpr("SUM(o_qtty)").show()
/*
+---+----+---+---+---+---+-----+
|a |d |b |e |f |c |qtty |
+---+----+---+---+---+---+-----+
|11 |2018|A |x |y |222|27.71|
|33 |2018|C |x |y |555|126.0|
|33 |2018|C |x |y |666|84.0 |
|11 |2018|A |x |y |444|69.29|
|11 |2017|A |x |y |666|35.0 |
|33 |2018|C |x |y |444|10.0 |
|22 |2018|B |x |y |555|20.0 |
|11 |2018|A |x |y |888|88.0 |
|11 |2018|A |x |y |666|65.0 |
+---+----+---+---+---+---+-----+
+---------+
|sum(qtty)|
+---------+
| 525.0|
+---------+
+-----------+
|sum(o_qtty)|
+-----------+
| 640.0|
+-----------+
*/
}
}
Also, note that the aggregation must not produce any negative quantity.
I show the code where was necessary to do changes.
val distribute = p1
.groupBy("a","b", "e", "f") // now we don't need to aggregate by field "d"
.agg(sum(col("to_distribute")).as("to_distribute"))
val proportion = p2
.groupBy("a","b", "e", "f") // now we don't need to aggregate by field "d"
.agg(sum(col("qtty")).as("proportion"))
// Here we remove "d" from the join
// If the distribution is null(there is no data in invoices for that owner)
// then we keep the original "qtty"
// column "d" from p2 dataframe was renamed as "year"
val result = p2
.join(distribute, Seq("a","b", "e", "f"),"left_outer")
.join(proportion, Seq("a","b", "e", "f"))
.selectExpr("a","b","e","f","c","IF(ROUND( ((to_distribute/proportion) * qtty) + qtty, 2) IS NULL,qtty,ROUND( ((to_distribute/proportion) * qtty) + qtty, 2)) AS qtty","d AS year")
.union(p1.withColumn("year",col("d")).drop("d","to_distribute"))
.orderBy(col("b"))
****EXPECTED OUTPUT****
+---+---+---+---+---+-----+----+
|a |b |e |f |c |qtty |year|
+---+---+---+---+---+-----+----+
|11 |A |x |y |444|80.0 |2018|
|11 |A |x |y |222|32.0 |2018|
|11 |A |x |y |666|65.0 |2018|
|11 |A |x |y |888|88.0 |2018|
|11 |A |x |y |666|35.0 |2017|
|22 |B |x |y |555|20.0 |2018|
|33 |C |x |y |555|126.0|2018|
|33 |C |x |y |444|10.0 |2018|
|33 |C |x |y |666|84.0 |2018|
|99 |D |x |y |888|100.0|2018|
+---+---+---+---+---+-----+----+
+---------+
|sum(qtty)|
+---------+
| 640.0|
+---------+
+-----------+
|sum(o_qtty)|
+-----------+
| 640.0|
+-----------+

Convert list to a dataframe column in pyspark

I have a dataframe in which one of the string type column contains a list of items that I want to explode and make it part of the parent dataframe. How can I do it?
Here is the code to create a sample dataframe:
from pyspark.sql import Row
from collections import OrderedDict
def convert_to_row(d: dict) -> Row:
return Row(**OrderedDict(sorted(d.items())))
df=sc.parallelize([{"arg1": "first", "arg2": "John", "arg3" : '[{"name" : "click", "datetime" : "1570103345039", "event" : "entry" }, {"name" : "drag", "datetime" : "1580133345039", "event" : "exit" }]'},{"arg1": "second", "arg2": "Joe", "arg3": '[{"name" : "click", "datetime" : "1670105345039", "event" : "entry" }, {"name" : "drop", "datetime" : "1750134345039", "event" : "exit" }]'},{"arg1": "third", "arg2": "Jane", "arg3" : '[{"name" : "click", "datetime" : "1580105245039", "event" : "entry" }, {"name" : "drop", "datetime" : "1650134345039", "event" : "exit" }]'}]) \
.map(convert_to_row).toDF()
Running this code will create a dataframe as shown below:
+------+----+--------------------+
| arg1|arg2| arg3|
+------+----+--------------------+
| first|John|[{"name" : "click...|
|second| Joe|[{"name" : "click...|
| third|Jane|[{"name" : "click...|
+------+----+--------------------+
The arg3 column contains a list which I want to explode it into the detailed columns. I want the dataframe as follows:
arg1 | arg2 | arg3 | name | datetime | event
How can I achieve that?
You need to specify array to the schema in from_json function:
from pyspark.sql.functions import explode, from_json
schema = 'array<struct<name:string,datetime:string,event:string>>'
df.withColumn('data', explode(from_json('arg3', schema))) \
.select(*df.columns, 'data.*') \
.show()
+------+----+--------------------+-----+-------------+-----+
| arg1|arg2| arg3| name| datetime|event|
+------+----+--------------------+-----+-------------+-----+
| first|John|[{"name" : "click...|click|1570103345039|entry|
| first|John|[{"name" : "click...| drag|1580133345039| exit|
|second| Joe|[{"name" : "click...|click|1670105345039|entry|
|second| Joe|[{"name" : "click...| drop|1750134345039| exit|
| third|Jane|[{"name" : "click...|click|1580105245039|entry|
| third|Jane|[{"name" : "click...| drop|1650134345039| exit|
+------+----+--------------------+-----+-------------+-----+
Note: if your Spark version does not support simpleString format for schema, try the following:
from pyspark.sql.types import ArrayType, StringType, StructType, StructField
schema = ArrayType(
StructType([
StructField('name',StringType())
, StructField('datetime',StringType())
, StructField('event',StringType())
])
)

Spark - after a withColumn("newCol", collect_list(...)) select rows with more than one element

I am working with a DataFrame created from this json:
{"id" : "1201", "name" : "satish", "age" : "25"},
{"id" : "1202", "name" : "krishna", "age" : "28"},
{"id" : "1203", "name" : "amith", "age" : "39"},
{"id" : "1204", "name" : "javed", "age" : "23"},
{"id" : "1205", "name" : "mendy", "age" : "25"},
{"id" : "1206", "name" : "rob", "age" : "24"},
{"id" : "1207", "name" : "prudvi", "age" : "23"}
Initially the Dataframe looks like this:
+---+----+-------+
|age| id| name|
+---+----+-------+
| 25|1201| satish|
| 28|1202|krishna|
| 39|1203| amith|
| 23|1204| javed|
| 25|1205| mendy|
| 24|1206| rob|
| 23|1207| prudvi|
+---+----+-------+
What I need is to group all students with the same age, ordering them depending on their id. This is how I'm approaching this so far:
*Note: I'm pretty sure the are more efficient way than adding a new column using withColumn("newCol", ..) to then use a select("newCol"), but I don't know how to solve it better
val conf = new SparkConf().setAppName("SimpleApp").set("spark.driver.allowMultipleContexts", "true").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read.json("students.json")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id"))).select("List")
The output I am getting is this:
[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([24,1206,rob])]
[WrappedArray([23,1204,javed])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([28,1202,krishna])]
[WrappedArray([39,1203,amith])]
Now, How can I filter the rows which have got more than one element? That is, I want that my final dataframe to be:
[WrappedArray([25,1201,satish], [25,1205,mendy])]
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
My best approach so far is:
val mergedDF = df.withColumn("newCol", collect_list(struct("age","id","name")).over(Window.partitionBy("age").orderBy("id")))
val filterd = mergedDF.withColumn("count", count("age").over(Window.partitionBy("age"))).filter($"count" > 1).select("newCol")
But I must be missing something, because the result is not the expected one:
[WrappedArray([23,1204,javed], [23,1207,prudvi])]
[WrappedArray([25,1201,satish])]
[WrappedArray([25,1201,satish], [25,1205,mendy])]
You can use size() to filter your data:
import org.apache.spark.sql.functions.{col,size}
mergedDF.filter(size(col("newCol"))>1).show(false)
+---+----+------+-----------------------------------+
|age|id |name |newCol |
+---+----+------+-----------------------------------+
|23 |1207|prudvi|[[23,1204,javed], [23,1207,prudvi]]|
|25 |1205|mendy |[[25,1201,satish], [25,1205,mendy]]|
+---+----+------+-----------------------------------+

how to limit only one document for every user

I have one mongo collection like this:
| user_id | log | create_at |
| 1 | "login" | 1490688500 |
| 1 | "logout" | 1490688400 |
| 2 | "view_xxx" | 1490688300 |
| 2 | "cpd" | 1490688100 |
How can I get only one latest log for every user, such as
| 1 | "logon" | 1490688500 |
| 2 | "view_xxx" | 1490688300 |
You can use mongodb aggregation framework and you can run the following command:
db.collection.aggregate(
[
{ '$sort' : { 'created_at' : -1 } },
{ '$group' : { '_id' : '$user_id' , 'log' : { '$last' : '$log' }, 'created_at' : { '$last' : '$created_at' } } }
]
)
docs:
https://docs.mongodb.com/manual/reference/operator/aggregation/last/
https://docs.mongodb.com/manual/reference/operator/aggregation/sort/
https://docs.mongodb.com/manual/reference/operator/aggregation/group/