String aggregation and group by in PySpark - pyspark

I have a dataset that has Id, Value and Timestamp columns. Id and Value columns are strings. Sample:
Id
Value
Timestamp
Id1
100
1658919600
Id1
200
1658919602
Id1
300
1658919601
Id2
433
1658919677
I want to concatenate Values that belong to the same Id, and order them by Timestamp. E.g. for rows with Id1 the result would look like:
Id
Values
Id1
100;300;200
Some pseudo code would be:
res = SELECT Id,
STRING_AGG(Value,";") WITHING GROUP ORDER BY Timestamp AS Values
FROM table
GROUP BY Id
Can someone help me write this in Databricks? PySpark and SQL are both fine.

You can collect lists of struct ofTimestamp and Value (in that order) for each Id, sort them (sort_array will sort by the first value of struct, i.e Timestamp) and combine Value's values into string using concat_ws.
PySpark (Spark 3.1.2)
import pyspark.sql.functions as F
(df
.groupBy("Id")
.agg(F.expr("concat_ws(';', sort_array(collect_list(struct(Timestamp, Value))).Value) as Values"))
).show(truncate=False)
# +---+-----------+
# |Id |Values |
# +---+-----------+
# |Id1|100;300;200|
# |Id2|433 |
# +---+-----------+
in SparkSQL
SELECT Id, concat_ws(';', sort_array(collect_list(struct(Timestamp, Value))).Value) as Values
FROM table
GROUP BY Id

This is a beautiful question!! This is a perfect use case for Fugue which can port Python and Pandas code to PySpark. I think this is something that is hard to express in Spark but easy to express in native Python or Pandas.
Let's just concern ourselves with 1 ID first. For one ID, using pure native Python, it would look like below. Assume the Timestamps are already sorted when this is applied.
import pandas as pd
df = pd.DataFrame({"Id": ["Id1", "Id1", "Id1", "Id2","Id2","Id2"],
"Value": [100,200,300,433, 500,600],
"Timestamp": [1658919600, 1658919602, 1658919601, 1658919677, 1658919670, 1658919672]})
from typing import Iterable, List, Dict, Any
def logic(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
_id = df[0]['Id']
items = []
for row in df:
items.append(row['Value'])
yield {"Id": _id, "Values": items}
Now we can call Fugue with one line of code to run this on Pandas. Fugue uses the type annotation from the logic function to handle conversions for you as it enters the function. We can run this for 1 ID (not sorted yet).
from fugue import transform
transform(df.loc[df["Id"] == "Id1"], logic, schema="Id:str,Values:[int]")
and that generates this:
Id Values
0 Id1 [100, 200, 300]
Now we are ready to bring it to Spark. All we need to do is add the engine and partitioning strategy to the transform call.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = transform(df,
logic,
schema="Id:str,Values:[int]",
partition={"by": "Id", "presort": "Timestamp asc"},
engine=spark)
sdf.show()
Because we passed in the SparkSession, this code will run on Spark.sdf is a SparkDataFrame so we need .show() because it evaluates lazily. Schema is a requirement for Spark so we need it too on Fugue but it's significantly simplified. The partitioning strategy will run logic on each Id, and will sort the items by Timestamp for each partition.
For the FugueSQL version, you can do:
from fugue_sql import fsql
fsql(
"""
SELECT *
FROM df
TRANSFORM PREPARTITION BY Id PRESORT Timestamp ASC USING logic SCHEMA Id:str,Values:[int]
PRINT
"""
).run(spark)

Easiest Solution :
df1=df.sort(asc('Timestamp')).groupBy("id").agg(collect_list('Value').alias('newcol'))
+---+---------------+
| id| newcol|
+---+---------------+
|Id1|[100, 300, 200]|
|Id2| [433]|
+---+---------------+
df1.withColumn('newcol',concat_ws(";",col("newcol"))).show()
+---+-----------+
| id| newcol|
+---+-----------+
|Id1|100;300;200|
|Id2| 433|
+---+-----------+

Related

how to pivot /transpose rows of a column in to individual columns in spark-scala without using the pivot method

Please check below image for the reference to my use case
You can get the same result without using pivot by adding the columns manually, if you know all the names of the new columns:
import org.apache.spark.sql.functions.{col, when}
dataframe
.withColumn("cheque", when(col("ttype") === "cheque", col("tamt")))
.withColumn("draft", when(col("ttype") === "draft", col("tamt")))
.drop("tamt", "ttype")
As this solution does not trigger shuffle, your processing will be faster than using pivot.
It can be generalized if you don't know the name of the columns. However, in this case you should benchmark to check whether pivot is more performant:
import org.apache.spark.sql.functions.{col, when}
val newColumnNames = dataframe.select("ttype").distinct.collect().map(_.getString(0))
newColumnNames
.foldLeft(dataframe)((df, columnName) => {
df.withColumn(columnName, when(col("ttype") === columnName, col("tamt")))
})
.drop("tamt", "ttype")
Use groupBy,pivot & agg functions. Check below code.
Added inline comments.
scala> df.show(false)
+----------+------+----+
|tdate |ttype |tamt|
+----------+------+----+
|2020-10-15|draft |5000|
|2020-10-18|cheque|7000|
+----------+------+----+
scala> df
.groupBy($"tdate") // Grouping data based on tdate column.
.pivot("ttype",Seq("cheque","draft")) // pivot based on ttype and "draft","cheque" are new column name
.agg(first("tamt")) // aggregation by "tamt" column.
.show(false)
+----------+------+-----+
|tdate |cheque|draft|
+----------+------+-----+
|2020-10-18|7000 |null |
|2020-10-15|null |5000 |
+----------+------+-----+

Spark Column merging all list into 1 single list

I want the below column to merge into a single list for n-gram calculation. I am not sure how can I merge all the lists in a column into a single one.
+--------------------+
| author|
+--------------------+
| [Justin, Lee]|
|[Chatbots, were, ...|
|[Our, hopes, were...|
|[And, why, wouldn...|
|[At, the, Mobile,...|
+--------------------+
(Edit)Some more info:
I would like this as a spark df column and all the words including the repeated ones in a single list. The data is kind of big so I want to try avoiding methods like collect
OP wants to aggregate all the arrays/lists into the top row.
values = [(['Justin','Lee'],),(['Chatbots','were'],),(['Our','hopes','were'],),
(['And','why','wouldn'],),(['At','the','Mobile'],)]
df = sqlContext.createDataFrame(values,['author',])
df.show()
+------------------+
| author|
+------------------+
| [Justin, Lee]|
| [Chatbots, were]|
|[Our, hopes, were]|
|[And, why, wouldn]|
| [At, the, Mobile]|
+------------------+
This step suffices.
from pyspark.sql import functions as F
df = df.groupby().agg(F.collect_list('author').alias('list_of_authors'))
df.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|list_of_authors |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray(Justin, Lee), WrappedArray(Chatbots, were), WrappedArray(Our, hopes, were), WrappedArray(And, why, wouldn), WrappedArray(At, the, Mobile)]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
DataFrames, same as other distributed data structures, are not iterable and by only using dedicated higher order function and / or SQL methods can be accessed
Suppose your Dataframe is DF1 and Output is DF2
You need something like :
values = [(['Justin', 'Lee'],), (['Chatbots', 'were'],), (['Our', 'hopes', 'were'],),
(['And', 'why', 'wouldn'],), (['At', 'the', 'Mobile'],)]
df = spark.createDataFrame(values, ['author', ])
df.agg(F.collect_list('author').alias('author')).show(truncate=False)
Upvote if works

Filtering dataframe spark scala for dates greater than current time

I have a data frame in spark 1.6 that I would like to select all rows greater than the current time. I am filtering on "time_occurred" column with this type of format "yyyy-MM-dd'T'HH:mm:ss.SSS". I was wondering what the best way is to achieve this?
Best way would be casting the field to timestamp type by using Regexp_replace function to replace 'T'.
Then by using current_timestamp function we can filter out data in the dataframe.
Example:
Spark-scala-1.6:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
//sample data
val df=sc.parallelize(Seq(("2019-10-17'T'18:30:45.123"),("2019-10-15'T'18:30:45.123"))).toDF("ts")
df.filter(regexp_replace('ts,"'T'"," ").cast("timestamp") > current_timestamp).show(false)
Result:
+-------------------------+
|ts |
+-------------------------+
|2019-10-17'T'18:30:45.123|
+-------------------------+
In case if you need to replace 'T' to get timestamp type for ts field then use this approach.
df.withColumn("ts",regexp_replace('ts,"'T'"," ").cast("timestamp"))
.filter('ts > current_timestamp).show(false)
Result:
+-----------------------+
|ts |
+-----------------------+
|2019-10-17 18:30:45.123|
+-----------------------+
Result ts field will be having Timestamp type.

Filtering a DataFrame on date columns comparison

I am trying to filter a DataFrame comparing two date columns using Scala and Spark. Based on the filtered DataFrame there are calculations running on top to calculate new columns.
Simplified my data frame has the following schema:
|-- received_day: date (nullable = true)
|-- finished: int (nullable = true)
On top of that I create two new column t_start and t_end that would be used for filtering the DataFrame. They have 10 and 20 days difference from the original column received_day:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10))
.withColumn("t_start",date_sub(col("received_day"),20))
I now want to have a new calculated column that indicates for each row of data how many rows of the dataframe are in the t_start to t_end period. I thought I can achieve this the following way:
val dfWithCount = dfWithDates
.withColumn("cnt", lit(
dfWithDates.filter(
$"received_day".lt(col("t_end"))
&& $"received_day".gt(col("t_start"))).count()))
However, this count only returns 0 and I believe that the problem is in the the argument that I am passing to lt and gt.
From following that issue here Filtering a spark dataframe based on date I realized that I need to pass a string value. If I try with hard coded values like lt(lit("2018-12-15")), then the filtering works. So I tried casting my columns to StringType:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10).cast(DataTypes.StringType))
.withColumn("t_start",date_sub(col("received_day"),20).cast(DataTypes.StringType))
But the filter still returns an empty dataFrame.
I would assume that I am not handling the data type right.
I am running on Scala 2.11.0 with Spark 2.0.2.
Yes you are right. For $"received_day".lt(col("t_end") each reveived_day value is compared with the current row's t_end value, not the whole dataframe. So each time you'll get zero as count.
You can solve this by writing a simple udf. Here is the way how you can solve the issue:
Creating sample input dataset:
import org.apache.spark.sql.{Row, SparkSession}
import java.sql.Date
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((Date.valueOf("2018-10-12"),1),
(Date.valueOf("2018-10-13"),1),
(Date.valueOf("2018-09-25"),1),
(Date.valueOf("2018-10-14"),1)).toDF("received_day", "finished")
val dfWithDates= df
.withColumn("t_start",date_sub(col("received_day"),20))
.withColumn("t_end",date_sub(col("received_day"),10))
dfWithDates.show()
+------------+--------+----------+----------+
|received_day|finished| t_start| t_end|
+------------+--------+----------+----------+
| 2018-10-12| 1|2018-09-22|2018-10-02|
| 2018-10-13| 1|2018-09-23|2018-10-03|
| 2018-09-25| 1|2018-09-05|2018-09-15|
| 2018-10-14| 1|2018-09-24|2018-10-04|
+------------+--------+----------+----------+
Here for 2018-09-25 we desire count 3
Generate output:
val count_udf = udf((received_day:Date) => {
(dfWithDates.filter((col("t_end").gt(s"$received_day")) && col("t_start").lt(s"$received_day")).count())
})
val dfWithCount = dfWithDates.withColumn("count",count_udf(col("received_day")))
dfWithCount.show()
+------------+--------+----------+----------+-----+
|received_day|finished| t_start| t_end|count|
+------------+--------+----------+----------+-----+
| 2018-10-12| 1|2018-09-22|2018-10-02| 0|
| 2018-10-13| 1|2018-09-23|2018-10-03| 0|
| 2018-09-25| 1|2018-09-05|2018-09-15| 3|
| 2018-10-14| 1|2018-09-24|2018-10-04| 0|
+------------+--------+----------+----------+-----+
To make computation faster i would suggest to cache dfWithDates as there are repetition of same operation for each row.
You can cast date value to string with any pattern using DateTimeFormatter
import java.time.format.DateTimeFormatter
date.format(DateTimeFormatter.ofPattern("yyyy-MM-dd"))

get the distinct elements of an ArrayType column in a spark dataframe

I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:
Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[]
2, ["feat1_2"],["feat2_1","feat2_2"]
3,["feat1_4"],["feat2_3"]
I want to get the list of distinct elements inside each feature column, so the output will be:
distinct_feat1,distinct_feat2
-----------------------------
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]
what is the best way to do this in Scala?
You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:
import org.apache.spark.sql.functions._
val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
withColumn("feat2", explode(col("feat2"))).
agg(collect_set("feat1").alias("distinct_feat1"),
collect_set("feat2").alias("distinct_feat2"))
distinct_df.show
+--------------------+--------------------+
| distinct_feat1| distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+
distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
WrappedArray(, feat2_1, feat2_2, feat2_3)])
one more solution for spark 2.4+
.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))
beware, if one of columns is null, result will be null
The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:
def array_unique_values(df, fields):
from pyspark.sql.functions import col, collect_set, explode
from functools import reduce
data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])
And then:
data = array_unique_values(df, my_fields)
data.take(1)