get row corresponding to latest timestamp in pyspark - pyspark

I have a dataframe as :
+--------------+-----------------+-------------------+
| ecid| creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID1 |2018-08-31 20:00:00|
|ECID-195000300|USER_ID2 |2016-08-31 20:00:00|
I need to have a row with earliest timestamp as:
+--------------+-----------------+-------------------+
| ecid| creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID2 |2016-08-31 20:00:00|
How can I acheive this in pyspark:
I tried
df.groupBy("ecid").agg(min("creation_timestamp"))
However I am just getting the ecid and timestamp field. I want to have all the field not just two field

Use window row_number function with partition by on ecid and order by on creation_timestamp.
Example:
#sampledata
df=spark.createDataFrame([("ECID-195000300","USER_ID1","2018-08-31 20:00:00"),("ECID-195000300","USER_ID2","2016-08-31 20:00:00")],["ecid","creation_user","creation_timestamp"])
from pyspark.sql import Window
from pyspark.sql.functions import *
w = Window.partitionBy('ecid').orderBy("creation_timestamp")
df.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop("rn").show()
#+--------------+-------------+-------------------+
#| ecid|creation_user| creation_timestamp|
#+--------------+-------------+-------------------+
#|ECID-195000300| USER_ID2|2016-08-31 20:00:00|
#+--------------+-------------+-------------------+

I think you will need a window function + a filter for that. I can propose you the following untested solution:
import pyspark.sql.window as psw
import pyspark.sql.functions as psf
w = psw.Window.partitionBy('ecid')
df = (df.withColumn("min_tmp", psf.min('creation_timestamp').over(w))
.filter(psf.col("min_tmp") == psf.col("creation_timestamp"))
)
The window function allows you to return the min over each ecid as a new column of your DataFrame

Related

String aggregation and group by in PySpark

I have a dataset that has Id, Value and Timestamp columns. Id and Value columns are strings. Sample:
Id
Value
Timestamp
Id1
100
1658919600
Id1
200
1658919602
Id1
300
1658919601
Id2
433
1658919677
I want to concatenate Values that belong to the same Id, and order them by Timestamp. E.g. for rows with Id1 the result would look like:
Id
Values
Id1
100;300;200
Some pseudo code would be:
res = SELECT Id,
STRING_AGG(Value,";") WITHING GROUP ORDER BY Timestamp AS Values
FROM table
GROUP BY Id
Can someone help me write this in Databricks? PySpark and SQL are both fine.
You can collect lists of struct ofTimestamp and Value (in that order) for each Id, sort them (sort_array will sort by the first value of struct, i.e Timestamp) and combine Value's values into string using concat_ws.
PySpark (Spark 3.1.2)
import pyspark.sql.functions as F
(df
.groupBy("Id")
.agg(F.expr("concat_ws(';', sort_array(collect_list(struct(Timestamp, Value))).Value) as Values"))
).show(truncate=False)
# +---+-----------+
# |Id |Values |
# +---+-----------+
# |Id1|100;300;200|
# |Id2|433 |
# +---+-----------+
in SparkSQL
SELECT Id, concat_ws(';', sort_array(collect_list(struct(Timestamp, Value))).Value) as Values
FROM table
GROUP BY Id
This is a beautiful question!! This is a perfect use case for Fugue which can port Python and Pandas code to PySpark. I think this is something that is hard to express in Spark but easy to express in native Python or Pandas.
Let's just concern ourselves with 1 ID first. For one ID, using pure native Python, it would look like below. Assume the Timestamps are already sorted when this is applied.
import pandas as pd
df = pd.DataFrame({"Id": ["Id1", "Id1", "Id1", "Id2","Id2","Id2"],
"Value": [100,200,300,433, 500,600],
"Timestamp": [1658919600, 1658919602, 1658919601, 1658919677, 1658919670, 1658919672]})
from typing import Iterable, List, Dict, Any
def logic(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
_id = df[0]['Id']
items = []
for row in df:
items.append(row['Value'])
yield {"Id": _id, "Values": items}
Now we can call Fugue with one line of code to run this on Pandas. Fugue uses the type annotation from the logic function to handle conversions for you as it enters the function. We can run this for 1 ID (not sorted yet).
from fugue import transform
transform(df.loc[df["Id"] == "Id1"], logic, schema="Id:str,Values:[int]")
and that generates this:
Id Values
0 Id1 [100, 200, 300]
Now we are ready to bring it to Spark. All we need to do is add the engine and partitioning strategy to the transform call.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = transform(df,
logic,
schema="Id:str,Values:[int]",
partition={"by": "Id", "presort": "Timestamp asc"},
engine=spark)
sdf.show()
Because we passed in the SparkSession, this code will run on Spark.sdf is a SparkDataFrame so we need .show() because it evaluates lazily. Schema is a requirement for Spark so we need it too on Fugue but it's significantly simplified. The partitioning strategy will run logic on each Id, and will sort the items by Timestamp for each partition.
For the FugueSQL version, you can do:
from fugue_sql import fsql
fsql(
"""
SELECT *
FROM df
TRANSFORM PREPARTITION BY Id PRESORT Timestamp ASC USING logic SCHEMA Id:str,Values:[int]
PRINT
"""
).run(spark)
Easiest Solution :
df1=df.sort(asc('Timestamp')).groupBy("id").agg(collect_list('Value').alias('newcol'))
+---+---------------+
| id| newcol|
+---+---------------+
|Id1|[100, 300, 200]|
|Id2| [433]|
+---+---------------+
df1.withColumn('newcol',concat_ws(";",col("newcol"))).show()
+---+-----------+
| id| newcol|
+---+-----------+
|Id1|100;300;200|
|Id2| 433|
+---+-----------+

How to count frequency of min and max for all columns from a pyspark dataframe?

I have a pyspark dataframe where i am finding out min/max values and count of min/max values for each columns. I am able to select min/max values using:
df.select([min(col(c)).alias(c) for c in df.columns])
I want to have the count of min/max values as well in same dataframe.
Specific output I need:
...| col_n | col_m |...
...| xn | xm |... min(col(coln))
...| count(col_n==xn) | count(col_m==xm) |...
try this,
from pyspark.sql.functions import udf,struct,array
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,2,11),(1,3,4,12),(1,5,6,13),(1,7,8,14),(2,9,10,15),(2,11,12,16),(2,13,14,17)],schema=['col1','col2','col3','col4'])
expr=[F.max(coln).alias(coln+'_max') for coln in tst.columns]
tst_mx = tst.select(*expr)
#%%
tst_dict = tst_mx.collect()[0].asDict()
#%%
expr1=( [F.count(F.when(F.col(coln)==tst_dict[coln+'_max'],F.col(coln))).alias(coln+'_max_count') for coln in tst.columns])
#%%
tst_res = tst.select(*(expr+expr1))
In the expr, I have just tried out for max function. you can scale this to other functions like min,mean etc., even use a list comprehension for the function list. Refer this answer for such a scaling : pyspark: groupby and aggregate avg and first on multiple columns
It is explained for aggregate, can also be done to select statement.

Filtering dataframe spark scala for dates greater than current time

I have a data frame in spark 1.6 that I would like to select all rows greater than the current time. I am filtering on "time_occurred" column with this type of format "yyyy-MM-dd'T'HH:mm:ss.SSS". I was wondering what the best way is to achieve this?
Best way would be casting the field to timestamp type by using Regexp_replace function to replace 'T'.
Then by using current_timestamp function we can filter out data in the dataframe.
Example:
Spark-scala-1.6:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
//sample data
val df=sc.parallelize(Seq(("2019-10-17'T'18:30:45.123"),("2019-10-15'T'18:30:45.123"))).toDF("ts")
df.filter(regexp_replace('ts,"'T'"," ").cast("timestamp") > current_timestamp).show(false)
Result:
+-------------------------+
|ts |
+-------------------------+
|2019-10-17'T'18:30:45.123|
+-------------------------+
In case if you need to replace 'T' to get timestamp type for ts field then use this approach.
df.withColumn("ts",regexp_replace('ts,"'T'"," ").cast("timestamp"))
.filter('ts > current_timestamp).show(false)
Result:
+-----------------------+
|ts |
+-----------------------+
|2019-10-17 18:30:45.123|
+-----------------------+
Result ts field will be having Timestamp type.

casting to string of column for pyspark dataframe throws error

I have pyspark dataframe with two columns with datatypes as
[('area', 'int'), ('customer_play_id', 'int')]
+----+----------------+
|area|customer_play_id|
+----+----------------+
| 100| 8606738 |
| 110| 8601843 |
| 130| 8602984 |
+----+----------------+
I want to cast column area to str using pyspark commands but I am getting error as below
I tried below
str(df['area']) : but it didnt change datatype to str
df.area.astype(str) : gave "TypeError: unexpected type: "
df['area'].cast(str) same as error above
Any help will be appreciated
I want datatype of area as string using pyspark dataframe operation
Simply you can do any of these -
Option1:
df1 = df.select('*',df.area.cast("string"))
select - All the columns you want in df1 should be mentioned in select
Option2:
df1 = df.selectExpr("*","cast(area as string) AS new_area")
selectExpr - All the columns you want in df1 should be mentioned in selectExpr
Option3:
df1 = df.withColumn("new_area", df.area.cast("string"))
withColumn will add new column (additional to existing columns of df)
"*" in select and selectExpr represent all the columns.
use withColumn function to change the data type or values in the field in spark e.g. is show below:
import pyspark.sql.functions as F
df = df.withColumn("area",F.col("area").cast("string"))
You Can use this UDF Function
from pyspark.sql.types import FloatType
tofloatfunc = udf(lambda x: x,FloatType())
changedTypedf = df.withColumn("Column_name", df["Column_name"].cast(FloatType()))

Pyspark Window Function

I am trying to calculate the row_number on a data-set based on certain column but i am getting the below error
AttributeError: 'module' object has no attribute 'rowNumber'
I am using the below script to get the row number based on MID and ClaimID. Ay thoughts why this is coming up?
from pyspark.sql.functions import first
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window
import pyspark.sql.functions as func
def Codes(pharmacyCodes):
df_data=pharmacyCodes
(df_data
.select("MID","claimid",
F.rowNumber()
.over(Window
.partitionBy("MID")
.orderBy("MID")
)
.alias("rowNum")
)
.show()
)
I think you're looking for row_number rather than rowNumber. The mixture of camel case and snake case with Pyspark can get confusing.