Pyspark Window Function - pyspark

I am trying to calculate the row_number on a data-set based on certain column but i am getting the below error
AttributeError: 'module' object has no attribute 'rowNumber'
I am using the below script to get the row number based on MID and ClaimID. Ay thoughts why this is coming up?
from pyspark.sql.functions import first
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window
import pyspark.sql.functions as func
def Codes(pharmacyCodes):
df_data=pharmacyCodes
(df_data
.select("MID","claimid",
F.rowNumber()
.over(Window
.partitionBy("MID")
.orderBy("MID")
)
.alias("rowNum")
)
.show()
)

I think you're looking for row_number rather than rowNumber. The mixture of camel case and snake case with Pyspark can get confusing.

Related

Azure Databricks analyze if the columns names are lower case, using islower() function

This is my logic on pyspark:
df2 = spark.sql(f" SELECT tbl_name, column_name, data_type, current_count FROM {database_name}.{tablename}")
query_df = spark.sql(f"SELECT tbl_name, COUNT(column_name) as `num_cols` FROM {database_name}.{tablename} GROUP BY tbl_name")
df_join = df2.join(query_df,['tbl_name'])
Then I want to add to the Dataframe another column called 'column_case_lower' with the analyzes if the columns_names are lower case using islower() function.
I'm using this logic to do the analyzes:
df_join.withColumn("column_case_lower",
when((col("column_name").islower()) == 'true'.otherwise('false'))
-- The error is: unexpected EOF while parsing
expecting something like this:
islower() cant be applied on column type. Use the below code that uses UDF instead.
def checkCase(col_value):
return col_value.islower()
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
checkUDF = udf(lambda z: checkCase(z),StringType())
from pyspark.sql.functions import col,when
df.withColumn("new_col", when(checkUDF(col('column_name')) == True,"True")
.otherwise("False")).show()

get row corresponding to latest timestamp in pyspark

I have a dataframe as :
+--------------+-----------------+-------------------+
| ecid| creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID1 |2018-08-31 20:00:00|
|ECID-195000300|USER_ID2 |2016-08-31 20:00:00|
I need to have a row with earliest timestamp as:
+--------------+-----------------+-------------------+
| ecid| creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID2 |2016-08-31 20:00:00|
How can I acheive this in pyspark:
I tried
df.groupBy("ecid").agg(min("creation_timestamp"))
However I am just getting the ecid and timestamp field. I want to have all the field not just two field
Use window row_number function with partition by on ecid and order by on creation_timestamp.
Example:
#sampledata
df=spark.createDataFrame([("ECID-195000300","USER_ID1","2018-08-31 20:00:00"),("ECID-195000300","USER_ID2","2016-08-31 20:00:00")],["ecid","creation_user","creation_timestamp"])
from pyspark.sql import Window
from pyspark.sql.functions import *
w = Window.partitionBy('ecid').orderBy("creation_timestamp")
df.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop("rn").show()
#+--------------+-------------+-------------------+
#| ecid|creation_user| creation_timestamp|
#+--------------+-------------+-------------------+
#|ECID-195000300| USER_ID2|2016-08-31 20:00:00|
#+--------------+-------------+-------------------+
I think you will need a window function + a filter for that. I can propose you the following untested solution:
import pyspark.sql.window as psw
import pyspark.sql.functions as psf
w = psw.Window.partitionBy('ecid')
df = (df.withColumn("min_tmp", psf.min('creation_timestamp').over(w))
.filter(psf.col("min_tmp") == psf.col("creation_timestamp"))
)
The window function allows you to return the min over each ecid as a new column of your DataFrame

pyspark function to change datatype

The code is working outside the function however when I take it inside the function and adjust for var argument passed, I am getting an error. Thanks for the help!
from pyspark.sql.types import DateType
from pyspark.sql.functions import col, unix_timestamp, to_date
def change_string_to_date(df,var):
df = df.withColumn("{}".format(var),to_date(unix_timestamp(col("{}".format(var))), 'yyyy-MM-dd').cast("timestamp"))
return df
df_data = change_string_to_date(df_data,'mis_dt')
figured it out. "unix_timestamp" was causing the problem . very silly error.

pyspark's window functions fn.avg() only output same data

Here is my code:
import pandas as pd
from pyspark.sql import SQLContext
import pyspark.sql.functions as fn
from pyspark.sql.functions import isnan, isnull
from pyspark.sql.functions import lit
from pyspark.sql.window import Window
spark= SparkSession.builder.appName(" ").getOrCreate()
file = "D:\project\HistoryData.csv"
lines = pd.read_csv(file)
spark_df=spark.createDataFrame(cc,['id','time','average','max','min'])
temp = Window.partitionBy("time").orderBy("id").rowsBetween(-1, 1)
df = spark_df.withColumn("movingAvg",fn.avg("average").over(temp))
df.show()
But it output this:
It output the same data,and some data is disappear.

Converting Scala code to PySpark

I have found the following code for selecting n rows from dataframe grouped by unique_id.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val window = Window.partitionBy("userId").orderBy($"rating".desc)
dataframe.withColumn("r", row_number.over(window)).where($"r" <= n)
I have tried the following:
from pyspark.sql.functions import row_number, desc
from pyspark.sql.window import Window
w = Window.partitionBy(post_tags.EntityID).orderBy(post_tags.Weight)
newdata=post_tags.withColumn("r", row_number.over(w)).where("r" <= 3)
I get the following error:
AttributeError: 'function' object has no attribute 'over'
Please help me on the same.
I found the answer to this:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
window = Window.partitionBy(df['user_id']).orderBy(df['score'].desc())
df.select('*', rank().over(window).alias('rank'))
.filter(col('rank') <= 2)
.show()
Credits to #mtoto for his answer https://stackoverflow.com/a/38398563/5165377