Converting Scala code to PySpark - scala

I have found the following code for selecting n rows from dataframe grouped by unique_id.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val window = Window.partitionBy("userId").orderBy($"rating".desc)
dataframe.withColumn("r", row_number.over(window)).where($"r" <= n)
I have tried the following:
from pyspark.sql.functions import row_number, desc
from pyspark.sql.window import Window
w = Window.partitionBy(post_tags.EntityID).orderBy(post_tags.Weight)
newdata=post_tags.withColumn("r", row_number.over(w)).where("r" <= 3)
I get the following error:
AttributeError: 'function' object has no attribute 'over'
Please help me on the same.

I found the answer to this:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
window = Window.partitionBy(df['user_id']).orderBy(df['score'].desc())
df.select('*', rank().over(window).alias('rank'))
.filter(col('rank') <= 2)
.show()
Credits to #mtoto for his answer https://stackoverflow.com/a/38398563/5165377

Related

Azure Databricks analyze if the columns names are lower case, using islower() function

This is my logic on pyspark:
df2 = spark.sql(f" SELECT tbl_name, column_name, data_type, current_count FROM {database_name}.{tablename}")
query_df = spark.sql(f"SELECT tbl_name, COUNT(column_name) as `num_cols` FROM {database_name}.{tablename} GROUP BY tbl_name")
df_join = df2.join(query_df,['tbl_name'])
Then I want to add to the Dataframe another column called 'column_case_lower' with the analyzes if the columns_names are lower case using islower() function.
I'm using this logic to do the analyzes:
df_join.withColumn("column_case_lower",
when((col("column_name").islower()) == 'true'.otherwise('false'))
-- The error is: unexpected EOF while parsing
expecting something like this:
islower() cant be applied on column type. Use the below code that uses UDF instead.
def checkCase(col_value):
return col_value.islower()
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
checkUDF = udf(lambda z: checkCase(z),StringType())
from pyspark.sql.functions import col,when
df.withColumn("new_col", when(checkUDF(col('column_name')) == True,"True")
.otherwise("False")).show()

get row corresponding to latest timestamp in pyspark

I have a dataframe as :
+--------------+-----------------+-------------------+
| ecid| creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID1 |2018-08-31 20:00:00|
|ECID-195000300|USER_ID2 |2016-08-31 20:00:00|
I need to have a row with earliest timestamp as:
+--------------+-----------------+-------------------+
| ecid| creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID2 |2016-08-31 20:00:00|
How can I acheive this in pyspark:
I tried
df.groupBy("ecid").agg(min("creation_timestamp"))
However I am just getting the ecid and timestamp field. I want to have all the field not just two field
Use window row_number function with partition by on ecid and order by on creation_timestamp.
Example:
#sampledata
df=spark.createDataFrame([("ECID-195000300","USER_ID1","2018-08-31 20:00:00"),("ECID-195000300","USER_ID2","2016-08-31 20:00:00")],["ecid","creation_user","creation_timestamp"])
from pyspark.sql import Window
from pyspark.sql.functions import *
w = Window.partitionBy('ecid').orderBy("creation_timestamp")
df.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop("rn").show()
#+--------------+-------------+-------------------+
#| ecid|creation_user| creation_timestamp|
#+--------------+-------------+-------------------+
#|ECID-195000300| USER_ID2|2016-08-31 20:00:00|
#+--------------+-------------+-------------------+
I think you will need a window function + a filter for that. I can propose you the following untested solution:
import pyspark.sql.window as psw
import pyspark.sql.functions as psf
w = psw.Window.partitionBy('ecid')
df = (df.withColumn("min_tmp", psf.min('creation_timestamp').over(w))
.filter(psf.col("min_tmp") == psf.col("creation_timestamp"))
)
The window function allows you to return the min over each ecid as a new column of your DataFrame

pyspark function to change datatype

The code is working outside the function however when I take it inside the function and adjust for var argument passed, I am getting an error. Thanks for the help!
from pyspark.sql.types import DateType
from pyspark.sql.functions import col, unix_timestamp, to_date
def change_string_to_date(df,var):
df = df.withColumn("{}".format(var),to_date(unix_timestamp(col("{}".format(var))), 'yyyy-MM-dd').cast("timestamp"))
return df
df_data = change_string_to_date(df_data,'mis_dt')
figured it out. "unix_timestamp" was causing the problem . very silly error.

pyspark's window functions fn.avg() only output same data

Here is my code:
import pandas as pd
from pyspark.sql import SQLContext
import pyspark.sql.functions as fn
from pyspark.sql.functions import isnan, isnull
from pyspark.sql.functions import lit
from pyspark.sql.window import Window
spark= SparkSession.builder.appName(" ").getOrCreate()
file = "D:\project\HistoryData.csv"
lines = pd.read_csv(file)
spark_df=spark.createDataFrame(cc,['id','time','average','max','min'])
temp = Window.partitionBy("time").orderBy("id").rowsBetween(-1, 1)
df = spark_df.withColumn("movingAvg",fn.avg("average").over(temp))
df.show()
But it output this:
It output the same data,and some data is disappear.

Pyspark Window Function

I am trying to calculate the row_number on a data-set based on certain column but i am getting the below error
AttributeError: 'module' object has no attribute 'rowNumber'
I am using the below script to get the row number based on MID and ClaimID. Ay thoughts why this is coming up?
from pyspark.sql.functions import first
from pyspark.sql.types import *
from pyspark.sql import *
from pyspark.sql import Row, functions as F
from pyspark.sql.window import Window
import pyspark.sql.functions as func
def Codes(pharmacyCodes):
df_data=pharmacyCodes
(df_data
.select("MID","claimid",
F.rowNumber()
.over(Window
.partitionBy("MID")
.orderBy("MID")
)
.alias("rowNum")
)
.show()
)
I think you're looking for row_number rather than rowNumber. The mixture of camel case and snake case with Pyspark can get confusing.