Transform Scala code to Pyspark - Where clause - scala

I have the scala code as follows, I need to transform it to Pyspark, but beforehand I need to undrestand that what exactly where is doing here ?
val df= df1.join(df2,Seq("ID")).where('EVENT>='CONTINOUS_ENROL_START && 'EVENT <= 'CONTINOUS_ENROL_END)

Related

Filter rows if value exists in array column

NOTE: I'm working in Spark 2.4.4
I have the following dataset
col1
['{"key1": "val1"}','{"key2": "val2"}']
['{"key1": "val1"}','{"key2": "val3"}']
Essentially, I'd like to filter out any rows where key2 is not val2.
col1
['{"key1": "val1"}','{"key2": "val2"}']
In trino SQL, I'm doing it like this:
any_match(col1, x -> json_extract_scalar(x, '$.key2') = 'val2')
But this isn't available in Spark 2.4
My only idea is to explode and then use the following code which isn't efficient.
df.filter(F.get_json_object(F.col("col1"), '$.key2') == 'val2')
I'm wondering if I can do this without exploding in my version of spark (2.4.4)
For spark >=2.4, you can use the exists function of spark SQL.
df = df.withColumn('flag', F.expr('exists(col1, x -> get_json_object(x, "$.key2") == "val2")')) \
.filter(F.col('flag')).drop('flag')
df.show(truncate=False)

cast string column to decimal in scala dataframe

I have a dataframe (scala)
I am using both pyspark and scala in a notebook
#pyspark
spark.read.csv(output_path + '/dealer', header = True).createOrReplaceTempView('dealer_dl')
%scala
import org.apache.spark.sql.functions._
val df = spark.sql("select * from dealer_dl")
How to convert a string column (amount) into decimal in scala dataframe.
I tried as below.
%scala
df = df.withColumn("amount", $"amount".cast(DecimalType(9,2)))
But I am getting an error as below:
error: reassignment to val
I am used to pyspark and quite new to scala. I need to do by scala to proceed further. Please let me know. Thanks.
in scala you can't reasign references defined as val but val is immutable reference. if you want to use reasigning some ref you can use var but better solution is not reasign something to the same reference name and use another val.
For example:
val dfWithDecimalAmount = df.withColumn("amount", $"amount".cast(DecimalType(9,2)))

How to get the common value by comparing two pyspark dataframes

I am migrating the pandas dataframe to pyspark. I have two dataframes in pyspark with different counts. The below code I am able to achieve in pandas but not in pyspark. How to compare the 2 dataframes values in pyspark and put the value as new column in df2.
def impute_value (row,df_custom):
for index,row_custom in df_custom.iterrows():
if row_custom["Identifier"] == row["IDENTIFIER"]:
row["NEW_VALUE"] = row_custom['CUSTOM_VALUE']
return row["NEW_VALUE"]
df2['VALUE'] = df2.apply(lambda row: impute_value(row, df_custom),axis =1)
How can I convert this particular function to pyspark dataframe? In pyspark, I cannot pass the row wise value to the function(impute_value).
I tried the following.
df3= df2.join(df_custom, df2["IDENTIFIER"]=df_custom["Identifier"],"left")
df3.WithColumnRenamed("CUSTOM_VALUE","NEW_VALUE")
This is not giving me the result.
the left join itself should do the needful
import pyspark.sql.functions as f
df3= df2.join(df_custom.withColumnRenamed('Identifier','Id'), df2["IDENTIFIER"]=df_custom["Id"],"left")
df3=df3.withColumn('NEW_VALUE',f.col('CUSTOM_VALUE')).drop('CUSTOM_VALUE','Id')

spark dataframe filter and select

I have a spark scala dataframe and need to filter the elements based on condition and select the count.
val filter = df.groupBy("user").count().alias("cnt")
val **count** = filter.filter(col("user") === ("subscriber").select("cnt")
The error i am facing is value select is not a member of org.apache.spark.sql.Column
Also for some reasons count is Dataset[Row]
Any thoughts to get the count in a single line?
DataSet[Row] is DataFrame
RDD[Row] is DataFrame so no need to worry.. its dataframe
see this for better understanding... Difference between DataFrame, Dataset, and RDD in Spark
Regarding select is not a member of org.apache.spark.sql.Column its purely compile error.
val filter = df.groupBy("user").count().alias("cnt")
val count = filter.filter (col("user") === ("subscriber"))
.select("cnt")
will work since you are missing ) braces which is closing brace for filter.
You are missing ")" before .select, Please check below code.
Column class don't have .select method, you have to invoke select on Dataframe.
val filter = df.groupBy("user").count().alias("cnt")
val **count** = filter.filter(col("user") === "subscriber").select("cnt")

scala: select column where not contains elements into dataframe

I have this ligne of code that should create a dataframe from list of columns that not contain a string. I tried this but it doesn't work:
val exemple = hiveObj.sql("show tables in database").select("tableName")!==="ABC".collect()
Try using the filter method:
import org.apache.spark.sql.functions._
import spark.implicits._
val exemple = hiveObj.sql("your query here").filter($"columnToFilter" =!= "ABC").show
NOTE: the inequality operator =!=is only available for Spark 2.0.0+. If you're using an older version, you must use !==. You can see the documentation here.
If you need to filter several columns you can do so:
.filter($"columnToFilter" =!= "ABC" and $"columnToFilter2" =!= "ABC")
another alternative answer to my question:
val exemple1 = hiveObj.sql("show tables in database").filter(!$"tableName".contains("ABC")).show()