Pyspark regex to data frame - pyspark

I have a code similar to this:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def regex_filter(x):
regexs = ['.*123.*']
if x and x.strip():
for r in regexs:
if re.match(r, x, re.IGNORECASE):
return True
return False
filter_udf = udf(regex_filter, BooleanType())
df_filtered = df.filter(filter_udf(df.fieldXX))
I want to use "regexs" var to verify if any digit "123" is in "fieldXX"
i don't know what i did wrong!
Could anyone help me with this?

Regexp is incorrect.
I think it should be something like:
regexs = '.*[123].*'

You can use SQL function to attain this
df.createOrReplaceTempView("df_temp")
df_1 = spark.sql("select *, case when col1 like '%123%' then 'TRUE' else 'FALSE' end col2 from df_temp")
Disadvantage in using UDF is you cannot save the data frame back or do any manipulations in that data frame further.

Related

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below.
I get the error: & is not a supported operation for types str and str. Please review your code.
Any idea how to get this right? I've never tried to join more than 1 column for aliased tables. Thx!!
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (("crm.id=cng.id") & ("crm.cpid = cng.cpid")), how = "inner")
Try using as below -
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (["id"] and ["cpid"]), how = "inner")
Your join condition is overcomplicated. It can be as simple as this
df_initial_sample = df_crm.join(df_cngpt, on=['id', 'cpid'], how = 'inner')

How to convert the Int column into a string in Pyspark?

Since I am a beginner of Pyspark can anyone help in doing conversion of an Integer Column into a String?
Here is my code in Aws Athena and I need to convert it into pyspark dataframe.
case when A.[HHs Reach] = 0 or A.[HHs Reach] is null then '0'
when A.[HHs Reach] = 1000000000 then '*'
else cast(A.[HHs Reach] as varchar) end as [HHs Reach]
assuming df is your dataframe, something like this :
from pyspark.sql import functions as F
df.withColumn(
"HHs Reach",
F.when(F.col("HHs Reach").isNull(), '0')
.when(F.col("HHs Reach") == 1000000000, '*')
.otherwise(F.col("HHs Reach").cast("string"))
)

replace column and get ltrim of the column value

I want to replace an column in an dataframe. need to get the scala
syntax code for this
Controlling_Area = CC2
Hierarchy_Name = CC2HIDNE
Need to write as : HIDENE
ie: remove the Controlling_Area present in Hierarchy_Name .
val dfPC = ReadLatest("/Full", "parquet")
.select(
LRTIM( REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"") ),
Col(ColumnN),
Col(ColumnO)
)
notebook:3: error: not found: value REPLACE
REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"")
^
Expecting to get the LTRIM and replace code in scala
You can use withColumnRenamed to achieve that:
import org.apache.spark.sql.functions
val dfPC = ReadLatest("/Full", "parquet")
.withColumnRenamed("Hierarchy_Name","Controlling_Area")
.withColumn("Controlling_Area",ltrim(col("Controlling_Area")))

window functions( lag, lead) implementation in pyspark?

Below is the T-SQL code attached. I tried to convert it to pyspark using window functions which is also attached.
case
when eventaction = 'IN' and lead(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid) in('IN','OUT')
then lead(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid)
else ''
end as next_action
Pyspark code giving error using window function lead
Tgt_df = Tgt_df.withColumn((('Lead', lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")) == 'IN' )|
('1', lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")) == 'OUT')
, (lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate"))).otherwise('').alias("next_action")))
But it's not working. What to do!?
The withColumn method should be used as df.withColumn('name_of_col', value_of_column), that's why you have an error.
From your T-SQL requests, the corresponding pyspark code should be :
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")
Tgt_df = Tgt_df.withColumn('next_action',
F.when((F.col('event_action')=='IN')&(F.lead('event_action', 1).over(w).isin(['IN', 'OUT'])),
F.lead('event_action', 1).over(w)
).otherwise('')
)

SPARK SQL: Implement AND condition inside a CASE statement

I am aware of how to implement a simple CASE-WHEN-THEN clause in SPARK SQL using Scala. I am using Version 1.6.2. But, I need to specify AND condition on multiple columns inside the CASE-WHEN clause. How to achieve this in SPARK using Scala ?
Thanks in advance for your time and help!
Here's the SQL query that I have:
select sd.standardizationId,
case when sd.numberOfShares = 0 and
isnull(sd.derivatives,0) = 0 and
sd.holdingTypeId not in (3,10)
then
8
else
holdingTypeId
end
as holdingTypeId
from sd;
First read table as dataframe
val table = sqlContext.table("sd")
Then select with expression. There align syntaxt according to your database.
val result = table.selectExpr("standardizationId","case when numberOfShares = 0 and isnull(derivatives,0) = 0 and holdingTypeId not in (3,10) then 8 else holdingTypeId end as holdingTypeId")
And show result
result.show
An alternative option, if it's wanted to avoid using the full string expression, is the following:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val sd = sqlContext.table("sd")
val conditionedColumn: Column = when(
(sd("numberOfShares") === 0) and
(coalesce(sd("derivatives"), lit(0)) === 0) and
(!sd("holdingTypeId").isin(Seq(3,10): _*)), 8
).otherwise(sd("holdingTypeId")).as("holdingTypeId")
val result = sd.select(sd("standardizationId"), conditionedColumn)