I have a code similar to this:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def regex_filter(x):
regexs = ['.*123.*']
if x and x.strip():
for r in regexs:
if re.match(r, x, re.IGNORECASE):
return True
return False
filter_udf = udf(regex_filter, BooleanType())
df_filtered = df.filter(filter_udf(df.fieldXX))
I want to use "regexs" var to verify if any digit "123" is in "fieldXX"
i don't know what i did wrong!
Could anyone help me with this?
Regexp is incorrect.
I think it should be something like:
regexs = '.*[123].*'
You can use SQL function to attain this
df.createOrReplaceTempView("df_temp")
df_1 = spark.sql("select *, case when col1 like '%123%' then 'TRUE' else 'FALSE' end col2 from df_temp")
Disadvantage in using UDF is you cannot save the data frame back or do any manipulations in that data frame further.
Related
Python doesn't like the ampersand below.
I get the error: & is not a supported operation for types str and str. Please review your code.
Any idea how to get this right? I've never tried to join more than 1 column for aliased tables. Thx!!
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (("crm.id=cng.id") & ("crm.cpid = cng.cpid")), how = "inner")
Try using as below -
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (["id"] and ["cpid"]), how = "inner")
Your join condition is overcomplicated. It can be as simple as this
df_initial_sample = df_crm.join(df_cngpt, on=['id', 'cpid'], how = 'inner')
Since I am a beginner of Pyspark can anyone help in doing conversion of an Integer Column into a String?
Here is my code in Aws Athena and I need to convert it into pyspark dataframe.
case when A.[HHs Reach] = 0 or A.[HHs Reach] is null then '0'
when A.[HHs Reach] = 1000000000 then '*'
else cast(A.[HHs Reach] as varchar) end as [HHs Reach]
assuming df is your dataframe, something like this :
from pyspark.sql import functions as F
df.withColumn(
"HHs Reach",
F.when(F.col("HHs Reach").isNull(), '0')
.when(F.col("HHs Reach") == 1000000000, '*')
.otherwise(F.col("HHs Reach").cast("string"))
)
I want to replace an column in an dataframe. need to get the scala
syntax code for this
Controlling_Area = CC2
Hierarchy_Name = CC2HIDNE
Need to write as : HIDENE
ie: remove the Controlling_Area present in Hierarchy_Name .
val dfPC = ReadLatest("/Full", "parquet")
.select(
LRTIM( REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"") ),
Col(ColumnN),
Col(ColumnO)
)
notebook:3: error: not found: value REPLACE
REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"")
^
Expecting to get the LTRIM and replace code in scala
You can use withColumnRenamed to achieve that:
import org.apache.spark.sql.functions
val dfPC = ReadLatest("/Full", "parquet")
.withColumnRenamed("Hierarchy_Name","Controlling_Area")
.withColumn("Controlling_Area",ltrim(col("Controlling_Area")))
Below is the T-SQL code attached. I tried to convert it to pyspark using window functions which is also attached.
case
when eventaction = 'IN' and lead(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid) in('IN','OUT')
then lead(eventaction,1) over (PARTITION BY barcode order by barcode,eventdate,transactionid)
else ''
end as next_action
Pyspark code giving error using window function lead
Tgt_df = Tgt_df.withColumn((('Lead', lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")) == 'IN' )|
('1', lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")) == 'OUT')
, (lead('eventaction').over(Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate"))).otherwise('').alias("next_action")))
But it's not working. What to do!?
The withColumn method should be used as df.withColumn('name_of_col', value_of_column), that's why you have an error.
From your T-SQL requests, the corresponding pyspark code should be :
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.partitionBy("barcode").orderBy("barcode","transactionid", "eventdate")
Tgt_df = Tgt_df.withColumn('next_action',
F.when((F.col('event_action')=='IN')&(F.lead('event_action', 1).over(w).isin(['IN', 'OUT'])),
F.lead('event_action', 1).over(w)
).otherwise('')
)
I am aware of how to implement a simple CASE-WHEN-THEN clause in SPARK SQL using Scala. I am using Version 1.6.2. But, I need to specify AND condition on multiple columns inside the CASE-WHEN clause. How to achieve this in SPARK using Scala ?
Thanks in advance for your time and help!
Here's the SQL query that I have:
select sd.standardizationId,
case when sd.numberOfShares = 0 and
isnull(sd.derivatives,0) = 0 and
sd.holdingTypeId not in (3,10)
then
8
else
holdingTypeId
end
as holdingTypeId
from sd;
First read table as dataframe
val table = sqlContext.table("sd")
Then select with expression. There align syntaxt according to your database.
val result = table.selectExpr("standardizationId","case when numberOfShares = 0 and isnull(derivatives,0) = 0 and holdingTypeId not in (3,10) then 8 else holdingTypeId end as holdingTypeId")
And show result
result.show
An alternative option, if it's wanted to avoid using the full string expression, is the following:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val sd = sqlContext.table("sd")
val conditionedColumn: Column = when(
(sd("numberOfShares") === 0) and
(coalesce(sd("derivatives"), lit(0)) === 0) and
(!sd("holdingTypeId").isin(Seq(3,10): _*)), 8
).otherwise(sd("holdingTypeId")).as("holdingTypeId")
val result = sd.select(sd("standardizationId"), conditionedColumn)