Python using list parameter in pyspark.sql like a macro in sas - pyspark

I have a list and want to use in pyspark.sql statement.
VLIST=['afhjh', 'aikn5','hsa76']
INC=pyspark.sql("select * from table1 where VIG=$VLIST")
I tried to use like a sas statement(using $ instead of &) which failed.
How can I use it correctly.

You can try something a bit different :
VLIST = ('afhjh', 'aikn5','hsa76')
INC = pyspark.sql(f"select * from table1 where VIG in {VLIST}")
or another way :
from pyspark.sql import functions as F
INC = pyspark.table("table1")
INC = INC.where(F.col("VIG").isin(*(F.lit(val) for val in VLIST)))

Related

Pyspark join on multiple aliased table columns

Python doesn't like the ampersand below.
I get the error: & is not a supported operation for types str and str. Please review your code.
Any idea how to get this right? I've never tried to join more than 1 column for aliased tables. Thx!!
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (("crm.id=cng.id") & ("crm.cpid = cng.cpid")), how = "inner")
Try using as below -
df_initial_sample = df_crm.alias('crm').join(df_cngpt.alias('cng'), on= (["id"] and ["cpid"]), how = "inner")
Your join condition is overcomplicated. It can be as simple as this
df_initial_sample = df_crm.join(df_cngpt, on=['id', 'cpid'], how = 'inner')

pyspark parent child recursive on same dataframe

I have the following two Dataframes that stores diagnostic and part change for helicopter parts.
diagnostic dataframe stores the maintenance activities carried out date. The part change dataframe stores all part removals for all the helicopter parts, parent(rotor), and child (turbofan, axle, module).
What I am trying to achieve is quite complex, based on the diagnostic df I want to provide me the first removal for the same part along with its parent roll all the way up to so that I get the helicopter serial no at that maintenance date.
Here the initial code to generate the sample datasets:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as F
from pyspark.sql import Window as W
diagnostic_list = [
{"SN":"tur26666","PN": "turbofan", "maintenance":"23/05/2016"}
]
part_change_list= [
{"SN":"tur26666","PN": "turbofan", "removal":"30/03/2019", "Parent_SN" : "com26666","Parent_PN": "compressor"},
{"SN":"tur26666","PN": "turbofan", "removal":"30/06/2016", "Parent_SN" : "com26666","Parent_PN": "compressor"},
{"SN":"com26666","PN": "compressor", "removal":"13/03/2019", "Parent_SN" : "rot71777","Parent_PN": "rotorcraft"},
{"SN":"com26666","PN": "compressor", "removal":"30/06/2016", "Parent_SN" : "rot26666","Parent_PN": "rotorcraft"},
{"SN":"rot26666","PN": "rotorcraft", "removal":"31/12/2019", "Parent_SN" : "OYAAA","Parent_PN": "helicopter"},
{"SN":"rot26666","PN": "rotorcraft", "removal":"24/06/2016", "Parent_SN" : "OYHZZ","Parent_PN": "helicopter"},
]
spark = SparkSession.builder.getOrCreate()
diagnostic_df = spark.createDataFrame(Row(**x) for x in diagnostic_list)
part_change_df = spark.createDataFrame(Row(**x) for x in part_change_list)
diagnostic_df.show()
+--------+--------+-----------+
| SN| PN|maintenance|
+--------+--------+-----------+
|tur26666|turbofan| 23/05/2016|
+--------+--------+-----------+
part_change_df.show()
+--------+----------+----------+---------+----------+
| SN| PN| removal|Parent_SN| Parent_PN|
+--------+----------+----------+---------+----------+
|tur26666| turbofan|30/03/2019| com26666|compressor|
|tur26666| turbofan|30/06/2016| com26666|compressor|
|com26666|compressor|13/03/2019| rot71777|rotorcraft|
|com26666|compressor|29/06/2016| rot26666|rotorcraft|
|rot26666|rotorcraft|31/12/2019| OYAAA|helicopter|
|rot26666|rotorcraft|24/06/2016| OYHZZ|helicopter|
+--------+----------+----------+---------+----------+
I was able to get the first removal for the child turbofan with the below code :
working_df = (
diagnostic_df.join(part_change_df, ["SN", "PN"], how="inner")
.filter(F.col("removal") >= F.col("maintenance"))
.withColumn(
"rank",
F.rank().over(
W.partitionBy([F.col(col) for col in ["SN", "PN", "maintenance"]]).orderBy(
F.col("removal")
)
),
)
.filter(F.col("rank") == 1)
.drop("rank")
)
working_df.show()
+--------+--------+-----------+----------+---------+----------+
| SN| PN|maintenance| removal|Parent_SN| Parent_PN|
+--------+--------+-----------+----------+---------+----------+
|tur26666|turbofan| 23/05/2016|30/06/2016| com26666|compressor|
+--------+--------+-----------+----------+---------+----------+
How can I create a for loop or a recursive loop within the part_change_df to get the results like this that takes each parent of the first child and makes it the next child and get the first removal information after the first child(turbofan)'s maintenance date)?
+--------+--------+-----------+----------+---------+----------+--------------+--------------+--------------+-------------------+-------------------+-------------------+
| SN| PN|maintenance| removal|Parent_SN| Parent_PN|Parent_removal|next_Parent_SN|next_Parent_PN|next_Parent_removal|next_next_Parent_PN|next_next_Parent_SN|
+--------+--------+-----------+----------+---------+----------+--------------+--------------+--------------+-------------------+-------------------+-------------------+
|tur26666|turbofan| 23/05/2016|30/06/2016| com26666|compressor| 29/06/2016| rot26666| rotorcraft| 24/06/2016| helicopter| OYHZZ|
+--------+--------+-----------+----------+---------+----------+--------------+--------------+--------------+-------------------+-------------------+-------------------+
I could hardcode each parent and join working dataframe with the part change dataframe, but the problem i don't know exactly how high the number of parents a child will have .
The ultimate goal is like to get the child maintenance date and roll up all the way to the final parent removal date and the helicopter serial no:
+--------+--------+-----------+----------+-------------------+-------------------+-------------------+
| SN| PN|maintenance| removal|next_Parent_removal|next_next_Parent_PN|next_next_Parent_SN|
+--------+--------+-----------+----------+-------------------+-------------------+-------------------+
|tur26666|turbofan| 23/05/2016|30/06/2016| 24/06/2016| helicopter| OYHZZ|
+--------+--------+-----------+----------+-------------------+-------------------+-------------------+

replace column and get ltrim of the column value

I want to replace an column in an dataframe. need to get the scala
syntax code for this
Controlling_Area = CC2
Hierarchy_Name = CC2HIDNE
Need to write as : HIDENE
ie: remove the Controlling_Area present in Hierarchy_Name .
val dfPC = ReadLatest("/Full", "parquet")
.select(
LRTIM( REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"") ),
Col(ColumnN),
Col(ColumnO)
)
notebook:3: error: not found: value REPLACE
REPLACE(col("Hierarchy_Name"),col("Controlling_Area"),"")
^
Expecting to get the LTRIM and replace code in scala
You can use withColumnRenamed to achieve that:
import org.apache.spark.sql.functions
val dfPC = ReadLatest("/Full", "parquet")
.withColumnRenamed("Hierarchy_Name","Controlling_Area")
.withColumn("Controlling_Area",ltrim(col("Controlling_Area")))

Pyspark regex to data frame

I have a code similar to this:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
def regex_filter(x):
regexs = ['.*123.*']
if x and x.strip():
for r in regexs:
if re.match(r, x, re.IGNORECASE):
return True
return False
filter_udf = udf(regex_filter, BooleanType())
df_filtered = df.filter(filter_udf(df.fieldXX))
I want to use "regexs" var to verify if any digit "123" is in "fieldXX"
i don't know what i did wrong!
Could anyone help me with this?
Regexp is incorrect.
I think it should be something like:
regexs = '.*[123].*'
You can use SQL function to attain this
df.createOrReplaceTempView("df_temp")
df_1 = spark.sql("select *, case when col1 like '%123%' then 'TRUE' else 'FALSE' end col2 from df_temp")
Disadvantage in using UDF is you cannot save the data frame back or do any manipulations in that data frame further.

SPARK SQL: Implement AND condition inside a CASE statement

I am aware of how to implement a simple CASE-WHEN-THEN clause in SPARK SQL using Scala. I am using Version 1.6.2. But, I need to specify AND condition on multiple columns inside the CASE-WHEN clause. How to achieve this in SPARK using Scala ?
Thanks in advance for your time and help!
Here's the SQL query that I have:
select sd.standardizationId,
case when sd.numberOfShares = 0 and
isnull(sd.derivatives,0) = 0 and
sd.holdingTypeId not in (3,10)
then
8
else
holdingTypeId
end
as holdingTypeId
from sd;
First read table as dataframe
val table = sqlContext.table("sd")
Then select with expression. There align syntaxt according to your database.
val result = table.selectExpr("standardizationId","case when numberOfShares = 0 and isnull(derivatives,0) = 0 and holdingTypeId not in (3,10) then 8 else holdingTypeId end as holdingTypeId")
And show result
result.show
An alternative option, if it's wanted to avoid using the full string expression, is the following:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
val sd = sqlContext.table("sd")
val conditionedColumn: Column = when(
(sd("numberOfShares") === 0) and
(coalesce(sd("derivatives"), lit(0)) === 0) and
(!sd("holdingTypeId").isin(Seq(3,10): _*)), 8
).otherwise(sd("holdingTypeId")).as("holdingTypeId")
val result = sd.select(sd("standardizationId"), conditionedColumn)