Pyspark regexp_replace - pyspark

I am trying to replace values in a dataframe based on a when condition.
i.e. Data frame values:
country_code|country_code_iso3
| UA-09| null|
| UA-14| null|
| UA-43| UA|
I run this code:
df.withColumn('country_code_iso3',
when(df.country_code.startswith('UA-'),regexp_replace('country_code_iso3','','UKR'))
.when(df.country_code.startswith('UA-'),regexp_replace('country_code_iso3','UA','UKR'))
.otherwise(df.country_code_iso3))
but my results end up like this:
|country_code|country_code_iso3|
| UA-09| null|
| UA-14| null|
| UA-43| UKRUUKRAUKR|
I want to look like this:
|country_code|country_code_iso3|
| UA-09| UKR|
| UA-14| UKR|
| UA-43| UKR|
Any idea how I can tweak my code to fix this?
Thanks!

if you just want to UKR whenever it starts with UA-, you can just do this...also in your code your first when statement is causing the issue and your other when is never hit because the condition has already been meet
df = (df
.withColumn('country_code_iso3',
when((df.country_code.startswith('UA-')),lit('UKR'))
.otherwise(df.country_code_iso3))
)
df.show()
+------------+-----------------+
|country_code|country_code_iso3|
+------------+-----------------+
| UA-09| UKR|
| UA-14| UKR|
| UA-43| UKR|
+------------+-----------------+

Related

Get last n items in pyspark

For a dataset like -
+---+------+----------+
| id| item| timestamp|
+---+------+----------+
| 1| apple|2022-08-15|
| 1| peach|2022-08-15|
| 1| apple|2022-08-15|
| 1|banana|2022-08-14|
| 2| apple|2022-08-15|
| 2|banana|2022-08-14|
| 2|banana|2022-08-14|
| 2| water|2022-08-14|
| 3| water|2022-08-15|
| 3| water|2022-08-14|
+---+------+----------+
Can I use pyspark functions directly to get last three items the user purchased in the past 5 days? I know udf can do that, but I am wondering if any existing funtion can achieve this.
My expected output is like below or anything simliar is okay too.
id last_three_item
1 [apple, peach, apple]
2 [water, banana, apple]
3 [water, water]
Thanks!
You can use pandas_udf for this.
#f.pandas_udf(returnType=ArrayType(StringType()), functionType=f.PandasUDFType.GROUPED_AGG)
def pudf_get_top_3(x):
return x.head(3).to_list()
sdf\
.orderby("timestamp")\
.groupby("id")\
.agg(pudf_get_top_3("item")\
.alias("last_three_item))\
.show()

PySpark: Remove leading numbers and full stop from dataframe column

I'm trying to remove numbers and full stops that lead the names of horses in a betting dataframe.
The format is like this:
Horse Name
Horse Name
I would like the resulting df column to just have the horses name.
I've tried splitting the column at the full stop but am not getting the required result.
import pyspark.sql.functions as F
runners_returns = runners_returns.withColumn('runner_name', F.split(F.col('runner_name'), '.'))
Any help is greatly appreciated
With a Dataframe like the following.
df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| 123.John|
| 2| 5.42Anna|
| 3| .203Josh|
| 4| 102Paul|
+---+-----------+
You can do remove the leading numbers and periods like this.
import pyspark.sql.functions as F
df = (df.withColumn("runner_name",
F.regexp_replace('runner_name', r'(^[\d\.]+)', '')))
df.show()
+---+-----------+
| ID|runner_name|
+---+-----------+
| 1| John|
| 2| Anna|
| 3| Josh|
| 4| Paul|
+---+-----------+

Reading a tsv file in pyspark

I want to read a tsv file but it has no header I am creating my own schema nad then trying to read TSV file but after applyting schema it is showing all columns values as null.Below is my code and result.
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
schema = StructType([StructField("id_code", IntegerType()),StructField("description", StringType())])
df=spark.read.csv("C:/Users/HP/Downloads/`connection_type`.tsv",schema=schema)
df.show();
+-------+-----------+
|id_code|description|
+-------+-----------+
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
+-------+-----------+
If i read it simply without applying any schema.
df=spark.read.csv("C:/Users/HP/Downloads/connection_type.tsv",sep="/t")
df.show()
+-----------------+
| _c0|
+-----------------+
| 0 Not Specified |
| 1 Modem |
| 2 LAN/Wifi |
| 3 Unknown |
| 4 Mobile Carrier|
+-----------------+
It is not coming in a proper way. Can anyone please help me with this. My sample file is .tsv file and it has below records.
0 Specified
1 Modemwifi
2 LAN/Wifi
3 Unknown
4 Mobile user
Add the sep option and if it is really tab-separated, this will work.
df = spark.read.option("inferSchema","true").option("sep","\t").csv("test.tsv").show()
+---+-----------+
|_c0| _c1|
+---+-----------+
| 0| Specified|
| 1| Modemwifi|
| 2| LAN/Wifi|
| 3| Unknown|
| 4|Mobile user|
+---+-----------+

Find a substring from a column and write into new column (Multiple column search)

I am a newbie to the PYSPARK.
I am reading the data from a table and updating the same table. I have a requirement where I have to search for a small string in to columns and if found I need to write that into new column.
Logic is like this:
IF
(Terminal_Region is not NULL & Terminal_Region contains "WC") OR
(Terminal_Footprint is not NULL & Terminal_Footprint contains "WC")
THEN REGION = "EOR"
ELSE
REGION ="WOR"
If both of those fields has NULL, then REGION = 'NotMapped'
I need to create a new REGION in the Datafarme using PYSPARK. Can somebody help me?
|Terminal_Region |Terminal_footprint | REGION |
+-------------------+-------------------+----------+
| west street WC | | EOR |
| WC 87650 | | EOR |
| BOULVEVARD WC | | EOR |
| | |Not Mapped|
| |landinf dr WC | EOR |
| |FOX VALLEY WC 76543| EOR |
+-------------------+-------------------+----------+
I think the following code should create your desired output. The code should work with spark 2.2, which includes the contains function.
from pyspark.sql.functions import *
df = spark.createDataFrame([("west street WC",None),\
("WC 87650",None),\
("BOULVEVARD WC",None),\
(None,None),\
(None,"landinf dr WC"),\
(None,"FOX VALLEY WC 76543")],\
["Terminal_Region","Terminal_footprint"]) #Creating Dataframe
df.show() #print initial df
df.withColumn("REGION", when( col("Terminal_Region").isNull() & col("Terminal_footprint").isNull(), "NotMapped").\ #check if both are Null
otherwise(when((col("Terminal_Region").contains("WC")) | ( col("Terminal_footprint").contains("WC")), "EOR").otherwise("WOR"))).show() #otherwise search for "WC"
Output:
#initial dataframe
+---------------+-------------------+
|Terminal_Region| Terminal_footprint|
+---------------+-------------------+
| west street WC| null|
| WC 87650| null|
| BOULVEVARD WC| null|
| null| null|
| null| landinf dr WC|
| null|FOX VALLEY WC 76543|
+---------------+-------------------+
# df with the logic applied
+---------------+-------------------+---------+
|Terminal_Region| Terminal_footprint| REGION|
+---------------+-------------------+---------+
| west street WC| null| EOR|
| WC 87650| null| EOR|
| BOULVEVARD WC| null| EOR|
| null| null|NotMapped|
| null| landinf dr WC| EOR|
| null|FOX VALLEY WC 76543| EOR|
+---------------+-------------------+---------+

Sum of single column across rows based on a condition in Spark Dataframe

Consider the following dataframe:
+-------+-----------+-------+
| rid| createdon| count|
+-------+-----------+-------+
| 124| 2017-06-15| 1 |
| 123| 2017-06-14| 2 |
| 123| 2017-06-14| 1 |
+-------+-----------+-------+
I need to add the count column among rows which has createdon and rid of are same.
Therefore the resultant dataframe should be follows:
+-------+-----------+-------+
| rid| createdon| count|
+-------+-----------+-------+
| 124| 2017-06-15| 1 |
| 123| 2017-06-14| 3 |
+-------+-----------+-------+
I am using Spark 2.0.2.
I have tried agg, conditions inside select etc, but couldn't find the solution. Can anyone help me?
Try this
import org.apache.spark.sql.{functions => func}
df.groupBy($"rid", $"createdon").agg(func.sum($"count").alias("count"))
this should do what you want:
import org.apache.spark.sql.functions.sum
df
.groupBy($"rid",$"createdon")
.agg(sum($"count").as("count"))
.show