Removing leading zeros after joining with ( | ) pyspark - pyspark

How can I remove leading zeros after joining, for example,
100|0000000086,
200|000000000087,
100|00000075
300|00007505
I want this data to be
100|86,
200|87,
100|75,
300|7505
Thank you in advance!!

You can use regex to replace the leading zeros after |. In pyspark you can use regex_replace to achieve your desired result. For example,
df = df.withColumn('new_a', F.regexp_replace(F.col('a'), '\|0*', '|'))
df.show(truncate=False)
Output:
+-----------------------------------------------------------+-------------------------------+
|a |new_a |
+-----------------------------------------------------------+-------------------------------+
|100|0000000086, 200|000000000087, 100|00000075 300|00007505|100|86, 200|87, 100|75 300|7505|
+-----------------------------------------------------------+-------------------------------+

Related

Change prefix in a integer column in pyspark

I want to convert the prefix from 222.. to 999.. in pyspark.
Expected new column new_id with changed prefixt to 999..s
I will be using this column for inner merge b/w 2 pysparl dataframes
id
new_id
2222238308750
9999938308750
222222579844
999999579844
222225701296
999995701296
2222250087899
9999950087899
2222250087899
9999950087899
2222237274658
9999937274658
22222955099
99999955099
22222955099
99999955099
22222955099
99999955099
222285678
999985678
You can achieve it with something like this,
# First calculate the number of "2"s from the start till some other value is found, for eg '2223' should give you 3 as the length
# Use that calculated value to repeat the "9" that many times
# replace starting "2"s with the calulated "9" string
# finally drop all the calculated columns
df.withColumn("len_2", F.length(F.regexp_extract(F.col("value"), r"^2*(?!2)", 0)).cast('int'))\
.withColumn("to_replace_with", F.expr("repeat('9', len_2)"))\
.withColumn("new_value", F.expr("regexp_replace(value, '^2*(?!2)', to_replace_with)")) \
.drop("len_2", "to_replace_with")\
.show(truncate=False)
Output:
+-------------+-------------+
|value |new_value |
+-------------+-------------+
|2222238308750|9999938308750|
|222222579844 |999999579844 |
|222225701296 |999995701296 |
|2222250087899|9999950087899|
|2222250087899|9999950087899|
|2222237274658|9999937274658|
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|222285678 |999985678 |
+-------------+-------------+
I have used the column name as value, you would have to substitute it with id.
You can try the following:
from pyspark.sql.functions import *
df = df.withColumn("tempcol1", regexp_extract("id", "^2*", 0)).withColumn("tempcol2", split(regexp_replace("id", "^2*", "_"), "_")[1]).withColumn("new_id", concat((regexp_replace("tempcol1", "2", "9")), "tempcol2")).drop("tempcol1", "tempcol2")
The id column is split into two temp columns, one having the prefix and the other the rest of the string. The prefix column values are replaced and concatenated back with the second temp column.

pyspark regexp_replace replacing multiple values in a column

I have the url https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r in dataset. I want to remove https:// at the start of the string and \r at the end of the string.
Creating dataframe to replicate the issue
c = spark.createDataFrame([('https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r',)], ['str'])
I tried below regexp_replace with pipe function. But it is not working as expected.
c.select(F.regexp_replace('str', 'https:// | \\r', '')).first()
Actual output:
www.youcuomizei.comEquaion-Kid-Backack-Peronalized301793
Expected output:
www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793
the "backslash"r (\r) is not showing in your original spark.createDataFrame object because you have to escape it. so your spark.createDataFrame should be. please note the double backslashes
c = spark.createDataFrame([("https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\\r",)], ['str'])
which will give this output:
+------------------------------------------------------------------------------+
|str |
+------------------------------------------------------------------------------+
|https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r|
+------------------------------------------------------------------------------+
your regex https://|[\\r] will not remove the \r . the regex should be
c = (c
.withColumn("str", F.regexp_replace("str", "https://|[\\\\]r", ""))
)
which will give this output:
+--------------------------------------------------------------------+
|str |
+--------------------------------------------------------------------+
|www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793|
+--------------------------------------------------------------------+

Removal of ( ) from string in pyspark

We are receiving data like tx=Reach % (YouTube) and we need only Youtube from that. How can we remove without hardcoding. For hardcoding i was using
df=df.withColumn('tx', F.regexp_replace('tx', 'Reach % (YouTube)', 'YouTube'))
but we do not need hardcoding like youtube or etc.How can we apply checks like of there is Reach % then remove all except string inside bracket in pyspark
df = spark.createDataFrame(
[("tx=Reach % (YouTube)", )],
schema=['col1']
)
df.show(10, False)
+--------------------+
|col1 |
+--------------------+
|tx=Reach % (YouTube)|
+--------------------+
df.select(func.regexp_extract(func.col('col1'), '\((.*?)\)', 1).alias('val')).show(10, False)
+-------+
|val |
+-------+
|YouTube|
+-------+
You can use regexp_extract instead of regexp_replace
df.withColumn("tx", F.regexp_extract(F.col("tx"), r"\(([\w]+)\)", 1))
\( is (
\) is )
[\w]+ is 1 or more word characters like [a-zA-Z0-9_]+.
Note that this will only extract word characters. If you want to extract anything between parenthesis, you can use .*? instead of [\w]+.

Extract specific string from a column in pyspark dataframe

I have below pyspark dataframe.
column_a
name,age,pct_physics,country,class
name,age,pct_chem,class
pct_math,class
I have to extract only the part of string which begins with only pct and discard rest of them.
Expected output:
column_a
pct_physics
pct_chem
pct_math
How to achieve this in pyspark
Use regexp_extract function.
Example:
df.withColumn("output",regexp_extract(col("column_a"),"(pct_.*?),",1)).show(10,False)
#+----------------------------------+-----------+
#|column_a |output |
#+----------------------------------+-----------+
#|name,age,pct_physics,country,class|pct_physics|
#|name,age,pct_chem,class |pct_chem |
#+----------------------------------+-----------+

Pyspark dataframe split and pad delimited column value into Array of N index

There is a pyspark source dataframe having a column named X. The column X consists of '-' delimited values. There can be any number of delimited values in that particular column.
Example of source dataframe given below:
X
A123-B345-C44656-D4423-E3445-F5667
X123-Y345
Z123-N345-T44656-M4423
X123
Now, need to split this column with delimiter and pull exactly N=4 seperate delimited values. If there are more than 4 delimited values, then we need first 4 delimited values and discard the rest. If there are less than 4 delimited values, then we need to pick the existing ones and pad the rest with empty character "".
Resulting output should be like below:
X
Col1
Col2
Col3
Col4
A123-B345-C44656-D4423-E3445-F5667
A123
B345
C44656
D4423
X123-Y345
A123
Y345
Z123-N345-T44656-M4423
Z123
N345
T44656
M4423
X123
X123
Have easily accomplished this in python as per below code, but thinking of pyspark approach to do this:
def pad_infinite(siterable, padding=None):
return chain(iterable, repeat(padding))
def pad(iterable, size, padding=None):
return islice(pad_infinite(iterable, padding), size)
colA, colB, colC, colD= list(pad(X.split('-'), 4, ''))
You can split the string into an array, separate the elements of the array into columns and then fill the null values with an empty string:
df = ...
df.withColumn("arr", F.split("X", "-")) \
.selectExpr("X", "arr[0] as Col1", "arr[1] as Col2", "arr[2] as Col3", "arr[3] as Col4") \
.na.fill("") \
.show(truncate=False)
Output:
+----------------------------------+----+----+------+-----+
|X |Col1|Col2|Col3 |Col4 |
+----------------------------------+----+----+------+-----+
|A123-B345-C44656-D4423-E3445-F5667|A123|B345|C44656|D4423|
|X123-Y345 |X123|Y345| | |
|Z123-N345-T44656-M4423 |Z123|N345|T44656|M4423|
|X123 |X123| | | |
+----------------------------------+----+----+------+-----+