pyspark regexp_replace replacing multiple values in a column

pyspark regexp_replace replacing multiple values in a column - pyspark

I have the url https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r in dataset. I want to remove https:// at the start of the string and \r at the end of the string.
Creating dataframe to replicate the issue
c = spark.createDataFrame([('https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r',)], ['str'])
I tried below regexp_replace with pipe function. But it is not working as expected.
c.select(F.regexp_replace('str', 'https:// | \\r', '')).first()
Actual output:
www.youcuomizei.comEquaion-Kid-Backack-Peronalized301793
Expected output:
www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793

the "backslash"r (\r) is not showing in your original spark.createDataFrame object because you have to escape it. so your spark.createDataFrame should be. please note the double backslashes
c = spark.createDataFrame([("https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\\r",)], ['str'])
which will give this output:
+------------------------------------------------------------------------------+
|str |
+------------------------------------------------------------------------------+
|https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r|
+------------------------------------------------------------------------------+
your regex https://|[\\r] will not remove the \r . the regex should be
c = (c
.withColumn("str", F.regexp_replace("str", "https://|[\\\\]r", ""))
)
which will give this output:
+--------------------------------------------------------------------+
|str |
+--------------------------------------------------------------------+
|www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793|
+--------------------------------------------------------------------+

Related

Removing leading zeros after joining with ( | ) pyspark

How can I remove leading zeros after joining, for example,
100|0000000086,
200|000000000087,
100|00000075
300|00007505
I want this data to be
100|86,
200|87,
100|75,
300|7505
Thank you in advance!!

You can use regex to replace the leading zeros after |. In pyspark you can use regex_replace to achieve your desired result. For example,
df = df.withColumn('new_a', F.regexp_replace(F.col('a'), '\|0*', '|'))
df.show(truncate=False)
Output:
+-----------------------------------------------------------+-------------------------------+
|a |new_a |
+-----------------------------------------------------------+-------------------------------+
|100|0000000086, 200|000000000087, 100|00000075 300|00007505|100|86, 200|87, 100|75 300|7505|
+-----------------------------------------------------------+-------------------------------+

Removal of ( ) from string in pyspark

We are receiving data like tx=Reach % (YouTube) and we need only Youtube from that. How can we remove without hardcoding. For hardcoding i was using
df=df.withColumn('tx', F.regexp_replace('tx', 'Reach % (YouTube)', 'YouTube'))
but we do not need hardcoding like youtube or etc.How can we apply checks like of there is Reach % then remove all except string inside bracket in pyspark

df = spark.createDataFrame(
[("tx=Reach % (YouTube)", )],
schema=['col1']
)
df.show(10, False)
+--------------------+
|col1 |
+--------------------+
|tx=Reach % (YouTube)|
+--------------------+
df.select(func.regexp_extract(func.col('col1'), '\((.*?)\)', 1).alias('val')).show(10, False)
+-------+
|val |
+-------+
|YouTube|
+-------+

You can use regexp_extract instead of regexp_replace
df.withColumn("tx", F.regexp_extract(F.col("tx"), r"\(([\w]+)\)", 1))
\( is (
\) is )
[\w]+ is 1 or more word characters like [a-zA-Z0-9_]+.
Note that this will only extract word characters. If you want to extract anything between parenthesis, you can use .*? instead of [\w]+.

Remove the repeated punctuation from pyspark dataframe

I need to remove the repeated punctuations and keep the last occurrence only.
For example: !!!! -> !
!!$$ -> !$
I have a dataset that looks like below
temp = spark.createDataFrame([
(0, "This is Spark!!!!"),
(1, "I wish Java could use case classes!!##"),
(2, "Data science is cool#$#!"),
(3, "Machine!!$$")
], ["id", "words"])
+---+--------------------------------------+
|id |words |
+---+--------------------------------------+
|0 |This is Spark!!!! |
|1 |I wish Java could use case classes!!##|
|2 |Data science is cool#$#! |
|3 |Machine!!$$ |
+---+--------------------------------------+
I tried regex to remove specific punctuations and that is below
df2 = temp.select(
[F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)
but the above is not working. Can anyone tell how to achieve this in pyspark?
Below is the desired output.
id words
0 0 This is Spark!
1 1 I wish Java could use case classes!#
2 2 Data science is cool#$#!
3 3 Machine!$

You can use this regex.
df2 = temp.select('id',
F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))
Regex explanation.
( -> Group anything between this and ) and create a capturing group
[ -> Match any characters between this and ]
([!$#]) -> Create the capturing group that match any of !, $, #
\1 -> Reference the first capturing group
+ -> Match 1 or more of a preceding group or character
([!$#])\1+ -> Match any of !, $, # that repeats more than 1 time.
And the last argument of regex_replace to set $1 which is referencing the first capturing group (a single character of !, $, #) to replace the repeating characters with just the single character.
You can add more characters between [] for matching more special characters.

Removing tailing tabs from a string column in a Spark Dataframe

I need to clean a column from a Dataframe which contains tailing whitespaces. Something like this:
'17063256 '
'17403492 '
'17390052 '
First, I tried to remove white spaces using trim:
df.withColumn("col1_cleansed", trim(col("col1")))
Then I though it may be tailing "tabs", so I tried as well with:
df.withColumn("col1_cleansed", regexp_replace(col("col1"), "\t", ""))
However none of these two solutions seems to be working.
What is the correct way to remove "tab" characters from a string column in Spark?

Method trim or rtrim does seem to have problem handling general whitespaces. To remove trailing whitespaces, consider using regexp_replace with regex pattern \\s+$ (with '$' representing end of string), as shown below:
val df = Seq(
"17063256 ", // space
"17403492 ", // tab
"17390052 " // space + tab
).toDF("c1")
df.withColumn("c1_trimmed", regexp_replace($"c1", "\\s+$", "")).show
// Output (prettified)
// +------------+----------+
// | c1|c1_trimmed|
// +------------+----------+
// | 17063256 | 17063256|
// | 17403492 | 17403492|
// |17390052 | 17390052|
// +------------+----------+

Try below udf & change as per your needs.
val normalize = udf((in: String) => {
import java.text.Normalizer.{normalize ⇒ jnormalize, _}
val cleaned = in.trim.toLowerCase
val normalized = jnormalize(cleaned, Form.NFD).replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{IsM}\\p{IsLm}\\p{IsSk}]+", "")
normalized.replaceAll("'s", "")
.replaceAll("ß", "ss")
.replaceAll("ø", "o")
.replaceAll("[^a-zA-Z0-9-]+", " ")
})
df.withColumn("col1_cleansed", normalize(col("col1")))

You can regex_replace to and replace with ""
df.withColumn("new", regexp_replace($"id", " ",""))
.show(false)
Output:
+------------------------------------------------------+----------+
|id |new |
+------------------------------------------------------+----------+
|'17063256 ' |'17063256'|
|'17403492 '|'17403492'|
|'17390052 ' |'17390052'|
+------------------------------------------------------+----------+

Another way to look at the problem. Extract only the required portion from the column .This will work if you are expecting only alphanumeric values and nothing else.
Feel free to modify it to accept numbers only if required.
df.withColumn("cleansed_col",regexp_extract(col("input"),"[a-z0-9]+",0))

How to split using multi-char separator with pipe?

I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"
Input:
TEST:|:|:51:|:|:PHT054008056
Test code:
dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))
Result:
+------------------------------+
|splitColumn |
+------------------------------+
|[TEST, |, |, 51, |, |, P] |
+------------------------------+
Test code:
dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))
part1 and part2 work correctly.
part3 only has 2 characters and rest of the string is truncated.
part3:
P
I want to get the entire part3 string.
Any help is appreciated.

You're almost there – just need to escape | within your delimiter, as follows:
val df = Seq(
(1, "TEST:|:|:51:|:|:PHT054008056"),
(2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")
df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id| testcolumn| part3|
// +---+--------------------+------------+
// | 1|TEST:|:|:51:|:|:P...|PHT054008056|
// | 2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+
[UPDATE]
You could also use triple quotes for the delimiter, in which case you still have to escape | to indicate it's a literal pipe (not or in Regex):
df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show
Note that with triple quotes, you need only a single escape character \, whereas without the triple quotes the escape character itself needs to be escaped (hence \\).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark regexp_replace replacing multiple values in a column - pyspark

Related

Removing leading zeros after joining with ( | ) pyspark

Removal of ( ) from string in pyspark

Remove the repeated punctuation from pyspark dataframe

Removing tailing tabs from a string column in a Spark Dataframe

How to split using multi-char separator with pipe?

Categories

Resources