Remove the repeated punctuation from pyspark dataframe - pyspark

I need to remove the repeated punctuations and keep the last occurrence only.
For example: !!!! -> !
!!$$ -> !$
I have a dataset that looks like below
temp = spark.createDataFrame([
(0, "This is Spark!!!!"),
(1, "I wish Java could use case classes!!##"),
(2, "Data science is cool#$#!"),
(3, "Machine!!$$")
], ["id", "words"])
+---+--------------------------------------+
|id |words |
+---+--------------------------------------+
|0 |This is Spark!!!! |
|1 |I wish Java could use case classes!!##|
|2 |Data science is cool#$#! |
|3 |Machine!!$$ |
+---+--------------------------------------+
I tried regex to remove specific punctuations and that is below
df2 = temp.select(
[F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)
but the above is not working. Can anyone tell how to achieve this in pyspark?
Below is the desired output.
id words
0 0 This is Spark!
1 1 I wish Java could use case classes!#
2 2 Data science is cool#$#!
3 3 Machine!$

You can use this regex.
df2 = temp.select('id',
F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))
Regex explanation.
( -> Group anything between this and ) and create a capturing group
[ -> Match any characters between this and ]
([!$#]) -> Create the capturing group that match any of !, $, #
\1 -> Reference the first capturing group
+ -> Match 1 or more of a preceding group or character
([!$#])\1+ -> Match any of !, $, # that repeats more than 1 time.
And the last argument of regex_replace to set $1 which is referencing the first capturing group (a single character of !, $, #) to replace the repeating characters with just the single character.
You can add more characters between [] for matching more special characters.

Related

Change prefix in a integer column in pyspark

I want to convert the prefix from 222.. to 999.. in pyspark.
Expected new column new_id with changed prefixt to 999..s
I will be using this column for inner merge b/w 2 pysparl dataframes
id
new_id
2222238308750
9999938308750
222222579844
999999579844
222225701296
999995701296
2222250087899
9999950087899
2222250087899
9999950087899
2222237274658
9999937274658
22222955099
99999955099
22222955099
99999955099
22222955099
99999955099
222285678
999985678
You can achieve it with something like this,
# First calculate the number of "2"s from the start till some other value is found, for eg '2223' should give you 3 as the length
# Use that calculated value to repeat the "9" that many times
# replace starting "2"s with the calulated "9" string
# finally drop all the calculated columns
df.withColumn("len_2", F.length(F.regexp_extract(F.col("value"), r"^2*(?!2)", 0)).cast('int'))\
.withColumn("to_replace_with", F.expr("repeat('9', len_2)"))\
.withColumn("new_value", F.expr("regexp_replace(value, '^2*(?!2)', to_replace_with)")) \
.drop("len_2", "to_replace_with")\
.show(truncate=False)
Output:
+-------------+-------------+
|value |new_value |
+-------------+-------------+
|2222238308750|9999938308750|
|222222579844 |999999579844 |
|222225701296 |999995701296 |
|2222250087899|9999950087899|
|2222250087899|9999950087899|
|2222237274658|9999937274658|
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|22222955099 |99999955099 |
|222285678 |999985678 |
+-------------+-------------+
I have used the column name as value, you would have to substitute it with id.
You can try the following:
from pyspark.sql.functions import *
df = df.withColumn("tempcol1", regexp_extract("id", "^2*", 0)).withColumn("tempcol2", split(regexp_replace("id", "^2*", "_"), "_")[1]).withColumn("new_id", concat((regexp_replace("tempcol1", "2", "9")), "tempcol2")).drop("tempcol1", "tempcol2")
The id column is split into two temp columns, one having the prefix and the other the rest of the string. The prefix column values are replaced and concatenated back with the second temp column.

pyspark regexp_replace replacing multiple values in a column

I have the url https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r in dataset. I want to remove https:// at the start of the string and \r at the end of the string.
Creating dataframe to replicate the issue
c = spark.createDataFrame([('https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r',)], ['str'])
I tried below regexp_replace with pipe function. But it is not working as expected.
c.select(F.regexp_replace('str', 'https:// | \\r', '')).first()
Actual output:
www.youcuomizei.comEquaion-Kid-Backack-Peronalized301793
Expected output:
www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793
the "backslash"r (\r) is not showing in your original spark.createDataFrame object because you have to escape it. so your spark.createDataFrame should be. please note the double backslashes
c = spark.createDataFrame([("https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\\r",)], ['str'])
which will give this output:
+------------------------------------------------------------------------------+
|str |
+------------------------------------------------------------------------------+
|https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r|
+------------------------------------------------------------------------------+
your regex https://|[\\r] will not remove the \r . the regex should be
c = (c
.withColumn("str", F.regexp_replace("str", "https://|[\\\\]r", ""))
)
which will give this output:
+--------------------------------------------------------------------+
|str |
+--------------------------------------------------------------------+
|www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793|
+--------------------------------------------------------------------+

how I can groupby a column and use it to groupby the other column?

I am classifying a column to different parts based on their first letter. It means that if they have the same 4 letter, they are in a same class. I use the following code to do that:
# this code extracts the first 4 elements of each title
df1_us2 = df1_us2.withColumn("first_2_char", df1_us2.clean_company_name.substr(1,4))
#this code group them in a list
group_user = df1_us2.groupBy('first_2_char').agg(collect_set('col1').alias('cal11'))
Each title has a description, I want this classification happen for the description as well:
Example:
col1 description
summer a season
summary it is a brief
common having similar
communication null
house living place
output:
col11 description1
['summer','summary'] ['a season',' it is a brief']
['common','communication'] ['having similar', null]
['house'] ['living place ']
How I can modify the above code to get description1?
Note: if a description is null, the null should be in the list. Because I am gonna use index of elements incol1 to get their description. So both of columns should have the same size of list per each row.
collect_list should work as aggregation function:
from pyspark.sql import functions as F
df = ...
df.withColumn('f2c', df.col1.substr(1,2)) \
.fillna('null') \
.groupby('f2c') \
.agg(F.collect_list('col1').alias('col11'),
F.collect_list('description').alias('description1')) \
.drop('f2c') \
.show(truncate=False)
To include the null values in the arrays they are replaced with strings first.
Output:
+-----------------------+-------------------------+
|col11 |description1 |
+-----------------------+-------------------------+
|[house] |[living place] |
|[common, communication]|[having similar, null] |
|[summer, summary] |[a season, it is a brief]|
+-----------------------+-------------------------+
For further processing the two arrays can be combined into a map using map_from_arrays:
[...]
.withColumn('map', F.map_from_arrays('col11', 'description1')) \
.show(truncate=False)
Output:
+-----------------------+-------------------------+-------------------------------------------------+
|col11 |description1 |map |
+-----------------------+-------------------------+-------------------------------------------------+
|[house] |[living place] |{house -> living place} |
|[common, communication]|[having similar, null] |{common -> having similar, communication -> null}|
|[summer, summary] |[a season, it is a brief]|{summer -> a season, summary -> it is a brief} |
+-----------------------+-------------------------+-------------------------------------------------+

Spark - Replace column value - regex pattern value is having slash value- how to handle it?

Data Frame:
+-------------------+-------------------+
| Desc| replaced_columns|
+-------------------+-------------------+
|India is my Country|India is my Country|
| Delhi is my Nation| Delhi is my Nation|
| I Love India\Delhi| I Love India\Delhi|
| I Love USA| I Love USA|
|I am stay in USA\SA|I am stay in USA\SA|
+-------------------+-------------------+
"Desc" column is the original column name from DataFrame. replace_columns is after we are doing some transformation. In desc column , i need to replace "India\Delhi" value to "-". I tried below code.
dataDF.withColumn("replaced_columns", regexp_replace(dataDF("Desc"), "India\\Delhi", "-")).show()
it is NOT replacing with "-" string. How can i do that one ?
I found 3 approaches for the above question:
val approach1 = dataDF.withColumn("replaced_columns", regexp_replace(col("Desc"), "\\\\","-")).show() // (it should be 4 backslash in actual while running in IDE)
val approach2 = dataDF.select($"Desc",translate($"Desc","\\","-").as("replaced_columns")).show()
The below one is for specific record which you asked above -- ( In desc column , I need to replace "India\Delhi" value to "-". I tried below code.)
val approach3 = dataDF
.withColumn("replaced_columns",when(col("Desc").like("%Delhi")
, regexp_replace(col("Desc"), "\\\\", "-")).otherwise(col("Desc")))
.show()

How to split using multi-char separator with pipe?

I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"
Input:
TEST:|:|:51:|:|:PHT054008056
Test code:
dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))
Result:
+------------------------------+
|splitColumn |
+------------------------------+
|[TEST, |, |, 51, |, |, P] |
+------------------------------+
Test code:
dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))
part1 and part2 work correctly.
part3 only has 2 characters and rest of the string is truncated.
part3:
P
I want to get the entire part3 string.
Any help is appreciated.
You're almost there – just need to escape | within your delimiter, as follows:
val df = Seq(
(1, "TEST:|:|:51:|:|:PHT054008056"),
(2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")
df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id| testcolumn| part3|
// +---+--------------------+------------+
// | 1|TEST:|:|:51:|:|:P...|PHT054008056|
// | 2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+
[UPDATE]
You could also use triple quotes for the delimiter, in which case you still have to escape | to indicate it's a literal pipe (not or in Regex):
df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show
Note that with triple quotes, you need only a single escape character \, whereas without the triple quotes the escape character itself needs to be escaped (hence \\).