Removal of ( ) from string in pyspark

Removal of ( ) from string in pyspark - pyspark

We are receiving data like tx=Reach % (YouTube) and we need only Youtube from that. How can we remove without hardcoding. For hardcoding i was using
df=df.withColumn('tx', F.regexp_replace('tx', 'Reach % (YouTube)', 'YouTube'))
but we do not need hardcoding like youtube or etc.How can we apply checks like of there is Reach % then remove all except string inside bracket in pyspark

df = spark.createDataFrame(
[("tx=Reach % (YouTube)", )],
schema=['col1']
)
df.show(10, False)
+--------------------+
|col1 |
+--------------------+
|tx=Reach % (YouTube)|
+--------------------+
df.select(func.regexp_extract(func.col('col1'), '\((.*?)\)', 1).alias('val')).show(10, False)
+-------+
|val |
+-------+
|YouTube|
+-------+

You can use regexp_extract instead of regexp_replace
df.withColumn("tx", F.regexp_extract(F.col("tx"), r"\(([\w]+)\)", 1))
\( is (
\) is )
[\w]+ is 1 or more word characters like [a-zA-Z0-9_]+.
Note that this will only extract word characters. If you want to extract anything between parenthesis, you can use .*? instead of [\w]+.

Related

Removing leading zeros after joining with ( | ) pyspark

How can I remove leading zeros after joining, for example,
100|0000000086,
200|000000000087,
100|00000075
300|00007505
I want this data to be
100|86,
200|87,
100|75,
300|7505
Thank you in advance!!

You can use regex to replace the leading zeros after |. In pyspark you can use regex_replace to achieve your desired result. For example,
df = df.withColumn('new_a', F.regexp_replace(F.col('a'), '\|0*', '|'))
df.show(truncate=False)
Output:
+-----------------------------------------------------------+-------------------------------+
|a |new_a |
+-----------------------------------------------------------+-------------------------------+
|100|0000000086, 200|000000000087, 100|00000075 300|00007505|100|86, 200|87, 100|75 300|7505|
+-----------------------------------------------------------+-------------------------------+

pyspark regexp_replace replacing multiple values in a column

I have the url https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r in dataset. I want to remove https:// at the start of the string and \r at the end of the string.
Creating dataframe to replicate the issue
c = spark.createDataFrame([('https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r',)], ['str'])
I tried below regexp_replace with pipe function. But it is not working as expected.
c.select(F.regexp_replace('str', 'https:// | \\r', '')).first()
Actual output:
www.youcuomizei.comEquaion-Kid-Backack-Peronalized301793
Expected output:
www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793

the "backslash"r (\r) is not showing in your original spark.createDataFrame object because you have to escape it. so your spark.createDataFrame should be. please note the double backslashes
c = spark.createDataFrame([("https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\\r",)], ['str'])
which will give this output:
+------------------------------------------------------------------------------+
|str |
+------------------------------------------------------------------------------+
|https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r|
+------------------------------------------------------------------------------+
your regex https://|[\\r] will not remove the \r . the regex should be
c = (c
.withColumn("str", F.regexp_replace("str", "https://|[\\\\]r", ""))
)
which will give this output:
+--------------------------------------------------------------------+
|str |
+--------------------------------------------------------------------+
|www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793|
+--------------------------------------------------------------------+

Remove the repeated punctuation from pyspark dataframe

I need to remove the repeated punctuations and keep the last occurrence only.
For example: !!!! -> !
!!$$ -> !$
I have a dataset that looks like below
temp = spark.createDataFrame([
(0, "This is Spark!!!!"),
(1, "I wish Java could use case classes!!##"),
(2, "Data science is cool#$#!"),
(3, "Machine!!$$")
], ["id", "words"])
+---+--------------------------------------+
|id |words |
+---+--------------------------------------+
|0 |This is Spark!!!! |
|1 |I wish Java could use case classes!!##|
|2 |Data science is cool#$#! |
|3 |Machine!!$$ |
+---+--------------------------------------+
I tried regex to remove specific punctuations and that is below
df2 = temp.select(
[F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)
but the above is not working. Can anyone tell how to achieve this in pyspark?
Below is the desired output.
id words
0 0 This is Spark!
1 1 I wish Java could use case classes!#
2 2 Data science is cool#$#!
3 3 Machine!$

You can use this regex.
df2 = temp.select('id',
F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))
Regex explanation.
( -> Group anything between this and ) and create a capturing group
[ -> Match any characters between this and ]
([!$#]) -> Create the capturing group that match any of !, $, #
\1 -> Reference the first capturing group
+ -> Match 1 or more of a preceding group or character
([!$#])\1+ -> Match any of !, $, # that repeats more than 1 time.
And the last argument of regex_replace to set $1 which is referencing the first capturing group (a single character of !, $, #) to replace the repeating characters with just the single character.
You can add more characters between [] for matching more special characters.

How to split column into multiple columns in Spark 2?

I am reading the data from HDFS into DataFrame using Spark 2.2.0 and Scala 2.11.8:
val df = spark.read.text(outputdir)
df.show()
I see this result:
+--------------------+
| value|
+--------------------+
|(4056,{community:...|
|(56,{community:56...|
|(2056,{community:...|
+--------------------+
If I run df.head(), I see more details about the structure of each row:
[(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})]
I want to get the following output:
+---------+----------+
| id | value|
+---------+----------+
|4056 |1 |
|56 |56 |
|2056 |20 |
+---------+----------+
How can I do it? I tried using .map(row => row.mkString(",")),
but I don't know how to extract the data as I showed.

The problem is that you are getting the data as a single column of strings. The data format is not really specified in the question (ideally it would be something like JSON), but given what we know, we can use a regular expression to extract the number on the left (id) and the community field:
val r = """\((\d+),\{.*community:(\d+).*\}\)"""
df.select(
F.regexp_extract($"value", r, 1).as("id"),
F.regexp_extract($"value", r, 2).as("community")
).show()

A bunch of regular expressions should give you required result.
df.select(
regexp_extract($"value", "^\\(([0-9]+),.*$", 1) as "id",
explode(split(regexp_extract($"value", "^\\(([0-9]+),\\{(.*)\\}\\)$", 2), ",")) as "value"
).withColumn("value", split($"value", ":")(1))

If your data is always of the following format
(4056,{community:1,communitySigmaTot:1020457,internalWeight:0,nodeWeight:1020457})
Then you can simply use split and regex_replace inbuilt functions to get your desired output dataframe as
import org.apache.spark.sql.functions._
df.select(regexp_replace((split(col("value"), ",")(0)), "\\(", "").as("id"), regexp_replace((split(col("value"), ",")(1)), "\\{community:", "").as("value") ).show()
I hope the answer is helpful

How to split using multi-char separator with pipe?

I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"
Input:
TEST:|:|:51:|:|:PHT054008056
Test code:
dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))
Result:
+------------------------------+
|splitColumn |
+------------------------------+
|[TEST, |, |, 51, |, |, P] |
+------------------------------+
Test code:
dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))
part1 and part2 work correctly.
part3 only has 2 characters and rest of the string is truncated.
part3:
P
I want to get the entire part3 string.
Any help is appreciated.

You're almost there – just need to escape | within your delimiter, as follows:
val df = Seq(
(1, "TEST:|:|:51:|:|:PHT054008056"),
(2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")
df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id| testcolumn| part3|
// +---+--------------------+------------+
// | 1|TEST:|:|:51:|:|:P...|PHT054008056|
// | 2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+
[UPDATE]
You could also use triple quotes for the delimiter, in which case you still have to escape | to indicate it's a literal pipe (not or in Regex):
df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show
Note that with triple quotes, you need only a single escape character \, whereas without the triple quotes the escape character itself needs to be escaped (hence \\).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Removal of ( ) from string in pyspark - pyspark

Related

Removing leading zeros after joining with ( | ) pyspark

pyspark regexp_replace replacing multiple values in a column

Remove the repeated punctuation from pyspark dataframe

How to split column into multiple columns in Spark 2?

How to split using multi-char separator with pipe?

Categories

Resources