I am trying to split a string column of a dataframe in spark based on a delimiter ":|:|:"
Input:
TEST:|:|:51:|:|:PHT054008056
Test code:
dataframe1
.withColumn("splitColumn", split(col("testcolumn"), ":|:|:"))
Result:
+------------------------------+
|splitColumn |
+------------------------------+
|[TEST, |, |, 51, |, |, P] |
+------------------------------+
Test code:
dataframe1
.withColumn("part1", split(col("testcolumn"), ":|:|:").getItem(0))
.withColumn("part2", split(col("testcolumn"), ":|:|:").getItem(3))
.withColumn("part3", split(col("testcolumn"), ":|:|:").getItem(6))
part1 and part2 work correctly.
part3 only has 2 characters and rest of the string is truncated.
part3:
P
I want to get the entire part3 string.
Any help is appreciated.
You're almost there – just need to escape | within your delimiter, as follows:
val df = Seq(
(1, "TEST:|:|:51:|:|:PHT054008056"),
(2, "TEST:|:|:52:|:|:PHT053007057")
).toDF("id", "testcolumn")
df.withColumn("part3", split($"testcolumn", ":\\|:\\|:").getItem(2)).show
// +---+--------------------+------------+
// | id| testcolumn| part3|
// +---+--------------------+------------+
// | 1|TEST:|:|:51:|:|:P...|PHT054008056|
// | 2|TEST:|:|:52:|:|:P...|PHT053007057|
// +---+--------------------+------------+
[UPDATE]
You could also use triple quotes for the delimiter, in which case you still have to escape | to indicate it's a literal pipe (not or in Regex):
df.withColumn("part3", split($"testcolumn", """:\|:\|:""").getItem(2)).show
Note that with triple quotes, you need only a single escape character \, whereas without the triple quotes the escape character itself needs to be escaped (hence \\).
Related
I have the url https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r in dataset. I want to remove https:// at the start of the string and \r at the end of the string.
Creating dataframe to replicate the issue
c = spark.createDataFrame([('https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r',)], ['str'])
I tried below regexp_replace with pipe function. But it is not working as expected.
c.select(F.regexp_replace('str', 'https:// | \\r', '')).first()
Actual output:
www.youcuomizei.comEquaion-Kid-Backack-Peronalized301793
Expected output:
www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793
the "backslash"r (\r) is not showing in your original spark.createDataFrame object because you have to escape it. so your spark.createDataFrame should be. please note the double backslashes
c = spark.createDataFrame([("https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\\r",)], ['str'])
which will give this output:
+------------------------------------------------------------------------------+
|str |
+------------------------------------------------------------------------------+
|https://www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793\r|
+------------------------------------------------------------------------------+
your regex https://|[\\r] will not remove the \r . the regex should be
c = (c
.withColumn("str", F.regexp_replace("str", "https://|[\\\\]r", ""))
)
which will give this output:
+--------------------------------------------------------------------+
|str |
+--------------------------------------------------------------------+
|www.youcustomizeit.com/p/Equations-Kids-Backpack-Personalized/301793|
+--------------------------------------------------------------------+
We are receiving data like tx=Reach % (YouTube) and we need only Youtube from that. How can we remove without hardcoding. For hardcoding i was using
df=df.withColumn('tx', F.regexp_replace('tx', 'Reach % (YouTube)', 'YouTube'))
but we do not need hardcoding like youtube or etc.How can we apply checks like of there is Reach % then remove all except string inside bracket in pyspark
df = spark.createDataFrame(
[("tx=Reach % (YouTube)", )],
schema=['col1']
)
df.show(10, False)
+--------------------+
|col1 |
+--------------------+
|tx=Reach % (YouTube)|
+--------------------+
df.select(func.regexp_extract(func.col('col1'), '\((.*?)\)', 1).alias('val')).show(10, False)
+-------+
|val |
+-------+
|YouTube|
+-------+
You can use regexp_extract instead of regexp_replace
df.withColumn("tx", F.regexp_extract(F.col("tx"), r"\(([\w]+)\)", 1))
\( is (
\) is )
[\w]+ is 1 or more word characters like [a-zA-Z0-9_]+.
Note that this will only extract word characters. If you want to extract anything between parenthesis, you can use .*? instead of [\w]+.
I need to remove the repeated punctuations and keep the last occurrence only.
For example: !!!! -> !
!!$$ -> !$
I have a dataset that looks like below
temp = spark.createDataFrame([
(0, "This is Spark!!!!"),
(1, "I wish Java could use case classes!!##"),
(2, "Data science is cool#$#!"),
(3, "Machine!!$$")
], ["id", "words"])
+---+--------------------------------------+
|id |words |
+---+--------------------------------------+
|0 |This is Spark!!!! |
|1 |I wish Java could use case classes!!##|
|2 |Data science is cool#$#! |
|3 |Machine!!$$ |
+---+--------------------------------------+
I tried regex to remove specific punctuations and that is below
df2 = temp.select(
[F.regexp_replace(col, r',|\.|&|\\|\||-|_', '').alias(col) for col in temp.columns]
)
but the above is not working. Can anyone tell how to achieve this in pyspark?
Below is the desired output.
id words
0 0 This is Spark!
1 1 I wish Java could use case classes!#
2 2 Data science is cool#$#!
3 3 Machine!$
You can use this regex.
df2 = temp.select('id',
F.regexp_replace('words', r'([!$#])\1+', '$1').alias('words'))
Regex explanation.
( -> Group anything between this and ) and create a capturing group
[ -> Match any characters between this and ]
([!$#]) -> Create the capturing group that match any of !, $, #
\1 -> Reference the first capturing group
+ -> Match 1 or more of a preceding group or character
([!$#])\1+ -> Match any of !, $, # that repeats more than 1 time.
And the last argument of regex_replace to set $1 which is referencing the first capturing group (a single character of !, $, #) to replace the repeating characters with just the single character.
You can add more characters between [] for matching more special characters.
I need to clean a column from a Dataframe which contains tailing whitespaces. Something like this:
'17063256 '
'17403492 '
'17390052 '
First, I tried to remove white spaces using trim:
df.withColumn("col1_cleansed", trim(col("col1")))
Then I though it may be tailing "tabs", so I tried as well with:
df.withColumn("col1_cleansed", regexp_replace(col("col1"), "\t", ""))
However none of these two solutions seems to be working.
What is the correct way to remove "tab" characters from a string column in Spark?
Method trim or rtrim does seem to have problem handling general whitespaces. To remove trailing whitespaces, consider using regexp_replace with regex pattern \\s+$ (with '$' representing end of string), as shown below:
val df = Seq(
"17063256 ", // space
"17403492 ", // tab
"17390052 " // space + tab
).toDF("c1")
df.withColumn("c1_trimmed", regexp_replace($"c1", "\\s+$", "")).show
// Output (prettified)
// +------------+----------+
// | c1|c1_trimmed|
// +------------+----------+
// | 17063256 | 17063256|
// | 17403492 | 17403492|
// |17390052 | 17390052|
// +------------+----------+
Try below udf & change as per your needs.
val normalize = udf((in: String) => {
import java.text.Normalizer.{normalize ⇒ jnormalize, _}
val cleaned = in.trim.toLowerCase
val normalized = jnormalize(cleaned, Form.NFD).replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{IsM}\\p{IsLm}\\p{IsSk}]+", "")
normalized.replaceAll("'s", "")
.replaceAll("ß", "ss")
.replaceAll("ø", "o")
.replaceAll("[^a-zA-Z0-9-]+", " ")
})
df.withColumn("col1_cleansed", normalize(col("col1")))
You can regex_replace to and replace with ""
df.withColumn("new", regexp_replace($"id", " ",""))
.show(false)
Output:
+------------------------------------------------------+----------+
|id |new |
+------------------------------------------------------+----------+
|'17063256 ' |'17063256'|
|'17403492 '|'17403492'|
|'17390052 ' |'17390052'|
+------------------------------------------------------+----------+
Another way to look at the problem. Extract only the required portion from the column .This will work if you are expecting only alphanumeric values and nothing else.
Feel free to modify it to accept numbers only if required.
df.withColumn("cleansed_col",regexp_extract(col("input"),"[a-z0-9]+",0))
I'm reading a CSV in Spark with scala and it handles line one correctly in the example below, but in line two of the example the line has an ending quote character, but no leading quote character for the first column. This causes an issue by moving the data over and outputting bad|col in the final result, which is incorrect.
"good,col","good,col"
bad,col","good,col"
Is there an option to handle quote characters that don't have a leading (or ending) quote in the option specification when reading the file in spark with scala?
Hm... by using rdd and with some replacements, I can obtain what you want.
val df = rdd.map(r => (r.replaceAll("\",\"", "|").replaceAll("\"", "").split("\\|"))).map{ case Array(a, b) => (a, b) }.toDF("col1", "col2")
df.show()
+--------+--------+
| col1| col2|
+--------+--------+
|good,col|good,col|
| bad,col|good,col|
+--------+--------+